| Title: | Tools to Support the 'WiscLab' |
|---|---|
| Description: | A collection of 'R' functions for use (and re-use) across 'WiscLab' projects. |
| Authors: | Tristan Mahr [aut, cre] (ORCID: <https://orcid.org/0000-0002-8890-5116>) |
| Maintainer: | Tristan Mahr <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.1.9000 |
| Built: | 2026-06-04 21:34:25 UTC |
| Source: | https://github.com/tjmahr/wisclabmisc |
An audit is a list() with two fields:
data: most recently updated (or poked) data
notes: a history of the fn calls and results applied to the object.
Usage:
audit_wrap(data) wraps some data inside of an audit.
audit_peek(x, fn, ...) applies a function to the data, log the results
in the notes, but not overwrite the audit data.
audit_poke(x, fn, ...) applies a function to the data, log the results
in the notes, and overwrite the audit data.
audit_unwrap() returns the underlying data.
audit_wrap(data) audit_peek(x, fn, ...) audit_poke(x, fn, ...) audit_unwrap(x)audit_wrap(data) audit_peek(x, fn, ...) audit_poke(x, fn, ...) audit_unwrap(x)
data |
data to wrap into an audit |
x |
an audit object |
fn |
a function to apply to the data inside of the audit object |
... |
further arguments for |
There is no audit_notes() function at this time because it is not
clear what the most convenient or tidy format would be for the notes.
These container was designed for piping a vector of filenames through a series of data validation functions. Therefore, the intended use case is a vector of a data and not larger objects.
audit_wrap(), audit_peek(), audit_poke() return audit objects.
audit_unwrap() returns the $data from an audit.
x <- letters |> audit_wrap() |> audit_peek(rev) |> audit_peek(toupper) # Note that none of the underlying data has changed. This is useful for # finding values that fail validation checks. x # Use _poke() to update the data: x <- letters |> audit_wrap() |> audit_peek(rev) |> audit_poke(toupper) |> audit_peek(getElement, 1) |> audit_peek(function(x) character(0)) x # Index into the notes to pull the data x$notes[[3]][["result"]]x <- letters |> audit_wrap() |> audit_peek(rev) |> audit_peek(toupper) # Note that none of the underlying data has changed. This is useful for # finding values that fail validation checks. x # Use _poke() to update the data: x <- letters |> audit_wrap() |> audit_peek(rev) |> audit_poke(toupper) |> audit_peek(getElement, 1) |> audit_peek(function(x) character(0)) x # Index into the notes to pull the data x$notes[[3]][["result"]]
Set default arguments for brms model fitting
brms_args_create( ..., .backend = "cmdstanr", .threads = 2, .chains = 4, .cores = 4, .iter = 2000, .silent = 0, .file_refit = "on_change", .refresh = 25, .control = list() )brms_args_create( ..., .backend = "cmdstanr", .threads = 2, .chains = 4, .cores = 4, .iter = 2000, .silent = 0, .file_refit = "on_change", .refresh = 25, .control = list() )
... |
arguments to |
.backend, .threads, .chains, .cores, .iter, .silent, .file_refit, .refresh, .control
|
These arguments set default default values. Overwrite these defaults by
using the argument without the |
One can set the default value for adapt_delta directly
(adapt_delta = .98). This value will be propagated to
control$adapt_delta.
A function for generating a list of arguments to brms::brm().
brm_args <- brms_args_create() # using package-provided defaults brm_args() # overwriting a default value brm_args(iter = 500) # adapt_delta is handled specially brm_args(adapt_delta = .95) # adapt_delta is handled specially brm_args(adapt_delta = .95) # We can overwrite the package-provided defaults other_brm_args <- brms_args_create(iter = 4000, backend = "rstan") other_brm_args() # And overwrite them too other_brm_args(backend = "cmdstanr")brm_args <- brms_args_create() # using package-provided defaults brm_args() # overwriting a default value brm_args(iter = 500) # adapt_delta is handled specially brm_args(adapt_delta = .95) # adapt_delta is handled specially brm_args(adapt_delta = .95) # We can overwrite the package-provided defaults other_brm_args <- brms_args_create(iter = 4000, backend = "rstan") other_brm_args() # And overwrite them too other_brm_args(backend = "cmdstanr")
check_model_centiles() computes centiles from a model and computes the
calibration of each centile. check_computed_centiles() works on a dataframe
with precomputed c[XX] columns of centiles.
check_model_centiles( data, model, var_x, var_y, centiles = c(5, 10, 25, 50, 75, 90, 95) ) check_computed_centiles(data, var_y)check_model_centiles( data, model, var_x, var_y, centiles = c(5, 10, 25, 50, 75, 90, 95) ) check_computed_centiles(data, var_y)
data |
a dataset used to fit a model. If the dataframe is grouped with
|
model |
a gamlss model prepared by |
var_x, var_y
|
bare column names of the predictor and outcome variables |
centiles |
centiles to use for prediction. Defaults to |
a tibble the number of points and the percentage of points less than or equal to each quantile value.
Ages are rounded down to the nearest month. A difference of 20 months, 29 days is interpreted as 20 months.
chrono_age(t1, t2)chrono_age(t1, t2)
t1, t2
|
dates in "yyyy-mm-dd" format |
the chronological ages in months. NA is returned if the age cannot be computed.
# Two years exactly chrono_age("2014-01-20", "2012-01-20") #> 24 # Shift a year chrono_age("2014-01-20", "2013-01-20") #> 12 chrono_age("2014-01-20", "2011-01-20") #> 36 # Shift a month chrono_age("2014-01-20", "2012-02-20") #> 23 chrono_age("2014-01-20", "2011-12-20") #> 25 # 3 months exactly chrono_age("2014-05-10", "2014-02-10") #> 3 # Borrow a month when the earlier date has a later day chrono_age("2014-05-10", "2014-02-11") #> 2, equal to 2 months, 29 days rounded down to nearest month # Inverted argument order chrono_age("2012-01-20", "2014-01-20") #> 24 # Multiple dates t1 <- c("2012-01-20", "2014-02-10", "2010-10-10") t2 <- c("2014-01-20", "2014-05-10", "2014-11-10") chrono_age(t1, t2) #> [1] 24 3 49# Two years exactly chrono_age("2014-01-20", "2012-01-20") #> 24 # Shift a year chrono_age("2014-01-20", "2013-01-20") #> 12 chrono_age("2014-01-20", "2011-01-20") #> 36 # Shift a month chrono_age("2014-01-20", "2012-02-20") #> 23 chrono_age("2014-01-20", "2011-12-20") #> 25 # 3 months exactly chrono_age("2014-05-10", "2014-02-10") #> 3 # Borrow a month when the earlier date has a later day chrono_age("2014-05-10", "2014-02-11") #> 2, equal to 2 months, 29 days rounded down to nearest month # Inverted argument order chrono_age("2012-01-20", "2014-01-20") #> 24 # Multiple dates t1 <- c("2012-01-20", "2014-02-10", "2010-10-10") t2 <- c("2014-01-20", "2014-05-10", "2014-11-10") chrono_age(t1, t2) #> [1] 24 3 49
Create an ROC curve from observed data
compute_empirical_roc( data, response, predictor, direction = "auto", best_weights = c(1, 0.5), levels = NULL, ... )compute_empirical_roc( data, response, predictor, direction = "auto", best_weights = c(1, 0.5), levels = NULL, ... )
data |
a dataframe containing responses (groupings) and predictor variable |
response |
a bare column name with the group status (control vs. cases).
If the response has more than two groups, the first element of |
predictor |
a bare column name with the predictor to use for classification |
direction |
|
best_weights |
weights for computing the best ROC curve points. Defaults
to |
levels |
two-element vector |
... |
additional arguments passed to |
a new dataframe of ROC coordinates is returned with columns for the
predictor variable, .sensitivities, .specificities, .auc,
.direction, .controls, .cases, .n_controls, .n_cases,
.is_best_youden and .is_best_closest_topleft.
set.seed(100) x1 <- rnorm(100, 4, 1) x2 <- rnorm(100, 2, .5) both <- c(x1, x2) steps <- seq(min(both), max(both), length.out = 200) d1 <- dnorm(steps, mean(x1), sd(x1)) d2 <- dnorm(steps, mean(x2), sd(x2)) data <- tibble::tibble( y = steps, d1 = d1, d2 = d2, outcome = rbinom(200, 1, prob = 1 - (d1 / (d1 + d2))), group = ifelse(outcome, "case", "control") ) # get an ROC on the fake data compute_empirical_roc(data, outcome, y) # this guess the cases and controls from the group name and gets it wrong compute_empirical_roc(data, group, y) # better compute_empirical_roc(data, group, y, levels = c("control", "case"))set.seed(100) x1 <- rnorm(100, 4, 1) x2 <- rnorm(100, 2, .5) both <- c(x1, x2) steps <- seq(min(both), max(both), length.out = 200) d1 <- dnorm(steps, mean(x1), sd(x1)) d2 <- dnorm(steps, mean(x2), sd(x2)) data <- tibble::tibble( y = steps, d1 = d1, d2 = d2, outcome = rbinom(200, 1, prob = 1 - (d1 / (d1 + d2))), group = ifelse(outcome, "case", "control") ) # get an ROC on the fake data compute_empirical_roc(data, outcome, y) # this guess the cases and controls from the group name and gets it wrong compute_empirical_roc(data, group, y) # better compute_empirical_roc(data, group, y, levels = c("control", "case"))
Compute overlap rate for (phoneme alignment) intervals
compute_overlap_rate(x1, x2, y1, y2)compute_overlap_rate(x1, x2, y1, y2)
x1, x2
|
start and end times for the first interval |
y1, y2
|
start and end times for the second interval |
Paulo and Oliveira (2004) provide an "overlap rate" statistic for computing the amount of overlap between two (time) intervals. To my knowledge, nobody has described the Overlap Rate in this way, but it is the Jaccard index applied to time intervals.
Let and
be the sets of times spanned by the
intervals and . Then, is the intersection or the
times covered by both intervals, and is the union or the
times covered by either interval. The size of a set is denoted
. Then the overlap rate is the Jaccard index or the proportion of
elements that the two sets have in common:
the overlap rate
Paulo, S., & Oliveira, L. C. (2004). Automatic Phonetic Alignment and Its Confidence Measures. In J. L. Vicedo, P. Martínez-Barco, R. Muńoz, & M. Saiz Noeda (Eds.), Advances in Natural Language Processing (pp. 36–44). Springer. https://doi.org/10.1007/978-3-540-30228-5_4
compute_overlap_rate( c(0.0, 0.0, 0.0, 0.0), c(1.0, 1.0, 1.0, NA), c(0.5, 2.0, 1.0, 1.0), c(2.0, 3.0, 2.0, 2.0) )compute_overlap_rate( c(0.0, 0.0, 0.0, 0.0), c(1.0, 1.0, 1.0, NA), c(0.5, 2.0, 1.0, 1.0), c(2.0, 3.0, 2.0, 2.0) )
Compute positive and negative predictive value
compute_predictive_value_from_rates(sensitivity, specificity, prevalence)compute_predictive_value_from_rates(sensitivity, specificity, prevalence)
sensitivity, specificity, prevalence
|
vectors of confusion matrix rates |
These vectors passed into this function should be some common length and/or length 1. For example, 4 sensitivities, 4 specificities and 1 incidence will work because the sensitivities and specificities have a common length and we can safely recycle (reuse) the incidence value. But 4 sensitivities, 2 specificities, and 1 incidence will fail because there is not a common length.
a tibble with the columns sensitivity, specificity, prevalence,
ppv, npv where ppv and npv are the positive predictive value and
the negative predictive value.
compute_predictive_value_from_rates( sensitivity = .9, specificity = .8, prevalence = .05 ) compute_predictive_value_from_rates( sensitivity = .67, specificity = .53, prevalence = c(.15, .3) )compute_predictive_value_from_rates( sensitivity = .9, specificity = .8, prevalence = .05 ) compute_predictive_value_from_rates( sensitivity = .67, specificity = .53, prevalence = c(.15, .3) )
pROC::roc() does not support observation-weights when computing ROC curves.
This function fills that gap.
compute_sens_spec_from_ecdf( data, response, predictor, weights = NULL, direction = NULL, levels = NULL )compute_sens_spec_from_ecdf( data, response, predictor, weights = NULL, direction = NULL, levels = NULL )
data |
a dataframe containing responses (groupings) and predictor variable |
response |
a bare column name with the group status (control vs. cases).
If the response has more than two groups, the first element of |
predictor |
a bare column name with the predictor to use for classification |
weights |
a bare column name for the observation weights. If |
direction |
|
levels |
two-element vector |
The .sensitivities and .specificities columns are calculated directly
from the (weighted) ECDFs of the controls and cases, so no call to
pROC::roc() is made.
.auc is calculated using trapezoid_auc().
c(-Inf, Inf) are added to the predictor vector so that sensitivities and
specificities range from 0 to 1.
A dataframe of stepwise empirical ROC coordinates computed from
(optionally weighted) data. The output includes columns for the predictor
variable, .sensitivities, .specificities, .auc, .comparison,
.n_controls, .n_cases, .w_controls, .w_cases, .comparison,
.direction, .response, .controls, and .cases. n_ columns contain
the number of observations for that predictor value and w_ contain the
total weight of the observations for that predictor value.
# Simulate 3-class dataset set.seed(100) n <- 50 means <- c("A" = 0, "B" = 1, "C" = 2) y <- sample(c("A", "B", "C"), n, replace = TRUE) x <- rnorm(n, mean = means[y], sd = 1) w <- runif(n, 0.5, 2) df <- data.frame(y, x, w) # Compare "A" (controls) to "C" (cases) roc_tbl <- compute_sens_spec_from_ecdf( data = df, response = y, predictor = x, weights = w, direction = "control-low", levels = c("A", "C") ) dplyr::glimpse(roc_tbl) # Compare "B" (controls) to "C" (cases) roc_tbl2 <- compute_sens_spec_from_ecdf( data = df, response = y, predictor = x, weights = w, direction = "control-low", levels = c("B", "C") ) dplyr::glimpse(roc_tbl2) library(ggplot2) # Plotting can be tricky bc sens and spec values repeat. geom_path() does the # right thing and walks along threshold values to plot the sens-spec pairs ggplot(roc_tbl) + aes(x = .specificities, y = .sensitivities) + geom_path() + scale_x_reverse() + coord_fixed(ratio = 1)# Simulate 3-class dataset set.seed(100) n <- 50 means <- c("A" = 0, "B" = 1, "C" = 2) y <- sample(c("A", "B", "C"), n, replace = TRUE) x <- rnorm(n, mean = means[y], sd = 1) w <- runif(n, 0.5, 2) df <- data.frame(y, x, w) # Compare "A" (controls) to "C" (cases) roc_tbl <- compute_sens_spec_from_ecdf( data = df, response = y, predictor = x, weights = w, direction = "control-low", levels = c("A", "C") ) dplyr::glimpse(roc_tbl) # Compare "B" (controls) to "C" (cases) roc_tbl2 <- compute_sens_spec_from_ecdf( data = df, response = y, predictor = x, weights = w, direction = "control-low", levels = c("B", "C") ) dplyr::glimpse(roc_tbl2) library(ggplot2) # Plotting can be tricky bc sens and spec values repeat. geom_path() does the # right thing and walks along threshold values to plot the sens-spec pairs ggplot(roc_tbl) + aes(x = .specificities, y = .sensitivities) + geom_path() + scale_x_reverse() + coord_fixed(ratio = 1)
Create an ROC curve from smoothed densities
compute_smooth_density_roc( data, controls, cases, along = NULL, best_weights = c(1, 0.5), direction = "auto" )compute_smooth_density_roc( data, controls, cases, along = NULL, best_weights = c(1, 0.5), direction = "auto" )
data |
a dataframe containing densities |
controls, cases
|
bare column names for the densities of the control group and case group |
along |
optional bare column name for the response values |
best_weights |
weights for computing the best ROC curve points. Defaults
to |
direction |
|
the dataframe is updated with new columns for the .sensitivities,
.specificities, .auc, .roc_row, .is_best_youden and
.is_best_closest_topleft.
set.seed(100) x1 <- rnorm(100, 4, 1) x2 <- rnorm(100, 2, .5) both <- c(x1, x2) steps <- seq(min(both), max(both), length.out = 200) d1 <- dnorm(steps, mean(x1), sd(x1)) d2 <- dnorm(steps, mean(x2), sd(x2)) data <- tibble::tibble( y = steps, d1 = d1, d2 = d2, outcome = rbinom(200, 1, prob = 1 - (d1 / (d1 + d2))), group = ifelse(outcome, "case", "control") ) compute_smooth_density_roc(data, d1, d2) compute_smooth_density_roc(data, d1, d2, along = y) # terrible ROC because the response is not present (just the densities) data_shuffled <- data[sample(seq_len(nrow(data))), ] compute_smooth_density_roc(data_shuffled, d1, d2) # sorted along response first: correct AUC compute_smooth_density_roc(data_shuffled, d1, d2, along = y)set.seed(100) x1 <- rnorm(100, 4, 1) x2 <- rnorm(100, 2, .5) both <- c(x1, x2) steps <- seq(min(both), max(both), length.out = 200) d1 <- dnorm(steps, mean(x1), sd(x1)) d2 <- dnorm(steps, mean(x2), sd(x2)) data <- tibble::tibble( y = steps, d1 = d1, d2 = d2, outcome = rbinom(200, 1, prob = 1 - (d1 / (d1 + d2))), group = ifelse(outcome, "case", "control") ) compute_smooth_density_roc(data, d1, d2) compute_smooth_density_roc(data, d1, d2, along = y) # terrible ROC because the response is not present (just the densities) data_shuffled <- data[sample(seq_len(nrow(data))), ] compute_smooth_density_roc(data_shuffled, d1, d2) # sorted along response first: correct AUC compute_smooth_density_roc(data_shuffled, d1, d2, along = y)
This package provides dataframes of information about the consonants and
vowels in American English. The following datasets collect acquisition
(acq) features which (try to) characterize the expected acquisition or
speech-motor difficulty of speech sounds. See also
data_features_consonants.
data_acq_consonants data_acq_vowelsdata_acq_consonants data_acq_vowels
An object of class tbl_df (inherits from tbl, data.frame) with 24 rows and 16 columns.
An object of class tbl_df (inherits from tbl, data.frame) with 17 rows and 8 columns.
data_acq_consonants provides the following features:
knitr::kable(data_acq_consonants)
| phone | cmubet | wiscbet | cm2020_90_age_mean | cm2020_90_age_sd | cm2020_90_age_min | cm2020_90_age_max | cm2020_90_num_studies | cm2020_90_stage | s93_eights | k1992_set | kd2018_complexity | hml84_frequency | hml84_log10fpm | mhr82_frequency | mhr82_log10fpm |
| p | P | p | 33.2 | 6.9 | 24 | 48 | 12 | early | early | 1 | 3 | 50694 | 4.276690 | 14851 | 4.230197 |
| b | B | b | 31.4 | 7.8 | 24 | 48 | 13 | early | early | 2 | 4 | 51831 | 4.286323 | 18027 | 4.314365 |
| t | T | t | 38.5 | 9.2 | 24 | 60 | 13 | early | middle | 3 | 5 | 188536 | 4.847127 | 69108 | 4.897970 |
| d | D | d | 35.7 | 6.7 | 24 | 48 | 13 | early | early | 2 | 4 | 102205 | 4.581205 | 47838 | 4.738214 |
| k | K | k | 37.7 | 7.3 | 24 | 48 | 13 | early | middle | 2 | 4 | 73250 | 4.436541 | 25462 | 4.464334 |
| g | G | g | 36.8 | 6.6 | 24 | 48 | 13 | early | middle | 2 | 4 | 19422 | 3.860027 | 15789 | 4.256796 |
| tʃ | CH | tsh | 53.5 | 10.7 | 36 | 72 | 12 | middle | middle | 4 | 6 | 17147 | 3.805921 | 3149 | 3.556614 |
| dʒ | JH | dzh | 51.0 | 11.8 | 36 | 72 | 13 | middle | middle | 4 | 6 | 13220 | 3.692964 | 3015 | 3.537729 |
| m | M | m | 33.2 | 6.7 | 24 | 48 | 13 | early | early | 1 | 3 | 75850 | 4.451689 | 25799 | 4.470044 |
| n | N | n | 33.1 | 7.4 | 24 | 48 | 13 | early | early | 1 | 3 | 204939 | 4.883358 | 68331 | 4.893059 |
| ŋ | NG | ng | 40.3 | 10.8 | 24 | 55 | 10 | early | middle | 3 | 5 | 13692 | 3.708200 | 9102 | 4.017578 |
| f | F | f | 38.3 | 6.3 | 24 | 48 | 13 | early | middle | 2 | 4 | 51526 | 4.283759 | 10731 | 4.089082 |
| v | V | v | 50.8 | 10.8 | 36 | 66 | 12 | middle | middle | 4 | 6 | 66490 | 4.394489 | 8753 | 4.000598 |
| θ | TH | th | 77.0 | 7.4 | 72 | 96 | 10 | late | late | 4 | 6 | 18300 | 3.834184 | 4337 | 3.695631 |
| ð | DH | dh | 69.0 | 11.3 | 54 | 96 | 12 | late | late | 4 | 6 | 108602 | 4.607571 | 35995 | 4.614683 |
| s | S | s | 51.3 | 16.3 | 24 | 84 | 12 | middle | late | 4 | 6 | 114733 | 4.631421 | 33752 | 4.586741 |
| z | Z | z | 56.8 | 14.3 | 30 | 84 | 11 | middle | late | 4 | 6 | 54454 | 4.307763 | 24027 | 4.439141 |
| ʃ | SH | sh | 55.0 | 10.5 | 36 | 72 | 12 | middle | late | 4 | 6 | 21756 | 3.909312 | 4844 | 3.743645 |
| ʒ | ZH | zh | 70.7 | 12.2 | 60 | 84 | 3 | late | late | 4 | 6 | 1488 | 2.744336 | 100 | 2.058441 |
| h | HH | h | 35.0 | 7.0 | 24 | 48 | 13 | early | early | 1 | 3 | 51235 | 4.281300 | 18896 | 4.334811 |
| l | L | l | 53.8 | 10.4 | 24 | 60 | 12 | middle | late | 3 | 5 | 98287 | 4.564229 | 29343 | 4.525946 |
| r | R | r | 66.6 | 18.6 | 30 | 96 | 12 | late | late | 3 | 5 | 121548 | 4.656481 | 29534 | 4.528764 |
| w | W | w | 35.2 | 6.8 | 24 | 48 | 13 | early | early | 1 | 3 | 60251 | 4.351697 | 25171 | 4.459342 |
| j | Y | j | 45.8 | 11.0 | 30 | 60 | 13 | early | early | 2 | 4 | 16510 | 3.789480 | 8544 | 3.990103 |
Description of each column:
phone in IPA
phone in the CMU alphabet
phone in an older system used by our lab
Age of acquisition statistics reported by Crowe & McLeod (2020). Statistics are the mean, SD, min and max age (in months) when children reached 90% accuracy on a consonant.
Number of studies used by Crowe & McLeod (2020) to compute the corresponding statistics.
Developmental stage assigned to the consonant by
Crowe & McLeod (2020). Sounds with an age_mean before 48 months are
early, before 60 months are middle, and of 60 or older are late.
Developmental stage of Shriberg (1993)—that is,
the early 8, middle 8 and late 8 consonants.
Developmental set from Kent (1992). Sets corresponds to the age of 90% mastery in Sander (1972): Set 1 is mastered at age 3-years-old, Set 2 at age 4, Set 3 at age 6, and Set 4 at a later age.
Phonetic complexity scores from Kuruvilla-Dugdale et al. (2018). This scoring system is based on the development description of vowels and consonants in Kent (1992). The scores for individual segments range from 1 for the earliest vowels to 6 for the last-acquired consonants. Under this system, assign a score to each part of a syllable (onset, nucleus, coda) using these scores when the syllable part is a single segment and using scores of 7 and 8 for 2-consonant and 3-consonant clusters, respectively.
Raw frequency and log10 frequency per million of the phoneme in the Hoosier Mental Lexicon (Nusbaum, Pisoni, Pisoni, 1984) word-frequency dictionary.
Raw frequency and log10 frequency per million of the phoneme in the Moe, Hopkins, and Rush (1982) word frequency dictionary of first-graders.
data_acq_vowels provides the following features:
knitr::kable(data_acq_vowels)
| phone | cmubet | wiscbet | kd2018_complexity | hml84_frequency | hml84_log10fpm | mhr82_frequency | mhr82_log10fpm |
| i | IY | i | 2 | 82430 | 4.487818 | 28537 | 4.513850 |
| ɪ | IH | I | 4 | 195348 | 4.862542 | 43999 | 4.701884 |
| eɪ | EY | eI | 4 | 38966 | 4.162419 | 16679 | 4.280611 |
| ɛ | EH | E | 3 | 73107 | 4.435692 | 31322 | 4.554291 |
| æ | AE | ae | 4 | 106838 | 4.600459 | 40290 | 4.663639 |
| ʌ | AH | ^ | 1 | 44555 | 4.220629 | 22826 | 4.416871 |
| ə | AH | 4 | 1 | 231568 | 4.936412 | 38977 | 4.649250 |
| u | UW | u | 2 | 57320 | 4.330039 | 19188 | 4.341471 |
| ʊ | UH | U | 4 | 13557 | 3.703897 | 4288 | 3.690696 |
| oʊ | OW | oU | 2 | 61954 | 4.363802 | 17330 | 4.297240 |
| ɔ | AO | c | 3 | 26110 | 3.988540 | 12004 | 4.137767 |
| ɑ | AA | @ | 1 | 36589 | 4.135084 | 17326 | 4.297140 |
| aʊ | AW | @U | 3 | 16079 | 3.777992 | 7377 | 3.926321 |
| aɪ | AY | @I | 3 | 39275 | 4.165849 | 23807 | 4.435146 |
| ɔɪ | OY | cI | 3 | 2140 | 2.902147 | 744 | 2.930014 |
| ɝ | ER | 3^ | 5 | 17048 | 3.803406 | 4227 | 3.684474 |
| ɚ | ER | 4^ | 5 | 41966 | 4.194631 | 10676 | 4.086850 |
phone in IPA
phone in the CMU alphabet
phone in an older system used by our lab
Phonetic complexity scores from Kuruvilla-Dugdale et al. (2018). This scoring system is based on the development description of vowels and consonants in Kent (1992). The scores for individual segments range from 1 for the earliest vowels to 6 for the last-acquired consonants. Under this system, assign a score to each part of a syllable (onset, nucleus, coda) using these scores when the syllable part is a single segment and using scores of 7 and 8 for 2-consonant and 3-consonant clusters, respectively.
Raw frequency and log10 frequency per million of the phoneme in the Hoosier Mental Lexicon (Nusbaum, Pisoni, Pisoni, 1984) word-frequency dictionary.
Raw frequency and log10 frequency per million of the phoneme in the Moe, Hopkins, and Rush (1982) word frequency dictionary of first-graders.
Crowe and McLeod (2020, below as the cm2020_ variables) provides a
systematic review and summary statistics for age of acquisition norms for
English consonants. They scoured the literature of acquisition ages for
individual consonants and computed summary statistics on them. They
considered just accuracy of sounds when produced in single words. Their
sources include a mix of a journal articles and norms for articulation
assessments. They do not weight statistics from individual studies by sample
size or sampling procedure.
I prepared the Crowe and McLeod (2020) data by copying the relevant numbers from their Table 2 making the following changes: 1) rounding mean and SD values to 1 decimal point (3 days for ages in months), 2) dropping /ʍ/, 3) using /r/, /g/, /tʃ/, /dʒ/ for IPA characters instead of the specialized characters used in the article.
The hml84_frequency column provides the frequency count for the phonemes in
the Hoosier Mental Lexicon (Nusbaum, Pisoni, Pisoni, 1984). That is, we count
how many times the phonemes appear in each word in the word list and weight
them by the word frequency. For example, "ad" has two phonemes and a corpus
frequency of 99, so it counts for 99 /æ/ tokens and 99 /d/ tokens.
The HML frequency counts derive from the Brown Corpus of one million English words that were printed/published in 1961. The HML provides frequencies of phonological words, and homophones are combined into a single entry. For example, the word "ad" has a frequency of 99 (11 ad tokens plus 88 add tokens). That's why, I suppose, it's a mental lexicon. Approximately 8,000 words in the HML were not in the K&F frequency word list, and these are apparently assigned a frequency of 1.
The mhr82_frequency column was constructed in a similar way but the frequencies
were based on a corpus of words used by first-graders
(Moe, Hopkins, & Rush, 1982).
The hml84_log10fpm and mhr82_log10fpm columns provide the frequency in log-10
frequency per million which is more appropriate for analyses. Computing
frequency per million normalize the frequency counts across different
corpora, and log-frequency is better suited than raw or normalized frequency
counts.
I computed these phoneme frequencies independently, but retrieved my copies of the HML and MHR frequency-pronunciation tables from a course by Smith, Beckman and Foltz (2016).
The English consonants are often broken down into three developmental classes, based on Shriberg (1993):
Early 8: m b j n w d p h
Middle 8: t ŋ k g f v tʃ dʒ,
Late 8: ʃ θ s z ð l r ʒ
This classification is included as the s93_eights column.
From these names alone, we might interpret these classes such that sounds in the Early 8 would be acquired before the ones in the Middle 8, and likewise that the Middle 8 would be acquired before the Late 8. But these classes were not created by examining patterns of typical consonant acquisition.
For some context, Shriberg (1993) introduces the Early 8, Middle 8, and Late 8 data by describing the following panel of the article's Figure 7:
About which, Shriberg (1993) says: "The values for this trend, which is a profile of consonant mastery, were taken from a group of 64 3- to 6-year-old speech-delayed children Shriberg, Kwiatkowski, & Gruber, 1992). Severity of involvement of the 24 English consonants is represented as the percentage correct for each consonant sorted in decreasing order from left to right. Notice that the most obvious breaks in this function allow for a division of the 24 consonants into three groups of eight sounds termed the Early-8, averaging over 75% correct, the Middle-8, averaging 25%-75% correct, and the Late-8, including consonants averaging less than 25% correct in continuous conversational speech (/ʒ/ is infrequently represented in young, speech-delayed children's spontaneous conversational speech)."
So, there were 64 3–6-year-old children with speech delays, and consonant sounds were divided into three classes based on how often these children produced the sounds correctly on average in a conversational speech sample. This classification is not so much a measure of the relative ordering of speech sound development as it is the relative difficulty of these sounds for children with a speech delay of unknown origin. It would be more appropriate to replace the levels of Early/Middle/Late with Easy/Medium/Hard.
Phonetic complexity measures (k1992_set and kd2018_complexity)
assign the speech sounds different complexity levels based on biological
principles outlined in Kent (1992). Because Kent (1992) is a book
chapter that is not floating around online, it's worthwhile to review
the provenance of these complexity measures. In short, Kent (1992) applied
interpreted consonant and vowel development data in terms of their
motor demands.
Sander (1972) set out to construct a set of developmental norms for typical consonant acquisition in English. His big idea was to include the median age of acquisition as well as the 90th percentile age of acquisition. The median can tell us something about the average acquisition of the speech sounds, and the 90th percentile can set a benchmark for delayed acquisition. There are some quirks of the methodology. First, Sander (1972) was targeting "customary articulation" which was defined using production accuracy average across word positions. So, the age of 50% customary articulation for /t/ is the earliest age when the average of word-initial accuracy, word-medial accuracy and word-final accuracy is greater than 50%. Second, the norms for this study were created by augmenting data from 3–8-year-olds (Templin, 1957; n = 480) with some earlier data for 2-year-olds (Wellman et al. 1931; n = 15).
Sander (1972) presented these acquisition norms in the following figure:
Kent (1992) aimed to explain the course of English sound development in terms of biological and motoric principles. He examined the ages of 90% acquisition from Sander (1972)—that is, the right edges of the bars in the previous figure—and observed that /p m n w h/ are mastered at age 3, /b d k g j f/ at age 4, /t ŋ r l/ at age 6 and /s z ʃ ʒ v θ ð tʃ dʒ/ after age 7. He then described motoric demands in each of these sets of sounds. I'll paraphrase:
Set 1 requires fast "ballistic" movements for stops /p m n/, slow "ramp" movements for /w h/, velopharyngeal control for oral-nasal contrast, laryngeal control for voicing contrast.
Set 2 adds more stops /b d k g/ and another ramp /j/ and a new place of articulation (velars), but also requires "fine force regulation for frication" for /f/.
Set 3 adds more stops /t ŋ/, but also requires tongue "bending" for /r/ and /l/.
Set 4 adds more lingual fricatives /s z ʃ ʒ θ ð/ which require tongue bending and fine force control along with /v tʃ dʒ/. Kent does not characterize the motor demands for the affricates /tʃ dʒ/.
Let's pause for a moment and observe that this breakdown is just an attempt to describe the Sander (1972) norms, and it is somewhat underdeveloped. For example, why is /t/ in Set 3 but /d/ in Set 2? It is not answered here, but I think this late mastery is an artefact of Sander's requirement of 90% accuracy averaging over the three word-positions. The medial and final productions of /t/ might require allophonic variation in /t/ (e.g., flapping or glottalization), so mastery of /t/ would require different motor gestures and some phonological knowledge on the part of the child. But in Kent's description /t/ is a later-mastered ballistic movement.
Still, the main point of Kent's description, I think, is that lingual (tongue) consonants are more difficult. Elsewhere in the chapter, Kent (1992) describes how the tongue is a "muscular hydrostat" like an elephant trunk, and bending a hydrostat requires coordination of different muscle directions:
"Gaining motor control over a hydrostat presents some special problems to the young child learning speech. For one, bending the hydrostat is unlike bending a jointed structure such as a finger. The tongue has no joints per se; it flexes by appropriate contraction of its three-dimensional network of intrinsic longitudinal, vertical, and transverse fibers. Bending a hydrostat requires that muscle fibers be shortened on one aspect simultaneously with a resistance to a change in diameter (Smith and Kier 1989). If the diameter change is not resisted, then the hydrostat will shorten on one side but will not bend. To use the tongue in speech, the child must learn to control the tongue to meet skeletal, movement, and shaping requirements, often simultaneously. These special characteristics of the tongue may well play a role in vowel and consonant mastery."
Kim and colleagues (2010) applied these developmental sets (k1992_set) as
articulatory complexity levels while
examining consonant errors in dysarthric speech. They then asked
questions such as whether more complex consonants had more consonant
errors than less complex ones (yes) or whether lower intelligibility
speakers made more complexity-reducing consonant substitutions than
higher intelligibility speakers (apparently so). Examining the speech
of 5-year-olds, Allison and Hustad (2014) later used these complexity
levels as a way to score the phonetic complexity of sentences. They
assigned consonants 1–5 scores (the 1–4 complexity levels with a score
of 5 for consonants clusters), and summed up the scores to provide a
complexity score for a sentence. Three of the eight 5-year-olds with
dysarthria showed a negative effect of sentence complexity on
intelligibility.
Kent (1992) also described the yearly developmental progression of vowels. I'll paraphrase again:
By age 1: Infants produce vocants (vowel precursors) which correspond to the low-front, central and low-back vowels /æ ɛ ʌ ə ɑ/. Thus, the tongue only moves in the anterior-posterior direction (i.e., there is limited up-down movement).
By age 2: Toddlers produce the "maximally dissimilar" corner vowels /i u ɑ/ and produce /o/ and the central vowels /ʌ ə/.
By age 3: Children incorporate two lower vowels /ɛ ɔ/ and the diphthongs /aɪ aʊ ɔɪ/ which require gliding movements.
By age 4: Children incorporate the remaining non-rhotic vowels /ʊ ɪ e æ/. The appearance of the front vowels suggests that tongue-jaw coordination is a relatively late motor achievement. (/i/ appears earlier because its extreme height is easy.)
Lastly: Children incorporate /ɚ ɝ/ last because these r-colored vowels require tongue bending.
Kuruvilla-Dugdale and colleagues (2018) used this description to
incorporate vowels into the phonetic complexity scale
(kd2018_complexity). The /ʌ ə ɑ/ vocants from age 1 and the vowels
from age 2 mark the bottom of the complexity scale. The vowels that are
acquired at ages 3, 4 and afterwards are assigned to the consonant
complexity levels with the same age of mastery. Finally, consonant
clusters serve as the ceiling for the scale:
Table: Phonetic complexity scores from Kuruvilla-Dugdale et al. (2018).
| kd2018_complexity | consonants | vowels |
| 1 | ʌ ə ɑ | |
| 2 | i u oʊ | |
| 3 | p m n h w | ɛ ɔ aʊ aɪ ɔɪ |
| 4 | b d k g f j | ɪ eɪ æ ʊ |
| 5 | t ŋ l r | ɝ ɚ |
| 6 | tʃ dʒ v θ ð s z ʃ ʒ | |
| 7 | 2-consonant clusters | |
| 8 | 3-consonant clusters |
It is not clear how to apply this scale, so my approach has been to break words into subsyllabic units and assign scores to the syllable onsets, nucleui and codas in each word. For example, "jump" is /dʒ/ + /ʌ/ + /mp/ so it would have complexity of 6 + 1 + 7 = 14, and "jumper" includes a syllable break between the cluster, so it would have a score 6 + 1 + 1 + 1 + 5 = 14.
Kuruvilla-Dugdale and colleagues (2018) used this scoring system to compare intelligibility for low complexity versus high complexity words. For example, for speakers with ALS and mild dysarthria, there was statistically clear reduction in intelligibility for high complexity words but not for low complexity words. I applied this scoring system on single-word intelligibility in children's speech (Mahr & Hustad, 2023). There was a probable but not statistically clear negative effect of complexity on intelligibility over and above the effects of age, word frequency and word neighborhood competition. (Regrettably, I coded the one consonant cluster in the word list with a complexity of 8 instead of 7, but otherwise this is the approach.)
Allison, K. M., & Hustad, K. C. (2014). Impact of sentence length and phonetic complexity on intelligibility of 5-year-old children with cerebral palsy. International Journal of Speech-Language Pathology, 16(4), 396–407. https://doi.org/10.3109/17549507.2013.876667
Crowe, K., & McLeod, S. (2020). Children’s English Consonant Acquisition in the United States: A Review. American Journal of Speech-Language Pathology, 29(4), 2155–2169. https://doi.org/10.1044/2020_AJSLP-19-00168
Kent, R. D. (1992). The Biology of Phonological Development. In C. A. Ferguson, L. Menn, & C. Stoel-Gammon (Eds.), Phonological development: Models, research, implications (pp. 65–90). York Press.
Kim, H., Martin, K., Hasegawa-Johnson, M., & Perlman, A. (2010). Frequency of consonant articulation errors in dysarthric speech. Clinical Linguistics and Phonetics, 24(10), 759–770. https://doi.org/10.3109/02699206.2010.497238
Kuruvilla-Dugdale, M., Custer, C., Heidrick, L., Barohn, R., & Govindarajan, R. (2018). A Phonetic Complexity-Based Approach for Intelligibility and Articulatory Precision Testing: A Preliminary Study on Talkers With Amyotrophic Lateral Sclerosis. Journal of Speech, Language, and Hearing Research, 61(9), 2205–2214. https://doi.org/10.1044/2018_JSLHR-S-17-0462
Mahr, T. J., & Hustad, K. C. (2023). Lexical Predictors of Intelligibility in Young Children’s Speech. Journal of Speech, Language, and Hearing Research, 66(8S), 3013–3025. https://doi.org/10.1044/2022_JSLHR-22-00294
Sander, E. K. (1972). When are Speech Sounds Learned? Journal of Speech and Hearing Disorders, 37(1), 55–63. https://doi.org/10.1044/jshd.3701.55
Shriberg, L. D. (1993). Four New Speech and Prosody-Voice Measures for Genetics Research and Other Studies in Developmental Phonological Disorders. Journal of Speech, Language, and Hearing Research, 36(1), 105–140. https://doi.org/10.1044/jshr.3601.105
A dataset of simulated intelligibility scores for testing and demonstrating modeling functions. These were created by fitting a Bayesian model of the raw Hustad and colleagues (2020) and drawing 1 sample from the posterior distribution of expected predictions (i.e., "epreds). In other words, these values are model predictions of the original dataset. They are correlated with original dataset values at r = .86. We might think of the simulation as adding random noise to the original dataset.
data_example_intelligibility_by_lengthdata_example_intelligibility_by_length
A data frame with 694 rows and 5 variables:
identifier for the child
child's age in months
length of the child's longest utterance
utterance length
child's intelligibility for the given utterance length (proportion of words said by the child that were correctly transcribed by two listeners)
Hustad, K. C., Mahr, T., Natzke, P. E. M., & Rathouz, P. J. (2020). Development of Speech Intelligibility Between 30 and 47 Months in Typically Developing Children: A Cross-Sectional Study of Growth. Journal of Speech, Language, and Hearing Research, 63(6), 1675–1687. https://doi.org/10.1044/2020_JSLHR-20-00008
A dataset of fake intelligibility scores for testing and demonstrating
modeling functions. These were created by randomly sampling 200 rows of an
intelligibility dataset and adding random noise to the age_months and
intelligibility variables. These values do not measure any real children
but represent plausible age and intelligibility measurements from our kind of
work.
data_fake_intelligibilitydata_fake_intelligibility
A data frame with 200 rows and 2 variables:
child's age in months
child's intelligibility (proportion of words said by the child that were correctly transcribed by two listeners)
A dataset of fake speaking rate measures for testing and demonstrating
modeling functions. These were created by randomly
sampling 200 rows of a speaking rate dataset and adding random noise to the
age_months and speaking_sps variables. These values do not measure any
real children but represent plausible age and rate measurements from our
kind of work.
data_fake_ratesdata_fake_rates
A data frame with 200 rows and 2 variables:
child's age in months
child's speaking rate in syllables per second
This package provides dataframes of information about the consonants and vowels in American English. The phonetic features are conventional descriptions of how sounds are produced. See also data_acq_consonants.
data_features_consonants data_features_vowelsdata_features_consonants data_features_vowels
An object of class tbl_df (inherits from tbl, data.frame) with 24 rows and 11 columns.
An object of class tbl_df (inherits from tbl, data.frame) with 17 rows and 13 columns.
Most of the phonetic features are self-evident and definitional. For example, /p/ is the bilabial voiceless stop. For fuzzier features, I consulted the general IPA chart and the Wikipedia page on English phonology. These issues included things like: what are the lax vowels again? or the last two rows of the consonant tables are approximants, so /r,l,j/ are approximants.
Some features have alternative feature sets in order to accommodate degrees
of aggregation. For example, /r,l,j,w/ are approximant in manner
but divided into liquid and glide in manner_alt.
data_features_consonants is a dataframe with 24 rows and
10 variables.
knitr::kable(data_features_consonants)
| phone | cmubet | wiscbet | voicing | voicing_alt | manner | manner_alt | place | place_fct | sonorance | sonorance_alt |
| p | P | p | voiceless | spread_glottis | stop | stop | labial | labial | obstruent | obstruent |
| b | B | b | voiced | plain | stop | stop | labial | labial | obstruent | obstruent |
| t | T | t | voiceless | spread_glottis | stop | stop | alveolar | alveolar | obstruent | obstruent |
| d | D | d | voiced | plain | stop | stop | alveolar | alveolar | obstruent | obstruent |
| k | K | k | voiceless | spread_glottis | stop | stop | velar | velar | obstruent | obstruent |
| g | G | g | voiced | plain | stop | stop | velar | velar | obstruent | obstruent |
| tʃ | CH | tsh | voiceless | spread_glottis | affricate | affricate | postalveolar | postalveolar | obstruent | strident |
| dʒ | JH | dzh | voiced | plain | affricate | affricate | postalveolar | postalveolar | obstruent | strident |
| m | M | m | voiced | NA | nasal | nasal | labial | labial | sonorant | sonorant |
| n | N | n | voiced | NA | nasal | nasal | alveolar | alveolar | sonorant | sonorant |
| ŋ | NG | ng | voiced | NA | nasal | nasal | velar | velar | sonorant | sonorant |
| f | F | f | voiceless | spread_glottis | fricative | fricative | labiodental | labiodental | obstruent | strident |
| v | V | v | voiced | plain | fricative | fricative | labiodental | labiodental | obstruent | strident |
| θ | TH | th | voiceless | spread_glottis | fricative | fricative | dental | dental | obstruent | obstruent |
| ð | DH | dh | voiced | plain | fricative | fricative | dental | dental | obstruent | obstruent |
| s | S | s | voiceless | spread_glottis | fricative | fricative | alveolar | alveolar | obstruent | strident |
| z | Z | z | voiced | plain | fricative | fricative | alveolar | alveolar | obstruent | strident |
| ʃ | SH | sh | voiceless | spread_glottis | fricative | fricative | postalveolar | postalveolar | obstruent | strident |
| ʒ | ZH | zh | voiced | plain | fricative | fricative | postalveolar | postalveolar | obstruent | strident |
| h | HH | h | voiceless | spread_glottis | fricative | fricative | glottal | glottal | obstruent | obstruent |
| l | L | l | voiced | NA | approximant | liquid | alveolar | alveolar | sonorant | sonorant |
| r | R | r | voiced | NA | approximant | liquid | postalveolar | postalveolar | sonorant | sonorant |
| w | W | w | voiced | NA | approximant | glide | labiovelar | NA | sonorant | sonorant |
| j | Y | j | voiced | NA | approximant | glide | palatal | palatal | sonorant | sonorant |
Description of each column:
phone in IPA
phone in the CMU alphabet
phone in an older system used by our lab
voiced versus voiceless
spread_glottis versus plain
manner of articulation
alternative manner coding that separates approximants into liquids and glides
place of articulation
place coded as a factor and ordered based on
frontness of the articulators. labiovelar is recoded as NA.
obstruent versus sonorant
obstruant versus sonorant versus strident.
Levels of the factor columns:
data_features_consonants |> lapply(levels) |> Filter(length, x = _) #> $place_fct #> [1] "labial" "labiodental" "dental" "alveolar" "postalveolar" #> [6] "palatal" "velar" "glottal"
The CMU alphabet does not include a glottal stop.
Here /f,v/ are coded as strident following Wikipedia and Sound Pattern of English. If this feature value doesn't seem right, we should probably use an alternative feature of sibilant for the stridents minus /f,v/.
The alternative voicing scheme was suggested by a colleague because of how
the voice-voiceless phonetic contrast is achieved with different articulatory
strategies in different languages. Note that voicing_alt does not assign
a feature to nasals or approximants.
data_features_vowels is a dataframe with 17 rows and
11 variables.
knitr::kable(data_features_vowels)
| phone | cmubet | wiscbet | hint | manner | manner_alt | tenseness | height | height_fct | height_alt | backness | backness_fct | rounding |
| i | IY | i | beat | vowel | vowel | tense | high | high | high | front | front | unrounded |
| ɪ | IH | I | bit | vowel | vowel | lax | mid-high | mid-high | high | front | front | unrounded |
| eɪ | EY | eI | bait | vowel | vowel | tense | mid-high | mid-high | mid | front | front | unrounded |
| ɛ | EH | E | bet | vowel | vowel | lax | mid-low | mid-low | mid | front | front | unrounded |
| æ | AE | ae | bat | vowel | vowel | lax | low | low | low | front | front | unrounded |
| ʌ | AH | ^ | but | vowel | vowel | lax | mid-low | mid-low | mid | central | central | unrounded |
| ə | AH | 4 | comma | vowel | vowel | lax | mid-low | mid-low | mid | central | central | unrounded |
| u | UW | u | boot | vowel | vowel | tense | high | high | high | back | back | rounded |
| ʊ | UH | U | book | vowel | vowel | lax | mid-high | mid-high | high | back | back | rounded |
| oʊ | OW | oU | boat | vowel | vowel | tense | mid-high | mid-high | mid | back | back | rounded |
| ɔ | AO | c | bought | vowel | vowel | tense | mid-low | mid-low | low | back | back | rounded |
| ɑ | AA | @ | bot | vowel | vowel | tense | low | low | low | back | back | unrounded |
| aʊ | AW | @U | bout | vowel | diphthong | diphthong | diphthong | NA | diphthong | diphthong | NA | diphthong |
| aɪ | AY | @I | bite | vowel | diphthong | diphthong | diphthong | NA | diphthong | diphthong | NA | diphthong |
| ɔɪ | OY | cI | boy | vowel | diphthong | diphthong | diphthong | NA | diphthong | diphthong | NA | diphthong |
| ɝ | ER | 3^ | letter | vowel | r-colored | r-colored | mid-low | mid-low | mid | central | central | r-colored |
| ɚ | ER | 4^ | burt | vowel | r-colored | r-colored | mid-low | mid-low | mid | central | central | r-colored |
Description of each column:
phone in IPA
phone in the CMU alphabet
phone in an older system used by our lab
a word containing the selected vowel
manner of articulation
alternative manner with vowel, diphthong and r-colored
tense versus lax (versus diphthong and r-colored)
vowel height on a four-level scale
height coded as a factor ordered high,
mid-high, mid-low, low. diphthong is recoded to NA.
vowel height on a three-level scale
vowel backness
backness coded as a factor ordered front,
central, back. diphthong is recoded to NA.
unrounded versus rounded (versus diphthong and r-colored)
Levels of the factor columns:
data_features_vowels |> lapply(levels) |> Filter(length, x = _) #> $height_fct #> [1] "high" "mid-high" "mid-low" "low" #> #> $backness_fct #> [1] "front" "central" "back"
I don't consider /eɪ/ and /oʊ/ to be diphthongs, but perhaps manner_alt
could encode the difference of these vowels from the others.
In the CMU alphabet and ARPAbet, vowels can include a number to indicate
vowel stress, so AH1 or AH2 is /ʌ/ but AH0 is /ə/.
The vowel features for General American English, according to Wikipedia, are as follows:
| Front | Central | Back | ||||
|---|---|---|---|---|---|---|
| lax | tense | lax | tense | lax | tense | |
| Close | ɪ | i | ʊ | u | ||
| Mid | ɛ | eɪ | ə | (ɜ) | oʊ | |
| Open | æ | (ʌ) | ɑ | (ɔ) | ||
| Diphthongs | aɪ ɔɪ aʊ | |||||
I adapted these features in this way:
A four-level breakdown of height (high, mid-high, mid-low, low) was used instead of a three-level one (close, mid, open).
tense and lax features were directly borrowed. Diphthongs and r-colored vowels are were not assign a tenseness.
/ɑ/ moved to back (following the general IPA)
diphthongs have no backness or height
r-colored vowels were given the backness and height of the /ʌ,ə/
Based on the assumption that /ʌ,ə/ are the same general vowel with differing stress, these vowels have the same features. This definition clashes with the general IPA chart which places /ʌ/ as a back vowel. However, /ʌ/ is a conventional notation. Quoting Wikipedia again: "Although the notation /ʌ/ is used for the vowel of STRUT in RP and General American, the actual pronunciation is closer to a near-open central vowel [ɐ] in RP and advanced back [ʌ̟] in General American." That is, /ʌ/ is fronted in American English (hence, mid) in American English.
We use items from the Test of Children's Speech (TOCS+, Hodge & Gotzke, 2014) in our child speech and listener intelligibility studies.
data_tocs_itemsdata_tocs_items
A data frame with 103 rows and 4 variables:
whether the item is a single-word or multiword utterance
number of words in the prompt
identifier for the item (e.g., within filenames)
the actual prompt presented to the child
any notes about an item's usage in experiments or analyses
Most of these items are subset from a larger pool of items on the TOCS+. Hustad and colleagues (2021) is a supplemental material for a larger study about speech intelligibility which describes where the items were taken from within the larger TOCS+ task. (It is unclear, as I write this in 2026, which items were added by our group to the TOCS+ pool, so I have added "possibly not an original TOCS item" to items that do not appear in that supplemental material.)
In this picture-prompted repetition task, a child hears a prompt,
repeats it (production), and later on listeners transcribe it
(transcription). In our listener-response database, we differentiate these
three forms with fields named tocs_prompt (prompt), sentence
(production), and response (transcription).
Hodge, M. M., & Gotzke, C. (2014). Construct-related validity of the TOCS+ measures: Comparison of intelligibility and speaking rate scores in children with and without speech disorders. Journal of Communication Disorders, 51, 51–63. https://doi.org/10.1016/j.jcomdis.2014.06.007
Hustad, K. C., Mahr, T. J., Natzke, P., & Rathouz, P. J. (2021). Supplemental Material 1: Intelligibility growth between 30 and 119 months (Hustad et al., 2021) (Version 1). ASHA Journals. https://doi.org/10.23641/asha.16583426.v1
file_replace_name() uses stringr::str_replace() to rename files.
file_rename_with() allows you to rename files with a generic
string-transforming function.
file_replace_name( path, pattern, replacement, .dry_run = FALSE, .overwrite = FALSE ) file_rename_with(path, .fn, ..., .dry_run = FALSE, .overwrite = FALSE)file_replace_name( path, pattern, replacement, .dry_run = FALSE, .overwrite = FALSE ) file_rename_with(path, .fn, ..., .dry_run = FALSE, .overwrite = FALSE)
path |
vector of paths for files to rename |
pattern, replacement
|
arguments forwarded to |
.dry_run |
when |
.overwrite |
Whether to overwrite files. Defaults to |
.fn |
function to call file paths |
... |
arguments passed onto |
Only the basename of the file (returned by basename()
undergoes string replacement).
the contents of paths with updated file names. Duplicated elements
are removed. This function throws an error if a name collision is detected
(where two files are both renamed into the same target path).
# With .dry_run = TRUE, we can make up some file paths. dir <- "//some-fake-location/" path <- file.path( dir, c("report_1.csv", "report_2.csv", "report-1.csv", "skipped.csv") ) updated <- file_replace_name(path, "report_", "report-", .dry_run = TRUE) # Collisions are detected updated <- file_replace_name(path, "report_\\d", "report-1", .dry_run = TRUE)# With .dry_run = TRUE, we can make up some file paths. dir <- "//some-fake-location/" path <- file.path( dir, c("report_1.csv", "report_2.csv", "report-1.csv", "skipped.csv") ) updated <- file_replace_name(path, "report_", "report-", .dry_run = TRUE) # Collisions are detected updated <- file_replace_name(path, "report_\\d", "report-1", .dry_run = TRUE)
The function fits the same type of GAMLSS model as used in Hustad and colleagues (2021):
A beta regression model (via gamlss.dist::BE()) with natural cubic splines
on the mean (mu) and scale (sigma). This model is fitted using this package's
mem_gamlss() wrapper function.
fit_beta_gamlss(data, var_x, var_y, df_mu = 3, df_sigma = 2, control = NULL) fit_beta_gamlss_se( data, name_x, name_y, df_mu = 3, df_sigma = 2, control = NULL ) predict_beta_gamlss(newdata, model, centiles = c(5, 10, 50, 90, 95)) optimize_beta_gamlss_slope( model, centiles = 50, interval = c(30, 119), maximum = TRUE ) uniroot_beta_gamlss(model, centiles = 50, targets = 0.5, interval = c(30, 119))fit_beta_gamlss(data, var_x, var_y, df_mu = 3, df_sigma = 2, control = NULL) fit_beta_gamlss_se( data, name_x, name_y, df_mu = 3, df_sigma = 2, control = NULL ) predict_beta_gamlss(newdata, model, centiles = c(5, 10, 50, 90, 95)) optimize_beta_gamlss_slope( model, centiles = 50, interval = c(30, 119), maximum = TRUE ) uniroot_beta_gamlss(model, centiles = 50, targets = 0.5, interval = c(30, 119))
data |
a data frame |
var_x, var_y
|
(unquoted) variable names giving the predictor variable
(e.g., |
df_mu, df_sigma
|
degrees of freedom. If |
control |
a |
name_x, name_y
|
quoted variable names giving the predictor variable
(e.g., |
newdata |
a one-column dataframe for predictions |
model |
a model fitted by |
centiles |
centiles to use for prediction. Defaults to
|
interval |
for |
maximum |
for |
targets |
for |
There are two versions of this function. The main version is
fit_beta_gamlss(), and it works with unquoted column names (e.g.,
age). The alternative version is fit_beta_gamlss_se(); the final
"se" stands for "Standard Evaluation". This designation means that the
variable names must be given as strings (so, the quoted "age" instead of
bare name age). This alternative version is necessary when we fit several
models using parallel computing with furrr::future_map() (as when using
bootstrap resampling).
predict_centiles() will work with this function, but it will likely throw a
warning message. Therefore, predict_beta_gamlss() provides an alternative
way to compute centiles from the model. This function manually computes the
centiles instead of relying on gamlss::centiles(). The main difference is
that new x values go through splines::predict.ns() and then these are
multiplied by model coefficients.
optimize_beta_gamlss_slope() computes the point (i.e., age) and rate of
steepest growth for different quantiles. This function wraps over the
following process:
an internal prediction function computes a quantile at some x from model
coefficients and spline bases.
another internal function uses numDeriv::grad() to get the gradient
of this prediction function for x.
optimize_beta_gamlss_slope() uses stats::optimize() on the gradient
function to find the x with the maximum or minimum slope.
uniroot_beta_gamlss() also uses this internal prediction function to find
when a quantile growth curve crosses a given value. stats::uniroot() finds
where a function crosses 0 (a root). If we modify our prediction function to
always subtract .5 at the end, then the root for this prediction function
would be the x value where the predicted value crosses .5. That's the idea
behind how uniroot_beta_gamlss() works. In our work, we would use this
approach to find, say, the age (root) when children in the 10th percentile
(centiles) cross 50% intelligibility (targets).
This part is a brief note that GAMLSS uses a different parameterization of the beta distribution for its beta family than other packages.
The canonical parameterization of the beta distribution uses shape parameters
and and the probability density function:
where is the beta function.
For beta regression, the distribution is reparameterized so that there is a
mean probability and some other parameter that represents the
spread around that mean. In GAMLSS (gamlss.dist::BE()), they use a scale
parameter (larger values mean more spread around mean).
Everywhere else—betareg::betareg() and rstanarm::stan_betareg() in
vignette("betareg", "betareg"), brms::Beta() in
vignette("brms_families", "brms"), mgcv::betar()—it's a precision
parameter (larger values mean more precision, less spread around
mean). Here is a comparison:
for fit_beta_gamlss() and fit_beta_gamlss_se(), a
mem_gamlss()-fitted model. The .user data in the model includes degrees
of freedom for each parameter and the splines::ns() basis for each
parameter. For predict_beta_gamlss(), a dataframe containing the
model predictions for mu and sigma, plus columns for each centile in
centiles. For optimize_beta_gamlss_slope(), a dataframe with the
optimized x values (maximum or minimum), the gradient at that
x value (objective), and the quantile (quantile). For
uniroot_beta_gamlss(), a dataframe one row per quantile/target
combination with the results of calling stats::uniroot(). The
root column is the x value where the quantile curve crosses the
target value.
Associated article: https://doi.org/10.1044/2021_JSLHR-21-00142
data_fake_intelligibility m <- fit_beta_gamlss( data_fake_intelligibility, age_months, intelligibility ) # using "qr" in summary() just to suppress a warning message summary(m, type = "qr") # Alternative interface d <- data_fake_intelligibility m2 <- fit_beta_gamlss_se( data = d, name_x = "age_months", name_y = "intelligibility" ) coef(m2) == coef(m) # how to use control to change gamlss() behavior m_traced <- fit_beta_gamlss( data_fake_intelligibility, age_months, intelligibility, control = gamlss::gamlss.control(n.cyc = 15, trace = TRUE) ) # The `.user` space includes the spline bases, so that we can make accurate # predictions of new xs. names(m$.user) # predict logit(mean) at 55 months: logit_mean_55 <- cbind(1, predict(m$.user$basis_mu, 55)) %*% coef(m) logit_mean_55 stats::plogis(logit_mean_55) # But predict_gen_gamma_gamlss() does this work for us and also provides # centiles new_ages <- data.frame(age_months = 48:71) centiles <- predict_beta_gamlss(new_ages, m) centiles # Confirm that the manual prediction matches the automatic one centiles[centiles$age_months == 55, "mu"] stats::plogis(logit_mean_55) if(requireNamespace("ggplot2", quietly = TRUE)) { library(ggplot2) ggplot(pivot_centiles_longer(centiles)) + aes(x = age_months, y = .value) + geom_line(aes(group = .centile, color = .centile_pair)) + geom_point( aes(y = intelligibility), data = subset( data_fake_intelligibility, 48 <= age_months & age_months <= 71 ) ) } # Age of steepest growth for each centile optimize_beta_gamlss_slope( model = m, centiles = c(5, 10, 50, 90), interval = range(data_fake_intelligibility$age_months) ) # Manual approach: Make fine grid of predictions and find largest jump centiles_grid <- predict_beta_gamlss( newdata = data.frame(age_months = seq(28, 95, length.out = 1000)), model = m ) centiles_grid[which.max(diff(centiles_grid$c5)), "age_months"] # When do children in different centiles reach 50%, 70% intelligibility? uniroot_beta_gamlss( model = m, centiles = c(5, 10, 50), targets = c(.5, .7) )data_fake_intelligibility m <- fit_beta_gamlss( data_fake_intelligibility, age_months, intelligibility ) # using "qr" in summary() just to suppress a warning message summary(m, type = "qr") # Alternative interface d <- data_fake_intelligibility m2 <- fit_beta_gamlss_se( data = d, name_x = "age_months", name_y = "intelligibility" ) coef(m2) == coef(m) # how to use control to change gamlss() behavior m_traced <- fit_beta_gamlss( data_fake_intelligibility, age_months, intelligibility, control = gamlss::gamlss.control(n.cyc = 15, trace = TRUE) ) # The `.user` space includes the spline bases, so that we can make accurate # predictions of new xs. names(m$.user) # predict logit(mean) at 55 months: logit_mean_55 <- cbind(1, predict(m$.user$basis_mu, 55)) %*% coef(m) logit_mean_55 stats::plogis(logit_mean_55) # But predict_gen_gamma_gamlss() does this work for us and also provides # centiles new_ages <- data.frame(age_months = 48:71) centiles <- predict_beta_gamlss(new_ages, m) centiles # Confirm that the manual prediction matches the automatic one centiles[centiles$age_months == 55, "mu"] stats::plogis(logit_mean_55) if(requireNamespace("ggplot2", quietly = TRUE)) { library(ggplot2) ggplot(pivot_centiles_longer(centiles)) + aes(x = age_months, y = .value) + geom_line(aes(group = .centile, color = .centile_pair)) + geom_point( aes(y = intelligibility), data = subset( data_fake_intelligibility, 48 <= age_months & age_months <= 71 ) ) } # Age of steepest growth for each centile optimize_beta_gamlss_slope( model = m, centiles = c(5, 10, 50, 90), interval = range(data_fake_intelligibility$age_months) ) # Manual approach: Make fine grid of predictions and find largest jump centiles_grid <- predict_beta_gamlss( newdata = data.frame(age_months = seq(28, 95, length.out = 1000)), model = m ) centiles_grid[which.max(diff(centiles_grid$c5)), "age_months"] # When do children in different centiles reach 50%, 70% intelligibility? uniroot_beta_gamlss( model = m, centiles = c(5, 10, 50), targets = c(.5, .7) )
The function fits the same type of GAMLSS model as used in Mahr and colleagues (2021): A
generalized gamma regression model (via gamlss.dist::GG()) with natural
cubic splines on the mean (mu), scale (sigma), and shape (nu) of the
distribution. This model is fitted using this package's mem_gamlss()
wrapper function.
fit_gen_gamma_gamlss( data, var_x, var_y, df_mu = 3, df_sigma = 2, df_nu = 1, control = NULL ) fit_gen_gamma_gamlss_se( data, name_x, name_y, df_mu = 3, df_sigma = 2, df_nu = 1, control = NULL ) predict_gen_gamma_gamlss(newdata, model, centiles = c(5, 10, 50, 90, 95))fit_gen_gamma_gamlss( data, var_x, var_y, df_mu = 3, df_sigma = 2, df_nu = 1, control = NULL ) fit_gen_gamma_gamlss_se( data, name_x, name_y, df_mu = 3, df_sigma = 2, df_nu = 1, control = NULL ) predict_gen_gamma_gamlss(newdata, model, centiles = c(5, 10, 50, 90, 95))
data |
a data frame |
var_x, var_y
|
(unquoted) variable names giving the predictor variable
(e.g., |
df_mu, df_sigma, df_nu
|
degrees of freedom. If |
control |
a |
name_x, name_y
|
quoted variable names giving the predictor variable
(e.g., |
newdata |
a one-column dataframe for predictions |
model |
a model fitted by |
centiles |
centiles to use for prediction. Defaults to |
There are two versions of this function. The main version is
fit_gen_gamma_gamlss(), and it works with unquoted column names (e.g.,
age). The alternative version is fit_gen_gamma_gamlss_se(); the final
"se" stands for "Standard Evaluation". This designation means that the
variable names must be given as strings (so, the quoted "age" instead of
bare name age). This alternative version is necessary when we fit several
models using parallel computing with furrr::future_map() (as when using
bootstrap resampling).
predict_centiles() will work with this function, but it will likely throw a
warning message. Therefore, predict_gen_gamma_gamlss() provides an
alternative way to compute centiles from the model. This function manually
computes the centiles instead of relying on gamlss::centiles(). The main
difference is that new x values go through splines::predict.ns() and then
these are multiplied by model coefficients.
for fit_gen_gamma_gamlss() and fit_gen_gamma_gamlss_se(), a
mem_gamlss()-fitted model. The .user data in the model includes degrees
of freedom for each parameter and the splines::ns() basis for each
parameter. For predict_gen_gamma_gamlss(), a dataframe containing the
model predictions for mu, sigma, and nu, plus columns for each centile in
centiles.
Associated article: https://doi.org/10.1044/2021_JSLHR-21-00206
data_fake_rates m <- fit_gen_gamma_gamlss(data_fake_rates, age_months, speaking_sps) # using "qr" in summary() just to suppress a warning message summary(m, type = "qr") # Alternative interface d <- data_fake_rates m2 <- fit_gen_gamma_gamlss_se( data = d, name_x = "age_months", name_y = "speaking_sps" ) coef(m2) == coef(m) # how to use control to change gamlss() behavior m_traced <- fit_gen_gamma_gamlss( data_fake_rates, age_months, speaking_sps, control = gamlss::gamlss.control(n.cyc = 15, trace = TRUE) ) # The `.user` space includes the spline bases, so that we can make accurate # predictions of new xs. names(m$.user) # predict log(mean) at 55 months: log_mean_55 <- cbind(1, predict(m$.user$basis_mu, 55)) %*% coef(m) log_mean_55 exp(log_mean_55) # But predict_gen_gamma_gamlss() does this work for us and also provides # centiles new_ages <- data.frame(age_months = 48:71) centiles <- predict_gen_gamma_gamlss(new_ages, m) centiles # Confirm that the manual prediction matches the automatic one centiles[centiles$age_months == 55, "mu"] exp(log_mean_55) if(requireNamespace("ggplot2", quietly = TRUE)) { library(ggplot2) ggplot(pivot_centiles_longer(centiles)) + aes(x = age_months, y = .value) + geom_line(aes(group = .centile, color = .centile_pair)) + geom_point( aes(y = speaking_sps), data = subset( data_fake_rates, 48 <= age_months & age_months <= 71 ) ) } # Example of 0-df splines m <- fit_gen_gamma_gamlss( data_fake_rates, age_months, speaking_sps, df_mu = 0, df_sigma = 2, df_nu = 0 ) coef(m, what = "mu") coef(m, what = "sigma") coef(m, what = "nu") # mu and nu fixed, c50 mostly locked in predict_gen_gamma_gamlss(new_ages, m)[c(1, 9, 17, 24), ]data_fake_rates m <- fit_gen_gamma_gamlss(data_fake_rates, age_months, speaking_sps) # using "qr" in summary() just to suppress a warning message summary(m, type = "qr") # Alternative interface d <- data_fake_rates m2 <- fit_gen_gamma_gamlss_se( data = d, name_x = "age_months", name_y = "speaking_sps" ) coef(m2) == coef(m) # how to use control to change gamlss() behavior m_traced <- fit_gen_gamma_gamlss( data_fake_rates, age_months, speaking_sps, control = gamlss::gamlss.control(n.cyc = 15, trace = TRUE) ) # The `.user` space includes the spline bases, so that we can make accurate # predictions of new xs. names(m$.user) # predict log(mean) at 55 months: log_mean_55 <- cbind(1, predict(m$.user$basis_mu, 55)) %*% coef(m) log_mean_55 exp(log_mean_55) # But predict_gen_gamma_gamlss() does this work for us and also provides # centiles new_ages <- data.frame(age_months = 48:71) centiles <- predict_gen_gamma_gamlss(new_ages, m) centiles # Confirm that the manual prediction matches the automatic one centiles[centiles$age_months == 55, "mu"] exp(log_mean_55) if(requireNamespace("ggplot2", quietly = TRUE)) { library(ggplot2) ggplot(pivot_centiles_longer(centiles)) + aes(x = age_months, y = .value) + geom_line(aes(group = .centile, color = .centile_pair)) + geom_point( aes(y = speaking_sps), data = subset( data_fake_rates, 48 <= age_months & age_months <= 71 ) ) } # Example of 0-df splines m <- fit_gen_gamma_gamlss( data_fake_rates, age_months, speaking_sps, df_mu = 0, df_sigma = 2, df_nu = 0 ) coef(m, what = "mu") coef(m, what = "sigma") coef(m, what = "nu") # mu and nu fixed, c50 mostly locked in predict_gen_gamma_gamlss(new_ages, m)[c(1, 9, 17, 24), ]
Observations are scale()-ed before clustering.
fit_kmeans(data, k, vars, args_kmeans = list())fit_kmeans(data, k, vars, args_kmeans = list())
data |
a dataframe |
k |
number of clusters to create |
vars |
variable selection for clustering. Select multiple variables
with |
args_kmeans |
additional arguments passed to |
Note that each variable is scaled() before clustering
and then cluster means are unscaled to match the original data scale.
This function provides the original kmeans labels as .kmeans_cluster but
other alternative labeling based on different sortings of the data. These are
provided in order to deal with label-swapping in Bayesian models. See
bootstrapping example below.
the original data but augmented with additional columns for
clustering details. including .kmeans_cluster (cluster number of each
observation, as a factor) and .kmeans_k (selected number of clusters).
Cluster-level information is also included. For example, suppose that
we cluster using the variable x. Then the output will have a
column .kmeans_x giving the cluster mean for x and
.kmeans_rank_x giving the cluster labels reordered using the cluster
means for x. The column .kmeans_sort contains the cluster sorted
using the first principal component of the scaled variables. All columns
of cluster indices are a factor() so that they can be plotted as
discrete variables.
data_kmeans <- fit_kmeans(mtcars, 3, c(mpg, wt, hp)) library(ggplot2) ggplot(data_kmeans) + aes(x = wt, y = mpg) + geom_point(aes(color = .kmeans_cluster)) ggplot(data_kmeans) + aes(x = wt, y = mpg) + geom_point(aes(color = .kmeans_rank_wt)) # Example of label swapping set.seed(123) data_boots <- lapply( 1:10, function(x) { rows <- sample(seq_len(nrow(mtcars)), replace = TRUE) data <- mtcars[rows, ] data$.bootstrap <- x data } ) |> lapply(fit_kmeans, k = 3, c(mpg, wt, hp)) |> dplyr::bind_rows() |> dplyr::select(.bootstrap, dplyr::starts_with(".kmeans_")) |> dplyr::distinct() # Clusters start off in random locations and move to center, so the labels # differ between model runs and across bootstraps. ggplot(data_boots) + aes(x = .kmeans_wt, y = .kmeans_mpg) + geom_point(aes(color = .kmeans_cluster)) + labs(title = "k-means centers on 10 bootstraps") # Labels sorted using first principal component # so the labels are more consistent. ggplot(data_boots) + aes(x = .kmeans_wt, y = .kmeans_mpg) + geom_point(aes(color = .kmeans_sort)) + labs(title = "k-means centers on 10 bootstraps")data_kmeans <- fit_kmeans(mtcars, 3, c(mpg, wt, hp)) library(ggplot2) ggplot(data_kmeans) + aes(x = wt, y = mpg) + geom_point(aes(color = .kmeans_cluster)) ggplot(data_kmeans) + aes(x = wt, y = mpg) + geom_point(aes(color = .kmeans_rank_wt)) # Example of label swapping set.seed(123) data_boots <- lapply( 1:10, function(x) { rows <- sample(seq_len(nrow(mtcars)), replace = TRUE) data <- mtcars[rows, ] data$.bootstrap <- x data } ) |> lapply(fit_kmeans, k = 3, c(mpg, wt, hp)) |> dplyr::bind_rows() |> dplyr::select(.bootstrap, dplyr::starts_with(".kmeans_")) |> dplyr::distinct() # Clusters start off in random locations and move to center, so the labels # differ between model runs and across bootstraps. ggplot(data_boots) + aes(x = .kmeans_wt, y = .kmeans_mpg) + geom_point(aes(color = .kmeans_cluster)) + labs(title = "k-means centers on 10 bootstraps") # Labels sorted using first principal component # so the labels are more consistent. ggplot(data_boots) + aes(x = .kmeans_wt, y = .kmeans_mpg) + geom_point(aes(color = .kmeans_sort)) + labs(title = "k-means centers on 10 bootstraps")
Convert between age in months, years;months, and yymm age formats
format_year_month_age(x, sep = ";") parse_year_month_age(x, sep = ";") parse_yymm_age(x, start = 1L)format_year_month_age(x, sep = ";") parse_year_month_age(x, sep = ";") parse_yymm_age(x, start = 1L)
x |
|
sep |
Separator to use for |
start |
For |
For format_year_month_age(), ages of NA return "NA;NA".
For parse_year_month_age(), values that cannot be parsed
return NA.
This format by default is not numerically ordered. This means that c("2;0", "10;10", "10;9") would sort as c("10;10", "10;9", "2;0"). The function
stringr::str_sort(..., numeric = TRUE) will sort this vector correctly.
format_year_month_age() returns a character vector in "years;months"
format (or years{sep}months format more generally).
parse_year_month_age() returns a vector of ages in months.
parse_yymm_age(): returns a vector of ages in months.
ages <- c(26, 58, 25, 67, 21, 59, 36, 43, 27, 49, NA) ym_ages <- format_year_month_age(ages) ym_ages parse_year_month_age(ym_ages) parse_yymm_age(c("0204", "0310")) parse_yymm_age(c("ab_0204", "ab_0310"), start = 4)ages <- c(26, 58, 25, 67, 21, 59, 36, 43, 27, 49, NA) ym_ages <- format_year_month_age(ages) ym_ages parse_year_month_age(ym_ages) parse_yymm_age(c("0204", "0310")) parse_yymm_age(c("ab_0204", "ab_0310"), start = 4)
Impute missing data at different utterance lengths using successive linear models.
impute_values_by_length( data, var_y, var_length, id_cols = NULL, include_max_length = FALSE, data_train = NULL )impute_values_by_length( data, var_y, var_length, id_cols = NULL, include_max_length = FALSE, data_train = NULL )
data |
dataframe in which to impute missing value |
var_y |
bare name of the response variable for imputation |
var_length |
bare name of the length variable |
id_cols |
a selection of variable names that uniquely identify each
group of related observations. For example, |
include_max_length |
whether to use the maximum length value as a
predictor in the imputation models. Defaults to |
data_train |
(optional) dataframe used to train the imputation models.
For example, we might have data from a reference group of children in
|
a dataframe with the additional columns {var_y}_imputed (the
imputed value), .max_{var_length} with the highest value of var_length
with observed data, and {var_y}_imputation for labeling whether
observations were "imputed" or "observed".
In Hustad and colleagues (2020), we modeled intelligibility data in young children's speech. Children would hear an utterance and then they would repeat it. The utterances started at 2 words in length, then increased to 3 words in length, and so on in batches of 10 sentences, all the way to 7 words in length. There was a problem, however: Not all of the children could produce utterances at every length. Specifically, if a child could not reliably produced 5 utterances of a given length length, the task was halted. So given the nature of the task, if a child had produced 5-word utterances, they also produced 2–4-word utterances as well.
The length of the utterance probably influenced the outcome variable: Longer utterances have more words that might help a listener understand the sentence, for example. Therefore, it did not seem appropriate to ignore the missing values. We used the following two-step procedure (see the Supplemental Materials for more detail):
Remark about data and data_train: One might ask, why shouldn't some
children help train the data imputation models? Let's consider a
norm-referenced standardized testing scenario: We have a new participant
(observations in data), and we want to know how they compare to their age
peers (participants in data_train). By separating out data_train and
fixing it to a reference group, we can apply the same adjustment/imputation
procedure to all new participants.
Hustad, K. C., Mahr, T., Natzke, P. E. M., & Rathouz, P. J. (2020). Development of Speech Intelligibility Between 30 and 47 Months in Typically Developing Children: A Cross-Sectional Study of Growth. Journal of Speech, Language, and Hearing Research, 63(6), 1675–1687. https://doi.org/10.1044/2020_JSLHR-20-00008
Hustad, K. C., Mahr, T., Natzke, P. E. M., & J. Rathouz, P. (2020). Supplemental Material S1 (Hustad et al., 2020). ASHA journals. https://doi.org/10.23641/asha.12330956.v1
set.seed(1) fake_data <- tibble::tibble( child = c( "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "c", "c", "c", "c", "c", "e", "e", "e", "e", "e", "f", "f", "f", "f", "f", "g", "g", "g", "g", "g", "h", "h", "h", "h", "h", "i", "i", "i", "i", "i" ), level = c(1:5, 1:5, 1:5, 1:5, 1:5, 1:5, 1:5, 1:5), x = c( c(100, 110, 120, 130, 150) + c(-8, -5, 0, NA, NA), c(100, 110, 120, 130, 150) + c(6, 6, 4, NA, NA), c(100, 110, 120, 130, 150) + c(-5, -5, -2, 2, NA), c(100, 110, 120, 130, 150) + rbinom(5, 12, .5) - 6, c(100, 110, 120, 130, 150) + rbinom(5, 12, .5) - 6, c(100, 110, 120, 130, 150) + rbinom(5, 12, .5) - 6, c(100, 110, 120, 130, 150) + rbinom(5, 12, .5) - 6, c(100, 110, 120, 130, 150) + rbinom(5, 12, .5) - 6 ) ) data_imputed <- impute_values_by_length( fake_data, x, level, id_cols = c(child), include_max_length = FALSE ) if (requireNamespace("ggplot2")) { library(ggplot2) ggplot(data_imputed) + aes(x = level, y = x_imputed) + geom_line(aes(group = child)) + geom_point(aes(color = x_imputation)) }set.seed(1) fake_data <- tibble::tibble( child = c( "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "c", "c", "c", "c", "c", "e", "e", "e", "e", "e", "f", "f", "f", "f", "f", "g", "g", "g", "g", "g", "h", "h", "h", "h", "h", "i", "i", "i", "i", "i" ), level = c(1:5, 1:5, 1:5, 1:5, 1:5, 1:5, 1:5, 1:5), x = c( c(100, 110, 120, 130, 150) + c(-8, -5, 0, NA, NA), c(100, 110, 120, 130, 150) + c(6, 6, 4, NA, NA), c(100, 110, 120, 130, 150) + c(-5, -5, -2, 2, NA), c(100, 110, 120, 130, 150) + rbinom(5, 12, .5) - 6, c(100, 110, 120, 130, 150) + rbinom(5, 12, .5) - 6, c(100, 110, 120, 130, 150) + rbinom(5, 12, .5) - 6, c(100, 110, 120, 130, 150) + rbinom(5, 12, .5) - 6, c(100, 110, 120, 130, 150) + rbinom(5, 12, .5) - 6 ) ) data_imputed <- impute_values_by_length( fake_data, x, level, id_cols = c(child), include_max_length = FALSE ) if (requireNamespace("ggplot2")) { library(ggplot2) ggplot(data_imputed) + aes(x = level, y = x_imputed) + geom_line(aes(group = child)) + geom_point(aes(color = x_imputation)) }
Compute entropy and related measures
info_surprisal(x, base = NULL) info_entropy(p, base = NULL) info_cross_entropy(p, q, base = NULL) info_kl_divergence(p, q, base = NULL) info_kl_divergence_matrix(mat, base = NULL)info_surprisal(x, base = NULL) info_entropy(p, base = NULL) info_cross_entropy(p, q, base = NULL) info_kl_divergence(p, q, base = NULL) info_kl_divergence_matrix(mat, base = NULL)
x |
a vector of probabilities |
base |
the base of the logarithm. Defaults to e ( |
p, q
|
a probability distribution (vector of probabilities that sum to 1) |
mat |
a matrix where each row is a probability distribution |
the entropy
Given a probability x, the surprisal value (or information content) of
x is the log of the inverse probability, log(1 / x). Rare events (smaller
probabilities) have larger surprisal values than more common events. The word
"surprise" here conveys how unexpected or informative an event is. (The idea
that surprises are "informative" or contain information makes sense with the
intuition that there isn't much to be learned from predictable events.)
Because of how division works with logarithms, log(1 / x) simplifies so
that info_surprisal(x) is the negative log probability,-log(x). The units
of the information content depends on the base of the logarithm. Our
functions by default use the natural log() which provides information in
nats, but it is also common to see log2()-based surprisals which provide
information in bits.
Given a probability distribution p—a vector of probabilities that
sum to 1—the entropy of the distribution is the
probability-weighted average of the surprisal values. A weighted average
is sum(weights * values) / sum(weights), but because the weights here
are probabilities that sum to 1, the info_entropy(p) is sum(p * info_surprisal(p)). Entropy can be interpreted as a measure of
uncertainty in the distribution; it's the expected surprisal value in
the distribution.
In info_entropy(p) the surprisal values and their weights came from
the same probability distribution. But this doesn't need to be the case.
Suppose that there are ground-truth probabilities p and there are also
estimated probabilities q. info_entropy(q) computes a weighted
average where events with surprisal info_surprisal(q) have frequencies
of q. But if we knew the true frequencies, the surprisals in q would
occur with frequency p, so their weighted average should be sum(p * info_surprisal(q)). This value is the cross entropy of q with
respect to p, and info_cross_entropy(p, q) implements this function.
Cross entropy is commonly used as a loss function in machine learning.
Gibb's inequality
says that info_entropy(p) <= info_cross_entropy(p, q). Unless p and
q are the same distribution, there will always be some excess
uncertainty or surprise in the cross entropy compared to the entropy:
info_cross_entropy(p, q) = info_entropy(p) + *excess*. This excess is
the Kullback-Liebler divergence (KL divergence or relative entropy).
Due to the properties of logarithms, info_kl_divergence(p, q)
simplifies to sum (p * log(p / q)). KL divergence is an important
quantity for comparing probability distributions. For example, each row
in a confusion matrix is a probability distribution, and KL divergence
can identify which rows are most similar. info_kl_divergence_matrix(mat)
computes a distance matrix using on each pair of rows in mat using KL
divergence.
wikipedia_example <- rbind( p = c(9, 12, 4) / 25, q = c(1, 1, 1) / 3 ) info_kl_divergence_matrix(wikipedia_example)wikipedia_example <- rbind( p = c(9, 12, 4) / 25, q = c(1, 1, 1) / 3 ) info_kl_divergence_matrix(wikipedia_example)
Join data onto resampled IDs
join_to_split(x, y, by, validate = FALSE)join_to_split(x, y, by, validate = FALSE)
x |
an rset object created by |
y |
y dataframe with a column of the id values which was resampled to
create |
by |
the name of column in y with the data |
validate |
whether to validate the join by counting the number of rows
associated with each id. Defaults to |
the original rset object with its x$data updated to join with y
and with the row numbers x$in_id updated to work on the expanded dataset.
library(dplyr) data_trees <- tibble::as_tibble(datasets::Orange) data_tree_ids <- distinct(data_trees, Tree) # Resample ids data_bootstraps <- data_tree_ids %>% rsample::bootstraps(times = 20) %>% rename(splits_id = splits) %>% # Attach data to resampled ids mutate( data_splits = splits_id %>% purrr::map( join_to_split, data_trees, by = "Tree", validate = TRUE ) ) data_bootstrapslibrary(dplyr) data_trees <- tibble::as_tibble(datasets::Orange) data_tree_ids <- distinct(data_trees, Tree) # Resample ids data_bootstraps <- data_tree_ids %>% rsample::bootstraps(times = 20) %>% rename(splits_id = splits) %>% # Attach data to resampled ids mutate( data_splits = splits_id %>% purrr::map( join_to_split, data_trees, by = "Tree", validate = TRUE ) ) data_bootstraps
This function is a wrapper around logitnorm::momentsLogitnorm().
logitnorm_mean(mu, sigma)logitnorm_mean(mu, sigma)
mu |
mean(s) on the logit scale |
sigma |
standard deviation(s) on the logit scale |
the means of the distributions
x <- logitnorm_mean(2, 1) x # compare to simulation set.seed(100) rnorm(1000, 2, 1) |> plogis() |> mean()x <- logitnorm_mean(2, 1) x # compare to simulation set.seed(100) rnorm(1000, 2, 1) |> plogis() |> mean()
Think of it as a gamlss model with memories (mem. gamlss).
mem_gamlss(...)mem_gamlss(...)
... |
arguments passed to gamlss::gamlss() |
the fitted model object but updated to include user information in
model$.user. Includes the dataset used to fit the model
model$.user$data, the session info model$.user$session_info and the
call used to fit the model model$.user$call. model$call is updated to
match
gamlss has trouble doing predictions without the original training data.
predict_centiles(newdata, model, centiles = c(5, 10, 50, 90, 95), ...) pivot_centiles_longer(data)predict_centiles(newdata, model, centiles = c(5, 10, 50, 90, 95), ...) pivot_centiles_longer(data)
newdata |
a one-column dataframe for predictions |
model |
a gamlss model prepared by |
centiles |
centiles to use for prediction. Defaults to |
... |
arguments passed to |
data |
centile predictions to reshape for |
a tibble with fitted centiles for predict_centiles() and a
long-format tibble with one centile value per row in
pivot_centiles_longer()
Custom function for printing duckdb database connections
print_duckdb(object)print_duckdb(object)
object |
a database connection to a duckdb database |
Use the following to overwrite the S4 method for printing duckdb objects.
setMethod("show", "duckdb_connection", wisclabmisc::print_duckdb)
NULL invisibly
skip_block() is a lightweight control-flow helper to deliberately
skip execution of a block of code while still documenting that the block
exists. It is intended as an alternative to commenting out code
or wrapping code in if (FALSE) { ... }.
skip_block(...)skip_block(...)
... |
Either a single braced expression containing the code block to skip, or a character string followed by a single braced expression. When a message is supplied, it is printed instead of the default message. |
The function captures the unevaluated code block, reports how many lines would have run, optionally prints a user-supplied message, and then returns invisibly without evaluating the code.
skip_block() uses non-standard evaluation to capture the code block via
substitute(). The code block is never evaluated.
Supplying more than two arguments, or supplying no code block, is an error.
NULL, invisibly.
skip_block({ Sys.sleep(10) }) skip_block("Skipping slow preprocessing step", { Sys.sleep(10) } )skip_block({ Sys.sleep(10) }) skip_block("Skipping slow preprocessing step", { Sys.sleep(10) } )
Select and row-bind multiple database tables from the same source together
tbl_bind(src, table_selection, id_name = ".source")tbl_bind(src, table_selection, id_name = ".source")
src |
a database connection |
table_selection |
a |
id_name |
Name for the table ID column. The names of the original tables
will be included in a column named by |
In dplyr, tbl(src, table_name) names and queries a database table.
tbl_bind(src, <selection>) extends this idea into selecting and
combining multiple database tables.
To manually combine two tbls x and y, use dplyr::union_all(x, y).
a tbl (query to a remote table)
library(dplyr) db <- duckdb::duckdb() |> DBI::dbConnect() db DBI::dbWriteTable(db, "mtcars_g1", mtcars[mtcars$cyl == 4, ]) DBI::dbWriteTable(db, "mtcars_g2", mtcars[mtcars$cyl == 6, ]) DBI::dbWriteTable(db, "mtcars_g3", mtcars[mtcars$cyl == 8, ]) DBI::dbWriteTable(db, "trees", trees) DBI::dbListTables(db) r <- db |> tbl_bind(starts_with("mtcars")) |> count(.source) r # the query is several UNIONs show_query(r) DBI::dbDisconnect(db, shutdown = TRUE)library(dplyr) db <- duckdb::duckdb() |> DBI::dbConnect() db DBI::dbWriteTable(db, "mtcars_g1", mtcars[mtcars$cyl == 4, ]) DBI::dbWriteTable(db, "mtcars_g2", mtcars[mtcars$cyl == 6, ]) DBI::dbWriteTable(db, "mtcars_g3", mtcars[mtcars$cyl == 8, ]) DBI::dbWriteTable(db, "trees", trees) DBI::dbListTables(db) r <- db |> tbl_bind(starts_with("mtcars")) |> count(.source) r # the query is several UNIONs show_query(r) DBI::dbDisconnect(db, shutdown = TRUE)
Extract the TOCS details from a string (usually a filename)
tocs_item(xs) tocs_type(xs) tocs_length(xs)tocs_item(xs) tocs_type(xs) tocs_length(xs)
xs |
a character vector |
tocs_item() returns the substring with the TOCS item, tocs_type()
returns whether the item is "single-word" or "multiword", and
tocs_length() returns the length of the TOCS item (i.e., the number of
words).
x <- c( "XXv16s7T06.lab", "XXv15s5T06.TextGrid", "XXv13s3T10.WAV", "XXv18wT11.wav", "non-matching", "s2T01", "XXv01s4B01.wav", "XXv01wB01.wav", # sometimes these have tags for *v*irtual visits or recording attempts "XXv13s3T10v.WAV", "XXv13s3T10a.lab" ) data.frame( x = x, item = tocs_item(x), type = tocs_type(x), length = tocs_length(x) )x <- c( "XXv16s7T06.lab", "XXv15s5T06.TextGrid", "XXv13s3T10.WAV", "XXv18wT11.wav", "non-matching", "s2T01", "XXv01s4B01.wav", "XXv01wB01.wav", # sometimes these have tags for *v*irtual visits or recording attempts "XXv13s3T10v.WAV", "XXv13s3T10a.lab" ) data.frame( x = x, item = tocs_item(x), type = tocs_type(x), length = tocs_length(x) )
Compute AUCs using the trapezoid method
trapezoid_auc(xs, ys) partial_trapezoid_auc(xs, ys, xlim)trapezoid_auc(xs, ys) partial_trapezoid_auc(xs, ys, xlim)
xs, ys
|
x and y positions |
xlim |
two-element vector (a range) of the |
the area under the curve computed using the trapezoid method. For
partial_trapezoid_auc(), the partial area under the curve is computed.
# Predict whether a car has automatic vs. manual transmission from mpg r <- pROC::roc(am ~ mpg, mtcars) pROC::auc(r) trapezoid_auc(r$specificities, r$sensitivities) pROC::auc(r, partial.auc = c(.9, 1), partial.auc.focus = "sp") partial_trapezoid_auc(r$specificities, r$sensitivities, c(.9, 1)) pROC::auc(r, partial.auc = c(.9, 1), partial.auc.focus = "se") partial_trapezoid_auc(r$sensitivities, r$specificities, c(.9, 1)) pROC::auc(r, partial.auc = c(.1, .9), partial.auc.focus = "sp") partial_trapezoid_auc(r$specificities, r$sensitivities, c(.1, .9)) pROC::auc(r, partial.auc = c(.1, .9), partial.auc.focus = "se") partial_trapezoid_auc(r$sensitivities, r$specificities, c(.1, .9))# Predict whether a car has automatic vs. manual transmission from mpg r <- pROC::roc(am ~ mpg, mtcars) pROC::auc(r) trapezoid_auc(r$specificities, r$sensitivities) pROC::auc(r, partial.auc = c(.9, 1), partial.auc.focus = "sp") partial_trapezoid_auc(r$specificities, r$sensitivities, c(.9, 1)) pROC::auc(r, partial.auc = c(.9, 1), partial.auc.focus = "se") partial_trapezoid_auc(r$sensitivities, r$specificities, c(.9, 1)) pROC::auc(r, partial.auc = c(.1, .9), partial.auc.focus = "sp") partial_trapezoid_auc(r$specificities, r$sensitivities, c(.1, .9)) pROC::auc(r, partial.auc = c(.1, .9), partial.auc.focus = "se") partial_trapezoid_auc(r$sensitivities, r$specificities, c(.1, .9))
For each participant, we find their length of longest utterance. We predict this longest utterance length as a nonlinear function of some variable, and we compute the probability of reaching each utterance length at each value of the predictor variable. These probabilities are then normalized to provide weights for each utterance length.
weight_lengths_with_ordinal_model( data_train, var_length, var_x, id_cols, spline_df = 2, data_join = NULL )weight_lengths_with_ordinal_model( data_train, var_length, var_x, id_cols, spline_df = 2, data_join = NULL )
data_train |
dataframe used to train the ordinal model. |
var_length |
bare name of the length variable. For example,
|
var_x |
bare name of the predictor variable. For example, |
id_cols |
a selection of variable names that uniquely identify each
group of related observations. For example, |
spline_df |
number of degrees of freedom to use for the ordinal regression model. |
data_join |
(optional) dataset to use join the weights onto. This feature is necessary because we want to train a dataset on the observed data but supply the weights to the dataset with missing values imputed. |
the probability and weights of each utterance length at each observed
value of var_x. These are in the added columns
{var_length}_prob_reached and {var_length}_weight, respectively.
Hustad, K. C., Mahr, T., Natzke, P. E. M., & Rathouz, P. J. (2020). Development of Speech Intelligibility Between 30 and 47 Months in Typically Developing Children: A Cross-Sectional Study of Growth. Journal of Speech, Language, and Hearing Research, 63(6), 1675–1687. https://doi.org/10.1044/2020_JSLHR-20-00008
Hustad, K. C., Mahr, T., Natzke, P. E. M., & J. Rathouz, P. (2020). Supplemental Material S1 (Hustad et al., 2020). ASHA journals. https://doi.org/10.23641/asha.12330956.v1
data_weights <- weight_lengths_with_ordinal_model( data_example_intelligibility_by_length, tocs_level, age_months, child, spline_df = 2 ) if (requireNamespace("ggplot2")) { library(ggplot2) p1 <- ggplot(data_weights) + aes(x = age_months, y = tocs_level_prob_reached) + geom_line(aes(color = ordered(tocs_level)), linewidth = 1) + scale_color_ordinal(end = .85) + labs(y = "Prob. of reaching length", color = "Utterance length") print(p1) p2 <- p1 + aes(y = tocs_level_weight) + labs(y = "Weight of utterance length") print(p2) }data_weights <- weight_lengths_with_ordinal_model( data_example_intelligibility_by_length, tocs_level, age_months, child, spline_df = 2 ) if (requireNamespace("ggplot2")) { library(ggplot2) p1 <- ggplot(data_weights) + aes(x = age_months, y = tocs_level_prob_reached) + geom_line(aes(color = ordered(tocs_level)), linewidth = 1) + scale_color_ordinal(end = .85) + labs(y = "Prob. of reaching length", color = "Utterance length") print(p1) p2 <- p1 + aes(y = tocs_level_weight) + labs(y = "Weight of utterance length") print(p2) }