wisclabmisc includes for that
making with GAMLSS models easier. Normally, when gamlss()
fit a model, it does not store the data alongside the model. We provide
a function that fixes that. We also provide some functions extracting
centiles (percentile curves) from models easiers.
gamlss()
that remembers the datamem_gamlss()
(memory gamlss) provides a drop-in
replacement for the gamlss()
function.
library(wisclabmisc)
library(gamlss)
library(tidyverse)
data <- as.data.frame(nlme::Orthodont)
model <- mem_gamlss(distance ~ age, data = data)
#> GAMLSS-RS iteration 1: Global Deviance = 505.577
#> GAMLSS-RS iteration 2: Global Deviance = 505.577
The only difference between mem_gamlss()
and
gamlss()
is that the modified version includes a bundle of
data in .user
that records the original dataset, session
information and the call used to fit the model.
str(model$.user, max.level = 1)
#> List of 3
#> $ data :'data.frame': 108 obs. of 4 variables:
#> $ session_info:List of 2
#> ..- attr(*, "class")= chr [1:2] "session_info" "list"
#> $ call : language mem_gamlss(distance ~ age, data = data)
gamlss does not store the data as part of the model object, and we need the dataset because prediction and centile prediction often fails without the dataset:
newdata <- distinct(data, age)
centiles.pred(
model,
cent = c(25, 50, 75),
xname = "age",
xvalues = newdata$age,
plot = FALSE
)
#> Error in data.frame(data, source = namelist): arguments imply differing number of rows: 4, 5
But including the original dataset works:
centiles.pred(
model,
cent = c(25, 50, 75),
xname = "age",
xvalues = newdata$age,
plot = FALSE,
data = model$.user$data
)
#> x 25 50 75
#> 1 8 20.34723 22.04259 23.73796
#> 2 10 21.66760 23.36296 25.05833
#> 3 12 22.98797 24.68333 26.37870
#> 4 14 24.30834 26.00370 27.69907
(“Centile prediction” means predicting the percentiles of the data
along a single variable. That’s why the above function just needs a
single xname
: A single predictor variable is used. We use
centile prediction compute growth curves so that we can look at smooth
changes in the percentiles over age.)
This package provides predict_centiles()
as a
streamlined version of the above code, but:
mem_gamlss()
age
instead of
x
)q
(for quantile)centiles <- predict_centiles(
newdata,
model,
cent = c(25, 50, 75)
)
centiles
#> # A tibble: 4 × 4
#> age c25 c50 c75
#> <dbl> <dbl> <dbl> <dbl>
#> 1 8 20.3 22.0 23.7
#> 2 10 21.7 23.4 25.1
#> 3 12 23.0 24.7 26.4
#> 4 14 24.3 26.0 27.7
Those predicted centiles are in wide format. We can tidy them into a
long format with pivot_centiles_longer()
. This also
includes .pair
column that helps mark commonly paired
quantiles 25:75, 10:90, and 5:95.
pivot_centiles_longer(centiles)
#> # A tibble: 12 × 4
#> age .centile .value .centile_pair
#> <dbl> <dbl> <dbl> <chr>
#> 1 8 25 20.3 centiles 25, 75
#> 2 8 50 22.0 median
#> 3 8 75 23.7 centiles 25, 75
#> 4 10 25 21.7 centiles 25, 75
#> 5 10 50 23.4 median
#> 6 10 75 25.1 centiles 25, 75
#> 7 12 25 23.0 centiles 25, 75
#> 8 12 50 24.7 median
#> 9 12 75 26.4 centiles 25, 75
#> 10 14 25 24.3 centiles 25, 75
#> 11 14 50 26.0 median
#> 12 14 75 27.7 centiles 25, 75
Half of the data should be above the 50% centile line and half should
be below the 50% centile line. The same holds for the other centile
lines. This check_sample_centiles()
performs this check by
computing the percentages of observations less than or equal to each
centile line.
check_sample_centiles(data, model, age, distance)
#> # A tibble: 7 × 4
#> .centile n n_under_centile percent_under_centile
#> <dbl> <int> <int> <dbl>
#> 1 5 108 6 5.56
#> 2 10 108 9 8.33
#> 3 25 108 25 23.1
#> 4 50 108 61 56.5
#> 5 75 108 85 78.7
#> 6 90 108 95 88.0
#> 7 95 108 100 92.6
Which matches the gamlss package’s output:
centiles(
model,
model$.user$data$age,
data = model$.user$data,
cent = c(5, 10,25, 50, 75, 90, 95),
plot = FALSE
)
#> % of cases below 5 centile is 5.555556
#> % of cases below 10 centile is 8.333333
#> % of cases below 25 centile is 23.14815
#> % of cases below 50 centile is 56.48148
#> % of cases below 75 centile is 78.7037
#> % of cases below 90 centile is 87.96296
#> % of cases below 95 centile is 92.59259
This function also supports grouped data to check centile performance for different subsets of data.
data %>%
mutate(age_bin = ntile(age, 2)) %>%
group_by(age_bin) %>%
check_sample_centiles(model, age, distance)
#> # A tibble: 14 × 5
#> age_bin .centile n n_under_centile percent_under_centile
#> <int> <dbl> <int> <int> <dbl>
#> 1 1 5 54 3 5.56
#> 2 1 10 54 4 7.41
#> 3 1 25 54 13 24.1
#> 4 1 50 54 29 53.7
#> 5 1 75 54 44 81.5
#> 6 1 90 54 49 90.7
#> 7 1 95 54 51 94.4
#> 8 2 5 54 3 5.56
#> 9 2 10 54 5 9.26
#> 10 2 25 54 12 22.2
#> 11 2 50 54 32 59.3
#> 12 2 75 54 41 75.9
#> 13 2 90 54 46 85.2
#> 14 2 95 54 49 90.7
This output also matches the output provide by gamlss’s
centile.split()
function:
centiles.split(
model,
model$.user$data$age,
data = model$.user$data,
n.inter = 2,
cent = c(5, 10,25, 50, 75, 90, 95),
plot = FALSE
)
#> 7 to 11 11 to 15
#> 5 5.555556 5.555556
#> 10 7.407407 9.259259
#> 25 24.074074 22.222222
#> 50 53.703704 59.259259
#> 75 81.481481 75.925926
#> 90 90.740741 85.185185
#> 95 94.444444 90.740741