Tools for ROC curves

library(tidyverse)
library(wisclabmisc)
library(pROC, exclude = c("cov", "smooth", "var"))
#> Type 'citation("pROC")' for a citation.

A primer on ROC curves

wisclabmisc provides functions for tidying results from ROC curves. These curves arise in diagnostic or classification settings where we want to use some test score to determine whether an individual belongs in a control group versus a case group. This binary classification could be normal versus clinical status, regular email versus spam status, and so on. I use the terminology control and case to follow the pROC package’s interface.

In this classification literature, there are tons and tons of statistics to describe classifier performance. The ROC curve centers around the two important quantities of sensitivity and specificity:

sensitivity is the proportion of true cases correctly identified as cases.
- Also called the true positive rate or recall.
- If I apply my spam classifier to 100 spam emails, how many will be correctly flagged as spam?
- P(case result | case status)
- Sensitivity makes sense to me if I think about the problem as detecting something subtle. (Like a Jedi being “force sensitive” or Spider-Man’s Spidey sense tingling when he’s in danger.)
specificity is the proportion of true controls correctly identified as controls.
- Also called the true negative rate or selectivity.
- If I apply my spam classifier to 100 safe (ham) emails, how many will be correctly ignored?
- P(control result | control status)
- Specificity is not a great term; selectivity makes slightly more sense. We don’t want the sensor to trip over noise: It needs to be specific or selective.

Suppose our diagnostic instrument provides a score, and we have to choose a diagnostic threshold for one of these scores. For example, suppose we decide that scores above 60 indicate that an email is probably spam and can be moved into the spam folder. Then that threshold will have its own specificity attached to it. We can look at the proportion of spam emails that are equal to or above 60 (sensitivity), and we can look at the proportion of ham emails that are below 60 (specificity). Each number we choose for the threshold will have its own sensitivity and specificity score, so an ROC curve is a visualization of how sensitivity and specificity change along the range of threshold scores. (More impenetrable terminology: ROC stands for “receiver operating characteristic”, having something to do with detections made by radar receivers at different operating levels.)

A worked example

We can work through an example ROC curve using the pROC package. pROC provides the aSAH dataset which provides “several clinical and one laboratory variable of 113 patients with an aneurysmal subarachnoid hemorrhage” (hence, aSAH).

We have the outcome (Good versus Poor) and some measure called s100b. We can see that that are many more Good outcomes near 0 and there are Poor outcomes.

data <- as_tibble(aSAH)
data
#> # A tibble: 113 × 7
#>    gos6  outcome gender   age wfns  s100b  ndka
#>    <ord> <fct>   <fct>  <int> <ord> <dbl> <dbl>
#>  1 5     Good    Female    42 1      0.13  3.01
#>  2 5     Good    Female    37 1      0.14  8.54
#>  3 5     Good    Female    42 1      0.1   8.09
#>  4 5     Good    Female    27 1      0.04 10.4 
#>  5 1     Poor    Female    42 3      0.13 17.4 
#>  6 1     Poor    Male      48 2      0.1  12.8 
#>  7 4     Good    Male      57 5      0.47  6   
#>  8 1     Poor    Male      41 4      0.16 13.2 
#>  9 5     Good    Female    49 1      0.18 15.5 
#> 10 4     Good    Female    75 2      0.1   6.01
#> # ℹ 103 more rows
count(data, outcome)
#> # A tibble: 2 × 2
#>   outcome     n
#>   <fct>   <int>
#> 1 Good       72
#> 2 Poor       41

ggplot(data) + 
  aes(x = s100b, y = outcome) + 
  geom_point(
    position = position_jitter(width = 0, height = .2),
    size = 3,
    alpha = .2,
  ) +
  theme_grey(base_size = 12) +
  labs(y = NULL)

For each point in a grid of points along s100b, we can compute the proportions of patients in each group above or below that threshold. We can then plot these proportions to visualize the trading relations between specificity and sensitivity as the threshold changes.

by_outcome <- split(data, data$outcome)
smallest_diff <- min(diff(unique(sort(data$s100b))))
grid <- tibble(
  threshold = seq(
    min(data$s100b) - smallest_diff, 
    max(data$s100b) + smallest_diff, 
    length.out = 200
  )
)

roc_coordinates <- grid %>% 
  rowwise() %>% 
  summarise(
    threshold = threshold,
    prop_poor_above = mean(by_outcome$Poor$s100b >= threshold),
    prop_good_below = mean(by_outcome$Good$s100b < threshold),
  )

ggplot(roc_coordinates) + 
  aes(x = threshold) + 
  geom_step(aes(y = prop_poor_above)) + 
  geom_step(aes(y = prop_good_below)) +
  annotate("text", x = 2, y = .9, hjust = 1, label = "specificity") + 
  annotate("text", x = 2, y = .1, hjust = 1, label = "sensitivity") +
  labs(
    title = "Sensitivity and specificity as cumulative proportions",
    x = "threshold (diagnosis when score >= threshold)",
    y = NULL
  )

It took me about 5 tries to get this plot correct. I am able to convince myself by noting that all of the Good outcomes are less than .51 so the threshold should not catch a single Good outcome and hence have specificity of 1. Conversely, there is just Poor outcome above 1, so a threshold of 1 is going to detect 1 Poor outcome and hence have a very low sensitivity.

If we ignore the threshold in our visualization, we can (finally) plot a canonical ROC curve. It shows specificity in reversing order so that the most ideal point is the top left corner (sensitivity = 1, specificity = 1).

roc_coordinates <- roc_coordinates %>% 
  rename(
    sensitivities = prop_poor_above, 
    specificities = prop_good_below
  ) %>%
  # otherwise the stair-steps look wrong
  arrange(sensitivities)

p <- ggplot(roc_coordinates) + 
  aes(x = specificities, y = sensitivities) + 
  geom_step() +
  scale_x_reverse() + 
  coord_fixed() + 
  theme_grey(base_size = 14)
p

We can compare our plot to the one provided by pROC package. We find a perfect match in our sensitivity and specificity values.

roc <- pROC::roc(data, response = outcome, predictor = s100b)
#> Setting levels: control = Good, case = Poor
#> Setting direction: controls < cases
plot(roc)


proc_coordinates <- roc[2:3] %>% 
  as.data.frame() %>% 
  arrange(sensitivities)

# Plot the pROC point as a wide semi-transparent blue
# band on top of ours
p + 
  geom_step(
    data = proc_coordinates, 
    color = "blue",
    alpha = .5,
    size = 2
  )
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

Instead of computing ROC curves by hand, we defer the calculation of ROC curves to the pROC package because it is easy to get confused when calculating sensitivity and specificity and because pROC provides other tools for working with ROC curves. Thus, wisclabmisc’s goal with ROC curves is to provide helper functions fit ROC curves with pROC and return the results in a nice dataframe.

We contrast two types of ROC curves:

an empirical ROC curve where the raw data is used to make a jagged ROC curve
a (smooth) density ROC curve where the densities of two distributions are used to make a smooth ROC curve.

Emprical ROC curves

Let’s return the above example, predicting the group label outcome (case: Poor, control: Good) from the predictor s100b.

r <- pROC::roc(data, outcome, s100b)
#> Setting levels: control = Good, case = Poor
#> Setting direction: controls < cases
r
#> 
#> Call:
#> roc.data.frame(data = data, response = outcome, predictor = s100b)
#> 
#> Data: s100b in 72 controls (outcome Good) < 41 cases (outcome Poor).
#> Area under the curve: 0.7314

From the messages, we can see that pROC::roc() makes a few decisions for us: that Good is the control level and Poor is the case level, and that controls should have a lower s100b than cases.

pROC::roc() returns an roc object which bundles all of the data and model results together. Ultimately, we want a the results in a dataframe so that one row will provide the sensitivity and specificity for each threshold value.

r
#> 
#> Call:
#> roc.data.frame(data = data, response = outcome, predictor = s100b)
#> 
#> Data: s100b in 72 controls (outcome Good) < 41 cases (outcome Poor).
#> Area under the curve: 0.7314
class(r)
#> [1] "roc"
str(r, max.level = 1, give.attr = FALSE)
#> List of 15
#>  $ percent           : logi FALSE
#>  $ sensitivities     : num [1:51] 1 0.976 0.976 0.976 0.976 ...
#>  $ specificities     : num [1:51] 0 0 0.0694 0.1111 0.1389 ...
#>  $ thresholds        : num [1:51] -Inf 0.035 0.045 0.055 0.065 ...
#>  $ direction         : chr "<"
#>  $ cases             : num [1:41] 0.13 0.1 0.16 0.12 0.44 0.71 0.49 0.07 0.33 0.09 ...
#>  $ controls          : num [1:72] 0.13 0.14 0.1 0.04 0.47 0.18 0.1 0.1 0.04 0.08 ...
#>  $ fun.sesp          :function (thresholds, controls, cases, direction)  
#>  $ auc               : 'auc' num 0.731
#>  $ call              : language roc.data.frame(data = data, response = outcome, predictor = s100b)
#>  $ original.predictor: num [1:113] 0.13 0.14 0.1 0.04 0.13 0.1 0.47 0.16 0.18 0.1 ...
#>  $ original.response : Factor w/ 2 levels "Good","Poor": 1 1 1 1 2 2 1 2 1 1 ...
#>  $ predictor         : num [1:113] 0.13 0.14 0.1 0.04 0.13 0.1 0.47 0.16 0.18 0.1 ...
#>  $ response          : Factor w/ 2 levels "Good","Poor": 1 1 1 1 2 2 1 2 1 1 ...
#>  $ levels            : chr [1:2] "Good" "Poor"

We can get close to a dataframe by manipulating the list or by using coords(). pROC::coords() has additional features that allow it to identify the “best” ROC points, but it strips off useful data like the direction used.

r[1:5] %>% 
  as.data.frame() %>% 
  tibble::as_tibble()
#> # A tibble: 51 × 5
#>    percent sensitivities specificities thresholds direction
#>    <lgl>           <dbl>         <dbl>      <dbl> <chr>    
#>  1 FALSE           1            0        -Inf     <        
#>  2 FALSE           0.976        0           0.035 <        
#>  3 FALSE           0.976        0.0694      0.045 <        
#>  4 FALSE           0.976        0.111       0.055 <        
#>  5 FALSE           0.976        0.139       0.065 <        
#>  6 FALSE           0.902        0.222       0.075 <        
#>  7 FALSE           0.878        0.306       0.085 <        
#>  8 FALSE           0.829        0.389       0.095 <        
#>  9 FALSE           0.780        0.486       0.105 <        
#> 10 FALSE           0.756        0.542       0.115 <        
#> # ℹ 41 more rows

pROC::coords(r) %>% 
  tibble::as_tibble()
#> # A tibble: 51 × 3
#>    threshold specificity sensitivity
#>        <dbl>       <dbl>       <dbl>
#>  1  -Inf          0            1    
#>  2     0.035      0            0.976
#>  3     0.045      0.0694       0.976
#>  4     0.055      0.111        0.976
#>  5     0.065      0.139        0.976
#>  6     0.075      0.222        0.902
#>  7     0.085      0.306        0.878
#>  8     0.095      0.389        0.829
#>  9     0.105      0.486        0.780
#> 10     0.115      0.542        0.756
#> # ℹ 41 more rows

wisclabmisc provides compute_empirical_roc() which combines results from pROC::roc() and pROC::coords() into a tibble. It includes metadata about the .controls and .cases levels, the .direction of the relationship, and the overall .auc of the curve. It also identifies two “best” coordinates with .is_best_youden and is_best_closest_topleft. Finally, it retains the name of the predictor variable.

compute_empirical_roc(data, outcome, s100b)
#> Setting levels: control = Good, case = Poor
#> Setting direction: controls < cases

We can still see the messages emitted by the pROC::roc() call when we use compute_empirical_roc(). We can pass the arguments direction and levels to pROC::roc() to silence these messages.

data_roc <- compute_empirical_roc(
  data, 
  outcome, 
  s100b, 
  direction = "<",
  levels = c("Good", "Poor")
)
data_roc
#> # A tibble: 51 × 11
#>       s100b .specificities .sensitivities  .auc .direction .controls .cases
#>       <dbl>          <dbl>          <dbl> <dbl> <chr>      <chr>     <chr> 
#>  1 -Inf             0               1     0.731 <          Good      Poor  
#>  2    0.035         0               0.976 0.731 <          Good      Poor  
#>  3    0.045         0.0694          0.976 0.731 <          Good      Poor  
#>  4    0.055         0.111           0.976 0.731 <          Good      Poor  
#>  5    0.065         0.139           0.976 0.731 <          Good      Poor  
#>  6    0.075         0.222           0.902 0.731 <          Good      Poor  
#>  7    0.085         0.306           0.878 0.731 <          Good      Poor  
#>  8    0.095         0.389           0.829 0.731 <          Good      Poor  
#>  9    0.105         0.486           0.780 0.731 <          Good      Poor  
#> 10    0.115         0.542           0.756 0.731 <          Good      Poor  
#> # ℹ 41 more rows
#> # ℹ 4 more variables: .n_controls <int>, .n_cases <int>, .is_best_youden <lgl>,
#> #   .is_best_closest_topleft <lgl>

According to the help page for pROC::coords() is Youden’s J statistic is the point that is farthest vertical distance from the diagonal line. The other “best” point is the point closest to the upper-left corner. The following plot labels each of these distances. The Youden’s point and the topleft point here are the same point.

data_roc <- data_roc %>% 
  arrange(.sensitivities)

p_best <- ggplot(data_roc) + 
  aes(x = .specificities, y = .sensitivities) + 
  geom_abline(
    slope = 1, 
    intercept = 1, 
    linetype = "dotted", 
    color = "grey20"
  ) +
  geom_step() + 
  geom_segment(
    aes(xend = .specificities, yend = 1 - .specificities),
    data = . %>% filter(.is_best_youden),
    color = "blue",
    linetype = "dashed"
  ) + 
  geom_segment(
    aes(xend = 1, yend = 1),
    data = . %>% filter(.is_best_closest_topleft),
    color = "maroon",
    linetype = "dashed"
  ) + 
  # Basically, finding a point 9/10ths of the way
  # along the line
  geom_text(
    aes(
      x = weighted.mean(c(1, .specificities), c(9, 1)), 
      y = weighted.mean(c(1, .sensitivities), c(9, 1)), 
    ),
    data = . %>% filter(.is_best_closest_topleft),
    color = "maroon",
    label = "closest to topleft",
    hjust = 0, 
    nudge_x = .02,
    size = 5
  ) + 
  geom_text(
    aes(
      x = .specificities, 
      y = weighted.mean(c(1 - .specificities, .sensitivities), c(1, 2)), 
    ),
    data = . %>% filter(.is_best_youden),
    color = "blue",
    label = "Youden's J\n(max height above diagonal)",
    hjust = 0,
    vjust = .5,
    nudge_x = .02,
    size = 5
  ) + 
  annotate(
    "text",
    x = .91,
    y = .05,
    hjust = 0, 
    size = 5,
    label = "diagonal: random classifier",
    color = "grey20"
  ) +
  scale_x_reverse() +
  coord_fixed() +
  theme_grey(base_size = 12)
p_best

(Smooth) density ROC curves

Instead of looking at the observed data, let’s assume the s100b values in each group are drawn from a normal distribution but the means and scales (standard deviations) are different for the two groups. We can compute each group’s mean and standard deviation and then plot the normal density curves on top of each other. Pepe (2003) refers to this approach as the “binormal ROC curve”.

data_stats <- data %>% 
  group_by(outcome) %>% 
  summarise(
    mean = mean(s100b),
    sd = sd(s100b)
  ) 

l_control <- data_stats %>% 
  filter(outcome == "Good") %>% 
  as.list()

l_case <- data_stats %>% 
  filter(outcome != "Good") %>% 
  as.list()

ggplot(data) + 
  aes(x = s100b, color = outcome) + 
  # include a "rug" at the bottom
  geom_jitter(aes(y = -.2), width = 0, height = .15, alpha = .4) +
  stat_function(
    data = . %>% filter(outcome == "Good"),
    fun = dnorm, 
    args = list(mean = l_control$mean, sd = l_control$sd)
  ) +
  stat_function(
    data = . %>% filter(outcome != "Good"),
    fun = dnorm, 
    args = list(mean = l_case$mean, sd = l_case$sd)
  ) +
  geom_text(
    aes(x = mean, y = dnorm(mean, mean, sd), label = outcome),
    data = data_stats,
    vjust = "inward",
    hjust = 0,
    nudge_x = .05,
    nudge_y = .05,
    size = 4
  ) +
  theme_grey(14) +
  theme(legend.position = "top", legend.justification = "left") +
  labs(y = NULL) +
  guides(color = "none")

At various points along the x-axis range, stat_function() compute dnorm() (the density of the normal curves). We can do that by hand too. We take the full range of the data, and then within each group, generate a set of points along that range and compute that group’s density at each point.

data_grid <- data %>% 
  mutate(
    xmin = min(s100b),
    xmax = max(s100b)
  ) %>% 
  group_by(outcome) %>% 
  summarise(
    x = seq(xmin[1], xmax[1], length.out = 200),
    group_mean = mean(s100b),
    group_sd = sd(s100b),
    density = dnorm(x, group_mean, group_sd),
    .groups = "drop"
  ) 
#> Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
#> dplyr 1.1.0.
#> ℹ Please use `reframe()` instead.
#> ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
#>   always returns an ungrouped data frame and adjust accordingly.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
data_grid
#> # A tibble: 400 × 5
#>    outcome      x group_mean group_sd density
#>    <fct>    <dbl>      <dbl>    <dbl>   <dbl>
#>  1 Good    0.03        0.162    0.131    1.84
#>  2 Good    0.0403      0.162    0.131    1.98
#>  3 Good    0.0505      0.162    0.131    2.13
#>  4 Good    0.0608      0.162    0.131    2.27
#>  5 Good    0.0710      0.162    0.131    2.40
#>  6 Good    0.0813      0.162    0.131    2.53
#>  7 Good    0.0915      0.162    0.131    2.64
#>  8 Good    0.102       0.162    0.131    2.75
#>  9 Good    0.112       0.162    0.131    2.84
#> 10 Good    0.122       0.162    0.131    2.91
#> # ℹ 390 more rows

Next, we pivot to a wide pivot format because we will be comparing the two densities at each point.

data_dens <- data_grid %>% 
  rename(s100b = x) %>% 
  select(-group_mean, -group_sd) %>% 
  pivot_wider(names_from = outcome, values_from = density)
data_dens
#> # A tibble: 200 × 3
#>     s100b  Good  Poor
#>     <dbl> <dbl> <dbl>
#>  1 0.03    1.84 0.659
#>  2 0.0403  1.98 0.676
#>  3 0.0505  2.13 0.694
#>  4 0.0608  2.27 0.711
#>  5 0.0710  2.40 0.729
#>  6 0.0813  2.53 0.746
#>  7 0.0915  2.64 0.763
#>  8 0.102   2.75 0.780
#>  9 0.112   2.84 0.797
#> 10 0.122   2.91 0.813
#> # ℹ 190 more rows

pROC::roc() can compute an ROC curve from these densities. Note that the interface here is different. We do not provide a dataframe and names of columns in that data frame. Instead, we provide two vectors of densities, and in fact, those densities are lost after computing the ROC curve.

data_dens <- arrange(data_dens, s100b)
r_dens <- roc(
  density.controls = data_dens$Good, 
  density.cases = data_dens$Poor
)
r_dens
#> 
#> Call:
#> roc.default(density.controls = data_dens$Good, density.cases = data_dens$Poor)
#> 
#> Data: (unknown) in 0 controls ((unknown) )  0 cases ((unknown) ).
#> Smoothing: density with controls: data_dens$Good; and cases: data_dens$Poor
#> Area under the curve: 0.8299
plot(r_dens)

The roc object here returns the coordinates with sensitivity in decreasing order, so it is not obvious how to map these sensitivities back to the original densities. In terms of the earlier density plot, we don’t know whether the sensitivities move up the x axis or down the x axis.

Let’s restate the problem again, for clarity:

We want to map thresholds to densities to ROC coordinates and map ROC coordinates back to densities to thresholds.
With pROC::roc(density.controls, density.controls), we hit a brick wall and cannot map backwards from ROC coordinates because the sensitivites may have been reversed with respect to the densities.

Fortunately, if we compute the sensitivities by hand, we can figure out how the coordinates were ordered. We try both orderings and find the one that best matches one provided by pROC::roc().

# direction > : Good > threshold >= Poor
sens_gt <- rev(cumsum(data_dens$Poor) / sum(data_dens$Poor))
# direction < : Good < threshold <= Poor
sens_lt <- 1 - (cumsum(data_dens$Poor) / sum(data_dens$Poor))
# The model did ??
fitted_sensitivities <- r_dens$sensitivities[-c(1, 201)]

mean(fitted_sensitivities - sens_lt)
#> [1] 0.004999997
mean(fitted_sensitivities - sens_gt)
#> [1] -0.530585

Because the < direction better matched the ROC results, we conclude that the sensitivities follow the same order as the densities.

compute_smooth_density_roc() uses a similar heuristic to determine the order of the ROC coordinates with respect to the original densities. As a result, we can map the original threshold values to sensitivity and specificity values. The function also lets us use column names directly.

data_smooth <- compute_smooth_density_roc(
  data = data_dens, 
  controls = Good, 
  cases = Poor, 
  along = s100b
)
data_smooth
#> # A tibble: 202 × 10
#>     s100b  Good  Poor .sensitivities .specificities  .auc .roc_row .direction
#>     <dbl> <dbl> <dbl>          <dbl>          <dbl> <dbl>    <int> <chr>     
#>  1 0.03    1.84 0.659          1             0      0.830        2 <         
#>  2 0.0403  1.98 0.676          0.992         0.0221 0.830        3 <         
#>  3 0.0505  2.13 0.694          0.984         0.0460 0.830        4 <         
#>  4 0.0608  2.27 0.711          0.975         0.0716 0.830        5 <         
#>  5 0.0710  2.40 0.729          0.967         0.0989 0.830        6 <         
#>  6 0.0813  2.53 0.746          0.958         0.128  0.830        7 <         
#>  7 0.0915  2.64 0.763          0.949         0.158  0.830        8 <         
#>  8 0.102   2.75 0.780          0.939         0.190  0.830        9 <         
#>  9 0.112   2.84 0.797          0.930         0.223  0.830       10 <         
#> 10 0.122   2.91 0.813          0.920         0.257  0.830       11 <         
#> # ℹ 192 more rows
#> # ℹ 2 more variables: .is_best_youden <lgl>, .is_best_closest_topleft <lgl>

compute_smooth_density_roc() also provides coordinates for the “best” thresholds by the Youden or topleft criteria. Because of the consistency between the two functions, we can just replace the data used to make annotated ROC curve with the smoothed ROC coordinates. In this case, the Youden and topleft points are different.

p_best + list(data_smooth)

As a final demonstration, let’s compare the smooth and empirical ROC sensitivity and specificity values along the threshold values.

ggplot(data_smooth) + 
  aes(x = s100b) + 
  geom_line(
    aes(color = "smooth", linetype = "smooth", y = .sensitivities),
  ) + 
  geom_line(
    aes(color = "empirical", linetype = "smooth", y = .sensitivities),
    data = data_roc
  ) + 
  geom_line(
    aes(color = "smooth", linetype = "empirical", y = .specificities)
  ) + 
  geom_line(
    aes(color = "empirical", linetype = "empirical", y = .specificities),
    data = data_roc
  ) +
  annotate("text", x = 2, y = .9, hjust = 1, label = "specificity") + 
  annotate("text", x = 2, y = .1, hjust = 1, label = "sensitivity") +
  labs(
    color = "ROC type", 
    linetype = "ROC type",
    y = NULL
  ) + 
  theme_grey(base_size = 12) + 
  theme(legend.position = "top")
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_line()`).
#> Removed 2 rows containing missing values or values outside the scale range
#> (`geom_line()`).