---
title: "Tools for ROC curves"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Tools for ROC curves}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(tidyverse)
```

```{r setup}
library(tidyverse)
library(wisclabmisc)
library(pROC, exclude = c("cov", "smooth", "var"))
```

## A primer on ROC curves

wisclabmisc provides functions for tidying results from ROC curves.
These curves arise in diagnostic or classification settings where we
want to use some test score to determine whether an individual belongs
in a *control* group versus a *case* group. This binary classification
could be normal versus clinical status, regular email versus spam
status, and so on. I use the terminology *control* and *case* to follow
the pROC package's interface.

In this classification literature, there are tons and tons of statistics
to describe classifier performance. The ROC curve centers around the two
important quantities of [*sensitivity* and
*specificity*](https://en.wikipedia.org/wiki/Sensitivity_and_specificity):

  - **sensitivity** is the proportion of true cases correctly identified
    as cases.
      - Also called the *true positive rate* or *recall.*
      - If I apply my spam classifier to 100 spam emails, how many will
        be correctly flagged as spam?
      - P(case result | case status)
      - *Sensitivity* makes sense to me if I think about the problem as
        detecting something subtle. (Like a Jedi being "force sensitive"
        or Spider-Man's Spidey sense tingling when he's in danger.)

  - **specificity** is the proportion of true controls correctly
    identified as controls.
      - Also called the *true negative rate* or *selectivity*.
      - If I apply my spam classifier to 100 safe (ham) emails, how many
        will be correctly ignored?
      - P(control result | control status)
      - *Specificity* is not a great term; *selectivity* makes slightly
        more sense. We don't want the sensor to trip over noise: It
        needs to be specific or selective.

Suppose our diagnostic instrument provides a score, and we have to
choose a diagnostic threshold for one of these scores. For example,
suppose we decide that scores above 60 indicate that an email is
probably spam and can be moved into the spam folder. Then that threshold
will have its own specificity attached to it. We can look at the
proportion of spam emails that are equal to or above 60 (sensitivity),
and we can look at the proportion of ham emails that are below 60
(specificity). Each number we choose for the threshold will have its own
sensitivity and specificity score, so **an ROC curve is a visualization
of how sensitivity and specificity change along the range of threshold
scores**. (More impenetrable terminology: ROC stands for "receiver
operating characteristic", having something to do with detections made
by [radar receivers at different operating levels][radar].)

### A worked example

We can work through an example ROC curve using the pROC package. pROC
provides the `aSAH` dataset which provides "several clinical and one
laboratory variable of 113 patients with an aneurysmal subarachnoid
hemorrhage" (hence, `aSAH`).

We have the `outcome` (`Good` versus `Poor`) and some measure called
`s100b`. We can see that that are many more `Good` outcomes near 0 and
there are `Poor` outcomes.

```{r data-points, fig.width = 4, fig.height = 3}
data <- as_tibble(aSAH)
data
count(data, outcome)

ggplot(data) + 
  aes(x = s100b, y = outcome) + 
  geom_point(
    position = position_jitter(width = 0, height = .2),
    size = 3,
    alpha = .2,
  ) +
  theme_grey(base_size = 12) +
  labs(y = NULL)
```

For each point in a grid of points along `s100b`, we can compute the
proportions of patients in each group above or below that threshold. We
can then plot these proportions to visualize the trading relations
between specificity and sensitivity as the threshold changes.

```{r sens-spec, fig.height = 3, fig.width = 5}
by_outcome <- split(data, data$outcome)
smallest_diff <- min(diff(unique(sort(data$s100b))))
grid <- tibble(
  threshold = seq(
    min(data$s100b) - smallest_diff, 
    max(data$s100b) + smallest_diff, 
    length.out = 200
  )
)

roc_coordinates <- grid %>% 
  rowwise() %>% 
  summarise(
    threshold = threshold,
    prop_poor_above = mean(by_outcome$Poor$s100b >= threshold),
    prop_good_below = mean(by_outcome$Good$s100b < threshold),
  )

ggplot(roc_coordinates) + 
  aes(x = threshold) + 
  geom_step(aes(y = prop_poor_above)) + 
  geom_step(aes(y = prop_good_below)) +
  annotate("text", x = 2, y = .9, hjust = 1, label = "specificity") + 
  annotate("text", x = 2, y = .1, hjust = 1, label = "sensitivity") +
  labs(
    title = "Sensitivity and specificity as cumulative proportions",
    x = "threshold (diagnosis when score >= threshold)",
    y = NULL
  )
```

It took me about 5 tries to get this plot correct. I am able to convince
myself by noting that all of the `Good` outcomes are less than .51 so
the threshold should not catch a single `Good` outcome and hence have
specificity of 1. Conversely, there is just `Poor` outcome above 1, so a
threshold of 1 is going to detect 1 `Poor` outcome and hence have a very
low sensitivity.

If we ignore the threshold in our visualization, we can (finally) plot a
canonical ROC curve. It shows specificity in reversing order so that the
most ideal point is the top left corner (sensitivity = 1,
specificity = 1).

```{r roc-1, fig.width = 3, fig.height = 3}
roc_coordinates <- roc_coordinates %>% 
  rename(
    sensitivities = prop_poor_above, 
    specificities = prop_good_below
  ) %>%
  # otherwise the stair-steps look wrong
  arrange(sensitivities)

p <- ggplot(roc_coordinates) + 
  aes(x = specificities, y = sensitivities) + 
  geom_step() +
  scale_x_reverse() + 
  coord_fixed() + 
  theme_grey(base_size = 14)
p
```

We can compare our plot to the one provided by pROC package. We find a perfect
match in our sensitivity and specificity values.

```{r base-roc, fig.width = 3, fig.height = 3}
roc <- pROC::roc(data, response = outcome, predictor = s100b)
plot(roc)

proc_coordinates <- roc[2:3] %>% 
  as.data.frame() %>% 
  arrange(sensitivities)

# Plot the pROC point as a wide semi-transparent blue
# band on top of ours
p + 
  geom_step(
    data = proc_coordinates, 
    color = "blue",
    alpha = .5,
    size = 2
  )
```

Instead of computing ROC curves by hand, we defer the calculation of ROC
curves to the pROC package because it is easy to get confused when
calculating sensitivity and specificity and because pROC provides other
tools for working with ROC curves. Thus, wisclabmisc's goal with ROC
curves is to provide helper functions fit ROC curves with pROC and
return the results in a nice dataframe.

We contrast two types of ROC curves:

  - an **empirical** ROC curve where the raw data is used to make a
    jagged ROC curve
  - a **(smooth) density** ROC curve where the densities of two
    distributions are used to make a smooth ROC curve.


## Emprical ROC curves

Let's return the above example, predicting the group label `outcome` (case: `Poor`, control: `Good`) from the predictor `s100b`.

```{r basic-roc, fig.width = 3, fig.height = 3}
r <- pROC::roc(data, outcome, s100b)
r
```

From the messages, we can see that `pROC::roc()` makes a few decisions for us: that `Good`
is the `control` level and `Poor` is the `case` level, and that controls should
have a lower `s100b` than cases. 

`pROC::roc()` returns an `roc` object which bundles all of the data and
model results together. Ultimately, we want a the results in a dataframe
so that one row will provide the sensitivity and specificity for each
threshold value.

```{r}
r
class(r)
str(r, max.level = 1, give.attr = FALSE)
```

We can get close to a dataframe by manipulating the list or by using
`coords()`. `pROC::coords()` has additional features that allow it to identify
the "best" ROC points, but it strips off useful data like the direction used.

```{r}
r[1:5] %>% 
  as.data.frame() %>% 
  tibble::as_tibble()

pROC::coords(r) %>% 
  tibble::as_tibble()
```

wisclabmisc provides `compute_empirical_roc()` which combines results
from `pROC::roc()` and `pROC::coords()` into a tibble. It includes
metadata about the `.controls` and `.cases` levels, the `.direction` of
the relationship, and the overall `.auc` of the curve. It also
identifies two "best" coordinates with `.is_best_youden` and
`is_best_closest_topleft`. Finally, it retains the name of the predictor
variable.

```{r}
compute_empirical_roc(data, outcome, s100b)
```

We can still see the messages emitted by the `pROC::roc()` call when we
use `compute_empirical_roc()`. We can pass the arguments `direction`
and `levels` to `pROC::roc()` to silence these messages.

```{r}
data_roc <- compute_empirical_roc(
  data, 
  outcome, 
  s100b, 
  direction = "<",
  levels = c("Good", "Poor")
)
data_roc
```


According to the help page for `pROC::coords()` is Youden's J statistic
is the point that is farthest vertical distance from the diagonal line.
The other "best" point is the point closest to the upper-left corner.
The following plot labels each of these distances. The Youden's point
and the topleft point here are the same point.

```{r roc-annotated}
data_roc <- data_roc %>% 
  arrange(.sensitivities)

p_best <- ggplot(data_roc) + 
  aes(x = .specificities, y = .sensitivities) + 
  geom_abline(
    slope = 1, 
    intercept = 1, 
    linetype = "dotted", 
    color = "grey20"
  ) +
  geom_step() + 
  geom_segment(
    aes(xend = .specificities, yend = 1 - .specificities),
    data = . %>% filter(.is_best_youden),
    color = "blue",
    linetype = "dashed"
  ) + 
  geom_segment(
    aes(xend = 1, yend = 1),
    data = . %>% filter(.is_best_closest_topleft),
    color = "maroon",
    linetype = "dashed"
  ) + 
  # Basically, finding a point 9/10ths of the way
  # along the line
  geom_text(
    aes(
      x = weighted.mean(c(1, .specificities), c(9, 1)), 
      y = weighted.mean(c(1, .sensitivities), c(9, 1)), 
    ),
    data = . %>% filter(.is_best_closest_topleft),
    color = "maroon",
    label = "closest to topleft",
    hjust = 0, 
    nudge_x = .02,
    size = 5
  ) + 
  geom_text(
    aes(
      x = .specificities, 
      y = weighted.mean(c(1 - .specificities, .sensitivities), c(1, 2)), 
    ),
    data = . %>% filter(.is_best_youden),
    color = "blue",
    label = "Youden's J\n(max height above diagonal)",
    hjust = 0,
    vjust = .5,
    nudge_x = .02,
    size = 5
  ) + 
  annotate(
    "text",
    x = .91,
    y = .05,
    hjust = 0, 
    size = 5,
    label = "diagonal: random classifier",
    color = "grey20"
  ) +
  scale_x_reverse() +
  coord_fixed() +
  theme_grey(base_size = 12)
p_best
```


## (Smooth) density ROC curves

Instead of looking at the observed data, let's assume the `s100b` values
in each group are drawn from a normal distribution but the means and
scales (standard deviations) are different for the two groups. We can
compute each group's mean and standard deviation and then plot the
normal density curves on top of each other. Pepe (2003) refers to this
approach as the "binormal ROC curve".

```{r data-points-2, fig.width = 4, fig.height = 3}
data_stats <- data %>% 
  group_by(outcome) %>% 
  summarise(
    mean = mean(s100b),
    sd = sd(s100b)
  ) 

l_control <- data_stats %>% 
  filter(outcome == "Good") %>% 
  as.list()

l_case <- data_stats %>% 
  filter(outcome != "Good") %>% 
  as.list()

ggplot(data) + 
  aes(x = s100b, color = outcome) + 
  # include a "rug" at the bottom
  geom_jitter(aes(y = -.2), width = 0, height = .15, alpha = .4) +
  stat_function(
    data = . %>% filter(outcome == "Good"),
    fun = dnorm, 
    args = list(mean = l_control$mean, sd = l_control$sd)
  ) +
  stat_function(
    data = . %>% filter(outcome != "Good"),
    fun = dnorm, 
    args = list(mean = l_case$mean, sd = l_case$sd)
  ) +
  geom_text(
    aes(x = mean, y = dnorm(mean, mean, sd), label = outcome),
    data = data_stats,
    vjust = "inward",
    hjust = 0,
    nudge_x = .05,
    nudge_y = .05,
    size = 4
  ) +
  theme_grey(14) +
  theme(legend.position = "top", legend.justification = "left") +
  labs(y = NULL) +
  guides(color = "none")
```

At various points along the *x*-axis range, `stat_function()` compute
`dnorm()` (the density of the normal curves). We can do that by
hand too. We take the full range of the data, and then within each
group, generate a set of points along that range and compute that
group's density at each point.

```{r}
data_grid <- data %>% 
  mutate(
    xmin = min(s100b),
    xmax = max(s100b)
  ) %>% 
  group_by(outcome) %>% 
  summarise(
    x = seq(xmin[1], xmax[1], length.out = 200),
    group_mean = mean(s100b),
    group_sd = sd(s100b),
    density = dnorm(x, group_mean, group_sd),
    .groups = "drop"
  ) 
data_grid
```

Next, we pivot to a wide pivot format because we will be comparing the
two densities at each point.

```{r}
data_dens <- data_grid %>% 
  rename(s100b = x) %>% 
  select(-group_mean, -group_sd) %>% 
  pivot_wider(names_from = outcome, values_from = density)
data_dens
```

`pROC::roc()` can compute an ROC curve from these densities. Note that the
interface here is different. We do not provide a dataframe and names of columns
in that data frame. Instead, we provide two vectors of densities, and in fact,
those densities are lost after computing the ROC curve.

```{r smooth-roc, fig.width = 3, fig.height = 3}
data_dens <- arrange(data_dens, s100b)
r_dens <- roc(
  density.controls = data_dens$Good, 
  density.cases = data_dens$Poor
)
r_dens
plot(r_dens)
```

The `roc` object here returns the coordinates with sensitivity in
decreasing order, so it is not obvious how to map these sensitivities
back to the original densities. In terms of the earlier density plot, we
don't know whether the sensitivities move up the *x* axis or down the
*x* axis. 

Let's restate the problem again, for clarity:

  - We want to map thresholds to densities to ROC coordinates and map
    ROC coordinates back to densities to thresholds.
  - With `pROC::roc(density.controls, density.controls)`, we hit a brick
    wall and cannot map backwards from ROC coordinates because the sensitivites 
    may have been reversed with respect to the densities.

Fortunately, if we compute the sensitivities by hand, we can figure out how
the coordinates were ordered. We try both orderings and find the one that 
best matches one provided by `pROC::roc()`.

```{r}
# direction > : Good > threshold >= Poor
sens_gt <- rev(cumsum(data_dens$Poor) / sum(data_dens$Poor))
# direction < : Good < threshold <= Poor
sens_lt <- 1 - (cumsum(data_dens$Poor) / sum(data_dens$Poor))
# The model did ??
fitted_sensitivities <- r_dens$sensitivities[-c(1, 201)]

mean(fitted_sensitivities - sens_lt)
mean(fitted_sensitivities - sens_gt)
```

Because the `<` direction better matched the ROC results, we conclude that
the sensitivities follow the same order as the densities.

`compute_smooth_density_roc()` uses a similar heuristic to determine the
order of the ROC coordinates with respect to the original densities. As
a result, we can map the original threshold values to sensitivity and
specificity values. The function also lets us use column names directly.

```{r}
data_smooth <- compute_smooth_density_roc(
  data = data_dens, 
  controls = Good, 
  cases = Poor, 
  along = s100b
)
data_smooth
```

`compute_smooth_density_roc()` also provides coordinates for the "best"
thresholds by the Youden or topleft criteria. Because of the consistency
between the two functions, we can just replace the data used to make
annotated ROC curve with the smoothed ROC coordinates. In this case, the
Youden and topleft points are different.

```{r roc-annotated-smooth}
p_best + list(data_smooth)
```


As a final demonstration, let's compare the smooth and empirical ROC
sensitivity and specificity values along the threshold values.

```{r another-comparison, fig.height = 3, fig.width = 4}
ggplot(data_smooth) + 
  aes(x = s100b) + 
  geom_line(
    aes(color = "smooth", linetype = "smooth", y = .sensitivities),
  ) + 
  geom_line(
    aes(color = "empirical", linetype = "smooth", y = .sensitivities),
    data = data_roc
  ) + 
  geom_line(
    aes(color = "smooth", linetype = "empirical", y = .specificities)
  ) + 
  geom_line(
    aes(color = "empirical", linetype = "empirical", y = .specificities),
    data = data_roc
  ) +
  annotate("text", x = 2, y = .9, hjust = 1, label = "specificity") + 
  annotate("text", x = 2, y = .1, hjust = 1, label = "sensitivity") +
  labs(
    color = "ROC type", 
    linetype = "ROC type",
    y = NULL
  ) + 
  theme_grey(base_size = 12) + 
  theme(legend.position = "top")
```


[radar]: https://stats.stackexchange.com/a/398154/14825
  "What is the origin of the “receiver operating characteristic” (ROC) terminology?"