Diagnostic Tests 1 Introduction Suppose we have a quantitative - - PowerPoint PPT Presentation
Diagnostic Tests 1 Introduction Suppose we have a quantitative - - PowerPoint PPT Presentation
Diagnostic Tests 1 Introduction Suppose we have a quantitative measurement X i on experimental or observed units i = 1 , ..., n , and a characteristic Y i = 0 or Y i = 1 (e.g. case/control status). The measurement X i is thought to be related to
Introduction Suppose we have a quantitative measurement Xi on experimental or observed units i = 1, ..., n, and a characteristic Yi = 0 or Yi = 1 (e.g. case/control status). The measurement Xi is thought to be related to the characteristic Yi in the sense that units with higher Xi values are more likely to have Yi = 1. We can make a prediction about Yi based on Xi by setting a threshold value T, and predicting Yi = 1 when Xi > T. This is called a “diagnostic test.”
2
Applications of diagnostic testing Cancer detection The amount or concentration of a protein Xi in serum
- btained from person i may be used to predict whether the person has
a particular form of cancer. Credit scoring A person’s credit score at the time that he or she receives a loan may be used to predict whether the loan is repaid on time.
3
Labeling conventions
- The labeling of outcome categories as 1 or 0 is arbitrary in principal – for
example we could label cancer as 1 and non-cancer as 0, or vice-versa. But in practice, label 1 is typically used for the rarer category, or the category that would require some action or intervention. Label 0 usually denotes a default category that requires no action.
- Depending on the situation, it may be that either larger values of X or
smaller values of X are associated with higher probabilities that Yi = 1. In the latter case we can work with −Xi, or use prediction rules of the form Xi < T rather than Xi > T.
4
Diagnostic testing terminology A diagnostic test is a balance between two types of successful predictions and two types of errors:
- Successful predictions:
True positive a situation in which Xi > T and Yi = 1, for example when a person with cancer is predicted to have cancer. True negative a situation in which Xi < T and Yi = 0, for example when a cancer-free person is predicted to be cancer-free.
- Errors:
False positive a situation in which Xi > T but Yi = 0, for example when a person is predicted to have cancer but actually does not. False negative a situation in which Xi < T but Yi = 1, for example when a person is predicted to be cancer-free but actually has cancer.
5
Marginal categories
- The actual status of a unit is positive or negative:
Positive everyone with Yi = 1 (all true positives and false negatives). The proportion of positives is often called the “prevalence.” Negative everyone with Yi = 0 (all false positives and true negatives).
- The predicted status of a unit is “called positive” or “called negative:”
Called positive everyone with Xi > T (all true positives and false posi- tives). Called negative everyone with Xi < T (all true negatives and false neg- atives).
6
The relationships among all these terms is summarized as follows: Called positive Called negative Positive True positive False negative Negative False positive True negative
7
Sensitivity and specificity A common way to evaluate a diagnostic test is in terms of sensitivity and specificity. Sensitivity the proportion of positive units that are called positive, the pop- ulation value is P(Xi > T|Yi = 1). Specificity the proportion of negative units that are called negative, the population value is P(Xi < T|Yi = 0). Since sensitivity and specificity are calculated conditionally on case/control status (Yi), they can be estimated using either a population sample or a case/control sample.
- 1-specificity is called the “false positive rate” (FPR)
- 1-sensitivity is called the “false negative rate” (FNR)
8
Example: Suppose we have a biomarker Xi for colon cancer such that 75%
- f people with colon cancer have Xi > T and 5% of people without colon
cancer have Xi > T. Thus the sensitivity is 75% and the specificity is 100%- 5%=95%. We then screen 1000 people from a population with 15% colon cancer prevalence. We should expect the following results: Called positive Called negative Positive 1000 · 0.15 · 0.75 = 112.5 1000 · 0.15 · 0.25 = 37.5 Negative 1000 · 0.85 · 0.05 = 42.5 1000 · 0.85 · 0.95 = 807.5 The overall error rate is 80/1000 = 8%, and there is a rough balance between false positives and false negatives. Most of the people who have colon cancer are detected.
9
Example: Now suppose we are screening for pancreatic cancer with a preva- lence of 0.5% using tests with the same sensitivity and specificity. We expect to get: Called positive Called negative Positive 1000 · 0.005 · 0.75 = 3.75 1000 · 0.005 · 0.25 = 1.25 Negative 1000 · 0.995 · 0.05 = 49.75 1000 · 0.995 · 0.95 = 945.25 The overall error rate improves to 50.25/1000 ≈ 5%. The errors overwhelm- ingly consist of cancer-free false positives. Note that we could get an error rate of 0.5% by predicting everybody to be cancer-free.
10
Sensitivity and specificity for normal populations Suppose that X|Y = 0 is normal with mean µ0 and standard deviation σ0, X|Y = 1 is normal with mean µ1 and standard deviation σ1. Sensitivity = P(X > T|Y = 1) = P((X − µ1)/σ1 > (T − µ1)/σ1|Y = 1) = P(Z > (T − µ1)/σ1|Y = 1) = 1 − P(Z ≤ (T − µ1)/σ1|Y = 1) P(Z ≤ ⋆) can be obtained from a normal probability table. Exercise: Derive a similar formula for specificity.
11
Positive and negative predictive values Another way to evaluate a diagnostic test is based on the positive and negative predictive values. Positive predictive value (PPV) the proportion of units called positive that are positive, the population value is P(Yi = 1|Xi > T). Negative predictive value (NPV) the proportion of units called negative that are negative, the population value is P(Yi = 0|Xi < T).
- 1-PPV is called the “false discovery rate” – the proportion of called
positives that are negative.
12
Relationships between sensitivity, specificity, positive predictive value, and negative predictive value If we know the prevalence, we can use Bayes’ theorem to convert between sensitivity/specificity and positive/negative predictive values. For example: P(Yi = 1|Xi > T) = P(Xi > T|Yi = 1)P(Yi = 1)/P(Xi > T) PPV = sensitivity · prevalence/P(positive call) Exercise: Derive a similar relationship for NPV. Note: If pre − valance/P(positive call) is approximately 1 then the PPV and sensitivity are similar. Note: PPV depends on prevalence, so cannot be estimated from a case/control sample unless we have an independent estimate of the prevalence.
13
Example: The probability of being a called positive in the colon cancer example above is 0.15·0.75+0.85·0.05 = 0.155. Thus the positive predictive value is 0.75 · 0.15/0.155 = 0.73. Exercise: show that the negative predictive value for the colon cancer ex- ample is 0.96. Example: For the pancreatic cancer example the probability of being a called positive is 0.005 · 0.75 + 0.995 · 0.05 = 0.05, so the positive predictive value is 0.75 · 0.005/0.05 = 0.075. Exercise: show that the negative predictive value for the pancreatic cancer example is 0.995. Note that pancreatic cancer screening looks easier than colon cancer screen- ing based on overall error rate (5% versus 8%) but PPV reveals that the pancreatic cancer test produces a high fraction of false positives.
14
Which cancer is truly easier to detect? It depends on the follow-up:
- Suppose that for colon cancer there is a secondary test that can quickly
and safely differentiate the 113 true positives from the 43 false positives, and there is a treatment that substantially helps 50% of people whose colon cancer is detected at screening. Then the 43 false positive only need to go through the inconvenience and stress of a secondary test, and half of the 113 true positives have substantially improved outcomes.
- Suppose that for pancreatic cancer the only way to confirm the disease
is by an invasive procedure that has a 10% rate of serious complications, and therapy only improves the outcome for 20% of people with the
- disease. Then 4.6% (=46/10) of healthy people are put at serious risk
in order to identify 5 people with pancreatic cancer, of whom only one
- n average will benefit from treatment.
Note: the numbers used for the colon and pancreatic cancer examples are made up, but are roughly realistic.
15
ROC curves Suppose we want to evaluate how much information a measurement Xi con- tains about a characteristic Yi, but we don’t yet want to fix a specific threshold value T. A graphical approach is to plot sensitivity on the vertical axis against 1 − specificity on the horizontal axis for all possible values of T.
0.0 0.2 0.4 0.6 0.8 1.0 1-Specificity 0.0 0.2 0.4 0.6 0.8 1.0 Sensitivity
Sensitivity Specificity Red 0.93 0.31 Blue 0.76 0.62 Green 0.50 0.84
16
The following facts constrain a plot of sensitivity against 1 − specificity:
- As T increases, the sensitivity is non-decreasing.
- As T increases, the specificity is non-increasing, so 1-specificity in non-
decreasing.
- When T is −∞ the sensitivity and 1-specificity are both 0.
- When T is +∞ the sensitivity and 1-specificity are both 1.
17
ROC curves A plot of sensitivity against 1-specificity is called a “Receiver Operating Char- acteristics curve,” or “ROC curve.” Due to the constraints discussed above, a ROC curve is a non-decreasing path from (0, 0) to (1, 1).
18
Reading and interpreting ROC curves
- If X contains no information about Y , the sensitivity is P(X > T|Y = 1) =
P(X > T), and the specificity is P(X < T|Y = 0) = P(X < T). Therefore 1-specificity is P(X < T), so the ROC curve is a plot of P(X < T) against P(X < T) – a diagonal line from (0, 0) to (1, 1). Note that in this case sensitivity = 1 − specificity, or sensitivity + specificity = 1.
- If X is perfectly informative about Y, then there exists a point T such
that P(X > T|Y = 1) = 1 and P(X < T|Y = 0) = 1. We can always determine the value of Y based on whether X is greater than, or less than
- T. In this case the ROC curve is a path from (0, 0) to (0, 1) to (1,0).
- If X is partially informative about Y, then for at least some values of T,
sensitivity + specificity > 1, so the ROC curve is sometimes above the
- diagonal. The more it lies above the diagonal, the better.
- If X is usually or always below the diagonal, the relationship between X
and Y is inverted, and we should be using −X rather than X to form our predictions.
19
Graphs of population ROC curves The following plots show population ROC curves (right side) together with the population densities (left side) of X values in the Y = 0 group (orange) and in the Y = 1 group (blue).
20
4 2 2 4 6 8 10 X 0.0 0.1 0.2 0.3 0.4 Probability 0.0 0.2 0.4 0.6 0.8 1.0 1-Specificity 0.0 0.2 0.4 0.6 0.8 1.0 Sensitivity
AUC=0.56
4 2 2 4 6 8 10 X 0.0 0.1 0.2 0.3 0.4 Probability 0.0 0.2 0.4 0.6 0.8 1.0 1-Specificity 0.0 0.2 0.4 0.6 0.8 1.0 Sensitivity
AUC=0.76
21
4 2 2 4 6 8 10 X 0.0 0.1 0.2 0.3 0.4 Probability 0.0 0.2 0.4 0.6 0.8 1.0 1-Specificity 0.0 0.2 0.4 0.6 0.8 1.0 Sensitivity
AUC=0.92
4 2 2 4 6 8 10 X 0.0 0.1 0.2 0.3 0.4 Probability 0.0 0.2 0.4 0.6 0.8 1.0 1-Specificity 0.0 0.2 0.4 0.6 0.8 1.0 Sensitivity
AUC=0.81
22
4 2 2 4 6 8 10 X 0.0 0.1 0.2 0.3 0.4 Probability 0.0 0.2 0.4 0.6 0.8 1.0 1-Specificity 0.0 0.2 0.4 0.6 0.8 1.0 Sensitivity
AUC=0.98
23
Area under the curve (AUC) The ROC curve always lies in the “unit box” (0, 1) × (0, 1).
- In the most favorable situation for prediction, the ROC curve consists of
the left and top edges of the box, so the area under the ROC curve is 1 (the area of the whole box).
- In the least favorable situation for prediction, X and Y are independent,
the ROC curve follows the diagonal from (0, 0) to (1, 1), and the area under the ROC curve is 1/2. In general, the area under the ROC curve (AUC) can be used as an overall measure of the information in X about Y . The AUC can fall anywhere between 0 and 1, but if the correct orientation
- f X is known the AUC will fall between 1/2 and 1.
Higher AUC values correspond to a greater amount of information in X about Y .
24
Sampling interpretation of the AUC Suppose a positive unit Yi = 1 and a negative unit Yj = 0 are selected at
- random. The population AUC is the probability that Xi > Xj.
The sample AUC is also known as the “Mann-Whitney statistic,” and can be equivalently calculated as
- i,j I(Xi > Xj, Yi = 1, Yj = 0)
- i,j I(Yi = 1, Yj = 0)
The AUC can be calculated in R as follows: wilcox.test(X1,X0)$statistic/(length(X1)*length(X0))
25
Inference Sensitivity, specificity, PPV, and NPV are proportions. For example, the population sensitivity is p = P(X > T|Y = 1) which we estimate as ˆ p =
- i
I(Xi > T)Yi/
- i
Yi. If the
i Yi is fixed (as in a case/control study), ˆ
p is a simple average. In this case, it is unbiased and has variance var ˆ p = p(1 − p)/n1, where n1 =
i Yi.
26
If the data are from a random sample of size N, then n1 is random. In this case, the estimate of sensitivity is still unbiased, but the variance is larger than in a case/control study. The conditional variance is: var(ˆ p|n1) = p(1 − p)/n1 Using the law of total variation, we get var ˆ p = varE(ˆ p|n1) + Evar(ˆ p|n1) = 0 + p(1 − p)
N
- n=1
P(n1 = n)/n. Since n1 is the number of cases out of a total sample size of N, and each sampled unit has a fixed probability q of being a case, n1 has a binomial distribution P(n1 = n) =
N
n
- qn(1 − q)N−n.
27
How does N
n=1 P(n1 = n)/n relate to 1/n1?
This tells us about the effi- ciency of a case/control study compared to a random population sample for estimating the sensitivity. These plots compare standard errors for the two types of sampling when the total sample size is 50.
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
P(Y=1)
0.1 0.2 0.3 0.4 0.5 0.6 0.7
Standard error
Case/control Population sample
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
P(Y=1)
0.00 0.01 0.02 0.03 0.04 0.05 0.06
Difference in standard errors
The difference could be important if p < 0.2 or so.
28
Sample ROC curves for various sample sizes ROC curves based on data fluctuate around their mean value and become more accurate as the sample size increases. The following plots show sample ROC curves when X|Y = 0 is standard normal and X|Y = 1 is normal with mean 1 and variance 1.
29
0.0 0.2 0.4 0.6 0.8 1.0
1-Specificity
0.0 0.2 0.4 0.6 0.8 1.0
Sensitivity Sample size 25 (per group)
0.0 0.2 0.4 0.6 0.8 1.0
1-Specificity
0.0 0.2 0.4 0.6 0.8 1.0
Sensitivity Sample size 50 (per group)
0.0 0.2 0.4 0.6 0.8 1.0
1-Specificity
0.0 0.2 0.4 0.6 0.8 1.0
Sensitivity Sample size 100 (per group)
0.0 0.2 0.4 0.6 0.8 1.0
1-Specificity
0.0 0.2 0.4 0.6 0.8 1.0
Sensitivity Sample size 200 (per group)
Inference for the AUC For most statistics, the standard error of the statistic based on a sample of size N approximately has the form SE ≈ c/ √ N. When this holds, we can form a log/log plot of SE against sample size and the slope will be −1/2: log SE ≈ log(c) − log(N)/2.
30
Here is the plot for AUC:
5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 log2 total sample size 5.4 5.2 5.0 4.8 4.6 4.4 4.2 4.0 3.8 log2 SE(AUC)
Standard error Linear fit
The slope of the grey line is −1.16, so this is not a typical statistic. It appears that SE ≈ c/N for the AUC. There are complicated analytic expressions for the standard error of the AUC, but the bootstrap is also a good approach.
31
Generalization of the threshold value A common way to set the threshold value T is to specify a lower bound κ on specificity, and set T to the lowest value such that the sample specificity in the training set is greater than T. What is the distribution of threshold values associated with this procedure? What is the distribution of population specificity values associated with this procedure?
32
Distributions of threshold values T when κ = 0.9, X|Y = 0 is standard normal and X|Y = 1 is normal with mean µ1 and variance 1. Both groups have sample size n.
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Threshold
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Density
n =25, µ1 =1
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Threshold
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Density
n =50, µ1 =1
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Threshold
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Density
n =50, µ1 =2
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Threshold
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Density
n =100, µ1 =1 33
Population specificities corresponding to the distributions of threshold values
- n the previous slide.
0.6 0.7 0.8 0.9 1.0
Population specificity
2 4 6 8 10 12
Density
n =25, µ1 =1
0.6 0.7 0.8 0.9 1.0
Population specificity
2 4 6 8 10 12
Density
n =50, µ1 =1
0.6 0.7 0.8 0.9 1.0
Population specificity
2 4 6 8 10 12
Density
n =50, µ1 =2
0.6 0.7 0.8 0.9 1.0
Population specificity
2 4 6 8 10 12
Density
n =100, µ1 =1 34
Parametric bootstrap for ROC analysis Suppose a diagnostic test is available that has an AUC of 0.8. Someone has developed a new test, and wants to show that it is superior to the gold standard. The following code outlines the parametric bootstrap using normal models for the X|Y = 0 and X|Y = 1 populations.
35
## Estimate means and standard deviations for the X|Y=0 and X|Y=1 ## populations. m0 = mean(X0) s0 = sd(X0) m1 = mean(X1) s1 = sd(X1) nboot = 1000 ## The number of bootstrap samples to use. auc = rep(0, nboot) for (k in 1:nboot) { ## Generate a bootstrap data set. x0 = rnorm(length(X0), mean=m0, sd=s0) x1 = rnorm(length(X1), mean=m1, sd=s1) auc[k] = wilcox.test(x1,x0)$statistic/(length(x1)*length(x0)) } auc = sort(auc) lb = auc[0.025*nboot] ## The lower bound of the CI. ub = auc[0.975*nboot] ## The upper bound of the CI. 36
The following plots show the observed ROC curve in red, along with 10 ROC curves from parametric bootstrap samples in grey.
0.0 0.2 0.4 0.6 0.8 1.0 1-Specificity 0.0 0.2 0.4 0.6 0.8 1.0 Sensitivity Sample size 25 (per group)
0.0 0.2 0.4 0.6 0.8 1.0
1-Specificity
0.0 0.2 0.4 0.6 0.8 1.0
Sensitivity Sample size 50 (per group)
µ1(σ1) µ0(σ0) n ˆ µ1(ˆ σ1) ˆ µ0(ˆ σ0) AUC 95%CI Left 1(1) 0(1) 25 1.03(0.97)
- 0.09(0.85)
0.80 (0.66,0.92) Right 1(1) 0(1) 50 1.24(0.95)
- 0.08(0.99)