2/9/2017 1
W.F. Sensakovic, PhD, DABR, MRSC
Attendees/trainees should not construe any of the discussion or content of the session as insider information about the American Board of Radiology or its examinations.
Public Domain Public Domain
W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not - - PDF document
2/9/2017 W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information about the American Board of Radiology or its examinations. Public Domain Public Domain 1
2/9/2017 1
W.F. Sensakovic, PhD, DABR, MRSC
Attendees/trainees should not construe any of the discussion or content of the session as insider information about the American Board of Radiology or its examinations.
Public Domain Public Domain
2/9/2017 2
– Outline subtle tumor
– Clinical decision or Human visual system
– Does widget “A” make it easier for the observer to detect the microcalcification?
Bunch of observers look at Bunch of subject images to create data that is then analyzed
2/9/2017 3
CC 3.0: Zzyzx11
CC 3.0 Aaron Dodson, from The Noun Project
Detection Delineation Diagnosis
Bunch of radiologists look at bunch of CT scans (FBP or Iterative Recon) to record probability
determines if iterative reconstruction impacts diagnosis.
2/9/2017 4
Widely Used Scale Definitely or almost definitely malignant Probably malignant Possibly malignant Probably benign Definitely or almost definitely benign
Based on: Swets JA, et al. Assessment of Diagnostic Technologies. Science 205(4408):753 (1979)
Including clinically relevance Malignant—diagnosis apparent—warrants appropriate clinical management Malignant—diagnosis uncertain—warrants further diagnostic study/biopsy I’m not certain—warrants further diagnostic study Benign—no follow‐up necessary
Based on: Potchen EJ. Measuring Observer Performance in Chest Radiology: Some Experiences. J Am Coll Radiol 3:423 (2006)
for single reader studies
– Wagner RF et al. Continuous versus Categorical Data for ROC Analysis Some Quantitative Considerations. Acad Rad 8(4): 328 (2001).
continuous scales for ratings
– Rockette HE, et al. The use of continuous and discrete confidence judgments in receiver operating characteristic studies of diagnostic imaging
Probability of Malignancy 0% 20% 40% 60% 80% 100%
2/9/2017 5
– Abnormal: Biopsy or other gold standard – Normal: Follow‐up (e.g., 1‐year) post imaging
– In a 3 system comparison, the “best” system depended on medthod used for truth
– Report variability in consensus
Reader, Full Factorial
– Every observer, reads every case, in every modality – Data correlations all us to get the highest power and lowest sample requirements
2/9/2017 6
– ROC Software listed later
but may still run on an emulator such as dosbox (https://www.dosbox.com)
– Sensitivity
– 1‐Specificity
CC 3.0: Marco Evangelista
2/9/2017 7
Does iterative reconstruction impact diagnosis of malignancy in lung lesions?
Public Domain Public Domain Public Domain Public Domain Public Domain Public Domain
Case #/Truth
1/Malignant 10.0 8.9 2/Benign 4.4 6.3 3/Benign 3.4 2.7 5/Malignant 5.6 5.2 6/Malignant 7.7 7.0 7/Malignant 9.2 8.1 … … Case #/Truth
1/Malignant 10.0 8.9 2/Benign 4.4 6.3 3/Benign 3.4 2.7 5/Malignant 5.6 5.2 6/Malignant 7.7 7.0 7/Malignant 9.2 8.1 … … Case #/Truth
1/Malignant 10.0 8.9 2/Benign 4.4 6.3 3/Benign 3.4 2.7 5/Malignant 5.6 5.2 6/Malignant 7.7 7.0 7/Malignant 9.2 8.1 … …
With IR W/O IR True Positive Fraction (Sensitivity) False Positive Fraction (1‐Specificity) With IR True Positive Fraction (Sensitivity) False Positive Fraction (1‐Specificity) W/O IR True Positive Fraction (Sensitivity) False Positive Fraction (1‐Specificity) With IR True Positive Fraction (Sensitivity) False Positive Fraction (1‐Specificity) W/O IR
2/9/2017 8
Does iterative reconstruction impact diagnosis of malignancy in lung lesions?
True Positive Fraction (Sensitivity) False Positive Fraction (1‐Specificity) With IR W/O IR
diagnosis
Does iterative reconstruction impact diagnosis of malignancy in lung lesions?
True Positive Fraction (Sensitivity) False Positive Fraction (1‐Specificity) With IR W/O IR
diagnosis
– AUC = 0.8
2/9/2017 9
Does iterative reconstruction impact diagnosis of malignancy in lung lesions?
True Positive Fraction (Sensitivity) False Positive Fraction (1‐Specificity) With IR W/O IR
diagnosis
– AUC = 0.8 – AUC = 0.7
Average percent correct if observers shown random malignant and benign and asked to choose the malignant
2/9/2017 10
– Calculate ROC and AUC for each observer – Calculate combined ROC and AUC with dispersion – Perform hypothesis test to determine if AUC’s from 2 treatments significantly differ
a small number of rating categories
Chem 39:561 (1993).
there are too few samples or if ratings are confined to a narrow range
– Metz CE. Practical Aspects of CAD Research Assessment Methodologies for CAD. Presented at the AAPM annual meeting.
fixed effect
– Similarly, for cases
2/9/2017 11
– Sensitivity 25%‐100% depending on case selection
sufficient number of false‐positive responses
– Rockette, et al. Selection of subtle cases for observer‐performance studies: The importance of knowing the true diagnosis (1998).
disease population prevalence
– ROC AUC stable between 2%‐28% study prevalence, but small increases in observer ratings are seen with low prevalence
Public Domain Public Domain Public Domain Public Domain Public Domain Public Domain
2/9/2017 12
– Minimum effect size of interest
– How much the difference varies
CC BY‐SA 2.0 Barry Stock
– Run a small pilot – Program uses pilot data and resampling/Monte Carlo simulation to estimate variance for various model componenets (reader, case, etc.)
pairs (near equal for normal/abnormal)
– ICRU Report 79
2/9/2017 13
– – –
Pilot Data
– Non‐clinical task, specialized software, new modality
– 45% of truth cases contained errors
– Armato SG, et al. The Lung Image Database Consortium (LIDC): Ensuring the Integrity of Expert‐Defined “Truth” Acad Radiol 14:1455 (2007)
– Clinical conditions and equipment
2/9/2017 14
– A few weeks rule‐of‐thumb (unless case is unusual) – Block study design (see refs. below)
– Metz CE. Some practical issues of experimental design and data analysis in radiological ROC studies. Invest Radiol 24:234 (1989). – Metz CE. Fundamental ROC analysis. In: Beutel J, et al. Handbook of medical imaging. Vol 1. Bellingham, WA: SPIE Press, 2000.
– Sp 0.9:
– Study description mindful of blinding – Types of relevant abnormalities and their precise study definition – How to perform task and record data – Unique conditions observers should or should not consider
2/9/2017 15
moderate, but important differences
– ~5% (ICRU Report 79) – Bigger difference could be seen with easier testing methodology – Smaller differences might be too costly or clinically insignificant
Bunch of radiologists look at bunch of chest radiographs (CR and DR) to determine if pneumonia is present. ROC determines if the modalities are equivalent.
2/9/2017 16
essentially the same as in diagnosis
Public Domain CC 2.0: Abhijit Tembhekar
Reduced Observer Variability 20%7% Abnormal—diagnosis apparent—warrants appropriate clinical management Abnormal—diagnosis uncertain—warrants further diagnostic study I’m not certain—warrants further diagnostic study Abnormal—but not clinically significant Normal
Potchen EJ. Measuring Observer Performance in Chest Radiology: Some Experiences. J Am Coll Radiol 3:423 (2006)
Clinical relevance reduces variability and thus sample size requirements
2/9/2017 17
Bunch of radiologists look at bunch of radiographs with and without CAD system to, mark centroid of nodules if present, and give confidence ratings. FROC determines if CAD helps.
Courtesy William F. Sensakovic
mark must be for “hit”
– 50% ROI overlap – Radius based on size of largest lesion
free‐response observer performance studies. Br J Radiol 86 (2013)
Confidence 0% 20% 40% 60% 80% 100%
2/9/2017 18
window/level settings. “Distance” between
if outlines are impacted by window/level settings.
– Know exact size – Clinically relevant?
– Union/Intersection – P‐Map
– STAPLE
2/9/2017 19
Coefficient
– Count pixels in intersection – Count pixels in union – Divide intersection by union
– D = 2J/(1+J) ∩ ∪
2/9/2017 20
0.0 0.0 0.8 0.67 0.20 0.33 0.33 0.50 0.50 0.67
Distance
– Easy to understand – Meaningful units
2/9/2017 21
Distance
– Find shortest absolute distance from each boundary point of A to each the boundary point of B – Repeat for B to A – Summary stats
difference
– Dice/Jaccard
– Average distance
2/9/2017 22
– Take a point in A and find the shortest distance to B – Repeat for all points
– Take the maximum
A B
– Take a point in A and find the shortest distance to B – Repeat for all points of A – Take the maximum of shortest distances
– Repeat for h(B,A) – Max of h(A,B) and h(B,A)
A B
2/9/2017 23
whatever is already drawn
– 48% increase in Jaccard when previous outline used
– Sensakovic et al. The influence of initial outlines on manual segmentation. Med Phys. 37(5):2153 (2010).
can alter measurements by 20%
– Sensakovic, WF et al. Discrete‐space versus continuous‐space lesion boundary and area definitions. Med Phys 35(9): 4070 (2008).
With Permission: Sensakovic, WF et al. Discrete‐space versus continuous‐space lesion boundary and area definitions. Med Phys 35(9): 4070 (2008).
testing
– See “Practical Statistics for Medical Physicists” from the last two AAPM annual meetings
CC 3.0: ReubenGBrewer
2/9/2017 24
study
medicine in general
– ICRU Report 79 – Wagner RF et al. Assessment of Medical Imaging Systems and Computer Aids: A Tutorial Review. Acad Radiol 14: 723 (2007) – Chakraborty DP. New Developments in Observer Performance Methodology in Medical Imaging. Semin Nucl Med 41(6): 401 (2011)
– Obuchowski NA, Beiden SV, Berbaum KS, et al. Multi‐reader, multicase ROC analysis: an empirical comparison of five methods. Acad Radiol 2004; 11:980 –995. – Toledano A. Three methods for analyzing correlated ROC curves: A comparison in real data sets from multi‐reader, multi‐case studies with a factorial design. Stat Med 2003; 22:2919 –2933
– Obuchowski NA. Multireader receiver operating characteristic studies: a comparison of study designs. Acad Radiol 1995; 2:709 –716 – Potchen EJ. Measuring Observer Performance in Chest Radiology: Some Experiences. J Am Coll Radiol 3:423 (2006)
– Hillis et al. Power Estimation for the Dorfman‐Berbaum‐Metz Method. Acad Radiol 11:1260 (2004). – See also some of the software listed
2/9/2017 25
– Chakraborty DP, et al. Observer studies involving detection and localization: Modeling, analysis, and validation. Medical Physics 31, 2313 (2004). – Chakraborty DP. New Developments in Observer Performance Methodology in Medical Imaging. Semin Nucl Med 41(6): 401 (2011). – Thompson JD, et al. Analysing data from observer studies in medical imaging research: An introductory guide to free‐response
– Thompson JD, et al. The Value of Observer Performance Studies in Dose Optimization: A Focus on Free‐Response Receiver Operating Characteristic Methods. J Nucl Med Technol 41:57 (2013).
ceiverOperatingCharacteristicROC/MRMCAnalysis/t abid/116/Default.aspx
2/9/2017 26
Studies in Medical Physics. Med Phys. 43, 3714 (2016); http://dx.doi.org/10.1118/1.4957320
– Phases, Levels, Controls, and All That: An Informal Session On Clinical Trials
– Use and Abuse of Common Statistics in Radiological Physics
– Uncertainty and Issues in Biological Modeling for the Modern Medical Physicist
– Clinical trials and the medical physicist: design, analysis, and our role – Implementation and Analysis of Observer Studies in Medical Physics – Analysis of Dependent Variables: Correlation and Simple Regression – Hypothesis or Hypotheses, That is the Question
2/9/2017 27
Correlation: Review of Terminology
Y is Dependent
to proportional increase in Y
leads to some increase in Y
Aug 1, 2016 Labby - AAPM 2016 54
1 1
y x
2/9/2017 28
Correlation: Review of Terminology
Aug 1, 2016 Labby - AAPM 2016 55
Correlation: Metrics of Interest
Aug 1, 2016 Labby - AAPM 2016 56
2/9/2017 29
Correlation: Metrics of Interest
Three major correlation metrics Pearson’s r Spearman’s ⍴ Fleiss’ κ
Aug 1, 2016 Labby - AAPM 2016 57
Correlation: Metrics of Interest
Three major correlation metrics Pearson’s r Spearman’s ⍴ Fleiss’ κ
Aug 1, 2016 Labby - AAPM 2016 58
2/9/2017 30
Correlation: Pearson’s r
independent variable
Aug 1, 2016 Labby - AAPM 2016 59
Correlation: Pearson’s r
Aug 1, 2016 Labby - AAPM 2016 60 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5
r = 1.00 r = 0.97 r = 0.76 r = 0.96 r = 0.96 r = 0.75
2/9/2017 31
Correlation: Pearson’s r
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
r = 1.00 r = 0.97 r = 0.76
Correlation: Spearman’s ⍴
variable
Aug 1, 2016 Labby - AAPM 2016 62
2/9/2017 32
Correlation: Spearman’s ⍴
Aug 1, 2016 Labby - AAPM 2016 63
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Correlation: Spearman’s ⍴
Aug 1, 2016 Labby - AAPM 2016 64
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Raw: (0,0) Rank: (1,1) (X,Y) pairs
2/9/2017 33
Correlation: Spearman’s ⍴
Aug 1, 2016 Labby - AAPM 2016 65
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Raw: (0,0) Rank: (1,1) (X,Y) pairs Raw: (0.05,0.0025) Rank: (2,2)
Correlation: Spearman’s ⍴
Aug 1, 2016 Labby - AAPM 2016 66
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Raw: (1,1) Rank: (20,20)
Pearson’s r of rank- transformed data: 1.00
Raw: (0,0) Rank: (1,1) Raw: (0.05,0.0025) Rank: (2,2)
2/9/2017 34
Correlation: Spearman’s ⍴
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
r = 1.00 ⍴ = 1.00 r = 0.97 ⍴ = 0.97 r = 0.76 ⍴ = 0.90
Correlation: Spearman’s ⍴
Aug 1, 2016 Labby - AAPM 2016 68 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5
r = 1.00 ⍴ = 1.00 r = 0.97 ⍴ = 0.97 r = 0.76 ⍴ = 0.90 r = 0.96 ⍴ = 1.00 r = 0.96 ⍴ = 0.99 r = 0.75 ⍴ = 0.89
2/9/2017 35
Correlation: Which Metric?
Continuous variables; “When one goes up, does the other (reliably) go down?”
Aug 1, 2016 Labby - AAPM 2016 69
Months after Baseline Relative Change from Baseline
−1.0 −0.5 0.0 0.5 1.0 1 2 3 4 5 6 7
Disease Volume Lung Volume
Z.E. Labby et al, J Thorac Oncol 8, (2013)
Correlation: Which Metric?
Continuous variables; “When one goes up, does the other (reliably) go down?”
Aug 1, 2016 Labby - AAPM 2016 70
Months after Baseline Relative Change from Baseline
−1.0 −0.5 0.0 0.5 1.0 1 2 3 4 5 6 7
Disease Volume Lung Volume
Z.E. Labby et al, J Thorac Oncol 8, (2013)
Answer: Spearman’s ⍴
2/9/2017 36
Correlation: Fleiss’ κ
Aug 1, 2016 Labby - AAPM 2016 71
Correlation: Fleiss’ κ
Aug 1, 2016 Labby - AAPM 2016 72
Progression 6 11 7 11 14 Stable 17 10 19 15 9 Partial 7 10 5 4 8 Complete 1 1
κ = 0.64
Landis and Koch, Biometrics, 33,159–174 (1977)
2/9/2017 37
Correlation vs. Agreement
Important question: Do you already know that the two variables will be correlated? Example: Tumor volumes as assessed by Physician vs. Algorithm
Aug 1, 2016 Labby - AAPM 2016 73
Correlation vs. Agreement
value remains unknown), correlation isn’t as meaningful
variables
Aug 1, 2016 Labby - AAPM 2016 74
Bland and Altman, “Statistical methods for assessing agreement between two methods of clinical measurement,” Lancet 327, 307 (1986).
2/9/2017 38
Correlation vs. Agreement
Aug 1, 2016 Labby - AAPM 2016 75
Average of Physician and Algorithm Difference (Physician – Algorithm)
Mean Mean + 2SD Mean - 2SD
Bland and Altman, “Statistical methods for assessing agreement between two methods of clinical measurement,” Lancet 327, 307 (1986).
Correlation vs. Agreement
exponential to get relative agreement bounds
Aug 1, 2016 Labby - AAPM 2016 76
Bland and Altman, “Statistical methods for assessing agreement between two methods of clinical measurement,” Lancet 327, 307 (1986).
ln
2/9/2017 39
Case #/Truth
1/Malignant 10.0 2/Benign 4.4 3/Benign 3.4 5/Malignant 5.6 6/Malignant 7.7 7/Malignant 9.2 … …
Observer Malignancy Rating # of Ratings
Actually Malignant Actually Benign With IR
Observer Malignancy Rating # of Ratings
Actually Malignant Actually Benign 0.0: Definitely Benign 5.0: Indeterminate 10.0 Definitely Malignant True Positive Fraction (Sensitivity) False Positive Fraction (1‐Specificity) With IR TP FP TP TN FN TP TN FP FN
2/9/2017 40
AUC comparison not appropriate if ROC curves cross each other
True Positive Fraction (Sensitivity) False Positive Fraction (1‐Specificity)
Diagnostic
– McClish DK. Analyzing a Portion of the ROC Curve. Medical Decision Making 9 (3): 190 (1989)
Sp and Se. . . Why bother with ROC?
Based on example given by CE Metz during lectures at the University of Chicago, 2003 True Positive Fraction (Sensitivity) False Positive Fraction (1‐Specificity)
2/9/2017 41
– Use localization information – More than one response per subject
not specific operating points (TP,FP) or (PPV, NPV) values
– Reader ability and experience, societal norms, and even disease prevalence in the study can impact specific