W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not - - PDF document

w f sensakovic phd dabr mrsc
SMART_READER_LITE
LIVE PREVIEW

W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not - - PDF document

2/9/2017 W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information about the American Board of Radiology or its examinations. Public Domain Public Domain 1


slide-1
SLIDE 1

2/9/2017 1

W.F. Sensakovic, PhD, DABR, MRSC

Attendees/trainees should not construe any of the discussion or content of the session as insider information about the American Board of Radiology or its examinations.

Public Domain Public Domain

slide-2
SLIDE 2

2/9/2017 2

  • Task is complex

– Outline subtle tumor

  • Unquantifiable human element

– Clinical decision or Human visual system

  • Human response is goal

– Does widget “A” make it easier for the observer to detect the microcalcification?

Bunch of observers look at Bunch of subject images to create data that is then analyzed

slide-3
SLIDE 3

2/9/2017 3

CC 3.0: Zzyzx11

CC 3.0 Aaron Dodson, from The Noun Project

Detection Delineation Diagnosis

Bunch of radiologists look at bunch of CT scans (FBP or Iterative Recon) to record probability

  • f malignancy for each. ROC analysis

determines if iterative reconstruction impacts diagnosis.

slide-4
SLIDE 4

2/9/2017 4

Widely Used Scale Definitely or almost definitely malignant Probably malignant Possibly malignant Probably benign Definitely or almost definitely benign

Based on: Swets JA, et al. Assessment of Diagnostic Technologies. Science 205(4408):753 (1979)

Including clinically relevance Malignant—diagnosis apparent—warrants appropriate clinical management Malignant—diagnosis uncertain—warrants further diagnostic study/biopsy I’m not certain—warrants further diagnostic study Benign—no follow‐up necessary

Based on: Potchen EJ. Measuring Observer Performance in Chest Radiology: Some Experiences. J Am Coll Radiol 3:423 (2006)

  • Typically 5‐7 categories
  • Validated scale if available and appropriate
  • Continuous vs. Categorical difference biggest

for single reader studies

– Wagner RF et al. Continuous versus Categorical Data for ROC Analysis Some Quantitative Considerations. Acad Rad 8(4): 328 (2001).

  • No practical difference between discrete and

continuous scales for ratings

– Rockette HE, et al. The use of continuous and discrete confidence judgments in receiver operating characteristic studies of diagnostic imaging

  • techniques. Invest Radiol 27(2):169 (1992).

Probability of Malignancy 0% 20% 40% 60% 80% 100%

slide-5
SLIDE 5

2/9/2017 5

  • Best

– Abnormal: Biopsy or other gold standard – Normal: Follow‐up (e.g., 1‐year) post imaging

  • Combined reads (expert panel)

– In a 3 system comparison, the “best” system depended on medthod used for truth

  • Revesz G et al. The effect of verification on the assessment of imaging techniques. Invest Radiol 18:194 (1983).

– Report variability in consensus

  • Bankier AA et al. Consensus Interpretation in Imaging Research: Is There a Better Way? Radiology 257:14 (2010).
  • Task is binary (e.g., Malignant vs. Benign)
  • Multi‐Reader, Multi‐Case (MRMC)
  • Multiple treatments (e.g., IR vs FBP)
  • Traditional, Fully‐Crossed, Paired‐Case Paired‐

Reader, Full Factorial

– Every observer, reads every case, in every modality – Data correlations all us to get the highest power and lowest sample requirements

slide-6
SLIDE 6

2/9/2017 6

  • Software (free or not) does it for you

– ROC Software listed later

  • Some unsupported or not functional on modern computers,

but may still run on an emulator such as dosbox (https://www.dosbox.com)

MATH

  • True Positive (TP)

– Sensitivity

  • False Positive (FP)

– 1‐Specificity

  • True Negative (TN)
  • False Negative (FN)

CC 3.0: Marco Evangelista

slide-7
SLIDE 7

2/9/2017 7

Does iterative reconstruction impact diagnosis of malignancy in lung lesions?

Public Domain Public Domain Public Domain Public Domain Public Domain Public Domain

Case #/Truth

  • Obs. 1
  • Obs. 2

1/Malignant 10.0 8.9 2/Benign 4.4 6.3 3/Benign 3.4 2.7 5/Malignant 5.6 5.2 6/Malignant 7.7 7.0 7/Malignant 9.2 8.1 … … Case #/Truth

  • Obs. 1
  • Obs. 2

1/Malignant 10.0 8.9 2/Benign 4.4 6.3 3/Benign 3.4 2.7 5/Malignant 5.6 5.2 6/Malignant 7.7 7.0 7/Malignant 9.2 8.1 … … Case #/Truth

  • Obs. 1
  • Obs. 2

1/Malignant 10.0 8.9 2/Benign 4.4 6.3 3/Benign 3.4 2.7 5/Malignant 5.6 5.2 6/Malignant 7.7 7.0 7/Malignant 9.2 8.1 … …

With IR W/O IR True Positive Fraction (Sensitivity) False Positive Fraction (1‐Specificity) With IR True Positive Fraction (Sensitivity) False Positive Fraction (1‐Specificity) W/O IR True Positive Fraction (Sensitivity) False Positive Fraction (1‐Specificity) With IR True Positive Fraction (Sensitivity) False Positive Fraction (1‐Specificity) W/O IR

slide-8
SLIDE 8

2/9/2017 8

Does iterative reconstruction impact diagnosis of malignancy in lung lesions?

True Positive Fraction (Sensitivity) False Positive Fraction (1‐Specificity) With IR W/O IR

  • Yes, it improves

diagnosis

  • By how much?

Does iterative reconstruction impact diagnosis of malignancy in lung lesions?

True Positive Fraction (Sensitivity) False Positive Fraction (1‐Specificity) With IR W/O IR

  • Yes, it improves

diagnosis

  • By how much?

– AUC = 0.8

slide-9
SLIDE 9

2/9/2017 9

Does iterative reconstruction impact diagnosis of malignancy in lung lesions?

True Positive Fraction (Sensitivity) False Positive Fraction (1‐Specificity) With IR W/O IR

  • Yes, it improves

diagnosis

  • By how much?

– AUC = 0.8 – AUC = 0.7

Average percent correct if observers shown random malignant and benign and asked to choose the malignant

slide-10
SLIDE 10

2/9/2017 10

  • ROC Software will (generally):

– Calculate ROC and AUC for each observer – Calculate combined ROC and AUC with dispersion – Perform hypothesis test to determine if AUC’s from 2 treatments significantly differ

  • Non‐parametric ROC gives bias underestimates with

a small number of rating categories

  • Zweig MH, Campbell G. Receiver operating characteristic plots: a fundamental evaluation tool in clinical medicine. Clin

Chem 39:561 (1993).

  • Parametric (semi‐parametric) may perform poorly if

there are too few samples or if ratings are confined to a narrow range

– Metz CE. Practical Aspects of CAD Research Assessment Methodologies for CAD. Presented at the AAPM annual meeting.

  • Only generalizable to population of all observers if
  • bserver is treated as a random effect instead of

fixed effect

– Similarly, for cases

slide-11
SLIDE 11

2/9/2017 11

  • Comparisons should be on same cases

– Sensitivity 25%‐100% depending on case selection

  • Nishikawa RM, et al. Effect of case selection on the performance of computer‐aided detection schemes. Med Phys 21, 265 (1994)
  • The normal case subtlety must be considered to ensure

sufficient number of false‐positive responses

– Rockette, et al. Selection of subtle cases for observer‐performance studies: The importance of knowing the true diagnosis (1998).

  • Study disease prevalence does not need to match

disease population prevalence

– ROC AUC stable between 2%‐28% study prevalence, but small increases in observer ratings are seen with low prevalence

  • Gur D, et al. Prevalence effect in a laboratory environment. Radiology 228:10 (2003).
  • Gur D, et al. The Prevalence Effect in a Laboratory Environment: Changing the Confidence Ratings. Acad Radiol 14:49 (2007).

Public Domain Public Domain Public Domain Public Domain Public Domain Public Domain

slide-12
SLIDE 12

2/9/2017 12

  • We need to know:

– Minimum effect size of interest

  • Smaller needs more cases for testing
  • Appendix C of ICRU 79: ΔSe (at Sp)  ΔAUC

– How much the difference varies

  • More variation needs more cases for testing

CC BY‐SA 2.0 Barry Stock

?

  • Sample size software (see references)

– Run a small pilot – Program uses pilot data and resampling/Monte Carlo simulation to estimate variance for various model componenets (reader, case, etc.)

  • Typical power 0.8 and α of 0.05
  • Typical numbers are 3‐5 observers and 100 case

pairs (near equal for normal/abnormal)

– ICRU Report 79

slide-13
SLIDE 13

2/9/2017 13

– – –

  • 50 observers, 530 cases each . . . Probably pass

Pilot Data

  • Observer training

– Non‐clinical task, specialized software, new modality

  • Data/truth verification

– 45% of truth cases contained errors

– Armato SG, et al. The Lung Image Database Consortium (LIDC): Ensuring the Integrity of Expert‐Defined “Truth” Acad Radiol 14:1455 (2007)

  • Display and acquisition

– Clinical conditions and equipment

slide-14
SLIDE 14

2/9/2017 14

  • Bias from re‐reading

– A few weeks rule‐of‐thumb (unless case is unusual) – Block study design (see refs. below)

– Metz CE. Some practical issues of experimental design and data analysis in radiological ROC studies. Invest Radiol 24:234 (1989). – Metz CE. Fundamental ROC analysis. In: Beutel J, et al. Handbook of medical imaging. Vol 1. Bellingham, WA: SPIE Press, 2000.

  • Observer Experience

– Sp 0.9:

  • Se ‐ 0.76 (high volume mammographers)
  • Se ‐ 0.65 (low volume mammographers)
  • Esserman L, et al. Improving the accuracy of mammography: volume and outcome relationships. J Natl Cancer Inst 6;94(5):369 (2002)
  • According to ICRU Report 79

– Study description mindful of blinding – Types of relevant abnormalities and their precise study definition – How to perform task and record data – Unique conditions observers should or should not consider

slide-15
SLIDE 15

2/9/2017 15

  • ROC is costly (time and or money)
  • Best used when looking for small to

moderate, but important differences

– ~5% (ICRU Report 79) – Bigger difference could be seen with easier testing methodology – Smaller differences might be too costly or clinically insignificant

  • 1. No localization

Bunch of radiologists look at bunch of chest radiographs (CR and DR) to determine if pneumonia is present. ROC determines if the modalities are equivalent.

slide-16
SLIDE 16

2/9/2017 16

  • Rating scales, sample size, and truth

essentially the same as in diagnosis

  • bserver study . . .
  • . . . but the tasks are very different!

Public Domain CC 2.0: Abhijit Tembhekar

Reduced Observer Variability 20%7% Abnormal—diagnosis apparent—warrants appropriate clinical management Abnormal—diagnosis uncertain—warrants further diagnostic study I’m not certain—warrants further diagnostic study Abnormal—but not clinically significant Normal

Potchen EJ. Measuring Observer Performance in Chest Radiology: Some Experiences. J Am Coll Radiol 3:423 (2006)

Clinical relevance reduces variability and thus sample size requirements

slide-17
SLIDE 17

2/9/2017 17

  • 2. Localization

Bunch of radiologists look at bunch of radiographs with and without CAD system to, mark centroid of nodules if present, and give confidence ratings. FROC determines if CAD helps.

Courtesy William F. Sensakovic

  • Mark lesion centroid
  • Determine how close

mark must be for “hit”

– 50% ROI overlap – Radius based on size of largest lesion

  • Haygood TM, et al. On the choice of acceptance radius in

free‐response observer performance studies. Br J Radiol 86 (2013)

Confidence 0% 20% 40% 60% 80% 100%

slide-18
SLIDE 18

2/9/2017 18

  • Bunch of dosimetrists outline the brainstem
  • n CT scans displayed two different

window/level settings. “Distance” between

  • utlines is calculated. ANOVA is used to test

if outlines are impacted by window/level settings.

  • Phantom

– Know exact size – Clinically relevant?

  • Combined outlines on patient images

– Union/Intersection – P‐Map

  • Meyer CR, et al. Evaluation of Lung MDCT Nodule Annotation Across Radiologists and Methods. Acad Radiol 13(10): 1254 (2006).

– STAPLE

  • Warfield SK. Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image
  • segmentation. IEEE Trans Med Imaging 23(7):903 (2004).
slide-19
SLIDE 19

2/9/2017 19

  • Jaccard Similarity

Coefficient

  • Jaccard

– Count pixels in intersection – Count pixels in union – Divide intersection by union

  • Dice

– D = 2J/(1+J) ∩ ∪

slide-20
SLIDE 20

2/9/2017 20

Dice Jaccard

0.0 0.0 0.8 0.67 0.20 0.33 0.33 0.50 0.50 0.67

  • Average Euclidean

Distance

– Easy to understand – Meaningful units

slide-21
SLIDE 21

2/9/2017 21

  • Average Euclidean

Distance

– Find shortest absolute distance from each boundary point of A to each the boundary point of B – Repeat for B to A – Summary stats

  • Fail to capture

difference

– Dice/Jaccard

  • ~0.9

– Average distance

  • <1mm
slide-22
SLIDE 22

2/9/2017 22

  • Hausdorff distance

– Take a point in A and find the shortest distance to B – Repeat for all points

  • f A

– Take the maximum

  • f shortest distances
  • h(A,B)

A B

  • Hausdorff distance

– Take a point in A and find the shortest distance to B – Repeat for all points of A – Take the maximum of shortest distances

  • h(A,B)

– Repeat for h(B,A) – Max of h(A,B) and h(B,A)

A B

slide-23
SLIDE 23

2/9/2017 23

  • Observers tend to agree with

whatever is already drawn

– 48% increase in Jaccard when previous outline used

– Sensakovic et al. The influence of initial outlines on manual segmentation. Med Phys. 37(5):2153 (2010).

  • Different boundary definitions

can alter measurements by 20%

– Sensakovic, WF et al. Discrete‐space versus continuous‐space lesion boundary and area definitions. Med Phys 35(9): 4070 (2008).

With Permission: Sensakovic, WF et al. Discrete‐space versus continuous‐space lesion boundary and area definitions. Med Phys 35(9): 4070 (2008).

  • Summary statistics, regression, hypothesis

testing

– See “Practical Statistics for Medical Physicists” from the last two AAPM annual meetings

CC 3.0: ReubenGBrewer

slide-24
SLIDE 24

2/9/2017 24

  • Many intricacies to running an observer

study

  • Most respected studies in radiology and

medicine in general

  • Review (ROC and some FROC)

– ICRU Report 79 – Wagner RF et al. Assessment of Medical Imaging Systems and Computer Aids: A Tutorial Review. Acad Radiol 14: 723 (2007) – Chakraborty DP. New Developments in Observer Performance Methodology in Medical Imaging. Semin Nucl Med 41(6): 401 (2011)

  • Comparing ROC Methods

– Obuchowski NA, Beiden SV, Berbaum KS, et al. Multi‐reader, multicase ROC analysis: an empirical comparison of five methods. Acad Radiol 2004; 11:980 –995. – Toledano A. Three methods for analyzing correlated ROC curves: A comparison in real data sets from multi‐reader, multi‐case studies with a factorial design. Stat Med 2003; 22:2919 –2933

  • Study Design

– Obuchowski NA. Multireader receiver operating characteristic studies: a comparison of study designs. Acad Radiol 1995; 2:709 –716 – Potchen EJ. Measuring Observer Performance in Chest Radiology: Some Experiences. J Am Coll Radiol 3:423 (2006)

  • Power and Sample Size

– Hillis et al. Power Estimation for the Dorfman‐Berbaum‐Metz Method. Acad Radiol 11:1260 (2004). – See also some of the software listed

slide-25
SLIDE 25

2/9/2017 25

  • FROC and JAFROC

– Chakraborty DP, et al. Observer studies involving detection and localization: Modeling, analysis, and validation. Medical Physics 31, 2313 (2004). – Chakraborty DP. New Developments in Observer Performance Methodology in Medical Imaging. Semin Nucl Med 41(6): 401 (2011). – Thompson JD, et al. Analysing data from observer studies in medical imaging research: An introductory guide to free‐response

  • techniques. Radiography 20: 295 (2014).

– Thompson JD, et al. The Value of Observer Performance Studies in Dose Optimization: A Focus on Free‐Response Receiver Operating Characteristic Methods. J Nucl Med Technol 41:57 (2013).

  • http://www.lerner.ccf.org/qhs/software/
  • http://metz‐roc.uchicago.edu/MetzROC/software
  • http://perception.radiology.uiowa.edu/Software/Re

ceiverOperatingCharacteristicROC/MRMCAnalysis/t abid/116/Default.aspx

  • http://www.devchakraborty.com/index.php
  • http://didsr.github.io/iMRMC/
  • Websearch your favorite software package and ROC
slide-26
SLIDE 26

2/9/2017 26

  • Sensakovic, WF. MO‐FG‐206‐02: Implementation and Analysis of Observer

Studies in Medical Physics. Med Phys. 43, 3714 (2016); http://dx.doi.org/10.1118/1.4957320

  • 2015 (Virtual Library)

– Phases, Levels, Controls, and All That: An Informal Session On Clinical Trials

  • http://www.aapm.org/education/VL/vl.asp?id=4686

– Use and Abuse of Common Statistics in Radiological Physics

  • http://www.aapm.org/education/VL/vl.asp?id=4685

– Uncertainty and Issues in Biological Modeling for the Modern Medical Physicist

  • http://www.aapm.org/education/VL/vl.asp?id=4687
  • 2016 (Handouts Available)

– Clinical trials and the medical physicist: design, analysis, and our role – Implementation and Analysis of Observer Studies in Medical Physics – Analysis of Dependent Variables: Correlation and Simple Regression – Hypothesis or Hypotheses, That is the Question

  • http://www.aapm.org/meetings/2016AM/PRAbs.asp?mid=115&aid=31918
  • https://creativecommons.org/licenses/
slide-27
SLIDE 27

2/9/2017 27

Correlation: Review of Terminology

  • Dependent vs. Independent Variables
  • Standard plot: X is Independent

Y is Dependent

  • Linear vs. Monotonic
  • Linear: increase in X leads

to proportional increase in Y

  • Monotonic: increase in X

leads to some increase in Y

Aug 1, 2016 Labby - AAPM 2016 54

1 1

y x

slide-28
SLIDE 28

2/9/2017 28

Correlation: Review of Terminology

  • Variable Type
  • Continuous
  • Example: Ionization chamber charge collected vs. Dose delivered
  • Discrete
  • Example: Number of patients seen vs. Calendar year
  • Ordinal
  • Example: Severity of normal tissue toxicity vs. Prescription Level
  • Categorical
  • Example: RECIST response classification vs. Radiologist Observer

Aug 1, 2016 Labby - AAPM 2016 55

Correlation: Metrics of Interest

  • Four big categories of data
  • Continuous
  • Discrete
  • Ordinal
  • Categorical

Aug 1, 2016 Labby - AAPM 2016 56

slide-29
SLIDE 29

2/9/2017 29

Correlation: Metrics of Interest

  • Four big categories of data
  • Continuous
  • Discrete
  • Ordinal
  • Categorical

Three major correlation metrics Pearson’s r Spearman’s ⍴ Fleiss’ κ

Aug 1, 2016 Labby - AAPM 2016 57

Correlation: Metrics of Interest

  • Four big categories of data
  • Continuous
  • Discrete
  • Ordinal
  • Categorical

Three major correlation metrics Pearson’s r Spearman’s ⍴ Fleiss’ κ

Aug 1, 2016 Labby - AAPM 2016 58

slide-30
SLIDE 30

2/9/2017 30

Correlation: Pearson’s r

  • “Linear” or “Product-Moment” correlation
  • Applies only to continuous data
  • Parametric correlation
  • Tendency of dependent variable to increase linearly with the

independent variable

  • Key Point:
  • There is an assumed form to the relationship
  • Linear, and therefore also monotonic

Aug 1, 2016 Labby - AAPM 2016 59

Correlation: Pearson’s r

Aug 1, 2016 Labby - AAPM 2016 60 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5

r = 1.00 r = 0.97 r = 0.76 r = 0.96 r = 0.96 r = 0.75

slide-31
SLIDE 31

2/9/2017 31

Correlation: Pearson’s r

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

r = 1.00 r = 0.97 r = 0.76

Correlation: Spearman’s ⍴

  • “Rank” correlation
  • Applies to continuous, discrete, or ordinal data
  • Non-parametric correlation
  • Tendency of dependent variable to increase with the independent

variable

  • Key Point:
  • There is no assumed relationship, only monotonicity
  • Math: Pearson’s r of rank-transformed data

Aug 1, 2016 Labby - AAPM 2016 62

slide-32
SLIDE 32

2/9/2017 32

Correlation: Spearman’s ⍴

Aug 1, 2016 Labby - AAPM 2016 63

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Correlation: Spearman’s ⍴

Aug 1, 2016 Labby - AAPM 2016 64

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Raw: (0,0) Rank: (1,1) (X,Y) pairs

slide-33
SLIDE 33

2/9/2017 33

Correlation: Spearman’s ⍴

Aug 1, 2016 Labby - AAPM 2016 65

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Raw: (0,0) Rank: (1,1) (X,Y) pairs Raw: (0.05,0.0025) Rank: (2,2)

Correlation: Spearman’s ⍴

Aug 1, 2016 Labby - AAPM 2016 66

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Raw: (1,1) Rank: (20,20)

Pearson’s r of rank- transformed data: 1.00

Raw: (0,0) Rank: (1,1) Raw: (0.05,0.0025) Rank: (2,2)

slide-34
SLIDE 34

2/9/2017 34

Correlation: Spearman’s ⍴

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

r = 1.00 ⍴ = 1.00 r = 0.97 ⍴ = 0.97 r = 0.76 ⍴ = 0.90

Correlation: Spearman’s ⍴

Aug 1, 2016 Labby - AAPM 2016 68 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5

r = 1.00 ⍴ = 1.00 r = 0.97 ⍴ = 0.97 r = 0.76 ⍴ = 0.90 r = 0.96 ⍴ = 1.00 r = 0.96 ⍴ = 0.99 r = 0.75 ⍴ = 0.89

slide-35
SLIDE 35

2/9/2017 35

Correlation: Which Metric?

Continuous variables; “When one goes up, does the other (reliably) go down?”

Aug 1, 2016 Labby - AAPM 2016 69

Months after Baseline Relative Change from Baseline

−1.0 −0.5 0.0 0.5 1.0 1 2 3 4 5 6 7

Disease Volume Lung Volume

Z.E. Labby et al, J Thorac Oncol 8, (2013)

Correlation: Which Metric?

Continuous variables; “When one goes up, does the other (reliably) go down?”

Aug 1, 2016 Labby - AAPM 2016 70

Months after Baseline Relative Change from Baseline

−1.0 −0.5 0.0 0.5 1.0 1 2 3 4 5 6 7

Disease Volume Lung Volume

Z.E. Labby et al, J Thorac Oncol 8, (2013)

Answer: Spearman’s ⍴

slide-36
SLIDE 36

2/9/2017 36

Correlation: Fleiss’ κ

  • Categorical correlation
  • Applies only to categorical data
  • Categorical data could be inherently ordinal
  • Non-parametric correlation
  • How well do independent categories sort dependent categories?
  • Math: number of dependent-independent pairs in agreement
  • ver the number expected by chance alone.

Aug 1, 2016 Labby - AAPM 2016 71

Correlation: Fleiss’ κ

  • Example:
  • 5 radiologists contour tumors in
  • 31 patients
  • Response classification from baseline to post-chemo CT scans
  • Progressive Disease
  • Stable Disease
  • Partial Response
  • Complete Response

Aug 1, 2016 Labby - AAPM 2016 72

  • Obs. 1
  • Obs. 2
  • Obs. 3
  • Obs. 4
  • Obs. 5

Progression 6 11 7 11 14 Stable 17 10 19 15 9 Partial 7 10 5 4 8 Complete 1 1

κ = 0.64

Landis and Koch, Biometrics, 33,159–174 (1977)

slide-37
SLIDE 37

2/9/2017 37

Correlation vs. Agreement

  • Quick tangent…

Important question: Do you already know that the two variables will be correlated? Example: Tumor volumes as assessed by Physician vs. Algorithm

Aug 1, 2016 Labby - AAPM 2016 73

Correlation vs. Agreement

  • Especially with implicit independent variables (i.e., the true

value remains unknown), correlation isn’t as meaningful

  • Correlation is only the strength of a relationship between two

variables

  • Agreement is the actual 1:1 accuracy

Aug 1, 2016 Labby - AAPM 2016 74

Bland and Altman, “Statistical methods for assessing agreement between two methods of clinical measurement,” Lancet 327, 307 (1986).

slide-38
SLIDE 38

2/9/2017 38

Correlation vs. Agreement

Aug 1, 2016 Labby - AAPM 2016 75

Average of Physician and Algorithm Difference (Physician – Algorithm)

Mean Mean + 2SD Mean - 2SD

Bland and Altman, “Statistical methods for assessing agreement between two methods of clinical measurement,” Lancet 327, 307 (1986).

Correlation vs. Agreement

  • Absolute agreement vs. Relative agreement
  • Absolute: plot raw differences
  • Relative: plot log differences
  • Get mean, SD of log-transformed data, then apply

exponential to get relative agreement bounds

Aug 1, 2016 Labby - AAPM 2016 76

Bland and Altman, “Statistical methods for assessing agreement between two methods of clinical measurement,” Lancet 327, 307 (1986).

ln

  • ln ln
slide-39
SLIDE 39

2/9/2017 39

Case #/Truth

  • Obs. 1

1/Malignant 10.0 2/Benign 4.4 3/Benign 3.4 5/Malignant 5.6 6/Malignant 7.7 7/Malignant 9.2 … …

Observer Malignancy Rating # of Ratings

Actually Malignant Actually Benign With IR

Observer Malignancy Rating # of Ratings

Actually Malignant Actually Benign 0.0: Definitely Benign 5.0: Indeterminate 10.0 Definitely Malignant True Positive Fraction (Sensitivity) False Positive Fraction (1‐Specificity) With IR TP FP TP TN FN TP TN FP FN

slide-40
SLIDE 40

2/9/2017 40

AUC comparison not appropriate if ROC curves cross each other

True Positive Fraction (Sensitivity) False Positive Fraction (1‐Specificity)

  • Better for Screening
  • Maybe better for

Diagnostic

  • Partial AUC

– McClish DK. Analyzing a Portion of the ROC Curve. Medical Decision Making 9 (3): 190 (1989)

Sp and Se. . . Why bother with ROC?

  • CR: Better Se, Worse Sp
  • DR: Better Sp, Worse Se
  • Which is better?

Based on example given by CE Metz during lectures at the University of Chicago, 2003 True Positive Fraction (Sensitivity) False Positive Fraction (1‐Specificity)

slide-41
SLIDE 41

2/9/2017 41

  • JAFROC

– Use localization information – More than one response per subject

  • Chakraborty DP, et al. Observer studies involving detection and localization: Modeling, analysis, and validation. Med Phys 31, 2313 (2004)
  • Practically, comparisons need whole ROC curves and

not specific operating points (TP,FP) or (PPV, NPV) values

– Reader ability and experience, societal norms, and even disease prevalence in the study can impact specific

  • perating points and (PPV,NPV) values
  • Wagner et al. Assessment of Medical Imaging Systems and Computer Aids: A Tutorial Review. Acad Radiol 14: 723 (2007)