Assessing inter-rater agreement in Stata Daniel Klein - PowerPoint PPT Presentation

Assessing inter-rater agreement in Stata Daniel Klein klein.daniel.81@gmail.com klein@incher.uni-kassel.de University of Kassel INCHER-Kassel 15th German Stata Users Group meeting Berlin June 23, 2017 1 / 28

Interrater agreement and Cohen’s Kappa: A brief review Generalizing the Kappa coefficient More agreement coefficients Statistical inference and benchmarking agreement coefficients Implementation in Stata Examples 2 / 28

Interrater agreement What is it? An imperfect working definition Define interrater agreement as the propensity for two or more raters (coders, judges, . . . ) to, independently from each other, classify a given subject (unit of analysis) into the same predefined category. 3 / 28

Interrater agreement How to measure it? Rater B Rater A Total ◮ Consider 1 2 ◮ r = 2 raters 1 n 11 n 12 n 1 . ◮ n subjects 2 n 21 n 22 n 2 . ◮ q = 2 categories Total n . 1 n . 2 n ◮ The observed proportion of agreement is p o = n 11 + n 22 n 4 / 28

Cohen’s Kappa The problem of chance agreement The problem ◮ Observed agreement may be due to . . . ◮ subject properties ◮ chance Cohen’s (1960) solution ◮ Define the proportion of agreement expected by chance as p e = n 1 . n × n . 1 n + n 2 . n × n . 2 n ◮ Then define Kappa as κ = p o − p e 1 − p e 5 / 28

Cohen’s Kappa Partial agreement and weighted Kappa The Problem ◮ For q > 2 (ordered) categories raters might partially agree ◮ The Kappa coefficient cannot reflect this Cohen’s (1968) solution ◮ Assign a set of weights to the cells of the contingency table ◮ Define linear weights | k − l | w kl = 1 − | q max − q min | ◮ Define quadratic weights ( k − l ) 2 w kl = 1 − ( q max − q min ) 2 6 / 28

Cohen’s Kappa Quadratic weights (Example) ◮ Weighting matrix for q = 4 categories ◮ Quadratic weights Rater B Rater A 1 2 3 4 1 1.00 2 0.89 1.00 3 0.56 0.89 1.00 4 0.00 0.56 0.89 1.00 7 / 28

Generalizing Kappa Missing ratings The problem ◮ Some subjects classified by only one rater ◮ Excluding these subjects reduces accuracy Gwet’s (2014) solution (also see Krippendorff 1970, 2004, 2013) ◮ Add a dummy category, X , for missing ratings ◮ Base p o on subjects classified by both raters ◮ Base p e on subjects classified by one or both raters ◮ Potential problem: no explicit assumption about type of missing data (MCAR, MAR, MNAR) 8 / 28

Missing ratings Calculation of p o and p e Rater B Rater A Total 1 2 . . . q X 1 n 11 n 12 . . . n 1 q n 1 X n 1 . 2 n 21 n 22 . . . n 2 q n 2 X n 2 . . . . . . . . . . . . . . . . . . . . . . . . . q n q 1 n q 2 n qq n qX n q. . . . 0 X n X 1 n X 2 n Xq n X. Total n . 1 n . 2 . . . n .q n .X n ◮ Calculate p o and p e as q q w kl n kl � � p o = n − ( n .X + n X. ) k =1 l =1 and q q n k. n .l � � p e = w kl × n − n .X n − n X. k =1 l =1 9 / 28

Generalizing Kappa Three or more raters ◮ Consider three pairs of raters {A, B}, {A, C}, {B, C} ◮ Agreement might be observed for . . . ◮ 0 pairs ◮ 1 pair ◮ all 3 pairs ◮ It is not possible for only two pairs to agree ◮ Define agreement as average agreement over all pairs ◮ here 0 , 0 . 33 or 1 ◮ With r = 3 raters and q = 2 categories, p o ≥ 1 3 by design 10 / 28

Three or more raters Observed agreement ◮ Organize the data as n × q matrix Category Subject Total 1 . . . . . . k q 1 r 11 . . . r 1 k . . . r 1 q r 1 . . . . . . . . . . . . . . . . . . . . . i r i 1 r ik r iq r i . . . . . . . . . . . . . . . n r n 1 . . . r nk . . . r nq r n Average ¯ . . . ¯ . . . ¯ ¯ r 1 . r k. r q. r ◮ Average observed agreement over all pairs of raters n ′ q q p o = 1 r ik ( w kl r il − 1) � � � n ′ r i ( r i − 1) i =1 k =1 l =1 11 / 28

Three or more raters Chance agreement ◮ Fleiss (1971) expected proportion of agreement q q � � p e = w kl π k π l k =1 l =1 with n π k = 1 r ik � n r i i =1 ◮ Fleiss’ Kappa does not reduce to Cohen’s Kappa ◮ It instead reduces to Scott’s π ◮ Conger (1980) generalizes Cohen’s Kappa (formula somewhat complex) 12 / 28

Generalizing Kappa Any level of measurement ◮ Krippendorff (1970, 2004, 2013) introduces more weights (calling them difference functions) ◮ ordinal ◮ ratio ◮ circular ◮ bipolar ◮ Gwet (2014) suggests Data metric Weights nominal/categorical none (identity) ordinal ordinal interval linear, quadratic, radical ratio any ◮ Rating categories must be predefined 13 / 28

More agreement coefficients A general form ◮ Gwet (2014) discusses (more) agreement coefficients of the form κ · = p o − p e 1 − p e ◮ Differences only in chance agreement p e ◮ Brennan and Prediger (1981) coefficient ( κ n ) q q p e = 1 � � w kl q 2 k =1 l =1 ◮ Gwet’s (2008, 2014) AC ( κ G ) q � q � q l =1 w kl � k =1 p e = π k (1 − π k ) q ( q − 1) k =1 14 / 28

More agreement coefficients Krippendorff’s alpha ◮ Gwet (2014) obtains Krippendorff’s alpha as κ α = p o − p e 1 − p e with � 1 − 1 � o + 1 p ′ p o = n ′ ¯ n ′ ¯ r r where n ′ q q o = 1 r ik ( w kl r il − 1) � � � p ′ r ( r i − 1) ¯ n ′ i =1 k =1 l =1 and q q � � w kl π ′ k π ′ p e = l k =1 l =1 with n ′ k = 1 r ik � π ′ n ′ ¯ r i =1 15 / 28

Statistical inference Approaches ◮ Model-based (analytic) approach ◮ based on theoretical distribution under H 0 ◮ not necessarily valid for confidence interval construction ◮ Bootstrap ◮ valid confidence intervals with few assumptions ◮ computationally intensive ◮ Design-based (finite population) ◮ First introduced by Gwet (2014) ◮ sample of subjects drawn from subject universe ◮ sample of raters drawn from rater population 16 / 28

Statistical inference Design-based approach ◮ Inference conditional on the sample of raters n 1 − f � ( κ ⋆ i − κ ) 2 V ( κ ) = n ( n − 1) i =1 where i = κ i − 2 (1 − κ ) p e i − p e κ ⋆ 1 − p e with n ′ × p o i − p e κ i = n 1 − p e p e i and p o i are the subject-level expected and observed agreement 17 / 28

Benchmarking agreement coefficients Benchmark scales ◮ How do we interpret the extent of agreement? ◮ Landis and Koch (1977) suggest Coefficient Interpretation < 0.00 Poor 0.00 to 0.20 Slight 0.21 to 0.40 Fair 0.41 to 0.60 Moderate 0.61 to 0.80 Substantial 0.81 to 1.00 Almost Perfect ◮ Similar scales proposed (e.g., Fleiss 1981, Altman 1991) 18 / 28

Benchmarking agreement coefficients Probabilistic approach The Problem ◮ Precision of estimated agreement coefficients depends on ◮ the number of subjects ◮ the number of raters ◮ the number of categories ◮ Common practice of benchmarking ignores this uncertainty Gwet’s (2014) solution ◮ Probabilistic benchmarking method 1. Compute the probability for a coefficient to fall into each benchmark interval 2. Calculate the cumulative probability, starting from the highest level 3. Choose the benchmark interval associated with a cumulative probability larger than a given threshold 19 / 28

Interrater agreement in Stata Kappa ◮ kap , kappa (StataCorp.) ◮ Cohen’s Kappa, Fleiss Kappa for three or more raters ◮ Caseweise deletion of missing values ◮ Linear, quadratic and user-defined weights (two raters only) ◮ No confidence intervals ◮ kapci (SJ) ◮ Analytic confidence intervals for two raters and two ratings ◮ Bootstrap confidence intervals ◮ kappci ( kaputil , SSC) ◮ Confidence intervals for binomial ratings (uses ci for proportions) ◮ kappa2 (SSC) ◮ Conger’s (weighted) Kappa for three or more raters ◮ Uses available cases ◮ Jackknife confidence intervals ◮ Majority agreement 20 / 28

Interrater agreement in Stata Krippendorff’s alpha ◮ krippalpha (SSC) ◮ Ordinal, quadratic and ratio weights ◮ No confidence intervals ◮ kalpha (SSC) ◮ Ordinal, quadratic, ratio, circular and bipolar weights ◮ (Pseudo-) bootstrap confidence intervals (not recommended) ◮ kanom (SSC) ◮ Two raters with nominal ratings only ◮ No weights (for disagreement) ◮ Confidence intervals (delta method) ◮ Supports basic features of complex survey designs 21 / 28

Interrater agreement in Stata Kappa, etc. ◮ kappaetc (SSC) ◮ Observed agreement, Cohen and Conger’s Kappa, Fleiss’ Kappa, Krippendorff’s alpha, Brennan and Prediger coefficient, Gwet’s AC ◮ Uses available cases, optional casewise deletion ◮ Ordinal, linear, quadratic, radical, ratio, circular, bipolar, power, and user-defined weights ◮ Confidence intervals for all coefficients (design-based) ◮ Standard errors conditional on sample of subjects, sample of raters, or unconditional ◮ Benchmarking estimated coefficients (probabilistic and deterministic) ◮ . . . 22 / 28

Kappa paradoxes Dependence on marginal totals Rater B Rater B Rater A Total Rater A Total 1 2 1 2 1 45 15 60 1 25 35 60 2 25 15 40 2 5 35 40 Total 70 30 100 Total 30 70 100 p o = 0 . 60 p o = 0 . 60 = 0 . 20 = 0 . 20 κ n κ n κ = 0 . 13 κ = 0 . 26 = 0 . 12 = 0 . 19 κ F κ F κ G = 0 . 27 κ G = 0 . 21 = 0 . 13 = 0 . 20 κ α κ α Tables from Feinstein and Cicchetti 1990 23 / 28

Assessing inter-rater agreement in Stata Daniel Klein - PowerPoint PPT Presentation

Assessing inter-rater agreement in Stata Daniel Klein klein.daniel.81@gmail.com klein@incher.uni-kassel.de University of Kassel INCHER-Kassel 15th German Stata Users Group meeting Berlin June 23, 2017 1 / 28 Interrater agreement and

Rater agreement - ordinal ratings Karl Bang Christensen Dept. of Biostatistics, Univ. of

Bayesian hierarchical models in Stata Nikolay Balov StataCorp LP 2016 Stata Conference Nikolay

Semantic Krippendorffs for measuring inter- rater agreement in SNOMED CT coding studies

Automated Scoring and Rater Drift National Conference on Student Assessment Detroit, 2010

Python applications in Stata 16 BPLIM 2020 Portuguese Stata Conference BPLIM Python

Bayesian Analysis using Stata Bill Rising StataCorp LP 2016 Brazilian Stata Users Group Meeting

Measuring inter- -annotator annotator Measuring inter agreement in GO agreement in GO

world of In Inter Ic Ice-Pump JAN 2016 Presentation of Inter Ice-Pump 1 Inter Ice-Pump ApS //

Why Inter- -Municipal Municipal Why Inter Cooperation? Cooperation? 1 Inter- -Municipal

Stata: Basics, Shortcuts, and Integration with Introduction LaTeX Stata Syntax and Shortcuts

Meta-analysis using Stata Yulia Marchenko Executive Director of Statistics StataCorp LLC 2019

Analyzing interval-censored survival-time data in Stata Xiao Yang Senior Statistician and

Robust Statistics using Stata First Belgian Stata Users Meeting Vincenzo Verardi Fnrs, UNamur,

Calibrating Survey Weights in Stata Jeff Pitblado StataCorp LLC 2018 Canadian Stata Users Group

Dynamic Documents in Stata Bill Rising StataCorp LP 2016 Oceania Stata Users Group Meeting

Estimating dynamic stochastic general equilibrium models in Stata David Schenck Senior

A Novel Holistic Behavior Change Coaching Approach Harm op den Akker, PhD Roessingh Research and

1 Timothy 6:1-2 (NIV) All who are under the yoke of slavery should consider their masters worthy

Building ilding an an op open en con oncordancer ordancer for or Mal alay ay/In

Tight Bounds on Minimax Regret under Logarithmic Loss via Self-Concordance Blair Bilodeau 1,2,3 ,

Cystically Dominant Lesions Fundamental skill of diagnostic pathologists of the Breast

This is a parallel parrot! Adam Sampson Institute of Arts, Media and Computer Games

Improved implementation for finding text similarities in large collections of data Notebook for

Convex Optimization 9. Unconstrained minimization Prof. Ying Cui Department of Electrical