Quality control of corpus annotation through reliability measures - PowerPoint PPT Presentation

Motivation Measuring agreement University of Essex Interpreting agreement Quality control of corpus annotation through reliability measures Ron Artstein Department of Computer Science University of Essex artstein@essex.ac.uk ACL-2007 tutorial 24 June 2007 Thanks to EPSRC grant GR/S76434/01, ARRAU (Anaphora Resolution and Underspecification) Ron Artstein Quality control of corpus annotation through reliability measures

Motivation Measuring agreement University of Essex Interpreting agreement Annotated corpora 2 Annotated corpora are needed for: Supervised learning – training and evaluation Unsupervised learning – evaluation Hand-crafted systems – evaluation Analysis of text Quality control: Annotations need to be correct. Ron Artstein Quality control of corpus annotation through reliability measures

Motivation Measuring agreement University of Essex Interpreting agreement Correctness and reliability 3 Systems are evaluated with respect to a standard standard taken to be correct During corpus creation, no standard exists As a minimum, annotation should be reliable Qualitative evaluation also necessary Ron Artstein Quality control of corpus annotation through reliability measures

Motivation Measuring agreement University of Essex Interpreting agreement Reliability and agreement 4 Reliability = consistency Needs to be measured on the same text Different annotators If independent annotators mark a text the same way, they have internalized the same scheme (instructions) will apply it consistently to new data annotations might be correct Ron Artstein Quality control of corpus annotation through reliability measures

Motivation Measuring agreement University of Essex Interpreting agreement Reliability studies 5 Reliability data Sample of the corpus Multiple annotators Annotators must work independently Otherwise we can’t compare them Results do not generalize from one domain to another Annotators internalized a scheme for newswire corpus They may apply it differently to email corpus Ron Artstein Quality control of corpus annotation through reliability measures

Motivation Two coders Measuring agreement Many coders University of Essex Interpreting agreement Weighted coefficients Measuring agreement 6 Agreement measures are not hypothesis tests Evaluating magnitude, not existence/lack of effect Not comparing two hypotheses No clear probabilistic interpretation Ron Artstein Quality control of corpus annotation through reliability measures

Motivation Two coders Measuring agreement Many coders University of Essex Interpreting agreement Weighted coefficients Observed agreement 7 Observed agreement: proportion of items on which 2 coders agree. Detailed Listing Contingency Table Item Coder 1 Coder 2 Boxcar Tanker Total a Boxcar Tanker Boxcar 41 3 44 b Tanker Boxcar Tanker 9 47 56 c Boxcar Boxcar Total 50 50 100 d Boxcar Tanker e Tanker Tanker Agreement: 41 + 47 = 0 . 88 f Tanker Tanker 100 . . . . . . Ron Artstein Quality control of corpus annotation through reliability measures

Motivation Two coders Measuring agreement Many coders University of Essex Interpreting agreement Weighted coefficients Chance agreement 8 Some agreement is expected by chance alone. Two coders randomly assigning “Boxcar” and “Tanker” labels will agree half of the time. The amount expected by chance varies depending on the annotation scheme and on the annotated data. Meaningful agreement is the agreement above chance . Similar to the concept of “baseline” for system evaluation. Ron Artstein Quality control of corpus annotation through reliability measures

Motivation Two coders Measuring agreement Many coders University of Essex Interpreting agreement Weighted coefficients Correction for chance 9 How much of the observed agreement is above chance? A B Total Total Chance Above A 44 6 50 44 6 6 6 38 0 = + B 6 44 50 6 44 6 6 0 38 Total 50 50 100 88 12 76 Agreement: 88/100 Due to chance: 12/100 Above chance: 76/100 Ron Artstein Quality control of corpus annotation through reliability measures

Motivation Two coders Measuring agreement Many coders University of Essex Interpreting agreement Weighted coefficients Correction for chance 10 How much of the observed agreement is above chance? A B C D Total A 22 1 1 1 25 B 1 22 1 1 25 C 1 1 22 1 25 D 1 1 1 22 25 Total 25 25 25 25 100 Ron Artstein Quality control of corpus annotation through reliability measures

Motivation Two coders Measuring agreement Many coders University of Essex Interpreting agreement Weighted coefficients Correction for chance 11 Total Chance Above 22 1 1 1 1 1 1 1 21 0 0 0 1 22 1 1 1 1 1 1 0 21 0 0 = + 1 1 22 1 1 1 1 1 0 0 21 0 1 1 1 22 1 1 1 1 0 0 0 21 88 4 84 Agreement: 88/100 Due to chance: 4/100 Above chance: 84/100 Ron Artstein Quality control of corpus annotation through reliability measures

Motivation Two coders Measuring agreement Many coders University of Essex Interpreting agreement Weighted coefficients Correction for chance 12 A B Total A B C D Total A 44 6 50 A 22 1 1 1 25 B 6 44 50 B 1 22 1 1 25 Total 50 50 100 C 1 1 22 1 25 D 1 1 1 22 25 Total 25 25 25 25 100 Agreement: 88/100 Agreement: 88/100 Due to chance: 12/100 Due to chance: 4/100 Above chance: 76/100 Above chance: 84/100 Ron Artstein Quality control of corpus annotation through reliability measures

Motivation Two coders Measuring agreement Many coders University of Essex Interpreting agreement Weighted coefficients Expected agreement 13 Observed agreement ( A o ): proportion of actual agreement Expected agreement ( A e ): expected value of A o Amount of agreement above chance: A o − A e Maximum possible agreement above chance: 1 − A e Proportion of agreement above chance attained: A o − A e 1 − A e Ron Artstein Quality control of corpus annotation through reliability measures

Motivation Two coders Measuring agreement Many coders University of Essex Interpreting agreement Weighted coefficients Expected agreement 14 Big question: how to calculate the amount of agreement expected by chance (A e )? Ron Artstein Quality control of corpus annotation through reliability measures

Motivation Two coders Measuring agreement Many coders University of Essex Interpreting agreement Weighted coefficients S : same chance for all coders and categories 15 Number of category labels: q Probability of one coder picking a particular category q a : 1 q � 2 � 1 Probability of both coders picking a particular category q a : q Probability of both coders picking the same category: � 2 � 1 = 1 A S e = q · q q Ron Artstein Quality control of corpus annotation through reliability measures

Motivation Two coders Measuring agreement Many coders University of Essex Interpreting agreement Weighted coefficients Are all categories equally likely? 16 A B Total A B C D Total A 44 6 50 A 44 6 0 0 50 B 6 44 50 B 6 44 0 0 50 Total 50 50 100 C 0 0 0 0 0 D 0 0 0 0 0 Total 50 50 0 0 100 A o = 0 . 88 A o = 0 . 88 A e = 1 A e = 1 2 = 0 . 5 4 = 0 . 25 S = 0 . 88 − 0 . 5 S = 0 . 88 − 0 . 25 = 0 . 76 = 0 . 84 1 − 0 . 5 1 − 0 . 25 Ron Artstein Quality control of corpus annotation through reliability measures

Motivation Two coders Measuring agreement Many coders University of Essex Interpreting agreement Weighted coefficients π : different chance for different categories 17 Total number of judgments: N Probability of one coder picking a particular category q a : n qa N � n qa � 2 Probability of both coders picking a particular category q a : N Probability of both coders picking the same category: � n q � 2 = 1 � � A π n 2 e = q N N 2 q q Ron Artstein Quality control of corpus annotation through reliability measures

Motivation Two coders Measuring agreement Many coders University of Essex Interpreting agreement Weighted coefficients Comparison of S and π 18 A B C Total A B C Total A 44 6 0 50 A 77 1 2 80 B 6 44 0 50 B 1 6 3 10 C 0 0 0 0 C 2 3 5 10 Total 50 50 0 100 Total 80 10 10 100 A o = 0 . 88 A o = 0 . 88 S = 0 . 88 − 1 / 3 S = 0 . 88 − 1 / 3 = 0 . 82 = 0 . 82 1 − 1 / 3 1 − 1 / 3 π = 0 . 88 − 0 . 5 π = 0 . 88 − 0 . 66 = 0 . 76 ≈ 0 . 65 1 − 0 . 5 1 − 0 . 66 A π e ≥ A S We can prove that for any sample: π ≤ S e Ron Artstein Quality control of corpus annotation through reliability measures

Motivation Two coders Measuring agreement Many coders University of Essex Interpreting agreement Weighted coefficients Prevalence 19 Is the following annotation reliable? Two annotators disambiguate 1000 instances of the word love : zero (as in tennis) emotion Each annotator found: 995 instances of ‘emotion’ 5 instances of ‘zero’ The annotators marked different instances of ‘zero’. Agr: 99%! emotion zero Total A o = 0 . 99 emotion 990 5 995 S = 0 . 99 − . 5 = 0 . 98 1 − . 5 zero 5 0 5 π = 0 . 99 − 0 . 99005 ≈ − 0 . 005 Total 995 5 1000 1 − 0 . 99005 Ron Artstein Quality control of corpus annotation through reliability measures

Quality control of corpus annotation through reliability measures - PowerPoint PPT Presentation

Motivation Measuring agreement University of Essex Interpreting agreement Quality control of corpus annotation through reliability measures Ron Artstein Department of Computer Science University of Essex artstein@essex.ac.uk ACL-2007

Introduction Detecting Errors in Effects of Annotation Errors Detecting Errors in Corpus

Annotation Processing in a Kotlin World Zac Sweers @pandanomic Annotation Processing in a

Resources for Computational Linguistics Annotation Tools: RSTTool &MMAX Presentation by

MACAQ : A Multi Annotated Corpus to study how we adapt Answers to various Questions Anne

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

Annotation Quality Checking and Annotation Quality Checking and Its Implications for Design of

Annotation and Evaluation Diana Maynard, Niraj Aswani University of Sheffield University of

Lecture 2 Annotation tools & Segmentation Summary of Part 1 Annotation theory

Systematic Annotation Mark Voorhies 4/5/2012 Mark Voorhies Systematic Annotation Review RTFM

Assessing annotation Assessing annotation consistency in the Gene consistency in the Gene

Web Annotations Building the Experience Annotation An annotation is something added. It is not

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Software Reliability and System Reliability Introduction 1 Software Reliability and System

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

Comprehensive Annotation of Multiword Expressions in a Social Web Corpus Nathan Schneider,

Full Stack Type Safety Szymon Pyalski Egnyte Inc. Europython 2020 Premise Typing basics Our

A Review Corpus for Argumentation Analysis Henning Wachsmuth , Martin Trenkmann, Benno

Visualization & Visual Analytics 1 Angus Forbes creativecoding.evl.uic.edu/courses/cs424

Image Annotations in ResearchSpace By Jana Parvanova, Vladimir Alexiev, Stanislav Kostadinov

Performance Annotations for Complex Software Systems Daniele Rogora Antonio Carzaniga Amer

Pre-processing and annotation Raw data from a linguistic source cant be exploited directly. We

The WebJunction Experience Find Your Learning Flow Jennifer Peterson Community Manager

Supporting Code Comprehension via Annotations: Right Information at the Right Time and Place

Quality control of corpus annotation through reliability measures - PowerPoint PPT Presentation

Motivation Measuring agreement University of Essex Interpreting agreement Quality control of corpus annotation through reliability measures Ron Artstein Department of Computer Science University of Essex artstein@essex.ac.uk ACL-2007

Introduction Detecting Errors in Effects of Annotation Errors Detecting Errors in Corpus

Annotation Processing in a Kotlin World Zac Sweers @pandanomic Annotation Processing in a

Resources for Computational Linguistics Annotation Tools: RSTTool &amp;MMAX Presentation by

MACAQ : A Multi Annotated Corpus to study how we adapt Answers to various Questions Anne

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

Annotation Quality Checking and Annotation Quality Checking and Its Implications for Design of

Annotation and Evaluation Diana Maynard, Niraj Aswani University of Sheffield University of

Lecture 2 Annotation tools &amp; Segmentation Summary of Part 1 Annotation theory

Systematic Annotation Mark Voorhies 4/5/2012 Mark Voorhies Systematic Annotation Review RTFM

Assessing annotation Assessing annotation consistency in the Gene consistency in the Gene

Web Annotations Building the Experience Annotation An annotation is something added. It is not

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Software Reliability and System Reliability Introduction 1 Software Reliability and System

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

Comprehensive Annotation of Multiword Expressions in a Social Web Corpus Nathan Schneider,

Full Stack Type Safety Szymon Pyalski Egnyte Inc. Europython 2020 Premise Typing basics Our

A Review Corpus for Argumentation Analysis Henning Wachsmuth , Martin Trenkmann, Benno

Visualization &amp; Visual Analytics 1 Angus Forbes creativecoding.evl.uic.edu/courses/cs424

Image Annotations in ResearchSpace By Jana Parvanova, Vladimir Alexiev, Stanislav Kostadinov

Performance Annotations for Complex Software Systems Daniele Rogora Antonio Carzaniga Amer

Pre-processing and annotation Raw data from a linguistic source cant be exploited directly. We

The WebJunction Experience Find Your Learning Flow Jennifer Peterson Community Manager

Supporting Code Comprehension via Annotations: Right Information at the Right Time and Place

Resources for Computational Linguistics Annotation Tools: RSTTool &MMAX Presentation by

Lecture 2 Annotation tools & Segmentation Summary of Part 1 Annotation theory

Visualization & Visual Analytics 1 Angus Forbes creativecoding.evl.uic.edu/courses/cs424