Ground Truth, Machine Learning, and the Mechanical Turk Bob - PowerPoint PPT Presentation

Ground Truth, Machine Learning, and the Mechanical Turk Bob Carpenter (w. Emily Jamison, Breck Baldwin) Alias-i, Inc.

Supervised Machine Learning 1. Define coding standard mapping inputs to outputs, e.g.: • English word → stem • newswire text → person name spans • biomedical text → genes mentioned 2. Collect inputs and code “gold standard” training data 3. Develop and train statistical model using data 4. Apply to unseen inputs

Coding Bottleneck • Bottleneck is collecting training corpus • Commericial data’s expensive (e.g. LDA, ELRA) • Academic corpora typically restrictively licensed • Limited to existing corpora • For new problems, use: self, grad students, temps, interns, . . . • Mechanical Turk to the rescue

Case Studies

Case 1: Named Entities

Named Entities Worked • Conveying the coding standard – official MUC-6 standard dozens of pages – examples are key – (maybe a qualifying exam) • User Interface Problem – highlighting with mouse too fiddly (c.f. Fitt’s Law) – one entity type at a time (vs. pulldown menus) – checkboxes (vs. highlighting spans)

Discussion: Named Entities • 190K tokens, 64K capitalized, 4K names • Less than a week at 2 cents/400 tokens • Turkers overall better than LDC data – Correctly Rejected: Webster’s, Seagram, Du Pont, Buick-Cadillac, Moon, erstwhile Phineas Foggs – Incorrectly Accepted: Tass – Missed Punctuation: J E. ‘‘Buster’’ Brown • Many Turkers no better than chance (c.f. social psych by Yochai Benkler, Harvard)

Case 2: Morphological Stemming

Morphological Stemming Worked • Three iterations on coding standard – simplified task to one stem • Four iterations on final standard instructions – added previously confusing examples • Added qualifying test

Case 3: Gene Linkage

Gene Linkage Failed • Could get Turkers to pass qualifier • Could not get Turkers to take task even at $1/hit • Doing coding ourselves (5-10 minutes/HIT) • How to get Turkers do these complex tasks? – Low concentration tasks done quickly – Compatible with studies of why Turkers Turk

Inferring Gold Standards

Voted Gold Standard • Turkers vote • Label with majority category • Censor if no majority

Some Labeled Data • Seed the data with cases with known labels • Use known cases to estimate coder accuracy • Vote with adjustment for accuracy • Requires relatively large amount of items for – estimating accuracies well – liveness for new items • Gold may not be as pure as requesters think • Some preference tasks have no “right” answer – e.g. Bing vs. Google, Facestat, Colors, ...

Estimate Everything • Gold standard labels • Coder accuracies – sensitivity (false negative rate; misses) – specificity (false positive rate; false alarms) – imbalance indicates bias; high values accuracy • Coding standard difficulty – average accuracies – variation among coders • Item difficulty (important, but not enough data)

Benefits of Estimation • Full Bayesian posterior inference – probabilistic “gold standard” – compatible with Bayesian machine learning • Works better than voting with threshold – largest benefit with few Turkers/item – evaluated with known “gold standard” • Compatible with adding gold standard cases

Why we Need Task Difficulty • What’s your estimate for: – a baseball player who goes 12 for 20? – a market that goes down 9 out of 10 days? – a coin that lands heads 3 out of 10 times? – . . . • Smooth estimates for coders with few items • Hierarchical model of accuracy prior

Soft Gold Standard • Is 24 karat gold standard even possible? • Some items are really marginal • Traditional approach – censoring disagreements – adjudicating disagreements (revise standard) – adjudication may not converge • Posterior uncertainty can be modeled

Active Learning • Choose most useful items to code next • Balance two criteria – high uncertainty – high typicality (how to measure?) • Can get away with fewer coders/item • May introduce sampling bias

Code-a-Little, Learn-a-Little • Semi-automated coding • System suggests labels • Coders correct labels • Much faster coding • But may introduce bias

Statistical Inference Model

5 Dentists Diagnosing Cavities Dentists Count Dentists Count Dentists Count 00000 1880 10000 22 00001 789 10001 26 00010 43 10010 6 00011 75 10011 14 00100 23 10100 1 00101 63 10101 20 00110 8 10110 2 00111 22 10111 17 01000 188 11000 2 01001 191 11001 20 01010 17 11010 6 01011 67 11011 27 01100 15 11100 3 01101 85 11101 72 01110 8 11110 1 01111 56 11111 100

Posteriors for Dentist Accuracies Annotator Specificities Annotator Sensitivities 200 200 Annotator Key 1 150 150 2 3 4 100 100 5 50 50 0 0 0.6 0.7 0.8 0.9 1.0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 a.0 a.1 • Posterior density vs. point estimates (e.g. mean)

Posteriors for Dentistry Data Items 00000 00001 00010 00011 10000 10001 10010 10011 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 00100 00101 00110 00111 10100 10101 10110 10111 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 01000 01001 01010 01011 11000 11001 11010 11011 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 01100 01101 01110 01111 11100 11101 11110 11111 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 Accounts for bias, so very different from simple vote!

Beta-Binomial “Random Effects” ✓✏ ✓✏ ✓✏ ✓✏ α 0 α 1 β 0 β 1 ✒✑ ✒✑ ✒✑ ✒✑ ❅ � ❅ � J ❅ ❘ ✓✏ � ✠ ❅ ❘ ✓✏ � ✠ θ 0 ,j θ 1 ,j ✒✑ ✒✑ ❅ � ❅ � ❅ � I K ❅ � ✓✏ ✓✏ ❘ ✓✏ ❅ � ✠ ✲ ✲ c i x k π ✒✑ ✒✑ ✒✑

Sampling Notation Coders don’t all code the same items c i Bernoulli ( π ) ∼ a 0 ,j Beta ( α 0 , β 0 ) ∼ a 1 ,j Beta ( α 1 , β 1 ) ∼ x k Bernoulli ( c i k a 1 ,j k + (1 − c i k )(1 − a 0 ,j k )) ∼ π Beta (1 , 1) ∼ α 0 / ( α 0 + β 0 ) Beta (1 , 1) ∼ α 0 + β 0 Polynomial ( − 5 / 2) ∼ α 1 / ( α 1 + β 1 ) Beta (1 , 1) ∼ α 1 + β 1 Polynomial ( − 5 / 2) ∼

Ground Truth, Machine Learning, and the Mechanical Turk Bob - PowerPoint PPT Presentation

Ground Truth, Machine Learning, and the Mechanical Turk Bob Carpenter (w. Emily Jamison, Breck Baldwin) Alias-i, Inc. Supervised Machine Learning 1. Define coding standard mapping inputs to outputs, e.g.: English word stem

Amazon Mechanical Turk IRB C ONSIDERATIONS February 9, 2017 Adam F. Bailey, MA, CIP Social and

Truth, T Truth-values, and the l like Fabien Schang National Research University Higher

SYMBOLIC LOGIC UNIT 3: COMPUTING TRUTH VALUES Truth Values The truth value of a

Turk workflow and tools . mitcho + Hadas mitcho,hkotek@mit.edu Hackl Lab Turkshop March 2013

Newark Newark Newark Newark South South South Ground South Ground Ground Ground Water

Ground Truth Data for Performance Evaluation of Urdu Nastalique OCR Aneeta Niazi Research

Capturing Traffic Traces with Ground- Capturing Traffic Traces with Ground- Truth Information

Truth Revisited: What is Truth? Truth is Important. Pilate therefore said to Jesus: Art

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Mechanical Workshop Hafiz Muhammad Rizwan Master Craftsman (Mechanical) Overview Mechanical

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

DISCUSSION Leased oyster ground found to be in Public Ground. Newly found Chincoteague colored

Bohemian Rhapsody Lecture Seven The Age of Innocence TOPIC For all his aspirations and

Jason Miller (MIT) Liouville quantum gravity and the Brownian map Jason Miller and Scott

Brownian disks Jrmie B ETTINELLI based on joint work with Grgory Miermont Feb. 20, 2018

Refuge: transforming a broken refugee system Professor Alexander Betts Professor Paul Collier

GRAVITATIONAL WAVES FROM NS INTERIORS C. Peralta, M. Bennett, M. Giacobello, A. Melatos, A. Ooi,

Tax Practitioner Event Hosted by B Square Financial - 20 September 2017 WELCOME QUOTE FOR

BRUTALISM Web Design Hey, thats how I write web sites! Checklist My CMS HTML, CSS mostly

Analysis of sorting data using multiple correspondence analysis and a related method E.M.