Speech Transcrip-on with Crowdsourcing Crowdsourcing and Human - PowerPoint PPT Presentation

Speech Transcrip-on with Crowdsourcing Crowdsourcing and Human Computa2on Instructor: Chris Callison-Burch Thanks to Sco< Novotney for today’s slides!

Lecture Takeaways 1. Get more data, not be<er data 2. Use other Turkers to do QC for you 3. Non-English crowdsourcing is not easy

Siri in Five Minutes Should I bring Yes, it will an umbrella rain today

Siri in Five Minutes Should I bring Yes, it will an umbrella rain today Automatic Speech Recognition

Digit Recognition

Digit Recognition P(one| ) =

Digit Recognition P(one| ) = P( |one) P(one) P( )

Digit Recognition P(one| ) = P( |one) P(one) Acoustic Model Language Model

Digit Recognition P(one| ) = P( |one) P(one) P( ) P(two| ) = P( |two) P(two) P( )

Digit Recognition P(one| ) = P( |one) P(one) P( ) P(two| ) = P( |two) P(two) P( ) . . . P(zero| ) = P( |zero) P(zero) P( )

Evaluating Performance Reference THIS IS AN EXAMPLE SENTENCE

Evaluating Performance Reference THIS IS AN EXAMPLE SENTENCE Hypothesis THIS IS EXAMPLE CENT TENSE

Evaluating Performance Reference THIS IS AN EXAMPLE SENTENCE Hypothesis THIS IS EXAMPLE CENT TENSE Score Del. Subs. Insert.

Evaluating Performance Reference THIS IS AN EXAMPLE SENTENCE Hypothesis THIS IS EXAMPLE CENT TENSE Score Del. Subs. Insert. # sub # ins # del 1 1 1 + + + + WER 60 % = = = # ref 5

Evaluating Performance Reference THIS IS AN EXAMPLE SENTENCE Hypothesis THIS IS EXAMPLE CENT TENSE Score Del. Subs. Insert. # sub # ins # del 1 1 1 + + + + WER 60 % = = = # ref 5 • Some Examples (lower is better) – Youtube: ~50% – Automatic closed captions for news: ~12% – Siri/Google voice: ~5%

Probabilistic Modeling arg max P( | W ) P( W ) W Language Acoustic Model Model • Both models are statistical – I’m going to completely skip over how they work • Need training data – Audio of people saying “one three zero four” – Matching transcript “one three zero four”

Why do we need data? 60 50 Test set WER 40 30 20 10 0 1 10 100 1000 10000 Hours of Manual Training Data

Motivation • Speech recognition models are hungry for data – ASR requires thousands of hours of transcribed audio – In-domain data needed to overcome mismatches like language, speaking style, acoustic channel, noise, etc … • Conversational telephone speech transcription is difficult – Spontaneous speech between intimates – Rapid speech, phonetic reductions and varied speaking style – Expensive and time consuming • $150 / hour of transcription • 50 hours of effort / hour of transcription • Deploying to new domains is slow and expensive

Evaluating Mechanical Turk • Prior work judged quality by comparing Turkers to experts – 10 Turkers match expert for many NLP tasks ( Snow et al 2008 ) • Other Mechanical Turk speech transcription had low WER – Robot Instructions ~3% WER (Marge 2010) – Street addresses, travel dialogue ~6% WER (McGraw 2010) • Right metric depends on the data consumer – Humans: WER on transcribed data – Systems: WER on test data decoded with a trained system

English Speech Corpus • English Switchboard corpus – Ten minute conversations about an assigned topic – Two existing transcriptions for a twenty hour subset: • LDC – high quality, ~50xRT transcription time • Fisher ‘QuickTrans’ effort – 6xRT transcription time • Callfriend language-identification corpora – Korean, Hindi, Tamil, Farsi, and Vietnamese – Conversations from U.S. to home country between friends – Mixture of English and native language – Only Korean has existing LDC transcriptions

Transcription Task Pay: OH WELL I GUESS RETIREMENT THAT KIND OF THING WHICH I DON'T WORRY MUCH ABOUT UH AND WE HAVE A SOCCER TEAM THAT COMES AND GOES WE DON'T EVEN HAVE THAT PRETTY

Speech Transcription for $5/hour • Paid $300 to transcribe 20 hours of Switchboard three times – $5 per hour of transcription ($0.05 per utterance) – 1089 Turkers completed the task in six days – 30 utterances transcribed on average (earning 15 cents) – 63 Turkers completed more than 100 utterances • Some people complained about the cost – “wow that's a lot of dialogue for $.05” – “this stuff is really hard. pay per hit should be higher” • Many enjoyed the task and found it interesting – “Very interesting exercise. would welcome more hits.” – “You don't grow pickles they are cucumbers!!!!”

Turker Transcription Rate Number of Turkers Transcription Time / Utterance Length (xRT) Fisher QuickTrans – 6xRT Historical Estimates – 50xRT

Dealing with Real World Data • Every word in the transcripts needs a pronunciation – Misspellings, new proper name spellings, jeez vs. geez – Inconsistent hesitation markings, myriad of ‘uh-huh’ spellings – 26% of utterances contained OOVs (10% of the vocabulary) • Lots of elbow grease to prepare phonetic dictionary • Turkers found creative ways not to follow instructions – Comments like “hard to hear” or “did the best I could :)” – Enter transcriptions into wrong text box – But very few typed in gibberish • We did not explicitly filter comments, etc …

Disagreement with Experts 23% mean disagreement Normalized Density Transcrip2on WER well ITS been nice talking to you again 12% well it's been [DEL] A NICE PARTY JENGA 71% well it's been nice talking to you again 0% Average Turker Disagreement

Estimation of Turker Skill Estimated disagreement of 25% True disagreement of 23% Normalized Density Transcrip2on WER Est. WER well ITS been nice talking to you again 12% 43% well it's been [DEL] A NICE PARTY JENGA 71% 78% well it's been nice talking to you again 0% 37% Average Turker Disagreement

Rating Turkers: Expert vs. Non-Expert Disagreement Against Other Turkers Disagreement Against Expert

Selecting Turkers by Estimated Skill Disagreement Against Other Turkers Disagreement Against Expert

Selecting Turkers by Estimated Skill Disagreement Against Other Turkers 12% 25% 57% 4.5% Disagreement Against Expert

Selecting Turkers by Estimated Skill Disagreement Against Other Turkers Disagreement Against Expert

Finding the Right Turkers Mean disagreement of 23% F-Score WER Selection Threshold

Finding the Right Turkers Mean disagreement of 23% Easy to reject bad workers Mean Disagreement: 23% F-Score Hard to find good workers WER Selection Threshold

Selecting Turkers by Estimated Skill Disagreement Against Other Turkers 1% 4% 92% 2% Disagreement Against Expert

Reducing Disagreement Selection LDC Disagreement None 23% System Combination 21% Estimated Best Turker 20% Oracle Best Turker 18% Oracle Best Utterance 13%

Mechanical Turk for ASR Training • Ultimate test is system performance – Build acoustic and language models – Decode test set and compute WER – Compare to systems trained on equivalent expert transcription • 23% professional disagreement might seem worrying – How does it effect system performance? – Do reductions in disagreement transfer to system gains? – What are best practices for improving ASR performance?

Breaking Down The Degradation • Measured test WER degradation from 1 to 16 hours – 3% relative degradation for acoustic model – 2% relative degradation for language model – 5% relative degradation for both – Despite 23% transcription disagreement with LDC 60 System Performance 55 LDC LM Acoustic Models (WER) 50 Mturk LM LDC AM 45 Mturk AM Language Models 40 35 1 2 4 8 16 Hours of Training Data

Value of Repeated Transcription • Each utterance was transcribed three times • What is the value of this duplicate effort? – Instead of dreaming up a better combination method, use oracle error rate as upper bound on system combination Transcription LDC Disagreement ASR WER Random 23% 42.0% Oracle 13% 40.9% LDC - 39.5% • Cutting disagreement in half reduced degradation by half • System combination has at most 2.5% WER to recover

How to Best Spend Resources? • Given a fixed transcription budget, either: – Transcribe as much audio as possible – Improve quality by redundantly transcribing ASR • With a 60 hour transcription budget, Transcription Hours Cost WER – 42.0% 20 hours transcribed once Mturk 20 $100 42.0% – 40.9% Oracle selection from 20 hours transcribed three times Oracle Mturk 20 $300 40.9% – 37.6% 60 hours transcribed once MTurk 60 $300 37.6% – 39.5% 20 hours professionally transcribed LDC 20 39.5% • Get more data, not better data – Compare 37.6% WER versus 40.9% WER • Even expert data is outperformed by more lower quality data – Compare 39.5% WER to 37.6% WER

Speech Transcrip-on with Crowdsourcing Crowdsourcing and Human - PowerPoint PPT Presentation

Speech Transcrip-on with Crowdsourcing Crowdsourcing and Human Computa2on Instructor: Chris Callison-Burch Thanks to Sco< Novotney for todays slides! Lecture Takeaways 1. Get more data, not be<er data 2. Use other Turkers to do QC

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

A/B Testing Crowdsourcing and Human Computation Instructor: Chris Callison-Burch Website:

Crowdsourcing and Human Computer Interaction Design Crowdsourcing and Human Computation

How Crowdsourcing Enabled Computer Vision Crowdsourcing and Human Computation Instructor: Chris

Rise of Crowdsourcing Crowdsourcing = Harvesting societys wisdom, skill, creativity, and scale

Crowdsourcing and HCI 2: Privacy and Latency Crowdsourcing and Human Computation Instructor:

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Crowdsourcing of Weather Data on Mobile App and Deep Learning Lior Perez 99th AMS annual

Crowdsourcing Cytogenetic Biodosimetry Dose Estimation Crowdsourcing Cytogenetic Biodosimetry Dose

New Directions For Neurorehabilitation Karunesh Ganguly, MD PhD Assistant Professor, Department

Complete Axiomatization for the Bisimilarity Distance on MCs Giorgio Bacci , Giovanni Bacci, Kim

Advances in QBF Reasoning Florian Lonsing Knowledge-Based Systems Group, Vienna University of

5G: From Theory to Practice Senior Manager, Advanced Wireless Research ian.wong@ni.com ni.com

Transcribing the Digital Archive of Southern Speech: Methods and Preliminary Analysis Rachel

Transcriber driving strategies for transcription aid system Gr egory Senay, Georges Linar`

Recent Advances in Automatic Speech Summarization Sadaoki Furui Department of Computer Science

InSite: Enabling Transparency With Searchable, Shareable, Interactive Transcripts IAnnotate 2018,