Speech Processing 15-492/18-492 Speech Synthesis Evaluation

Evaluating Speech Synthesis How good is the voice? � How good is the voice? � � This voice is a 45.67 This voice is a 45.67 � Is voice X better than voice Y � Is voice X better than voice Y � Why? � Why? �

Evaluation Objective measures � Objective measures � � Run a program and get a number Run a program and get a number � Subjective measures � Subjective measures � � Have human listeners extract a score Have human listeners extract a score � Do Object and Subjective scores correlate � Do Object and Subjective scores correlate �

Human Tests � Synthesis people are warped Synthesis people are warped � � The more you listen the better it becomes The more you listen the better it becomes � � They hear things others don’t They hear things others don’t � � Non Non- -synthesis people are warped synthesis people are warped � � People very sensitive to listening conditions People very sensitive to listening conditions � � What question do you ask What question do you ask � � What hardware you play it on What hardware you play it on � � There are (at least) two orthogonal scales There are (at least) two orthogonal scales � � Understandable Understandable � � natural natural �

Standard Tests DRT: diagnostic rhyme tests � DRT: diagnostic rhyme tests � � Test confusable phones Test confusable phones � � “bat” “bat” vs vs “pat” “pat” � � Good for identifying phone errors Good for identifying phone errors � � Sometimes in carrier sentences Sometimes in carrier sentences �  Now we will say pat again. Now we will say pat again.  � Unit selection Unit selection �  Just include the standard works in the database Just include the standard works in the database 

Standard Tests � SUS: Semantically unpredictable sentences SUS: Semantically unpredictable sentences � � Det Det adj adj noun verb noun verb det det adj adj noun noun � � Automatically filled in with low frequency words Automatically filled in with low frequency words �  The The parklike parklike holders threw the vague vegetables holders threw the vague vegetables   The simplistic consonants swam the The simplistic consonants swam the episcopal episcopal quartet quartet   The dark geniuses woke the humane emptiness. The dark geniuses woke the humane emptiness.   The masterly serials withdrew the collaborative brochure The masterly serials withdrew the collaborative brochure  � Test for understandability Test for understandability � � Ask users to type in what they hear Ask users to type in what they hear � � Good as discrimination Good as discrimination � � Very hard for even fluent non Very hard for even fluent non- -natives natives �

Standard tests MOS: mean opinion scores � MOS: mean opinion scores � � 1 1- -5 quality, naturalness, “like it” 5 quality, naturalness, “like it” � � Take average score Take average score �

Some experimental problems � Order of presentation Order of presentation � � Other aids change perception Other aids change perception � � Showing the text makes it much easier Showing the text makes it much easier � � Having a talking head “improves” the synthesis Having a talking head “improves” the synthesis � � Hardware quality Hardware quality � � Some voices better on the telephone Some voices better on the telephone � � Loud speaker quality (headphone quality) Loud speaker quality (headphone quality) � � Room acoustics Room acoustics � � Volume Volume � � Understandability Understandability � � Harder if doing other task Harder if doing other task � � Personal preference Personal preference � � Voice is full understandable but “creepy” Voice is full understandable but “creepy” � � Voice is incomprehensible but “funny” Voice is incomprehensible but “funny” � � Sounds like my grade school teacher Sounds like my grade school teacher �

TTS Evaluation How good are your ears? � How good are your ears? �

SUS Sentences sus_00022 � sus_00022 � sus_00012 � sus_00012 � sus_00005 � sus_00005 � sus_00017 � sus_00017 �

SUS Sentences The serene adjustments foresaw the � The serene adjustments foresaw the � acceptable acquisition acceptable acquisition The temperamental gateways forgave the � The temperamental gateways forgave the � weatherbeaten finalist finalist weatherbeaten The sorrowful premieres sang the � The sorrowful premieres sang the � ostentatious gymnast ostentatious gymnast The disruptive billboards blew the sugary � The disruptive billboards blew the sugary � endorsement endorsement

TTS Evaluation

TTS Evaluation In mud eels are, in mud none are � In mud eels are, in mud none are � A 1918 state constitutional amendment � A 1918 state constitutional amendment � made Massachusetts one of 23 states made Massachusetts one of 23 states where citizens can enact laws by plebiscite. where citizens can enact laws by plebiscite. Which is which � Which is which � � The numbers are 25 and 34. The numbers are 25 and 34. � � The numbers 20 5 and 34. The numbers 20 5 and 34. � What is the temperature in Pittsburgh � What is the temperature in Pittsburgh �

Objective Synthesis Tests � Text analysis Text analysis � � How well do you cover How well do you cover NSWs NSWs � � How well do you cover homographs How well do you cover homographs � � Lexical coverage Lexical coverage � � How often do you see a new word How often do you see a new word � � Lexical correctness Lexical correctness � � How correct are pronunciations How correct are pronunciations � � For unseen words For unseen words � � For seen words For seen words � � Phonetic intelligibility Phonetic intelligibility � � DRT tests DRT tests � � Semantic intelligibility Semantic intelligibility � � SUS tests SUS tests �

Blizzard Challenge Annual Event from 2005 � Annual Event from 2005 � Distribute large databases of speech � Distribute large databases of speech � Participants � Participants � � Build a voice Build a voice � � Synthesize a set of sentences Synthesize a set of sentences � Listeners � Listeners � � Listen and grade results Listen and grade results �

Blizzard Challenge � 2005: US English synthesis, 4 voices, 1 hour each 2005: US English synthesis, 4 voices, 1 hour each � � 4 teams plus “Studio” (human speech) 4 teams plus “Studio” (human speech) � � 2006: US English: 1 voice: 6 hours and 1 hour 2006: US English: 1 voice: 6 hours and 1 hour � � 12 teams 12 teams � � 2007: US English: 1 voice: 9 hours and 1 hour 2007: US English: 1 voice: 9 hours and 1 hour � � 14 teams 14 teams � � 2008: UK English: 15 hours: Mandarin 5 hours 2008: UK English: 15 hours: Mandarin 5 hours � � 19 teams 19 teams � � Split between industry and academia Split between industry and academia � � Split between Asia, Europe, Americas. Split between Asia, Europe, Americas. �

Listeners � Three sets of listeners Three sets of listeners � � Speech experts (participants) Speech experts (participants) � � Paid undergrads (native speakers) Paid undergrads (native speakers) � � Volunteers Volunteers � � Types of tests Types of tests � � MOS tests (1 MOS tests (1- -5) 5) � � SUS tests SUS tests � � DRT tests DRT tests � � About 300 listeners in total About 300 listeners in total �

Listening Web based � Web based � � So everyone did it in a different environment So everyone did it in a different environment � � But we got access to more people But we got access to more people � � Asked to do it in quiet office with headphone Asked to do it in quiet office with headphone � � Could listen multiple times Could listen multiple times �

Blizzard Challenge Results Speech Experts � Speech Experts � � Like synthesis better Like synthesis better � � Understand synthesis better Understand synthesis better � Volunteers don’t always finish tests � Volunteers don’t always finish tests � Undergrads sometime finish tests � Undergrads sometime finish tests � � (or put in filler answers) (or put in filler answers) � Results were correlated over different � Results were correlated over different � subgroups subgroups

Application Tests How does it work *in* the application � How does it work *in* the application � With real application data � With real application data � A good voice is not noticed � A good voice is not noticed � Have *real* users evaluate it � Have *real* users evaluate it � Give them a choice (even if artificial) � Give them a choice (even if artificial) � � CEO choices the one they like! CEO choices the one they like! �

Speech Processing 15-492/18-492 Speech Synthesis Evaluation - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How good is the voice? How good is the voice? This voice is a 45.67 This voice is a 45.67 Is voice X better than voice Y Is voice X

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Spoken Dialog Systems SDS

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 15-492/18-492 Speech Processing Current Topics and Future challenges

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Speech Processing 15-492/18-492 Computer Speech Analog to Digital Speech (sound) is analog

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

Speech Processing 15-492/18-492 Speech Translation Speech Translation Three part systems

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Medical Care of Vulnerable and Underserved Populations February 28- March 2, 2019 Holiday Inn

Knowledge in the Situation Calculus Adrian Pearce 8 July 2009 includes slides by Ryan Kelly

Video Joseph April 20th - Sept 21st a new sermon series - jealousy, betrayal, temptation &

Computer Vision by Learning: Motion in Action Jan van Gemert, UvA 2 Motion and perceptual

Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided

Todays whether: if, elif, or else! Congrats, Pats!

Natural Language Processing Angel Xuan Chang angelxuanchang.github.io/nlp-class adapted from

Taking Time Seriously Bryan OSullivan Twitter: @bos31337 Monday, June 18, 12 Lets talk

Sambuz

Useful Links

Newsletter

Mail Us