Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 - - PowerPoint PPT Presentation
Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 - - PowerPoint PPT Presentation
Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis Evaluating Speech Synthesis How good is the voice? How good is the voice? This voice is a 45.67 This voice is a
Evaluating Speech Synthesis Evaluating Speech Synthesis
How good is the voice?
How good is the voice?
This voice is a 45.67
This voice is a 45.67
Is voice X better than voice Y
Is voice X better than voice Y
Why?
Why?
Evaluation Evaluation
Objective measures
Objective measures
Run a program and get a number
Run a program and get a number
Subjective measures
Subjective measures
Have human listeners extract a score
Have human listeners extract a score
Do Object and Subjective scores correlate
Do Object and Subjective scores correlate
Human Tests Human Tests
Synthesis people are warped
Synthesis people are warped
The more you listen the better it becomes
The more you listen the better it becomes
They hear things others don’t
They hear things others don’t
Non-synthesis people are warped
Non-synthesis people are warped
People very sensitive to listening conditions
People very sensitive to listening conditions
What question do you ask
What question do you ask
What hardware you play it on
What hardware you play it on
There are (at least) two orthogonal scales
There are (at least) two orthogonal scales
Understandability
Understandability
Naturalness
Naturalness
Standard Tests Standard Tests
DRT: diagnostic rhyme tests
DRT: diagnostic rhyme tests
Test confusable phones
Test confusable phones
“
“bat” vs “pat” bat” vs “pat”
Good for identifying phone errors
Good for identifying phone errors
Sometimes in carrier sentences
Sometimes in carrier sentences
Now we will say pat again.
Now we will say pat again.
Unit selection
Unit selection
Just include the standard works in the database
Just include the standard works in the database
Standard Tests Standard Tests
SUS: Semantically unpredictable sentences
SUS: Semantically unpredictable sentences
Det adj noun verb det adj noun
Det adj noun verb det adj noun
Automatically filled in with low frequency words
Automatically filled in with low frequency words
The parklike holders threw the vague vegetables
The parklike holders threw the vague vegetables
The simplistic consonants swam the episcopal quartet
The simplistic consonants swam the episcopal quartet
The dark geniuses woke the humane emptiness.
The dark geniuses woke the humane emptiness.
The masterly serials withdrew the collaborative brochure
The masterly serials withdrew the collaborative brochure
Test for understandability
Test for understandability
Ask users to type in what they hear
Ask users to type in what they hear
Good as discrimination
Good as discrimination
Very hard for even fluent non-natives
Very hard for even fluent non-natives
Standard tests Standard tests
MOS: mean opinion scores
MOS: mean opinion scores
1-5 quality, naturalness, “like it”
1-5 quality, naturalness, “like it”
Take average score
Take average score
Some experimental problems Some experimental problems
Order of presentation
Order of presentation
Other aids change perception
Other aids change perception
Showing the text makes it much easier
Showing the text makes it much easier
Having a talking head “improves” the synthesis
Having a talking head “improves” the synthesis
Hardware quality
Hardware quality
Some voices better on the telephone
Some voices better on the telephone
Loud speaker quality (headphone quality)
Loud speaker quality (headphone quality)
Room acoustics
Room acoustics
Volume
Volume
Understandability
Understandability
Harder if doing other task
Harder if doing other task
Personal preference
Personal preference
Voice is full understandable but “creepy”
Voice is full understandable but “creepy”
Voice is incomprehensible but “funny”
Voice is incomprehensible but “funny”
Sounds like my grade school teacher
Sounds like my grade school teacher
TTS Evaluation TTS Evaluation
How good are your ears?
How good are your ears?
SUS Sentences SUS Sentences
sus_00005
sus_00005
sus_00012
sus_00012
sus_00017
sus_00017
sus_00022
sus_00022
SUS Sentences SUS Sentences
The sorrowful premieres sang the
The sorrowful premieres sang the
- stentation gymnast
- stentation gymnast
The temperamental gateways forgave the
The temperamental gateways forgave the weatherbeaten finalist weatherbeaten finalist
The disruptive billboards blew the sugary
The disruptive billboards blew the sugary endorsement endorsement
The serene adjustments foresaw the
The serene adjustments foresaw the acceptable acquisition acceptable acquisition
TTS Evaluation TTS Evaluation
TTS Evaluation TTS Evaluation
In mud eels are, in mud none are
In mud eels are, in mud none are
A 1918 state constitutional amendment
A 1918 state constitutional amendment made Massachusetts one of 23 states made Massachusetts one of 23 states where citizens can enact laws by plebiscite. where citizens can enact laws by plebiscite.
Which is which
Which is which
The numbers are 25 and 34.
The numbers are 25 and 34.
The numbers 20 5 and 34.
The numbers 20 5 and 34.
What is the temperature in Pittsburgh
What is the temperature in Pittsburgh
Objective Synthesis Tests Objective Synthesis Tests
Text analysis
Text analysis
How well do you cover NSWs
How well do you cover NSWs
How well do you cover homographs
How well do you cover homographs
Lexical coverage
Lexical coverage
How often do you see a new word
How often do you see a new word
Lexical correctness
Lexical correctness
How correct are pronunciations
How correct are pronunciations
For unseen words
For unseen words
For seen words
For seen words
Phonetic intelligibility
Phonetic intelligibility
DRT tests
DRT tests
Semantic intelligibility
Semantic intelligibility
SUS tests
SUS tests
Blizzard Challenge Blizzard Challenge
Annual Event from 2005 (15 years plus)
Annual Event from 2005 (15 years plus)
Distribute large databases of speech
Distribute large databases of speech
Participants
Participants
Build a voice
Build a voice
Synthesize a set of sentences
Synthesize a set of sentences
Listeners
Listeners
Listen and grade results
Listen and grade results
Blizzard Challenge Blizzard Challenge
2005: US English synthesis, 4 voices, 1 hour each 2005: US English synthesis, 4 voices, 1 hour each
4 teams plus “Studio” (human speech) 4 teams plus “Studio” (human speech)
2006: US English: 1 voice: 6 hours and 1 hour 2006: US English: 1 voice: 6 hours and 1 hour
12 teams 12 teams
2007: US English: 1 voice: 9 hours and 1 hour 2007: US English: 1 voice: 9 hours and 1 hour
14 teams 14 teams
2008: UK English: 15 hours: Mandarin 5 hours 2008: UK English: 15 hours: Mandarin 5 hours
19 teams 19 teams
2009: UK English: 15 hours: Mandarin 5 hours 2009: UK English: 15 hours: Mandarin 5 hours
2010: UK English 18 hours: Mandarin 6 hours 2010: UK English 18 hours: Mandarin 6 hours
2010- Audio Books, Indian Languages, Speaking in Noise 2010- Audio Books, Indian Languages, Speaking in Noise
Split between industry and academia Split between industry and academia
Split between Asia, Europe, America (mostly Europe and Asia). Split between Asia, Europe, America (mostly Europe and Asia).
Listeners Listeners
Three sets of listeners
Three sets of listeners
Speech experts (participants)
Speech experts (participants)
Paid undergrads (native speakers)
Paid undergrads (native speakers)
Volunteers
Volunteers
Types of tests
Types of tests
MOS tests (1-5)
MOS tests (1-5)
SUS tests
SUS tests
DRT tests
DRT tests
About 300 listeners in total
About 300 listeners in total
Listening Listening
Web based
Web based
So everyone did it in a different environment
So everyone did it in a different environment
But we got access to more people
But we got access to more people
Asked to do it in quiet office with headphone
Asked to do it in quiet office with headphone
Could listen multiple times
Could listen multiple times
Blizzard Challenge Results Blizzard Challenge Results
Speech Experts
Speech Experts
Like synthesis better
Like synthesis better
Understand synthesis better
Understand synthesis better
Volunteers don’t always finish tests
Volunteers don’t always finish tests
Undergrads sometimes finish tests
Undergrads sometimes finish tests
(or put in filler answers)
(or put in filler answers)
Results were correlated over different
Results were correlated over different subgroups subgroups
Application Tests Application Tests
How does it work *in* the application
How does it work *in* the application
With real application data
With real application data
A good voice is not noticed
A good voice is not noticed
Have *real* users evaluate it
Have *real* users evaluate it
Give them a choice (even if artificial)
Give them a choice (even if artificial)
CEO chooses the one they like!
CEO chooses the one they like!
Clearer Spoken Output Clearer Spoken Output
In Let’s Go Bus Domain
In Let’s Go Bus Domain
Lexical Choice
Lexical Choice
The next bus is at 10:23
The next bus is at 10:23
The next bus is in 11 minutes
The next bus is in 11 minutes
Prosodic variation
Prosodic variation
The next bus is at 10:23
The next bus is at 10:23
The next bus is at, 10:23.
The next bus is at, 10:23.
Spectral variation
Spectral variation
Clear articulation (when asked to repeat)
Clear articulation (when asked to repeat)
The next bust is at, 10:23.
The next bust is at, 10:23.
Summary Summary
TTS Evaluation is hard
TTS Evaluation is hard
But not impossible
But not impossible
Clear ways (that are consistent) are available
Clear ways (that are consistent) are available
MOS scores
MOS scores
SUS
SUS
Application based testing