Speech Processing 15-492/18-492 Speech Synthesis Evaluation - - PowerPoint PPT Presentation
Speech Processing 15-492/18-492 Speech Synthesis Evaluation - - PowerPoint PPT Presentation
Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How good is the voice? How good is the voice? This voice is a 45.67 This voice is a 45.67 Is voice X better than voice Y Is voice X
Evaluating Speech Synthesis
- How good is the voice?
How good is the voice?
- This voice is a 45.67
This voice is a 45.67
- Is voice X better than voice Y
Is voice X better than voice Y
- Why?
Why?
Evaluation
- Objective measures
Objective measures
- Run a program and get a number
Run a program and get a number
- Subjective measures
Subjective measures
- Have human listeners extract a score
Have human listeners extract a score
- Do Object and Subjective scores correlate
Do Object and Subjective scores correlate
Human Tests
- Synthesis people are warped
Synthesis people are warped
- The more you listen the better it becomes
The more you listen the better it becomes
- They hear things others don’t
They hear things others don’t
- Non
Non-
- synthesis people are warped
synthesis people are warped
- People very sensitive to listening conditions
People very sensitive to listening conditions
- What question do you ask
What question do you ask
- What hardware you play it on
What hardware you play it on
- There are (at least) two orthogonal scales
There are (at least) two orthogonal scales
- Understandable
Understandable
- natural
natural
Standard Tests
- DRT: diagnostic rhyme tests
DRT: diagnostic rhyme tests
- Test confusable phones
Test confusable phones
- “bat”
“bat” vs vs “pat” “pat”
- Good for identifying phone errors
Good for identifying phone errors
- Sometimes in carrier sentences
Sometimes in carrier sentences
Now we will say pat again.
Now we will say pat again.
- Unit selection
Unit selection
Just include the standard works in the database
Just include the standard works in the database
Standard Tests
- SUS: Semantically unpredictable sentences
SUS: Semantically unpredictable sentences
- Det
Det adj adj noun verb noun verb det det adj adj noun noun
- Automatically filled in with low frequency words
Automatically filled in with low frequency words
The
The parklike parklike holders threw the vague vegetables holders threw the vague vegetables
The simplistic consonants swam the
The simplistic consonants swam the episcopal episcopal quartet quartet
The dark geniuses woke the humane emptiness.
The dark geniuses woke the humane emptiness.
The masterly serials withdrew the collaborative brochure
The masterly serials withdrew the collaborative brochure
- Test for understandability
Test for understandability
- Ask users to type in what they hear
Ask users to type in what they hear
- Good as discrimination
Good as discrimination
- Very hard for even fluent non
Very hard for even fluent non-
- natives
natives
Standard tests
- MOS: mean opinion scores
MOS: mean opinion scores
- 1
1-
- 5 quality, naturalness, “like it”
5 quality, naturalness, “like it”
- Take average score
Take average score
Some experimental problems
- Order of presentation
Order of presentation
- Other aids change perception
Other aids change perception
- Showing the text makes it much easier
Showing the text makes it much easier
- Having a talking head “improves” the synthesis
Having a talking head “improves” the synthesis
- Hardware quality
Hardware quality
- Some voices better on the telephone
Some voices better on the telephone
- Loud speaker quality (headphone quality)
Loud speaker quality (headphone quality)
- Room acoustics
Room acoustics
- Volume
Volume
- Understandability
Understandability
- Harder if doing other task
Harder if doing other task
- Personal preference
Personal preference
- Voice is full understandable but “creepy”
Voice is full understandable but “creepy”
- Voice is incomprehensible but “funny”
Voice is incomprehensible but “funny”
- Sounds like my grade school teacher
Sounds like my grade school teacher
TTS Evaluation
- How good are your ears?
How good are your ears?
SUS Sentences
- sus_00022
sus_00022
- sus_00012
sus_00012
- sus_00005
sus_00005
- sus_00017
sus_00017
SUS Sentences
- The serene adjustments foresaw the
The serene adjustments foresaw the acceptable acquisition acceptable acquisition
- The temperamental gateways forgave the
The temperamental gateways forgave the weatherbeaten weatherbeaten finalist finalist
- The sorrowful premieres sang the
The sorrowful premieres sang the
- stentatious gymnast
- stentatious gymnast
- The disruptive billboards blew the sugary
The disruptive billboards blew the sugary endorsement endorsement
TTS Evaluation
TTS Evaluation
- In mud eels are, in mud none are
In mud eels are, in mud none are
- A 1918 state constitutional amendment
A 1918 state constitutional amendment made Massachusetts one of 23 states made Massachusetts one of 23 states where citizens can enact laws by plebiscite. where citizens can enact laws by plebiscite.
- Which is which
Which is which
- The numbers are 25 and 34.
The numbers are 25 and 34.
- The numbers 20 5 and 34.
The numbers 20 5 and 34.
- What is the temperature in Pittsburgh
What is the temperature in Pittsburgh
Objective Synthesis Tests
- Text analysis
Text analysis
- How well do you cover
How well do you cover NSWs NSWs
- How well do you cover homographs
How well do you cover homographs
- Lexical coverage
Lexical coverage
- How often do you see a new word
How often do you see a new word
- Lexical correctness
Lexical correctness
- How correct are pronunciations
How correct are pronunciations
- For unseen words
For unseen words
- For seen words
For seen words
- Phonetic intelligibility
Phonetic intelligibility
- DRT tests
DRT tests
- Semantic intelligibility
Semantic intelligibility
- SUS tests
SUS tests
Blizzard Challenge
- Annual Event from 2005
Annual Event from 2005
- Distribute large databases of speech
Distribute large databases of speech
- Participants
Participants
- Build a voice
Build a voice
- Synthesize a set of sentences
Synthesize a set of sentences
- Listeners
Listeners
- Listen and grade results
Listen and grade results
Blizzard Challenge
- 2005: US English synthesis, 4 voices, 1 hour each
2005: US English synthesis, 4 voices, 1 hour each
- 4 teams plus “Studio” (human speech)
4 teams plus “Studio” (human speech)
- 2006: US English: 1 voice: 6 hours and 1 hour
2006: US English: 1 voice: 6 hours and 1 hour
- 12 teams
12 teams
- 2007: US English: 1 voice: 9 hours and 1 hour
2007: US English: 1 voice: 9 hours and 1 hour
- 14 teams
14 teams
- 2008: UK English: 15 hours: Mandarin 5 hours
2008: UK English: 15 hours: Mandarin 5 hours
- 19 teams
19 teams
- Split between industry and academia
Split between industry and academia
- Split between Asia, Europe, Americas.
Split between Asia, Europe, Americas.
Listeners
- Three sets of listeners
Three sets of listeners
- Speech experts (participants)
Speech experts (participants)
- Paid undergrads (native speakers)
Paid undergrads (native speakers)
- Volunteers
Volunteers
- Types of tests
Types of tests
- MOS tests (1
MOS tests (1-
- 5)
5)
- SUS tests
SUS tests
- DRT tests
DRT tests
- About 300 listeners in total
About 300 listeners in total
Listening
- Web based
Web based
- So everyone did it in a different environment
So everyone did it in a different environment
- But we got access to more people
But we got access to more people
- Asked to do it in quiet office with headphone
Asked to do it in quiet office with headphone
- Could listen multiple times
Could listen multiple times
Blizzard Challenge Results
- Speech Experts
Speech Experts
- Like synthesis better
Like synthesis better
- Understand synthesis better
Understand synthesis better
- Volunteers don’t always finish tests
Volunteers don’t always finish tests
- Undergrads sometime finish tests
Undergrads sometime finish tests
- (or put in filler answers)
(or put in filler answers)
- Results were correlated over different
Results were correlated over different subgroups subgroups
Application Tests
- How does it work *in* the application
How does it work *in* the application
- With real application data
With real application data
- A good voice is not noticed
A good voice is not noticed
- Have *real* users evaluate it
Have *real* users evaluate it
- Give them a choice (even if artificial)
Give them a choice (even if artificial)
- CEO choices the one they like!
CEO choices the one they like!
Clearer Spoken Output
- In Let’s Go Bus Domain
In Let’s Go Bus Domain
- Lexical Choice
Lexical Choice
- The next bus is at 10:23
The next bus is at 10:23
- The next bus is in 11 minutes
The next bus is in 11 minutes
- Prosodic variation
Prosodic variation
- The next bus is at 10:23
The next bus is at 10:23
- The next bus is at, 10:23.
The next bus is at, 10:23.
- Spectral variation
Spectral variation
- Clear articulation (when asked to repeat)
Clear articulation (when asked to repeat)
- The next bust is at, 10:23.
The next bust is at, 10:23.
Summary
- TTS Evaluation is hard
TTS Evaluation is hard
- But not impossible
But not impossible
- Clear ways (that are consistent) are available
Clear ways (that are consistent) are available
MOS scores
MOS scores
SUS
SUS
Application based testing
Application based testing
HW2: TTS
- Due 3:30pm Monday October 20
Due 3:30pm Monday October 20th
th
- Install Festival and
Install Festival and Festvox Festvox
- Find 10 errors in each of two different
Find 10 errors in each of two different synthesizers synthesizers
- Build a voice
Build a voice
- A Talking Clock
A Talking Clock
- A general voice
A general voice
- (or both)