[PPT] - Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 PowerPoint Presentation

SLIDE 1

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492

Speech Synthesis Evaluation

SLIDE 2

Evaluating Speech Synthesis Evaluating Speech Synthesis

 How good is the voice?

How good is the voice?

 This voice is a 45.67

This voice is a 45.67

 Is voice X better than voice Y

Is voice X better than voice Y

 Why?

Why?

SLIDE 3

Evaluation Evaluation

 Objective measures

Objective measures

 Run a program and get a number

Run a program and get a number

 Subjective measures

Subjective measures

 Have human listeners extract a score

Have human listeners extract a score

 Do Object and Subjective scores correlate

Do Object and Subjective scores correlate

SLIDE 4

Human Tests Human Tests

 Synthesis people are warped

Synthesis people are warped

 The more you listen the better it becomes

The more you listen the better it becomes

 They hear things others don’t

They hear things others don’t

 Non-synthesis people are warped

Non-synthesis people are warped

 People very sensitive to listening conditions

People very sensitive to listening conditions

 What question do you ask

What question do you ask

 What hardware you play it on

What hardware you play it on

 There are (at least) two orthogonal scales

There are (at least) two orthogonal scales

 Understandability

Understandability

 Naturalness

Naturalness

SLIDE 5

Standard Tests Standard Tests

 DRT: diagnostic rhyme tests

DRT: diagnostic rhyme tests

 Test confusable phones

Test confusable phones

 “

“bat” vs “pat” bat” vs “pat”

 Good for identifying phone errors

Good for identifying phone errors

 Sometimes in carrier sentences

Sometimes in carrier sentences

 Now we will say pat again.

Now we will say pat again.

 Unit selection

Unit selection

 Just include the standard works in the database

Just include the standard works in the database

SLIDE 6

Standard Tests Standard Tests

 SUS: Semantically unpredictable sentences

SUS: Semantically unpredictable sentences

 Det adj noun verb det adj noun

Det adj noun verb det adj noun

 Automatically filled in with low frequency words

Automatically filled in with low frequency words

 The parklike holders threw the vague vegetables

The parklike holders threw the vague vegetables

 The simplistic consonants swam the episcopal quartet

The simplistic consonants swam the episcopal quartet

 The dark geniuses woke the humane emptiness.

The dark geniuses woke the humane emptiness.

 The masterly serials withdrew the collaborative brochure

The masterly serials withdrew the collaborative brochure

 Test for understandability

Test for understandability

 Ask users to type in what they hear

Ask users to type in what they hear

 Good as discrimination

Good as discrimination

 Very hard for even fluent non-natives

Very hard for even fluent non-natives

SLIDE 7

Standard tests Standard tests

 MOS: mean opinion scores

MOS: mean opinion scores

 1-5 quality, naturalness, “like it”

1-5 quality, naturalness, “like it”

 Take average score

Take average score

SLIDE 8

Some experimental problems Some experimental problems

 Order of presentation

Order of presentation

 Other aids change perception

Other aids change perception

 Showing the text makes it much easier

Showing the text makes it much easier

 Having a talking head “improves” the synthesis

Having a talking head “improves” the synthesis

 Hardware quality

Hardware quality

 Some voices better on the telephone

Some voices better on the telephone

 Loud speaker quality (headphone quality)

Loud speaker quality (headphone quality)

 Room acoustics

Room acoustics

 Volume

Volume

 Understandability

Understandability

 Harder if doing other task

Harder if doing other task

 Personal preference

Personal preference

 Voice is full understandable but “creepy”

Voice is full understandable but “creepy”

 Voice is incomprehensible but “funny”

Voice is incomprehensible but “funny”

 Sounds like my grade school teacher

Sounds like my grade school teacher

SLIDE 9

TTS Evaluation TTS Evaluation

 How good are your ears?

How good are your ears?

SLIDE 10

SUS Sentences SUS Sentences

 sus_00005

sus_00005

 sus_00012

sus_00012

 sus_00017

sus_00017

 sus_00022

sus_00022

SLIDE 11

SUS Sentences SUS Sentences

 The sorrowful premieres sang the

The sorrowful premieres sang the

stentation gymnast
stentation gymnast

 The temperamental gateways forgave the

The temperamental gateways forgave the weatherbeaten finalist weatherbeaten finalist

 The disruptive billboards blew the sugary

The disruptive billboards blew the sugary endorsement endorsement

 The serene adjustments foresaw the

The serene adjustments foresaw the acceptable acquisition acceptable acquisition

SLIDE 12

TTS Evaluation TTS Evaluation

SLIDE 13

TTS Evaluation TTS Evaluation

 In mud eels are, in mud none are

In mud eels are, in mud none are

 A 1918 state constitutional amendment

A 1918 state constitutional amendment made Massachusetts one of 23 states made Massachusetts one of 23 states where citizens can enact laws by plebiscite. where citizens can enact laws by plebiscite.

 Which is which

Which is which

 The numbers are 25 and 34.

The numbers are 25 and 34.

 The numbers 20 5 and 34.

The numbers 20 5 and 34.

 What is the temperature in Pittsburgh

What is the temperature in Pittsburgh

SLIDE 14

Objective Synthesis Tests Objective Synthesis Tests

 Text analysis

Text analysis

 How well do you cover NSWs

How well do you cover NSWs

 How well do you cover homographs

How well do you cover homographs

 Lexical coverage

Lexical coverage

 How often do you see a new word

How often do you see a new word

 Lexical correctness

Lexical correctness

 How correct are pronunciations

How correct are pronunciations

 For unseen words

For unseen words

 For seen words

For seen words

 Phonetic intelligibility

Phonetic intelligibility

 DRT tests

DRT tests

 Semantic intelligibility

Semantic intelligibility

 SUS tests

SUS tests

SLIDE 15

Blizzard Challenge Blizzard Challenge

 Annual Event from 2005 (15 years plus)

Annual Event from 2005 (15 years plus)

 Distribute large databases of speech

Distribute large databases of speech

 Participants

Participants

 Build a voice

Build a voice

 Synthesize a set of sentences

Synthesize a set of sentences

 Listeners

Listeners

 Listen and grade results

Listen and grade results

SLIDE 16

Blizzard Challenge Blizzard Challenge



2005: US English synthesis, 4 voices, 1 hour each 2005: US English synthesis, 4 voices, 1 hour each



4 teams plus “Studio” (human speech) 4 teams plus “Studio” (human speech)



2006: US English: 1 voice: 6 hours and 1 hour 2006: US English: 1 voice: 6 hours and 1 hour



12 teams 12 teams



2007: US English: 1 voice: 9 hours and 1 hour 2007: US English: 1 voice: 9 hours and 1 hour



14 teams 14 teams



2008: UK English: 15 hours: Mandarin 5 hours 2008: UK English: 15 hours: Mandarin 5 hours



19 teams 19 teams



2009: UK English: 15 hours: Mandarin 5 hours 2009: UK English: 15 hours: Mandarin 5 hours



2010: UK English 18 hours: Mandarin 6 hours 2010: UK English 18 hours: Mandarin 6 hours



2010- Audio Books, Indian Languages, Speaking in Noise 2010- Audio Books, Indian Languages, Speaking in Noise



Split between industry and academia Split between industry and academia



Split between Asia, Europe, America (mostly Europe and Asia). Split between Asia, Europe, America (mostly Europe and Asia).

SLIDE 17

Listeners Listeners

 Three sets of listeners

Three sets of listeners

 Speech experts (participants)

Speech experts (participants)

 Paid undergrads (native speakers)

Paid undergrads (native speakers)

 Volunteers

Volunteers

 Types of tests

Types of tests

 MOS tests (1-5)

MOS tests (1-5)

 SUS tests

SUS tests

 DRT tests

DRT tests

 About 300 listeners in total

About 300 listeners in total

SLIDE 18

Listening Listening

 Web based

Web based

 So everyone did it in a different environment

So everyone did it in a different environment

 But we got access to more people

But we got access to more people

 Asked to do it in quiet office with headphone

Asked to do it in quiet office with headphone

 Could listen multiple times

Could listen multiple times

SLIDE 19

Blizzard Challenge Results Blizzard Challenge Results

 Speech Experts

Speech Experts

 Like synthesis better

Like synthesis better

 Understand synthesis better

Understand synthesis better

 Volunteers don’t always finish tests

Volunteers don’t always finish tests

 Undergrads sometimes finish tests

Undergrads sometimes finish tests

 (or put in filler answers)

(or put in filler answers)

 Results were correlated over different

Results were correlated over different subgroups subgroups

SLIDE 20

Application Tests Application Tests

 How does it work in the application

How does it work in the application

 With real application data

With real application data

 A good voice is not noticed

A good voice is not noticed

 Have real users evaluate it

Have real users evaluate it

 Give them a choice (even if artificial)

Give them a choice (even if artificial)

 CEO chooses the one they like!

CEO chooses the one they like!

SLIDE 21

Clearer Spoken Output Clearer Spoken Output

 In Let’s Go Bus Domain

In Let’s Go Bus Domain

 Lexical Choice

Lexical Choice

 The next bus is at 10:23

The next bus is at 10:23

 The next bus is in 11 minutes

The next bus is in 11 minutes

 Prosodic variation

Prosodic variation

 The next bus is at 10:23

The next bus is at 10:23

 The next bus is at, 10:23.

The next bus is at, 10:23.

 Spectral variation

Spectral variation

 Clear articulation (when asked to repeat)

Clear articulation (when asked to repeat)

 The next bust is at, 10:23.

The next bust is at, 10:23.

SLIDE 22

Summary Summary

 TTS Evaluation is hard

TTS Evaluation is hard

 But not impossible

But not impossible

 Clear ways (that are consistent) are available

Clear ways (that are consistent) are available

 MOS scores

MOS scores

 SUS

SUS

 Application based testing

Application based testing

SLIDE 23

SLIDE 24