L ET S RECAPITULATE [2/2] W HAT WE HERE NOWADAYS Objectively - - PowerPoint PPT Presentation

l et s recapitulate 2 2
SMART_READER_LITE
LIVE PREVIEW

L ET S RECAPITULATE [2/2] W HAT WE HERE NOWADAYS Objectively - - PowerPoint PPT Presentation

S PEECH SYNTHESIS EVALUATION S BASTIEN L E M AGUER ADAPT C ENTRE , S IGMEDIA L AB , EE E NGINEERING , T RINITY C OLLEGE D UBLIN 11-07-2019 L ET S RECAPITULATE [2/2] W HAT WE HERE NOWADAYS Objectively Wavenet was a game changer Tacotron:


slide-1
SLIDE 1

SPEECH SYNTHESIS EVALUATION

SÉBASTIEN LE MAGUER

ADAPT CENTRE, SIGMEDIA LAB, EE ENGINEERING, TRINITY COLLEGE DUBLIN 11-07-2019

slide-2
SLIDE 2

LET’S RECAPITULATE [2/2]

slide-3
SLIDE 3

WHAT WE HERE NOWADAYS

Objectively

Wavenet was a game changer Tacotron: easy to use if you have enough data

What you may read

"Human-like", "High-fidelity", "Highly Natural", . . .

1 33

slide-4
SLIDE 4

KEY QUESTIONS/PROBLEMS

AI Hype

https://www.economist.com/technology-quarterly/2020/06/11/ an-understanding-of-ais-limitations-is-starting-to-sink-in Environmental issues: [Strubell et al., (2019)] and follow up

Problematic question

Is the quality really that good?

Fundamental questions

Did we solve anything? If yes, what did we solve?

2 33

slide-5
SLIDE 5

LET’S GET STARTED

slide-6
SLIDE 6

WHAT IS EVALUATION

3 33

slide-7
SLIDE 7

WHAT IS EVALUATION

The ideal

Being to describe in details what a system bring compare to other ones

In practice

Classify/order the systems based on their synthesis

4 33

slide-8
SLIDE 8

WHAT TO EVALUATE?

acoustic prosody temporal structure tonal structure amplitude profile symbolic prosody syllabic stress word accent sentence mode pronunciation of words also in sentence context phrasing, rhythm voice quality inherent or introduced by signal processing? discontinuities in unit concatenation . . .

5 33

slide-9
SLIDE 9

EVALUATION AXES

Intelligibility Similarity Naturalness

6 33

slide-10
SLIDE 10

WHERE DOES IT TAKE PLACE?

Text Corpus Param. NLP Training stage signal text

  • ac. param.
  • ling. desc.

Offline

Models

Online

NLP Generation stage

  • ling. desc.

Rendering

  • ac. param.
  • bjective

subjective

7 33

slide-11
SLIDE 11

OBJECTIVE EVALUATION

slide-12
SLIDE 12

OBJECTIVE EVALUATION - THE METRICS

Which axes

Intelligibility: not so used Similarity: assessment/validation

The main metrics

Spectrum: MCD, RMSE, Euclidean distances F0: RMSE (Hz/Cent), VUV ratio, LL-Ratio BAP: RMSE Duration: RMSE, syllable/phoneme rate

8 33

slide-13
SLIDE 13

SUBJECTIVE EVALUATION

slide-14
SLIDE 14

SUBJECTIVE EVALUATION - INTRODUCTION

Subjective evaluation are human focused. . . . . . . .so expensive! Should be really carefully prepared you won’t be able to repeat it if you mess up the preparation the analysis of the results is depending a lot on the preparation be careful about the question asked and the targeted listeners (see checklist later!) Generally at least 3 systems involved the original voice a reference (anchor) system the analyzed system

9 33

slide-15
SLIDE 15

INTELLIGIBILITY TEST

Semantically Unpredictable Sentences (SUS)

Unpredictable ⇒ force the listener to "decipher" the message Syntax is correct Example: A table eat the doctor

Protocol guideline

A step:

  • 1. The listener listen to a SUS
  • 2. He/she writes down what he heard (joker character for not heard word)
  • 3. A distance is computed between what has been typed/the original sentence

Score = Word Error Rate/Phone error rate (less common) A nice paper: [Benoit, (1990)]

10 33

slide-16
SLIDE 16

THE

STANDARD PROTOCOLS FOR SUBJECTIVE EVALUATION

slide-17
SLIDE 17

SCORING METHODOLOGIES

The ACR protocol [ITU-T, (1996)]

Absolute Category Rating (ACR) ⇒ Mean Opinion Score Scores from 1 (bad) to 5 (excellent)

Key points

Systems and utterances are randomized (Latin-Square algorithm) The question asked is going to condition the user ⇒ caution! Major problem: scores are "flatten"

11 33

slide-18
SLIDE 18

ACR - INTERFACE

12 33

slide-19
SLIDE 19

PREFERENCE-BASED METHODOLOGIES

AB(X) test

2 samples (A) an (B) are presented 3 choices: A, B and no preference ABX:

I a fixed reference X is presented

Key points

More strict than ACR ⇒ results more significant "no preference" can be remove ⇒ post-processing analysis required!

13 33

slide-20
SLIDE 20

AB - INTERFACE

14 33

slide-21
SLIDE 21

MUSHRA

MUltiple Stimuli with Hidden Reference and Anchor [ITU-R, (2001)]

Idea: combining scoring and preference Continuous score from 0 to 100 with steps at every 20 Some constraints:

I Given reference + reference hidden (consistency) I Given anchors

Key points

Mix the scoring and preference methodologies But:

I difficult from the listener perspective I small differences are difficult to interpret

15 33

slide-22
SLIDE 22

MUSHRA - INTERFACE

16 33

slide-23
SLIDE 23

WHAT TO DO WITH THE RESULTS

slide-24
SLIDE 24

STATISTICAL ANALYSIS

Why?

We are using a sample ⇒ we want to generalize

How: using statistical test

Generally t-test or Wilcoxon based test Generally set ¸ = 0:05 Report the confidence interval and the effect size

Important !!!

Be careful and honest with the conclusion

17 33

slide-25
SLIDE 25

A COUNTER EXAMPLE (BLOG POST, PAPER IS BETTER)

Graphic results

18 33

slide-26
SLIDE 26

A COUNTER EXAMPLE (BLOG POST, PAPER IS BETTER)

Graphic results "Explanation"

". . . were obtained in blind tests with human subjects (from over 500 ratings on 100 test sentences). As we can see, WaveNets reduce the gap between the state of the art and human-level performance by over 50% for both US English and Mandarin Chinese."

18 33

slide-27
SLIDE 27

HOW TO INTERPRET RESULTS

Results taken from [Al-Radhi et al., (2018)]

19 33

slide-28
SLIDE 28

VALIDATION

slide-29
SLIDE 29

SOME PRECAUTIONS

Results have to be reproducible! Environment setup reproducibility

I Description of the conditions (speaker/headphones, . . . )

Protocol reproducibility

I Description of the test I Description of the question I Description of the corpora I Description of the "cognitive aspect" (duration, pause?, . . . )

Statistical reproducibility

I Description of listeners (number, expert vs non expert, . . . ) I Statistical analysis (confidence interval, . . . )

20 33

slide-30
SLIDE 30

CHECKLIST - 1 [WESTER ET AL., (2015)]

˜ What test to use?

I MOS, MUSHRA, preference, intelligibility, and same/different judgments all fit different situations.

˜ Which question(s) to ask?

I Be aware that the question you ask may influence the answer you get. The terms you use may be interpreted differently by listeners, e.g., what does “quality” or “naturalness” actually mean?

˜ Which data to use for testing?

I Factor out aspects that affect the evaluation, but which are unrelated to the research question studied.

˜ Is the evaluation material unbiased and free of training data? ˜ Is a reference needed?

I Consider giving a reference or adding training material, particularly for intonation evaluation. Also consider the case for including other anchors.

21 33

slide-31
SLIDE 31

CHECKLIST - 2 [WESTER ET AL., (2015)]

˜ What type of listeners?

I Native vs. non-native? Speech experts vs. naive listeners? Age, gender, hearing impairments? Different listener groups can lead to different results.

˜ How many listeners to use? ˜ How many data-points are needed? ˜ Is the task suitable for human listeners?

I Take into consideration listener boredom, fatigue, and memory constraints, as well as cognitive load.

˜ Can you use crowd-sourcing?

I The biggest concern here is how to ensure the quality of the test-takers.

˜ How is the experiment going to be conducted?

I With headphones or speakers, over the web or in a listening booth?

22 33

slide-32
SLIDE 32

SOME BIASES (EX: [CLARK ET AL., (2019)])

Background

Analyze how to evaluate long utterances in a ACR based test

Some results

23 33

slide-33
SLIDE 33

THE BLIZZARD CHALLENGE

slide-34
SLIDE 34

THE BLIZZARD CHALLENGE

Website: http://festvox.org/blizzard When?

I every year since 2005

Who (participants)?

I university I some companies

Which kind of systems?

I Parametrical I Unit selection I Hybrid

Philosophy

I Focus on the analysis and the exchange rather than pure rating ! I Results made anonymous (however by reading the different papers you can rebuild the results)

24 33

slide-35
SLIDE 35

WHAT’S NOW

slide-36
SLIDE 36

CURRENT SITUATION

Strong need of new protocols ([Wagner et al., (2019)])

MOS, MUSHRA not refined enough! What does a preference, score mean? Get more precise feedback, qualify the speech

25 33

slide-37
SLIDE 37

SUBJECTIVE EVALUATION - THE NEW WAYS

Behavioural

Task focused (reaction time, completion, . . . ) [Wagner and Betz, (2017)]

Physiological

Pupillometry [Govender and King, (2018)] Source: Peter Lamb/123RF Be careful: [Winn et al., (2018)] EEG [Parmonangan et al., (2019)] Source: [Siuly et al., (2016)] Be careful: [Belardinelli et al., (2019)]

26 33

slide-38
SLIDE 38

OBJECTIVE EVALUATION - THE BIG MISCONCEPTION

For more details, see [Wagner et al., (2019)]

Subjective evaluation seems way better

More robust Human involved is the loop

Key problem(s)

It is expensive What do we learn about the signal?

Objective evaluation - 2 goals:

  • 1. Pointing differences
  • 2. Classifying systems

27 33

slide-39
SLIDE 39

TAKE HOME MESSAGES [6/6]

slide-40
SLIDE 40

WHAT TO REMEMBER

Speech synthesis = easy problem

A lot of human effort A lot of computer effort

Different solutions for different problems

Parametric synthesis: handcrafted + database Control/speed vs quality

28 33

slide-41
SLIDE 41

THE CURRENT STATE

The DNN (r)evolution

Everything is moving to DNN based architecture Definitely a jump in quality (but how much and why?)

Don’t forget the "user"

CHI: GAFA (obviously!), startups Blind people: using diphone synthesis (why?) A lot of other: speech researchers/scientists, entertainment, . . .

29 33

slide-42
SLIDE 42

SOME SENSITIVE POINTS

A big potential danger

Spoofing (See challenge ASVSpoof [Wu et al., (2015)])

Black box vs control

DNN: we don’t understand what the system is doing And when it fails? Environmental issues [Strubell et al., (2019)]

Evaluation

Important issue Information vs marketing

30 33

slide-43
SLIDE 43

SOME DIRECTIONS

Expressive speech synthesis

Long sentences remain a challenge Lots of potential application

Security / sociology

Spoofing Anonymization / Gender neutrality

Evaluation / control

Tool to test hypothesis Needs to qualify the synthesis

31 33

slide-44
SLIDE 44

A NICE RESOURCE

Synsig

https://www.synsig.org/index.php/Main_Page

  • Pr. Simon King’s website

http://www.speech.zone/

32 33

slide-45
SLIDE 45

BIBLIOGRAPHY [1/1]

slide-46
SLIDE 46

Belardinelli, Paolo et al. (May 2019). “Reproducibility in TMS–EEG studies: A call for data sharing, standard procedures and effective experimental control”. In: Brain Stimulation: Basic, Translational, and Clinical Research in Neuromodulation 12.3, pp. 787–790. Benoit, Christian (1990). “An intelligibility test using semantically unpredictable sentences: towards the quantification of linguistic complexity”. In: Speech Communication 9.4, pp. 293–304. Clark, Rob et al. (2019). “Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs”. In: Proc. 10th ISCA Speech Synthesis Workshop, pp. 99–104. Govender, Avashna and Simon King (2018). “Measuring the Cognitive Load of Synthetic Speech Using a Dual Task Paradigm.”. In: Interspeech, pp. 2843–2847. ITU-R, Recommendation (2001). “BS.1534-1: Method for the subjective assessment of intermediate sound quality (MUSHRA)”. In: International Telecommunications Union, Geneva. ITU-T (1996). P800: Methods for objective and subjective assessment of quality. Tech. rep. Parmonangan, Ivan Halim et al. (2019). “Speech Quality Evaluation of Synthesized Japanese Speech Using EEG”. In: Proc. Interspeech 2019, pp. 1228–1232. Al-Radhi, Mohammed Salah et al. (2018). “A Continuous Vocoder Using Sinusoidal Model for Statistical Parametric Speech Synthesis”. In: International Conference on Speech and Computer. Springer,

  • pp. 11–20.

Siuly, Siuly et al. (2016). “Electroencephalogram (EEG) and Its Background”. In: EEG Signal Analysis and Classification: Techniques and Applications. Cham: Springer International Publishing, pp. 3–21.

slide-47
SLIDE 47

BIBLIOGRAPHY II

Strubell, Emma et al. (July 2019). “Energy and Policy Considerations for Deep Learning in NLP”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp. 3645–3650. Wagner, Petra and Simon Betz (2017). “Speech Synthesis Evaluation – Realizing a Social Turn”. In: Tagungsband Elektronische Sprachsignalverarbeitung (ESSV). Saarbrücken, pp. 167–172. Wagner, Petra et al. (2019). “Speech Synthesis Evaluation — State-of-the-Art Assessment and Suggestion for a Novel Research Program”. In: Proc. 10th ISCA Speech Synthesis Workshop,

  • pp. 105–110.

Wester, Mirjam et al. (2015). “Are we using enough listeners? no! - an empirically-supported critique of interspeech 2014 TTS evaluations”. In: Annual Conference of the International Speech Communication Association (Interspeech), pp. 3476–3480. Winn, Matthew B. et al. (Jan. 2018). “Best Practices and Advice for Using Pupillometry to Measure Listening”. In: Trends Hear. 22. Wu, Zhizheng et al. (2015). “ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge”. In: Sixteenth Annual Conference of the International Speech Communication Association.