L ET S RECAPITULATE [2/2] W HAT WE HERE NOWADAYS Objectively - - PowerPoint PPT Presentation
L ET S RECAPITULATE [2/2] W HAT WE HERE NOWADAYS Objectively - - PowerPoint PPT Presentation
S PEECH SYNTHESIS EVALUATION S BASTIEN L E M AGUER ADAPT C ENTRE , S IGMEDIA L AB , EE E NGINEERING , T RINITY C OLLEGE D UBLIN 11-07-2019 L ET S RECAPITULATE [2/2] W HAT WE HERE NOWADAYS Objectively Wavenet was a game changer Tacotron:
LET’S RECAPITULATE [2/2]
WHAT WE HERE NOWADAYS
Objectively
Wavenet was a game changer Tacotron: easy to use if you have enough data
What you may read
"Human-like", "High-fidelity", "Highly Natural", . . .
1 33
KEY QUESTIONS/PROBLEMS
AI Hype
https://www.economist.com/technology-quarterly/2020/06/11/ an-understanding-of-ais-limitations-is-starting-to-sink-in Environmental issues: [Strubell et al., (2019)] and follow up
Problematic question
Is the quality really that good?
Fundamental questions
Did we solve anything? If yes, what did we solve?
2 33
LET’S GET STARTED
WHAT IS EVALUATION
3 33
WHAT IS EVALUATION
The ideal
Being to describe in details what a system bring compare to other ones
In practice
Classify/order the systems based on their synthesis
4 33
WHAT TO EVALUATE?
acoustic prosody temporal structure tonal structure amplitude profile symbolic prosody syllabic stress word accent sentence mode pronunciation of words also in sentence context phrasing, rhythm voice quality inherent or introduced by signal processing? discontinuities in unit concatenation . . .
5 33
EVALUATION AXES
Intelligibility Similarity Naturalness
6 33
WHERE DOES IT TAKE PLACE?
Text Corpus Param. NLP Training stage signal text
- ac. param.
- ling. desc.
Offline
Models
Online
NLP Generation stage
- ling. desc.
Rendering
- ac. param.
- bjective
subjective
7 33
OBJECTIVE EVALUATION
OBJECTIVE EVALUATION - THE METRICS
Which axes
Intelligibility: not so used Similarity: assessment/validation
The main metrics
Spectrum: MCD, RMSE, Euclidean distances F0: RMSE (Hz/Cent), VUV ratio, LL-Ratio BAP: RMSE Duration: RMSE, syllable/phoneme rate
8 33
SUBJECTIVE EVALUATION
SUBJECTIVE EVALUATION - INTRODUCTION
Subjective evaluation are human focused. . . . . . . .so expensive! Should be really carefully prepared you won’t be able to repeat it if you mess up the preparation the analysis of the results is depending a lot on the preparation be careful about the question asked and the targeted listeners (see checklist later!) Generally at least 3 systems involved the original voice a reference (anchor) system the analyzed system
9 33
INTELLIGIBILITY TEST
Semantically Unpredictable Sentences (SUS)
Unpredictable ⇒ force the listener to "decipher" the message Syntax is correct Example: A table eat the doctor
Protocol guideline
A step:
- 1. The listener listen to a SUS
- 2. He/she writes down what he heard (joker character for not heard word)
- 3. A distance is computed between what has been typed/the original sentence
Score = Word Error Rate/Phone error rate (less common) A nice paper: [Benoit, (1990)]
10 33
THE
STANDARD PROTOCOLS FOR SUBJECTIVE EVALUATION
SCORING METHODOLOGIES
The ACR protocol [ITU-T, (1996)]
Absolute Category Rating (ACR) ⇒ Mean Opinion Score Scores from 1 (bad) to 5 (excellent)
Key points
Systems and utterances are randomized (Latin-Square algorithm) The question asked is going to condition the user ⇒ caution! Major problem: scores are "flatten"
11 33
ACR - INTERFACE
12 33
PREFERENCE-BASED METHODOLOGIES
AB(X) test
2 samples (A) an (B) are presented 3 choices: A, B and no preference ABX:
I a fixed reference X is presented
Key points
More strict than ACR ⇒ results more significant "no preference" can be remove ⇒ post-processing analysis required!
13 33
AB - INTERFACE
14 33
MUSHRA
MUltiple Stimuli with Hidden Reference and Anchor [ITU-R, (2001)]
Idea: combining scoring and preference Continuous score from 0 to 100 with steps at every 20 Some constraints:
I Given reference + reference hidden (consistency) I Given anchors
Key points
Mix the scoring and preference methodologies But:
I difficult from the listener perspective I small differences are difficult to interpret
15 33
MUSHRA - INTERFACE
16 33
WHAT TO DO WITH THE RESULTS
STATISTICAL ANALYSIS
Why?
We are using a sample ⇒ we want to generalize
How: using statistical test
Generally t-test or Wilcoxon based test Generally set ¸ = 0:05 Report the confidence interval and the effect size
Important !!!
Be careful and honest with the conclusion
17 33
A COUNTER EXAMPLE (BLOG POST, PAPER IS BETTER)
Graphic results
18 33
A COUNTER EXAMPLE (BLOG POST, PAPER IS BETTER)
Graphic results "Explanation"
". . . were obtained in blind tests with human subjects (from over 500 ratings on 100 test sentences). As we can see, WaveNets reduce the gap between the state of the art and human-level performance by over 50% for both US English and Mandarin Chinese."
18 33
HOW TO INTERPRET RESULTS
Results taken from [Al-Radhi et al., (2018)]
19 33
VALIDATION
SOME PRECAUTIONS
Results have to be reproducible! Environment setup reproducibility
I Description of the conditions (speaker/headphones, . . . )
Protocol reproducibility
I Description of the test I Description of the question I Description of the corpora I Description of the "cognitive aspect" (duration, pause?, . . . )
Statistical reproducibility
I Description of listeners (number, expert vs non expert, . . . ) I Statistical analysis (confidence interval, . . . )
20 33
CHECKLIST - 1 [WESTER ET AL., (2015)]
˜ What test to use?
I MOS, MUSHRA, preference, intelligibility, and same/different judgments all fit different situations.
˜ Which question(s) to ask?
I Be aware that the question you ask may influence the answer you get. The terms you use may be interpreted differently by listeners, e.g., what does “quality” or “naturalness” actually mean?
˜ Which data to use for testing?
I Factor out aspects that affect the evaluation, but which are unrelated to the research question studied.
˜ Is the evaluation material unbiased and free of training data? ˜ Is a reference needed?
I Consider giving a reference or adding training material, particularly for intonation evaluation. Also consider the case for including other anchors.
21 33
CHECKLIST - 2 [WESTER ET AL., (2015)]
˜ What type of listeners?
I Native vs. non-native? Speech experts vs. naive listeners? Age, gender, hearing impairments? Different listener groups can lead to different results.
˜ How many listeners to use? ˜ How many data-points are needed? ˜ Is the task suitable for human listeners?
I Take into consideration listener boredom, fatigue, and memory constraints, as well as cognitive load.
˜ Can you use crowd-sourcing?
I The biggest concern here is how to ensure the quality of the test-takers.
˜ How is the experiment going to be conducted?
I With headphones or speakers, over the web or in a listening booth?
22 33
SOME BIASES (EX: [CLARK ET AL., (2019)])
Background
Analyze how to evaluate long utterances in a ACR based test
Some results
23 33
THE BLIZZARD CHALLENGE
THE BLIZZARD CHALLENGE
Website: http://festvox.org/blizzard When?
I every year since 2005
Who (participants)?
I university I some companies
Which kind of systems?
I Parametrical I Unit selection I Hybrid
Philosophy
I Focus on the analysis and the exchange rather than pure rating ! I Results made anonymous (however by reading the different papers you can rebuild the results)
24 33
WHAT’S NOW
CURRENT SITUATION
Strong need of new protocols ([Wagner et al., (2019)])
MOS, MUSHRA not refined enough! What does a preference, score mean? Get more precise feedback, qualify the speech
25 33
SUBJECTIVE EVALUATION - THE NEW WAYS
Behavioural
Task focused (reaction time, completion, . . . ) [Wagner and Betz, (2017)]
Physiological
Pupillometry [Govender and King, (2018)] Source: Peter Lamb/123RF Be careful: [Winn et al., (2018)] EEG [Parmonangan et al., (2019)] Source: [Siuly et al., (2016)] Be careful: [Belardinelli et al., (2019)]
26 33
OBJECTIVE EVALUATION - THE BIG MISCONCEPTION
For more details, see [Wagner et al., (2019)]
Subjective evaluation seems way better
More robust Human involved is the loop
Key problem(s)
It is expensive What do we learn about the signal?
Objective evaluation - 2 goals:
- 1. Pointing differences
- 2. Classifying systems
27 33
TAKE HOME MESSAGES [6/6]
WHAT TO REMEMBER
Speech synthesis = easy problem
A lot of human effort A lot of computer effort
Different solutions for different problems
Parametric synthesis: handcrafted + database Control/speed vs quality
28 33
THE CURRENT STATE
The DNN (r)evolution
Everything is moving to DNN based architecture Definitely a jump in quality (but how much and why?)
Don’t forget the "user"
CHI: GAFA (obviously!), startups Blind people: using diphone synthesis (why?) A lot of other: speech researchers/scientists, entertainment, . . .
29 33
SOME SENSITIVE POINTS
A big potential danger
Spoofing (See challenge ASVSpoof [Wu et al., (2015)])
Black box vs control
DNN: we don’t understand what the system is doing And when it fails? Environmental issues [Strubell et al., (2019)]
Evaluation
Important issue Information vs marketing
30 33
SOME DIRECTIONS
Expressive speech synthesis
Long sentences remain a challenge Lots of potential application
Security / sociology
Spoofing Anonymization / Gender neutrality
Evaluation / control
Tool to test hypothesis Needs to qualify the synthesis
31 33
A NICE RESOURCE
Synsig
https://www.synsig.org/index.php/Main_Page
- Pr. Simon King’s website
http://www.speech.zone/
32 33
BIBLIOGRAPHY [1/1]
Belardinelli, Paolo et al. (May 2019). “Reproducibility in TMS–EEG studies: A call for data sharing, standard procedures and effective experimental control”. In: Brain Stimulation: Basic, Translational, and Clinical Research in Neuromodulation 12.3, pp. 787–790. Benoit, Christian (1990). “An intelligibility test using semantically unpredictable sentences: towards the quantification of linguistic complexity”. In: Speech Communication 9.4, pp. 293–304. Clark, Rob et al. (2019). “Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs”. In: Proc. 10th ISCA Speech Synthesis Workshop, pp. 99–104. Govender, Avashna and Simon King (2018). “Measuring the Cognitive Load of Synthetic Speech Using a Dual Task Paradigm.”. In: Interspeech, pp. 2843–2847. ITU-R, Recommendation (2001). “BS.1534-1: Method for the subjective assessment of intermediate sound quality (MUSHRA)”. In: International Telecommunications Union, Geneva. ITU-T (1996). P800: Methods for objective and subjective assessment of quality. Tech. rep. Parmonangan, Ivan Halim et al. (2019). “Speech Quality Evaluation of Synthesized Japanese Speech Using EEG”. In: Proc. Interspeech 2019, pp. 1228–1232. Al-Radhi, Mohammed Salah et al. (2018). “A Continuous Vocoder Using Sinusoidal Model for Statistical Parametric Speech Synthesis”. In: International Conference on Speech and Computer. Springer,
- pp. 11–20.
Siuly, Siuly et al. (2016). “Electroencephalogram (EEG) and Its Background”. In: EEG Signal Analysis and Classification: Techniques and Applications. Cham: Springer International Publishing, pp. 3–21.
BIBLIOGRAPHY II
Strubell, Emma et al. (July 2019). “Energy and Policy Considerations for Deep Learning in NLP”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp. 3645–3650. Wagner, Petra and Simon Betz (2017). “Speech Synthesis Evaluation – Realizing a Social Turn”. In: Tagungsband Elektronische Sprachsignalverarbeitung (ESSV). Saarbrücken, pp. 167–172. Wagner, Petra et al. (2019). “Speech Synthesis Evaluation — State-of-the-Art Assessment and Suggestion for a Novel Research Program”. In: Proc. 10th ISCA Speech Synthesis Workshop,
- pp. 105–110.
Wester, Mirjam et al. (2015). “Are we using enough listeners? no! - an empirically-supported critique of interspeech 2014 TTS evaluations”. In: Annual Conference of the International Speech Communication Association (Interspeech), pp. 3476–3480. Winn, Matthew B. et al. (Jan. 2018). “Best Practices and Advice for Using Pupillometry to Measure Listening”. In: Trends Hear. 22. Wu, Zhizheng et al. (2015). “ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge”. In: Sixteenth Annual Conference of the International Speech Communication Association.