L ET S RECAPITULATE [2/2] W HAT WE HERE NOWADAYS Objectively - PowerPoint PPT Presentation

S PEECH SYNTHESIS EVALUATION S ÉBASTIEN L E M AGUER ADAPT C ENTRE , S IGMEDIA L AB , EE E NGINEERING , T RINITY C OLLEGE D UBLIN 11-07-2019

L ET ’ S RECAPITULATE [2/2]

W HAT WE HERE NOWADAYS Objectively Wavenet was a game changer Tacotron: easy to use if you have enough data What you may read "Human-like", "High-fidelity", "Highly Natural", . . . 1 33

K EY QUESTIONS / PROBLEMS AI Hype https://www.economist.com/technology-quarterly/2020/06/11/ an-understanding-of-ais-limitations-is-starting-to-sink-in Environmental issues: [Strubell et al., (2019)] and follow up Problematic question Is the quality really that good? Fundamental questions Did we solve anything? If yes, what did we solve? 2 33

L ET ’ S GET STARTED

W HAT IS EVALUATION 3 33

W HAT IS EVALUATION The ideal Being to describe in details what a system bring compare to other ones In practice Classify/order the systems based on their synthesis 4 33

W HAT TO EVALUATE ? acoustic prosody temporal structure tonal structure amplitude profile symbolic prosody syllabic stress word accent sentence mode pronunciation of words also in sentence context phrasing, rhythm voice quality inherent or introduced by signal processing? discontinuities in unit concatenation . . . 5 33

E VALUATION AXES Similarity Intelligibility Naturalness 6 33

W HERE DOES IT TAKE PLACE ? signal ac. param. Param. Offline Corpus Training stage ling. desc. NLP text Models Online ling. desc. ac. param. Generation stage Rendering Text NLP objective subjective 7 33

O BJECTIVE EVALUATION

O BJECTIVE EVALUATION - T HE METRICS Which axes Intelligibility: not so used Similarity : assessment/validation The main metrics Spectrum: MCD , RMSE, Euclidean distances F0: RMSE (Hz/Cent), VUV ratio , LL-Ratio BAP: RMSE Duration: RMSE , syllable/phoneme rate 8 33

S UBJECTIVE EVALUATION

S UBJECTIVE EVALUATION - I NTRODUCTION Subjective evaluation are human focused. . . . . . . .so expensive ! Should be really carefully prepared you won’t be able to repeat it if you mess up the preparation the analysis of the results is depending a lot on the preparation be careful about the question asked and the targeted listeners (see checklist later!) Generally at least 3 systems involved the original voice a reference (anchor) system the analyzed system 9 33

I NTELLIGIBILITY TEST Semantically Unpredictable Sentences (SUS) Unpredictable ⇒ force the listener to "decipher" the message Syntax is correct Example: A table eat the doctor Protocol guideline A step: 1. The listener listen to a SUS 2. He/she writes down what he heard (joker character for not heard word) 3. A distance is computed between what has been typed/the original sentence Score = Word Error Rate/Phone error rate (less common) A nice paper: [Benoit, (1990)] 10 33

T HE STANDARD PROTOCOLS FOR SUBJECTIVE EVALUATION

S CORING METHODOLOGIES The ACR protocol [ITU-T, (1996)] Absolute Category Rating (ACR) ⇒ Mean Opinion Score Scores from 1 (bad) to 5 (excellent) Key points Systems and utterances are randomized (Latin-Square algorithm) The question asked is going to condition the user ⇒ caution! Major problem: scores are "flatten" 11 33

ACR - I NTERFACE 12 33

P REFERENCE - BASED METHODOLOGIES AB(X) test 2 samples (A) an (B) are presented 3 choices: A, B and no preference ABX: I a fixed reference X is presented Key points More strict than ACR ⇒ results more significant "no preference" can be remove ⇒ post-processing analysis required! 13 33

AB - I NTERFACE 14 33

MUSHRA MUltiple Stimuli with Hidden Reference and Anchor [ITU-R, (2001)] Idea: combining scoring and preference Continuous score from 0 to 100 with steps at every 20 Some constraints: I Given reference + reference hidden (consistency) I Given anchors Key points Mix the scoring and preference methodologies But: I difficult from the listener perspective I small differences are difficult to interpret 15 33

MUSHRA - I NTERFACE 16 33

W HAT TO DO WITH THE RESULTS

S TATISTICAL ANALYSIS Why? We are using a sample ⇒ we want to generalize How: using statistical test Generally t-test or Wilcoxon based test Generally set ¸ = 0 : 05 Report the confidence interval and the effect size Important !!! Be careful and honest with the conclusion 17 33

A COUNTER EXAMPLE ( BLOG POST , PAPER IS BETTER ) Graphic results 18 33

A COUNTER EXAMPLE ( BLOG POST , PAPER IS BETTER ) Graphic results "Explanation" ". . . were obtained in blind tests with human subjects (from over 500 ratings on 100 test sentences). As we can see, WaveNets reduce the gap between the state of the art and human-level performance by over 50% for both US English and Mandarin Chinese." 18 33

H OW TO INTERPRET RESULTS Results taken from [Al-Radhi et al., (2018)] 19 33

V ALIDATION

S OME PRECAUTIONS Results have to be reproducible! Environment setup reproducibility I Description of the conditions (speaker/headphones, . . . ) Protocol reproducibility I Description of the test I Description of the question I Description of the corpora I Description of the "cognitive aspect" (duration, pause?, . . . ) Statistical reproducibility I Description of listeners (number, expert vs non expert, . . . ) I Statistical analysis (confidence interval, . . . ) 20 33

C HECKLIST - 1 [W ESTER ET AL ., (2015)] ˜ What test to use? I MOS, MUSHRA, preference, intelligibility, and same/different judgments all fit different situations. ˜ Which question(s) to ask? I Be aware that the question you ask may influence the answer you get. The terms you use may be interpreted differently by listeners, e.g., what does “quality” or “naturalness” actually mean? ˜ Which data to use for testing? I Factor out aspects that affect the evaluation, but which are unrelated to the research question studied. ˜ Is the evaluation material unbiased and free of training data? ˜ Is a reference needed? I Consider giving a reference or adding training material, particularly for intonation evaluation. Also consider the case for including other anchors. 21 33

C HECKLIST - 2 [W ESTER ET AL ., (2015)] ˜ What type of listeners? I Native vs. non-native? Speech experts vs. naive listeners? Age, gender, hearing impairments? Different listener groups can lead to different results. ˜ How many listeners to use? ˜ How many data-points are needed? ˜ Is the task suitable for human listeners? I Take into consideration listener boredom, fatigue, and memory constraints, as well as cognitive load. ˜ Can you use crowd-sourcing? I The biggest concern here is how to ensure the quality of the test-takers. ˜ How is the experiment going to be conducted? I With headphones or speakers, over the web or in a listening booth? 22 33

S OME BIASES ( EX : [C LARK ET AL ., (2019)]) Background Analyze how to evaluate long utterances in a ACR based test Some results 23 33

T HE BLIZZARD CHALLENGE

T HE BLIZZARD CHALLENGE Website: http://festvox.org/blizzard When? I every year since 2005 Who (participants)? I university I some companies Which kind of systems? I Parametrical I Unit selection I Hybrid Philosophy I Focus on the analysis and the exchange rather than pure rating ! I Results made anonymous (however by reading the different papers you can rebuild the results) 24 33

W HAT ’ S NOW

C URRENT SITUATION Strong need of new protocols ([Wagner et al., (2019)]) MOS, MUSHRA not refined enough! What does a preference, score mean? Get more precise feedback, qualify the speech 25 33

S UBJECTIVE EVALUATION - THE NEW WAYS Behavioural Task focused (reaction time, completion, . . . ) [Wagner and Betz, (2017)] Physiological Pupillometry [Govender and EEG [Parmonangan et al., (2019)] King, (2018)] Source: [Siuly et al., (2016)] Be careful: [Belardinelli et al., Source: Peter Lamb/123RF (2019)] Be careful: [Winn et al., (2018)] 26 33

O BJECTIVE EVALUATION - THE BIG MISCONCEPTION For more details, see [Wagner et al., (2019)] Subjective evaluation seems way better More robust Human involved is the loop Key problem(s) It is expensive What do we learn about the signal? Objective evaluation - 2 goals: 1. Pointing differences 2. Classifying systems 27 33

T AKE HOME MESSAGES [6/6]

W HAT TO REMEMBER Speech synthesis � = easy problem A lot of human effort A lot of computer effort Different solutions for different problems Parametric synthesis: handcrafted + database Control/speed vs quality 28 33

T HE CURRENT STATE The DNN (r)evolution Everything is moving to DNN based architecture Definitely a jump in quality (but how much and why?) Don’t forget the "user" CHI: GAFA (obviously!), startups Blind people : using diphone synthesis (why?) A lot of other : speech researchers/scientists, entertainment, . . . 29 33

S OME SENSITIVE POINTS A big potential danger Spoofing (See challenge ASVSpoof [Wu et al., (2015)]) Black box vs control DNN: we don’t understand what the system is doing And when it fails? Environmental issues [Strubell et al., (2019)] Evaluation Important issue Information vs marketing 30 33

L ET S RECAPITULATE [2/2] W HAT WE HERE NOWADAYS Objectively - PowerPoint PPT Presentation

S PEECH SYNTHESIS EVALUATION S BASTIEN L E M AGUER ADAPT C ENTRE , S IGMEDIA L AB , EE E NGINEERING , T RINITY C OLLEGE D UBLIN 11-07-2019 L ET S RECAPITULATE [2/2] W HAT WE HERE NOWADAYS Objectively Wavenet was a game changer Tacotron:

Challenge 22: Osteo-chip Launch Meeting 08 September 2016 The Challenge An in vitro model

1 Gradient descent with fixed step In this section, we discuss a gradient descent method with

Existence of Minimisers in the Plateau Problem Anthony Salib 09/04/2020 The Plateau Problem

VHDL Mquina de Estados (FSM) 1 MC602 2011 Tpicos IC-UNICAMP Mquinas de estados

Keep definition, change category: a practical approach to monadic model evolution Institute of

8. Assertion Based Design and 8.1 Assertion-based design Assertion Languages Assertions

hPIN/hTAN: A Lightweight and Low- Cost e-Banking Solution against Untrusted Computers Shujun Li 1

E[F] ? Albert R Meyer, May 8, 2013 Albert R Meyer, May 8, 2013 ranvarfail.1 ranvarfail.3

Heterogeneity in Computing: Now and in the Future Anne Benoit LIP, Ecole Normale Sup erieure

Cloud storage state of affairs Storage clusters contain thousands of storage nodes, with e.g. 500

Variance = E[I 2 ] 2pE[I] + p 2 = E[I] 2p p + p 2 = 2 2 = p-2p+ p pq variance.1

Asset Team Will continue to use Maya 2008 for this quarter. Yes, Maya 2008 NOT Maya

Visualizing Sensor Data Hauptseminar Information Visualization - Wintersemester 2008/2009"

Communication Networks II Seamless Context-Aware Communication Services - Overall Issues Prof.

Announcements: Discussion via Zoom; see Canvas for link Project 4 due Mon 4/13

Simple Median-Based Method for Stationary Background Generation Using Background Subtraction

FIELD ESTIMATION IN DENSE IMAGE ARRAYS F. Battisti, M. Brizzi, M. Carli, A. Neri Universit

LaBGen-P: A Pixel-Level Stationary Background Generation Method Based on LaBGen B. Laugraud, S.

Parallel Simulation of Social Agents using Cilk and OpenCL DS-RT 2011 15th International

Frequency Dependence of Scintillation Arcs Dan Stinebring Oberlin College 2019 November 4

A Variable-pipeline On-chip Router Optimized to Traffic Pattern Yuto Hirata (Keio University)

Assignment: Named Entity Recognition Empirical Methods in Natural Language Processing Philipp

Charged Lepton Flavour Violation: mu2e, mu3e and Comet Gavin Hesketh, UCL Thanks to Mark

Course overview J. Gomes Ferreira http://ecowin.org/ Universidade Nova de Lisboa Coastal and

Sambuz

Useful Links

Newsletter

Mail Us

L ET S RECAPITULATE [2/2] W HAT WE HERE NOWADAYS Objectively - PowerPoint PPT Presentation

S PEECH SYNTHESIS EVALUATION S BASTIEN L E M AGUER ADAPT C ENTRE , S IGMEDIA L AB , EE E NGINEERING , T RINITY C OLLEGE D UBLIN 11-07-2019 L ET S RECAPITULATE [2/2] W HAT WE HERE NOWADAYS Objectively Wavenet was a game changer Tacotron:

Challenge 22: Osteo-chip Launch Meeting 08 September 2016 The Challenge An in vitro model

1 Gradient descent with fixed step In this section, we discuss a gradient descent method with

Existence of Minimisers in the Plateau Problem Anthony Salib 09/04/2020 The Plateau Problem

VHDL Mquina de Estados (FSM) 1 MC602 2011 Tpicos IC-UNICAMP Mquinas de estados

Keep definition, change category: a practical approach to monadic model evolution Institute of

8. Assertion Based Design and 8.1 Assertion-based design Assertion Languages Assertions

hPIN/hTAN: A Lightweight and Low- Cost e-Banking Solution against Untrusted Computers Shujun Li 1

E[F] ? Albert R Meyer, May 8, 2013 Albert R Meyer, May 8, 2013 ranvarfail.1 ranvarfail.3

Heterogeneity in Computing: Now and in the Future Anne Benoit LIP, Ecole Normale Sup erieure

Cloud storage state of affairs Storage clusters contain thousands of storage nodes, with e.g. 500

Variance = E[I 2 ] 2pE[I] + p 2 = E[I] 2p p + p 2 = 2 2 = p-2p+ p pq variance.1

Asset Team Will continue to use Maya 2008 for this quarter. Yes, Maya 2008 NOT Maya

Visualizing Sensor Data Hauptseminar Information Visualization - Wintersemester 2008/2009&quot;

Communication Networks II Seamless Context-Aware Communication Services - Overall Issues Prof.

Announcements: Discussion via Zoom; see Canvas for link Project 4 due Mon 4/13

Simple Median-Based Method for Stationary Background Generation Using Background Subtraction

FIELD ESTIMATION IN DENSE IMAGE ARRAYS F. Battisti, M. Brizzi, M. Carli, A. Neri Universit

LaBGen-P: A Pixel-Level Stationary Background Generation Method Based on LaBGen B. Laugraud, S.

Parallel Simulation of Social Agents using Cilk and OpenCL DS-RT 2011 15th International

Frequency Dependence of Scintillation Arcs Dan Stinebring Oberlin College 2019 November 4

A Variable-pipeline On-chip Router Optimized to Traffic Pattern Yuto Hirata (Keio University)

Assignment: Named Entity Recognition Empirical Methods in Natural Language Processing Philipp

Charged Lepton Flavour Violation: mu2e, mu3e and Comet Gavin Hesketh, UCL Thanks to Mark

Course overview J. Gomes Ferreira http://ecowin.org/ Universidade Nova de Lisboa Coastal and

Sambuz

Useful Links

Newsletter

Mail Us

Visualizing Sensor Data Hauptseminar Information Visualization - Wintersemester 2008/2009"