Are we using enough listeners? No! An empirically-supported critique - PowerPoint PPT Presentation

Are we using enough listeners? No! An empirically-supported critique of Interspeech 2014 TTS evaluations Mirjam Wester, Cassia Valentini-Botinhao and Gustav Eje Henter � � � � � CSTR, University of Edinburgh

Introduction • Objective measures aren’t good enough at measuring the perceptual quality of synthetic speech • Subjective listening tests remain the gold standard: • Mean Opinion Score (MOS) tests • Preference tests • ABX tests • Transcription tasks • MUSHRA tests � • Despite many listening test guidelines, contemporary evaluations are often very poor as they don’t take guidelines into account.

Our study Common shortcomings in subjective evaluations from Interspeech 2014 � Using Blizzard 2013 data we show the importance of: � Su ffi cient participants � � Su ffi cient test material � � � Checklist of elements that should be considered when designing a good listening test

Interspeech 2014 • Number of speech synthesis studies at Interspeech 2014 using a particular amount of listeners. Number of studies Number of Preference listeners MOS test

Interspeech 2014 • Number of speech synthesis studies at Interspeech 2014 using a particular amount of listeners. Number of studies Number of Preference listeners MOS test 1-10 10 8

Interspeech 2014 • Number of speech synthesis studies at Interspeech 2014 using a particular amount of listeners. Number of studies Number of Preference listeners MOS test 1-10 10 8 11-20 5 5

Interspeech 2014 • Number of speech synthesis studies at Interspeech 2014 using a particular amount of listeners. Number of studies Number of Preference listeners MOS test 1-10 10 8 11-20 5 5 21-30 0 1

Interspeech 2014 • Number of speech synthesis studies at Interspeech 2014 using a particular amount of listeners. Number of studies Number of Preference listeners MOS test 1-10 10 8 11-20 5 5 21-30 0 1 31-50 4 5

Interspeech 2014 • Number of speech synthesis studies at Interspeech 2014 using a particular amount of listeners. Number of studies Number of Preference listeners MOS test 1-10 10 8 11-20 5 5 21-30 0 1 31-50 4 5 >50 3 3

Interspeech 2014 • Number of speech synthesis studies at Interspeech 2014 using a particular amount of listeners. Number of studies Number of Preference listeners MOS test 1-10 10 8 11-20 5 5 21-30 0 1 31-50 4 5 >50 3 3 Not stated 2 0

Interspeech 2014 • Number of speech synthesis studies at Interspeech 2014 using a particular amount of listeners. Number of studies Number of Preference listeners MOS test 1-10 10 8 11-20 5 5 21-30 0 1 31-50 4 5 >50 3 3 Not stated 2 0 Total studies 24 22

Missing details IS-2014

Missing details IS-2014 • The demographics of listeners (native or non-native, age, accent, possible hearing impairments).

Missing details IS-2014 • The demographics of listeners (native or non-native, age, accent, possible hearing impairments). • The language of the synthesised speech.

Missing details IS-2014 • The demographics of listeners (native or non-native, age, accent, possible hearing impairments). • The language of the synthesised speech. • The domain of the sentence material (training and test).

Missing details IS-2014 • The demographics of listeners (native or non-native, age, accent, possible hearing impairments). • The language of the synthesised speech. • The domain of the sentence material (training and test). • The number of test samples (sentences, words, paragraphs).

Missing details IS-2014 • The demographics of listeners (native or non-native, age, accent, possible hearing impairments). • The language of the synthesised speech. • The domain of the sentence material (training and test). • The number of test samples (sentences, words, paragraphs). • The specific question participants were asked to answer.

Missing details IS-2014 • The demographics of listeners (native or non-native, age, accent, possible hearing impairments). • The language of the synthesised speech. • The domain of the sentence material (training and test). • The number of test samples (sentences, words, paragraphs). • The specific question participants were asked to answer. • The listening conditions (headphones or speakers, listening booth or on the web).

Blizzard 2013 • To illustrate importance of sentence coverage and number and type of listeners we re-analysed Blizzard 2013 data. • Last English evaluation • Focus on main task: MOS tests for naturalness and similarity (EH1) listener type number of listeners EE (paid / lab / native) 50 ER (volunteers / not controlled) 92 ES (speech experts / not controlled) 52 All 194

Blizzard data details � • 11 systems including natural speech • 11 listener groups • 4-5 listeners per group for EE | 5-10 for ER | 3-5 for ES • Each listener scores each system: • 5 times for naturalness • once for similarity • MOS test was used for both naturalness and similarity

Re-analysis of Blizzard • Progressively larger subsets of data 1. number of significantly di ff erent system pairs 2. rank correlation between the ranking given by the current data subset and the ranking obtained when considering all participants for the test in question

Participants (I) 55 � Number of comparisons found to be significantly different 50 45 � 40 35 � 30 25 � 20 15 Naturalness Similarity � EE EE 10 ER ER ES ES 5 ALL ALL � 0 20 40 60 80 100 120 140 160 180 Number of listeners • Blizzard similarity tests overall resulted in fewer significant di ff erences than the naturalness evaluation.

Participants (II) � 1.00 1.00 � 0.95 0.95 � � Rank correlation Rank correlation 0.90 0.90 � 0.85 0.85 � Naturalness Similarity 0.80 EE 0.80 EE � ER ER ES ES ALL ALL � 0.75 0.75 20 40 60 80 100 120 140 160 180 20 40 60 80 100 120 140 160 180 Number of listeners Number of listeners � • Naturalness: 30 paid participants (EE) su ffi cient for strong correlation (>0.98). • Similarity: results never quite reach stability

Participants (III) 55 1.00 1.00 � Number of comparisons found to be significantly different 50 45 0.95 0.95 40 � 35 Rank correlation Rank correlation 0.90 0.90 30 � 25 0.85 0.85 20 � 15 Naturalness Similarity Naturalness Similarity EE EE 0.80 EE 0.80 EE 10 ER ER ER ER ES ES ES ES 5 ALL ALL � ALL ALL 0 0.75 0.75 20 40 60 80 100 120 140 160 180 20 40 60 80 100 120 140 160 180 20 40 60 80 100 120 140 160 180 Number of listeners Number of listeners Number of listeners � • EE (Paid listeners) correlate best with full-data rankings • ER (Volunteers) consistently give low rank correlations and least number of significant pairs for a given number of listeners • ES (Expert listeners) identify a large number of significant di ff erences in naturalness, but their rank correlation with the overall full data picture was either close to average (naturalness) or the lowest observed (similarity)

Data Coverage (I) Naturalness Similarity � � � � � � • Judgments change substantially between listener groups, particularly for the similarity scores.

Data Coverage (II) 55 � Number of comparisons found to be significantly different 50 45 � 40 35 30 � 25 20 � 15 Naturalness Similarity EE EE 10 ER ER ES ES 5 ALL ALL � 0 20 40 60 80 100 120 140 160 180 Number of datapoints • The big gap between naturalness and similarity tasks in previous figures can largely be explained by the di ff erence in the number of scores collected per listener.

Summary • Blizzard analyses suggest that at least 30 listeners are needed for reliable results. • Each listener should listen to several examples of each system evaluated. • 150 judgements per MOS should probably be a minimum. � • Types of listeners: paid participants, online volunteers and expert listeners • paid: above numbers apply • online: more data and more listeners • experts: their preferences di ff er from those of the general public � • Thought and design of experiments is paramount

Conclusion � � � • Take home message: • Think before you test! • Report on the design of your experiment and motivate the choices made • See the checklist in the paper for inspiration

Are we using enough listeners? No! An empirically-supported critique - PowerPoint PPT Presentation

Are we using enough listeners? No! An empirically-supported critique of Interspeech 2014 TTS evaluations Mirjam Wester, Cassia Valentini-Botinhao and Gustav Eje Henter CSTR, University of Edinburgh Introduction

Create Distributions Empirically using Excel V0E 10/11/2014 0E 2014 Schield Creating

Shunem 1. Sufficiency means enough to meet the situation; enough to accomplish the task.

In 2019, our media partners raised $38M for Childrens Miracle Network Hospitals in the US

Perception of sibilant geminates Perception of sibilant geminates by non- -native listeners

JSF Lifecycle Diagram If immediate=true then Actions, Action Listeners, and Value Change

EMPLOYER EMPLOYER AN INTRODUCTION TO SUPPORTED SUPPORTED POLICING POLICING An introduction

Using Synthetic Test Suites to Empirically Compare Search-Based and Greedy Prioritizers Zachary

Using Controlled Numbers of Real Faults and Mutants to Empirically Evaluate Coverage-Based Test

EMPIRICALLY IDENTIFYING THE BEST GENETIC ALGORITHM FOR COVERING ARRAY GENERATION Liang Yalan 1 ,

Empirically Distinguishing Informative and Prestige Effects of Advertising (Ackerberg, 2001 RAND)

Observatory of Complex Systems http://ocs.unipa.it ELSA Air Traffic Simulator: an Empirically

Ask and You Shall Receive: Empirically Evaluating Declarative Approaches to Finding Data in

Testing times: the effect of time period selection on empirically determined rainfall-runoff

Empirically Estimating Order Constraints for Content Planning in Generation Pablo A. Duboue and

Empirically Comparing the Finite-Time Performance of Simulation-Optimization Algorithms Anna Dong,

Assessing empirically the inflectional complexity of Mauritian Creole Olivier Bonami 1 Fabiola

3d dualities and Weyl group symmetry Luca Cassia Milano-Bicocca University Based on work with

Banyan Tree Holdings Limited 4Q15 & FY15 Results

4Q17 & FY17 Results Briefing 1 Disclaimer This document is provided to you for information

Digital history and big data: Text mining historical documents on trade in the British Empire

co-creating urban soundscapes: opportunities and risks assoc. prof. dr. Monika Maiulien

media assistants marcelo cicconet, ilana paterman & priscila freire group andr maximo &

Olron , 11-17 May 2014 Pastoralism and water resources in the Sahel region Pierre Hiernaux

4 4 Lilies ies, Myrrh rrh & Doves es Symbols of the Song c.f. Dodim : 52% (32 of

Sambuz

Useful Links

Newsletter

Mail Us

Are we using enough listeners? No! An empirically-supported critique - PowerPoint PPT Presentation

Are we using enough listeners? No! An empirically-supported critique of Interspeech 2014 TTS evaluations Mirjam Wester, Cassia Valentini-Botinhao and Gustav Eje Henter CSTR, University of Edinburgh Introduction

Create Distributions Empirically using Excel V0E 10/11/2014 0E 2014 Schield Creating

Shunem 1. Sufficiency means enough to meet the situation; enough to accomplish the task.

In 2019, our media partners raised $38M for Childrens Miracle Network Hospitals in the US

Perception of sibilant geminates Perception of sibilant geminates by non- -native listeners

JSF Lifecycle Diagram If immediate=true then Actions, Action Listeners, and Value Change

EMPLOYER EMPLOYER AN INTRODUCTION TO SUPPORTED SUPPORTED POLICING POLICING An introduction

Using Synthetic Test Suites to Empirically Compare Search-Based and Greedy Prioritizers Zachary

Using Controlled Numbers of Real Faults and Mutants to Empirically Evaluate Coverage-Based Test

EMPIRICALLY IDENTIFYING THE BEST GENETIC ALGORITHM FOR COVERING ARRAY GENERATION Liang Yalan 1 ,

Empirically Distinguishing Informative and Prestige Effects of Advertising (Ackerberg, 2001 RAND)

Observatory of Complex Systems http://ocs.unipa.it ELSA Air Traffic Simulator: an Empirically

Ask and You Shall Receive: Empirically Evaluating Declarative Approaches to Finding Data in

Testing times: the effect of time period selection on empirically determined rainfall-runoff

Empirically Estimating Order Constraints for Content Planning in Generation Pablo A. Duboue and

Empirically Comparing the Finite-Time Performance of Simulation-Optimization Algorithms Anna Dong,

Assessing empirically the inflectional complexity of Mauritian Creole Olivier Bonami 1 Fabiola

3d dualities and Weyl group symmetry Luca Cassia Milano-Bicocca University Based on work with

Banyan Tree Holdings Limited 4Q15 &amp; FY15 Results

4Q17 &amp; FY17 Results Briefing 1 Disclaimer This document is provided to you for information

Digital history and big data: Text mining historical documents on trade in the British Empire

co-creating urban soundscapes: opportunities and risks assoc. prof. dr. Monika Maiulien

media assistants marcelo cicconet, ilana paterman &amp; priscila freire group andr maximo &amp;

Olron , 11-17 May 2014 Pastoralism and water resources in the Sahel region Pierre Hiernaux

4 4 Lilies ies, Myrrh rrh &amp; Doves es Symbols of the Song c.f. Dodim : 52% (32 of

Sambuz

Useful Links

Newsletter

Mail Us

Banyan Tree Holdings Limited 4Q15 & FY15 Results

4Q17 & FY17 Results Briefing 1 Disclaimer This document is provided to you for information

media assistants marcelo cicconet, ilana paterman & priscila freire group andr maximo &

4 4 Lilies ies, Myrrh rrh & Doves es Symbols of the Song c.f. Dodim : 52% (32 of