Measuring the Perceptual Effects of Speech Synthesis Modelling - - PowerPoint PPT Presentation

measuring the perceptual effects of speech synthesis
SMART_READER_LITE
LIVE PREVIEW

Measuring the Perceptual Effects of Speech Synthesis Modelling - - PowerPoint PPT Presentation

Measuring the Perceptual Effects of Speech Synthesis Modelling Assumptions Gustav Eje Henter, Thomas Merritt, Matt Shannon, Catherine Mayo, Simon King 1 of 29 Summary Hear the perceptual effects of modelling assumptions in statistical


slide-1
SLIDE 1

Measuring the Perceptual Effects of Speech Synthesis Modelling Assumptions

Gustav Eje Henter, Thomas Merritt, Matt Shannon, Catherine Mayo, Simon King

1 of 29

slide-2
SLIDE 2

Summary

“Hear the perceptual effects of modelling assumptions in statistical speech synthesis”

2 of 29

slide-3
SLIDE 3

Summary

“Hear the perceptual effects of modelling assumptions in statistical speech synthesis”

  • 1. Through manipulating repeated natural speech

2 of 29

slide-4
SLIDE 4

Summary

“Hear the perceptual effects of modelling assumptions in statistical speech synthesis”

  • 1. Through manipulating repeated natural speech
  • 2. Identify which assumptions that limit synthesiser naturalness

2 of 29

slide-5
SLIDE 5

Overview

  • 1. Background
  • 2. Methodology
  • 3. Experiments
  • 4. Conclusions and outlook

3 of 29

slide-6
SLIDE 6

Naturalness in speech synthesis

Output naturalness depends on many factors:

  • Text processing
  • Speech parameter representation (vocoder etc.)
  • Probabilistic models
  • Parameter generation method

4 of 29

slide-7
SLIDE 7

Naturalness in speech synthesis

Output naturalness depends on many factors:

  • Text processing
  • Speech parameter representation (vocoder etc.)
  • Probabilistic models
  • Parameter generation method

4 of 29

slide-8
SLIDE 8

Modelling assumptions

Acoustic models make many assumptions:

  • High-level assumptions
  • Different parameter streams are conditionally independent
  • Filter parameter trajectories are conditionally independent

5 of 29

slide-9
SLIDE 9

Modelling assumptions

Acoustic models make many assumptions:

  • High-level assumptions
  • Different parameter streams are conditionally independent
  • Filter parameter trajectories are conditionally independent
  • Low-level assumptions
  • A particular decision tree partitioning of linguistic contexts
  • Leaf node distributions are Gaussian

5 of 29

slide-10
SLIDE 10

Modelling assumptions

Acoustic models make many assumptions:

  • High-level assumptions
  • Different parameter streams are conditionally independent
  • Filter parameter trajectories are conditionally independent
  • Low-level assumptions
  • A particular decision tree partitioning of linguistic contexts
  • Leaf node distributions are Gaussian

Assumption adequacy affects output naturalness

5 of 29

slide-11
SLIDE 11

Questions

  • 1. Which high-level assumptions hurt naturalness?
  • 2. How much may we gain if we could remove these assumptions?

6 of 29

slide-12
SLIDE 12

Questions

  • 1. Which high-level assumptions hurt naturalness?
  • 2. How much may we gain if we could remove these assumptions?

→ Where should we direct our improvement efforts?

6 of 29

slide-13
SLIDE 13

Traditional fault-finding

Investigate naturalness through trial-and-error:

  • 1. Select an assumption and modify it
  • 2. Compare output naturalness before and after

7 of 29

slide-14
SLIDE 14

Traditional fault-finding

Investigate naturalness through trial-and-error:

  • 1. Select an assumption and modify it
  • 2. Compare output naturalness before and after

Problems:

  • Impressions are coloured by other imperfections
  • Low-level assumptions
  • Estimation errors

7 of 29

slide-15
SLIDE 15

Traditional fault-finding

Investigate naturalness through trial-and-error:

  • 1. Select an assumption and modify it
  • 2. Compare output naturalness before and after

Problems:

  • Impressions are coloured by other imperfections
  • Low-level assumptions
  • Estimation errors
  • Does not compare the relative severity of different assumptions

7 of 29

slide-16
SLIDE 16

Our insight

  • Natural speech is a sample from the true acoustic model

8 of 29

slide-17
SLIDE 17

Our insight

  • Natural speech is a sample from the true acoustic model
  • By manipulating repeated natural speech we can simulate
  • utput from
  • highly accurate models
  • only incorporating certain high-level modelling assumptions
  • no low-level assumptions at all

8 of 29

slide-18
SLIDE 18

Our insight

  • Natural speech is a sample from the true acoustic model
  • By manipulating repeated natural speech we can simulate
  • utput from
  • highly accurate models
  • only incorporating certain high-level modelling assumptions
  • no low-level assumptions at all
  • with a particular parameter representation
  • and a particular output generation method

8 of 29

slide-19
SLIDE 19

Why is this cool?

Nobody knows what these “nearly perfect” models are, yet we can listen to their output!

9 of 29

slide-20
SLIDE 20

Why is this cool?

Nobody knows what these “nearly perfect” models are, yet we can listen to their output!

  • Compare naturalness degradations due to different high-level

assumptions in an otherwise perfect model

  • Identified key naturalness bottlenecks in speech synthesis

9 of 29

slide-21
SLIDE 21

Overview

  • 1. Background
  • 2. Methodology
  • 3. Experiments
  • 4. Conclusions and outlook

10 of 29

slide-22
SLIDE 22

Repeated speech

Even when controlling for context, the same text can be realised acoustically in many different ways

11 of 29

slide-23
SLIDE 23

Repeated speech

Even when controlling for context, the same text can be realised acoustically in many different ways “Rice is often served in round bowls”

11 of 29

slide-24
SLIDE 24

Repeated speech

Even when controlling for context, the same text can be realised acoustically in many different ways “Rice is often served in round bowls”

11 of 29

slide-25
SLIDE 25

Repeated speech

Even when controlling for context, the same text can be realised acoustically in many different ways “Rice is often served in round bowls”

11 of 29

slide-26
SLIDE 26

Repeated speech

Even when controlling for context, the same text can be realised acoustically in many different ways “Rice is often served in round bowls”

11 of 29

slide-27
SLIDE 27

Repeated speech

Even when controlling for context, the same text can be realised acoustically in many different ways “Rice is often served in round bowls”

11 of 29

slide-28
SLIDE 28

REHASP 0.5 corpus

  • “REpeated HArvard Sentence Prompts”
  • Female British English talker “Lucy”
  • 30 Harvard sentence prompts
  • Each read aloud 40 times
  • Presented in random order
  • Recorded at 16 bit 96 kHz
  • Publicly available under a permissive license
  • datashare.is.ed.ac.uk/handle/10283/561

12 of 29

slide-29
SLIDE 29

In pictures

  • 0. Start with natural speech repetitions:

13 of 29

slide-30
SLIDE 30

In pictures

  • 1. Extract parameters:

R e p e t i t i

  • n

1

13 of 29

slide-31
SLIDE 31

In pictures

  • 1. Extract parameters:

R e p e t i t i

  • n

1

13 of 29

slide-32
SLIDE 32

Speech representation

Standard parametric speech representation used for experiments:

  • 16 kHz operating point
  • Matlab STRAIGHT for parameter extraction
  • 46-dimensional parameter vector with three streams:
  • 40 MCEPs (0–39), representing filter coefficients
  • Log-F0
  • 5 band aperiodicities (BAPs)
  • 5 ms frame shift

14 of 29

slide-33
SLIDE 33

In pictures

1.b. Resynthesise (baseline “V”):

R e p e t i t i

  • n

1

15 of 29

slide-34
SLIDE 34

In pictures

1.b. Resynthesise (baseline “V”):

R e p e t i t i

  • n

1

15 of 29

slide-35
SLIDE 35

In pictures

  • 1. Extract parameters:

R e p e t i t i

  • n

1

15 of 29

slide-36
SLIDE 36

In pictures

  • 1. Extract parameters:

R e p e t i t i

  • n

2 R e p e t i t i

  • n

3 R e p e t i t i

  • n

1

15 of 29

slide-37
SLIDE 37

In pictures

  • 1. Extract parameters:

R e p e t i t i

  • n

2 R e p e t i t i

  • n

3 R e p e t i t i

  • n

1

15 of 29

slide-38
SLIDE 38

In pictures

  • 1. Extract parameters:

R e p e t i t i

  • n

2 R e p e t i t i

  • n

3 R e p e t i t i

  • n

1

15 of 29

slide-39
SLIDE 39

Match timings

2.a. Match frames:

R e p e t i t i

  • n

2 R e p e t i t i

  • n

3 R e p e t i t i

  • n

1

16 of 29

slide-40
SLIDE 40

Match timings

2.a. Match frames:

16 of 29

slide-41
SLIDE 41

Match timings

2.a. Match frames:

16 of 29

slide-42
SLIDE 42

Match timings

2.a. Match frames:

16 of 29

slide-43
SLIDE 43

Match timings

2.b. Warp timings:

16 of 29

slide-44
SLIDE 44

Match timings

2.b. Warp timings:

16 of 29

slide-45
SLIDE 45

Match timings

2.b. Warp timings:

16 of 29

slide-46
SLIDE 46

Match timings

2.b. Warp timings:

16 of 29

slide-47
SLIDE 47

Match timings

2.b. Warp timings:

16 of 29

slide-48
SLIDE 48

Match timings

2.b. Warp timings:

16 of 29

slide-49
SLIDE 49

Match timings

2.b. Warp timings:

16 of 29

slide-50
SLIDE 50

Match timings

2.b. Warp timings:

16 of 29

slide-51
SLIDE 51

Match timings

2.b. Warp timings:

16 of 29

slide-52
SLIDE 52

Match timings

2.c. Resynthesise (baseline “D”):

16 of 29

slide-53
SLIDE 53

Match timings

2.d. Remove reference:

16 of 29

slide-54
SLIDE 54

Match timings

2.d. Remove reference:

16 of 29

slide-55
SLIDE 55

Match timings

2.d. Remove reference:

16 of 29

slide-56
SLIDE 56

Match timings

We now have “LEGO pieces” of aligned repetitions

16 of 29

slide-57
SLIDE 57

Create chimeric speech

3.a. Combine parameters from independent repetitions:

17 of 29

slide-58
SLIDE 58

Create chimeric speech

3.a. Combine parameters from independent repetitions:

17 of 29

slide-59
SLIDE 59

Create chimeric speech

3.a. Combine parameters from independent repetitions:

F i l t e r 3 S

  • u

r c e 1 F i l t e r 1 S

  • u

r c e 3

17 of 29

slide-60
SLIDE 60

Create chimeric speech

3.a. Combine parameters from independent repetitions:

F i l t e r 3 S

  • u

r c e 1 F i l t e r 1 S

  • u

r c e 3

17 of 29

slide-61
SLIDE 61

Create chimeric speech

3.a. Combine parameters from independent repetitions:

F i l t e r 1 S

  • u

r c e 3

17 of 29

slide-62
SLIDE 62

Create chimeric speech

3.a. Combine parameters from independent repetitions:

F i l t e r 1 S

  • u

r c e 3 F i l t e r 1 S

  • u

r c e 3

17 of 29

slide-63
SLIDE 63

Create chimeric speech

3.a. Combine parameters from independent repetitions:

F i l t e r 1 S

  • u

r c e 3

17 of 29

slide-64
SLIDE 64

Create chimeric speech

3.a. Resynthesise chimeric speech (here condition “SF”):

F i l t e r 1 S

  • u

r c e 3

17 of 29

slide-65
SLIDE 65

Create mean speech

3.b. Take the mean of all repetitions:

18 of 29

slide-66
SLIDE 66

Create mean speech

3.b. Take the mean of all repetitions:

R e p e t i t i

  • n

3 R e p e t i t i

  • n

1

18 of 29

slide-67
SLIDE 67

Create mean speech

3.b. Take the mean of all repetitions:

R e p e t i t i

  • n

3 R e p e t i t i

  • n

1 Me a n

18 of 29

slide-68
SLIDE 68

Create mean speech

3.b. Resynthesise mean speech (condition “M”):

R e p e t i t i

  • n

3 R e p e t i t i

  • n

1 Me a n

18 of 29

slide-69
SLIDE 69

Interpretation

  • Repeated speech ≈ independent samples from a “perfect”

acoustic model

  • Chimeric speech ≈ samples from a model making certain

high-level assumptions but no low-level assumptions

  • Mean speech ≈ the mean of a probabilistic model

19 of 29

slide-70
SLIDE 70

Overview

  • 1. Background
  • 2. Methodology
  • 3. Experiments
  • 4. Conclusions and outlook

20 of 29

slide-71
SLIDE 71

Present investigation

  • Two model assumption classes:
  • 1. Stream independence assumptions

1.1 Source and filter parameters independent 1.2 Filter, pitch, aperiodicities independent

  • 2. Independence assumptions among filter coefficients

21 of 29

slide-72
SLIDE 72

Present investigation

  • Two model assumption classes:
  • 1. Stream independence assumptions

1.1 Source and filter parameters independent 1.2 Filter, pitch, aperiodicities independent

  • 2. Independence assumptions among filter coefficients
  • Two output generation methods:
  • 1. Random sampling from probability distribution
  • 2. Mean parameter generation

21 of 29

slide-73
SLIDE 73

Present investigation

  • Two model assumption classes:
  • 1. Stream independence assumptions

1.1 Source and filter parameters independent 1.2 Filter, pitch, aperiodicities independent

  • 2. Independence assumptions among filter coefficients
  • Two output generation methods:
  • 1. Random sampling from probability distribution
  • 2. Mean parameter generation

= 12 conditions (4 baselines)

  • For each of the 30 Harvard sentences

21 of 29

slide-74
SLIDE 74

What it sounds like

Sampling-based generation: Database examples: 3 7 26 32 Baselines: N VU V D Stream independence: SF SI Filter coefficient independence: L1 L2 H1 H2 I Mean-based generation: Averaging: M (Also available online at homepages.inf.ed.ac.uk/ghenter)

22 of 29

slide-75
SLIDE 75

Naturalness test

MUSHRA test for parallel, fine-grained naturalness assessment

23 of 29

slide-76
SLIDE 76

Naturalness results

Box plot of 549 comparisons rating natural speech at 100:

10 20 30 40 50 60 70 80 90 100 VU V D SF SI I M

24 of 29

slide-77
SLIDE 77

Overview

  • 1. Background
  • 2. Methodology
  • 3. Experiments
  • 4. Conclusions and outlook

25 of 29

slide-78
SLIDE 78

Conclusions

  • When sampling from models:

26 of 29

slide-79
SLIDE 79

Conclusions

  • When sampling from models:
  • 1. Source-filter independence assumption reduces naturalness

26 of 29

slide-80
SLIDE 80

Conclusions

  • When sampling from models:
  • 1. Source-filter independence assumption reduces naturalness
  • 2. Independence assumptions among filter coefficients further

reduces naturalness

26 of 29

slide-81
SLIDE 81

Conclusions

  • When sampling from models:
  • 1. Source-filter independence assumption reduces naturalness
  • 2. Independence assumptions among filter coefficients further

reduces naturalness

  • Using mean-based parameter generation:

26 of 29

slide-82
SLIDE 82

Conclusions

  • When sampling from models:
  • 1. Source-filter independence assumption reduces naturalness
  • 2. Independence assumptions among filter coefficients further

reduces naturalness

  • Using mean-based parameter generation:
  • 1. Better than sampling for poor models

26 of 29

slide-83
SLIDE 83

Conclusions

  • When sampling from models:
  • 1. Source-filter independence assumption reduces naturalness
  • 2. Independence assumptions among filter coefficients further

reduces naturalness

  • Using mean-based parameter generation:
  • 1. Better than sampling for poor models
  • 2. Less natural than sampling for accurate models

26 of 29

slide-84
SLIDE 84

Limitations

Conclusions not applicable to:

  • Other speech representations
  • Other parameter generation methods
  • E.g., postfiltering, global variance modelling

27 of 29

slide-85
SLIDE 85

Future work

  • Record REHASP 1.0 corpus

28 of 29

slide-86
SLIDE 86

Future work

  • Record REHASP 1.0 corpus
  • Expanded investigation
  • Consider additional assumptions
  • Cover the entire spectrum from natural speech to TTS system
  • Consider additional parameter generation methods

28 of 29

slide-87
SLIDE 87

Future work

  • Record REHASP 1.0 corpus
  • Expanded investigation
  • Consider additional assumptions
  • Cover the entire spectrum from natural speech to TTS system
  • Consider additional parameter generation methods
  • Effect of different parameter representations

28 of 29

slide-88
SLIDE 88

The end

slide-89
SLIDE 89

The end

Thank you for listening!