measuring the perceptual effects of speech synthesis
play

Measuring the Perceptual Effects of Speech Synthesis Modelling - PowerPoint PPT Presentation

Measuring the Perceptual Effects of Speech Synthesis Modelling Assumptions Gustav Eje Henter, Thomas Merritt, Matt Shannon, Catherine Mayo, Simon King 1 of 29 Summary Hear the perceptual effects of modelling assumptions in statistical


  1. Measuring the Perceptual Effects of Speech Synthesis Modelling Assumptions Gustav Eje Henter, Thomas Merritt, Matt Shannon, Catherine Mayo, Simon King 1 of 29

  2. Summary “Hear the perceptual effects of modelling assumptions in statistical speech synthesis” 2 of 29

  3. Summary “Hear the perceptual effects of modelling assumptions in statistical speech synthesis” 1. Through manipulating repeated natural speech 2 of 29

  4. Summary “Hear the perceptual effects of modelling assumptions in statistical speech synthesis” 1. Through manipulating repeated natural speech 2. Identify which assumptions that limit synthesiser naturalness 2 of 29

  5. Overview 1. Background 2. Methodology 3. Experiments 4. Conclusions and outlook 3 of 29

  6. Naturalness in speech synthesis Output naturalness depends on many factors: • Text processing • Speech parameter representation (vocoder etc.) • Probabilistic models • Parameter generation method 4 of 29

  7. Naturalness in speech synthesis Output naturalness depends on many factors: • Text processing • Speech parameter representation (vocoder etc.) • Probabilistic models • Parameter generation method 4 of 29

  8. Modelling assumptions Acoustic models make many assumptions: • High-level assumptions ◦ Different parameter streams are conditionally independent ◦ Filter parameter trajectories are conditionally independent 5 of 29

  9. Modelling assumptions Acoustic models make many assumptions: • High-level assumptions ◦ Different parameter streams are conditionally independent ◦ Filter parameter trajectories are conditionally independent • Low-level assumptions ◦ A particular decision tree partitioning of linguistic contexts ◦ Leaf node distributions are Gaussian 5 of 29

  10. Modelling assumptions Acoustic models make many assumptions: • High-level assumptions ◦ Different parameter streams are conditionally independent ◦ Filter parameter trajectories are conditionally independent • Low-level assumptions ◦ A particular decision tree partitioning of linguistic contexts ◦ Leaf node distributions are Gaussian Assumption adequacy affects output naturalness 5 of 29

  11. Questions 1. Which high-level assumptions hurt naturalness? 2. How much may we gain if we could remove these assumptions? 6 of 29

  12. Questions 1. Which high-level assumptions hurt naturalness? 2. How much may we gain if we could remove these assumptions? → Where should we direct our improvement efforts? 6 of 29

  13. Traditional fault-finding Investigate naturalness through trial-and-error: 1. Select an assumption and modify it 2. Compare output naturalness before and after 7 of 29

  14. Traditional fault-finding Investigate naturalness through trial-and-error: 1. Select an assumption and modify it 2. Compare output naturalness before and after Problems: • Impressions are coloured by other imperfections ◦ Low-level assumptions ◦ Estimation errors 7 of 29

  15. Traditional fault-finding Investigate naturalness through trial-and-error: 1. Select an assumption and modify it 2. Compare output naturalness before and after Problems: • Impressions are coloured by other imperfections ◦ Low-level assumptions ◦ Estimation errors • Does not compare the relative severity of different assumptions 7 of 29

  16. Our insight • Natural speech is a sample from the true acoustic model 8 of 29

  17. Our insight • Natural speech is a sample from the true acoustic model • By manipulating repeated natural speech we can simulate output from ◦ highly accurate models • only incorporating certain high-level modelling assumptions • no low-level assumptions at all 8 of 29

  18. Our insight • Natural speech is a sample from the true acoustic model • By manipulating repeated natural speech we can simulate output from ◦ highly accurate models • only incorporating certain high-level modelling assumptions • no low-level assumptions at all ◦ with a particular parameter representation ◦ and a particular output generation method 8 of 29

  19. Why is this cool? Nobody knows what these “nearly perfect” models are, yet we can listen to their output! 9 of 29

  20. Why is this cool? Nobody knows what these “nearly perfect” models are, yet we can listen to their output! • Compare naturalness degradations due to different high-level assumptions in an otherwise perfect model • Identified key naturalness bottlenecks in speech synthesis 9 of 29

  21. Overview 1. Background 2. Methodology 3. Experiments 4. Conclusions and outlook 10 of 29

  22. Repeated speech Even when controlling for context, the same text can be realised acoustically in many different ways 11 of 29

  23. Repeated speech Even when controlling for context, the same text can be realised acoustically in many different ways “Rice is often served in round bowls” 11 of 29

  24. Repeated speech Even when controlling for context, the same text can be realised acoustically in many different ways “Rice is often served in round bowls” 11 of 29

  25. Repeated speech Even when controlling for context, the same text can be realised acoustically in many different ways “Rice is often served in round bowls” 11 of 29

  26. Repeated speech Even when controlling for context, the same text can be realised acoustically in many different ways “Rice is often served in round bowls” 11 of 29

  27. Repeated speech Even when controlling for context, the same text can be realised acoustically in many different ways “Rice is often served in round bowls” 11 of 29

  28. REHASP 0.5 corpus • “REpeated HArvard Sentence Prompts” • Female British English talker “Lucy” • 30 Harvard sentence prompts • Each read aloud 40 times ◦ Presented in random order • Recorded at 16 bit 96 kHz • Publicly available under a permissive license ◦ datashare.is.ed.ac.uk/handle/10283/561 12 of 29

  29. In pictures 0. Start with natural speech repetitions: 13 of 29

  30. In pictures 1. Extract parameters: R e p e t i t i o n 1 13 of 29

  31. In pictures 1. Extract parameters: R e p e t i t i o n 1 13 of 29

  32. Speech representation Standard parametric speech representation used for experiments: • 16 kHz operating point • Matlab STRAIGHT for parameter extraction • 46-dimensional parameter vector with three streams: ◦ 40 MCEPs (0–39), representing filter coefficients ◦ Log-F0 ◦ 5 band aperiodicities (BAPs) • 5 ms frame shift 14 of 29

  33. In pictures 1.b. Resynthesise (baseline “V”): R e p e t i t i o n 1 15 of 29

  34. In pictures 1.b. Resynthesise (baseline “V”): R e p e t i t i o n 1 15 of 29

  35. In pictures 1. Extract parameters: R e p e t i t i o n 1 15 of 29

  36. In pictures 1. Extract parameters: R e p e t i t i o n 1 R e p e t i t i o n 2 R e p e t i t i o n 3 15 of 29

  37. In pictures 1. Extract parameters: R e p e t i t i o n 1 R e p e t i t i o n 2 R e p e t i t i o n 3 15 of 29

  38. In pictures 1. Extract parameters: R e p e t i t i o n 1 R e p e t i t i o n 2 R e p e t i t i o n 3 15 of 29

  39. Match timings 2.a. Match frames: R e p e t i t i o n 1 R e p e t i t i o n 2 R e p e t i t i o n 3 16 of 29

  40. Match timings 2.a. Match frames: 16 of 29

  41. Match timings 2.a. Match frames: 16 of 29

  42. Match timings 2.a. Match frames: 16 of 29

  43. Match timings 2.b. Warp timings: 16 of 29

  44. Match timings 2.b. Warp timings: 16 of 29

  45. Match timings 2.b. Warp timings: 16 of 29

  46. Match timings 2.b. Warp timings: 16 of 29

  47. Match timings 2.b. Warp timings: 16 of 29

  48. Match timings 2.b. Warp timings: 16 of 29

  49. Match timings 2.b. Warp timings: 16 of 29

  50. Match timings 2.b. Warp timings: 16 of 29

  51. Match timings 2.b. Warp timings: 16 of 29

  52. Match timings 2.c. Resynthesise (baseline “D”): 16 of 29

  53. Match timings 2.d. Remove reference: 16 of 29

  54. Match timings 2.d. Remove reference: 16 of 29

  55. Match timings 2.d. Remove reference: 16 of 29

  56. Match timings We now have “LEGO pieces” of aligned repetitions 16 of 29

  57. Create chimeric speech 3.a. Combine parameters from independent repetitions: 17 of 29

  58. Create chimeric speech 3.a. Combine parameters from independent repetitions: 17 of 29

  59. Create chimeric speech 3.a. Combine parameters from independent repetitions: F i l t e r 1 S o u r c e 1 F i l t e r 3 S o u r c e 3 17 of 29

  60. Create chimeric speech 3.a. Combine parameters from independent repetitions: F i l t e r 1 S o u r c e 1 F i l t e r 3 S o u r c e 3 17 of 29

  61. Create chimeric speech 3.a. Combine parameters from independent repetitions: F i l t e r 1 S o u r c e 3 17 of 29

  62. Create chimeric speech 3.a. Combine parameters from independent repetitions: F i l t e r 1 F i l t e r 1 S o u r c e 3 S o u r c e 3 17 of 29

  63. Create chimeric speech 3.a. Combine parameters from independent repetitions: F i l t e r 1 S o u r c e 3 17 of 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend