deep learning for speech synthesis
play

Deep learning for speech synthesis The good news, the bad news, - PowerPoint PPT Presentation

Deep learning for speech synthesis The good news, the bad news, and the fake news Scott Stevenson scott@faculty.ai The fake news 2 3 4 The effect of hot mic incidents Hot mic incidents Incidents can change can bring huge


  1. Deep learning for speech synthesis The good news, the bad news, and the fake news Scott Stevenson scott@faculty.ai

  2. The fake news 2

  3. 3

  4. 4

  5. The effect of “hot mic” incidents “Hot mic” incidents Incidents can change can bring huge voting intentions and the negative publicity outcome of elections This has been Politicians are demonstrated on particularly at risk multiple occasions 5

  6. “Ugh everything! “She’s just a sort of bigoted woman that said she used to be Labour. “I mean it’s just ridiculous.” 6

  7. The bad news 7

  8. If adversaries can generate ● realistic audio they can fabricate “hot mic” recordings Traditionally requires domain ● expertise Modern deep learning makes ● this imminently possible We can’t stop technology, but ● we can inoculate people against it 8

  9. L IH NG G W IH S T IH K . R EH P R AH Z EH N T EY SH AH N . Frontend Backend linguistic text audio representation 9

  10. Frontend 10

  11. Frontend: tokenisation and normalisation “IBM was founded in 1911” → “i b m was founded in nineteen eleven” “Apple is valued at $1 trillion” → “apple is valued at one trillion dollars” “He lives on St Paul’s St.” → “he lives on saint paul’s street” 11

  12. Frontend: phonetic transcriptions “apple is valued at one trillion dollars” → AE P AH L . IH Z . V AE L Y UW D . AE T . W AH N . T R IH L Y AH N . D AA L ER Z . The CMU Pronouncing Dictionary http://www.speech.cs.cmu.edu/cgi-bin/cmudict 12

  13. Backend 13

  14. Backend: concatenative Synthesise waveform from ● linguistic representation Most commonly ● concatenative systems Prerecorded database of ● audio samples (“units”) Typically 10 ms to 1 s long ● Picks best units to concatenate ● 14

  15. Problems with concatenative systems Require large database of high quality recordings ● Can’t change speaker or emotion without new database ● High intelligibility and naturalness ● Distinguishable by prosody (intonation, tone, stress, rhythm) ● “ My latest project is to learn how to project my voice ”: two pronunciations of project ○ ○ Liaison in French: final consonant no longer silent if following word begins with vowel 15

  16. Backend: parametric Don’t use pre-recorded units ● Mathematical model contains ● information to synthesise speech Speaker and emotion stored in params ● Speech contents controlled by input ● Model outputs passed to vocoder ● Less natural than concatenative ● systems because of DSP artefacts 16

  17. 17

  18. 18

  19. WaveNet arXiv 1609.03499 Change of paradigm for parametric speech synthesis ● Don’t feed model output to vocoder to generate waveform ● Instead, sample waveform directly from neural network ● Sample at ≥16 kHz to generate audio ● 19

  20. Causal convolutions time 20

  21. Causal convolutions Problem: causal convolutions require huge depth to give ● sufficiently large receptive field for good prosody Such a depth is computationally infeasible to train ● Chosen solution is to dilate convolutions ● Skip input values with interval to increase receptive field ● Receptive field grows exponentially with depth ● 21

  22. Dilated causal convolutions time 22

  23. Activation function arXiv 1606.05328 Use gated activation taken from PixelCNN ● Filter Gate Empirical choice: performs better than ReLU activation ● 23

  24. Activation function arXiv 1609.03499 Need to condition locally to input text sequence ● Have a second time series h (i.e. from linguistic frontend) ● Learned upsampling y = (h) to same frequency as x ● 24

  25. WaveNet limitations WaveNet can generate very human sounding waveforms ● But how do tell the WaveNet what to say? ● Still requires extensive feature engineering frontend ● Need time and linguistic expertise, and is brittle ● How do we improve on conventional frontend? ● 25

  26. deep learning 26

  27. 27

  28. Tacotron 2 28

  29. “Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speec h.” 29

  30. 30

  31. The good news 31

  32. Classification 32

  33. Generative adversarial networks Via backpropagation, Generator learns to produces better audio, while Discriminator learns to better distinguish synthetic from real. 33

  34. + = 35

  35. We’re hiring! We are hiring data scientists and machine learning engineers at all levels. If you’re interested in finding out more about Faculty and our work, get in touch! scott@faculty.ai Follow us: 36

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend