Deep learning for speech synthesis The good news, the bad news, - - PowerPoint PPT Presentation

deep learning for speech synthesis
SMART_READER_LITE
LIVE PREVIEW

Deep learning for speech synthesis The good news, the bad news, - - PowerPoint PPT Presentation

Deep learning for speech synthesis The good news, the bad news, and the fake news Scott Stevenson scott@faculty.ai The fake news 2 3 4 The effect of hot mic incidents Hot mic incidents Incidents can change can bring huge


slide-1
SLIDE 1

Scott Stevenson

scott@faculty.ai

Deep learning for speech synthesis

The good news, the bad news, and the fake news

slide-2
SLIDE 2

The fake news

2

slide-3
SLIDE 3

3

slide-4
SLIDE 4

4

slide-5
SLIDE 5

The effect of “hot mic” incidents

5

“Hot mic” incidents can bring huge negative publicity Incidents can change voting intentions and the

  • utcome of elections

This has been demonstrated on multiple occasions Politicians are particularly at risk

slide-6
SLIDE 6

6

“Ugh everything! “She’s just a sort of bigoted woman that said she used to be Labour. “I mean it’s just ridiculous.”

slide-7
SLIDE 7

The bad news

7

slide-8
SLIDE 8
  • If adversaries can generate

realistic audio they can fabricate “hot mic” recordings

  • Traditionally requires domain

expertise

  • Modern deep learning makes

this imminently possible

  • We can’t stop technology, but

we can inoculate people against it

8

slide-9
SLIDE 9

Frontend Backend

text audio linguistic representation

L IH NG G W IH S T IH K . R EH P R AH Z EH N T EY SH AH N .

9

slide-10
SLIDE 10

10

Frontend

slide-11
SLIDE 11

“IBM was founded in 1911”

→ “i b m was founded in nineteen eleven”

“Apple is valued at $1 trillion”

→ “apple is valued at one trillion dollars”

“He lives on St Paul’s St.”

→ “he lives on saint paul’s street”

Frontend: tokenisation and normalisation

11

slide-12
SLIDE 12

“apple is valued at one trillion dollars” → AE P AH L . IH Z . V AE L Y UW D . AE T . W AH N .

T R IH L Y AH N . D AA L ER Z .

Frontend: phonetic transcriptions

The CMU Pronouncing Dictionary http://www.speech.cs.cmu.edu/cgi-bin/cmudict

12

slide-13
SLIDE 13

13

Backend

slide-14
SLIDE 14

Backend: concatenative

  • Synthesise waveform from

linguistic representation

  • Most commonly

concatenative systems

  • Prerecorded database of

audio samples (“units”)

  • Typically 10 ms to 1 s long
  • Picks best units to concatenate

14

slide-15
SLIDE 15

Problems with concatenative systems

  • Require large database of high quality recordings
  • Can’t change speaker or emotion without new database
  • High intelligibility and naturalness
  • Distinguishable by prosody (intonation, tone, stress, rhythm)

○ “My latest project is to learn how to project my voice”: two pronunciations of project ○ Liaison in French: final consonant no longer silent if following word begins with vowel

15

slide-16
SLIDE 16

Backend: parametric

  • Don’t use pre-recorded units
  • Mathematical model contains

information to synthesise speech

  • Speaker and emotion stored in params
  • Speech contents controlled by input
  • Model outputs passed to vocoder
  • Less natural than concatenative

systems because of DSP artefacts

16

slide-17
SLIDE 17

17

slide-18
SLIDE 18

18

slide-19
SLIDE 19

WaveNet

  • Change of paradigm for parametric speech synthesis
  • Don’t feed model output to vocoder to generate waveform
  • Instead, sample waveform directly from neural network
  • Sample at ≥16 kHz to generate audio

19

arXiv 1609.03499

slide-20
SLIDE 20

Causal convolutions

20

time

slide-21
SLIDE 21

Causal convolutions

  • Problem: causal convolutions require huge depth to give

sufficiently large receptive field for good prosody

  • Such a depth is computationally infeasible to train
  • Chosen solution is to dilate convolutions
  • Skip input values with interval to increase receptive field
  • Receptive field grows exponentially with depth

21

slide-22
SLIDE 22

Dilated causal convolutions

22

time

slide-23
SLIDE 23

Activation function

  • Use gated activation taken from PixelCNN
  • Empirical choice: performs better than ReLU activation

23

Filter Gate

arXiv 1606.05328

slide-24
SLIDE 24

Activation function

  • Need to condition locally to input text sequence
  • Have a second time series h (i.e. from linguistic frontend)
  • Learned upsampling y = (h) to same frequency as x

24

arXiv 1609.03499

slide-25
SLIDE 25

WaveNet limitations

  • WaveNet can generate very human sounding waveforms
  • But how do tell the WaveNet what to say?
  • Still requires extensive feature engineering frontend
  • Need time and linguistic expertise, and is brittle
  • How do we improve on conventional frontend?

25

slide-26
SLIDE 26

26

deep learning

slide-27
SLIDE 27

27

slide-28
SLIDE 28

28

Tacotron 2

slide-29
SLIDE 29

“Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speec h.”

29

slide-30
SLIDE 30

30

slide-31
SLIDE 31

The good news

31

slide-32
SLIDE 32

32

Classification

slide-33
SLIDE 33

Via backpropagation, Generator learns to produces better audio, while Discriminator learns to better distinguish synthetic from real.

33

Generative adversarial networks

slide-34
SLIDE 34
slide-35
SLIDE 35

35

+ =

slide-36
SLIDE 36

scott@faculty.ai

Follow us:

We’re hiring!

We are hiring data scientists and machine learning engineers at all levels. If you’re interested in finding

  • ut more about Faculty and our work, get in touch!

36