Deep learning for speech synthesis The good news, the bad news, - PowerPoint PPT Presentation

Deep learning for speech synthesis The good news, the bad news, and the fake news Scott Stevenson scott@faculty.ai

The fake news 2

The effect of “hot mic” incidents “Hot mic” incidents Incidents can change can bring huge voting intentions and the negative publicity outcome of elections This has been Politicians are demonstrated on particularly at risk multiple occasions 5

“Ugh everything! “She’s just a sort of bigoted woman that said she used to be Labour. “I mean it’s just ridiculous.” 6

The bad news 7

If adversaries can generate ● realistic audio they can fabricate “hot mic” recordings Traditionally requires domain ● expertise Modern deep learning makes ● this imminently possible We can’t stop technology, but ● we can inoculate people against it 8

L IH NG G W IH S T IH K . R EH P R AH Z EH N T EY SH AH N . Frontend Backend linguistic text audio representation 9

Frontend 10

Frontend: tokenisation and normalisation “IBM was founded in 1911” → “i b m was founded in nineteen eleven” “Apple is valued at $1 trillion” → “apple is valued at one trillion dollars” “He lives on St Paul’s St.” → “he lives on saint paul’s street” 11

Frontend: phonetic transcriptions “apple is valued at one trillion dollars” → AE P AH L . IH Z . V AE L Y UW D . AE T . W AH N . T R IH L Y AH N . D AA L ER Z . The CMU Pronouncing Dictionary http://www.speech.cs.cmu.edu/cgi-bin/cmudict 12

Backend 13

Backend: concatenative Synthesise waveform from ● linguistic representation Most commonly ● concatenative systems Prerecorded database of ● audio samples (“units”) Typically 10 ms to 1 s long ● Picks best units to concatenate ● 14

Problems with concatenative systems Require large database of high quality recordings ● Can’t change speaker or emotion without new database ● High intelligibility and naturalness ● Distinguishable by prosody (intonation, tone, stress, rhythm) ● “ My latest project is to learn how to project my voice ”: two pronunciations of project ○ ○ Liaison in French: final consonant no longer silent if following word begins with vowel 15

Backend: parametric Don’t use pre-recorded units ● Mathematical model contains ● information to synthesise speech Speaker and emotion stored in params ● Speech contents controlled by input ● Model outputs passed to vocoder ● Less natural than concatenative ● systems because of DSP artefacts 16

WaveNet arXiv 1609.03499 Change of paradigm for parametric speech synthesis ● Don’t feed model output to vocoder to generate waveform ● Instead, sample waveform directly from neural network ● Sample at ≥16 kHz to generate audio ● 19

Causal convolutions time 20

Causal convolutions Problem: causal convolutions require huge depth to give ● sufficiently large receptive field for good prosody Such a depth is computationally infeasible to train ● Chosen solution is to dilate convolutions ● Skip input values with interval to increase receptive field ● Receptive field grows exponentially with depth ● 21

Dilated causal convolutions time 22

Activation function arXiv 1606.05328 Use gated activation taken from PixelCNN ● Filter Gate Empirical choice: performs better than ReLU activation ● 23

Activation function arXiv 1609.03499 Need to condition locally to input text sequence ● Have a second time series h (i.e. from linguistic frontend) ● Learned upsampling y = (h) to same frequency as x ● 24

WaveNet limitations WaveNet can generate very human sounding waveforms ● But how do tell the WaveNet what to say? ● Still requires extensive feature engineering frontend ● Need time and linguistic expertise, and is brittle ● How do we improve on conventional frontend? ● 25

deep learning 26

Tacotron 2 28

“Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speec h.” 29

The good news 31

Classification 32

Generative adversarial networks Via backpropagation, Generator learns to produces better audio, while Discriminator learns to better distinguish synthetic from real. 33

+ = 35

We’re hiring! We are hiring data scientists and machine learning engineers at all levels. If you’re interested in finding out more about Faculty and our work, get in touch! scott@faculty.ai Follow us: 36

Deep learning for speech synthesis The good news, the bad news, - PowerPoint PPT Presentation

Deep learning for speech synthesis The good news, the bad news, and the fake news Scott Stevenson scott@faculty.ai The fake news 2 3 4 The effect of hot mic incidents Hot mic incidents Incidents can change can bring huge

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

11-752: Speech Synthesis Objectives Understand basic processing in speech synthesis

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Speech Processing 15-492/18-492 Speech Synthesis Waveform generation 2 Speech Synthesis Text

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

Neural Program Synthesis Rishabh Singh, Google Brain Great Collaborators! Deep Learning and

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

for Speech Synthesis and Sensor Data Augmentation Deep Generative Neural Network Speech Text

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

AD ADA Adacel Technologies Li Limited Investor Presentation April 2016 Introductions Gary

WHAT IF WE HAD 5.5 MILLION PEOPLE DISCUSSING HOW TO APPLY #AI IN EVERYDAY LIFE? 2 AI IS A NEW

Audio Files Realignment by Dynamic Time Warping (DTW) Florian Picard, Florian Tilquin June 27,

MUSCLE WP5 Showcase: M. Perakakis E. Sanchez-Soto Real-Time Audio-Visual ICCS-NTUA

Software Infrastructure for Spoken Dialogue System Presenter: Aneef Izhar Ul Haq Components of a

The International C ommittee for the Co -ordination and S tandardisation of Speech D atabases and

Getting Involved in Undergraduate Research Katherine Sittig-Boyd Data Analyst Former DREU and

Stockport Council FUTURE THINKING: Health Robotics Andy Bleaden Stockport Council - Who are we?

Sambuz

Useful Links

Newsletter

Mail Us