Speech Synthesis Lecture 19 CS 753 Instructor: Preethi Jyothi

Project Preliminary Report • Preliminary project report will contribute towards 5% of your final grade. Deadline is on 27th October, 2019. • Define the following for your project: 1) Input-output behaviour of your system   5 points 2) Evaluation metric 3) At least two existing (or related) approaches to your problem • Propose a model and an algorithm for the problem you're tackling and give detailed descriptions for both. Do not provide generic descriptions of the model. Describe precisely how it applies to your problem. 5 points • Describe how much of your algorithm has been implemented. If you are using existing APIs/libraries, clearly demarcate which parts you will be implementing and for which 5 points parts you will rely on existing implementations. • Describe the experiments you are planning to run. If you have already run any 5 points preliminary experiments, please describe them along with reporting your initial results.

Text-To-Speech (TTS) Systems   Storied History Von Kempelen’s speaking machine (1791) • Bellows simulated the lungs • Rubber mouth and nose; nostrils had to be covered with   • two fingers for non-nasals Homer Dudley’s VODER (1939) • First device to synthesize speech sounds via electrical   • means Gunnar Fant’s OVE formant synthesizer (1960s) • Formant synthesizer for vowels • Computer-aided speech synthesis (1970s) • Concatenative (unit selection) • Parametric (HMM-based and NN-based)   • All images from http://www2.ling.su.se/staff/hartmut/kemplne.htm

Speech synthesis or TTS systems Goal of a TTS system: Produce a natural-sounding high- • quality speech waveform for a given word sequence TTS systems are typically divided into two parts: • A. Linguistic specification B. Waveform generation

Current TTS systems Constructed using a large amount of speech data • Referred to as corpus-based TTS systems • Two prominent instances of corpus-based TTS: • 1. Unit selection and concatenation 2. Statistical parametric speech synthesis

Unit Selection Synthesis

Unit selection synthesis or   Concatenative speech synthesis All segments Synthesize new • sentences by selecting sub-word units from a database of speech Optimal size of units? • Diphones?   Half-phones? Target cost Concatenation cost Image from Zen et al., “Statistical Parametric Speech Synthesis”, SPECOM 2001

Unit selection synthesis Target cost between a candidate, u i , and a target unit t i : • p w ( t ) j C ( t ) C ( t ) ( t i , u i ) = � j ( t i , u i ) , j =1 Concatenation cost between candidate units: • q w ( c ) k C ( c ) C ( c ) ( u i − 1 , u i ) = � k ( u i − 1 , u i ) , k =1 Find string of units that minimises the overall cost: • u 1: n = arg min ˆ u 1: n { C ( t 1: n , u 1: n ) } n n � C ( t ) ( t i , u i ) + � C ( c ) ( u i − 1 , u i ) . C ( t 1: n , u 1: n ) = i =1 i =2

Unit selection synthesis Clustered segments Target cost is   • pre-calculated using a clustering method Target cost Concatenation cost

Statistical Parametric Speech Synthesis

Parametric Speech Synthesis Framework O Speech   Speech   Train   Parameter   speech Synthesis Analysis Model Generation ˆ W Text   λ Text   text Analysis Analysis Training • Estimate acoustic model given speech utterances (O), word sequences (W)* • ˆ λ = arg max p ( O | W, λ ) λ * Here W could refer to any textual features relevant to the input text

Parametric Speech Synthesis Framework O ô Speech   Speech   Train   Parameter   speech Synthesis Analysis Model Generation ˆ W Text   λ Text   text Analysis Analysis Training • Estimate acoustic model given speech utterances (O), word sequences (W) • ˆ λ = arg max p ( O | W, λ ) HMMs! λ Synthesis • ˆ Find the most probable ô from and a given word sequence to be synthesised, w λ • p ( o | w, ˆ o = arg max ˆ λ ) o Synthesize speech from ô •

HMM-based speech synthesis amount Speech signal SPEECH w DATABASE weights Excitation Spectral parameter parameter syn- extraction extraction heuris- Excitation Spectral parameters parameters the Training of HMM Labels concatenation-cost pro- Training part and Synthesis part TEXT ) context-dependent HMMs implementations Text analysis & duration models All Labels Parameter generation from HMM that Excitation Spectral parameters parameters Excitation Synthesis SYNTHESIZED opti- generation filter SPEECH ger domain.

<latexit sha1_base64="lHhErvouZMX+5sUo7itrc2XQw/E=">ACw3icdVFdixMxFM2MH7vWj636EuwKBWMrMK64tQFMHFezuQlOGO2mDU0maXJHt8zOn/RNf42ZbhHbrhdCDuec29yb26V9Jgkv6L4zt179w8OH3QePnr85Kj79Nm5N5XjYsSNMu4yBy+ULMUIJSpxaZ0AnStxkS8+tfmL78J5acpvuLJiomFWykJywEBl3d9sDlibhr7+QBm4GdNwlRlq+b6x/E6x1SoNoXmDWs61ivtJZzQrjQCm6bFrb8fJ2IwNrnbnasrfX8v8mutvur/56uaO2/f0KWbeXDJ10H2QbkCPbOIs6/5kU8MrLUrkCrwfp4nFSQ0OJVei6bDKCwt8ATMxDrAELfykXu+goa8CM6VhFOGUSNfsv4atPcrnQelBpz73VxL3pYbV1i8n9SytBWKkt80KipF0dB2oXQqneCoVgEAdzK8lfI5OAY1t4JQ0h3v7wPzk8G6dvByd3veHzTgOyQvykvRJSk7JkHwhZ2REeDSMishENv4cL2IX40jae52Qr4uYPIL/afw=</latexit> Speech parameter generation Generate the most probable observation vectors given the HMM and w: p ( o | w, ˆ o = arg max ˆ λ ) o p ( o, q | w, ˆ X = arg max λ ) o ∀ q p ( o, q | w, ˆ ≈ arg max max λ ) o q p ( o | q, ˆ λ ) p ( q | w, ˆ = arg max max λ ) o q Determine the best state sequence and outputs sequentially: p ( q | w, ˆ q = arg max ˆ λ ) q q, ˆ Let’s explore this first o = arg max ˆ p ( o | ˆ λ ) o

Determining state outputs q, ˆ p ( o | ˆ o = arg max ˆ λ ) o N ( o ; µ ˆ = arg max q ) q , Σ ˆ o � ⊤ is a state-output vector sequence � o ⊤ 1 , . . . , o ⊤ where o = T to be generated, q = { q 1 , . . . , q T } is a state sequence, and � ⊤ is the mean vector for q . � µ ⊤ q 1 , . . . , µ ⊤ = Here, µ q q T = diag [ Σ , . . . , Σ ] is the covariance matrix for q and Σ What would look like? ˆ o Mean Variance

Adding dynamic features to state outputs � ⊤ c ⊤ t , ∆ c ⊤ � ∆ c t = c t − c t − 1 where o t = t dynamic feature is calculated as 7 between and can State output vectors contain both static ( c t ) and dynamic ( Δ c t ) features o W c . . . . . ⎡ ⎤ ⎡ ⎤ . . . . . · · · · · · . . . . . . ⎡ ⎤ . ⎢ ⎥ ⎢ ⎥ . · · · · · · c t − 1 I 0 0 0 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ · · · · · · ∆ c t − 1 c t − 2 − I I 0 0 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ · · · · · · c t − 1 c t I 0 0 0 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ · · · · · · ∆ c t − I I 0 0 c t ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ · · · · · · c t +1 I c t +1 0 0 0 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ . · · · · · · ∆ c t +1 − I I 0 0 . ⎢ ⎥ ⎢ ⎥ . ⎣ ⎦ ⎣ ⎦ . . . . . . . . . . · · · · · · . . . . . (17) o and c arranged in matrix form

Speech parameter generation Introducing dynamic feature constraints: • q, ˆ o = arg max ˆ p ( o | ˆ λ ) where o = Wc o If the output distributions are single Gaussians: • q, ˆ p ( o | ˆ λ ) = N ( o ; µ ˆ q ) q , Σ ˆ Then, by setting we get: ∂ log N ( o ; µ ˆ q , Σ ˆ q ) / ∂ c = 0 • W T Σ − 1 q Wc = W T Σ − 1 q µ ˆ q ˆ ˆ

Synthesis overview Clustered states Merged states Sentence HMM Static Delta Gaussian ML trajectory

Speech parameter generation Generate the most probable observation vectors given the HMM and w: p ( o | w, ˆ o = arg max ˆ λ ) o p ( o, q | w, ˆ X = arg max λ ) o ∀ q p ( o, q | w, ˆ ≈ arg max max λ ) q o p ( o | q, ˆ λ ) P ( q | w, ˆ = arg max max λ ) q o Determine the best state sequence and outputs sequentially: p ( q | w, ˆ Let’s explore this next q = arg max ˆ λ ) q q, ˆ o = arg max ˆ p ( o | ˆ λ ) o

Duration modeling How are durations • 0.4 modelled within an State duration probability p k ( d ) p k ( d ) = a d − 1 kk (1 − a kk ) HMM? ( a kk = 0.6 ) 0.3 Implicitly modelled by state • 0.2 self-transition probabilities p k ( d ) = a d − 1 · (1 − a kk ) 0.1 kk PMFs of state durations are • 0.0 geometric distributions 1 2 3 4 5 6 7 8 9 10 State duration d (frame) State durations are determined by maximising: • N X log P ( d | λ ) = log p j ( d j ) , j = 1 What would this solution look like if the PMFs of state durations are • geometric distributions?

Explicit modeling of state durations Each state duration is explicitly modelled as a single • σ 2 ( i ) Gaussian. The mean and variance of duration ξ ( i ) density of state i: T T � � χ t 0 ,t 1 ( i )( t 1 − t 0 + 1) t 0 =1 t 1 = t 0 ξ ( i ) = , T T � � χ t 0 ,t 1 ( i ) t 0 =1 t 1 = t 0 T T � � χ t 0 ,t 1 ( i )( t 1 − t 0 + 1) 2 t 0 =1 t 1 = t 0 σ 2 ( i ) − ξ 2 ( i ) , = T T � � χ t 0 ,t 1 ( i ) t 0 =1 t 1 = t 0 1 where t 1 � χ t 0 ,t 1 ( i ) = (1 − γ t 0 − 1 ( i )) · γ t ( i ) · (1 − γ t 1 +1 ( i )) , t = t 0 and γ t ( i ) is the probability of being in state i at time t

Speech Synthesis Lecture 19 CS 753 Instructor: Preethi Jyothi - PowerPoint PPT Presentation

Speech Synthesis Lecture 19 CS 753 Instructor: Preethi Jyothi Project Preliminary Report Preliminary project report will contribute towards 5% of your final grade. Deadline is on 27th October, 2019. Define the following for your project:

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

11-752: Speech Synthesis Objectives Understand basic processing in speech synthesis

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Speech Processing 15-492/18-492 Speech Synthesis Waveform generation 2 Speech Synthesis Text

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

Speech Processing 15-492/18-492 Speech Synthesis Talking heads Singing Synthesis More

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and

Speech and Language CS 188: Artificial Intelligence Speech technologies Automatic

Modern speech synthesis for phonetic sciences: a discussion and evaluation Zofia Malisz 1 ,

PAY BILLS Select Your Card

Grid Generation and Refinement Numerical treatment of PDE requires approximate description of the

Introduction to Computer Graphics Modeling (1) April 16, 2020 Kenshi Takayama Some

SAT Modulo Monotonic Theories Sam Bayless , Noah Bayless , Holger H. Hoos , Alan J. Hu

Data-Limited Face Analysis Yibo Hu JD AI Research Previously, CRIPAC, CASIA

NASNet, Speech Synthesis, External Memory Networks Milan Straka May 18, 2020 Charles University

Memory Consistency Models CSE 451 James Bornholt Memory consistency models The short version: