Text-to-Speech Synthesis Bernd Mbius Language Science and - - PowerPoint PPT Presentation

▶

Jun 01, 2023 332 likes •616 views

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University Lecture 4 June 4, 2020 Diphone Synthesis B Mbius Concatenative synthesis l Concatenative synthesis: general procedure Data-based,

SLIDE 1

Bernd Möbius

Language Science and Technology Saarland University Lecture 4 June 4, 2020 Diphone Synthesis

B Möbius Concatenative synthesis

Text-to-Speech Synthesis

SLIDE 2

l B Möbius Concatenative synthesis

Concatenative synthesis: general procedure

▪ Data-based, concatenative synthesis ▪ offline : ▪ extract units from recordings of natural speech ▪ store one (the best) token of each unit in acoustic unit inventory (corpus) ▪ online : ▪ retrieve required units from inventory ▪ concatenate units sequentially and smoothly ▪ impose prosody (F0, duration, (amplitude))

SLIDE 3

l B Möbius Concatenative synthesis

Concatenative synthesis: basic unit

▪ Which acoustic units are appropriate? ▪ allophones? [Eng/Ger: 45; Hawaiian: 13; !Xóõ: 159] ▪ diphones? [Eng/Ger: 2,025] ▪ triphones? [Eng/Ger: 91,125] ▪ syllables? [Eng/Ger: 12,500+; Jap: 110] ▪ Default case in these slides, unless noted otherwise: diphone as basic unit

SLIDE 4

B Möbius Concatenative synthesis

Allophone synthesis (visited again)

SLIDE 5

l B Möbius Concatenative synthesis

Basic unit: diphone

v-ɛ ə-v ɛ-s

SLIDE 6

l B Möbius Concatenative synthesis

Acoustic inventory construction

▪ Steps involved in constructing acoustic unit inventories for concatenative speech synthesis ▪ inventory design: list of required units (types) ▪ selection or construction of text material ▪ speaker selection ▪ recordings ▪ selection of best candidate (token) of each unit (type) ▪ unit extraction ('cutting'=indexing) ▪ fixed or flexible cut points?

SLIDE 7

l B Möbius Concatenative synthesis

Acoustic inventory design

▪ Comprise all relevant phonemic/allophonic variants (spectral properties, individual vowel space) ▪ Cover all well-formed sound sequences of target language (phonotactics, also across word boundaries) ▪ Model the most important coarticulatory effects (devoicing, rounding, nasalization, …) ▪ Concatenate units without audible discontinuities ('cuttability', unit candidate selection) ▪ Reasonable inventory size (recording time, quality control)

SLIDE 8

l B Möbius Concatenative synthesis

Individual vowel space

F2 F1

SLIDE 9

l B Möbius Concatenative synthesis

Individual vowel space

SLIDE 10

l B Möbius Concatenative synthesis

Coarticulation (voicing)

ǝ v ɛ k v̥ ɛ

SLIDE 11

l B Möbius Concatenative synthesis

Coarticulation (voicing)

SLIDE 12

l B Möbius Concatenative synthesis

'Cuttability'

Hard cuts in locations of minimal spectral change

SLIDE 13

l B Möbius Concatenative synthesis

Required units

▪ Why is the prediction of required units (types) difficult? ▪ speaker-specific properties of spoken language ▪ individual vowel space ▪ coarticulation and context-sensitivity ▪ sounds from foreign languages ▪ Criteria ▪ language-specific phonotactic constraints ▪ acoustic properties of speech sounds: some diphone types may not be required ▪ text book vs. phonetic reality (cf. vowel space)

SLIDE 14

l B Möbius Concatenative synthesis

Text materials

▪ Selection or construction of text material for recordings, covering required units ▪ "natural" sentences ▪ large phonetic variation ▪ selection by greedy algorithm ▪ relatively small number of sentences ▪ carrier sentences ▪ controlled segmental and prosodic context ▪ constructed nonsense sentences or words

/I-m/  "Er hatte Timmerei gesagt." "He said timmy again."

▪ relatively large number of sentences

SLIDE 15

l B Möbius Concatenative synthesis

Speaker selection

▪ Criteria for selecting a good voice ("voice talent") ▪ professional or "naïve" speaker? ▪ longer-term availability ▪ Is the voice pleasant (auditive-aesthetical)? ▪ Is the voice robust against signal processing? ▪ Does the voice remain pleasant after resynthesis?

SLIDE 16

l B Möbius Concatenative synthesis

Speaker selection

▪ Formal procedure [Syrdal et al. 1997, 1998; Schweitzer et al. 2006] ▪ "mini" TTS ▪ perception test with 3 voices, 15 sentences each ▪ intelligibility and pleasantness judgments (5-point scale) ▪ comparison for several factors ▪ signal processing method (e.g. PSOLA, HNM) ▪ RMS energy in voiceless regions ▪ spectral balance ▪ F0 variability ▪ different results for male vs. female voices

SLIDE 17

l B Möbius Concatenative synthesis

Recordings

▪ Recording conditions and practical considerations ▪ anechoic booth, or at least sound-treated studio ▪ professional microphone and headset ▪ parallel recording of speech and laryngograph signals ▪ auditory monitoring of extraneous noises ▪ phonetic monitoring of target units ▪ automatic recording regime, parallel back-up device ▪ monotonous or flat speaking style (?) ▪ all recordings in one session (?) ▪ make-up sessions for bad units

SLIDE 18

l B Möbius Concatenative synthesis

Unit candidate selection

▪ Selection of best candidate (token) of each unit (type) ▪ Objectives [Olive et al. 1998; Möbius 2001] ▪ find optimal cut and concatenation points ▪ cause minimal inter-segmental discontinuities ▪ optimal representation of target speech sounds ▪ Problem: phonetic variability ▪ systematic variation (coarticulation) ▪ random variability

SLIDE 19

l B Möbius Concatenative synthesis

Unit candidate selection: coarticulation

▪ Effects of prevocalic consonants on vowel formants (early, mid, late in vowel)

SLIDE 20

l B Möbius Concatenative synthesis

Unit candidate selection: coarticulation

▪ Effects of postvocalic consonants on vowel formants (early, mid, late in vowel)

SLIDE 21

l B Möbius Concatenative synthesis

Unit candidate selection: procedure

▪ Selection of best candidate (token) of each unit (type) ▪ Globally optimal selection, minimizing spectral discrepancies between any two diphones that can be concatenated (i.e. /t-i/⎯/i-m/) ▪ Search for ideal point in F1,2,3 space

[Olive et al. 1998; Möbius 2001]

▪ exhaustive search ▪ iterative grid search

SLIDE 22

l B Möbius Concatenative synthesis

Optimal cut and concatenation point

region [1] covers 12 diphones (tokens), 4 types region [2] covers all 10 diphone types (ideal point)

diph. R[1] R[2] k-i ✓✓✓✓✓ ✓✓ i-t ✓✓✓✓✓ ✓✓ g-i x ✓ i-m x ✓ d-i x ✓ i-n x ✓ l-i ✓ ✓ i-k ✓ ✓ m-i x ✓ i-d x ✓

SLIDE 23

l B Möbius Concatenative synthesis

Optimal cut and concatenation point

▪ Evaluating spectral discrepancies at concatenation point:

DMAX = max (( |Ti - Fi| ) / Bi ); i={1,2,3}

Ti = target formant values (data-based) Fi = actual formant values (measured) Bi = formant bandwidths (postulated)

▪ DMAX: maximal acceptable formant discrepancy ▪ here: threshold set by expert ▪ desired: perceptually motivated threshold

SLIDE 24

l B Möbius Concatenative synthesis

Unit candidate selection: problems

▪ Choice of appropriate speech representation (formants?) ▪ Choice of distance measure (perceptually motivated?) ▪ absolute distance vs. change of direction ▪ What to do if no suitable candidate is available? ▪ Need for diagnostic tools ▪ Criteria for selecting consonant candidates? ▪ e.g. amplitude profile, spectral balance ▪ Weighting of vocalic vs. consonantal features

SLIDE 25

l B Möbius Concatenative synthesis

Final selection for inventory

▪ Selection of best candidate for each required diphone ▪ final selection of best candidate (if more than one meets the DMAX criterion) ▪ final selection of cut point (if more than one meets the DMAX criterion) ▪ automatically (objectively best candidate/cut point) ▪ interactively (subjective decision by expert) ▪ build inventory ▪ extract speech signal intervals of selected diphones ▪ produce index file with diphone start and end points in corpus (preferred)

SLIDE 26

l B Möbius Concatenative synthesis

Concatenative synthesis: Summary

▪ Synthesis by re-sequencing and concatenating selected units of natural speech (typically: diphones) + units comprise dynamic phone-to-phone transitions + units cover local coarticulatory effects − longer-range coarticulation not covered − signal processing at least for smoothing concatention  signal processing for prosodic modifications  compromise between coverage and inventory size ▪ Standard synthesis technique in the 1990s ▪ suboptimal naturalness ▪ stable, predictable quality

SLIDE 27

l B Möbius Concatenative synthesis

Essential content: diphone synthesis

▪ What is a diphone? ▪ What is the motivation for using the diphone as the basic synthesis unit rather than phones? ▪ Which procedures can be used to ensure that the concatenation between any two diphones is maximally smooth or, in other words, that the discontinuities caused by concatenation are minimized?