text to speech synthesis
play

Text-to-Speech Synthesis Bernd Mbius Language Science and - PowerPoint PPT Presentation

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University Lecture 4 June 4, 2020 Diphone Synthesis B Mbius Concatenative synthesis l Concatenative synthesis: general procedure Data-based,


  1. Text-to-Speech Synthesis Bernd Möbius Language Science and Technology Saarland University Lecture 4 June 4, 2020 Diphone Synthesis B Möbius Concatenative synthesis

  2. l Concatenative synthesis: general procedure ▪ Data-based, concatenative synthesis ▪ offline : ▪ extract units from recordings of natural speech ▪ store one (the best) token of each unit in acoustic unit inventory (corpus) ▪ online : ▪ retrieve required units from inventory ▪ concatenate units sequentially and smoothly ▪ impose prosody (F 0 , duration, (amplitude)) B Möbius Concatenative synthesis

  3. l Concatenative synthesis: basic unit ▪ Which acoustic units are appropriate? ▪ allophones? [Eng/Ger: 45; Hawaiian: 13; !Xóõ: 159] ▪ diphones? [Eng/Ger: 2,025] ▪ triphones? [Eng/Ger: 91,125] ▪ syllables? [Eng/Ger: 12,500+; Jap: 110] ▪ Default case in these slides, unless noted otherwise: diphone as basic unit B Möbius Concatenative synthesis

  4. Allophone synthesis (visited again) 4 B Möbius Concatenative synthesis

  5. l Basic unit: diphone ə -v ɛ -s v- ɛ B Möbius Concatenative synthesis

  6. l Acoustic inventory construction ▪ Steps involved in constructing acoustic unit inventories for concatenative speech synthesis ▪ inventory design: list of required units (types) ▪ selection or construction of text material ▪ speaker selection ▪ recordings ▪ selection of best candidate (token) of each unit (type) ▪ unit extraction ('cutting'=indexing) ▪ fixed or flexible cut points? B Möbius Concatenative synthesis

  7. l Acoustic inventory design ▪ Comprise all relevant phonemic/allophonic variants (spectral properties, individual vowel space) ▪ Cover all well-formed sound sequences of target language (phonotactics, also across word boundaries) ▪ Model the most important coarticulatory effects (devoicing, rounding, nasalization, …) ▪ Concatenate units without audible discontinuities ('cuttability', unit candidate selection) ▪ Reasonable inventory size (recording time, quality control) B Möbius Concatenative synthesis

  8. l Individual vowel space F2 F1 B Möbius Concatenative synthesis

  9. l Individual vowel space B Möbius Concatenative synthesis

  10. l Coarticulation (voicing) ǝ v ɛ k v̥ ɛ B Möbius Concatenative synthesis

  11. l Coarticulation (voicing) B Möbius Concatenative synthesis

  12. l ' Cuttability' Hard cuts in locations of minimal spectral change B Möbius Concatenative synthesis

  13. l Required units ▪ Why is the prediction of required units (types) difficult? ▪ speaker-specific properties of spoken language ▪ individual vowel space ▪ coarticulation and context-sensitivity ▪ sounds from foreign languages ▪ Criteria ▪ language-specific phonotactic constraints ▪ acoustic properties of speech sounds: some diphone types may not be required ▪ text book vs. phonetic reality (cf. vowel space) B Möbius Concatenative synthesis

  14. l Text materials ▪ Selection or construction of text material for recordings, covering required units ▪ "natural" sentences ▪ large phonetic variation ▪ selection by greedy algorithm ▪ relatively small number of sentences ▪ carrier sentences ▪ controlled segmental and prosodic context ▪ constructed nonsense sentences or words /I-m/  "Er hatte T imm erei gesagt." "He said t imm y again." ▪ relatively large number of sentences B Möbius Concatenative synthesis

  15. l Speaker selection ▪ Criteria for selecting a good voice ("voice talent") ▪ professional or "naïve" speaker? ▪ longer-term availability ▪ Is the voice pleasant (auditive-aesthetical)? ▪ Is the voice robust against signal processing? ▪ Does the voice remain pleasant after resynthesis? B Möbius Concatenative synthesis

  16. l Speaker selection ▪ Formal procedure [Syrdal et al. 1997, 1998; Schweitzer et al. 2006] ▪ "mini" TTS ▪ perception test with 3 voices, 15 sentences each ▪ intelligibility and pleasantness judgments (5-point scale) ▪ comparison for several factors ▪ signal processing method (e.g. PSOLA, HNM) ▪ RMS energy in voiceless regions ▪ spectral balance ▪ F 0 variability ▪ different results for male vs. female voices B Möbius Concatenative synthesis

  17. l Recordings ▪ Recording conditions and practical considerations ▪ anechoic booth, or at least sound-treated studio ▪ professional microphone and headset ▪ parallel recording of speech and laryngograph signals ▪ auditory monitoring of extraneous noises ▪ phonetic monitoring of target units ▪ automatic recording regime, parallel back-up device ▪ monotonous or flat speaking style (?) ▪ all recordings in one session (?) ▪ make-up sessions for bad units B Möbius Concatenative synthesis

  18. l Unit candidate selection ▪ Selection of best candidate (token) of each unit (type) ▪ Objectives [Olive et al. 1998; Möbius 2001] ▪ find optimal cut and concatenation points ▪ cause minimal inter-segmental discontinuities ▪ optimal representation of target speech sounds ▪ Problem: phonetic variability ▪ systematic variation (coarticulation) ▪ random variability B Möbius Concatenative synthesis

  19. l Unit candidate selection: coarticulation ▪ Effects of prevocalic consonants on vowel formants (early, mid, late in vowel) B Möbius Concatenative synthesis

  20. l Unit candidate selection: coarticulation ▪ Effects of postvocalic consonants on vowel formants (early, mid, late in vowel) B Möbius Concatenative synthesis

  21. l Unit candidate selection: procedure ▪ Selection of best candidate (token) of each unit (type) ▪ Globally optimal selection, minimizing spectral discrepancies between any two diphones that can be concatenated (i.e. /t-i/ ⎯ /i-m/) ▪ Search for ideal point in F 1,2,3 space [Olive et al. 1998; Möbius 2001] ▪ exhaustive search ▪ iterative grid search B Möbius Concatenative synthesis

  22. l Optimal cut and concatenation point diph. R[1] R[2] ✓✓✓✓✓ ✓✓ k-i ✓✓✓✓✓ ✓✓ i-t ✓ g-i x ✓ i-m x ✓ d-i x ✓ i-n x ✓ ✓ l-i ✓ ✓ i-k ✓ m-i x ✓ i-d x region [1] covers 12 diphones (tokens), 4 types region [2] covers all 10 diphone types ( ideal point ) B Möbius Concatenative synthesis

  23. l Optimal cut and concatenation point ▪ Evaluating spectral discrepancies at concatenation point: DMAX = max (( |T i - F i | ) / B i ); i= {1,2,3} T i = target formant values (data-based) F i = actual formant values (measured) B i = formant bandwidths (postulated) ▪ DMAX: maximal acceptable formant discrepancy ▪ here: threshold set by expert ▪ desired: perceptually motivated threshold B Möbius Concatenative synthesis

  24. l Unit candidate selection: problems ▪ Choice of appropriate speech representation (formants?) ▪ Choice of distance measure (perceptually motivated?) ▪ absolute distance vs. change of direction ▪ What to do if no suitable candidate is available? ▪ Need for diagnostic tools ▪ Criteria for selecting consonant candidates? ▪ e.g. amplitude profile, spectral balance ▪ Weighting of vocalic vs. consonantal features B Möbius Concatenative synthesis

  25. l Final selection for inventory ▪ Selection of best candidate for each required diphone ▪ final selection of best candidate (if more than one meets the DMAX criterion) ▪ final selection of cut point (if more than one meets the DMAX criterion) ▪ automatically (objectively best candidate/cut point) ▪ interactively (subjective decision by expert) ▪ build inventory ▪ extract speech signal intervals of selected diphones ▪ produce index file with diphone start and end points in corpus (preferred) B Möbius Concatenative synthesis

  26. l Concatenative synthesis: Summary ▪ Synthesis by re-sequencing and concatenating selected units of natural speech (typically: diphones) + units comprise dynamic phone-to-phone transitions + units cover local coarticulatory effects − longer-range coarticulation not covered − signal processing at least for smoothing concatention  signal processing for prosodic modifications  compromise between coverage and inventory size ▪ Standard synthesis technique in the 1990s ▪ suboptimal naturalness ▪ stable, predictable quality B Möbius Concatenative synthesis

  27. l Essential content: diphone synthesis ▪ What is a diphone? ▪ What is the motivation for using the diphone as the basic synthesis unit rather than phones? ▪ Which procedures can be used to ensure that the concatenation between any two diphones is maximally smooth or, in other words, that the discontinuities caused by concatenation are minimized? B Möbius Concatenative synthesis

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend