Text-to-Speech synthesis using OpenMARY An introduction and - - PowerPoint PPT Presentation

text to speech synthesis using openmary
SMART_READER_LITE
LIVE PREVIEW

Text-to-Speech synthesis using OpenMARY An introduction and - - PowerPoint PPT Presentation

Text-to-Speech synthesis using OpenMARY An introduction and practical tutorial Marc Schrder, DFKI marc.schroeder@dfki.de eNTERFACE Amsterdam, 14 July 2010 Overview Some Text-to-Speech (TTS) basics Natural Language Processing Generating


slide-1
SLIDE 1

Text-to-Speech synthesis using OpenMARY

An introduction and practical tutorial Marc Schröder, DFKI

marc.schroeder@dfki.de eNTERFACE Amsterdam, 14 July 2010

slide-2
SLIDE 2

Marc Schröder, DFKI 2

Overview Some Text-to-Speech (TTS) basics

Natural Language Processing Generating the sound

diphone synthesis unit selection synthesis HMM-based synthesis

OpenMARY

existing system MARY 4.0 toolkit for adding new languages and voices

Tutorial overview

what you will learn to do in the tutorial

slide-3
SLIDE 3

Marc Schröder, DFKI 3

What is text-to-speech synthesis?

“You have one message from Dr Johnson.”

TTS

slide-4
SLIDE 4

Marc Schröder, DFKI 4

Applications of TTS Texts readers

for the blind in eyes-free environments (e.g., while driving)

Telephone-based voice portals Multi-modal interactive systems

talking heads “embodied conversational agents” (ECAs)

slide-5
SLIDE 5

Marc Schröder, DFKI 5

A Talking Head

“Hello, nice to meet you.”

TTS Information

  • n timing

and mouth shapes

slide-6
SLIDE 6

Marc Schröder, DFKI 6

Structure of a TTS system

text analysis audio generation Text or Speech synthesis markup phonetic transcription + prosodic parameters

Either plain text or SSML document Intonation specification Pausing & speech timing natural language processing techniques signal processing techniques Wave file

TEXT SSML ACOUSTPARAMS AUDIO

slide-7
SLIDE 7

Marc Schröder, DFKI 7

Structure of a TTS system: MARY TTS Text analysis

Input markup parser TEXT or SSML → RAWMARYXML Shallow NLP

RAWMARYXML → PARTSOFSPEECH

Phonemiser

PARTSOFSPEECH → ALLOPHONES

Symbolic prosody

ALLOPHONES → INTONATION

  • Acoust. parameters

INTONATION → ACOUSTPARAMS

Audio generation

waveform synthesis ACOUSTPARAMS → AUDIO

slide-8
SLIDE 8

Marc Schröder, DFKI 8

System structure: Input markup parser System-internal XML representation MaryXML => speech synthesis markup parsing is simple XML transformation Use XSLT => easily adaptable to new markup language

TEXT or SSML → RAWMARYXML

slide-9
SLIDE 9

Marc Schröder, DFKI 9

System structure: Shallow NLP Shallow NLP

Tokeniser

RAWMARYXML → TOKENS sentence boundaries, “tokens” = word-like units

Text normalisation

TOKENS → WORDS expanded, pronounceable forms (see next slide)

Part-of-speech tagger

WORDS → PARTSOFSPEECH

slide-10
SLIDE 10

Marc Schröder, DFKI 10

Preprocessing / Text normalisation

Net patterns (email, web addresses) info@dfki.de Date patterns 23/07/2001 Time patterns 12:24 h, 12:24 Duration patterns 12:24 h, 12 h 24 min Currency patterns 12.95 € Measure patterns 123.09 km Telephone number patterns +49-681-85775-5303 Number patterns (cardinal, ordinal, roman) 3 3rd III. Abbreviations engl. Special characters &

slide-11
SLIDE 11

Marc Schröder, DFKI 11

System structure: Phonemisation Phonemiser

PARTSOFSPEECH → PHONEMES

lexicon lookup letter-to-sound conversion

morphological decomposition letter-to-sound rules syllabification word stress assignment

Custom pronounciation PHONEMES → ALLOPHONES

slurring, non-standard pronounciation potentially trainable from annotated data of a given person

slide-12
SLIDE 12

Marc Schröder, DFKI 12

System structure: Prosody “Prosody”?

intonation (accented syllables; high or low phrase boundaries) rhythmic effects (pauses, syllable durations) loudness, voice quality

Symbolic prosody prediction

ALLOPHONES → INTONATION

assign prosody by rule, based on

punctuation part-of-speech

modelled using “Tones and Break Indices” (ToBI)

tonal targets: accents, boundary tones phrase breaks

slide-13
SLIDE 13

Marc Schröder, DFKI 13

System structure: Calculation of acoustic parameters Duration prediction INTONATION → DURATIONS

segment duration predicted

by rules

  • r by decision trees

Contour generation DURATIONS → ACOUSTPARAMS

fundamental frequency curve predicted

by rules

  • r by decision trees
slide-14
SLIDE 14

Marc Schröder, DFKI 14

System structure: Waveform synthesis Waveform synthesis ACOUSTPARAMS → AUDIO

several waveform generation technologies

slide-15
SLIDE 15

Marc Schröder, DFKI 15

Creating sound: Waveform synthesis technologies (1) Formant synthesis

acoustic model of speech generate acoustic structure by rule robotic sound

slide-16
SLIDE 16

Marc Schröder, DFKI 16

Creating sound: Waveform synthesis technologies (2) Concatenative synthesis

diphone synthesis

glue pre-recorded “diphones” together adapt prosody through signal processing

unit selection synthesis

glue units from a large corpus of speech together prosody comes from the corpus, (nearly) no signal processing

slide-17
SLIDE 17

Marc Schröder, DFKI 17

Creating sound: Waveform synthesis technologies (3) Statistical-parametric speech synthesis

with Hidden Markov Models models trained on speech corpora no data needed at runtime => small footprint

slide-18
SLIDE 18

Marc Schröder, DFKI 18

Examples of speech synthesis technologies MARY TTS

unit selection HMM-based MBROLA diphones expressive unit selection

Commercial

unit selection

IVONA Loquendo

formant synthesis

DecTalk

slide-19
SLIDE 19

Marc Schröder, DFKI 19

Concatenative synthesis: Isolated phones don't work

target: w I n t r= d eI acoustic unit database (units = phone segments recorded in isolation) w eI r= a I t n d T

slide-20
SLIDE 20

Marc Schröder, DFKI 20

Concatenative synthesis: Diphones

target: w I n t r= d eI _-w w-I I-n n-t t-r= r=-d d-eI eI-_ acoustic unit database units = diphone segments recorded in carrier words (flat intonation) _-w (wonder) w-I (will) I-n (spin) n-t (fountain) t-r= (water) r=-d (nerdy) d-eI (date) eI-_ (away) Diphones = sound segments from the middle of one phone to the middle of the next phone

slide-21
SLIDE 21

Marc Schröder, DFKI 21

Concatenative synthesis: Diphones (2)

target: w I n t r= d eI _-w w-I I-n n-t t-r= r=-d d-eI eI-_ PSOLA pitch manipulation

slide-22
SLIDE 22

Marc Schröder, DFKI 22

Concatenative synthesis Unit selection

“Which of these?” “Let's discuss the question of interchanges another day.” target: w I n t r= d eI acoustic unit database units = (di-)phone segments recorded in natural sentences (natural intonation)

slide-23
SLIDE 23

Marc Schröder, DFKI 23

AI Poker: The voices of Sam and Max

Sam: Unit Selection Synthesis Voice specifically recorded for AI Poker Natural sound within poker domain Max: HMM-based synthesis Sound quality is limited but constant with any text

slide-24
SLIDE 24

Marc Schröder, DFKI 24

Sam's voice: Unit selection syntheis

... several hours of speech recordings Unit selection corpus

“Ich habe zwei Paare.”

+ + + => very good quality within the poker domain!

slide-25
SLIDE 25

Marc Schröder, DFKI 25

Sam's voice: Unit selection syntheis

... several hours of speech recordings Unit selection corpus

“Ich kann auch ganz andere Sachen...”

+ + + reduced quality with arbitrary text

slide-26
SLIDE 26

Marc Schröder, DFKI 26

Max's voice: HMM-based synthesis

statistical models

“Ich habe zwei Paare.”

Hidden Markov Models acoustic feature vectors vocoder

slide-27
SLIDE 27

Marc Schröder, DFKI 27

Max's voice: HMM-based synthesis

statistical models

“Ich kann auch ganz andere Sachen...”

Hidden Markov Models acoustic feature vectors vocoder constant quality with arbitrary text

slide-28
SLIDE 28

Marc Schröder, DFKI 28

MARY TTS 4.0 Pure Java

Runs on any platform with Java 5

Client-server architecture

http interface – your browser is a MARY client

Multilingual, with UTF-8 support

English (US and GB) German Turkish Telugu Willkommen Konuşma స్చ స్నసస

slide-29
SLIDE 29

Marc Schröder, DFKI 29

Audio effects in MARY 4.0 Some can be applied to any voice

vocal tract length (longer – shorter ) Robot effect Whisper effect Jet pilot

More effects for HMM-based voices

pitch level (higher – lower ) pitch range (wider – narrower ) speaking rate (faster – slower )

Can be parameterised & combined to create characteristic voices

slide-30
SLIDE 30

Wikipedia XML dump

Wikipedia text import

Dump splitter Markup cleaner most frequent words in the language clean text sentences w/ diphone+prosody features Script selection

  • ptimising coverage

selected sentences / script Manual check, exclude unsuitable sentences Redstart record speech db audio files

Synthesis components

enable conversion ALLOPHONES->Audio in new voice Voice Import Tools acoustic models for F0+ duration unit selection voice files HMM- based voice files speaker- specific pronoun- ciation allo- phones .xml Transcription GUI pronoun- ciation lexicon list of function words letter-to- sound for unknown words

Basic NLP components

enable conversion TEXT->ALLOPHONES in new locale Phonemiser rudimentary POS tagger Tokeniser Symbolic prosody generic implementations with basic functionality: Feature maker

MARY TTS: New language support workflow

slide-31
SLIDE 31

Marc Schröder, DFKI 31

What you will learn to do in the MARY Tutorial Installing the MARY system

languages and voices

Interacting with MARY using the web client

basic experimentation interactive test of audio effects interactive documentation of http interface

Triggering TTS from your own software

http interface Java client code selecting language, voice and effects in requests

slide-32
SLIDE 32

Marc Schröder, DFKI 32

What you will learn to do in the MARY Tutorial (2) Using timing information: REALISED_ACOUSTPARAMS and REALISED_DURATIONS Performance: caching