FLST: Speech Recognition Bernd Mbius moebius@coli.uni-saarland.de - PowerPoint PPT Presentation

FLST: Speech Recognition Bernd Möbius moebius@coli.uni-saarland.de http://www.coli.uni-saarland.de/courses/FLST/2014/ FLST: Speech Recognition

ASR and ASU � Automatic speech recognition � recognition of words or word sequences � necessary basis for speech understanding and dialog systems � Automatic speech understanding � more directly connected with higher linguistic levels, such as syntax, semantics, and pragmatics 2 FLST: Speech Recognition

Structure of dialog systems ASR NLG speech feature synthesis extraction word answer recognition generation dialog control syntactic analysis pragmatic semantic analysis analysis ASU 3 FLST: Speech Recognition

Acoustic analysis 0.1494 � Feature extraction � utterance is analyzed as a 0 sequence of 10 ms frames � in each frame, spectral information is coded as a feature vector (MFCC, here: 12 coefficients) -0.3043 0.07922 0.1391 • MFCC = mel frequency Time (s) 12 cepstral coefficients • typically 13 static and 26 dynamic features Coefficients 1 0.07922 0.1391 Time (s) 4 FLST: Speech Recognition

Acoustic analysis � Word recognition � acoustic model (HMM): probabilities of sequences of feature vectors, given a sequence of words � stochastic language model: probabilities of word sequences � n-best word sequences (word hypotheses graphs) 5 FLST: Speech Recognition

Word hypotheses graph [Kompe 1997] 6 FLST: Speech Recognition

Linguistic analysis � Syntactic analysis � finds optimal word sequence(s) w.r.t. word recognition scores and syntactic rules / constraints � determine phrase structure in word sequence � relies on grammar rules and syntactic parsing � Semantic analysis � utterance interpretation (w/o context/domain info) � Pragmatic analysis � disambiguation and anaphora resolution (context info) 7 FLST: Speech Recognition

Relevance of prosody � Output of a standard ASR system: WHG � sequences of words without punctuation and prosody ja zur not geht's auch am samstag � Alternative realizations with prosody (1) Ja, zur Not geht's auch am Samstag. 'Yes, if necessary it will also be possible on Saturday.' (2) Ja, zur Not. Geht's auch am Samstag? 'Yes, if absolutely necessary. Will it also be possible on Sat?' ( 3) - (12) ¡… � … ¡ not only in contrived examples! 8 FLST: Speech Recognition

Relevance of prosody � Prosodic structure � sentence mode: Treffen wir uns bei Ihnen? 'Do we meet at your place?' Treffen wir uns bei Ihnen! 'Let's meet at your place!' � phrase boundaries: Fünfter geht bei mir, nicht aber neunzehnter. 'The fifth is possible for me, but not the nineteenth.' Fünfter geht bei mir nicht, aber neunzehnter. 'The fifth is not possible for me, but the nineteenth is.' � accents : Ich fahre doch nach Hamburg. 'I will go to H (as you know).' Ich fahre DOCH nach Hamburg. 'I will go to H after all.' 9 FLST: Speech Recognition

Prosody in ASR � Historical perspective � application domains for ASR • until mid/late 1990s: information retrieval dialog • since then also: less restricted domains, free dialog � a chance to demonstrate the impact of prosody! • dialog turn segmentation • information structure • user state and affect � first end-to-end dialog system using prosody: Verbmobil 10 FLST: Speech Recognition

Role model systems: Verbmobil � Architecture � multilingual prosody module: German, English, Japanese � common algorithms, shared features, separate data � input: speech signal, word hypotheses graph (WHG) � output: prosodically annotated WHG (prosody by word), feeding other dialog system components (incl. MT): • detected boundaries � dialog act segmentation, dialog manager, deep syntactic analysis • detected phrase accents � semantic module • detected questions � semantic module, dialog manager 11 FLST: Speech Recognition

Role model systems: SmartKom � Beyond Verbmobil: (emotional) user state � architecture: input and output as in Verbmobil � prosodic events : accents, boundaries, rising BTs � user state as a 7-/4-/2-class problem: • joyful (s/w), surprised, neutral, hesitant, angry (w/s) • joyful, neutral, hesitant, angry • angry vs. not angry � realistic user states evoked in WOZ experiments � large feature vector: 121 features (91 pros. + 30 POS), different subsets for events and user state 12 FLST: Speech Recognition

SmartKom � Classification performance (% correct recog.) train test prominent words 81.0 77.0 phrase boundaries 89.8 88.6 rising BT 72.0 66.4 user state (7) *30.8 * leave one out user state (4) ** 68.3 ** multimodal user state (2) * 66.8 prosodic events (emotional) user state [Zeisssler at al. 2006] 13 FLST: Speech Recognition

Role model systems: SRI � Acoustic feature space of prosodic events � similar to VM/SK approach: features derived from F0 contour, duration (phones, pauses, rate), energy � feature extraction by proprietary toolkit, but claimed to be feasible with standard software (Praat, Snack) � standard statistical classifiers � all models are probabilistic and trainable to tasks � integration of prosodic and lexical modeling � language-independent: English, Mandarin, Arabic [www.speech.sri.com/people/ees/prosody] 14 FLST: Speech Recognition

Parameters and functions � Analysis problem: many-to many mapping of parameters to functions F0 duration lexical tone intensity lexical stress, word accent spectral prop. syllabic stress accenting prosodic phrasing speaking rate sentence mode pauses information structure rhythm discourse structure voice quality phonation type 15 FLST: Speech Recognition

Prosody recognition � Some approaches to exploiting prosody for ASR � recognition of ToBI events [Ostendorf & Ross 1997, ToBI-Lite: Wightman et al. 2000] � resolving syntactic ambiguities using phrase breaks [Hunt 1997] � analysis-by-synthesis detection of Fujisaki model parameters [Hirose 1997; Nakai et al. 1997] � detection of phrase boundaries, sentence mode, and accents [Verbmobil: Hess et al. 1997] � detection of prosodic events to support dialog manager [Verbmobil, SmartKom: Batliner & Nöth et al. 2000-2003] 16 FLST: Speech Recognition

Conclusion � Prosody is an integral part of natural speech � processed and used extensively by human listeners � Few ASR/ASU systems exploit prosodic structure � Prosody can play an important role in ASR � prosodic features are potentially useful on all levels of ASR/ASU systems, including affective user state 17 FLST: Speech Recognition

Human-machine dialog 18 FLST: Speech Recognition

Thanks! 19 FLST: Speech Recognition

FLST: Speech Recognition Bernd Mbius moebius@coli.uni-saarland.de - PowerPoint PPT Presentation

FLST: Speech Recognition Bernd Mbius moebius@coli.uni-saarland.de http://www.coli.uni-saarland.de/courses/FLST/2014/ FLST: Speech Recognition ASR and ASU Automatic speech recognition recognition of words or word sequences

FLST: Prosodic Models for Speech Technology Bernd Mbius moebius@coli.uni-saarland.de

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

FLST:Cognitive Foundations I Matthew W. Crocker crocker@coli.uni-sb.de FLST: Cognitive

FLST: Cognitive Foundations Francesca Delogu delogu@coli.uni-saarland.de

FLST: Linguistic Foundations Francesca Delogu delogu@coli.uni-saarland.de

FLST:Cognitive Foundations I Matthew W. Crocker crocker@coli.uni-sb.de FLST: Cognitive

FLST: Linguistic Foundations Francesca Delogu delogu@coli.uni-saarland.de

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Foundations of Language Science and Technology (FLST) Lecture 3 (19.10.2009) PD Dr.Valia Kordoni

Foundations of Language Science and Technology (FLST) Lecture 4 (28.10.2009): Syntax PD Dr.Valia

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Imperceptible, Robust and Targeted Adversarial Examples for Automatic Speech Recognition 1 2 2

From phonetics to speech technology Einar Meister Laboratory of Phonetics and Speech Technology

Speech Synthesis, Reinforcement Learning Milan Straka May 13, 2019 Charles University in Prague

CS 523: Multimedia Systems Angus Forbes creativecoding.evl.uic.edu/courses/cs523 Image-to-Image

A Parallelized Theorem Prover for Interactive Theorem Proving David L. Rager, Warren A. Hunt Jr.,

Parallelizing an Interactive Theorem Prover Functional Programming and Proofs with ACL2 David L.

Speedups of ergodic Z d actions Aimee S.A. Johnson Swarthmore College David McClendon

Parallel Algorithms Algorithm Theory WS 2013/14 Fabian Kuhn Parallel Computations : time to

Sambuz

Useful Links

Newsletter

Mail Us