Speech Recognition Speech Recognition Berlin Chen, - PowerPoint PPT Presentation

Speech Recognition Speech Recognition 語音辨識 Berlin Chen, 陳柏琳 berlin@csie.ntnu.edu.tw http://berlin.csie.ntnu.edu.tw

Course Contents • Both the theoretical and practical issues for spoken language processing will be considered • Technology for Automatic Speech Recognition (ASR) will be further emphasized • Topics to be covered – Statistical Modeling Paradigms • Spoken Language Structure • Hidden Markov Models • Speech Signal Analysis and Feature Extraction • Acoustic and Language Modeling • Search/Decoding Algorithms – Systems and Applications • Keyword Spotting, Dictation, Speaker Recognition, Spoken Dialogue, Speech-based Information Retrieval etc. 2 SP 2004 - Berlin Chen

Textbook and References • Textbook – X. Huang, A. Acero, H. Hon. Spoken Language Processing, Prentice Hall, 2001 – C. Manning and H. Schutze. Foundations of Statistical Natural Language Processing. MIT Press, 1999 • References books – T. F. Quatieri. Discrete-Time Speech Signal Processing - Principles and Practice. Prentice Hall, 2002 – J. R. Deller, J. H. L. Hansen, J. G. Proakis. Discrete-Time Processing of Speech Signals. IEEE Press, 2000 – F. Jelinek. Statistical Methods for Speech Recognition. MIT Press, 1999 – S. Young et al.. The HTK Book. Version 3.0, 2000 "http://htk.eng.cam.ac.uk" – L. Rabiner, B.H. Juang. Fundamentals of Speech Recognition. Prentice Hall, 1993 – 王小川教授，語音訊號處理，全華圖書 2004 3 SP 2004 - Berlin Chen

Textbook and References (cont.) • Reference papers 1. L. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proceedings of the IEEE, vol. 77, No. 2, February 1989 2. A. Dempster, N. Laird, and D. Rubin, " Maximum likelihood from incomplete data via the EM algorithm ," J. Royal Star. Soc., Series B, vol. 39, pp. 1-38, 1977 3. Jeff A. Bilmes "A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models," U.C. Berkeley TR-97-021 4. J. W. Picone, “Signal modeling techniques in speech recognition,” proceedings of the IEEE, September 1993, pp. 1215-1247 5. R. Rosenfeld, ”Two Decades of Statistical Language Modeling: Where Do We Go from Here?,” Proceedings of IEEE, August, 2000 6. H. Ney, “Progress in Dynamic Programming Search for LVCSR,” Proceedings of the IEEE, August 2000 7. H. Hermansky, "Should Recognizers Have Ears?", Speech Communication, 25(1-3), 1998 4 SP 2004 - Berlin Chen

Introduction References: 1. B. H. Juang and S. Furui, "Automatic Recognition and Understanding of Spoken Language - A First Step Toward Natural Human-Machine Communication,“ Proceedings of IEEE, August, 2000 2. I. Marsic, Member, A. Medl, And J. Flanagan, “Natural Communication with Information Systems,“ Proceedings of IEEE, August, 2000 5 SP 2004 - Berlin Chen

Historical Review 1952, Isolated-Digit Recognition, Bell Lab. 1956, Ten-Syllable Recognition, RCA 1959, Ten-Vowel Recognition, MIT Lincoln Lab 1959, Phoneme-sequence Recognition using Statistical Information of context , 1960s, Dynamic Time Warping to Compare Speech Events, Vintsyuk Fry and Denes 1960s-1970s, Hidden Markov Models for Speech Recognition, Baum, Baker and Jelinek Gestation of Foundations 1970s ~ Voice-Activated Typewriter Telecommunication (dictation machine, speaker-dependent), IBM (keyword spotting, speaker-independent), Bell Lab SRI BBN Technologies Speech at CMU LIMSI MIT SLS Cambridge HTK JHU CLSP Philips Microsoft 6 SP 2004 - Berlin Chen

Progress of Technology • US. National Institute of Standards and Technology (NIST) http://www.nist.gov/speech/ 7 SP 2004 - Berlin Chen

Progress of Technology (cont.) • Generic Application Areas (vocabulary vs. speaking style) 8 SP 2004 - Berlin Chen

Progress of Technology (cont.) • Benchmarks of ASR performance: Overview 9 SP 2004 - Berlin Chen

Progress of Technology (cont.) • Benchmarks of ASR performance: Broadcast News Speech 10 SP 2004 - Berlin Chen

Progress of Technology (cont.) • Benchmarks of ASR performance: Conversational Speech 11 SP 2004 - Berlin Chen

Progress of Technology (cont.) • Mandarin Conversational Speech (2003 Evaluation) – Adopted from 12 SP 2004 - Berlin Chen

Determinants of Speech Communication Speech Generation Speech Understanding Application Semantics, Message Formulation Message Comprehension ( ) Actions P M Phone, Word, Language System Language System Prosody ( ) P W M Feature Extraction Neural Transduction Neuromuscular Mapping Articulatory Parameter ( ) P S W , M Vocal Tract System Cochlea Motion ( ) Speech Analysis Speech Generation P A S , W , M ( ) P X A , S , W , M 13 SP 2004 - Berlin Chen

Statistical Modeling Paradigm • The statistical modeling paradigm used in speech and language processing Training Feature ANALYSIS TRAINING Data Sequence ALGORITHM Ground Truth ( Label or Class Information ) TRAINING STATISTICAL MODEL RECOGNITION Feature Recognized Input RECOGNITION ANALYSIS Sequence Sequence Data SEARCH 14 SP 2004 - Berlin Chen

Statistical Modeling Paradigm • Approaches based on Hidden Markov Models (HMMs) dominate the area of speech recognition – HMMs are based on rigorous mathematical theory built on several decades of mathematical results developed in other fields – HMMs are generated by the process of training on a large corpus of real speech data 15 SP 2004 - Berlin Chen

Difficulties: Speech Variability Pronunciation Speaker-independency Variation Speaker-adaptation Speaker-dependency Linguistic variability Inter-speaker variability Intra-speaker variability Variability caused Variability caused by the environment by the context Context-Dependent Robustness Acoustic Modeling Enhancement 16 SP 2004 - Berlin Chen

Large Vocabulary Continuous Speech Recognition (LVCSR) 語言解碼 / 搜尋演算法語音特徵參數抽取語音輸入 Linguistic Decoding and Feature Feature 文字輸出 Vectors Search Algorithm Extraction Language Language Acoustic Acoustic Text Speech Lexicon Models Modeling Models Corpora Modeling Corpora 詞典文字語音聲學模型之建立語言模型之建立資料庫資料庫可能詞句語音輸入 ˆ = W arg max P ( W X ) W 貝氏定理 P ( X | W ) P ( W ) = arg max P ( X ) W 詞彙網路搜尋 = arg max P ( X | W ) P ( W ) W 語言模型機率聲學模型機率 17 SP 2004 - Berlin Chen

Large Vocabulary Continuous Speech Recognition (cont.) • Transcription of Broadcast News Speech 18 SP 2004 - Berlin Chen

Spoken Dialogue • Spoken language is attractive because it is the most natural, convenient and inexpensive means of exchanging information for humans • In mobilizing situations, using keystrokes and mouse clicks could be impractical for rapid information access through small handheld devices like PDAs, cellular phones, etc. 19 SP 2004 - Berlin Chen

Spoken Dialogue (cont.) • Flowchart 20 SP 2004 - Berlin Chen

Spoken Dialogue (cont.) • Multimodality of Input and Output 21 SP 2004 - Berlin Chen

Spoken Dialogue (cont.) • Deployed Dialogue Systems 22 SP 2004 - Berlin Chen

Spoken Dialogue (cont.) • Topics vs. Dialogue Terms 23 SP 2004 - Berlin Chen

Speech-based Information Retrieval • Task : – Automatically indexing a collection of spoken documents with speech recognition techniques – Retrieving relevant documents in response to a text/speech query 24 SP 2004 - Berlin Chen

Speech-based Information Retrieval (cont.) 在四種不同時機下的資訊檢索過程。使用聲音問句 (VQ ， Voice Queries) 或文字問句 (TQ ， Text Queries) 去檢索聲音資訊 (VI ， Voice Information) 或者是傳統的文字資訊 (TI ， Text Information) 。 25 SP 2004 - Berlin Chen

Speech-based Information Retrieval (cont.) 26 SP 2004 - Berlin Chen

Speech-based Information Retrieval (cont.) overlapping character bigrams vector space model PDA, microphone, cellular phone overlapping syllable bigrams LVCSR or syllable decoding 27 SP 2004 - Berlin Chen

Speech-based Information Retrieval (cont.) • PDA-based IR system for Mandarin broadcast news 28 SP 2004 - Berlin Chen

Speech-based Information Retrieval (cont.) • PDA-based IR system for digital archives – Current deployed at National Museum of History, Taipei 29 SP 2004 - Berlin Chen

Speech-to-Speech Translation • Multilingual interactive speech translation – Aims at the achievement of a communication system for precise recognition and translation of spoken utterances for several conversational topics and environments by using human language knowledge synthetically (adopted form ATR-SLT ) 30 SP 2004 - Berlin Chen

Speech Recognition Speech Recognition Berlin Chen, - PowerPoint PPT Presentation

Speech Recognition Speech Recognition Berlin Chen, berlin@csie.ntnu.edu.tw http://berlin.csie.ntnu.edu.tw Course Contents Both the theoretical and practical issues for spoken language processing will be considered

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Effective Open Source Speech Recognition in Your Application #kde-speech Peter Grasch

Speech Processing 15-492/18-492 Speech Recognition Intro Acoustic modelling HMMs Speech

The purpose of this presentation is to introduce some of the decisions that should be made about

Survey Mode and Costs Department of Political Science and Government Aarhus University October

OMNICHANNEL AND THE POWER OF THE CUSTOMER Omnichannel Sum of two parts Contact Centre

How to Wreck a Nice Beach Theory and Practice Paul Hsu CSAIL Spoken Language Systems March 6,

SYSTEMS ETI 2506 Telecommunication Systems 1 BASIC ANALOGUE TELEPHONE Earphone (earpiece) The

Unit D time. Serial Communications D.3 D.4 Serial vs. Parallel Parallel Interfaces Serial

15 Chapte r: LTE Radio Interface Architecture Department of Electrical and Information

Introduction GeoXp : an R package for interactive exploratory spatial data analysis. Illustration

Sambuz

Useful Links

Newsletter

Mail Us