Part I Lecture 1 Introduction Introduction/Signal Processing, Part - PowerPoint PPT Presentation

Part I Lecture 1 Introduction Introduction/Signal Processing, Part I Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen}@us.ibm.com 10 September 2012 2 / 96 What Is Speech Recognition? Why Is Speech Recognition Important? Converting speech to text (STT). Demo. a.k.a. automatic speech recognition (ASR). What it’s not. Natural language understanding — e.g. , Siri. Speech synthesis — converting text to speech (TTS), e.g. , Watson. Speaker recognition — identifying who is speaking. 3 / 96 4 / 96

Because It’s Fast Other Reasons Requires no specialized training to do fast. modality method rate (words/min) Hands-free. sound speech 150–200 Speech-enabled devices are everywhere. sight sign language; gestures 100–150 Phones, smart or dumb. touch typing; mousing 60 Access to phone > access to internet. taste covering self in food < 1 smell not showering < 1 Text is easier to process than audio. Storage/compression; indexing; human consumption. 5 / 96 6 / 96 Key Applications Why Study Speech Recognition? Transcription: archiving/indexing audio. Real-world problem. Legal; medical; television and movies. Potential market: ginormous. Call centers. Hasn’t been solved yet. Whenever you interact with a computer . . . Not too easy; not too hard ( e.g. , vision). Without sitting in front of one. Lots of data. e.g. , smart or dumb phone; car; home entertainment. One of first learning problems of this scale. Accessibility. Connections to other problems with sequence data. People who can’t type, or type slowly. Machine translation, bioinformatics, OCR, etc. The hard of hearing. 7 / 96 8 / 96

Where Are We? Who Are We? Michael Picheny: Sr. Manager, Speech and Language. Course Overview 1 Bhuvana Ramabhadran: Manager, Acoustic Modeling. Stanley F. Chen: Regular guy. A Brief History of Speech Recognition IBM T.J. Watson Research Center, Yorktown Heights, NY. 2 Building a Speech Recognizer: The Basic Idea 3 Speech Production and Perception 4 9 / 96 10 / 96 Why Three Professors? How To Contact Us Too much knowledge to fit in one brain. In E-mail, prefix subject line with “EECS E6870:”!!! . Signal processing. Michael Picheny — picheny@us.ibm.com . Probability and statistics. Bhuvana Ramabhadran — bhuvana@us.ibm.com . Phonetics; linguistics. Stanley F. Chen — stanchen@us.ibm.com . Natural language processing. Office hours: right after class. Machine learning; artificial intelligence. Before class by appointment. Automata theory. TA: Xiao-Ming Wu — xw2223@columbia.edu . Courseworks. For posting questions about labs. 11 / 96 12 / 96

Course Outline Programming Assignments 80% of grade ( √− , √ , √ + grading). week topic assigned due 1 Introduction Some short written questions. 2 Signal processing; DTW lab 1 Write key parts of basic large vocabulary continuous 3 Gaussian mixture models speech recognition system. 4 Hidden Markov models lab 2 lab 1 Only the “fun” parts. 5 Language modeling C++ code infrastructure provided by us. 6 Pronunciation modeling lab 3 lab 2 Also accessible from Java (via SWIG). 7 Finite-state transducers Get account on ILAB computer cluster (x86 Linux PC’s). 8 Search lab 4 lab 3 Complete the survey. 9 Robustness; adaptation Labs due at Wednesday 6pm. 10 Discrim. training; ROVER project lab 4 11 Advanced language modeling 12 Neural networks; DBN’s. 13 Project presentations project 13 / 96 14 / 96 Final Project Readings 20% of grade. PDF versions of readings will be available on the web site. Option 1: Reading project (individual). Recommended text: Pick paper(s) from provided list, or propose your own. Speech Synthesis and Recognition , Holmes, 2nd Give 10-minute presentation summarizing paper(s). edition (paperback, 256 pp., 2001) [Holmes] . Option 2: Programming/experimental project (group). Reference texts: Pick project from provided list, or propose your own. Theory and Applications of Digital Signal Processing , Give 10-minute presentation summarizing project. Rabiner, Schafer (hardcover, 1056 pp., 2010) [R+S] . Speech and Language Processing , Jurafsky, Martin (2nd edition, hardcover, 1024 pp., 2000) [J+M] . Statistical Methods for Speech Recognition , Jelinek (hardcover, 305 pp., 1998) [Jelinek] . Spoken Language Processing , Huang, Acero, Hon (paperback, 1008 pp., 2001) [HAH] . 15 / 96 16 / 96

Web Site Prerequisites Basic knowledge of probability and statistics. www.ee.columbia.edu/~stanchen/fall12/e6870/ Fluency in C++ or Java. Syllabus. Basic knowledge of Unix or Linux. Slides from lectures (PDF). Knowledge of digital signal processing optional. Online by 8pm the night before each lecture. Helpful for understanding signal processing lectures. Hardcopy of slides distributed at each lecture? Not needed for labs. Lab assignments (PDF). Reading assignments (PDF). Online by lecture they are assigned. Username: speech , password: pythonrules . 17 / 96 18 / 96 Help Us Help You Where Are We? Feedback questionnaire after each lecture (2 questions). Course Overview 1 Feedback welcome any time. You, the student, are partially responsible . . . A Brief History of Speech Recognition For the quality of the course. 2 Please ask questions anytime! EE’s may find CS parts challenging, and vice versa. Building a Speech Recognizer: The Basic Idea 3 Together, we can get through this. Let’s go! Speech Production and Perception 4 19 / 96 20 / 96

The Early Years: 1950–1960’s Whither Speech Recognition? Ad hoc methods. Speech recognition has glamour. Funds have been available. Results have been less glamorous . . . Many key ideas introduced; not used all together. e.g. , spectral analysis; statistical training; language . . . General-purpose speech recognition seems far modeling. away. Special-purpose speech recognition is severely Small vocabulary. limited. It would seem appropriate for people to ask Digits; yes/no; vowels. themselves why they are working in the field and what they can expect to accomplish . . . Not tested with many speakers (usually < 10). . . . These considerations lead us to believe that a general phonetic typewriter is simply impossible unless the typewriter has an intelligence and a knowledge of language comparable to those of a native speaker of English . . . —John Pierce, Bell Labs, 1969 21 / 96 22 / 96 Whither Speech Recognition? Knowledge-Driven or Data-Driven? Killed ASR research at Bell Labs for many years. Knowledge-driven. People know stuff about speech, language, Partially served as impetus for first (D)ARPA program e.g. , linguistics, (acoustic) phonetics, semantics. (1971–1976) funding ASR research. Hand-derived rules. Goal: integrate speech knowledge, linguistics, and AI Use expert systems, AI to integrate knowledge. to make a breakthrough in ASR. Large vocabulary: 1000 words. Data-driven. Speed: a few times real time . Ignore what we think we know. Build dumb systems that work well if fed lots of data. Train parameters statistically. 23 / 96 24 / 96

The ARPA Speech Understanding Project The Birth of Modern ASR: 1970–1980’s Every time I fire a linguist, the performance of the 100 speech recognizer goes up. —Fred Jelinek, IBM, 1985(?) 80 Ignore (almost) everything we know about phonetics, 60 accuracy linguistics. View speech recognition as . . . . 40 Finding most probable word sequence given audio. Train probabilities automatically w/ transcribed speech. 20 0 SDC HWIM Hearsay Harpy ∗ Each system graded on different domain. 25 / 96 26 / 96 The Birth of Modern ASR: 1970–1980’s The Golden Years: 1990’s–now Many key algorithms developed/refined. 1984 now Expectation-maximization algorithm; n -gram models; CPU speed 60 MHz 3 GHz Gaussian mixtures; Hidden Markov models; Viterbi training data < 10h 10000h+ decoding; etc. output distributions GMM ∗ GMM Computing power still catching up to algorithms. sequence modeling HMM HMM First real-time dictation system built in 1984 (IBM). language models n -gram n -gram Specialized hardware ≈ 60 MHz Pentium. Basic algorithms have remained the same. Bulk of performance gain due to more data, faster CPU’s. Significant advances in adaptation, discriminative training. New technologies ( e.g. , Deep Belief Networks) on the cusp of adoption. ∗ Actually, 1989. 27 / 96 28 / 96

Part I Lecture 1 Introduction Introduction/Signal Processing, Part - PowerPoint PPT Presentation

Part I Lecture 1 Introduction Introduction/Signal Processing, Part I Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen}@us.ibm.com 10 September

Conformal Field Theories, Conformal Bootstrap and Applications Konstantinos Deligiannis December

Part 0: Git-ing Started Part 1: Essential Skills Part 2: Introduction to Git Part 3: Advanced

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Lecture 1: Introduction to the Sum of Squares Hierarchy Lecture Outline Part I:

Overview Two-Part MDL Two-Part MDL Two-Part MDL for Two-Part MDL for Grammar Learning

for Planted Clique Part I Lecture Outline Part I: Planted Clique and the Meka-Wigderson

Lecture 14: Planted Sparse Vector Lecture Outline Part I: Planted Sparse Vector and 2 to 4

Lecture 9: SOS Lower Bound for Knapsack Lecture Outline Part I: Knapsack Eqations and Pseudo-

FY17 CONSOLIDATED RESULTS UNIPOL AND UNIPOLSAI Bologna, 23 March 2018 2 PART 1 PART 2 PART 3

Answers To Common Questions (Part-2) ? Part 1 : Christian walk, Marriage Part 2 : Lifestyle

Introduction Part One: Initial Problem Part Two: Progress Over Six Months Part

Commercial Dog Breeders Part 8: Housing (Part 2) Introduction Housing Part 1 Housing Part 2

for Planted Clique Part II Lecture Outline Part I: Relaxed k-clique Equations and Theorem

DMR - Part 2 of 3 May 2, 2020 Part 1 - Mike Moore KC2NM Part 2 - Rich Hoffarth K2AXP Part 3 -

Outline Part 1 Introduction to Genomics Part 2 Visual Design for Genomics Part 3 Hands-On

Computer Graphics CS 543 Lecture 7 (Part 2) CS 543 Lecture 7 (Part 2) Lighting, Shading and

Accelerator Integration M. Anerella, BNL 4/27/10 (work by J. Schmalzle, J. Cozzolino, P. Kovach)

Introduction to Introduction to Statically Statically c ompatibility relationships ,

Crew Scheduling: Models and Algorithms Stefano Gualandi Universit` a di Pavia, Dipartimento di

Supporting Energy-Efficient Uploading Strategies for Continuous Sensing Applications on Mobile

Overview of KAGRA Data Analysis Hideyuki Tagoshi for the KAGRA Collaboration JGW-G1910798 1

Acceleration of Heavy Ion Beams with a Superconducting Continous Wave cw-Linac at GSI W. Barth 1,2

Depth Camera for Mobile Devices Instructor - Simon Lucey 16-423 - Designing Computer Vision Apps

Signals Maninder Kaur professormaninder@gmail.com Maninder Kaur www.eazynotes.com 1 Various