automatic speech recognition cs753 automatic speech
play

Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction to Statistical Speech Recognition Instructor: Preethi Jyothi Lecture 1 Course Specifics About the course (I) Main Topics: Introduction to


  1. Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction to Statistical Speech Recognition Instructor: Preethi Jyothi Lecture 1

  2. Course Specifics

  3. About the course (I) Main Topics: Introduction to statistical ASR • Acoustic models • Hidden Markov models Deep neural network-based models Pronunciation models • Language models (Ngram models, RNN-LMs) • Decoding search problem (Viterbi algorithm, etc.) •

  4. About the course (II) Course webpage: 
 www.cse.iitb.ac.in/~pjyothi/cs753 Reading: All mandatory reading will be freely available online. 
 Reading material will be posted on the website. A tu endance: 
 Strongly advised to a tu end all lectures given there’s no fixed textbook and a lot of the material covered in class will not be on the slides

  5. Evaluation — Assignments Grading: 3 assignments + 1 mid-sem exam making up 45% of the grade. Format: 1. One assignment will be almost entirely programming-based. The other two will mostly contain problems to be solved by hand. 2. Mid-sem will have some questions based on problems in assignment 1. For every problem that appears both in the assignment & exam, your score for that problem in the assignment will be replaced by averaging it with the score in the exam. Late Policy: 10% reduction in marks for every additional day past the due date. Submissions closed three days a fu er the due date.

  6. Evaluation — Final Project Grading: Constitutes 25% of the total grade. (Exceptional projects could get extra credit. Details on website soon.) Team: 2-3 members. Individual projects are highly discouraged. Project requirements: Discuss proposed project with me on or before January 30th • 4-5 page report about methodology & detailed 
 • experiments Project demo •

  7. Evaluation — Final Project On Project: Could be implementation of ideas learnt in class, applied to • real data (and/or to a new task) Could be a new idea/algorithm (with preliminary experiments) • Ideal project would lead to a conference paper • Sample project ideas: Voice tweeting system • Sentiment classification from voice-based reviews • Detecting accents from speech • Language recognition from speech segments • Audio search of speeches by politicians •

  8. Evaluation — Final Exam Grading: Constitutes 30% of the total grade. Syllabus: Will be tested on all the material covered in the course. Format: Closed book, wri tu en exam. Image from LOTR-I; meme not original

  9. Academic Integrity Policy Write what you know. • Use your own words. • If you refer to *any* external material, *always* cite your • sources. Follow proper citation guidelines. If you’re caught for plagiarism or copying, penalties are • much higher than simply omi tu ing that question. In short : Just not worth it. Don’t do it! • Image credit: https://www.flickr.com/photos/kurok/22196852451

  10. Introduction to Speech Recognition

  11. Exciting time to be an AI/ML researcher! Image credit: http://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html

  12. Lots of new progress What is speech recognition? 
 Why is it such a hard problem?

  13. Automatic Speech Recognition (ASR) Automatic speech recognition (or speech-to-text) systems • transform speech u tu erances into their corresponding text form, typically in the form of a word sequence

  14. Automatic Speech Recognition (ASR) Automatic speech recognition (or speech-to-text) systems • transform speech u tu erances into their corresponding text form, typically in the form of a word sequence. Many downstream applications of ASR: • Speech understanding: comprehending the semantics of text • Audio information retrieval: searching speech databases • Spoken translation: translating spoken language into foreign • text Keyword search: searching for specific content words in speech • Other related tasks include speaker recognition, speaker • diarization, speech detection, etc.

  15. History of ASR RADIO REX (1922)

  16. History of ASR SHOEBOX (IBM, 1962) 1 word Freq. detector 2002 2012 1962 1922 1932 1942 1952 1972 1982 1992

  17. History of ASR HARPY (CMU, 1976) 1 word 16 words Freq. Isolated word 
 detector recognition 2002 2012 1962 1922 1932 1942 1952 1972 1982 1992

  18. History of ASR HIDDEN MARKOV MODELS 
 (1980s) 1 word 16 words 1000 words Freq. Isolated word 
 Connected detector recognition speech 2002 2012 1962 1922 1932 1942 1952 1972 1982 1992

  19. History of ASR Cortana Siri DEEP NEURAL NETWORK BASED SYSTEMS (>2010) 1 word 16 words 1000 words 10K+ words Freq. Isolated word 
 Connected LVCSR detector recognition speech systems 2002 2012 1962 1922 1932 1942 1952 1972 1982 1992

  20. Why is ASR a challenging problem? Variabilities in di ff erent dimensions: Style : Read speech or spontaneous (conversational) speech? 
 Continuous natural speech or command & control? 
 Speaker characteristics : Rate of speech, accent, prosody (stress, intonation), speaker age, pronunciation variability even when the same speaker speaks the same word Channel characteristics : Background noise, room acoustics, microphone properties, interfering speakers Task specifics : Vocabulary size (very large number of words to be recognized), language-specific complexity, resource limitations

  21. Noisy channel model Encoder Noisy channel model Decoder S C O W Claude Shannon 
 1916-2001

  22. Noisy channel model applied to ASR Speaker Acoustic processor Decoder W O W * Claude Shannon 
 Fred Jelinek 
 1916-2001 1932-2010

  23. Statistical Speech Recognition Let O represent a sequence of acoustic observations (i.e. O = { O 1 , O 2 , … , O t } where O i is a feature vector observed at time t ) and W denote a word sequence. Then, the decoder chooses W * as follows: W ∗ = arg max Pr( W | O ) W Pr( O | W ) Pr( W ) = arg max Pr( O ) W This maximisation does not depend on Pr ( O ). So, we have W ∗ = arg max Pr( O | W ) Pr( W ) W

  24. Statistical Speech Recognition W ∗ = arg max Pr( O | W ) Pr( W ) W Pr ( O ⎸ W ) is referred to as the “ acoustic model ” Pr ( W ) is referred to as the “ language model ” Acoustic 
 Model Acoustic 
 Feature 
 SEARCH Generator speech 
 O Language 
 signal 
 Model word sequence 
 W *

  25. Example: Isolated word ASR task Vocabulary : 
 10 digits (zero, one, two, …), 2 operations (plus, minus) Data : 
 Speech u tu erances corresponding to each word sample from multiple speakers Recall the acoustic model is Pr ( O ⎸ W ): direct estimation is 
 impractical (why?) Let’s parameterize Pr α ( O ⎸ W ) using a Markov model with 
 parameters α . Now, the problem reduces to estimating α .

  26. Isolated word-based acoustic models a 33 a 11 a 22 a 01 a 12 a 23 a 34 1 2 3 Model for 
 0 4 word “ one ” b 3 ( ) b 1 ( ) b 2 ( ) .... O 1 O 2 O 3 O 4 O T Transition probabilities denoted by a ij from state i to state j Observation vectors O t are generated from the probability 
 density b j ( O t ) P . Jyothi, “Discriminative & AF-based Pron. models for ASR”, Ph.D. dissertation, 2013

  27. Isolated word-based acoustic models a 33 a 11 a 22 a 01 a 12 a 23 a 34 1 2 3 Model for 
 0 4 word “ one ” b 3 ( ) b 1 ( ) b 2 ( ) .... O 1 O 2 O 3 O 4 O T For an O ={ O 1 , O 2 , …, O 6 } and a state sequence Q ={0,1,1,2,3,4}: Pr( O , Q | W = ‘one’) = a 01 b 1 ( O 1 ) a 11 b 1 ( O 2 ) . . . X Pr( O | W = ‘one’) = Pr( O , Q | W = ‘one’) Q

  28. Isolated word recognition a 11 a 22 a 33 Pr( O | W = ‘one’) one: a 01 a 12 a 23 a 34 0 1 2 3 4 b 3 ( ) b 1 ( ) b 2 ( ) .... O 1 O 2 O 3 O 4 O T a 11 a 22 a 33 two: a 01 a 12 a 23 a 34 Pr( O | W = ‘two’) 0 1 2 3 4 b 3 ( ) b 1 ( ) b 2 ( ) .... O 1 O 2 O 3 O 4 O T . Pick arg max Pr( O | W = w ) . w . acoustic 
 What are we assuming features 
 about Pr( W ) ? a 11 a 22 a 33 plus: O a 01 a 12 a 23 a 34 0 1 2 3 4 Pr( O | W = ‘plus’) b 3 ( ) b 1 ( ) b 2 ( ) .... O 1 O 2 O 3 O 4 O T a 11 a 22 a 33 minus: a 01 a 12 a 23 a 34 0 1 2 3 4 b 3 ( ) b 1 ( ) b 2 ( ) .... Pr( O | W = ‘minus’) O 1 O 2 O 3 O 4 O T

  29. Isolated word recognition a 11 a 22 a 33 Pr( O | W = ‘one’) one: a 01 a 12 a 23 a 34 0 1 2 3 4 b 3 ( ) b 1 ( ) b 2 ( ) .... O 1 O 2 O 3 O 4 O T a 11 a 22 a 33 two: a 01 a 12 a 23 a 34 Pr( O | W = ‘two’) 0 1 2 3 4 b 3 ( ) b 1 ( ) b 2 ( ) .... O 1 O 2 O 3 O 4 O T . Is this approach . . scalable? acoustic 
 features 
 a 11 a 22 a 33 plus: O a 01 a 12 a 23 a 34 0 1 2 3 4 Pr( O | W = ‘plus’) b 3 ( ) b 1 ( ) b 2 ( ) .... O 1 O 2 O 3 O 4 O T a 11 a 22 a 33 minus: a 01 a 12 a 23 a 34 0 1 2 3 4 b 3 ( ) b 1 ( ) b 2 ( ) .... Pr( O | W = ‘minus’) O 1 O 2 O 3 O 4 O T

  30. Architecture of an ASR system Acoustic 
 Model (phones) Acoustic 
 Pronunciation 
 Feature 
 SEARCH Model Generator speech 
 O signal 
 word sequence 
 Language 
 Model W *

  31. Evaluate an ASR system Qv antitative metric: Error rates computed on an unseen test set by comparing W* (decoded output) against W ref (reference sentence) for each test u tu erance Sentence/U tu erance error rate (trivial to compute!) • Word/Phone error rate •

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend