Automatic Speech Recognition (CS753) Automatic Speech Recognition - - PowerPoint PPT Presentation

automatic speech recognition cs753 automatic speech
SMART_READER_LITE
LIVE PREVIEW

Automatic Speech Recognition (CS753) Automatic Speech Recognition - - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction to Statistical Speech Recognition Instructor: Preethi Jyothi Lecture 1 Course Specifics About the course (I) Main Topics: Introduction to


slide-1
SLIDE 1

Instructor: Preethi Jyothi Lecture 1

Automatic Speech Recognition (CS753)

Lecture 1: Introduction to Statistical Speech Recognition

Automatic Speech Recognition (CS753)

slide-2
SLIDE 2

Course Specifics

slide-3
SLIDE 3

Main Topics:

  • Introduction to statistical ASR
  • Acoustic models

Hidden Markov models Deep neural network-based models

  • Pronunciation models
  • Language models (Ngram models, RNN-LMs)
  • Decoding search problem (Viterbi algorithm, etc.)

About the course (I)

slide-4
SLIDE 4

About the course (II)

Course webpage: 
 www.cse.iitb.ac.in/~pjyothi/cs753 Reading: All mandatory reading will be freely available online. 
 Reading material will be posted on the website. Atuendance: 
 Strongly advised to atuend all lectures given there’s no fixed textbook and a lot of the material covered in class will not be

  • n the slides
slide-5
SLIDE 5

Evaluation — Assignments

Grading: 3 assignments + 1 mid-sem exam making up 45% of the grade. Format:

  • 1. One assignment will be almost entirely programming-based.

The other two will mostly contain problems to be solved by hand.

  • 2. Mid-sem will have some questions based on problems in

assignment 1. For every problem that appears both in the assignment & exam, your score for that problem in the assignment will be replaced by averaging it with the score in the exam. Late Policy: 10% reduction in marks for every additional day past the due date. Submissions closed three days afuer the due date.

slide-6
SLIDE 6

Evaluation — Final Project

Grading: Constitutes 25% of the total grade. (Exceptional projects could get extra credit. Details on website soon.) Team: 2-3 members. Individual projects are highly discouraged. Project requirements:

  • Discuss proposed project with me on or before January 30th
  • 4-5 page report about methodology & detailed 


experiments

  • Project demo
slide-7
SLIDE 7

Evaluation — Final Project

On Project:

  • Could be implementation of ideas learnt in class, applied to

real data (and/or to a new task)

  • Could be a new idea/algorithm (with preliminary experiments)
  • Ideal project would lead to a conference paper

Sample project ideas:

  • Voice tweeting system
  • Sentiment classification from voice-based reviews
  • Detecting accents from speech
  • Language recognition from speech segments
  • Audio search of speeches by politicians
slide-8
SLIDE 8

Evaluation — Final Exam

Grading: Constitutes 30% of the total grade. Syllabus: Will be tested on all the material covered in the course. Format: Closed book, writuen exam.

Image from LOTR-I; meme not original

slide-9
SLIDE 9

Academic Integrity Policy

  • Write what you know.
  • Use your own words.
  • If you refer to *any* external material, *always* cite your
  • sources. Follow proper citation guidelines.
  • If you’re caught for plagiarism or copying, penalties are

much higher than simply omituing that question.

  • In short: Just not worth it. Don’t do it!

Image credit: https://www.flickr.com/photos/kurok/22196852451

slide-10
SLIDE 10

Introduction to Speech Recognition

slide-11
SLIDE 11

Exciting time to be an AI/ML researcher!

Image credit: http://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html

slide-12
SLIDE 12

Lots of new progress

What is speech recognition? 
 Why is it such a hard problem?

slide-13
SLIDE 13

Automatic Speech Recognition (ASR)

  • Automatic speech recognition (or speech-to-text) systems

transform speech utuerances into their corresponding text form, typically in the form of a word sequence

slide-14
SLIDE 14

Automatic Speech Recognition (ASR)

  • Automatic speech recognition (or speech-to-text) systems

transform speech utuerances into their corresponding text form, typically in the form of a word sequence.

  • Many downstream applications of ASR:
  • Speech understanding: comprehending the semantics of text
  • Audio information retrieval: searching speech databases
  • Spoken translation: translating spoken language into foreign

text

  • Keyword search: searching for specific content words in speech
  • Other related tasks include speaker recognition, speaker

diarization, speech detection, etc.

slide-15
SLIDE 15

History of ASR

RADIO REX (1922)

slide-16
SLIDE 16

History of ASR

SHOEBOX (IBM, 1962)

1922 1942 1962 1982 2002 2012 1932 1952 1972 1992

1 word Freq. detector

slide-17
SLIDE 17

History of ASR

1922 1942 1962 1982 2002 2012 1932 1952 1972 1992

1 word Freq. detector 16 words Isolated word
 recognition

HARPY (CMU, 1976)

slide-18
SLIDE 18

History of ASR

1922 1942 1962 1982 2002 2012 1932 1952 1972 1992

1 word Freq. detector 16 words Isolated word
 recognition 1000 words Connected speech

HIDDEN MARKOV MODELS 
 (1980s)

slide-19
SLIDE 19

History of ASR

1922 1942 1962 1982 2002 2012 1932 1952 1972 1992

1 word Freq. detector 16 words Isolated word
 recognition 1000 words Connected speech 10K+ words LVCSR systems Siri

Cortana

DEEP NEURAL NETWORK BASED SYSTEMS (>2010)

slide-20
SLIDE 20

Why is ASR a challenging problem?

Variabilities in different dimensions: Style: Read speech or spontaneous (conversational) speech?
 Continuous natural speech or command & control?
 Speaker characteristics: Rate of speech, accent, prosody (stress, intonation), speaker age, pronunciation variability even when the same speaker speaks the same word Channel characteristics: Background noise, room acoustics, microphone properties, interfering speakers Task specifics: Vocabulary size (very large number of words to be recognized), language-specific complexity, resource limitations

slide-21
SLIDE 21

Noisy channel model

Encoder Decoder Noisy channel model

S C O W

Claude Shannon
 1916-2001

slide-22
SLIDE 22

Noisy channel model applied to ASR

Speaker Decoder Acoustic processor

W O W*

Claude Shannon
 1916-2001 Fred Jelinek
 1932-2010

slide-23
SLIDE 23

Statistical Speech Recognition

Let O represent a sequence of acoustic observations (i.e. O = {O1, O2 , … , Ot} where Oi is a feature vector observed at time t) and W denote a word

  • sequence. Then, the decoder chooses W* as follows:

W∗ = arg max

W

Pr(W|O) = arg max

W

Pr(O|W) Pr(W) Pr(O)

This maximisation does not depend on Pr(O). So, we have

W∗ = arg max

W

Pr(O|W) Pr(W)

slide-24
SLIDE 24

Statistical Speech Recognition

W∗ = arg max

W

Pr(O|W) Pr(W)

Pr(O⎸W) is referred to as the “acoustic model” Pr(W) is referred to as the “language model”

speech
 signal


Acoustic
 Feature
 Generator SEARCH Acoustic
 Model Language
 Model

word sequence
 W* O

slide-25
SLIDE 25

Example: Isolated word ASR task

Vocabulary: 
 10 digits (zero, one, two, …), 2 operations (plus, minus) Data: 
 Speech utuerances corresponding to each word sample from multiple speakers Recall the acoustic model is Pr(O⎸W): direct estimation is 
 impractical (why?) Let’s parameterize Prα(O⎸W) using a Markov model with 
 parameters α. Now, the problem reduces to estimating α.

slide-26
SLIDE 26

Isolated word-based acoustic models

P . Jyothi, “Discriminative & AF-based Pron. models for ASR”, Ph.D. dissertation, 2013

Transition probabilities denoted by aij from state i to state j Observation vectors Ot are generated from the probability 
 density bj(Ot)

1 2 3 O1 O2 O3 O4 OT

....

4 b1( ) b2( ) b3( ) a01 a12 a23 a34 a11 a22 a33

Model for
 word “one”

slide-27
SLIDE 27

Isolated word-based acoustic models

1 2 3 O1 O2 O3 O4 OT

....

4 b1( ) b2( ) b3( ) a01 a12 a23 a34 a11 a22 a33

For an O={O1,O2, …, O6} and a state sequence Q={0,1,1,2,3,4}:

Pr(O, Q|W = ‘one’) = a01b1(O1)a11b1(O2) . . .

Model for
 word “one”

Pr(O|W = ‘one’) = X

Q

Pr(O, Q|W = ‘one’)

slide-28
SLIDE 28

Isolated word recognition

1 2 3 O1 O2 O3 O4 OT

....

4 b1( ) b2( ) b3( ) a01 a12 a23 a34 a11 a22 a33

  • ne:

1 2 3 O1 O2 O3 O4 OT

....

4 b1( ) b2( ) b3( ) a01 a12 a23 a34 a11 a22 a33

two:

1 2 3 O1 O2 O3 O4 OT

....

4 b1( ) b2( ) b3( ) a01 a12 a23 a34 a11 a22 a33

plus:

1 2 3 O1 O2 O3 O4 OT

....

4 b1( ) b2( ) b3( ) a01 a12 a23 a34 a11 a22 a33

minus:

. . .

acoustic 
 features
 O What are we assuming about Pr(W)?

Pr(O|W = ‘one’) Pr(O|W = ‘two’) Pr(O|W = ‘plus’) Pr(O|W = ‘minus’)

Pick arg max

w

Pr(O|W = w)

slide-29
SLIDE 29

Isolated word recognition

1 2 3 O1 O2 O3 O4 OT

....

4 b1( ) b2( ) b3( ) a01 a12 a23 a34 a11 a22 a33

  • ne:

1 2 3 O1 O2 O3 O4 OT

....

4 b1( ) b2( ) b3( ) a01 a12 a23 a34 a11 a22 a33

two:

1 2 3 O1 O2 O3 O4 OT

....

4 b1( ) b2( ) b3( ) a01 a12 a23 a34 a11 a22 a33

plus:

1 2 3 O1 O2 O3 O4 OT

....

4 b1( ) b2( ) b3( ) a01 a12 a23 a34 a11 a22 a33

minus:

. . .

acoustic 
 features
 O

Pr(O|W = ‘one’) Pr(O|W = ‘two’) Pr(O|W = ‘plus’) Pr(O|W = ‘minus’)

Is this approach scalable?

slide-30
SLIDE 30

Architecture of an ASR system

speech
 signal


Acoustic
 Feature
 Generator SEARCH Acoustic
 Model (phones) Language
 Model

word sequence
 W* O

Pronunciation
 Model

slide-31
SLIDE 31

Evaluate an ASR system

Qvantitative metric: Error rates computed on an unseen test set by comparing W* (decoded output) against Wref (reference sentence) for each test utuerance

  • Sentence/Utuerance error rate (trivial to compute!)
  • Word/Phone error rate
slide-32
SLIDE 32

Evaluate an ASR system

Word/Phone error rate (ER) uses the Levenshtein distance measure: What are the minimum number of edits (insertions/ deletions/substitutions) required to convert W* to Wref? ER = PN

j=1 Insj + Delj + Subj

PN

j=1 `j

Insj, Delj, Subj are number of insertions/deletions/substitutions in the jth ASR output

`j

On a test set with N instances: is the total number of words/phones in the jth reference

slide-33
SLIDE 33

Course Overview

speech
 signal


Acoustic
 Feature
 Generator SEARCH Acoustic
 Model (phones) Language
 Model

word sequence
 W* O

Pronunciation
 Model

Properties

  • f speech

sounds Acoustic
 Signal Processing Hidden Markov Models Deep Neural Networks Hybrid HMM-DNN
 Systems Speaker Adaptation Ngram/RNN LMs G2P/feature- based models

slide-34
SLIDE 34

Course Overview

speech
 signal


Acoustic
 Feature
 Generator SEARCH Acoustic
 Model (phones) Language
 Model

word sequence
 W* O

Pronunciation
 Model

Properties

  • f speech

sounds Acoustic
 Signal Processing Hidden Markov Models Deep Neural Networks Hybrid HMM-DNN
 Systems Speaker Adaptation Ngram/RNN LMs G2P/feature- based models Search algorithms

slide-35
SLIDE 35

Formalism: Finite State Transducers

Course Overview

speech
 signal


Acoustic
 Feature
 Generator SEARCH Acoustic
 Model (phones) Language
 Model

word sequence
 W* O

Pronunciation
 Model

Properties

  • f speech

sounds Acoustic
 Signal Processing Hidden Markov Models Deep Neural Networks Hybrid HMM-DNN
 Systems Speaker Adaptation Ngram/RNN LMs G2P/feature- based models Search algorithms

slide-36
SLIDE 36

Next two classes: 
 Weighted Finite State Transducers in ASR