Lecture 5 The Big Picture/Language Modeling Michael Picheny, - - PowerPoint PPT Presentation

lecture 5
SMART_READER_LITE
LIVE PREVIEW

Lecture 5 The Big Picture/Language Modeling Michael Picheny, - - PowerPoint PPT Presentation

Lecture 5 The Big Picture/Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA


slide-1
SLIDE 1

Lecture 5

The Big Picture/Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen, Markus Nussbaum-Thom

Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen,nussbaum}@us.ibm.com

17 February 2016

slide-2
SLIDE 2

Administrivia

Slides posted before lecture may not match lecture. Lab 1 Not graded yet; will be graded by next lecture? Awards ceremony for evaluation next week. Grading: what’s up with the optional exercises? Lab 2 Due nine days from now (Friday, Feb. 26) at 6pm. Start early! Avail yourself of Piazza.

2 / 99

slide-3
SLIDE 3

Feedback

Clear (4); mostly clear (2); unclear (3). Pace: fast (3); OK (2). Muddiest: HMM’s in general (1); Viterbi (1); FB (1). Comments (2+ votes): want better/clearer examples (5) spend more time walking through examples (3) spend more time on high-level intuition before getting into details (3) good examples (2)

3 / 99

slide-4
SLIDE 4

Celebrity Sighting

New York Times 4 / 99

slide-5
SLIDE 5

Part I The HMM/GMM Framework

5 / 99

slide-6
SLIDE 6

Where Are We?

1

Review from 10,000 Feet

2

The Model

3

Training

4

Decoding

5

Technical Details

6 / 99

slide-7
SLIDE 7

The Raw Data

0.5 1 1.5 2 2.5 x 10

4

−1 −0.5 0.5

What do we do with waveforms?

7 / 99

slide-8
SLIDE 8

Front End Processing

Convert waveform to features.

8 / 99

slide-9
SLIDE 9

What Have We Gained?

Time domain ⇒ frequency domain. Removed vocal-fold excitation. Made features independent.

9 / 99

slide-10
SLIDE 10

ASR 1.0: Dynamic Time Warping

10 / 99

slide-11
SLIDE 11

Computing the Distance Between Utterances

Find “best” alignment between frames. Sum distances between aligned frames. Sum penalties for “weird” alignments.

11 / 99

slide-12
SLIDE 12

ASR 2.0: The HMM/GMM Framework

12 / 99

slide-13
SLIDE 13

Notation

13 / 99

slide-14
SLIDE 14

How Do We Do Recognition?

xtest = test features; Pω(x) = word model. (answer) =??? (answer) = arg max

ω∈vocab

Pω(xtest) Return the word whose model . . . Assigns the highest prob to the utterance.

14 / 99

slide-15
SLIDE 15

Putting it All Together

Pω(x) = ??? How do we actually train? How do we actually decode?

It’s a puzzlement by jubgo. Some rights reserved. 15 / 99

slide-16
SLIDE 16

Where Are We?

1

Review from 10,000 Feet

2

The Model

3

Training

4

Decoding

5

Technical Details

16 / 99

slide-17
SLIDE 17

So What’s the Model?

Pω(x) =??? Frequency that word ω generates features x. Has something to do with HMM’s and GMM’s.

Untitled by Daniel Oines. Some rights reserved. 17 / 99

slide-18
SLIDE 18

A Word Is A Sequence of Sounds

e.g., the word ONE: W → AH → N. Phoneme inventory. AA AE AH AO AW AX AXR AY B BD CH D DD DH DX EH ER EY F G GD HH IH IX IY JH K KD L M N NG OW OY P PD R S SH T TD TH TS UH UW V W X Y Z ZH What sounds make up TWO? What do we use to model sequences?

18 / 99

slide-19
SLIDE 19

HMM, v1.0

Outputs on arcs, not states. What’s the problem? What are the outputs?

19 / 99

slide-20
SLIDE 20

HMM, v2.0

What’s the problem? How many frames per phoneme?

20 / 99

slide-21
SLIDE 21

HMM, v3.0

Are we done?

21 / 99

slide-22
SLIDE 22

Concept: Alignment ⇔ Path

Path through HMM ⇒ sequence of arcs, one per frame. Notation: A = a1 · · · aT. at = which arc generated frame t.

22 / 99

slide-23
SLIDE 23

The Game Plan

Express Pω(x), the total prob of x . . . In terms of Pω(x, A), the prob of a single path. How? P(x) =

  • paths A

(path prob) =

  • paths A

P(x, A) Sum over all paths.

23 / 99

slide-24
SLIDE 24

How To Compute the Likelihood of a Path?

Path: A = a1 · · · aT. P(x, A) =

T

  • t=1

(arc prob) × (output prob) =

T

  • t=1

pat × P( xt|at) Multiply arc, output probs along path.

24 / 99

slide-25
SLIDE 25

What Do Output Probabilities Look Like?

Mixture of diagonal-covariance Gaussians. P( x|a) =

  • comp j

(mixture wgt)

  • dim d

(Gaussian for dim d) =

  • comp j

pa,j

  • dim d

N(xd; µa,j,d, σa,j,d)

25 / 99

slide-26
SLIDE 26

The Full Model

P(x) =

  • paths A

P(x, A) =

  • paths A

T

  • t=1

pat × P( xt|at) =

  • paths A

T

  • t=1

pat

  • comp j

pat,j

  • dim d

N(xt,d; µat,j,d, σ2

at,j,d)

pa — transition probability for arc a. pa,j — mixture weight, jth component of GMM on arc a. µa,j,d — mean, dth dim, jth component, GMM on arc a. σ2

a,j,d — variance, dth dim, jth component, GMM on arc a.

26 / 99

slide-27
SLIDE 27

Pop Quiz

What was the equation on the last slide?

27 / 99

slide-28
SLIDE 28

Where Are We?

1

Review from 10,000 Feet

2

The Model

3

Training

4

Decoding

5

Technical Details

28 / 99

slide-29
SLIDE 29

Training

How to create model Pω(x) from examples xω,1, xω,2, . . . ?

29 / 99

slide-30
SLIDE 30

What is the Goal of Training?

To estimate parameters . . . To maximize likelihood of training data.

Crossfit 0303 by Runar Eilertsen. Some rights reserved. 30 / 99

slide-31
SLIDE 31

What Are the Model Parameters?

pa — transition probability for arc a. pa,j — mixture weight, jth component of GMM on arc a. µa,j,d — mean, dth dim, jth component, GMM on arc a. σ2

a,j,d — variance, dth dim, jth component, GMM on arc a.

31 / 99

slide-32
SLIDE 32

Warm-Up: Non-Hidden ML Estimation

e.g., Gaussian estimation, non-hidden Markov Models. How to do this? (Hint: ??? and ???.) parameter description statistic pa arc prob # times arc taken pa,j mixture wgt # times component used µa,j,d mean xd σ2

a,j,d

variance x2

d

Count and normalize. i.e., collect a statistic; divide by normalizer count.

32 / 99

slide-33
SLIDE 33

How To Estimate Hidden Models?

The EM algorithm ⇒ FB algorithm for HMM’s. Hill-climbing maximum-likelihood estimation.

Uphill Struggle by Ewan Cross. Some rights reserved. 33 / 99

slide-34
SLIDE 34

The EM Algorithm

Expectation step. Using current model, compute posterior counts . . . Prob that thing occurred at time t. Maximization step. Like non-hidden MLE, except . . . Use fractional posterior counts instead of whole counts. Repeat.

34 / 99

slide-35
SLIDE 35

E step: Calculating Posterior Counts

e.g., posterior count γ(a, t) of taking arc a at time t. γ(a, t) = P(paths with arc a at time t) P(all paths) = 1 P(x) × P(paths from start to src(a))× P(arc a at time t) × P(paths from dst(a) to end) = 1 P(x) × α(src(a), t − 1) × pa × P( xt|a) × β(dst(a), t) Do Forward algorithm: α(S, t), P(x). Do Backward algorithm: β(S, t). Read off posterior counts.

35 / 99

slide-36
SLIDE 36

M step: Non-Hidden ML Estimation

Count and normalize. Same stats as non-hidden, except normalizer is fractional. e.g., arc prob pa pa = (count of a)

  • src(a′)=src(a) (count of a′) =
  • t γ(a, t)
  • src(a′)=src(a)
  • t γ(a′, t)

e.g., single Gaussian, mean µa,d for dim d. µa,d = (mean weighted by γ(a, t)) =

  • t γ(a, t)xt,d
  • t γ(a, t)

36 / 99

slide-37
SLIDE 37

Where Are We?

1

Review from 10,000 Feet

2

The Model

3

Training

4

Decoding

5

Technical Details

37 / 99

slide-38
SLIDE 38

What is Decoding?

(answer) = arg max

ω∈vocab

Pω(xtest)

38 / 99

slide-39
SLIDE 39

What Algorithm?

(answer) = arg max

ω∈vocab

Pω(xtest) For each word ω, how to compute Pω(xtest)? Forward or Viterbi algorithm.

39 / 99

slide-40
SLIDE 40

What Are We Trying To Compute?

P(x) =

  • paths A

P(x, A) =

  • paths A

T

  • t=1

pat × P( xt|at) =

  • paths A

T

  • t=1

(arc cost)

40 / 99

slide-41
SLIDE 41

Dynamic Programming

Shortest path problem. (answer) = min

paths A TA

  • t=1

(edge length) Forward algorithm. P(x) =

  • paths A

T

  • t=1

(arc cost) Viterbi algorithm. P(x) ≈ max

paths A T

  • t=1

(arc cost) Any semiring will do.

41 / 99

slide-42
SLIDE 42

Scaling

How does decoding time scale with vocab size?

42 / 99

slide-43
SLIDE 43

The One Big HMM Paradigm: Before

43 / 99

slide-44
SLIDE 44

The One Big HMM Paradigm: After

♦♥❡ t✇♦ t❤r❡❡ ❢♦✉r ☞✈❡ s✐① s❡✈❡♥ ❡✐❣❤t ♥✐♥❡ ③❡r♦

How does this help us?

44 / 99

slide-45
SLIDE 45

Pruning

What is time complexity of Forward/Viterbi? How many values α(S, t) to fill? Idea: only fill k best cells at each frame. What is time complexity? How does this scale with vocab size?

45 / 99

slide-46
SLIDE 46

How Does This Change Decoding?

Run Forward/Viterbi once, on one big HMM . . . Instead of once for every word model. Same algorithm; different graph!

♦♥❡ t✇♦ t❤r❡❡ ❢♦✉r ☞✈❡ s✐① s❡✈❡♥ ❡✐❣❤t ♥✐♥❡ ③❡r♦

46 / 99

slide-47
SLIDE 47

Forward or Viterbi?

What are we trying to compute? Total prob? Viterbi prob? Best word?

♦♥❡ t✇♦ t❤r❡❡ ❢♦✉r ☞✈❡ s✐① s❡✈❡♥ ❡✐❣❤t ♥✐♥❡ ③❡r♦

47 / 99

slide-48
SLIDE 48

Recovering the Word Identity

♦♥❡ t✇♦ t❤r❡❡ ❢♦✉r ☞✈❡ s✐① s❡✈❡♥ ❡✐❣❤t ♥✐♥❡ ③❡r♦

48 / 99

slide-49
SLIDE 49

Where Are We?

1

Review from 10,000 Feet

2

The Model

3

Training

4

Decoding

5

Technical Details

49 / 99

slide-50
SLIDE 50

Hyperparameters

What is a hyperparameter? A tunable knob or something adjustable . . . That can’t be estimated with “normal” training. Can you name some? Number of states in each word HMM. HMM topology. Number of GMM components.

50 / 99

slide-51
SLIDE 51

“Estimating” Hyperparameters

How does one set hyperparameters? Just try different values ⇒ expensive! Testing value ⇒ train whole HMM/GMM system. What criterion to optimize? Normal parameter: Likelihood (smooth). Hyperparameters: Word-error rate (noisy). Gradient descent unreliable; grid search instead. Ask an old-timer. What are good hyperparameter settings for ASR?

51 / 99

slide-52
SLIDE 52

How Many States?

Rule of thumb: three states per phoneme. A phoneme has a start, middle, and end? Example: TWO is composed of phonemes T UW. Two phonemes ⇒ six HMM states.

❚✶ ❚✷ ❚✸ ❯❲✶ ❯❲✷ ❯❲✸

❣✶✴✵✳✺ ❣✷✴✵✳✺ ❣✸✴✵✳✺ ❣✹✴✵✳✺ ❣✺✴✵✳✺ ❣✻✴✵✳✺ ❣✶✴✵✳✺ ❣✷✴✵✳✺ ❣✸✴✵✳✺ ❣✹✴✵✳✺ ❣✺✴✵✳✺ ❣✻✴✵✳✺

What guarantee each state models intended sound?

52 / 99

slide-53
SLIDE 53

Which HMM Topology?

A standard topology. Must say sounds of word in order. Can stay at each sound indefinitely.

❣✶✴✵✳✺ ❣✷✴✵✳✺ ❣✸✴✵✳✺ ❣✹✴✵✳✺ ❣✺✴✵✳✺ ❣✻✴✵✳✺ ❣✶✴✵✳✺ ❣✷✴✵✳✺ ❣✸✴✵✳✺ ❣✹✴✵✳✺ ❣✺✴✵✳✺ ❣✻✴✵✳✺

Can we skip sounds, e.g., fifth? Use skip arcs ⇔ arcs with no output. Need to modify Forward, Viterbi, etc.

❣✶✴✵✳✹ ❣✷✴✵✳✹ ❣✸✴✵✳✹ ❣✹✴✵✳✹ ❣✺✴✵✳✹ ❣✻✴✵✳✹ ❣✶✴✵✳✹ ❣✷✴✵✳✹ ❣✸✴✵✳✹ ❣✹✴✵✳✹ ❣✺✴✵✳✹ ❣✻✴✵✳✹ ✎✴✵✳✷ ✎✴✵✳✷ ✎✴✵✳✷ ✎✴✵✳✷ ✎✴✵✳✷ 53 / 99

slide-54
SLIDE 54

The Smallest Number in the World

Demo.

54 / 99

slide-55
SLIDE 55

Probabilities and Log Probabilities

P(x) =

  • paths A

T

  • t=1

pat

  • comp j

pat,j

  • dim d

N(xt,d; µat,j,d, σ2

at,j,d)

1 sec of data ⇒ T = 100 ⇒ Multiply 4,000 likelihoods. Easy to generate values below 10−307. Cannot store in C/C++ 64-bit double. What to do? Solution: store log probs instead of probs. Compute log α(S, t), not α(S, t).

55 / 99

slide-56
SLIDE 56

Viterbi Easy, Forward Tricky

Viterbi algorithm ˆ α(S, t) = max

S′ xt →S

P(S′

xt

→ S) × ˆ α(S′, t − 1) log ˆ α(S, t) = max

S′ xt →S

  • log P(S′

xt

→ S) + log ˆ α(S′, t − 1)

  • Forward algorithm

α(S, t) =

  • S′ xt

→S

P(S′

xt

→ S) × α(S′, t − 1) log α(S, t) = log

  • S′ xt

→S

exp

  • log P(S′

xt

→ S) + log α(S′, t − 1)

  • See Holmes, p. 153–154.

56 / 99

slide-57
SLIDE 57

What Have We Learned So Far?

Single-word recognition. How to recognize a single word. e.g., can handle a digit, not a digit string. What’s this good for? Old-time voice dialing. Recognizing digits in old-time phone menus. Not much else.

Old-timey cell phones by Vaguely Artistic. Some rights reserved. 57 / 99

slide-58
SLIDE 58

Part II Continuous Word ASR

58 / 99

slide-59
SLIDE 59

Where Are We?

1

Single Word To Continuous Word ASR

2

One More Thing

3

Discussion

59 / 99

slide-60
SLIDE 60

Next Step: Isolated Word Recognition

It’s . . . when . . . you . . . talk . . . like . . . this.

60 / 99

slide-61
SLIDE 61

Training Data, Test Data

What you need: silence-based segmenter. Chop up training data. Chop up test data. Reduces to single-word ASR.

61 / 99

slide-62
SLIDE 62

Continuous Word Recognition

It’s when you talk like this.

62 / 99

slide-63
SLIDE 63

Continuous Word Recognition

Single word. Find word model with highest prob: arg max

ω∈vocab

Pω(xtest) V-way classification. Continuous word. Find word sequence with highest prob: arg max

ω∈vocab∗ P(xtest|ω)

∞-way classification. This sounds hard.

63 / 99

slide-64
SLIDE 64

Decoding Continuous Word Data

Have single-word models Pω(x). How to decode continuous words?

64 / 99

slide-65
SLIDE 65

One Big HMM Paradigm

How to modify HMM . . . To accept word sequences instead of single words?

♦♥❡ t✇♦ t❤r❡❡ ❢♦✉r ☞✈❡ s✐① s❡✈❡♥ ❡✐❣❤t ♥✐♥❡ ③❡r♦

Loop!

65 / 99

slide-66
SLIDE 66

What Do We Need To Change in Viterbi?

Nada.

66 / 99

slide-67
SLIDE 67

Training on Continuous Word Data

Isolated word. Continuous word. Don’t know where words begin and end!

67 / 99

slide-68
SLIDE 68

How Does Training Work Again?

Isolated word. Continuous word: what to do? Idea: concatenate HMM’s!

68 / 99

slide-69
SLIDE 69

What Do We Need To Change in FB?

Nada.

69 / 99

slide-70
SLIDE 70

Recap: Continuous Word ASR

Use “one big HMM” paradigm for decoding. Modify HMM’s for decoding and training in intuitive way. Everything just works!

70 / 99

slide-71
SLIDE 71

Where Are We?

1

Single Word To Continuous Word ASR

2

One More Thing

3

Discussion

71 / 99

slide-72
SLIDE 72

One More Thing

What happens if we feed isolated speech . . . Into our continuous word system?

♦♥❡ t✇♦ t❤r❡❡ ❢♦✉r ☞✈❡ s✐① s❡✈❡♥ ❡✐❣❤t ♥✐♥❡ ③❡r♦

72 / 99

slide-73
SLIDE 73

What To Do About Silence?

Treat silence as just another word (∼SIL). How to design HMM for silence?

❣✶✴✵✳✹ ❣✷✴✵✳✹ ❣✸✴✵✳✹ ❣✶✴✵✳✹ ❣✷✴✵✳✹ ❣✸✴✵✳✹ ✎✴✵✳✷ ✎✴✵✳✷ 73 / 99

slide-74
SLIDE 74

Silence In Decoding

Where may silence occur? How many silences can occur in a row? Rule of thumb: unnecessary freedom should be avoided.

  • cf. Patriot Act.

❍▼▼s✐❧ ❍▼▼s✐❧ ❍▼▼s✐❧ ❍▼▼♦♥❡ ❍▼▼t✇♦ ❍▼▼t❤r❡❡ ✳ ✳ ✳ ✳ ✳ ✳ ❍▼▼♦♥❡ ❍▼▼t✇♦ ❍▼▼t❤r❡❡ ✳ ✳ ✳ ✳ ✳ ✳ ❍▼▼s✐❧ ✎ ❍▼▼s✐❧ ✎ ❍▼▼s✐❧ ✎ ❍▼▼♦♥❡ ❍▼▼t✇♦ ❍▼▼t❤r❡❡ ✳ ✳ ✳ ✳ ✳ ✳ ❍▼▼♦♥❡ ❍▼▼t✇♦ ❍▼▼t❤r❡❡ ✳ ✳ ✳ ✳ ✳ ✳ 74 / 99

slide-75
SLIDE 75

Silence In Training

Usually not included in transcripts. e.g., HMM for transcript: ONE TWO

❍▼▼s✐❧ ❍▼▼t✇♦ ❍▼▼s✐❧ ❍▼▼♦♥❡ ❍▼▼s✐❧ ❍▼▼t✇♦ ❍▼▼♦♥❡

Lab 2: graphs constructed for you.

75 / 99

slide-76
SLIDE 76

Recap: Silence

Don’t forget about silence! Silence can be modeled as just another word. Generalization: noises, music, filled pauses.

Silence by Alberto Ortiz. Some rights reserved. 76 / 99

slide-77
SLIDE 77

Where Are We?

1

Single Word To Continuous Word ASR

2

One More Thing

3

Discussion

77 / 99

slide-78
SLIDE 78

HMM/GMM Systems Are Easy To Use

List of inputs. Hyperparameters. HMM topology, # states; # GMM components. What else?∗ Utterances with transcripts. Automatically induces word begin/ends, silences. Period.

∗Small vocabulary only.

78 / 99

slide-79
SLIDE 79

HMM/GMM Systems Are Flexible

Same algorithms for: Single word, isolated word, continuous word ASR. Just change how HMM is created!

79 / 99

slide-80
SLIDE 80

HMM/GMM Systems Are Scalable

As training data, vocabulary grows. In decoding speed. Pruning ⇒ time grows slowly.∗ In model size. Number of parameters grow slowly.∗

∗When using large-vocab methods described in next few lectures.

80 / 99

slide-81
SLIDE 81

HMM/GMM’s Are The Bomb

State of art since invented in 1980’s. That’s 30+ years! Until a couple years ago . . . Basically every production system was HMM/GMM. Most probably still are.

BOMB by Apionid. Some rights reserved. 81 / 99

slide-82
SLIDE 82

Segue: What Have We Learned So Far?

Small-vocabulary continuous speech recognition. What’s this good for? Digit strings. Not much else. What’s next: large-vocabulary CSR.

82 / 99

slide-83
SLIDE 83

Part III Language Modeling

83 / 99

slide-84
SLIDE 84

Where Are We?

1

The Fundamental Equation of Speech Recognition

84 / 99

slide-85
SLIDE 85

Demo

85 / 99

slide-86
SLIDE 86

What’s the Point?

ASR works better if you say something “expected”. Otherwise, it doesn’t do that well.

86 / 99

slide-87
SLIDE 87

Wreck a Nice Peach?

Demo.

THIS IS OUR ROOM FOR A FOUR HOUR PERIOD . THIS IS HOUR ROOM FOUR A FOR OUR . PERIOD IT IS EASY TO RECOGNIZE SPEECH . IT IS EASY TO WRECK A NICE PEACH .

Homophones; acoustically ambiguous speech. How does it get it right . . . Even though acoustics for pair is same? (What if want other member of pair?) Need to model “expected” word sequences!

87 / 99

slide-88
SLIDE 88

How Do We Do Recognition?

xtest = test features; P(x|ω) = HMM/GMM model. (answer) =??? (answer) = arg max

ω∈vocab∗ P(xtest|ω)

Return the word sequence that . . . Assigns the highest prob to the utterance.

88 / 99

slide-89
SLIDE 89

Does This Prefer Likely Word Sequences?

e.g., P(xtest|OUR ROOM) vs. P(xtest|HOUR ROOM). If I say AA R R UW M, how do these compare? They should be about the same.

89 / 99

slide-90
SLIDE 90

How Do We Fix This?

Want term P(ω). Prior over word sequences; prefers likely sequences. What HMM/GMM’s give us: P(x|ω). Old: word sequence that maximizes likelihood of feats. (answer) = arg max

ω

P(x|ω) Idea: most likely word sequence given feats!? (answer) = arg max

ω

P(ω|x)

90 / 99

slide-91
SLIDE 91

Bayes’ Rule

The rule: P(x, ω) = P(ω)P(x|ω) = P(x)P(ω|x) P(ω|x) = P(ω)P(x|ω) P(x) Substituting: (answer) = arg max

ω

P(ω|x) = arg max

ω

P(ω)P(x|ω) P(x) = arg max

ω

P(ω)P(x|ω)

91 / 99

slide-92
SLIDE 92

The Fundamental Equation of ASR

Old way: (answer) = arg max

ω

P(x|ω) New way: (answer) = arg max

ω

P(ω|x) = arg max

ω

P(ω)P(x|ω) Added term P(ω), just like we wanted.

92 / 99

slide-93
SLIDE 93

Remember This!

(answer) = arg max

ω

(language model) × (acoustic model) = arg max

ω

(prior prob over words) × P(feats|words) = arg max

ω

P(ω)P(x|ω)

Forgot What I Wanted to Remember by Flood G.. Some rights reserved. 93 / 99

slide-94
SLIDE 94

Does This Fix Our Problem?

(answer) = arg max

ω

(language model) × (acoustic model) = arg max

ω

P(ω)P(x|ω) What about homophones?

THIS IS OUR ROOM FOR A FOUR HOUR PERIOD . THIS IS HOUR ROOM FOUR A FOR OUR . PERIOD

What about confusable sequences in general?

IT IS EASY TO RECOGNIZE SPEECH . IT IS EASY TO WRECK A NICE PEACH .

94 / 99

slide-95
SLIDE 95

Language Modeling: Goals

Describe which word sequences are likely. Eliminate nonsense; restrict choices given to AM. The fewer choices, the better you do! Save acoustic model’s ass.

95 / 99

slide-96
SLIDE 96

Pop Quiz

What is the fundamental equation of ASR?

96 / 99

slide-97
SLIDE 97

Part IV Epilogue

97 / 99

slide-98
SLIDE 98

What’s Next

Language modeling: on the road to LVCSR. Lecture 6: Pronunciation modeling. Acoustic modeling for LVCSR. Lectures 7, 8: Training, finite-state transducers, search. Efficient training and decoding for LVCSR.

98 / 99

slide-99
SLIDE 99

Course Feedback

1

Was this lecture mostly clear or unclear? What was the muddiest topic?

2

Comments on difficulty of Lab 1?

3

Other feedback (pace, content, atmosphere)?

99 / 99