Learning a Language Model from Continuous Speech Graham Neubig, - - PowerPoint PPT Presentation

learning a language model from continuous speech
SMART_READER_LITE
LIVE PREVIEW

Learning a Language Model from Continuous Speech Graham Neubig, - - PowerPoint PPT Presentation

Learning a Language Model from Continuous Speech Learning a Language Model from Continuous Speech Graham Neubig, Masato Mimura, Shinsuke Mori, Tatsuya Kawahara School of Informatics, Kyoto University, Japan 1 Learning a Language Model from


slide-1
SLIDE 1

1

Learning a Language Model from Continuous Speech

Learning a Language Model from Continuous Speech

Graham Neubig, Masato Mimura, Shinsuke Mori, Tatsuya Kawahara

School of Informatics, Kyoto University, Japan

slide-2
SLIDE 2

2

Learning a Language Model from Continuous Speech

  • 1. Outline
slide-3
SLIDE 3

3

Learning a Language Model from Continuous Speech

Speech

this is the song that never ends it just goes on and on my friends and if you started singing it not knowing what it was you'll just keep singing it forever just because this is the song that never ends it just goes on and on my friends and if you started singing it not knowing what it was you'll just keep singing it forever just because this is the song that never ends it just goes on and on my friends and if you started singing it not knowing what it was you'll just keep singing it forever just because this is the song that never ends it just goes on and on my friends and if...

Text Corpus

Training of a Speech Recongition System

this is the song that never ends it just goes on and on my friends and if you started singing it not knowing what it was you'll just keep singing it forever just because this is the song that never ends it just goes on and on my friends and if...

Transcription Speech

Training Training Language Model Acoustic Model Decoder

slide-4
SLIDE 4

4

Learning a Language Model from Continuous Speech

Speech

this is the song that never ends it just goes on and on my friends and if you started singing it not knowing what it was you'll just keep singing it forever just because this is the song that never ends it just goes on and on my friends and if you started singing it not knowing what it was you'll just keep singing it forever just because this is the song that never ends it just goes on and on my friends and if you started singing it not knowing what it was you'll just keep singing it forever just because this is the song that never ends it just goes on and on my friends and if...

Text Corpus

Training of a Speech Recongition System

this is the song that never ends it just goes on and on my friends and if you started singing it not knowing what it was you'll just keep singing it forever just because this is the song that never ends it just goes on and on my friends and if...

Transcription Speech

Training Training Language Model Acoustic Model Decoder

slide-5
SLIDE 5

5

Learning a Language Model from Continuous Speech

Speech

Training of a Speech Recongition System

this is the song that never ends it just goes on and on my friends and if you started singing it not knowing what it was you'll just keep singing it forever just because this is the song that never ends it just goes on and on my friends and if...

Transcription Speech

Training Training Language Model Acoustic Model Decoder

slide-6
SLIDE 6

6

Learning a Language Model from Continuous Speech

Why Learn a Language Model from Speech?

  • A straightforward way to handle spoken language
  • Fillers, colloquial expressions, and pronunciation

variants are included in the model

  • A way to learn models for resource-poor languages
  • LMs can be learned even for languages with no

digitized text

  • Use with language-independent acoustic models?

[Schultz & Waibel 01]

  • Semi-supervised Learning
  • Learn a model from newspaper text, update it with

spoken expressions or new vocabulary from speech

slide-7
SLIDE 7

7

Learning a Language Model from Continuous Speech

  • Goal: Learn a LM using no text
  • Two problems:
  • Word boundaries are not clear → use unsupervised word

segmentation

  • Acoustic ambiguity→Use a phoneme lattice to absorb

acoustic model errors

  • Method: Apply a Bayesian word segmentation method

[Mochihashi+ 09] to phoneme lattices

  • Implementation using weighted finite state transducers (WFST)
  • Result: An LM learned from continuous speech was able to

significantly reduce the ASR phoneme error rate on test data

Our Research

slide-8
SLIDE 8

8

Learning a Language Model from Continuous Speech

Previous Research

  • Learning words from speech
  • Using audio/visual data and techniques such as MMI or

MDL, learn grounded words [Roy+ 02, Taguchi+ 09]

  • Find similar audio segments using dynamic time

warping and acoustic similarity scores [Park+ 08]

  • Learning language models from speech
  • Use standard LM learning techniques on 1-best AM

results [de Marcken 95, Gorin+ 99]

  • Multigram model from acoustic lattices [Driesen+ 08]
  • No research learning n-gram LMs with acoustic uncertainty
  • Most work handles small vocabulary (infant directed

speech, digit recognition)

slide-9
SLIDE 9

9

Learning a Language Model from Continuous Speech

  • 2. Unsupervised word segmentation
slide-10
SLIDE 10

10

Learning a Language Model from Continuous Speech

LM-based Supervised Word Segmentation

  • Training: Use corpus W that is annotated with word

boundaries to train model G

  • Decoding: for character sequence x, treat all word

sequences w as possible candidates

  • The probability of a candidate is proportional to its LM

probability x=iam

Language Model G

P(w=iam; G) P(w=i am; G) P(w=ia m; G) P(w=i a m; G)

slide-11
SLIDE 11

11

Learning a Language Model from Continuous Speech

LM-Based Unsupervised Word Segmentation

  • Estimate an unobserved word sequence W of

unsegmented corpus X, train language model G over W

  • We desire a model that is highly expressive, but simple
  • Likelihood P(W|G) prefers expressive (complex) models
  • Add a prior P(G) that prefers simple models
  • Find a model with high joint probability

P(G,W)=P(G)P(W|G) Simple Model P(G) high P(W|G) low P(G)P(W|G) low Complex model P(G) low P(W|G) high P(G)P(W|G) low Ideal Model P(G) mid P(W|G) mid P(G)P(W|G) mid

slide-12
SLIDE 12

12

Learning a Language Model from Continuous Speech

Hierarchical Pitman-Yor Language Model (HPYLM) [Teh 06]

  • An n-gram language model based on non-parametric

Bayesian statistics

  • Has a number of attractive traits
  • Language model smoothing is realized through prior P(G)
  • Parameters can be learned using Gibbs sampling

Hε Ha Hb Hba Hca Hab Hdb … … … ~ PY(Hbase, d1, Θ1) ~ PY(Hε, d2, Θ2) ~ PY(Hb, d3, Θ3) PY(Ha, d3, Θ3) ~

slide-13
SLIDE 13

13

Learning a Language Model from Continuous Speech

Unsupervised Word Segmentation using HPYLMs [Mochihashi+ 09]

  • The model G is separated into a word-based language

model LM and a character-based spelling model SM

  • Words and spellings are connected in a probabilistic

framework (unknown words can be modeled)

  • It is possible to sample word boundaries using a

technique called forward-filtering/backward-sampling

  • Can be used with any (non-cyclic) finite-state automaton
  • Very similar to the forward-backward algorithm for HMMs

i am in chiba now

PLM(i|<s>) PLM(am|i) PLM(in|am) PLM(<unk>|in) PLM(now|<unk>) PLM(</s>|now) PSM(c|<s>) PSM(h|c) PSM(i|h) PSM(b|i) PSM(a|b) PSM(</s>|a)

slide-14
SLIDE 14

14

Learning a Language Model from Continuous Speech

Forward Filtering

  • Forward filtering is identical to the forward step in the

forward-backward algorithm

s0 s1 s2 s3 s4 s5 p(s1|s0) p(s2|s0) p(s3|s2) p(s4|s1) p(s3|s1) p(s4|s2) p(s5|s3) p(s5|s4)

forward filtering add forward probabilities in order

slide-15
SLIDE 15

15

Learning a Language Model from Continuous Speech

Forward Filtering

  • Forward filtering is identical to the forward step in the

forward-backward algorithm

s0 s1 s2 s3 s4 s5 p(s1|s0) p(s2|s0) p(s3|s2) p(s4|s1) p(s3|s1) p(s4|s2) p(s5|s3) p(s5|s4)

forward filtering add forward probabilities in order

f(s0) = 1

s0

slide-16
SLIDE 16

16

Learning a Language Model from Continuous Speech

Forward Filtering

  • Forward filtering is identical to the forward step in the

forward-backward algorithm

s0 s1 s2 s3 s4 s5 p(s1|s0) p(s2|s0) p(s3|s2) p(s4|s1) p(s3|s1) p(s4|s2) p(s5|s3) p(s5|s4)

forward filtering add forward probabilities in order

f(s0) = 1 f(s1) = p(s1|s0)*f(s0)

s1

slide-17
SLIDE 17

17

Learning a Language Model from Continuous Speech

Forward Filtering

  • Forward filtering is identical to the forward step in the

forward-backward algorithm

s0 s1 s2 s3 s4 s5 p(s1|s0) p(s2|s0) p(s3|s2) p(s4|s1) p(s3|s1) p(s4|s2) p(s5|s3) p(s5|s4)

forward filtering add forward probabilities in order

f(s0) = 1 f(s1) = p(s1|s0)*f(s0) f(s2) = p(s2|s0)*f(s0)

s2

slide-18
SLIDE 18

18

Learning a Language Model from Continuous Speech

Forward Filtering

  • Forward filtering is identical to the forward step in the

forward-backward algorithm

s0 s1 s2 s3 s4 s5 p(s1|s0) p(s2|s0) p(s3|s2) p(s4|s1) p(s3|s1) p(s4|s2) p(s5|s3) p(s5|s4)

forward filtering add forward probabilities in order

f(s0) = 1 f(s1) = p(s1|s0)*f(s0) f(s2) = p(s2|s0)*f(s0) f(s3) = p(s3|s1)*f(s1) + p(s3|s2)*f(s2)

s3

slide-19
SLIDE 19

19

Learning a Language Model from Continuous Speech

Forward Filtering

  • Forward filtering is identical to the forward step in the

forward-backward algorithm

s0 s1 s2 s3 s4 s5 p(s1|s0) p(s2|s0) p(s3|s2) p(s4|s1) p(s3|s1) p(s4|s2) p(s5|s3) p(s5|s4)

forward filtering add forward probabilities in order

f(s0) = 1 f(s1) = p(s1|s0)*f(s0) f(s2) = p(s2|s0)*f(s0) f(s3) = p(s3|s1)*f(s1) + p(s3|s2)*f(s2) f(s4) = p(s4|s1)*f(s1) + p(s4|s2)*f(s2)

s4

slide-20
SLIDE 20

20

Learning a Language Model from Continuous Speech

Forward Filtering

  • Forward filtering is identical to the forward step in the

forward-backward algorithm

s0 s1 s2 s3 s4 s5 p(s1|s0) p(s2|s0) p(s3|s2) p(s4|s1) p(s3|s1) p(s4|s2) p(s5|s3) p(s5|s4)

forward filtering add forward probabilities in order

f(s0) = 1 f(s1) = p(s1|s0)*f(s0) f(s2) = p(s2|s0)*f(s0) f(s3) = p(s3|s1)*f(s1) + p(s3|s2)*f(s2) f(s4) = p(s4|s1)*f(s1) + p(s4|s2)*f(s2) f(s5) = p(s5|s3)*f(s3) + p(s5|s4)*f(s4)

s5

slide-21
SLIDE 21

21

Learning a Language Model from Continuous Speech

Backward Sampling

s0 s1 s2 s3 s4 s5 p(s1|s0) p(s2|s0) p(s3|s2) p(s4|s1) p(s3|s1) p(s4|s2) p(s5|s3) p(s5|s4)

backward sampling sample edges from the final state

e(s5→x) p(x=s3) p(s5|s3)*f(s3) p(x=s4) p(s5|s4)*f(s4)

s5

  • Backward sampling samples a path, starting at the

final state, using the edge and forward probabilities

∝ ∝

slide-22
SLIDE 22

22

Learning a Language Model from Continuous Speech

Backward Sampling

s0 s1 s2 s3 s4 s5 p(s1|s0) p(s2|s0) p(s3|s2) p(s4|s1) p(s3|s1) p(s4|s2) p(s5|s3) p(s5|s4)

backward sampling sample edges from the final state

e(s5→x) p(x=s3) p(s5|s3)*f(s3) p(x=s4) p(s5|s4)*f(s4)

  • Backward sampling samples a path, starting at the

final state, using the edge and forward probabilities

∝ ∝

slide-23
SLIDE 23

23

Learning a Language Model from Continuous Speech

Backward Sampling

s0 s1 s2 s3 s4 s5 p(s1|s0) p(s2|s0) p(s3|s2) p(s4|s1) p(s3|s1) p(s4|s2) p(s5|s3) p(s5|s4)

backward sampling sample edges from the final state

s3

  • Backward sampling samples a path, starting at the

final state, using the edge and forward probabilities

e(s3→x) p(x=s1) p(s3|s1)*f(s1) p(x=s2) p(s3|s2)*f(s2) ∝ ∝

slide-24
SLIDE 24

24

Learning a Language Model from Continuous Speech

Backward Sampling

s0 s1 s2 s3 s4 s5 p(s1|s0) p(s2|s0) p(s3|s2) p(s4|s1) p(s3|s1) p(s4|s2) p(s5|s3) p(s5|s4)

backward sampling sample edges from the final state

  • Backward sampling samples a path, starting at the

final state, using the edge and forward probabilities

slide-25
SLIDE 25

25

Learning a Language Model from Continuous Speech

Backward Sampling

s0 s1 s2 s3 s4 s5 p(s1|s0) p(s2|s0) p(s3|s2) p(s4|s1) p(s3|s1) p(s4|s2) p(s5|s3) p(s5|s4)

backward sampling sample edges from the final state

s2

  • Backward sampling samples a path, starting at the

final state, using the edge and forward probabilities

slide-26
SLIDE 26

26

Learning a Language Model from Continuous Speech

Backward Sampling

s0 s1 s2 s3 s4 s5 p(s1|s0) p(s2|s0) p(s3|s2) p(s4|s1) p(s3|s1) p(s4|s2) p(s5|s3) p(s5|s4)

backward sampling sample edges from the final state

  • Backward sampling samples a path, starting at the

final state, using the edge and forward probabilities

slide-27
SLIDE 27

27

Learning a Language Model from Continuous Speech

  • 3. WFST Implementation and

Learning from Speech

slide-28
SLIDE 28

28

Learning a Language Model from Continuous Speech

Generating Word Segmentation Candidates with WFSTs

  • We propose a simple way to generate word segmentation

candidates using WFSTs

  • The WFSTs are quite similar to those used in ASR

Input X Dictionary L LM + SM

i/i a/a m/m a/ε m/ε ε/amw m/mc

Next

slide-29
SLIDE 29

29

Learning a Language Model from Continuous Speech

A Language Model WFST for Word Segmentation

  • Express both the

Language Model LM and Spelling Model SM as a single WFST

ε w1 w2 <s> c1

c2

ε

SM

w2:p(w2|w1) w1:p(w1) w2:p(w2) ε:p(FB|w1) ε:p(FB|w2) ε:p(FB) c1:p(c1|<s>)

LM

ε:p(FB|<s>) ε:p(FB|c1) ε:p(FB|c2) c1:p(c1) c2:p(c2) ε:p(</s>|c1) ε:p(</s>|c2)

slide-30
SLIDE 30

30

Learning a Language Model from Continuous Speech

A Language Model WFST for Word Segmentation

  • Express both the

Language Model LM and Spelling Model SM as a single WFST

ε w1 w2 <s> c1

c2

ε

SM

w2:p(w2|w1) w1:p(w1) w2:p(w2) ε:p(FB|w1) ε:p(FB|w2) ε:p(FB) c1:p(c1|<s>)

LM

ε:p(FB|<s>) ε:p(FB|c1) ε:p(FB|c2) c1:p(c1) c2:p(c2) ε:p(</s>|c1) ε:p(</s>|c2)

The key is the weighted edges connecting the two models

slide-31
SLIDE 31

31

Learning a Language Model from Continuous Speech

Word Segmentation Candidates as a WFST

  • Vocabulary “i, a, am”, unigram model

i/i w:PL(i) m/am w:PL(am) a/ε PL(FB) PL(FB) PL(FB) i/ic:PS(i) a/ac:PS(a) m/mc:PS(m) ε/</s>: PS(</s>) ε/</s>: PS(</s>) ε/</s>: PS(</s>) ε/</s>: PL(</s>) a/a w:PL(a)

slide-32
SLIDE 32

32

Learning a Language Model from Continuous Speech

Word Segmentation Candidates as a WFST

  • Vocabulary “i, a, am”, unigram model

i/i w:PL(i) m/am w:PL(am) a/ε PL(FB) PL(FB) PL(FB) i/ic:PS(i) a/ac:PS(a) m/mc:PS(m) ε/</s>: PS(</s>) ε/</s>: PS(</s>) ε/</s>: PS(</s>) ε/</s>: PL(</s>) a/a w:PL(a)

slide-33
SLIDE 33

33

Learning a Language Model from Continuous Speech

Word Segmentation Candidates as a WFST

  • Vocabulary “i, a, am”, unigram model

i/i w:PL(i) m/am w:PL(am) a/ε PL(FB) PL(FB) PL(FB) i/ic:PS(i) a/ac:PS(a) m/mc:PS(m) ε/</s>: PS(</s>) ε/</s>: PS(</s>) ε/</s>: PS(</s>) ε/</s>: PL(</s>) a/a w:PL(a)

slide-34
SLIDE 34

34

Learning a Language Model from Continuous Speech

Word Segmentation Candidates as a WFST

  • Vocabulary “i, a, am”, unigram model

i/i w:PL(i) m/am w:PL(am) a/ε PL(FB) PL(FB) PL(FB) i/ic:PS(i) a/ac:PS(a) m/mc:PS(m) ε/</s>: PS(</s>) ε/</s>: PS(</s>) ε/</s>: PS(</s>) ε/</s>: PL(</s>) a/a w:PL(a)

slide-35
SLIDE 35

35

Learning a Language Model from Continuous Speech

Adaptation to Speech

  • When using WFSTs, adaptation to speech is simple
  • Replace input X with a HMM-based acoustic model
  • Forward-filtering = creation of a recognition lattice
  • However, full expansion using HMMs is impossible
  • Instead, we use a trimmed phoneme lattice with

acoustic model scores

Text X

i a m

Speech X

i/PAM(i) e/PAM(e) y/PAM(y) a/PAM(a) a/PAM(a) m/PAM(m)

slide-36
SLIDE 36

36

Learning a Language Model from Continuous Speech

Learning from Text, Learning from Speech

Text Speech Input Character String Phoneme Lattice Technique WFST Composition, Sampling WFST Composition, Sampling Probability P(W|G)P(G) (LM Likelihood, Prior) P(X|W)P(W|G)P(G) (AM, LM Likelihoods, Prior) Samples Segmentation, LM Phoneme String for Each Utterance, Segmentation, LM

slide-37
SLIDE 37

37

Learning a Language Model from Continuous Speech

  • 4. Evaluation
slide-38
SLIDE 38

38

Learning a Language Model from Continuous Speech

Experimental Setting

  • Target: Speech from meetings of the Japanese diet
  • Fluent, large-vocabulary speech
  • Actual vocabulary size is 2858 words
  • Data preparation: triphone acoustic model
  • PER: one-best 34.2%, oracle 8.1%
  • Used syllable lattices, not phoneme lattices (due to

requirements of the decoder)

  • 8-117 minutes of training data, 27 minutes of test data
  • Evaluation standard:
  • Phoneme error rate over the test data using language

model learned from training speech

slide-39
SLIDE 39

39

Learning a Language Model from Continuous Speech

PER Results

  • An LM learned from continuous speech reduced the PER

by 8.92%

  • 3-gram is better than 1-gram: learned contextual info

7.9 16.1 31.1 58.7 116.7 25.00% 25.50% 26.00% 26.50% 27.00% 27.50% 28.00%

Proposed 3-gram Proposed 2-gram Proposed 1-gram

Size of Training Data (Minutes)

P h

  • n

e m e E r r

  • r

R a t e

AM Only 34.2%

slide-40
SLIDE 40

40

Learning a Language Model from Continuous Speech

Other Training Methods

0.0 7.9 16.1 31.1 58.7 116.7 24.00% 24.50% 25.00% 25.50% 26.00% 26.50% 27.00% 27.50% 28.00% 28.50% 29.00%

Size of Training Data (Minutes) P h

  • n

e m e E r r

  • r

R a t e

1-best & syllable 3-gram 1-best & unsupervised seg. lattice & unsupervised seg (proposed method) manual transcription, segmentation

slide-41
SLIDE 41

41

Learning a Language Model from Continuous Speech

  • 5. Conclusion
slide-42
SLIDE 42

42

Learning a Language Model from Continuous Speech

Conclusion

  • We demonstrated that it is possible to learn a

language model from continuous speech

Released open source http://www.phontron.com/latticelm

  • A number of potential applications
  • Learning language models and dictionaries for

resource-poor languages

  • Elegant handling of spoken language
  • Semi-supervised learning
slide-43
SLIDE 43

43

Learning a Language Model from Continuous Speech

Thank You

slide-44
SLIDE 44

44

Learning a Language Model from Continuous Speech

Extra Slides

slide-45
SLIDE 45

45

Learning a Language Model from Continuous Speech

Vocabulary/Model Complexity

1-gram 2-gram 3-gram Gold Standard 3-gram Vocabulary

4480 1351 708 2858

Average Word Length (Syl.)

2.03 1.37 1.18 1.73

Language Model States

4480 16150 38759 34073

Spelling Model States

9624 3869 2426 8378

slide-46
SLIDE 46

46

Learning a Language Model from Continuous Speech

Words learned

word English # (rank)

no

possessive 1052 (1)

ni

positional 830 (2)

to

and 685 (5)

Particles

word English # (rank)

ka

particle, subword 713 (3)

to:

subword 204 (27)

sai

subword 94 (65)

Subwords

word English # (rank)

yu:

say (colloq) 324 (19)

e:

filler 202 (28)

desune discourse marker

94 (65)

Colloquial Expressions

word English # (rank)

koto

thing 189 (32)

  • mo

think (stem) 56 (109)

hanashi

speak 23 (242)

Content/Function Words

rimasukeredomo, mo:shiage, yu:fu:ni jo:kyo:, kangae, chi:ki, toki, shiteki

slide-47
SLIDE 47

47

Learning a Language Model from Continuous Speech

Experimental Setup (2)

  • Training:
  • 8-117 minutes of continuous speech as training data
  • 0.5-20 second utterances
  • Flat priors on hyperparameters, little influence
  • 20 samples burn-in, 50 LM samples
  • Testing:
  • 27 minutes of speech separate from the training data
  • Lattice rescoring (not speech recognition)
  • Viterbi phoneme strings for each LM sample combined

using ROVER

slide-48
SLIDE 48

48

Learning a Language Model from Continuous Speech

Interesting Pronunciation Variants

  • nippon (Japan) → nippo:n
  • Learned with a long vowel not in the transcription
  • Extra emphasis is put on the name of the country,

particularly when using nippon instead of nihon

  • shiteorimasu (is doing)→ shitorimasu
  • There are many places where the speakers skip vowels
  • N → nothing
  • Many word-final Ns are not recognized by the AM
  • Perhaps taking these into account would improve AM

training?

slide-49
SLIDE 49

49

Learning a Language Model from Continuous Speech

Entropy Evaluation

7.9 16.1 31.1 58.7 116.7 4 4.5 5 5.5 6 6.5 7

Minutes of Training Data P e r

  • S

y l l a b l e E n t r

  • p

y

1-best & syllable 3-gram 1-best & unsupervised seg. lattice & unsupervised seg (proposed method) manual transcription, segmentation

  • Gain over 1-best is much lower, why?
  • Different pronunciations than the transcription

shiteorimasu→shitorimasu

  • Large effect on entropy, small on PER
slide-50
SLIDE 50

50

Learning a Language Model from Continuous Speech

Future Work: Grounding

  • The model learns a segmented phoneme string
  • For transcription, use actual text
  • Grounding with a grapheme string without

pronunciations (subtitles?)

  • In semi-supervised learning, phonetic pronunciations of

unknown words is often sufficient

  • For dialog, use semantic grounding
  • Use a robot with cameras, match images to words
slide-51
SLIDE 51

51

Learning a Language Model from Continuous Speech

Future Work: Integration with HMM

  • Currently working on lattices, direct integration with

HMM will give better results (for both training, testing)

0.0 7.9 16.1 31.1 58.7 116.7 24.00% 24.50% 25.00% 25.50% 26.00% 26.50% 27.00% 27.50% 28.00% Proposed 3-gram Oracle 3-gram Transcript 3-gram

Minutes of Training Data P h

  • n

e m e E r r

  • r

R a t e

slide-52
SLIDE 52

52

Learning a Language Model from Continuous Speech

Future Work: Implementation

  • Speed
  • Expanding FST lattice and forward filtering take a fair

amount of time

– 0.5-1 times real time

  • Several ways for improvement

– Perform beam-search trimming during forward filtering – Parallel sampling

  • Open-source
  • Will be made open-source pending code clean-up
  • Goal: mid-September
slide-53
SLIDE 53

53

Learning a Language Model from Continuous Speech

Formal Modeling

  • For text word segmentation, P(X|W) = 1, but for

speech this is not the case

  • Our new objective is the joint probability of the model

and acoustic features

  • Use an acoustic model scaling factor
  • Set to .2 (values between .1-.2 produced similar results)

P(X,W,G)=P(X|W)P(W|G)P(G) Acoustic Model Language Model Prior P(X,W,G)=P(X|W)λP(W|G)P(G)

slide-54
SLIDE 54

54

Learning a Language Model from Continuous Speech

Weighted Finite State Transducers (WFSTs)

  • Finite state automata with input/output/weight
  • Define weighted relations over strings
  • If weights are probabilities, probabilistic relations
  • Transducers combined through composition

a/s:2 b/t:1 s/x:0.5 t/y:3

A B A○B

a/x:2.5 b/y:4

slide-55
SLIDE 55

55

Learning a Language Model from Continuous Speech

Connecting Edges in Detail

  • To the SM from the base

state

  • Equal to the probability of

generating a symbol from the base distribution

ε <s> c1

c2

ε:p(FB) ε:p(</s>|c1) ε:p(</s>|c2)

  • In HPYLM, n-grams with an

unknown word as wi-1 are equal to base probabilities*

  • OK to make edges from the

SM only to base state

Pw i∣w i −2 ,UNK=Pw i∣UNK=Pw i 

* technically not true if the same word appears twice in a single sentence

slide-56
SLIDE 56

56

Learning a Language Model from Continuous Speech

Difference from Mochihashi's Method

Mochihashi Neubig Spelling Model ∞-gram+ Poisson Distribution Explicit Length Limit Character 3-gram No Length Limit Implementation Algorithmic Faster? WFST-Based Simpler?, Lattice Possible Worst-Case Complexity O(MLn) M=Sentence length L=Max word length n=n-gram length O(Mn+1) Expected Complexity O(MLn) O(kM+E) E=Number of existing word n-grams k=Spelling model n