Focusing Language Models For Automatic Speech Recognition Daniele - - PowerPoint PPT Presentation

focusing language models for automatic speech recognition
SMART_READER_LITE
LIVE PREVIEW

Focusing Language Models For Automatic Speech Recognition Daniele - - PowerPoint PPT Presentation

Focusing Language Models For Automatic Speech Recognition Daniele Falavigna, Roberto Gretter FBK, Italy The work leading to these results has received funding from the European Union under grant agreement n 287658 Text fr Fuzeile 12/7/12


slide-1
SLIDE 1

www.eu-bridge.eu

12/7/12

Text für Fußzeile

Roberto Gretter – FBK

The work leading to these results has received funding from the European Union under grant agreement n° 287658

Focusing Language Models For Automatic Speech Recognition

Daniele Falavigna, Roberto Gretter FBK, Italy

slide-2
SLIDE 2

www.eu-bridge.eu

12/7/12

Text für Fußzeile

Roberto Gretter – FBK

Outline

  • Problem definition
  • Auxiliary data selection
  • TFxIDF
  • Proposed method
  • Perplexity based method
  • Computational issues
  • TFxIDF vs proposed method
  • Experiments
  • Discussion
slide-3
SLIDE 3

www.eu-bridge.eu

12/7/12

Text für Fußzeile

Roberto Gretter – FBK

Problem definition

  • Given a general purpose text corpus and a given

speech to transcribe

  • Build a LM which is focused on the particular

(unknown) topic of the speech

  • No need for instantaneous, but should be quick
  • Approach:
  • Perform a first ASR pass
  • Use recognition output to select text data “similar” to

the context

  • Build a focused language model
  • Use the focused language model in the next ASR pass
slide-4
SLIDE 4

www.eu-bridge.eu

12/7/12

Text für Fußzeile

Roberto Gretter – FBK

Recognition setup

text corpus baseline LM 1-best word graph 1-best automatic selection auxiliary corpus auxiliary LM speech ASR first + second step ASR word graph rescoring

  • ff-line
slide-5
SLIDE 5

www.eu-bridge.eu

12/7/12

Text für Fußzeile

Roberto Gretter – FBK

terminology

  • text corpus
  • composed by N rows (N

documents)

  • average length of a document: Lc
  • dictionary
  • composed by td terms, 1≤d≤D
  • auxiliary corpus
  • composed by rows of the text

corpus, size: K words

  • speech to recognize
  • TED talks, average length: Lt

auxiliary corpus

t1 ¡ t2 ¡ t3 ¡ t4

… ¡ tD ¡ text corpus

slide-6
SLIDE 6

www.eu-bridge.eu

12/7/12

Text für Fußzeile

Roberto Gretter – FBK

Auxiliary data selection

  • rationale:
  • score each row in the text corpus against ASR output
  • sort rows according to score
  • select the first rows  auxiliary corpus (having size K)
  • 3 approaches implemented and compared:
  • TFxIDF
  • Proposed method
  • Perplexity based method
  • domain specific data (TED LM)
slide-7
SLIDE 7

www.eu-bridge.eu

12/7/12

Text für Fußzeile

Roberto Gretter – FBK

Auxiliary data selection: TFxIDF

  • for each talk i and for each word td compute:

tfd

i = frequency of term td inside talk

dfd = # of documents in the corpus containing td

  • compute the same for each row Rn in the corpus,

1≤n≤N

  • estimate a similarity score:

D d 1 ) df D log( )) log(tf (1 ] [t c

d i d d i

≤ ≤ + =

| R | | C | R . C ) R , s(C

n i n i n i

=

slide-8
SLIDE 8

www.eu-bridge.eu

12/7/12

Text für Fußzeile

Roberto Gretter – FBK

Auxiliary data selection: Proposed method

  • sort words in dictionary according to frequency
  • discard most frequent words (< D1 = 100)
  • they don’t carry semantic information
  • discard most rare words (> D2 = 200K)
  • too rare to help, include typos
  • replace words in corpus by their index in dictionary
  • sort indices in each row to allow quick comparison
  • estimate a similarity score:

) dim(R' ) dim(C' ) R' , (C' common ) R' , (C' s'

n i n i n i

+ =

slide-9
SLIDE 9

www.eu-bridge.eu

12/7/12

Text für Fußzeile

Roberto Gretter – FBK

Auxiliary data selection: Proposed method

  • example:
  • I would like your advice about rule one hundred

forty three concerning inadmissibility

  • 108 264 2837 1019 4890 166476

(like your advice rule concerning inadmissibility)

  • 47 54 108 264 2837 63 1019 6 12

65 24 4890 166476

  • 108 264 1019 2837 4890 166476
slide-10
SLIDE 10

www.eu-bridge.eu

12/7/12

Text für Fußzeile

Roberto Gretter – FBK

Auxiliary data selection: Proposed method

  • similarity score computation:
  • the lower index increment

155 264 2222 2345 2837 166476 108 264 1019 2837 4890 166476

score = 3 / 12

slide-11
SLIDE 11

www.eu-bridge.eu

12/7/12

Text für Fußzeile

Roberto Gretter – FBK

Auxiliary data selection: Perplexity based method

  • train a 3-gram LM using ASR output
  • estimate perplexity for each row in the corpus
  • use perplexity as a similarity score
slide-12
SLIDE 12

www.eu-bridge.eu

12/7/12

Text für Fußzeile

Roberto Gretter – FBK

Auxiliary data selection: Run time computational complexity

  • corpus size: N (5.7M) rows, average row length L (272)
  • dictionary size: D (1.6M) (D2=200K)

TFxIDF ¡ Proposed method ¡

Arithme.c ¡ ¡

  • pera.ons ¡

O(2 ¡x ¡N ¡x ¡L) ¡ O(N ¡x ¡L ¡/ ¡2) ¡ Memory ¡ requirements ¡ O(D ¡+ ¡N ¡x ¡L) ¡

  • ­‑-­‑-­‑ ¡

Process ¡size ¡ 650MB ¡ 10MB ¡ .me ¡ 114 ¡min ¡ 16 ¡min ¡

slide-13
SLIDE 13

www.eu-bridge.eu

12/7/12

Text für Fußzeile

Roberto Gretter – FBK

Training data

  • text corpus
  • google news
  • 5.7 M documents, 1.6 G words
  • 272 words per document
  • LM for rescoring:
  • 4-gram backoff LM, modified shift
  • 1.6M unigrams, 73M bigrams, 120M 3-grams and 195M 4-

grams.

  • FSN for first & second step:
  • 200K words, 37M bigrams, 34M 3-grams, 38M 4-grams.
  • auxiliary corpus
  • most similar documents, K words
slide-14
SLIDE 14

www.eu-bridge.eu

12/7/12

Text für Fußzeile

Roberto Gretter – FBK

Test data

  • TED talks (test sets of IWSLT 2011)
  • auxiliary corpus and auxiliary LM computed for

each talk

  • performance are reported as a function of K, the

number of words used to train the auxiliary LMs

dev-­‑set ¡ (19 ¡talks) ¡ test-set (8 talks) ¡ #words ¡ 44505 ¡ 12431 ¡ (min,max,mean) ¡ ¡ (591,4509,2342) ¡ ¡ (484,2855,1553) ¡

slide-15
SLIDE 15

www.eu-bridge.eu

12/7/12

Text für Fußzeile

Roberto Gretter – FBK

Results

  • Perplexity as a function of K
  • 0 means no interpolation

K is expressed in Kwords

200 ¡ 205 ¡ 210 ¡ 215 ¡ 220 ¡ 225 ¡ 230 ¡ 235 ¡ 240 ¡ 245 ¡ 250 ¡ PP ¡ NEW ¡ TFIDF ¡

dev ¡set ¡

¡

180 ¡ 185 ¡ 190 ¡ 195 ¡ 200 ¡ 205 ¡ 210 ¡ 215 ¡ 220 ¡ 225 ¡ 230 ¡ PP ¡ NEW ¡ TFIDF ¡

test ¡set ¡

  • Perplexity interpolating the baseline LM with a domain

specific LM (trained on ted2011 text, 2 Mwords):

dev set: 158 test set: 142

slide-16
SLIDE 16

www.eu-bridge.eu

12/7/12

Text für Fußzeile

Roberto Gretter – FBK

Results

18.5 ¡ 18.6 ¡ 18.7 ¡ 18.8 ¡ 18.9 ¡ 19.0 ¡ 19.1 ¡ 19.2 ¡ 19.3 ¡ 19.4 ¡ 19.5 ¡ PP ¡ NEW ¡ TFIDF ¡

dev ¡set ¡

18.0 ¡ 18.2 ¡ 18.4 ¡ 18.6 ¡ 18.8 ¡ 19.0 ¡ 19.2 ¡ 19.4 ¡ PP ¡ NEW ¡ TFIDF ¡

test ¡set ¡

  • WER as a function of K
  • 0 means no interpolation

K is expressed in Kwords

  • WER interpolating the baseline LM with a domain specific

LM (trained on ted2011 text, 2 Mwords):

dev set: 18.7 test set: 18.4

slide-17
SLIDE 17

www.eu-bridge.eu

12/7/12

Text für Fußzeile

Roberto Gretter – FBK

Conclusion

  • Method for focusing LMs without using in-domain data
  • Comparison between the proposed method and

TFxIDF

  • similar performance
  • less demanding computational requirements
  • Comparable results if using in-domain data
  • in this setting…
  • Future work:
  • how to add new words (to reduce OOV?)
  • instantaneous LM focusing
slide-18
SLIDE 18

www.eu-bridge.eu

12/7/12

Text für Fußzeile

Roberto Gretter – FBK

Thank you for the attention

slide-19
SLIDE 19

www.eu-bridge.eu

12/7/12

Text für Fußzeile

Roberto Gretter – FBK

LM interpolation

  • LM probability associated to every arc of the word

graph:

  • J = number of LMs to combine
  • λj = weights estimated to minimize the overall

perplexity on a development set ¡ ¡

The interpolation weights, i base and i aux, associated to the two LMs (LMbase and Lmi aux) are estimated so as to minimize the overall LM perplexity on the 1-best output (the same used to build the ith query document), of the second ASR decoding step.

=

=

J j j j

h w P h w P

1

] | [ ] | [ λ