Multilingual Speech Recognition With A Single End-To-End Model - - PowerPoint PPT Presentation

▶

Mar 15, 2024 305 likes •544 views

Multilingual Speech Recognition With A Single End-To-End Model Shubham Toshniwal 1 , Tara N. Sainath 2 , Ron J. Weiss 2 , Bo Li 2 , Pedro Moreno 2 , Eugene Weinstein 2 , and Kanishka Rao 2 1 TTI Chicago 2 Google April 18, 2017 Why Multilingual

SLIDE 1

Multilingual Speech Recognition With A Single End-To-End Model

Shubham Toshniwal1, Tara N. Sainath2, Ron J. Weiss2, Bo Li2, Pedro Moreno2, Eugene Weinstein2, and Kanishka Rao2

1TTI Chicago 2Google

April 18, 2017

SLIDE 2

Why Multilingual Speech Recognition Models ?

◮ Remarkable progress in speech recognition in past few years ◮ Most of this success restricted to high resource languages, e.g.

English

◮ Google Voice Search supports ∼120 out of 7000 languages ◮ Multilingual models:

◮ Utilize knowledge transfer across languages, and thus alleviate

data requirement

◮ Successful in Neural Machine Translation (Google NMT) ◮ Easier to deploy and maintain

SLIDE 3

Conventional ASR Systems

◮ Traditional ASR systems are modular ◮ Require expert curated resources

. . .

Feature Extraction Speech Feature Vectors Decoder Acoustic Pronunciation Dictionary Language Model Model

‘‘recognize speech’’

Words

SLIDE 4

Conventional ASR Systems

◮ Traditional ASR systems are modular ◮ Require expert curated resources

. . .

Feature Extraction Speech Feature Vectors Decoder Acoustic Pronunciation Dictionary Language Model Model

‘‘recognize speech’’

Words ◮ Multilingual models:

◮ Focus on just the acoustic model (Lin, 2009; Ghoshal, 2013) ◮ Separate language model and pronunciation model required for

each language

SLIDE 5

End-to-end ASR Models

◮ Encoder-decoder models achieved state-of-the-art result on

Google Voice Search task (Chiu et al. 2018)

◮ Encoder-Decoder models are appealing because:

◮ Conceptually simple; subsume the acoustic model,

pronunciation model, and language model in a single model.

◮ No need for expert curated resources!

Acoustic Features

x1 x2 x3 xT h3 hT h2 h1 GO recognize speech EOS

SLIDE 6

End-to-End Multilingual ASR Models

ct = Tg

i=1 αithi

ct uit = vTtanh(W1hi + W2dt) αt = softmax(ut) Encoder αt yt dt xT x3 Attention Layer Decoder x2 x1 hT h3 h2 h1

◮ We use attention-based encoder-decoder models ◮ Decoder outputs one character per time step ◮ For multilingual models, take union over character sets

SLIDE 7

Multilingual Encoder-Decoder Models

Model Training Inference Joint model No language ID No language ID

◮ Naive model; unaware of multilingual nature of data ◮ Can potentially handle code-switching

SLIDE 8

Multilingual Encoder-Decoder Models

Model Training Inference Joint model No language ID No language ID Multitask model Language ID No language ID

◮ Trained to jointly recognize language ID and speech

SLIDE 9

Multilingual Encoder-Decoder Models

Model Training Inference Joint model No language ID No language ID Multitask model Language ID No language ID Conditioned model Language ID Language ID

◮ Learnt embedding of language ID fed as input to condition

the model

◮ Language ID embedding can be fed in:

(a) Encoder, (b) Decoder, (c) Encoder & Decoder

SLIDE 10

Encoder-Conditioned Model

h3 hT h2 h1 x1 eL x2 xT x3 eL eL eL

Embedding Features Language Acoustic

Encoder of encoder-conditioned model

SLIDE 11

Task

◮ Recognize 9 Indian languages with a single model ◮ Very little script overlap, except for Hindi and Marathi. ◮ The union of character sets is close to 1000 characters! ◮ But the languages have large overlap in phonetic space

(Lavanya et al. 2005).

SLIDE 12

Experimental Setup

◮ Training data consists of dictated queries ◮ Average 230K queries (∼170 hrs) per language

B e n g a l i G u j a r a t i H i n d i K a n n a d a M a l a y a l a m M a r a t h i T a m i l T e l u g u U r d u 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5

Fraction of Total Data (in %)

364K 243K 213K 192K 285K 227K 164K 232K 196K

◮ Baseline: Encoder-decoder models trained for individual

languages

SLIDE 13

Joint vs Individual

B e n g a l i G u j a r a t i H i n d i K a n n a d a M a l a y a l a m M a r a t h i T a m i l T e l u g u U r d u W t A v g 10 20 30 40

WER (in %)

Joint Individual

◮ Joint model outperforms individual models on all languages!! ◮ The joint model is not even language aware at test time ◮ Overall a 21% relative reduction in Word Error Rate (WER)

SLIDE 14

Picking the Right Script

B e n g a l i G u j a r a t i H i n d i K a n n a d a M a l a y a l a m M a r a t h i T a m i l T e l u g u U r d u M i x e d Bengali Gujarati Hindi Kannada Malayalam Marathi Tamil Telugu Urdu 1 0.99 0.003 0.001 0.001 1 0.002 0.004 0.035 0.93 0.008 0.005 0.014 0.001 0.002 0.004 0.99 0.005 0.002 0.001 0.007 0.021 0.003 0.95 0.002 0.018 1 0.001 0.003 0.001 0.99 0.001 0.003 0.009 0.004 0.003 0.98

0.0 0.2 0.4 0.6 0.8 1.0

Rarely confused between languages

SLIDE 15

Joint vs Multitask

B e n g a l i G u j a r a t i H i n d i K a n n a d a M a l a y a l a m M a r a t h i T a m i l T e l u g u U r d u W t A v g 5 10 15 20 25 30 35

WER (in %)

Joint Multitask

Insignificant gains from multitask training

SLIDE 16

Joint vs Conditioned Models

Bengali Gujarati Hindi Kannada Malayalam Marathi Tamil Telugu Urdu Wt Avg 5 10 15 20 25 30 35

WER (in %)

Joint Decoder Encoder

◮ As expected, conditioning the model on the language ID of

speech helps

◮ Encoder conditioning:

◮ Performs better than decoder conditioning ◮ Potential acoustic model adaptation happening

SLIDE 17

Magic of Conditioning

B e n g a l i G u j a r a t i H i n d i K a n n a d a M a l a y a l a m M a r a t h i T a m i l T e l u g u U r d u M i x e d Bengali Gujarati Hindi Kannada Malayalam Marathi Tamil Telugu Urdu 1 1 1 1 1 1 1 1 1

0.0 0.2 0.4 0.6 0.8 1.0

SLIDE 18

Testing the Limits: Code Switching

◮ Can the joint model code switch between 2 Indian languages

(trained for recognizing them separately)

SLIDE 19

Testing the Limits: Code Switching

◮ Can the joint model code switch between 2 Indian languages

(trained for recognizing them separately)

◮ Artificial test set of 1000 utterances of Tamil query followed

by Hindi with 50ms silence in between

◮ The model does not code-switch :( ◮ Picks one of the two scripts and sticks with it ◮ From manual inspection:

◮ Transcribes either the Hindi/Tamil part in corresponding script ◮ Transliteration in rare cases

SLIDE 20

Feeding the Wrong Language ID

◮ Does the model obey acoustics or is it faithful to language ID?

SLIDE 21

Feeding the Wrong Language ID

◮ Does the model obey acoustics or is it faithful to language ID? ◮ Artificial dataset of 1000 Urdu queries tagged as Hindi ◮ Transliterates Urdu queries in Hindi’s script ◮ Learns to disentangle the acoustic-phonetic content from the

language identity

◮ Transliterator as a byproduct!

SLIDE 22

Conclusion

◮ Encoder-Decoder models:

◮ Elegant and simple framework for multilingual models ◮ Outperform models trained for specific languages ◮ Rarely confused between individual languages ◮ Fail at code-switching

◮ Recent work along similar lines got promising results as well

(Kim, 2017; Watanabe, 2017; Tong, 2018; Dalmia, 2018)

◮ Questions?

SLIDE 23