SLIDE 1 Multilingual Speech Recognition With A Single End-To-End Model
Shubham Toshniwal1, Tara N. Sainath2, Ron J. Weiss2, Bo Li2, Pedro Moreno2, Eugene Weinstein2, and Kanishka Rao2
1TTI Chicago 2Google
April 18, 2017
SLIDE 2 Why Multilingual Speech Recognition Models ?
◮ Remarkable progress in speech recognition in past few years ◮ Most of this success restricted to high resource languages, e.g.
English
◮ Google Voice Search supports ∼120 out of 7000 languages ◮ Multilingual models:
◮ Utilize knowledge transfer across languages, and thus alleviate
data requirement
◮ Successful in Neural Machine Translation (Google NMT) ◮ Easier to deploy and maintain
SLIDE 3 Conventional ASR Systems
◮ Traditional ASR systems are modular ◮ Require expert curated resources
. . .
Feature Extraction Speech Feature Vectors Decoder Acoustic Pronunciation Dictionary Language Model Model
‘‘recognize speech’’
Words
SLIDE 4 Conventional ASR Systems
◮ Traditional ASR systems are modular ◮ Require expert curated resources
. . .
Feature Extraction Speech Feature Vectors Decoder Acoustic Pronunciation Dictionary Language Model Model
‘‘recognize speech’’
Words ◮ Multilingual models:
◮ Focus on just the acoustic model (Lin, 2009; Ghoshal, 2013) ◮ Separate language model and pronunciation model required for
each language
SLIDE 5 End-to-end ASR Models
◮ Encoder-decoder models achieved state-of-the-art result on
Google Voice Search task (Chiu et al. 2018)
◮ Encoder-Decoder models are appealing because:
◮ Conceptually simple; subsume the acoustic model,
pronunciation model, and language model in a single model.
◮ No need for expert curated resources!
Acoustic Features
x1 x2 x3 xT h3 hT h2 h1 GO recognize speech EOS
SLIDE 6 End-to-End Multilingual ASR Models
ct = Tg
i=1 αithi
ct uit = vTtanh(W1hi + W2dt) αt = softmax(ut) Encoder αt yt dt xT x3 Attention Layer Decoder x2 x1 hT h3 h2 h1
◮ We use attention-based encoder-decoder models ◮ Decoder outputs one character per time step ◮ For multilingual models, take union over character sets
SLIDE 7
Multilingual Encoder-Decoder Models
Model Training Inference Joint model No language ID No language ID
◮ Naive model; unaware of multilingual nature of data ◮ Can potentially handle code-switching
SLIDE 8
Multilingual Encoder-Decoder Models
Model Training Inference Joint model No language ID No language ID Multitask model Language ID No language ID
◮ Trained to jointly recognize language ID and speech
SLIDE 9
Multilingual Encoder-Decoder Models
Model Training Inference Joint model No language ID No language ID Multitask model Language ID No language ID Conditioned model Language ID Language ID
◮ Learnt embedding of language ID fed as input to condition
the model
◮ Language ID embedding can be fed in:
(a) Encoder, (b) Decoder, (c) Encoder & Decoder
SLIDE 10
Encoder-Conditioned Model
h3 hT h2 h1 x1 eL x2 xT x3 eL eL eL
Embedding Features Language Acoustic
Encoder of encoder-conditioned model
SLIDE 11
Task
◮ Recognize 9 Indian languages with a single model ◮ Very little script overlap, except for Hindi and Marathi. ◮ The union of character sets is close to 1000 characters! ◮ But the languages have large overlap in phonetic space
(Lavanya et al. 2005).
SLIDE 12 Experimental Setup
◮ Training data consists of dictated queries ◮ Average 230K queries (∼170 hrs) per language
B e n g a l i G u j a r a t i H i n d i K a n n a d a M a l a y a l a m M a r a t h i T a m i l T e l u g u U r d u 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5
Fraction of Total Data (in %)
364K 243K 213K 192K 285K 227K 164K 232K 196K
◮ Baseline: Encoder-decoder models trained for individual
languages
SLIDE 13
Joint vs Individual
B e n g a l i G u j a r a t i H i n d i K a n n a d a M a l a y a l a m M a r a t h i T a m i l T e l u g u U r d u W t A v g 10 20 30 40
WER (in %)
Joint Individual
◮ Joint model outperforms individual models on all languages!! ◮ The joint model is not even language aware at test time ◮ Overall a 21% relative reduction in Word Error Rate (WER)
SLIDE 14 Picking the Right Script
B e n g a l i G u j a r a t i H i n d i K a n n a d a M a l a y a l a m M a r a t h i T a m i l T e l u g u U r d u M i x e d Bengali Gujarati Hindi Kannada Malayalam Marathi Tamil Telugu Urdu 1 0.99 0.003 0.001 0.001 1 0.002 0.004 0.035 0.93 0.008 0.005 0.014 0.001 0.002 0.004 0.99 0.005 0.002 0.001 0.007 0.021 0.003 0.95 0.002 0.018 1 0.001 0.003 0.001 0.99 0.001 0.003 0.009 0.004 0.003 0.98
0.0 0.2 0.4 0.6 0.8 1.0
Rarely confused between languages
SLIDE 15
Joint vs Multitask
B e n g a l i G u j a r a t i H i n d i K a n n a d a M a l a y a l a m M a r a t h i T a m i l T e l u g u U r d u W t A v g 5 10 15 20 25 30 35
WER (in %)
Joint Multitask
Insignificant gains from multitask training
SLIDE 16 Joint vs Conditioned Models
Bengali Gujarati Hindi Kannada Malayalam Marathi Tamil Telugu Urdu Wt Avg 5 10 15 20 25 30 35
WER (in %)
Joint Decoder Encoder
◮ As expected, conditioning the model on the language ID of
speech helps
◮ Encoder conditioning:
◮ Performs better than decoder conditioning ◮ Potential acoustic model adaptation happening
SLIDE 17 Magic of Conditioning
B e n g a l i G u j a r a t i H i n d i K a n n a d a M a l a y a l a m M a r a t h i T a m i l T e l u g u U r d u M i x e d Bengali Gujarati Hindi Kannada Malayalam Marathi Tamil Telugu Urdu 1 1 1 1 1 1 1 1 1
0.0 0.2 0.4 0.6 0.8 1.0
SLIDE 18
Testing the Limits: Code Switching
◮ Can the joint model code switch between 2 Indian languages
(trained for recognizing them separately)
SLIDE 19 Testing the Limits: Code Switching
◮ Can the joint model code switch between 2 Indian languages
(trained for recognizing them separately)
◮ Artificial test set of 1000 utterances of Tamil query followed
by Hindi with 50ms silence in between
◮ The model does not code-switch :( ◮ Picks one of the two scripts and sticks with it ◮ From manual inspection:
◮ Transcribes either the Hindi/Tamil part in corresponding script ◮ Transliteration in rare cases
SLIDE 20
Feeding the Wrong Language ID
◮ Does the model obey acoustics or is it faithful to language ID?
SLIDE 21
Feeding the Wrong Language ID
◮ Does the model obey acoustics or is it faithful to language ID? ◮ Artificial dataset of 1000 Urdu queries tagged as Hindi ◮ Transliterates Urdu queries in Hindi’s script ◮ Learns to disentangle the acoustic-phonetic content from the
language identity
◮ Transliterator as a byproduct!
SLIDE 22 Conclusion
◮ Encoder-Decoder models:
◮ Elegant and simple framework for multilingual models ◮ Outperform models trained for specific languages ◮ Rarely confused between individual languages ◮ Fail at code-switching
◮ Recent work along similar lines got promising results as well
(Kim, 2017; Watanabe, 2017; Tong, 2018; Dalmia, 2018)
◮ Questions?
SLIDE 23
Conditioning Encoder is Enough
B e n g a l i G u j a r a t i H i n d i K a n n a d a M a l a y a l a m M a r a t h i T a m i l T e l u g u U r d u W t A v g 5 10 15 20 25 30 35
WER (in %)
Encoder Encoder+Decoder
◮ Conditioning decoder on top of conditioning the encoder
doesn’t buy us much
◮ Possibly because the attention mechanism feeds in
information from the encoder to the decoder