On the Choice of Modeling Unit for Sequence-to-Sequence Speech - - PowerPoint PPT Presentation

on the choice of modeling unit for sequence to sequence
SMART_READER_LITE
LIVE PREVIEW

On the Choice of Modeling Unit for Sequence-to-Sequence Speech - - PowerPoint PPT Presentation

On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition Kazuki Irie, Rohit Prabhavalkar, Anjuli Kannan, Antoine Bruguier, David Rybach, Patrick Nguyen Interspeech 2019, Graz, Austria


slide-1
SLIDE 1

1

On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition

Kazuki Irie, Rohit Prabhavalkar, Anjuli Kannan, Antoine Bruguier, David Rybach, Patrick Nguyen

Interspeech 2019, Graz, Austria September 19, 2019

1

slide-2
SLIDE 2

Background

Sequence to sequence speech recognition

  • Directly output character based units:

Grapheme, BPE, Word-Pieces

  • Key for making it “end-to-end”. No need for a pronunciation lexicon.
  • Jointly learn acoustic, pronunciation, and language modeling in a single model.
  • But phonemes may be more natural units for acoustic modeling.
  • In hybrid systems, context dependent phonemes work the best.

2

slide-3
SLIDE 3

Previous Work

Sainath et al. (ICASSP 2018) “No Need For A Lexicon? Evaluating The Value Of The Pronunciation Lexica In End-To-End Models”

  • Grapheme models outperform phoneme based S2S models.
  • Phoneme models win on “proper nouns” e.g.,

[Grapheme] Charles Lindberg vs. [Phoneme] Charles Lindbergh (Correct!)

  • Very large scale tasks on 12.5K-hour English Voice Search

(and a 27.5K-hour multi dialects task)

3

slide-4
SLIDE 4

Unanswered Questions

  • Does this trend Phoneme vs Grapheme depend on the amount of data?

In the hybrid system, the amount of data matters. E.g., Sung et al. (ICASSP 2008) “Revisiting graphemes with increased amount of data”.

  • Can we make use of pronunciation lexicon to improve S2S ASR?

Examples from Sainath et al.: [Grapheme]: Charles Lindberg vs. [Phoneme] Charles Lindbergh (Correct!)

4

slide-5
SLIDE 5

Systematic evaluation on a publicly available dataset

  • Evaluate phonemic models under different amounts of data on a publicly

available dataset (LibriSpeech). Investigation on phoneme / character-based units complementarity

  • Separate phonemic and grapheme/word-piece models.
  • Model combination approach:

Decision phoneme vs. grapheme left to the score combination method. (No decision taken by the model.) Analysis of output hypotheses

This work

5

slide-6
SLIDE 6

LibriSpeech Dataset

  • Training data settings: 100h, 460h, and 960h.
  • Evaluation data: Dev (clean, other), Test (clean, other)

Pronunciation lexicon (official)

  • 200K vocabulary.
  • Pronunciations with stress (“ah0”, “ah1”): 70 phonemes.
  • Average number of pronunciations per word: 1.03
  • Average word length in terms of phonemes: 6.5 (Max: 20)

N-gram word LMs (official)

  • When needed, we use 3-gram LM (no pruning)
  • Trained on 800M-word extra text-only data.

6

slide-7
SLIDE 7

Model Architecture

Standard Listen, Attend and Spell model (LAS)

Chan et al. & Zhang et al. (ICASSP 2016 / 2017)

  • Input: 80-dim log-mel features + deltas + acc.
  • 2 CNN layers on the input (time reduction of 4)
  • Bi-directional LSTM encoder
  • Attention, LSTM decoder.
  • Models trained with Tensorflow Lingvo

https://github.com/tensorflow/lingvo

  • We consider 3 model sizes:

Small / Medium / Large with LSTM sizes 256 / 512 / 1024 to find the best fit to different scenarios.

7

slide-8
SLIDE 8

Baseline Word-Piece/Graphemic Models

Model Unit Params. Dev-Clean Dev-Other Test-Clean Test-Other Grapheme 35 M 5.3 15.6 5.6 15.8 130 M 5.3 15.2 5.5 15.3 Word-Piece 16K 60 M 4.9 14.0 5.0 14.1 180 M 4.4 13.2 4.7 13.4 + LSTM LM 3.3 10.3 3.6 10.3

  • Good baselines without SpecAugment (Park et al. Interspeech 2019!).

8

960h training set

slide-9
SLIDE 9

Phonemic models

Convert the target transcriptions to the phoneme sequence and learn the sequence-to-sequence model.

9

Introduces two issues:

  • For training, should choose the pronunciation for words with multiple

pronunciation variants.

  • For recognition, can not distinguish homophones without LM.

In our phonemic models:

  • Randomly choose one of pronunciations for each word and define a

deterministic mapping before the training. → Trade-off for simplicity.

  • WFST decoder: beam search constrained by a lexicon (L) and combine a

language model (G) score.

slide-10
SLIDE 10

Training phonemic models (cont’d)

Further specifications:

  • End-of-word token <eow> in vocab.
  • Sentence end token <eos> in vocab.
  • Words which are out of lexicon: <unk> <eow>.

Further consequence: Phoneme-level LM decoder? Weak class based LM? (1) as i approached the city i heard bells ringing (2) eze ai approached thaa citty aie her'd belhs ringing

10

slide-11
SLIDE 11

WER results on 960h training set

Model Dev-Clean Dev-Other Test-Clean Test-Other Phoneme + LG 5.6 15.8 6.2 15.8 Grapheme 5.3 15.2 5.5 15.3 Word-Piece 16K 4.4 13.2 4.7 13.4

  • Use official word level 3-gram LM as G (trained on extra 800M-word data).
  • Phoneme+LG comparable but slightly worse than the grapheme model:

Similar observations to Sainath et al.

  • No improvement for the grapheme model when decoded with L and G.

11

slide-12
SLIDE 12

12

Examples where the phonemic model wins over the word-piece model

  • bartley is unseen during training set.
  • kirkland is in the training set.

Word-Piece Phoneme when did you come partly when did you come bartley kerklin jumped for the jetty kirkland jumped for the jetty Man’s eyes were made fixed man’s eyes remained fixed

slide-13
SLIDE 13

Does this extrapolate to other data size scenarios?

Unique words Unseen word rate (%) Dev-Clean Dev-Other Test-Clean Test-Other Lexicon 200 K 0.3 0.6 0.4 0.5 960h 89 K 0.6 0.8 0.6 0.8 460h 66 K 0.9 1.2 1.0 1.3 100h 34 K 2.5 2.5 2.4 2.8

  • Less data: more unseen word rate.
  • Word-Piece model trained on the corresponding portion of data.

13

slide-14
SLIDE 14

WER results on 460h and 100h training data

14

  • Lower resource conditions: not more favorable to phonemic models.
  • Large degradation with less data (compared with hybrid systems)

Train data Model Dev-Clean Dev-Other Test-Clean Test-Other 460h Phoneme + LG 7.6 27.3 8.5 27.8 Grapheme 6.4 23.5 6.8 24.1 Word-Piece 16K 5.7 21.8 6.5 22.5 100h Phoneme + LG 13.8 38.9 14.3 40.9 Grapheme 11.6 36.1 12.0 38.0 Word-Piece 16K 12.7 33.9 12.9 35.5

slide-15
SLIDE 15

Model combination experiments

Can we get benefits from the phonemic model by model combination? Rescoring

  • Generate a N-best list from one LAS model.
  • Rescore with another model.
  • Log-linear score combination with a weight optimized on the dev set.

Cross-Rescoring and union of N-best list

  • Generate N-best list from both models.
  • Cross-rescore.
  • Extract the 1-best from the union.

15

slide-16
SLIDE 16

8-best list rescoring results

16

Model Dev-Clean Dev-Other Test-Clean Test-Other Word-Pieces 4.4 (2.4) 13.2 (9.2) 4.7 (2.6) 13.4 (9.1) + Phoneme 4.1 12.4 4.3 12.4 + Grapheme 4.0 12.3 4.3 12.3 + Both 3.9 12.2 4.3 12.2

  • Improvements of up to 8% rel. WER.
  • Similar improvements with graphemic or phonemic model rescoring.

(Oracle)

slide-17
SLIDE 17

8-best list rescoring results

17

Model Dev-Clean Dev-Other Test-Clean Test-Other Word-Pieces 4.4 (2.4) 13.2 (9.2) 4.7 (2.6) 13.4 (9.1) + Phoneme 4.1 12.4 4.3 12.4 + Grapheme 4.0 12.3 4.3 12.3 + Both 3.9 12.2 4.3 12.2

  • Improvements of up to 8% rel. WER.
  • Similar improvements with graphemic or phonemic model rescoring.

Phoneme + LG 5.6 (4.9) 15.8 (14.4) 6.2 (5.5) 15.8 (14.7) + Word-Piece 5.4 15.5 6.0 15.5

  • Limited improvements by rescoring phonemic hypothesis.

(Oracle)

slide-18
SLIDE 18

18

Examples where Word-piece + Grapheme + Phoneme wins

  • ver Word-piece + Grapheme

Word-Piece + Grapheme Word-Piece + Grapheme + Phoneme Oh bartly did you write to me Oh bartley did you write to me … lettuce leaf with mayonna is ... … lettuce leaf with mayonnaise ... … eyes blaze of indignation … eyes blazed with indignation

slide-19
SLIDE 19

Rescoring with an auxiliary decoder

Why not put two decoders into the same model? Shared encoder + 2 separate attention and decoder

19

Dev Test Clean Other Clean Other Word-Pieces 180M 4.4 13.2 4.7 13.4 + Auxiliary Phoneme 210M 4.3 13.0 4.6 13.1 + Separate Phoneme 310M 4.1 12.4 4.3 12.4

  • Separate model gives better improvements (with more parameters).
slide-20
SLIDE 20

Union of N-bests vs. Simple rescoring

20

Is it useful to decode hypotheses with the phonemic model?

  • Limited improvements.

Model # Hyp. Dev-Clean Dev-Other Test-Clean Test-Other Word-Piece 8 4.4 13.2 4.7 13.4 + Phoneme 4.1 12.4 4.3 12.4 Union 16 4.1 12.4 4.3 12.3

slide-21
SLIDE 21

Union of N-bests vs. Simple rescoring

21

Is it useful to decode hypotheses with the phonemic model?

  • Limited improvements.
  • Better increasing the beam size of the word-piece model and rescore.

Model # Hyp. Dev-Clean Dev-Other Test-Clean Test-Other Word-Piece 8 4.4 13.2 4.7 13.4 + Phoneme 4.1 12.4 4.3 12.4 Union 16 4.1 12.4 4.3 12.3 Word-Piece 16 4.4 13.2 4.7 13.4 + Phoneme 4.0 12.3 4.3 12.2

slide-22
SLIDE 22

How do the N-best hypotheses differ?

Reference: bozzle had always waited upon him with a decent coat and a well brushed hat and clean shoes Word-Piece 8-best beam: basil had always waited upon him with a decent coat and a well brushed hat and clean shoes bazil had always waited upon him with a decent coat and a well brushed hat and clean shoes basle had always waited upon him with a decent coat and a well brushed hat and clean shoes bosel had always waited upon him with a decent coat and a well brushed hat and clean shoes bosal had always waited upon him with a decent coat and a well brushed hat and clean shoes bosell had always waited upon him with a decent coat and a well brushed hat and clean shoes bazil had always waded upon him with a decent coat and a well brushed hat and clean shoes bossel had always waited upon him with a decent coat and a well brushed hat and clean shoes

  • Bozzle is unseen in training (and not in the lexicon).
  • Hypotheses contain multiple alternatives for this difficult word.

22

slide-23
SLIDE 23

Reference: bozzle had always waited upon him with a decent coat and a well brushed hat and clean shoes Phoneme + LG 8-best Beam: basil had always waited upon him with a decent coat and a well brushed hat and clean shoes basil had always waited upon him with a decent coat and a well brushed hat and clean shews bazil had always waited upon him with a decent coat and a well brushed hat and clean shoes basil had always waited upon him with a decent coat and a well brushed hat and cleane shoes basil had always waited upon him with a decent coat and a well brushed hat and clean shoos basil had always waited upon him with a decent coat and a well brushed hat and clean shues basil had always waited upon him with a decent coat and a well brushed hat and clean shooes basil had always waited upon him with a decent coat and ay well brushed hat and clean shoes

  • Difficulty with homophones and out of lexicon words.

23

slide-24
SLIDE 24

Conclusion

  • Revisited choice of model units for S2S ASR models:

even low resource scenarios do not promote phonemes.

  • Rescoring word-piece N-best list by phonemic or graphemic models gives

good improvements.

  • Marginal benefits of phonemic hypotheses.

Still to be investigated:

  • Effect of language model performance in the phonemic system.
  • Potential improvements related to homophones via better LM/search strategy?
  • Experiments on RNN-T (or other streaming S2S models)

24

slide-25
SLIDE 25

Thank you for your attention

Thanks to

  • Tara Sainath and Yu Zhang for helpful discussions!
  • Jinxi Guo for sharing his LM setups!

25