Pre-training on high-resource speech recognition improves - - PowerPoint PPT Presentation

pre training on high resource speech recognition improves
SMART_READER_LITE
LIVE PREVIEW

Pre-training on high-resource speech recognition improves - - PowerPoint PPT Presentation

Pre-training on high-resource speech recognition improves low-resource speech-to-text translation Sameer Bansal Herman Kamper Karen Livescu Adam Lopez Sharon Goldwater Current systems Spanish Audio: ? English text: 2 Current systems


slide-1
SLIDE 1

Pre-training on high-resource speech recognition improves low-resource speech-to-text translation

Sameer Bansal Herman Kamper Karen Livescu Adam Lopez Sharon Goldwater

slide-2
SLIDE 2

Current systems

?

Spanish Audio: English text:

2

slide-3
SLIDE 3

Current systems

?

  • la mi nombre es hodor

Spanish text: Automatic Speech Recognition

3

Spanish Audio: English text:

slide-4
SLIDE 4

Current systems

hi my name is hodor

  • la mi nombre es hodor

Automatic Speech Recognition Machine Translation

4

Spanish text: Spanish Audio: English text:

slide-5
SLIDE 5

~100 languages supported by Google Translate ...

5

slide-6
SLIDE 6

Unwritten languages

not available Mboshi text: Automatic Speech Recognition ~3000 languages with no writing system

6

Mboshi: Bantu language, Republic of Congo, ~160K speakers

slide-7
SLIDE 7

Unwritten languages

7

Efforts to collect speech and translations using mobile apps ○ Aikuma: Bird et al. 2014, LIG-Aikuma: Blachon et al. 2016

Mboshi:

paired with French translations (Godard et al. 2018) ~3000 languages with no writing system

slide-8
SLIDE 8

Haiti Earthquake, 2010

Moun kwense nan Sakre Kè nan Pòtoprens

Survivors sent text messages to helpline

  • International rescue teams face language barrier
  • No automated tools available
  • Volunteers from global Haitian diaspora help create

parallel text corpora in short time [Munro 2010]

People trapped in Sacred Heart Church, PauP

8

slide-9
SLIDE 9

Are we better prepared in 2019?

Moun kwense nan Sakre Kè nan Pòtoprens People trapped in Sacred Heart Church, PauP

Voice messages

9

slide-10
SLIDE 10

paired with translations (source audio)

  • Tens of hours of speech paired with text

translations

  • No source text available

Can we build a speech-to-text translation (ST) system? … given as training data:

10

slide-11
SLIDE 11

Neural models ...

Sequence-to-Sequence

Weiss et al. (2017)

Directly translate speech

11

hi my name is hodor

Spanish Audio: English text:

slide-12
SLIDE 12

English text Encoder Attention Decoder

12

Spanish Audio

  • telephone speech (unscripted)
  • realistic noise conditions
  • multiple speakers and dialects
  • crowdsourced English text translations

Spanish speech to English text

Closer to real-world conditions

slide-13
SLIDE 13

Good performance if trained on 100+ hours

13

Spanish speech to English text

Weiss et al. *for comparison text-to-text = 58

slide-14
SLIDE 14

Poor performance in low-resource settings

14

But ...

*for comparison text-to-text = 58 Weiss et al.

slide-15
SLIDE 15

Goal: to improve translation performance

15

slide-16
SLIDE 16

Goal: to improve translation performance … without labeling more low-resource speech

16

slide-17
SLIDE 17

100s of hours of monolingual speech paired with text available … typically used to train ASR systems English text (English Audio) French text (French Audio) Key idea: leverage monolingual data from a different high-resource language

17

slide-18
SLIDE 18

100s of hours of monolingual speech paired with text available … typically used to train ASR systems English text (English Audio) French text (French Audio)

18

Sequence-to-Sequence English text

?

Spanish Audio ~20 hours of Spanish-English

slide-19
SLIDE 19

100s of hours of monolingual speech paired with text available … typically used to train ASR systems

Spanish text (Spanish Audio)

19

Weiss et al. 2017 Anastasopoulos and Chiang 2018 Bérard et al. 2018 Sperber et al. 2019

Sequence-to-Sequence English text Spanish Audio ~20 hours of Spanish-English

slide-20
SLIDE 20

100s of hours of monolingual speech paired with text available … typically used to train ASR systems English text (English Audio) French text (French Audio)

20

Sequence-to-Sequence English text

?

Spanish Audio ~20 hours of Spanish-English

slide-21
SLIDE 21

Why Spanish-English?

21

slide-22
SLIDE 22

Why Spanish-English? simulate low-resource settings and test our method

22

slide-23
SLIDE 23

Why Spanish-English? simulate low-resource settings and test our method Later: results on truly low-resource language --- Mboshi to French

23

slide-24
SLIDE 24

Method

Encoder Attention Decoder

Audio

24

Same model architecture for ASR and ST

*randomly initialized parameters

text

slide-25
SLIDE 25

Pretrain on high-resource

Encoder Attention Decoder

English audio English text

25

300 hours of English audio and text

*train until convergence

slide-26
SLIDE 26

Fine-tune on low-resource

Encoder Attention Decoder Encoder Attention Decoder

transfer from English ASR

26

20 hours Spanish-English English audio English text Spanish audio English text

slide-27
SLIDE 27

Fine-tune on low-resource

27

*train until convergence

Encoder Attention Decoder

Spanish audio English text 20 hours Spanish-English

slide-28
SLIDE 28

Will this work?

28

slide-29
SLIDE 29

Spanish-English BLEU scores

29

baseline *for comparison Weiss et al. = 47.3

slide-30
SLIDE 30

Spanish-English BLEU scores

30

baseline *for comparison Weiss et al. = 47.3 pretraining

slide-31
SLIDE 31

Spanish-English BLEU scores

31

baseline *for comparison Weiss et al. = 47.3 pretraining

  • +9 BLEU
slide-32
SLIDE 32

Spanish-English BLEU scores

32

baseline *for comparison Weiss et al. = 47.3 pretraining

  • better performance with

half the data

slide-33
SLIDE 33

Further analysis

33

baseline *for comparison Weiss et al. = 47.3 pretraining

20 hours Spanish-English

slide-34
SLIDE 34

Faster training time

34

baseline pretraining

slide-35
SLIDE 35

baseline pretraining

Faster training time

35

2 hours ~20 hours

  • potentially useful in time critical scenarios
slide-36
SLIDE 36

Ablation: model parameters

Spanish to English, N = 20 hours

36

BLEU baseline 10.8 +English ASR 19.9 English English text

Encoder Attention Decoder

Spanish English text

Encoder Attention Decoder

slide-37
SLIDE 37

Ablation: model parameters

Spanish to English, N = 20 hours BLEU baseline 10.8 +English ASR 19.9 +English ASR: decoder 10.5

37

English English text

random Encoder Attention Decoder

Spanish English text

Encoder Attention Decoder

slide-38
SLIDE 38

Ablation: model parameters

Spanish to English, N = 20 hours BLEU baseline 10.8 +English ASR 19.9 +English ASR: decoder 10.5 +English ASR: encoder 16.6

38

English English text

random Encoder Attention Decoder

Spanish English text

Encoder Attention Decoder

slide-39
SLIDE 39

Ablation: model parameters

Spanish to English, N = 20 hours BLEU baseline 10.8 +English ASR 19.9 +English ASR: decoder 10.5 +English ASR: encoder 16.6

39

… transferring encoder only parameters works well! English English text

random Encoder Attention Decoder

Spanish English text

Encoder Attention Decoder

slide-40
SLIDE 40

Ablation: model parameters

English Spanish to English, N = 20 hours BLEU baseline 10.8 +English ASR 19.9 +English ASR: decoder 10.5 +English ASR: encoder 16.6

40

… can pretrain on a language different from both source and target in ST pair English text

random Encoder Attention Decoder

Spanish English text

Encoder Attention Decoder

slide-41
SLIDE 41

Pretraining on French

Spanish to English, N = 20 hours BLEU baseline 10.8 +English ASR 19.9 +English ASR: encoder 16.6 +French ASR: encoder ?

41

*only 20 hours of French ASR French French text

random Encoder Attention Decoder

Spanish English text

Encoder Attention Decoder

slide-42
SLIDE 42

Pretraining on French

French French text Spanish to English, N = 20 hours BLEU baseline 10.8 +English ASR 19.9 +English ASR: encoder 16.6 +French ASR: encoder 12.5

42

random

French ASR helps Spanish-English ST

Encoder Attention Decoder

Spanish English text

Encoder Attention Decoder

slide-43
SLIDE 43

Takeaways

  • Pretraining on a different language helps
  • transfer all model parameters for best gains
  • encoder parameters account for most of these

… useful when target vocabulary is different

43

slide-44
SLIDE 44

… Mboshi-French ST

44

slide-45
SLIDE 45

Mboshi-French ST

  • ST data by Godard et al. 2018

○ ~4 hours of speech, paired with French translations

  • Mboshi

○ Bantu language, Republic of Congo ○ Unwritten ○ ~160K speakers

45

slide-46
SLIDE 46

Mboshi-French: Results

Mboshi to French, N = 4 hours BLEU baseline ?

46

Mboshi

Encoder Attention Decoder

French text

slide-47
SLIDE 47

Mboshi-French: Results

Mboshi to French, N = 4 hours BLEU baseline 3.5

47

*outperformed by a naive baseline Mboshi

Encoder Attention Decoder

French text

slide-48
SLIDE 48

Pretraining on French ASR

French Mboshi Mboshi to French, N = 4 hours BLEU baseline 3.5 +French ASR: all ?

48

transfer all parameters French text

Encoder Attention Decoder

French text

Encoder Attention Decoder

slide-49
SLIDE 49

Pretraining on French ASR

Mboshi to French, N = 4 hours BLEU baseline 3.5 +French ASR: all 5.9

49

French ASR helps Mboshi-French ST French Mboshi French text

Encoder Attention Decoder

French text

Encoder Attention Decoder

slide-50
SLIDE 50

Pretraining on French ASR

Mboshi to French, N = 4 hours BLEU baseline 3.5 +French ASR: all 5.9

50

French ASR helps Mboshi-French ST French Mboshi French text

Encoder Attention Decoder

French text

Encoder Attention Decoder

slide-51
SLIDE 51

Pretraining on English ASR

Mboshi to French, N = 4 hours BLEU baseline 3.5 +French ASR: all 5.9 +English ASR: encoder ?

51

using encoder trained on a lot more data English English text

random Encoder Attention Decoder

Mboshi French text

Encoder Attention Decoder

slide-52
SLIDE 52

Pretraining on English ASR

Mboshi to French, N = 4 hours BLEU baseline 3.5 +French ASR: all 5.9 +English ASR: encoder 5.3

52

English ASR helps Mboshi-French ST English English text

random Encoder Attention Decoder

Mboshi French text

Encoder Attention Decoder

slide-53
SLIDE 53

Pretraining on French ASR: can transfer all parameters … but only 20 hours of data Pretraining on English ASR: trained on a lot more data (300 hours) … but can only transfer encoder parameters

53

slide-54
SLIDE 54

… but only 20 hours of data … but can only transfer encoder parameters

… combine both?

54

Pretraining on French ASR: can transfer all parameters Pretraining on English ASR: trained on a lot more data (300 hours)

slide-55
SLIDE 55

Pretraining on French and English ASR

Encoder Attention Decoder

French French text English

Encoder Attention Decoder

English text 20 hours 300 hours

55

slide-56
SLIDE 56

Pretraining on French and English ASR

Encoder Attention Decoder

Mboshi French text

Encoder Attention Decoder

French French text English

Encoder Attention Decoder

English text 20 hours 4 hours 300 hours

56

slide-57
SLIDE 57

Pretraining on French and English ASR

Encoder Attention Decoder

Mboshi French text

Encoder Attention Decoder

French French text English

Encoder Attention Decoder

English text 20 hours 4 hours 300 hours

57

slide-58
SLIDE 58

Pretraining on English ASR

Encoder Attention Decoder

Mboshi French text Mboshi to French, N = 4 hours BLEU baseline 3.5 +French ASR: all 5.9 +English ASR: encoder 5.3 +English ASR: encoder +French ASR: remaining ?

58

From English ASR From French ASR

slide-59
SLIDE 59

Pretraining on English ASR

Mboshi to French, N = 4 hours BLEU baseline 3.5 +French ASR: all 5.9 +English ASR: encoder 5.3 +English ASR: encoder +French ASR: remaining 7.1

59

combining gives the best gains

From English ASR From French ASR Encoder Attention Decoder

Mboshi French text

slide-60
SLIDE 60

Pretraining on English ASR

Mboshi to French, N = 4 hours BLEU baseline 3.5 +French ASR: all 5.9 +English ASR: encoder 5.3 +English ASR: encoder +French ASR: remaining 7.1

60

BLEU score is still low … but above naive baseline

From English ASR From French ASR Encoder Attention Decoder

Mboshi French text

slide-61
SLIDE 61

Conclusions

  • Pretraining on high-resource ASR improves low-resource ST
  • Potentially useful for endangered and/or unwritten languages
  • Bootstrap ST in time-critical scenarios
  • Future work: experiments on more languages, multilingual

training with joint vocabulary

61

slide-62
SLIDE 62
  • Anonymous reviewers, Edinburgh NLP members
  • Source code available at: https://github.com/0xSameer/ast

I am looking for full-time positions starting November 2019!

Thanks

62

  • 4th June, 3:30-5 pm - “Fluent Translations from Disfluent

Speech in End-to-End Speech Translation”, Salesky et al.

  • 5th June, 10:30-10:48 am - “Neural Machine Translation of

Text from Non-Native Speakers”, Anastasopoulos et al.

slide-63
SLIDE 63

Backup

63

slide-64
SLIDE 64

Mboshi-French naive baseline

64

slide-65
SLIDE 65
  • Speaker invariance

○ ASR data contains audio from 100s of speakers

  • Learning to factor out background noise (?)

Why does pretraining help?

65

BLEU Baseline +English ASR 50 speakers 7.2 17.5 (+143 %) 136 speakers 10.8 (+ 50%) 19.9 (+14%)

slide-66
SLIDE 66

Spanish-English ST

N hrs 2.5h 5h 10h 20h 50h 160h Weiss baseli ne 2.1 1.8 2.1 10.8 22.7 47.3 +ASR 5.7 9.1 14.5 20.2 28.3

  • +3.6

+7.3 +12.4 +9.4 +5.5

*results on Fisher test set ...

66

slide-67
SLIDE 67

Spanish-English ST

BLEU baseline 10.8 +En ASR: 300h 16.6 +Fr ASR:20h 12.5 +En ASR: 20h 13.2

Spanish to English, N = 20 hours

Encoder Attention Decoder

… French ASR helps improve Spanish-English ST

67

Spanish English text

slide-68
SLIDE 68

Spanish-English ST

68

slide-69
SLIDE 69

Neural model

CNN 1 MFCCs 150 x 13 37 x 512 CNN 2 75 x 128 37 x 512 bi-LSTM 1 bi-LSTM 2 bi-LSTM 3 1.5 s yo vive en bronx Embedding FF-Softmax LSTM 1 LSTM 2 LSTM 3 i live in br_ _ on_ _ x EOS GO i live in br_ _ on_ _ x Attention

69

slide-70
SLIDE 70

Neural model

CNN RNN Embedding FF-Softmax RNN predicted text Attention

Encoder Decoder

MFCCs prediction history

70

slide-71
SLIDE 71

100s of hours of monolingual speech paired with text available … typically used to train ASR systems

English text

71

Gülçehre et al., 2015 Toshniwal et al., 2018

Sequence-to-Sequence English text Spanish Audio ~20 hours of Spanish-English