Multitask Learning with Low-Level Auxiliary Tasks 1 Traditional - - PowerPoint PPT Presentation

multitask learning with low level auxiliary tasks
SMART_READER_LITE
LIVE PREVIEW

Multitask Learning with Low-Level Auxiliary Tasks 1 Traditional - - PowerPoint PPT Presentation

for Speech Recognition Shubham Toshniwal , Hao Tang, Liang Lu, Karen Livescu Toyota Technological Institute at Chicago Multitask Learning with Low-Level Auxiliary Tasks 1 Traditional automatic speech recognition (ASR) systems are modular.


slide-1
SLIDE 1

Multitask Learning with Low-Level Auxiliary Tasks

for Speech Recognition

Shubham Toshniwal, Hao Tang, Liang Lu, Karen Livescu

Toyota Technological Institute at Chicago

slide-2
SLIDE 2

Conventional ASR Systems

  • Traditional automatic speech recognition (ASR) systems are

modular.

  • Different components of the system are trained separately.
  • Components correspond to different levels of representation -

frame-level states, phones, and words etc.

. . .

Feature Extraction Speech Feature Vectors Decoder Acoustic Pronunciation Dictionary Language Model Model

‘‘recognize speech’’

Words

1

slide-3
SLIDE 3

End-to-end ASR Models

  • Neural end-to-end models for ASR have become viable and

popular.

  • End-to-end models are appealing because:
  • Conceptually simple; all model parameters contribute to the same

final goal.

  • Impressive results in ASR (Zweig et al. 2016) as well as other

domains (Vinyals et al. 2015, Huang et al. 2016).

Acoustic Features

x1 x2 x3 xT h3 hT h2 h1 GO recognize speech EOS 2

slide-4
SLIDE 4

End-to-end Models: Cons

However, end-to-end models have some drawbacks as well:

  • Optimization can be challenging.
  • Ignore potentially useful domain-specific information about

intermediate representations, as well as existing intermediate levels of supervision.

  • Hard to interpret intermediate learned representations, thus

harder to debug.

3

slide-5
SLIDE 5

Motivation

  • Analysis of some deep end-to-end models found that different

layers tend to specialize for different sub-tasks (Mohamed et al. 2012, Zeiler et al. 2014).

  • Lower layers focus on lower-level representation and higher
  • nes on higher-level representation.

Pixels Edges Faces Layer 1 Layer 2 Layer 3 Parts of face

4

slide-6
SLIDE 6

Motivation

  • Can we encourage such intermediate representation learning

more explicitly ?

  • Multitask learning: Combine final task loss (speech recognition)

with losses corresponding to lower-level tasks (such as phonetic recognition) applied on lower layers (Søgaard et al. 2016).

5

slide-7
SLIDE 7

Encoder-Decoder Model for speech recognition

y1 y2 x1 x2 x3 x4 x5 x6 x7 x8 xT GO CharDec (Lc)

  • We use the attention-enabled encoder-decoder variant

proposed by Chan et al. 2015.

  • Speech encoder: A pyramidal bidirectional LSTM that:

(i) Reads in acoustic features x x1 xT (ii) Outputs a sequence of high-level features (hidden states).

  • Character decoder: Attends to high-level features generated by

encoder and outputs y y1 yK .

6

slide-8
SLIDE 8

Encoder-Decoder Model for speech recognition

y1 y2 x1 x2 x3 x4 x5 x6 x7 x8 xT GO CharDec (Lc)

  • We use the attention-enabled encoder-decoder variant

proposed by Chan et al. 2015.

  • Speech encoder: A pyramidal bidirectional LSTM that:

(i) Reads in acoustic features x = (x1, . . . , xT) (ii) Outputs a sequence of high-level features (hidden states).

  • Character decoder: Attends to high-level features generated by

encoder and outputs y y1 yK .

6

slide-9
SLIDE 9

Encoder-Decoder Model for speech recognition

y1 y2 x1 x2 x3 x4 x5 x6 x7 x8 xT GO CharDec (Lc)

  • We use the attention-enabled encoder-decoder variant

proposed by Chan et al. 2015.

  • Speech encoder: A pyramidal bidirectional LSTM that:

(i) Reads in acoustic features x = (x1, . . . , xT) (ii) Outputs a sequence of high-level features (hidden states).

  • Character decoder: Attends to high-level features generated by

encoder and outputs y = (y1, . . . , yK).

6

slide-10
SLIDE 10

Adding Phoneme Supervision

$z_1$ $z_1$ $z_2$ PhoneDec $(L_{p}^{Dec})$ PhoneCTC $(L_{p}^{CTC})$ GO

y1 y2 x1 x2 x3 x4 x5 x6 x7 x8 xT GO CharDec (Lc)

  • Phoneme-level supervision obtained using pronunciation

dictionary.

  • Experiment with two types of sequence loss:

(a) Phoneme Decoder Loss (LDec

p ),

(b) CTC-loss (LCTC

p )

  • Training Loss L is given by: L

1 2 Lc

Lp

7

slide-11
SLIDE 11

Adding Phoneme Supervision

PhoneCTC $(L_{p}^{CTC})$

GO

PhoneDec (LDec

p

) z1 z2 z1 y1 y2 x1 x2 x3 x4 x5 x6 x7 x8 xT GO CharDec (Lc)

  • Phoneme-level supervision obtained using pronunciation

dictionary.

  • Experiment with two types of sequence loss:

(a) Phoneme Decoder Loss (LDec

p ),

(b) CTC-loss (LCTC

p )

  • Training Loss L is given by: L

1 2 Lc

Lp

7

slide-12
SLIDE 12

Adding Phoneme Supervision

$z_1$ $z_1$ $z_2$ PhoneDec $(L_{p}^{Dec})$ GO

PhoneCTC (LCTC

p

) y1 y2 x1 x2 x3 x4 x5 x6 x7 x8 xT GO CharDec (Lc)

  • Phoneme-level supervision obtained using pronunciation

dictionary.

  • Experiment with two types of sequence loss:

(a) Phoneme Decoder Loss (LDec

p ),

(b) CTC-loss (LCTC

p )

  • Training Loss L is given by: L

1 2 Lc

Lp

7

slide-13
SLIDE 13

Adding Phoneme Supervision

GO

PhoneDec (LDec

p

) z1 z2 z1 PhoneCTC (LCTC

p

) y1 y2 x1 x2 x3 x4 x5 x6 x7 x8 xT GO CharDec (Lc)

  • Phoneme-level supervision obtained using pronunciation

dictionary.

  • Experiment with two types of sequence loss:

(a) Phoneme Decoder Loss (LDec

p ),

(b) CTC-loss (LCTC

p )

  • Training Loss L is given by: L = 1

2(Lc + Lp). 7

slide-14
SLIDE 14

Adding frame-level Supervision

z1 z2 z1 PhoneDec (LDec

p

) State (Ls) s2 s1 s3 s4 s5 s6 s7 s8 sT PhoneCTC (LCTC

p

) y1 y2 x1 x2 x3 x4 x5 x6 x7 x8 xT GO CharDec (Lc) GO

  • We also experiment with frame-level state supervision.
  • Training Loss L is then: L = 1

3(Lc + Lp + Ls). 8

slide-15
SLIDE 15

Dataset & Model Details

Dataset:

  • Switchboard corpus - 300 hrs of conversational speech data.
  • Standard training/development/test split is used.

Model:

  • Speech Encoder: 4-layer pyramidal bidirectional LSTM.
  • Character Decoder: 1-layer unidirectional LSTM.
  • Both have 256 hidden units.

9

slide-16
SLIDE 16

Dev Results

Table 1: Character error rate (CER) and word error rate (WER) results on development data.

Model Dev CER (in ) Dev WER (in ) Enc-Dec (baseline) 14.6 26.0 Enc-Dec + PhoneCTC-3 14.0 25.3 Enc-Dec + PhoneDec-3 13.8 24.9 Enc-Dec + PhoneDec-4 14.5 25.9 Enc-Dec + State-2 13.6 24.1 Enc-Dec + PhoneDec-3 + State-2 13.4 24.1

10

slide-17
SLIDE 17

Dev Results

Table 1: Character error rate (CER) and word error rate (WER) results on development data.

Model Dev CER (in ) Dev WER (in ) Enc-Dec (baseline) 14.6 26.0 Enc-Dec + PhoneCTC-3 14.0 25.3 Enc-Dec + PhoneDec-3 13.8 24.9 Enc-Dec + PhoneDec-4 14.5 25.9 Enc-Dec + State-2 13.6 24.1 Enc-Dec + PhoneDec-3 + State-2 13.4 24.1

10

slide-18
SLIDE 18

Dev Results

Table 1: Character error rate (CER) and word error rate (WER) results on development data.

Model Dev CER (in ) Dev WER (in ) Enc-Dec (baseline) 14.6 26.0 Enc-Dec + PhoneCTC-3 14.0 25.3 Enc-Dec + PhoneDec-3 13.8 24.9 Enc-Dec + PhoneDec-4 14.5 25.9 Enc-Dec + State-2 13.6 24.1 Enc-Dec + PhoneDec-3 + State-2 13.4 24.1

10

slide-19
SLIDE 19

Dev Results

Table 1: Character error rate (CER) and word error rate (WER) results on development data.

Model Dev CER (in ) Dev WER (in ) Enc-Dec (baseline) 14.6 26.0 Enc-Dec + PhoneCTC-3 14.0 25.3 Enc-Dec + PhoneDec-3 13.8 24.9 Enc-Dec + PhoneDec-4 14.5 25.9 Enc-Dec + State-2 13.6 24.1 Enc-Dec + PhoneDec-3 + State-2 13.4 24.1

10

slide-20
SLIDE 20

Dev Results

Table 1: Character error rate (CER) and word error rate (WER) results on development data.

Model Dev CER (in ) Dev WER (in ) Enc-Dec (baseline) 14.6 26.0 Enc-Dec + PhoneCTC-3 14.0 25.3 Enc-Dec + PhoneDec-3 13.8 24.9 Enc-Dec + PhoneDec-4 14.5 25.9 Enc-Dec + State-2 13.6 24.1 Enc-Dec + PhoneDec-3 + State-2 13.4 24.1

10

slide-21
SLIDE 21

Test Results

Table 2: WER (%) on test data for different end-to-end models.

Model SWB CHE Full Our models Enc-Dec (baseline) 25.0 42.4 33.7 Enc-Dec + PhoneDec-3 + State-2 23.1 40.8 32.0 Lu et al. 2016 Enc-Dec 27.3 48.2 37.8 Enc-Dec (word) + 3-gram 25.8 46.0 36.0 Maas et al. 2015 CTC 38.0 56.1 47.1 Zweig et al. 2016 Iterated CTC 24.7 37.1 —

11

slide-22
SLIDE 22

Test Results

Table 2: WER (%) on test data for different end-to-end models.

Model SWB CHE Full Our models Enc-Dec (baseline) 25.0 42.4 33.7 Enc-Dec + PhoneDec-3 + State-2 23.1 40.8 32.0 Lu et al. 2016 Enc-Dec 27.3 48.2 37.8 Enc-Dec (word) + 3-gram 25.8 46.0 36.0 Maas et al. 2015 CTC 38.0 56.1 47.1 Zweig et al. 2016 Iterated CTC 24.7 37.1 —

11

slide-23
SLIDE 23

Test Results

Table 2: WER (%) on test data for different end-to-end models.

Model SWB CHE Full Our models Enc-Dec (baseline) 25.0 42.4 33.7 Enc-Dec + PhoneDec-3 + State-2 23.1 40.8 32.0 Lu et al. 2016 Enc-Dec 27.3 48.2 37.8 Enc-Dec (word) + 3-gram 25.8 46.0 36.0 Maas et al. 2015 CTC 38.0 56.1 47.1 Zweig et al. 2016 Iterated CTC 24.7 37.1 —

11

slide-24
SLIDE 24

Test Results

Table 2: WER (%) on test data for different end-to-end models.

Model SWB CHE Full Our models Enc-Dec (baseline) 25.0 42.4 33.7 Enc-Dec + PhoneDec-3 + State-2 23.1 40.8 32.0 Lu et al. 2016 Enc-Dec 27.3 48.2 37.8 Enc-Dec (word) + 3-gram 25.8 46.0 36.0 Maas et al. 2015 CTC 38.0 56.1 47.1 Zweig et al. 2016 Iterated CTC 24.7 37.1 —

11

slide-25
SLIDE 25

How does Multitask Learning help ?

Figure 1: Log-loss of training data (only Lc) for different model variations.

12

slide-26
SLIDE 26

How does Multitask Learning help ?

Figure 1: Log-loss of training data (only Lc) for different model variations.

12

slide-27
SLIDE 27

How does Multitask Learning help ?

Figure 1: Log-loss of training data (only Lc) for different model variations.

12

slide-28
SLIDE 28

Conclusion & Future Work

  • Multitask learning is great!
  • Using lower level supervision at lower-levels is the key to our

gains.

  • More generally, our ASR model can be extended to incorporate

higher-level supervision, such as semantic/syntactic labels.

  • The idea of incorporating different types of supervision at

different levels is of broad interest (Hashimoto et al. 2016, Weiss et al. 2017, Rao et al. 2017).

13

slide-29
SLIDE 29

Conclusion & Future Work

  • Multitask learning is great!
  • Using lower level supervision at lower-levels is the key to our

gains.

  • More generally, our ASR model can be extended to incorporate

higher-level supervision, such as semantic/syntactic labels.

  • The idea of incorporating different types of supervision at

different levels is of broad interest (Hashimoto et al. 2016, Weiss et al. 2017, Rao et al. 2017).

13

slide-30
SLIDE 30

Conclusion & Future Work

  • Multitask learning is great!
  • Using lower level supervision at lower-levels is the key to our

gains.

  • More generally, our ASR model can be extended to incorporate

higher-level supervision, such as semantic/syntactic labels.

  • The idea of incorporating different types of supervision at

different levels is of broad interest (Hashimoto et al. 2016, Weiss et al. 2017, Rao et al. 2017).

13

slide-31
SLIDE 31

Conclusion & Future Work

  • Multitask learning is great!
  • Using lower level supervision at lower-levels is the key to our

gains.

  • More generally, our ASR model can be extended to incorporate

higher-level supervision, such as semantic/syntactic labels.

  • The idea of incorporating different types of supervision at

different levels is of broad interest (Hashimoto et al. 2016, Weiss et al. 2017, Rao et al. 2017).

13

slide-32
SLIDE 32

Questions?

13