Beyond Sequential decoding toward parallel decoding In the context - - PowerPoint PPT Presentation

beyond sequential decoding toward parallel decoding
SMART_READER_LITE
LIVE PREVIEW

Beyond Sequential decoding toward parallel decoding In the context - - PowerPoint PPT Presentation

Beyond Sequential decoding toward parallel decoding In the context of neural sequence modelling Kyunghyun Cho New York University & Facebook AI Research Joint work with Jason Lee and Elman Mansimov Neural sequence modeling An arbitrary


slide-1
SLIDE 1

Beyond Sequential decoding toward parallel decoding

In the context of neural sequence modelling

Kyunghyun Cho New York University & Facebook AI Research Joint work with Jason Lee and Elman Mansimov

slide-2
SLIDE 2

Neural sequence modeling

  • An arbitrary input : e.g., another sequence, image, video, …
  • A sequence output
  • e.g., natural language sentence
  • Discrete
  • Use a neural network to estimate a distribution over sequences
  • Machine translation, automatic speech recognition, …

X

Y = (y1, y2, . . . , yT ) yt ∈ V

pθ(Y |X)

slide-3
SLIDE 3

Neural autoregressive sequence modeling

  • Unlike classifica,on, complex, strong dependencies among ’s
  • More than half of residents in Korea speak ______.
  • Among millions of possible tokens, only one word (Korean) is likely above.
  • Neural autoregressive sequence modelling

yt

p(Y |X) =

T

X

t=1

p(yt|y<t, X)

It explicitly models dependencies <bos> <eos>

slide-4
SLIDE 4

Neural autoregressive sequence modeling

  • This has become a de facto standard in recent years
  • e.g., neural machine transla7on
  • Neural autoregressive sequence modelling
  • Various op7ons: recurrent nets, convolu7onal nets, transformers

p(Y |X) =

T

X

t=1

p(yt|y<t, X)

X

y1 y2 y3 y4

slide-5
SLIDE 5

Neural autoregressive sequence modeling

  • Decoding is problema1c
  • 1. Exact decoding is intractable
  • 2. Decoding is inherently sequen1al

X

y1 y2 y3 y4

Y T

X

t=1

p(yt|y<t, X) =?

O(kT)

slide-6
SLIDE 6
  • Conditional independence among ’s
  • Exact decoding is tractable
  • Decoding is highly parallelizable

Neural no non-autoregressive sequence modeling

yt

p(Y |X) =

T

X

t=1

p(yt|X) p(Y |X) =

T

X

t=1

p(yt|y<t, X)

ˆ yt =

yt p(yt|X)

X

y1 y2 y3 y4

slide-7
SLIDE 7
  • Too good to be true: dependencies must be modelled somehow
  • Introduce a set of latent variables [Gu et al., 2018 ICLR]

Neural no non-autoregressive sequence modeling

X

y1 y2 y3 y4

p(Y |X) =

T

X

t=1

  • X

Z

p(yt|Z, X)p(Z|X)

Z

slide-8
SLIDE 8
  • Repetition as a latent variable [Gu et al., 2018 ICLR]
  • Each latent variable : # of repetitions of the input symbol
  • Neural no

non-autoregressive sequence modeling

zt

xt

|Z| = |X|

y1 y2 y3 y4 x1 x2 z1 z2

=1 =3

x1 x2 x2 x2

slide-9
SLIDE 9
  • Repetition as a latent variable [Gu et al., 2018 ICLR]
  • Each latent variable : # of repetitions of the input symbol
  • Monte Carlo approximation with rescoring

1. ,

  • 2. Pick with the high score by another model.

Neural no non-autoregressive sequence modeling

zt

xt

y1 y2 y3 y4 x1 x2 z1 z2 x1 x2 x2 x2

Zm ∼ Z|X

Ym =

Y

p(Y |Zm, X) Ym

slide-10
SLIDE 10
  • Repetition as a latent variable [Gu et al., 2018 ICLR]
  • Each latent variable : # of repetitions of the input symbol
  • For training: use an auxiliary task to train
  • 1. Word alignment models

Neural no non-autoregressive sequence modeling

zt

xt

y1 y2 y3 y4 x1 x2 z1 z2 x1 x2 x2 x2

p(Z|X)

slide-11
SLIDE 11
  • First convincing result!! [Gu et al., 2018 ICLR]
  • IWSLT’16 En→De

Neural no non-autoregressive sequence modeling

Non- Autoregressive? Decoding BLEU Sentence Latency (ms) No Greedy 28.89 408ms Beam search (4) 29.70 607ms Yes argmax 25.20 39ms MC+Rescoring (10) 27.44 79ms MC+Rescoring (100) 28.16 257ms

slide-12
SLIDE 12
  • Any alternative interpretation of the latent variables?
  • Latent variables share the output semantics
  • They share the same vocabulary
  • Multiple layers of the latent variables
  • Shared conditional distributions

Non-autoregressive modeling by iterative refinement [Lee, Mansimov & Cho, 2018]

p(Y |X) =

T

X

t=1

  • X

Z

p(yt|Z, X)p(Z|X)

zt ∈ V, yt ∈ V

p(Y |X) = X

Z1,...,ZL

T Y

t=1

p(yt|ZL, X) ! T Y

t=1

p(zL

t |ZL−1, X)

! · · · T Y

t=1

p(z1

t |X)

!

slide-13
SLIDE 13
  • Generative story: Iterative refinement
  • 1. Refine*: Generate an intermediate translation

given a previous translation and the source sentence

  • 2. Repeat 1 for iterations (or until convergence)

Non-autoregressive modeling by iterative refinement

Y l

Y l−1

X

L

* As the latent variables share the semantics with the output, we can use Z and Y exchangingly.

X

y1 y2 y3 y4 y1 y2 y3 y4

slide-14
SLIDE 14

Non-autoregressive modeling by iterative refinement

a yellow bus parked on parked in of parking road . a yellow and black on parked in a parking lot . a yellow and black bus parked in a parking lot . a yellow and black bus parked in a parking lot.

Input X

Y 1 Y 2

Y 3

Y 4

slide-15
SLIDE 15
  • Training 1: lower-bound maximization
  • The output of each iteration is encouraged to be the correct answer

Non-autoregressive modeling by iterative refinement

X

y1 y2 y3 y4 y1 y2 y3 y4

y1* y2* y3* y4* CE CE CE CE y1* y2* y3* y4* CE CE CE CE

slide-16
SLIDE 16
  • Training 2: Conditional denoising autoencoder
  • A denoising autoencoder learns to hill climb [Alain & Bengio, 2013]
  • Non-autoregressive modeling

by itera4ve refinement

ˆ Y = (Y, X)

p( ˆ Y |X) ≥ p(Y |X)

X

y1 y2 y3 y4

y1* y2* y3* y4* CE CE CE CE

Corruption Function C z1 z2 z3 z4

slide-17
SLIDE 17
  • Lower-bound maximization & Conditional Denoising
  • Mixed Training Objective
  • Consider L+1 iterations.
  • At each iteration,

stochastically choose

  • ne of the two objectives.
  • Joint training from scratch

Non-autoregressive modeling by iterative refinement

,

slide-18
SLIDE 18
  • En↔Ro (WMT’16): low-resource machine translation
  • 91.5% translation quality with up to 4x decoding speed-up (on GPU)

Experiments – Machine Translation

Non- Autoregressive? Decoding En→Ro (BLEU) Ro→En (BLEU) Speed (toks/sec) CPU GPU No Greedy 31.93 31.55 15.7 55.6 Beam (4) 32.40 32.06 7.3 43.3 Yes Iter 1 24.45 25.73 98.6 694.2 Iter 2 27.10 28.15 62.8 332.7 Iter 5 28.86 29.72 29.0 194.4 Iter 10 29.32 30.19 14.8 93.1 adaptive 29.66 30.30 16.5 226.6

slide-19
SLIDE 19
  • En↔De (WMT’15): moderate-scale machine translation
  • 80% translation quality with up to 2x decoding speed-up (on GPU)

Experiments – Machine Translation

Non- Autoregressive? Decoding En→De (BLEU) De→En (BLEU) Speed (toks/sec) CPU GPU No Greedy 23.40 26.49 15.8 53.6 Beam (4) 24.12 27.05 6.7 45.8 Yes Iter 1 12.65 14.84 101.2 536.5 Iter 2 15.03 17.15 56.4 407.1 Iter 5 17.53 20.02 27.1 173.4 Iter 10 18.48 21.10 13.1 87.8 adaptive 18.91 21.60 12.8 90.9

slide-20
SLIDE 20

Experiments – Machine Translation

  • Iterative refinement improves translation

quality (almost) monotonically

  • intermediate latent variables (translations)

are successfully capturing dependencies.

  • Quality degradation with large data
  • Significant speed-up in decoding on GPU
  • Fits well with GTC!
slide-21
SLIDE 21

Experiments – Machine Translation

Sr Src: seitdem habe ich sieben Ha ̈user in der Nachbarschaft mit den Lichtern versorgt und sie funktionierenen wirklich gut . Ite Iter 1: and I ’ve been seven homes since in neighborhood with the lights and they ’re really functional . Ite Iter 4: and I ’ve been seven homes in neighborhood with the lights , and they ’re a really functional . Ite Iter 8: and I ’ve been providing seven homes in the neighborhood with the lights and they ’re a really functional . Re Ref: since now , I ’ve set up seven homes around my community , and they ’re really working . Sr Src: er sah sehr glu ̈cklich aus , was damals ziemlich ungewo ̈hnlich war , da ihn die Nachrichten meistens deprimierten . Ite Iter 1: he looked very happy , which was pretty unusual the , because the news was were usually depressing . Ite Iter 4: he looked very happy , which was pretty unusual at the , because news was mostly depressing . Ite Iter 8: he looked very happy , which was pretty unusual at the time because the news was mostly depressing . Re Ref: there was a big smile on his face which was unusual then , because the news mostly depressed him .

slide-22
SLIDE 22
  • MS COCO: image caption generation
  • 85% caption quality with up to 5x decoding speed-up (on GPU)

Experiments – Image Caption Generation

Non- Autoregressive? Decoding BLEU Speed (toks/sec) CPU GPU No Greedy 23.47 4.3 2.1 Beam (4) 24.78 3.6 1.0 Yes Iter 1 20.12 17.1 8.9 Iter 2 20.88 12.0 5.7 Iter 5 21.12 6.2 2.8 Iter 10 21.24 2.0 1.2 adaptive 21.12 10.8 4.8

slide-23
SLIDE 23

Non-autoregressive modeling by iterative refinement

a woman standing on playing tennis on a tennis racquet . a woman standing on a tennis court a tennis racquet . a woman standing on a tennis court a a racquet . a woman standing on a tennis court holding a racquet .

Input X

Y 1 Y 2

Y 3

Y 4

slide-24
SLIDE 24

Conclusion

  • Latent variables capture output dependencies more efficiently.
  • Different interpreta8on → Different learning/decoding algorithms
  • Gu et al. [2018]: fer8lity → auxiliary supervision + noisy parallel decoding
  • Lee+Mansimov+Cho [2018]: itera8ve refinement → condi8onal denoising
  • Kaiser et al. [2018]: latent sequence → autoregressive inference
  • What else?
  • Genera8on quality closely tracks the autoregressive models’.
  • Decoding is significantly faster especially with GPU.
  • Poten8ally even faster decoding with a specialized hardware.
slide-25
SLIDE 25

Conclusion – Future Direc0ons

  • Mix of non-autoregressive and autoregressive paradigms
  • Autoregressive modeling followed by iterative refinement?

[Xia et al., 2017; Grangier & Auli, 2017]

  • Autoregressive generation of segments and non-autoregressive generation

within each segment [Kaiser et al., 2018; Huang et al., 2018], or

  • Non-autoregressive generation of segments and autoregressive generation

within each segment?

  • Beyond sentence-level generation
  • Efficiency of the non-autoregressive model may enable document-level

generation.

  • Many exciting future directions!