Universal transformers Matus Zilinec SZI, November 29, 2018 - - PowerPoint PPT Presentation

universal transformers
SMART_READER_LITE
LIVE PREVIEW

Universal transformers Matus Zilinec SZI, November 29, 2018 - - PowerPoint PPT Presentation

Universal transformers Matus Zilinec SZI, November 29, 2018 Motivation What do we want? Given a sequence x of inputs ( x 1 , x 2 , ..., x n ), predict a sequence y of outputs ( y 1 , y 2 , ..., y n ). Why would we want that? machine


slide-1
SLIDE 1

Universal transformers

Matus Zilinec SZI, November 29, 2018

slide-2
SLIDE 2

Motivation

What do we want? Given a sequence x of inputs (x1, x2, ..., xn), predict a sequence y of outputs (y1, y2, ..., yn′). Why would we want that? ◮ machine translation ◮ video captioning ◮ speech recognition ◮ generating music ◮ talking robots ◮ working with symbols (math)

slide-3
SLIDE 3

Motivation

How do we do it? ◮ Let’s use a neural network! ◮ It kind of works, but is it ideal? ◮ Problem: dependencies in data Recurrent neural networks!

slide-4
SLIDE 4

Recurrent neural nets

◮ RNN allows loops in the network ◮ each timestep t, read an item x(t) ◮ update internal state h(t) h(t) = f (h(t−1), x(t), θ)

slide-5
SLIDE 5

Encoder-decoder

◮ two RNNs, encoder and decoder ◮ compress meaning of x into a ”thought” vector ◮ use the vector to generate y ◮ Problems: information loss, vanishing gradient, parallelization

slide-6
SLIDE 6

Attention

◮ every item of y depends on different items of x ◮ try to learn dependencies and focus attention Dot-product attention query q, keys k, values v wi = ki · q attention(q) =

i wi · vi

slide-7
SLIDE 7

Multi-head self-attention

◮ multi-head: focus on multiple places at once ◮ self-: update representation of x instead of comparing with y ◮ for each head, use different features from x, thus the weights Attention(Q, K, V ) = softmax( QK T

√dk )V

headi = Attention(HW Q

i , HW K i , HW V i )

MultiHeadSelfAttention(H) = Concat(head1, ..., headk)W O

slide-8
SLIDE 8

Transformer

◮ encoder-decoder without recurrence, just attention ◮ generates N intermediate representations of x and y ◮ all timesteps can be processed at the same time

slide-9
SLIDE 9

Universal transformer

◮ generalization of transformer ◮ recurrent in depth, not width ◮ O(#steps) << O(inputlength)

slide-10
SLIDE 10

Adaptive computation time (ACT)

◮ imagine simplifying equations ◮ different inputs - different difficulty ◮ dynamically adjust number of computational steps for each symbol in the input ◮ When should we stop? ◮ predict ”pondering value” - probability of stopping ◮ stop computation for symbol when (Pshouldstop > threshold) ∨ (step > N) ◮ pondering value trained jointly with the transformer

slide-11
SLIDE 11

Results

bAbI Question-Answering ◮ read a story and answer questions about the characters Average error and number of failed tasks (> 5% error) out of 20 (in parentheses; lower is better in both cases)

slide-12
SLIDE 12

Ponder time

slide-13
SLIDE 13

Results

Learning to Execute ◮ just like progtest ◮ algorithmic, memorization and evaluation tasks

slide-14
SLIDE 14

Results

Machine translation ◮ UT with ACT not mentioned ◮ seems to perform worse so far

slide-15
SLIDE 15

Another talk about Universal Transformer: Let’s Talk ML, TH:A-1347, 11:00