SLIDE 1
Universal transformers Matus Zilinec SZI, November 29, 2018 - - PowerPoint PPT Presentation
Universal transformers Matus Zilinec SZI, November 29, 2018 - - PowerPoint PPT Presentation
Universal transformers Matus Zilinec SZI, November 29, 2018 Motivation What do we want? Given a sequence x of inputs ( x 1 , x 2 , ..., x n ), predict a sequence y of outputs ( y 1 , y 2 , ..., y n ). Why would we want that? machine
SLIDE 2
SLIDE 3
Motivation
How do we do it? ◮ Let’s use a neural network! ◮ It kind of works, but is it ideal? ◮ Problem: dependencies in data Recurrent neural networks!
SLIDE 4
Recurrent neural nets
◮ RNN allows loops in the network ◮ each timestep t, read an item x(t) ◮ update internal state h(t) h(t) = f (h(t−1), x(t), θ)
SLIDE 5
Encoder-decoder
◮ two RNNs, encoder and decoder ◮ compress meaning of x into a ”thought” vector ◮ use the vector to generate y ◮ Problems: information loss, vanishing gradient, parallelization
SLIDE 6
Attention
◮ every item of y depends on different items of x ◮ try to learn dependencies and focus attention Dot-product attention query q, keys k, values v wi = ki · q attention(q) =
i wi · vi
SLIDE 7
Multi-head self-attention
◮ multi-head: focus on multiple places at once ◮ self-: update representation of x instead of comparing with y ◮ for each head, use different features from x, thus the weights Attention(Q, K, V ) = softmax( QK T
√dk )V
headi = Attention(HW Q
i , HW K i , HW V i )
MultiHeadSelfAttention(H) = Concat(head1, ..., headk)W O
SLIDE 8
Transformer
◮ encoder-decoder without recurrence, just attention ◮ generates N intermediate representations of x and y ◮ all timesteps can be processed at the same time
SLIDE 9
Universal transformer
◮ generalization of transformer ◮ recurrent in depth, not width ◮ O(#steps) << O(inputlength)
SLIDE 10
Adaptive computation time (ACT)
◮ imagine simplifying equations ◮ different inputs - different difficulty ◮ dynamically adjust number of computational steps for each symbol in the input ◮ When should we stop? ◮ predict ”pondering value” - probability of stopping ◮ stop computation for symbol when (Pshouldstop > threshold) ∨ (step > N) ◮ pondering value trained jointly with the transformer
SLIDE 11
Results
bAbI Question-Answering ◮ read a story and answer questions about the characters Average error and number of failed tasks (> 5% error) out of 20 (in parentheses; lower is better in both cases)
SLIDE 12
Ponder time
SLIDE 13
Results
Learning to Execute ◮ just like progtest ◮ algorithmic, memorization and evaluation tasks
SLIDE 14
Results
Machine translation ◮ UT with ACT not mentioned ◮ seems to perform worse so far
SLIDE 15