Universal transformers Matus Zilinec SZI, November 29, 2018
Motivation What do we want? Given a sequence x of inputs ( x 1 , x 2 , ..., x n ), predict a sequence y of outputs ( y 1 , y 2 , ..., y n ′ ). Why would we want that? ◮ machine translation ◮ video captioning ◮ speech recognition ◮ generating music ◮ talking robots ◮ working with symbols (math)
Motivation How do we do it? ◮ Let’s use a neural network! ◮ It kind of works, but is it ideal? ◮ Problem: dependencies in data Recurrent neural networks!
Recurrent neural nets ◮ RNN allows loops in the network ◮ each timestep t , read an item x ( t ) ◮ update internal state h ( t ) h ( t ) = f ( h ( t − 1) , x ( t ) , θ )
Encoder-decoder ◮ two RNNs, encoder and decoder ◮ compress meaning of x into a ”thought” vector ◮ use the vector to generate y ◮ Problems: information loss, vanishing gradient, parallelization
Attention ◮ every item of y depends on different items of x ◮ try to learn dependencies and focus attention Dot-product attention query q , keys k , values v w i = k i · q attention ( q ) = � i w i · v i
Multi-head self-attention ◮ multi-head: focus on multiple places at once ◮ self-: update representation of x instead of comparing with y ◮ for each head, use different features from x , thus the weights Attention ( Q , K , V ) = softmax ( QK T √ d k ) V head i = Attention ( HW Q i , HW K i , HW V i ) MultiHeadSelfAttention ( H ) = Concat ( head 1 , ..., head k ) W O
Transformer ◮ encoder-decoder without recurrence, just attention ◮ generates N intermediate representations of x and y ◮ all timesteps can be processed at the same time
Universal transformer ◮ generalization of transformer ◮ recurrent in depth, not width ◮ O (# steps ) << O ( inputlength )
Adaptive computation time (ACT) ◮ imagine simplifying equations ◮ different inputs - different difficulty ◮ dynamically adjust number of computational steps for each symbol in the input ◮ When should we stop? ◮ predict ”pondering value” - probability of stopping ◮ stop computation for symbol when ( P shouldstop > threshold ) ∨ ( step > N ) ◮ pondering value trained jointly with the transformer
Results bAbI Question-Answering ◮ read a story and answer questions about the characters Average error and number of failed tasks ( > 5% error) out of 20 (in parentheses; lower is better in both cases)
Ponder time
Results Learning to Execute ◮ just like progtest ◮ algorithmic, memorization and evaluation tasks
Results Machine translation ◮ UT with ACT not mentioned ◮ seems to perform worse so far
Another talk about Universal Transformer: Let’s Talk ML, TH:A-1347, 11:00
Recommend
More recommend