universal transformers
play

Universal transformers Matus Zilinec SZI, November 29, 2018 - PowerPoint PPT Presentation

Universal transformers Matus Zilinec SZI, November 29, 2018 Motivation What do we want? Given a sequence x of inputs ( x 1 , x 2 , ..., x n ), predict a sequence y of outputs ( y 1 , y 2 , ..., y n ). Why would we want that? machine


  1. Universal transformers Matus Zilinec SZI, November 29, 2018

  2. Motivation What do we want? Given a sequence x of inputs ( x 1 , x 2 , ..., x n ), predict a sequence y of outputs ( y 1 , y 2 , ..., y n ′ ). Why would we want that? ◮ machine translation ◮ video captioning ◮ speech recognition ◮ generating music ◮ talking robots ◮ working with symbols (math)

  3. Motivation How do we do it? ◮ Let’s use a neural network! ◮ It kind of works, but is it ideal? ◮ Problem: dependencies in data Recurrent neural networks!

  4. Recurrent neural nets ◮ RNN allows loops in the network ◮ each timestep t , read an item x ( t ) ◮ update internal state h ( t ) h ( t ) = f ( h ( t − 1) , x ( t ) , θ )

  5. Encoder-decoder ◮ two RNNs, encoder and decoder ◮ compress meaning of x into a ”thought” vector ◮ use the vector to generate y ◮ Problems: information loss, vanishing gradient, parallelization

  6. Attention ◮ every item of y depends on different items of x ◮ try to learn dependencies and focus attention Dot-product attention query q , keys k , values v w i = k i · q attention ( q ) = � i w i · v i

  7. Multi-head self-attention ◮ multi-head: focus on multiple places at once ◮ self-: update representation of x instead of comparing with y ◮ for each head, use different features from x , thus the weights Attention ( Q , K , V ) = softmax ( QK T √ d k ) V head i = Attention ( HW Q i , HW K i , HW V i ) MultiHeadSelfAttention ( H ) = Concat ( head 1 , ..., head k ) W O

  8. Transformer ◮ encoder-decoder without recurrence, just attention ◮ generates N intermediate representations of x and y ◮ all timesteps can be processed at the same time

  9. Universal transformer ◮ generalization of transformer ◮ recurrent in depth, not width ◮ O (# steps ) << O ( inputlength )

  10. Adaptive computation time (ACT) ◮ imagine simplifying equations ◮ different inputs - different difficulty ◮ dynamically adjust number of computational steps for each symbol in the input ◮ When should we stop? ◮ predict ”pondering value” - probability of stopping ◮ stop computation for symbol when ( P shouldstop > threshold ) ∨ ( step > N ) ◮ pondering value trained jointly with the transformer

  11. Results bAbI Question-Answering ◮ read a story and answer questions about the characters Average error and number of failed tasks ( > 5% error) out of 20 (in parentheses; lower is better in both cases)

  12. Ponder time

  13. Results Learning to Execute ◮ just like progtest ◮ algorithmic, memorization and evaluation tasks

  14. Results Machine translation ◮ UT with ACT not mentioned ◮ seems to perform worse so far

  15. Another talk about Universal Transformer: Let’s Talk ML, TH:A-1347, 11:00

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend