Transformer, External Memory Networks Milan Straka May 20, 2019 - PowerPoint PPT Presentation

NPFL114, Lecture 12 Transformer, External Memory Networks Milan Straka May 20, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

Exams Five questions, written preparation, then we go through it together (or you can leave and let me grade it by myself). Each question is 20 points, and up to 40 points (surplus above 80 points; there is no distinction between regular and competition points) transfered from the practicals, and up to 10 points for GitHub pull requests. To pass the exam, you need to obtain at least 60, 75 and 90 out of 100 points for the written exam (plus up to 40 points from the practicals), to obtain grades 3, 2 and 1, respectively. The SIS should give you an exact time of the exam (including a gap between students) so that you do not come all at once. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 2/39

What Next In the winter semester: NPFL117 – Deep Learning Seminar [0/2 Ex] Reading group of deep learning papers (in all areas). Every participant presents a paper about deep learning, learning how to read a paper, present it in a understandable way, and get deep learning knowledge from other presentations. NPFL122 – Deep Reinforcement Learning [2/2 C+Ex] In a sense continuation of Deep Learning, but instead of supervised learning, reinforced learning is the main method. Similar format to the Deep Learning course. NPFL129 – Machine Learning 101 A course intended as a prequel to Deep Learning – introduction to machine learning (regression, classification, structured prediction, clusterization, hyperparameter optimization; decision trees, SVM, maximum entropy classifiers, gradient boosting, … ), with practicals in Python. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 3/39

Neural Architecture Search (NASNet) – 2017 Using REINFORCE with baseline, we can design neural network architectures. We fix the overall architecture and design only Normal and Reduction cells. Figure 2 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 4/39

Neural Architecture Search (NASNet) – 2017 Every block is designed by a RNN controller generating individual operations. Figure 3 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 5/39

Neural Architecture Search (NASNet) – 2017 Figure 4 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 6/39

Neural Architecture Search (NASNet) – 2017 Table 2 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 7/39

Neural Architecture Search (NASNet) – 2017 Figure 5 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 8/39

Attention is All You Need For some sequence processing tasks, sequential processing of its elements might be too restricting. Instead, we may want to combine sequence elements independently on their distance. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 9/39

Attention is All You Need Figure 1 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 10/39

Attention is All You Need Q K V The attention module for a queries , keys and values is defined as: ( QK ⊤ ) Attention( Q , K , V ) = softmax V . d k W The queries, keys and values are computed from current word representations using a linear transformation as = V ⋅ W Q Q = V ⋅ W K K = V ⋅ W V V NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 11/39

Attention is All You Need Multihead attention is used in practice. Instead of using one huge attention, we split queries, keys and values to several groups (similar to how ResNeXt works), compute the attention in each of the groups separately, and then concatenate the results. Scaled Dot-Product Attention Multi-Head Attention Figure 2 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 12/39

Attention is All You Need Positional Embeddings We need to encode positional information (which was implicit in RNNs). Learned embeddings for every position. Sinusoids of different frequencies: ( 2 i / d ) PE = sin pos /10000 ( pos ,2 i ) ( 2 i / d ) PE = cos pos /10000 ( pos ,2 i +1) This choice of functions should allow the model to attend to relative positions, since for any k PE PE pos + k pos fixed , is a linear function of . NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 13/39

Attention is All You Need Positional embeddings for 20 words of dimension 512, lighter colors representing values closer to 1 and darker colors representing values closer to -1. http://jalammar.github.io/illustrated-transformer/ NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 14/39

Attention is All You Need Regularization The network is regularized by: dropout of input embeddings, dropout of each sub-layer, just before before it is added to the residual connection (and then normalized), label smoothing. Default dropout rate and also label smoothing weight is 0.1. Parallel Execution Training can be performed in parallel because of the masked attention – the softmax weights of the self-attention are zeroed out not to allow attending words later in the sequence. However, inference is still sequential (and no substantial improvements have been achieved on parallel inference similar to WaveNet). NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 15/39

Why Attention Table 1 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 16/39

Transformers Results Table 2 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 17/39

Transformers Results train PPL BLEU params N d model d ff h d k d v P drop ǫ ls × 10 6 steps (dev) (dev) base 6 512 2048 8 64 64 0.1 0.1 100K 4.92 25.8 65 1 512 512 5.29 24.9 4 128 128 5.00 25.5 (A) 16 32 32 4.91 25.8 32 16 16 5.01 25.4 16 5.16 25.1 58 (B) 32 5.01 25.4 60 2 6.11 23.7 36 4 5.19 25.3 50 8 4.88 25.5 80 (C) 256 32 32 5.75 24.5 28 1024 128 128 4.66 26.0 168 1024 5.12 25.4 53 4096 4.75 26.2 90 0.0 5.77 24.6 0.2 4.95 25.5 (D) 0.0 4.67 25.3 0.2 5.47 25.7 (E) positional embedding instead of sinusoids 4.92 25.7 big 6 1024 4096 16 0.3 300K 4.33 26.4 213 Table 4 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 18/39

Neural Turing Machines So far, all input information was stored either directly in network weights, or in a state of a recurrent network. However, mammal brains seem to operate with a working memory – a capacity for short-term storage of information and its rule-based manipulation. M We can therefore try to introduce an external memory to a neural network. The memory will be a matrix, where rows correspond to memory cells. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 19/39

Neural Turing Machines The network will control the memory using a controller which reads from the memory and writes to is. Although the original paper also considered a feed-forward (non-recurrent) controller, usually the controller is a recurrent LSTM network. Figure 1 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 20/39

Neural Turing Machine Reading t w t To read the memory in a differentiable way, the controller at time emits a read distribution r t over memory locations, and the returned read vector is then ∑ = ( i ) ⋅ ( i ). r w M t t t i Writing t Writing is performed in two steps – an erase followed by an add . The controller at time emits w e t t a write distribution over memory locations, and also an erase vector and an add vector a t . The memory is then updates as ( i ) = ( i ) [ 1 − ( i ) e ] + ( i ) a . M M w w t −1 t t t t t NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 21/39

Transformer, External Memory Networks Milan Straka May 20, 2019 - PowerPoint PPT Presentation

NPFL114, Lecture 12 Transformer, External Memory Networks Milan Straka May 20, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Exams Five questions,

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

West Cape Transformer Replacement Purpose Purchase a 50 MVA Load Tap Changer power

IEEE Transformer Committee PC57.167 Distribution Transformer Monitoring - User Mark Scarborough

Transformer Program Trevor Foster Electrical Engineering Manager Calpine Transformer Program 1

Electronic DC Transformer Pavol Bauer Learning objectives What is an electronic DC

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Magnetics Design 3.1 Important magnetic equations 3.2 Magnetic losses 3.3 Transformer 3.3.1

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

External buffer Raslan Darawsheh Mellanox External buffer First was introduced by Olivier

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

External Memory Geometric Data Structures Lars Arge Duke University June 27, 2002 Summer School

A Parallel External- -Memory Memory A Parallel External Frontier Breadth- -First Traversal

Objectives Review algorithms Programming in Python Data types Expressions

Narendra Modis visit to the UK Meeting held at Sri Guru Singh Sabha, Southall 12-14 November

Disclaimer This presentation has been prepared by Nucleus Wealth and is for general information

A Supervised Sequence 2 Sequence Problem Janos Borst July 26, 2019 University of Leipzig - NLP

Source images - internet Lady Gaga - Fancy Pants

and the Cold Start Problem CS 278 | Stanford University | Michael Bernstein Last time We act

agentic theory of the firm JC Spender www.jcspender.com theory of the firm what questions

Cross-over Methodology How does GDA serve as a framework for integrating different species of