Machine Learning Lecture 10: Recurrent Neural Networks Nevin L. - PowerPoint PPT Presentation

Machine Learning Lecture 10: Recurrent Neural Networks Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set of notes is based on internet resources and references listed at the end. Nevin L. Zhang (HKUST) Machine Learning 1 / 43

Introduction Outline 1 Introduction 2 Recurrent Neural Networks 3 Long Short-Term Memory (LSTM) RNN 4 RNN Architectures 5 Attention Nevin L. Zhang (HKUST) Machine Learning 2 / 43

Introduction Introduction So far, we have been talking about neural network models for labelled data: { x i , y i } N i =1 − → P ( y | x ) , where each training example consists of one input x i and one output y i . Next, we will talk about neural network models for sequential data: { ( x (1) , . . . , x ( τ i ) ) , ( y (1) , . . . , y ( τ i ) ) } N → P ( y ( t ) | x (1) , . . . , x ( t ) ) , i =1 − i i i i where each training example consists of a sequence of inputs ( x (1) , . . . , x ( τ i ) ) and a sequence of outputs ( y (1) , . . . , y ( τ i ) ), i i i i and the current output y ( t ) depends not only on the current input, but also all previous inputs. Nevin L. Zhang (HKUST) Machine Learning 3 / 43

Introduction Introduction: Language Modeling Data: A collection of sentences. For each sentence, create an output sequence by shifting it: (” what ” , ” is ” , ” the ” , ” problem ”) , (” is ” , ” the ” , ” problem ” , − ) From the training pairs, we can learn a neural language model : It is used to predict the next word: P ( w k | w 1 , . . . , w k − 1 ). It also defines a probability distribution over sentences: P ( w 1 , w 2 , . . . , w n ) = P ( w 1 ) P ( w 2 | w 1 ) P ( w 3 | w 2 , w 1 ) P ( w 4 | w 3 , w 2 , w 1 ) . . . Nevin L. Zhang (HKUST) Machine Learning 4 / 43

Introduction Introduction: Dialogue and Machine Translation Data: A collection of matched pairs. ”How are you?” ; ”I am fine.” We can still thinking of having an input and an output at each time point, except some inputs and outputs are dummies. (” How ” , ” are ” , ” you ” , − , − , − ) , ( − , − , − , ” I ” , ” am ” , ” fine ”) . From the training pairs, we can learn neural model for dialogue or machine translation. Nevin L. Zhang (HKUST) Machine Learning 5 / 43

Recurrent Neural Networks Outline 1 Introduction 2 Recurrent Neural Networks 3 Long Short-Term Memory (LSTM) RNN 4 RNN Architectures 5 Attention Nevin L. Zhang (HKUST) Machine Learning 6 / 43

Recurrent Neural Networks Recurrent Neural Networks A circuit diagram , aka recurrent graph , (left), where a black square indicates time-delayed dependence, or A unfolded computational graph , aka unrolled graph , (right). The length of the unrolled graph is determined by the length the input. In other words, the unrolled graphs for different sequences can be of different lengths. Nevin L. Zhang (HKUST) Machine Learning 7 / 43

Recurrent Neural Networks Recurrent Neural Networks The input tokens x ( t ) are represented as embedding vectors , which are determined together with other model parameters during learning. The hidden states h ( t ) are also vectors. The current state h ( t ) depends on the current input x ( t ) and the previous state h ( t − 1) as follows: b + Wh ( t − 1) + Ux ( t ) a ( t ) = h ( t ) tanh( a ( t ) ) = where b , W and U are model parameters. They are independent of time t . Nevin L. Zhang (HKUST) Machine Learning 8 / 43

Recurrent Neural Networks Recurrent Neural Networks The output sequence is produced as follows: o ( t ) c + Vh ( t ) = y ( t ) softmax( o ( t ) ) ˆ = where c and V are model parameters. They are independent of time t . Nevin L. Zhang (HKUST) Machine Learning 9 / 43

Recurrent Neural Networks Recurrent Neural Networks This is the loss for one training pair : τ L ( { x (1) , . . . , x ( τ ) } , { y (1) , . . . , y ( τ ) } ) = − � log P model( y ( t ) | x (1) , . . . , x ( t ) ) , t =1 where log P model( y ( t ) | x (1) , . . . , x ( t ) ) is obtained by reading the entry for y ( t ) y ( t ) . from the models output vector ˆ When there are multiple input-target sequence pairs, the losses are added up. Training objective: Minimize the total loss of all training pairs w.r.t the model parameters and embedding vectors: W , U , V , b , c , θ em where θ em are the embedding vectors. Nevin L. Zhang (HKUST) Machine Learning 10 / 43

Recurrent Neural Networks Training RNNs RNNs are trained using stochastic gradient descent. We need gradients: ∇ W L , ∇ U L , ∇ V L , ∇ b L , ∇ c L , ∇ θ em L They are computed using Backpropagation Through Time (BPTT) , which is an adaption of Backpropagation to the unrolled computational graph. BPTT is implemented in deep learning packages such as Tensorflow. Nevin L. Zhang (HKUST) Machine Learning 11 / 43

Recurrent Neural Networks RNN and Self-Supervised Learning Self-supervised learning is a learning technique where the training data is automatically labelled. It is still supervised learning, but the datasets do not need to be manually labelled by a human , but they can e.g. be labelled by finding and exploiting the relations between different input signals. RNN training is self-supervised learning. Nevin L. Zhang (HKUST) Machine Learning 12 / 43

Long Short-Term Memory (LSTM) RNN Outline 1 Introduction 2 Recurrent Neural Networks 3 Long Short-Term Memory (LSTM) RNN 4 RNN Architectures 5 Attention Nevin L. Zhang (HKUST) Machine Learning 13 / 43

Long Short-Term Memory (LSTM) RNN Basic Idea Long Short-Term Memory (LSTM) unit is a widely used technique to address long-term dependencies. The key idea is to use memory cells and gates : c ( t ) = f t c ( t − 1) + i t a ( t ) c ( t ) : memory state at t ; a ( t ) : new input at t . If the forget gate f t is open (i.e, 1) and the input gate i t is closed (i.e., 0), the current memory is kept. If the forget gate f t is closed (i.e, 0) and the input gate i t is open (i.e., 1), the current memory is erased and replaced by new input. If we can learn f t and i t from data, then we can automatically determine how much history to remember/forget. Nevin L. Zhang (HKUST) Machine Learning 14 / 43

Long Short-Term Memory (LSTM) RNN Basic Idea In the case of vectors, c ( t ) = f t ⊗ c ( t − 1) + i t ⊗ a ( t ) , where ⊗ means pointwise product. f t is called the forget gate vector because it determines which components of the previous state and how much of them to remember/forget. i t is called the input gate vector because it determines which components of the input from h ( t − 1) and x ( t ) and how much of them should go into the current state. If we can learn f t and i t from data, then we can automatically determine which component to remember/forget and how much of them to remember/forget. Nevin L. Zhang (HKUST) Machine Learning 15 / 43

Long Short-Term Memory (LSTM) RNN LSTM Cell In standard RNN, a ( t ) = b + Wh ( t − 1) + Ux ( t ) , h ( t ) = tanh( a ( t ) ) In LSTM, we introduce a cell state vector c t , and set f t ⊗ c t − 1 + i t ⊗ a ( t ) c t = h ( t ) = tanh( c t ) where f t and i t are vectors. Nevin L. Zhang (HKUST) Machine Learning 16 / 43

Long Short-Term Memory (LSTM) RNN LSTM Cell: Learning the Gates f t is determined based on current input x ( t ) and previous hidden unit h ( t − 1) : f t = σ ( W f x ( t ) + U f h ( t − 1) + b f ) , where W f , U f , b f are parameters to be learned from data. i t is also determined based on current input x ( t ) and previous hidden unit h ( t − 1) : i t = σ ( W i x ( t ) + U i h ( t − 1) + b i ) where W i , U i , b i are parameters to be learned from data. Note the sigmoid activation function is used for the gates so that their values are often close to 0 or 1. In contrast, tanh is used for the output h ( t ) so as to have strong gradient signal during backprop. Nevin L. Zhang (HKUST) Machine Learning 17 / 43

Long Short-Term Memory (LSTM) RNN LSTM Cell: Output Gate We can also have a output gate to control which components of the state vector c t and how much of them should be outputted: o t = σ ( W q x ( t ) + U q h ( t − 1) + b q ) where W q , U q , b q are the learnable parameters, and set h ( t ) = o t ⊗ tanh( c t ) Nevin L. Zhang (HKUST) Machine Learning 18 / 43

Long Short-Term Memory (LSTM) RNN LSTM Cell: Summary A standard RNN cell : a ( t ) = b + Wh ( t − 1) + Ux ( t ) , h ( t ) = tanh( a ( t ) ) An LSTM Cell : σ ( W f x ( t ) + U f h ( t − 1) + b f ) f t = (forget gate, 0 � in figure) σ ( W i x ( t ) + U i h ( t − 1) + b i ) i t = (input gate, 1 � in figure) σ ( W q x ( t ) + U q h ( t − 1) + b q ) o t = (output gate, 3 � in figure) f t ⊗ c t − 1 + i t ⊗ tanh( Ux ( t ) + Wh ( t − 1) + b ) c t = (update memory) h ( t ) = o t ⊗ tanh( c t ) (next hidden unit) Nevin L. Zhang (HKUST) Machine Learning 19 / 43

Long Short-Term Memory (LSTM) RNN Gated Recurrent Unit The Gated Recurrent Unit (GRU) is another a gating mechanism to allow RNNs to efficiently learn long-range dependency. It is similar to LSTM (no memory and no output unit) and hence has fewer paramters. is a simplified version of an LSTM unit with fewer parameters. Performance also similar to LSTM, except better on small datasets. Nevin L. Zhang (HKUST) Machine Learning 20 / 43

Long Short-Term Memory (LSTM) RNN Gated Recurrent Unit Nevin L. Zhang (HKUST) Machine Learning 21 / 43

Machine Learning Lecture 10: Recurrent Neural Networks Nevin L. - PowerPoint PPT Presentation

Machine Learning Lecture 10: Recurrent Neural Networks Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set of notes is based on internet resources and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Early Scheduling in Parallel State Machine Replication Eduardo Alchieri, Fernando Dotti, and

Overview Motivation and introduction Model and fault model ECE 553: TESTING AND

Neural CRF Parsing AUTHORS: GREG DURRETT AND DAN KLEIN PRESENTER: YUNDI FEI 1 Overview Based

Function Objects and the Comparator Interface Merge Sort Fork/Join Framework Checkout

Informatics 1 Computation and Logic Lecture 19 Computation: The Big Ideas Creative Commons

Outline Specificities of SEQUENTIAL data Alignment of sequences by DTW Model sequential

Sequential Quadratic Programming 1 Lecture 17 ME EN 575 Andrew Ning aning@byu.edu Outline

Accurate Clock Mesh Sizing via Sequential Quadratic Programming Venkata Rajesh Mekala, Yifang