Machine Learning - MT 2016 11 & 12. Neural Networks Varun - PowerPoint PPT Presentation

Machine Learning - MT 2016 11 & 12. Neural Networks Varun Kanade University of Oxford November 14 & 16, 2016

Announcements ◮ Problem Sheet 3 due this Friday by noon ◮ Practical 2 this week: Compare NBC & LR ◮ (Optional) Reading a paper 1

Outline Today, we’ll study feedforward neural networks ◮ Multi-layer perceptrons ◮ Classification or regression settings ◮ Backpropagation to compute gradients ◮ Brief introduction to tensorflow and MNIST 2

Artificial Neuron : Logistic Regression 1 Unit b x 1 w 1 y = Pr( y = 1 | x , w , b ) Σ � w 2 x 2 Linear Function Non-linearity ◮ A unit in a neural network computes a linear function of its input and is then composed with a non-linear activation function ◮ For logistic regression, the non-linear activation function is the sigmoid 1 σ ( z ) = 1 + e − z ◮ The separating surface is linear 3

Multilayer Perceptron (MLP) : Classification 1 b 2 1 Σ w 2 11 x 1 w 3 11 w 2 12 b 3 1 y = Pr( y = 1 | x , W , b ) � Σ 1 w 2 21 x 2 w 3 12 w 2 22 Σ b 2 2 1 4

Multilayer Perceptron (MLP) : Regression 1 b 2 1 Σ w 2 11 x 1 w 3 11 w 2 12 b 3 1 y = E [ y | x , W , b ] � Σ 1 w 2 21 x 2 w 3 12 w 2 22 Σ b 2 2 1 5

A Toy Example 6

Logistic Regression Fails Badly 7

Solve using MLP z 2 a 2 1 1 1 b 2 1 Σ w 2 z 3 a 3 1 1 11 x 1 w 3 11 w 2 12 y = Pr( y = 1 | x , W i , b i ) b 3 1 Σ � 1 w 2 z 2 a 2 2 2 21 w 3 x 2 12 w 2 22 Σ b 2 2 1 Let us use the notation: a 1 = z 1 = x z 2 = W 2 a 1 + b 2 a 2 = tanh( z 2 ) z 3 = W 3 a 2 + b 3 y = a 3 = σ ( z 3 ) 8

Scatterplot Comparison ( x 1 , x 2 ) vs ( a 2 1 , a 2 2 ) 9

Decision Boundary of the Neural Net 10

Feedforward Neural Networks Layer 1 Layer 2 Layer 3 Layer 4 (Input) (Hidden) (Hidden) (Output) Fully Connected Layer 11

Computing Gradients on Toy Example b 2 1 Want the derivatives z 2 1 → a 2 w 2 1 11 x 1 ∂ℓ ∂ℓ 11 , w 3 ∂w 2 ∂w 2 w 2 11 12 12 ∂ℓ ∂ℓ b 3 z 3 1 → a 3 ℓ ( y, a 3 21 , 1 ) ∂w 2 ∂w 2 1 1 22 w 2 21 ∂ℓ ∂ℓ w 3 11 , x 2 ∂w 3 ∂w 3 12 12 w 2 22 z 2 2 → a 2 ∂ℓ 1 , ∂ℓ 2 , ∂ℓ 2 ∂b 2 ∂b 2 ∂b 3 b 2 1 2 ∂ℓ ∂ℓ ∂ℓ Would suffice to compute 1 , 1 , ∂z 3 ∂z 2 ∂z 2 2 12

Computing Gradients on Toy Example Let us compute the following: a 3 1 − y ∂ℓ 1 = − y 1 − y 1. 1 + 1 = ∂a 3 a 3 1 − a 3 a 3 1 (1 − a 3 1 ) ∂a 3 1 = a 3 1 · (1 − a 3 2. 1 ) ∂z 3 ∂z 3 ∂ a 2 = [ w 3 11 , w 3 3. 1 12 ] � � 1 − tanh 2 ( z 2 1 ) 0 ∂ a 2 4. ∂ z 2 = 1 − tanh 2 ( z 2 0 2 ) Then we can calculate ∂a 3 1 = a 3 ∂ℓ ∂ℓ 1 = 1 · 1 − y 1 ∂z 3 ∂a 3 ∂z 3 � � ∂a 3 ∂z 3 ∂z 3 ∂ a 2 · ∂ a 2 ∂ a 2 · ∂ a 2 ∂ℓ ∂ℓ ∂ℓ ∂ z 2 = 1 · · ∂ z 2 = 1 · 1 1 1 ∂a 3 ∂z 3 ∂z 3 ∂ z 2 1 13

a L loss ℓ Each layer consists of a linear function layer L and non-linear activation ∂ℓ ∂ z L Layer l consists of the following: layer L − 1 z l = W l a l − 1 + b l a l = f l ( z l ) layer l where f l is the non-linear activation in layer l . ∂ℓ ∂ z l layer l − 1 If there are n l units in layer l , then W l is n l × n l − 1 layer 2 Backward pass to compute derivatives ∂ℓ input x a 1 ∂ z 2 14

loss ℓ a L layer L Forward Equations (1) a 1 = x (input) layer L − 1 (2) z l = W l a l − 1 + b l (3) a l = f l ( z l ) layer l (4) ℓ ( a L , y ) layer l − 1 layer 2 input x a 1 15

Output Layer a L z L = W L a L − 1 + b L a L = f L ( z L ) layer L ( z L → a L ) ℓ ( y, a L ) Loss: ∂ a L · ∂ a L ∂ℓ ∂ℓ ∂ℓ a L − 1 ∂ z L = ∂ z L ∂ z L ∂ℓ ∂ℓ If there are n L (output) units in layer L , then ∂ a L and ∂ z L are row vectors with n L elements and ∂ a L ∂ z L is the n L × n L Jacobian matrix:   ∂a L ∂a L ∂a L · · · 1 1 1  ∂z L ∂z L ∂z L  nL 1 2   ∂a L ∂a L ∂a L   2 2 · · · 2 ∂ a L   ∂z L ∂z L ∂z L  1 2 nL  ∂ z L =  . . .  ... . . .   . . .     ∂a L ∂a L ∂a L nL nL nL · · · ∂z L ∂z L ∂z L nL 1 2 If f L is applied element-wise, e.g., sigmoid then this matrix is diagonal 16

Back Propagation a l (the inputs into layer l + 1 ) z l +1 = W l +1 a l + b l +1 ( w l +1 j,k weight on connection from k th ∂ℓ unit in layer l to j th unit in layer l + 1 ) a l ∂ z l +1 a l = f ( z l ) ( f is a non-linearity) layer l ( z l → a l ) ∂ℓ (derivative passed from layer above) ∂ z l +1 ∂ z l +1 · ∂ z l +1 ∂ℓ ∂ℓ ∂ z l = ∂ℓ a l − 1 ∂ z l ∂ z l ∂ z l +1 · ∂ z l +1 · ∂ a l ∂ℓ = ∂ a l ∂ z l ∂ z l +1 · W l +1 · ∂ a l ∂ℓ = ∂ z l 17

Gradients with respect to parameters ∂ℓ a l ∂ z l +1 z l = W l a l − 1 + b l ( w l j,k weight on connection from k th layer l ( z l → a l ) unit in layer l - 1 to j th unit in layer l ) ∂ℓ (obtained using backpropagation) ∂ z l ∂ℓ a l − 1 ∂ z l ∂z l i · a l − 1 Consider ∂ℓ ∂ℓ ∂ℓ ij = i · ij = i ∂w l ∂z l ∂w l ∂z l j ∂ℓ ∂ℓ i = ∂b l ∂z l i � � T a l − 1 ∂ℓ ∂ℓ More succinctly, we may write: ∂ W l = ∂ z l ∂ℓ ∂ℓ ∂ b l = ∂ z l 18

Forward Equations a L loss ℓ (1) a 1 = x (input) (2) z l = W l a l − 1 + b l layer L ∂ℓ (3) a l = f l ( z l ) ∂ z L layer L − 1 (4) ℓ ( a L , y ) layer l Back-propagation Equations ∂ℓ ∂ z l ∂ a L · ∂ a L layer l − 1 ∂ℓ ∂ℓ (1) Compute ∂ z L = ∂ z L ∂ z l +1 · W l +1 · ∂ a l ∂ℓ ∂ℓ (2) ∂ z l = ∂ z l layer 2 � � T ∂ℓ a l − 1 ∂ℓ (3) ∂ W l = ∂ z l ∂ℓ ∂ℓ ∂ℓ input x a 1 (4) ∂ b l = ∂ z 2 ∂ z l 19

Computational Questions What is the running time to compute the gradient for a single data point? ◮ As many matrix multiplications as there are fully connected layers ◮ Performed twice during forward and backward pass What is the space requirement? ◮ Need to store vectors a l , z l , and ∂ℓ ∂ z l for each layer Can we process multiple examples together? ◮ Yes, if we minibatch, we perform tensor operations ◮ Make sure that all parameters fit in GPU memory 20

Training Deep Neural Networks ◮ Back-propagation gives gradient ◮ Stochastic gradient descent is the method of choice ◮ Regularisation ◮ How do we add ℓ 1 or ℓ 2 regularisation? ◮ Don’t regularise bias terms ◮ How about convergence? ◮ What did we learn in the last 10 years, that we didn’t know in the 80s? 21

Training Feedforward Deep Networks Layer 1 Layer 2 Layer 3 Layer 4 (Input) (Hidden) (Hidden) (Output) Why do we get non-convex optimisation problem? All units in a layer are symmetric, hence invariant to permutations 22

A toy example 1 z 2 a 2 1 1 b 2 1 a 2 Target is y = 1 − x Σ 1 2 w 2 1 x ∈ {− 1 , 1 } Squared Loss Function ℓ ( a 2 1 , y ) = ( a 2 1 − y ) 2 1 ∂a 2 ∂ℓ 1 = 2( a 2 1 = 2( a 2 1 − y ) σ ′ ( z 2 1 − y ) · 1 1 ) 0 . 8 ∂z 2 ∂z 2 0 . 6 If x = − 1 , w 2 1 ≈ 5 , b 2 1 ≈ 0 , then σ ′ ( z 2 1 ) ≈ 0 0 . 4 Cross-Entropy Loss Function 0 . 2 z 2 1 ℓ ( a 2 1 , y ) = − ( y log a 2 1 + (1 − y ) log(1 − a 2 0 1 )) − 8 − 6 − 4 − 2 0 2 4 6 8 a 2 ∂a 2 1 − y ∂ℓ 1 = ( a 2 1 = 1 ) · 1 − y ) 1 ∂z 2 a 2 1 (1 − a 2 ∂z 2 23

Propagating Gradients Backwards 1 1 1 b 2 b 3 b 4 1 1 1 x = a 1 w 2 w 3 w 4 a 4 Σ Σ Σ 1 1 1 1 1 ◮ Cross entropy loss: ℓ ( a 4 1 , y ) = − ( y log a 4 1 + (1 − y ) log(1 − a 4 1 )) 1 = a 4 ∂ℓ 1 − y ◮ ∂z 4 ∂z 4 ∂a 3 1 = ( a 4 1 − y ) · w 4 1 · σ ′ ( z 3 ∂ℓ ∂ℓ 1 = 1 · 1 1 · 1 1 ) ◮ ∂z 3 ∂z 4 ∂a 3 ∂z 3 ∂z 3 ∂a 2 ∂ℓ ∂ℓ 1 = ( a 4 1 − y ) · w 4 1 · σ ′ ( z 3 1 ) · w 3 1 · σ ′ ( z 2 1 = 1 · 1 1 · 1 1 ) ◮ ∂z 2 ∂z 3 ∂a 2 ∂z 3 ◮ Saturation: When the output of an artificial neuron is in the ‘flat’ part, e.g., where σ ′ ( z ) ≈ 0 for sigmoid ◮ Vanishing Gradient Problem: Multiplying several σ ′ ( z l i ) together makes the gradient ≈ 0 , when we have a large number of layers ◮ For example, when using sigmoid activation, σ ′ ( z ) ∈ [0 , 1 / 4] 24

Avoiding Saturation Rectifier Use rectified linear units 3 Rectifier non-linearity f ( z ) = max(0 , z ) 2 Rectified Linear Unit (ReLU) max(0 , a · w + b ) 1 You can also use f ( z ) = | z | Other variants 0 leaky ReLUs, parametric ReLUs − 3 − 2 − 1 0 1 2 3 25

Initialising Weights and Biases Initialising is important when minimising non-convex functions. We may get very different results depending on where we start the optimisation. Suppose we were using a sigmoid unit , how would you initialise the weights? ◮ Suppose z = � D i =1 w i a i ◮ E.g., choose w i ∈ [ − 1 1 D , D ] at random √ √ What if it were a ReLU unit? ◮ You can initialise similarly How about the biases? ◮ For sigmoid, can use 0 or a random value around 0 ◮ For ReLU, should use a small positive constant 26

Machine Learning - MT 2016 11 & 12. Neural Networks Varun - PowerPoint PPT Presentation

Machine Learning - MT 2016 11 & 12. Neural Networks Varun Kanade University of Oxford November 14 & 16, 2016 Announcements Problem Sheet 3 due this Friday by noon Practical 2 this week: Compare NBC & LR (Optional)

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Machine Learning 2 DS 4420 - Spring 2020 Neural Networks & backprop Byron C Wallace Neural

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of Computer Science University

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Neural Networks for Machine Learning Lecture 2a An overview of the main types of neural network

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

CHAPTER VI VI CHAPTER Learning in Feedforward Feedforward Learning in Neural Networks Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University back to the

On Stable Marriages and Greedy Matchings Fredrik Manne University of Bergen, Norway Md. Naim,

What Matchings Can be Stable? Refutability in Matching Theory Federico Echenique California

Internet Engineering: VoiceXML Ali Kamandi Sharif University of Technology Fall 2007

CS6501: T opics in Learning and Game Theory (Fall 2019) Prediction Markets and Scoring Rules

http://cs224w.stanford.edu Main question today: Given a network with labels on some nodes, how

Methodical Approximate Hardware Design and Reuse Amir Yazdanbakhsh

Optimizing Collective Communication on Multicores Rajesh Nishtala 1 Katherine Yelick 1 1

Machine Learning - MT 2016 11 & 12. Neural Networks Varun - PowerPoint PPT Presentation

Machine Learning - MT 2016 11 & 12. Neural Networks Varun Kanade University of Oxford November 14 & 16, 2016 Announcements Problem Sheet 3 due this Friday by noon Practical 2 this week: Compare NBC & LR (Optional)

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Machine Learning 2 DS 4420 - Spring 2020 Neural Networks &amp; backprop Byron C Wallace Neural

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of Computer Science University

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Neural Networks for Machine Learning Lecture 2a An overview of the main types of neural network

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

CHAPTER VI VI CHAPTER Learning in Feedforward Feedforward Learning in Neural Networks Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University back to the

On Stable Marriages and Greedy Matchings Fredrik Manne University of Bergen, Norway Md. Naim,

What Matchings Can be Stable? Refutability in Matching Theory Federico Echenique California

Internet Engineering: VoiceXML Ali Kamandi Sharif University of Technology Fall 2007

CS6501: T opics in Learning and Game Theory (Fall 2019) Prediction Markets and Scoring Rules

http://cs224w.stanford.edu Main question today: Given a network with labels on some nodes, how

Methodical Approximate Hardware Design and Reuse Amir Yazdanbakhsh

Optimizing Collective Communication on Multicores Rajesh Nishtala 1 Katherine Yelick 1 1

Machine Learning 2 DS 4420 - Spring 2020 Neural Networks & backprop Byron C Wallace Neural