Machine Learning - MT 2016 11 & 12. Neural Networks Varun Kanade University of Oxford November 14 & 16, 2016
Announcements ◮ Problem Sheet 3 due this Friday by noon ◮ Practical 2 this week: Compare NBC & LR ◮ (Optional) Reading a paper 1
Outline Today, we’ll study feedforward neural networks ◮ Multi-layer perceptrons ◮ Classification or regression settings ◮ Backpropagation to compute gradients ◮ Brief introduction to tensorflow and MNIST 2
Artificial Neuron : Logistic Regression 1 Unit b x 1 w 1 y = Pr( y = 1 | x , w , b ) Σ � w 2 x 2 Linear Function Non-linearity ◮ A unit in a neural network computes a linear function of its input and is then composed with a non-linear activation function ◮ For logistic regression, the non-linear activation function is the sigmoid 1 σ ( z ) = 1 + e − z ◮ The separating surface is linear 3
Multilayer Perceptron (MLP) : Classification 1 b 2 1 Σ w 2 11 x 1 w 3 11 w 2 12 b 3 1 y = Pr( y = 1 | x , W , b ) � Σ 1 w 2 21 x 2 w 3 12 w 2 22 Σ b 2 2 1 4
Multilayer Perceptron (MLP) : Regression 1 b 2 1 Σ w 2 11 x 1 w 3 11 w 2 12 b 3 1 y = E [ y | x , W , b ] � Σ 1 w 2 21 x 2 w 3 12 w 2 22 Σ b 2 2 1 5
A Toy Example 6
Logistic Regression Fails Badly 7
Solve using MLP z 2 a 2 1 1 1 b 2 1 Σ w 2 z 3 a 3 1 1 11 x 1 w 3 11 w 2 12 y = Pr( y = 1 | x , W i , b i ) b 3 1 Σ � 1 w 2 z 2 a 2 2 2 21 w 3 x 2 12 w 2 22 Σ b 2 2 1 Let us use the notation: a 1 = z 1 = x z 2 = W 2 a 1 + b 2 a 2 = tanh( z 2 ) z 3 = W 3 a 2 + b 3 y = a 3 = σ ( z 3 ) 8
Scatterplot Comparison ( x 1 , x 2 ) vs ( a 2 1 , a 2 2 ) 9
Decision Boundary of the Neural Net 10
Feedforward Neural Networks Layer 1 Layer 2 Layer 3 Layer 4 (Input) (Hidden) (Hidden) (Output) Fully Connected Layer 11
Computing Gradients on Toy Example b 2 1 Want the derivatives z 2 1 → a 2 w 2 1 11 x 1 ∂ℓ ∂ℓ 11 , w 3 ∂w 2 ∂w 2 w 2 11 12 12 ∂ℓ ∂ℓ b 3 z 3 1 → a 3 ℓ ( y, a 3 21 , 1 ) ∂w 2 ∂w 2 1 1 22 w 2 21 ∂ℓ ∂ℓ w 3 11 , x 2 ∂w 3 ∂w 3 12 12 w 2 22 z 2 2 → a 2 ∂ℓ 1 , ∂ℓ 2 , ∂ℓ 2 ∂b 2 ∂b 2 ∂b 3 b 2 1 2 ∂ℓ ∂ℓ ∂ℓ Would suffice to compute 1 , 1 , ∂z 3 ∂z 2 ∂z 2 2 12
Computing Gradients on Toy Example Let us compute the following: a 3 1 − y ∂ℓ 1 = − y 1 − y 1. 1 + 1 = ∂a 3 a 3 1 − a 3 a 3 1 (1 − a 3 1 ) ∂a 3 1 = a 3 1 · (1 − a 3 2. 1 ) ∂z 3 ∂z 3 ∂ a 2 = [ w 3 11 , w 3 3. 1 12 ] � � 1 − tanh 2 ( z 2 1 ) 0 ∂ a 2 4. ∂ z 2 = 1 − tanh 2 ( z 2 0 2 ) Then we can calculate ∂a 3 1 = a 3 ∂ℓ ∂ℓ 1 = 1 · 1 − y 1 ∂z 3 ∂a 3 ∂z 3 � � ∂a 3 ∂z 3 ∂z 3 ∂ a 2 · ∂ a 2 ∂ a 2 · ∂ a 2 ∂ℓ ∂ℓ ∂ℓ ∂ z 2 = 1 · · ∂ z 2 = 1 · 1 1 1 ∂a 3 ∂z 3 ∂z 3 ∂ z 2 1 13
a L loss ℓ Each layer consists of a linear function layer L and non-linear activation ∂ℓ ∂ z L Layer l consists of the following: layer L − 1 z l = W l a l − 1 + b l a l = f l ( z l ) layer l where f l is the non-linear activation in layer l . ∂ℓ ∂ z l layer l − 1 If there are n l units in layer l , then W l is n l × n l − 1 layer 2 Backward pass to compute derivatives ∂ℓ input x a 1 ∂ z 2 14
loss ℓ a L layer L Forward Equations (1) a 1 = x (input) layer L − 1 (2) z l = W l a l − 1 + b l (3) a l = f l ( z l ) layer l (4) ℓ ( a L , y ) layer l − 1 layer 2 input x a 1 15
Output Layer a L z L = W L a L − 1 + b L a L = f L ( z L ) layer L ( z L → a L ) ℓ ( y, a L ) Loss: ∂ a L · ∂ a L ∂ℓ ∂ℓ ∂ℓ a L − 1 ∂ z L = ∂ z L ∂ z L ∂ℓ ∂ℓ If there are n L (output) units in layer L , then ∂ a L and ∂ z L are row vectors with n L elements and ∂ a L ∂ z L is the n L × n L Jacobian matrix: ∂a L ∂a L ∂a L · · · 1 1 1 ∂z L ∂z L ∂z L nL 1 2 ∂a L ∂a L ∂a L 2 2 · · · 2 ∂ a L ∂z L ∂z L ∂z L 1 2 nL ∂ z L = . . . ... . . . . . . ∂a L ∂a L ∂a L nL nL nL · · · ∂z L ∂z L ∂z L nL 1 2 If f L is applied element-wise, e.g., sigmoid then this matrix is diagonal 16
Back Propagation a l (the inputs into layer l + 1 ) z l +1 = W l +1 a l + b l +1 ( w l +1 j,k weight on connection from k th ∂ℓ unit in layer l to j th unit in layer l + 1 ) a l ∂ z l +1 a l = f ( z l ) ( f is a non-linearity) layer l ( z l → a l ) ∂ℓ (derivative passed from layer above) ∂ z l +1 ∂ z l +1 · ∂ z l +1 ∂ℓ ∂ℓ ∂ z l = ∂ℓ a l − 1 ∂ z l ∂ z l ∂ z l +1 · ∂ z l +1 · ∂ a l ∂ℓ = ∂ a l ∂ z l ∂ z l +1 · W l +1 · ∂ a l ∂ℓ = ∂ z l 17
Gradients with respect to parameters ∂ℓ a l ∂ z l +1 z l = W l a l − 1 + b l ( w l j,k weight on connection from k th layer l ( z l → a l ) unit in layer l - 1 to j th unit in layer l ) ∂ℓ (obtained using backpropagation) ∂ z l ∂ℓ a l − 1 ∂ z l ∂z l i · a l − 1 Consider ∂ℓ ∂ℓ ∂ℓ ij = i · ij = i ∂w l ∂z l ∂w l ∂z l j ∂ℓ ∂ℓ i = ∂b l ∂z l i � � T a l − 1 ∂ℓ ∂ℓ More succinctly, we may write: ∂ W l = ∂ z l ∂ℓ ∂ℓ ∂ b l = ∂ z l 18
Forward Equations a L loss ℓ (1) a 1 = x (input) (2) z l = W l a l − 1 + b l layer L ∂ℓ (3) a l = f l ( z l ) ∂ z L layer L − 1 (4) ℓ ( a L , y ) layer l Back-propagation Equations ∂ℓ ∂ z l ∂ a L · ∂ a L layer l − 1 ∂ℓ ∂ℓ (1) Compute ∂ z L = ∂ z L ∂ z l +1 · W l +1 · ∂ a l ∂ℓ ∂ℓ (2) ∂ z l = ∂ z l layer 2 � � T ∂ℓ a l − 1 ∂ℓ (3) ∂ W l = ∂ z l ∂ℓ ∂ℓ ∂ℓ input x a 1 (4) ∂ b l = ∂ z 2 ∂ z l 19
Computational Questions What is the running time to compute the gradient for a single data point? ◮ As many matrix multiplications as there are fully connected layers ◮ Performed twice during forward and backward pass What is the space requirement? ◮ Need to store vectors a l , z l , and ∂ℓ ∂ z l for each layer Can we process multiple examples together? ◮ Yes, if we minibatch, we perform tensor operations ◮ Make sure that all parameters fit in GPU memory 20
Training Deep Neural Networks ◮ Back-propagation gives gradient ◮ Stochastic gradient descent is the method of choice ◮ Regularisation ◮ How do we add ℓ 1 or ℓ 2 regularisation? ◮ Don’t regularise bias terms ◮ How about convergence? ◮ What did we learn in the last 10 years, that we didn’t know in the 80s? 21
Training Feedforward Deep Networks Layer 1 Layer 2 Layer 3 Layer 4 (Input) (Hidden) (Hidden) (Output) Why do we get non-convex optimisation problem? All units in a layer are symmetric, hence invariant to permutations 22
A toy example 1 z 2 a 2 1 1 b 2 1 a 2 Target is y = 1 − x Σ 1 2 w 2 1 x ∈ {− 1 , 1 } Squared Loss Function ℓ ( a 2 1 , y ) = ( a 2 1 − y ) 2 1 ∂a 2 ∂ℓ 1 = 2( a 2 1 = 2( a 2 1 − y ) σ ′ ( z 2 1 − y ) · 1 1 ) 0 . 8 ∂z 2 ∂z 2 0 . 6 If x = − 1 , w 2 1 ≈ 5 , b 2 1 ≈ 0 , then σ ′ ( z 2 1 ) ≈ 0 0 . 4 Cross-Entropy Loss Function 0 . 2 z 2 1 ℓ ( a 2 1 , y ) = − ( y log a 2 1 + (1 − y ) log(1 − a 2 0 1 )) − 8 − 6 − 4 − 2 0 2 4 6 8 a 2 ∂a 2 1 − y ∂ℓ 1 = ( a 2 1 = 1 ) · 1 − y ) 1 ∂z 2 a 2 1 (1 − a 2 ∂z 2 23
Propagating Gradients Backwards 1 1 1 b 2 b 3 b 4 1 1 1 x = a 1 w 2 w 3 w 4 a 4 Σ Σ Σ 1 1 1 1 1 ◮ Cross entropy loss: ℓ ( a 4 1 , y ) = − ( y log a 4 1 + (1 − y ) log(1 − a 4 1 )) 1 = a 4 ∂ℓ 1 − y ◮ ∂z 4 ∂z 4 ∂a 3 1 = ( a 4 1 − y ) · w 4 1 · σ ′ ( z 3 ∂ℓ ∂ℓ 1 = 1 · 1 1 · 1 1 ) ◮ ∂z 3 ∂z 4 ∂a 3 ∂z 3 ∂z 3 ∂a 2 ∂ℓ ∂ℓ 1 = ( a 4 1 − y ) · w 4 1 · σ ′ ( z 3 1 ) · w 3 1 · σ ′ ( z 2 1 = 1 · 1 1 · 1 1 ) ◮ ∂z 2 ∂z 3 ∂a 2 ∂z 3 ◮ Saturation: When the output of an artificial neuron is in the ‘flat’ part, e.g., where σ ′ ( z ) ≈ 0 for sigmoid ◮ Vanishing Gradient Problem: Multiplying several σ ′ ( z l i ) together makes the gradient ≈ 0 , when we have a large number of layers ◮ For example, when using sigmoid activation, σ ′ ( z ) ∈ [0 , 1 / 4] 24
Avoiding Saturation Rectifier Use rectified linear units 3 Rectifier non-linearity f ( z ) = max(0 , z ) 2 Rectified Linear Unit (ReLU) max(0 , a · w + b ) 1 You can also use f ( z ) = | z | Other variants 0 leaky ReLUs, parametric ReLUs − 3 − 2 − 1 0 1 2 3 25
Initialising Weights and Biases Initialising is important when minimising non-convex functions. We may get very different results depending on where we start the optimisation. Suppose we were using a sigmoid unit , how would you initialise the weights? ◮ Suppose z = � D i =1 w i a i ◮ E.g., choose w i ∈ [ − 1 1 D , D ] at random √ √ What if it were a ReLU unit? ◮ You can initialise similarly How about the biases? ◮ For sigmoid, can use 0 or a random value around 0 ◮ For ReLU, should use a small positive constant 26
Recommend
More recommend