Applied Machine Learning Applied Machine Learning Gradient - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Gradient Computation & Automatic Differentiation Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1

Learning objectives Learning objectives using the chain rule to calculate the gradients automatic differentiation forward mode reverse mode (backpropagation) 2

Landscape of the cost function Landscape of the cost function model two layer MLP ( n ) ( n ) objective min L ( y , f ( x ; W , V )) W , V ∑ n f ( x ; W , V ) = g ( Wh ( V x ) ) loss function depends on the task there are exponentially many global optima: this is a non-convex optimization problem given one global optimum we can many critical points (points where gradient is zero) permute hidden units in each layer for symmetric activations: negate input/ouput of a unit for rectifiers: rescale input/output of a unit general beliefs supported by empirical and theoretical results in a special settings many more saddle points than local minima these are not stable and SGD can escape number of local minima increases for lower costs therefore most local optima are close to global optima strategy use gradient descent methods (covered earlier in the course) image credit: https://www.offconvex.org 3

Jacobian matrix Jacobian matrix R f : R → R we have the derivative d f ( w ) ∈ dw f : R R D → gradient is the vector of all partial derivatives ∂ ∂ ⊤ R D ∇ f ( w ) = [ f ( w ), … , f ( w )] ∈ w ∂ w 1 ∂ w D f : R R M the Jacobian matrix of all partial derivatives → D ∂ f ( w ) ∂ w 1 ⎡ ⎤ ∂ f ( w ) ∂ f ( w ) ∇ f ( w ) , … , 1 1 1 w ⎢ ⎥ ⎢ ⎥ ∂ w 1 ∂ w D ⎢ ⎥ ∈ R M × D J = ⋮ ⋱ ⋮ ⎣ ⎦ note that we use J also for cost function ∂ f ( w ) ∂ f ( w ) , … , M M ∂ w 1 ∂ w D ∂ f ( w ) for all three case we may simply write , where M,D will be clear from the context ∂ w what if W is a matrix? we assume it is reshaped it into a vector for these calculations 4 . 1

Chain rule Chain rule x , y , z ∈ R f : x ↦ z and h : z ↦ y for where dy dy dz = dx dz dx speed of change in z as we change x speed of change in y as we change z speed of change in y as we change x ∂ y c ∂ y c M ∂ z m = ∑ m =1 x ∈ R , z ∈ R R C D M , y ∈ more generally ∂ x d ∂ z m ∂ x d y c we are looking at all the "paths" through which change in changes and add their contribution x d ∂ y ∂ y ∂ z = in matrix form ∂ x ∂ z ∂ x M x D Jacobian C x D Jacobian C x M Jacobian 4 . 2 Winter 2020 | Applied Machine Learning (COMP551)

Training a two layer network Training a two layer network suppose we have x , … , x D inputs 1 D ^ 1 , … , ^ C C outputs y y ... M hidden units z , … , z ^ 1 ^ 2 ^ C y y y output 1 M W ^ = g ( W h ( V x ) ) model y ... z 2 1 hidden units z 1 z M Cost function we want to minimize V ... 1 input x 1 x 2 x D ( n ) ( n ) J ( W , V ) = L ( y , g ( W h ( V x )) ∑ n for simplicity we drop the bias terms ∂ ∂ J , need gradient wrt W and V: J ∂ W ∂ V simpler to write this for one instance (n) ∂ ∂ L , ( n ) y ∂ N ∂ ^ ( n ) so we will calculate and recover and ∂ N ∂ ( n ) y J = L ( y , ) L ^ ( n ) ∑ n =1 J = L ( y , ) ∑ n =1 ∂ W ∂ V ∂ V ∂ V ∂ W ∂ W 5 . 1

Gradient calculation Gradient calculation using the chain rule ∂ y ^ c ∂ ∂ L ∂ u c L = L ( y , ) ^ y ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ... ^ c = g ( u ) y ^ 1 ^ 2 ^ C y y y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c pre-activations z m ... = h ( q ) z z 2 z M z 1 m m similarly for V D = ∑ d =1 ∂ ∂ L ∂ y ^ c ∂ u c ∂ z m ∂ q m q V x L = ∑ c ∂ y m , d m d pre-activations ∂ V m , d ^ c ∂ u c ∂ z m ∂ q m ∂ V m , d ... x 1 x 2 x d x D W c , m depends on the loss function x d depends on the activation function depends on the middle layer activation 5 . 2

Gradient calculation Gradient calculation using the chain rule ∂ ∂ L ∂ y ^ c ∂ u c L = L ( y , ) ^ ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m y ^ c = g ( u ) depends on the loss function y c depends on the activation function M u = ∑ m =1 W z c , m m c z m { = h ( q ) z ^ = g ( u ) = u = Wz y m m regression 1 2 L ( y , ) = ^ ∣∣ y − ^ 2 ∣∣ y y 2 D = ∑ d =1 q V x m , d m d substituting 1 x d 2 L ( y , z ) = ∣∣ y − Wz ∣∣ 2 2 taking derivative ∂ L = ( ^ c − y ) z y we have seen this in linear regression lecture ∂ W c , m c m 5 . 3

Gradient calculation Gradient calculation using the chain rule L ( y , ) ^ y ∂ ∂ L ∂ y ^ c ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c z m = h ( q ) z m m scalar output C=1 { − u −1 ^ = g ( u ) = ( 1 + e ) y binary classification D = ∑ d =1 q V x m , d m d L ( y , ) = ^ y log ^ + (1 − y ) log(1 − ) ^ y y y x d substituting and simplifying (see logistic regression lecture) { u = − u L ( y , u ) = y log(1 + e ) + (1 − y ) log(1 + e ) u ∑ m W z m m ∂ L = ( − ^ y ) z m y substituting u in L and taking derivative ∂ W m 5 . 4

Gradient calculation Gradient calculation using the chain rule L ( y , ) ^ y ∂ ∂ L ∂ y ^ c ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c z m = h ( q ) z m m { y = g ( u ) = softmax( u ) D = multiclass classification ∑ d =1 q V x m , d m d L ( y , ) = ^ ∑ k y log ^ k y y k C is the number of classes x d substituting and simplifying (see logistic regression lecture) { u = ⊤ L ( y , u ) = − y u + log ∑ c u e ∑ m W z c , m m c ∂ L = ( ^ c − y ) z substituting u in L and taking derivative y c m ∂ W c , m 5 . 5

Gradient calculation Gradient calculation gradient wrt V: we already did this part ∂ y ^ c ∂ q m ∂ ∂ L ∂ u m ∂ z m L = ∑ c ∂ y L ( y , ) ^ y ∂ V m , d ^ c ∂ u m ∂ z m ∂ q m ∂ V m , d W k , m x d ^ c = g ( u ) y c depends on the middle layer activation M u = ∑ m =1 W z logistic function σ ( q )(1 − σ ( q )) c , m m c m m hyperbolic tan. m 2 1 − tanh( q ) = h ( q ) z m m { 0 ≤ 0 q ReLU m 1 > 0 q D = ∑ d =1 m q V x m , d m d example logistic sigmoid x d ( n ) ( n ) ( n ) ( n ) ( n ) ∂ J = ∑ n ∑ c y ( ^ c − ) W σ ( q )(1 − σ ( q )) x y c , m c m m ∂ V m , d d ( n ) ( n ) ( n ) ( n ) ( n ) = ( ^ c − ) W (1 − ) x ∑ n ∑ c y y z z c , m m c m d ( n ) for biases we simply assume the input is 1. x = 1 0 5 . 6

Gradient calculation Gradient calculation a common pattern L ( y , ) ^ y ∂ y ^ c ∂ ∂ L ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c ∂ L error from above ∂ u c z m input from below M u = ∑ m =1 W z c , m m c = h ( q ) z ∂ ∂ L ∂ y ^ c ∂ u c ∂ z m ∂ q m m m L = ∑ c ∂ y ∂ V m , d ^ c ∂ u c ∂ z m ∂ q m ∂ V m , d D ∂ L = ∑ d =1 q V x error from above ∂ q m m , d m d x d input from below x d 5 . 7 Winter 2020 | Applied Machine Learning (COMP551)

Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ M = 16 hidden units y C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c cost is softmax-cross-entropy helper functions = σ ( q ) z 1 def logsumexp( m m 2 Z,# NxC 3 ): 4 Zmax = np.max(Z,axis=1)[:, None] 1 1 1 1 1 1 def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D 5 lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] D = 6 return lse #N ∑ d =1 q V x m , d 2 2 2 2 2 2 Y, #N x C Y, #N x C Y, #N x C Y, #N x C Y, #N x C Y, #N x C 7 m d 8 def softmax( 3 3 3 3 3 3 W, #M x C W, #M x C W, #M x C W, #M x C W, #M x C W, #M x C 9 u, # N x C 10 ): 4 4 4 4 4 4 V, #D x M V, #D x M V, #D x M V, #D x M V, #D x M V, #D x M 11 u_exp = np.exp(u - np.max(u, 1)[:, None]) x d 12 return u_exp / np.sum(u_exp, axis=-1)[:, None] 5 5 5 5 5 5 ): ): ): ): ): ): 6 6 6 6 6 6 Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M 7 7 7 7 7 7 Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M 8 8 8 8 8 8 U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K 9 9 9 9 9 9 Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) ( n )⊤ ( n ) 10 10 10 10 10 10 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) N ( n ) J = − + log u c ∑ n =1 ∑ c y u e 11 11 11 11 11 11 return nll return nll return nll return nll return nll return nll 6 . 1

Applied Machine Learning Applied Machine Learning Gradient - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Gradient Computation & Automatic Differentiation Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives using the

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak

Applied Machine Learning Applied Machine Learning Logistic Regression Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Regularization Siamak Ravanbakhsh Siamak

EEEEKKKKK!!!! In January, 1942 a Soviet Ilyushin 4 flown by Lieutenant I.M.Chisov was badly

Higher-Order Distributions for Linear Logic Marie Kerjean & Jean-Simon Lemay Inria Bretagne -

Greedy Sparse Linear Approximations of Functionals from Nodal Data R. Schaback Gttingen

Gaussian kernel regression as an improved estimator for magnetization curve and spin gap Tota

Generalized Techniques in Numerical Integration Richard M. Slevinsky and Hassan Safouhi

4 2017/18 GouTP @ SCEE About: Python introduction for MATLAB users Date: 18 th of January 2018 Who:

Calculus 1120 Dan Barbasch Oct. 2, 2012 Dan Barbasch () Calculus 1120 Oct. 2, 2012 1 / 7

Geometric Integration Integration Lotka-Volterra Poisson Integrator and the Parareal Algorithm

Sambuz

Useful Links

Newsletter

Mail Us

Applied Machine Learning Applied Machine Learning Gradient - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Gradient Computation & Automatic Differentiation Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives using the

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak

Applied Machine Learning Applied Machine Learning Logistic Regression Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Regularization Siamak Ravanbakhsh Siamak

EEEEKKKKK!!!! In January, 1942 a Soviet Ilyushin 4 flown by Lieutenant I.M.Chisov was badly

Higher-Order Distributions for Linear Logic Marie Kerjean &amp; Jean-Simon Lemay Inria Bretagne -

Greedy Sparse Linear Approximations of Functionals from Nodal Data R. Schaback Gttingen

Gaussian kernel regression as an improved estimator for magnetization curve and spin gap Tota

Generalized Techniques in Numerical Integration Richard M. Slevinsky and Hassan Safouhi

4 2017/18 GouTP @ SCEE About: Python introduction for MATLAB users Date: 18 th of January 2018 Who:

Calculus 1120 Dan Barbasch Oct. 2, 2012 Dan Barbasch () Calculus 1120 Oct. 2, 2012 1 / 7

Geometric Integration Integration Lotka-Volterra Poisson Integrator and the Parareal Algorithm

Sambuz

Useful Links

Newsletter

Mail Us

Higher-Order Distributions for Linear Logic Marie Kerjean & Jean-Simon Lemay Inria Bretagne -