applied machine learning applied machine learning
play

Applied Machine Learning Applied Machine Learning Gradient - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Gradient Computation & Automatic Differentiation Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives using the


  1. Training a two layer network Training a two layer network suppose we have x , … , x D inputs 1 D ^ 1 , … , ^ C C outputs y y ... M hidden units z , … , z ^ 1 ^ 2 ^ C y y y output 1 M W ^ = g ( W h ( V x ) ) model y ... z 2 1 hidden units z 1 z M Cost function we want to minimize V ... 1 input x 1 x 2 x D ( n ) ( n ) J ( W , V ) = L ( y , g ( W h ( V x )) ∑ n for simplicity we drop the bias terms ∂ ∂ J , need gradient wrt W and V: J ∂ W ∂ V 5 . 1

  2. Training a two layer network Training a two layer network suppose we have x , … , x D inputs 1 D ^ 1 , … , ^ C C outputs y y ... M hidden units z , … , z ^ 1 ^ 2 ^ C y y y output 1 M W ^ = g ( W h ( V x ) ) model y ... z 2 1 hidden units z 1 z M Cost function we want to minimize V ... 1 input x 1 x 2 x D ( n ) ( n ) J ( W , V ) = L ( y , g ( W h ( V x )) ∑ n for simplicity we drop the bias terms ∂ ∂ J , need gradient wrt W and V: J ∂ W ∂ V simpler to write this for one instance (n) 5 . 1

  3. Training a two layer network Training a two layer network suppose we have x , … , x D inputs 1 D ^ 1 , … , ^ C C outputs y y ... M hidden units z , … , z ^ 1 ^ 2 ^ C y y y output 1 M W ^ = g ( W h ( V x ) ) model y ... z 2 1 hidden units z 1 z M Cost function we want to minimize V ... 1 input x 1 x 2 x D ( n ) ( n ) J ( W , V ) = L ( y , g ( W h ( V x )) ∑ n for simplicity we drop the bias terms ∂ ∂ J , need gradient wrt W and V: J ∂ W ∂ V simpler to write this for one instance (n) ∂ ∂ L , ( n ) y ∂ N ∂ ^ ( n ) so we will calculate and recover and ∂ N ∂ ( n ) y J = L ( y , ) L ^ ( n ) ∑ n =1 J = L ( y , ) ∑ n =1 ∂ W ∂ V ∂ V ∂ V ∂ W ∂ W 5 . 1

  4. Gradient calculation Gradient calculation ... ^ 1 ^ 2 ^ C y y y pre-activations ... z 2 z M z 1 pre-activations ... x 1 x 2 x D 5 . 2

  5. Gradient calculation Gradient calculation L ( y , ) ^ y ... ^ c = g ( u ) y ^ 1 ^ 2 ^ C y y y c M u = ∑ m =1 W z c , m m c pre-activations ... = h ( q ) z z 2 z M z 1 m m D = ∑ d =1 q V x m , d m d pre-activations ... x 1 x 2 x d x D 5 . 2

  6. Gradient calculation Gradient calculation using the chain rule ∂ y ^ c ∂ ∂ L ∂ u c L = L ( y , ) ^ y ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ... ^ c = g ( u ) y ^ 1 ^ 2 ^ C y y y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c pre-activations z m ... = h ( q ) z z 2 z M z 1 m m D = ∑ d =1 q V x m , d m d pre-activations ... x 1 x 2 x d x D 5 . 2

  7. Gradient calculation Gradient calculation using the chain rule ∂ y ^ c ∂ ∂ L ∂ u c L = L ( y , ) ^ y ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ... ^ c = g ( u ) y ^ 1 ^ 2 ^ C y y y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c pre-activations z m ... = h ( q ) z z 2 z M z 1 m m similarly for V D = ∑ d =1 ∂ ∂ L ∂ y ^ c ∂ u c ∂ z m ∂ q m q V x L = ∑ c ∂ y m , d m d pre-activations ∂ V m , d ^ c ∂ u c ∂ z m ∂ q m ∂ V m , d ... x 1 x 2 x d x D W c , m depends on the loss function x d depends on the activation function depends on the middle layer activation 5 . 2

  8. Gradient calculation Gradient calculation using the chain rule ∂ ∂ L ∂ y ^ c ∂ u c L = L ( y , ) ^ ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m y ^ c = g ( u ) depends on the loss function y c depends on the activation function M u = ∑ m =1 W z c , m m c z m = h ( q ) z m m D = ∑ d =1 q V x m , d m d x d 5 . 3

  9. Gradient calculation Gradient calculation using the chain rule ∂ ∂ L ∂ y ^ c ∂ u c L = L ( y , ) ^ ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m y ^ c = g ( u ) depends on the loss function y c depends on the activation function M u = ∑ m =1 W z c , m m c z m { = h ( q ) z ^ = g ( u ) = u = Wz y m m regression 1 2 L ( y , ) = ^ ∣∣ y − ^ 2 ∣∣ y y 2 D = ∑ d =1 q V x m , d m d x d 5 . 3

  10. Gradient calculation Gradient calculation using the chain rule ∂ ∂ L ∂ y ^ c ∂ u c L = L ( y , ) ^ ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m y ^ c = g ( u ) depends on the loss function y c depends on the activation function M u = ∑ m =1 W z c , m m c z m { = h ( q ) z ^ = g ( u ) = u = Wz y m m regression 1 2 L ( y , ) = ^ ∣∣ y − ^ 2 ∣∣ y y 2 D = ∑ d =1 q V x m , d m d substituting 1 x d 2 L ( y , z ) = ∣∣ y − Wz ∣∣ 2 2 5 . 3

  11. Gradient calculation Gradient calculation using the chain rule ∂ ∂ L ∂ y ^ c ∂ u c L = L ( y , ) ^ ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m y ^ c = g ( u ) depends on the loss function y c depends on the activation function M u = ∑ m =1 W z c , m m c z m { = h ( q ) z ^ = g ( u ) = u = Wz y m m regression 1 2 L ( y , ) = ^ ∣∣ y − ^ 2 ∣∣ y y 2 D = ∑ d =1 q V x m , d m d substituting 1 x d 2 L ( y , z ) = ∣∣ y − Wz ∣∣ 2 2 taking derivative ∂ L = ( ^ c − y ) z y we have seen this in linear regression lecture ∂ W c , m c m 5 . 3

  12. Gradient calculation Gradient calculation using the chain rule L ( y , ) ^ y ∂ ∂ L ∂ y ^ c ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c z m = h ( q ) z m m scalar output C=1 { − u −1 ^ = g ( u ) = ( 1 + e ) y binary classification D = ∑ d =1 q V x m , d m d L ( y , ) = ^ y log ^ + (1 − y ) log(1 − ) ^ y y y x d 5 . 4

  13. Gradient calculation Gradient calculation using the chain rule L ( y , ) ^ y ∂ ∂ L ∂ y ^ c ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c z m = h ( q ) z m m scalar output C=1 { − u −1 ^ = g ( u ) = ( 1 + e ) y binary classification D = ∑ d =1 q V x m , d m d L ( y , ) = ^ y log ^ + (1 − y ) log(1 − ) ^ y y y x d substituting and simplifying (see logistic regression lecture) − u L ( y , u ) = y log(1 + e ) + (1 − y ) log(1 + e ) u 5 . 4

  14. Gradient calculation Gradient calculation using the chain rule L ( y , ) ^ y ∂ ∂ L ∂ y ^ c ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c z m = h ( q ) z m m scalar output C=1 { − u −1 ^ = g ( u ) = ( 1 + e ) y binary classification D = ∑ d =1 q V x m , d m d L ( y , ) = ^ y log ^ + (1 − y ) log(1 − ) ^ y y y x d substituting and simplifying (see logistic regression lecture) { u = − u L ( y , u ) = y log(1 + e ) + (1 − y ) log(1 + e ) u ∑ m W z m m ∂ L = ( − ^ y ) z m y substituting u in L and taking derivative ∂ W m 5 . 4

  15. Gradient calculation Gradient calculation using the chain rule L ( y , ) ^ y ∂ ∂ L ∂ y ^ c ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c z m = h ( q ) z m m { y = g ( u ) = softmax( u ) D = multiclass classification ∑ d =1 q V x m , d m d L ( y , ) = ^ ∑ k y log ^ k y y k C is the number of classes x d 5 . 5

  16. Gradient calculation Gradient calculation using the chain rule L ( y , ) ^ y ∂ ∂ L ∂ y ^ c ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c z m = h ( q ) z m m { y = g ( u ) = softmax( u ) D = multiclass classification ∑ d =1 q V x m , d m d L ( y , ) = ^ ∑ k y log ^ k y y k C is the number of classes x d substituting and simplifying (see logistic regression lecture) ⊤ L ( y , u ) = − y u + log ∑ c u e 5 . 5

  17. Gradient calculation Gradient calculation using the chain rule L ( y , ) ^ y ∂ ∂ L ∂ y ^ c ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c z m = h ( q ) z m m { y = g ( u ) = softmax( u ) D = multiclass classification ∑ d =1 q V x m , d m d L ( y , ) = ^ ∑ k y log ^ k y y k C is the number of classes x d substituting and simplifying (see logistic regression lecture) { u = ⊤ L ( y , u ) = − y u + log ∑ c u e ∑ m W z c , m m c ∂ L = ( ^ c − y ) z substituting u in L and taking derivative y c m ∂ W c , m 5 . 5

  18. Gradient calculation Gradient calculation gradient wrt V: ∂ y ^ c ∂ q m ∂ ∂ L ∂ u m ∂ z m L = ∑ c ∂ y L ( y , ) ^ y ∂ V m , d ^ c ∂ u m ∂ z m ∂ q m ∂ V m , d W k , m x d ^ c = g ( u ) y c depends on the middle layer activation M u = ∑ m =1 W z c , m m c = h ( q ) z m m D = ∑ d =1 q V x m , d m d x d 5 . 6

  19. Gradient calculation Gradient calculation gradient wrt V: we already did this part ∂ y ^ c ∂ q m ∂ ∂ L ∂ u m ∂ z m L = ∑ c ∂ y L ( y , ) ^ y ∂ V m , d ^ c ∂ u m ∂ z m ∂ q m ∂ V m , d W k , m x d ^ c = g ( u ) y c depends on the middle layer activation M u = ∑ m =1 W z c , m m c = h ( q ) z m m D = ∑ d =1 q V x m , d m d x d 5 . 6

  20. Gradient calculation Gradient calculation gradient wrt V: we already did this part ∂ y ^ c ∂ q m ∂ ∂ L ∂ u m ∂ z m L = ∑ c ∂ y L ( y , ) ^ y ∂ V m , d ^ c ∂ u m ∂ z m ∂ q m ∂ V m , d W k , m x d ^ c = g ( u ) y c depends on the middle layer activation M u = ∑ m =1 W z logistic function σ ( q )(1 − σ ( q )) c , m m c m m hyperbolic tan. m 2 1 − tanh( q ) = h ( q ) z m m { 0 ≤ 0 q ReLU m 1 > 0 q D = ∑ d =1 m q V x m , d m d x d 5 . 6

  21. Gradient calculation Gradient calculation gradient wrt V: we already did this part ∂ y ^ c ∂ q m ∂ ∂ L ∂ u m ∂ z m L = ∑ c ∂ y L ( y , ) ^ y ∂ V m , d ^ c ∂ u m ∂ z m ∂ q m ∂ V m , d W k , m x d ^ c = g ( u ) y c depends on the middle layer activation M u = ∑ m =1 W z logistic function σ ( q )(1 − σ ( q )) c , m m c m m hyperbolic tan. m 2 1 − tanh( q ) = h ( q ) z m m { 0 ≤ 0 q ReLU m 1 > 0 q D = ∑ d =1 m q V x m , d m d example logistic sigmoid x d ( n ) ( n ) ( n ) ( n ) ( n ) ∂ J = ∑ n ∑ c y ( ^ c − ) W σ ( q )(1 − σ ( q )) x y c , m c m m ∂ V m , d d 5 . 6

  22. Gradient calculation Gradient calculation gradient wrt V: we already did this part ∂ y ^ c ∂ q m ∂ ∂ L ∂ u m ∂ z m L = ∑ c ∂ y L ( y , ) ^ y ∂ V m , d ^ c ∂ u m ∂ z m ∂ q m ∂ V m , d W k , m x d ^ c = g ( u ) y c depends on the middle layer activation M u = ∑ m =1 W z logistic function σ ( q )(1 − σ ( q )) c , m m c m m hyperbolic tan. m 2 1 − tanh( q ) = h ( q ) z m m { 0 ≤ 0 q ReLU m 1 > 0 q D = ∑ d =1 m q V x m , d m d example logistic sigmoid x d ( n ) ( n ) ( n ) ( n ) ( n ) ∂ J = ∑ n ∑ c y ( ^ c − ) W σ ( q )(1 − σ ( q )) x y c , m c m m ∂ V m , d d ( n ) ( n ) ( n ) ( n ) ( n ) = ( ^ c − ) W (1 − ) x ∑ n ∑ c y y z z c , m m c m d 5 . 6

  23. Gradient calculation Gradient calculation gradient wrt V: we already did this part ∂ y ^ c ∂ q m ∂ ∂ L ∂ u m ∂ z m L = ∑ c ∂ y L ( y , ) ^ y ∂ V m , d ^ c ∂ u m ∂ z m ∂ q m ∂ V m , d W k , m x d ^ c = g ( u ) y c depends on the middle layer activation M u = ∑ m =1 W z logistic function σ ( q )(1 − σ ( q )) c , m m c m m hyperbolic tan. m 2 1 − tanh( q ) = h ( q ) z m m { 0 ≤ 0 q ReLU m 1 > 0 q D = ∑ d =1 m q V x m , d m d example logistic sigmoid x d ( n ) ( n ) ( n ) ( n ) ( n ) ∂ J = ∑ n ∑ c y ( ^ c − ) W σ ( q )(1 − σ ( q )) x y c , m c m m ∂ V m , d d ( n ) ( n ) ( n ) ( n ) ( n ) = ( ^ c − ) W (1 − ) x ∑ n ∑ c y y z z c , m m c m d ( n ) for biases we simply assume the input is 1. x = 1 0 5 . 6

  24. Gradient calculation Gradient calculation a common pattern ∂ y ^ c ∂ ∂ L ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ∂ L error from above ∂ u c z m input from below ∂ ∂ L ∂ y ^ c ∂ u c ∂ z m ∂ q m L = ∑ c ∂ y ∂ V m , d ^ c ∂ u c ∂ z m ∂ q m ∂ V m , d ∂ L error from above ∂ q m x d input from below 5 . 7

  25. Gradient calculation Gradient calculation a common pattern L ( y , ) ^ y ∂ y ^ c ∂ ∂ L ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c ∂ L error from above ∂ u c z m input from below M u = ∑ m =1 W z c , m m c = h ( q ) z ∂ ∂ L ∂ y ^ c ∂ u c ∂ z m ∂ q m m m L = ∑ c ∂ y ∂ V m , d ^ c ∂ u c ∂ z m ∂ q m ∂ V m , d D ∂ L = ∑ d =1 q V x error from above ∂ q m m , d m d x d input from below x d 5 . 7 Winter 2020 | Applied Machine Learning (COMP551)

  26. Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ M = 16 hidden units y C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c = σ ( q ) z m m D = ∑ d =1 q V x m , d m d x d 6 . 1

  27. Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ M = 16 hidden units y C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c cost is softmax-cross-entropy helper functions = σ ( q ) z 1 def logsumexp( m m 2 Z,# NxC 3 ): 4 Zmax = np.max(Z,axis=1)[:, None] 1 def cost(X, #N x D 5 lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] D = 6 return lse #N ∑ d =1 q V x m , d 2 Y, #N x C 7 m d 8 def softmax( 3 W, #M x C 9 u, # N x C 10 ): 4 V, #D x M 11 u_exp = np.exp(u - np.max(u, 1)[:, None]) x d 12 return u_exp / np.sum(u_exp, axis=-1)[:, None] 5 ): 6 Q = np.dot(X, V) #N x M 7 Z = logistic(Q) #N x M 8 U = np.dot(Z, W) #N x K 9 Yh = softmax(U) 10 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) 11 return nll 6 . 1

  28. Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ M = 16 hidden units y C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c cost is softmax-cross-entropy helper functions = σ ( q ) z 1 def logsumexp( m m 2 Z,# NxC 3 ): 4 Zmax = np.max(Z,axis=1)[:, None] 1 1 def cost(X, #N x D def cost(X, #N x D 5 lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] D = 6 return lse #N ∑ d =1 q V x m , d 2 2 Y, #N x C Y, #N x C 7 m d 8 def softmax( 3 3 W, #M x C W, #M x C 9 u, # N x C 10 ): 4 4 V, #D x M V, #D x M 11 u_exp = np.exp(u - np.max(u, 1)[:, None]) x d 12 return u_exp / np.sum(u_exp, axis=-1)[:, None] 5 5 ): ): 6 6 Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M 7 7 Z = logistic(Q) #N x M Z = logistic(Q) #N x M 8 8 U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K 9 9 Yh = softmax(U) Yh = softmax(U) ( n )⊤ ( n ) 10 10 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) N ( n ) J = − + log u c ∑ n =1 ∑ c y u e 11 11 return nll return nll 6 . 1

  29. Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ M = 16 hidden units y C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c cost is softmax-cross-entropy helper functions = σ ( q ) z 1 def logsumexp( m m 2 Z,# NxC 3 ): 4 Zmax = np.max(Z,axis=1)[:, None] 1 1 1 def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D 5 lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] D = 6 return lse #N ∑ d =1 q V x m , d 2 2 2 Y, #N x C Y, #N x C Y, #N x C 7 m d 8 def softmax( 3 3 3 W, #M x C W, #M x C W, #M x C 9 u, # N x C 10 ): 4 4 4 V, #D x M V, #D x M V, #D x M 11 u_exp = np.exp(u - np.max(u, 1)[:, None]) x d 12 return u_exp / np.sum(u_exp, axis=-1)[:, None] 5 5 5 ): ): ): 6 6 6 Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M 7 7 7 Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M 8 8 8 U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K 9 9 9 Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) ( n )⊤ ( n ) 10 10 10 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) N ( n ) J = − + log u c ∑ n =1 ∑ c y u e 11 11 11 return nll return nll return nll 6 . 1

  30. Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ M = 16 hidden units y C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c cost is softmax-cross-entropy helper functions = σ ( q ) z 1 def logsumexp( m m 2 Z,# NxC 3 ): 4 Zmax = np.max(Z,axis=1)[:, None] 1 1 1 1 def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D 5 lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] D = 6 return lse #N ∑ d =1 q V x m , d 2 2 2 2 Y, #N x C Y, #N x C Y, #N x C Y, #N x C 7 m d 8 def softmax( 3 3 3 3 W, #M x C W, #M x C W, #M x C W, #M x C 9 u, # N x C 10 ): 4 4 4 4 V, #D x M V, #D x M V, #D x M V, #D x M 11 u_exp = np.exp(u - np.max(u, 1)[:, None]) x d 12 return u_exp / np.sum(u_exp, axis=-1)[:, None] 5 5 5 5 ): ): ): ): 6 6 6 6 Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M 7 7 7 7 Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M 8 8 8 8 U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K 9 9 9 9 Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) ( n )⊤ ( n ) 10 10 10 10 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) N ( n ) J = − + log u c ∑ n =1 ∑ c y u e 11 11 11 11 return nll return nll return nll return nll 6 . 1

  31. Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ M = 16 hidden units y C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c cost is softmax-cross-entropy helper functions = σ ( q ) z 1 def logsumexp( m m 2 Z,# NxC 3 ): 4 Zmax = np.max(Z,axis=1)[:, None] 1 1 1 1 1 def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D 5 lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] D = 6 return lse #N ∑ d =1 q V x m , d 2 2 2 2 2 Y, #N x C Y, #N x C Y, #N x C Y, #N x C Y, #N x C 7 m d 8 def softmax( 3 3 3 3 3 W, #M x C W, #M x C W, #M x C W, #M x C W, #M x C 9 u, # N x C 10 ): 4 4 4 4 4 V, #D x M V, #D x M V, #D x M V, #D x M V, #D x M 11 u_exp = np.exp(u - np.max(u, 1)[:, None]) x d 12 return u_exp / np.sum(u_exp, axis=-1)[:, None] 5 5 5 5 5 ): ): ): ): ): 6 6 6 6 6 Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M 7 7 7 7 7 Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M 8 8 8 8 8 U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K 9 9 9 9 9 Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) ( n )⊤ ( n ) 10 10 10 10 10 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) N ( n ) J = − + log u c ∑ n =1 ∑ c y u e 11 11 11 11 11 return nll return nll return nll return nll return nll 6 . 1

  32. Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ M = 16 hidden units y C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c cost is softmax-cross-entropy helper functions = σ ( q ) z 1 def logsumexp( m m 2 Z,# NxC 3 ): 4 Zmax = np.max(Z,axis=1)[:, None] 1 1 1 1 1 1 def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D 5 lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] D = 6 return lse #N ∑ d =1 q V x m , d 2 2 2 2 2 2 Y, #N x C Y, #N x C Y, #N x C Y, #N x C Y, #N x C Y, #N x C 7 m d 8 def softmax( 3 3 3 3 3 3 W, #M x C W, #M x C W, #M x C W, #M x C W, #M x C W, #M x C 9 u, # N x C 10 ): 4 4 4 4 4 4 V, #D x M V, #D x M V, #D x M V, #D x M V, #D x M V, #D x M 11 u_exp = np.exp(u - np.max(u, 1)[:, None]) x d 12 return u_exp / np.sum(u_exp, axis=-1)[:, None] 5 5 5 5 5 5 ): ): ): ): ): ): 6 6 6 6 6 6 Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M 7 7 7 7 7 7 Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M 8 8 8 8 8 8 U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K 9 9 9 9 9 9 Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) ( n )⊤ ( n ) 10 10 10 10 10 10 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) N ( n ) J = − + log u c ∑ n =1 ∑ c y u e 11 11 11 11 11 11 return nll return nll return nll return nll return nll return nll 6 . 1

  33. Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ y M = 16 hidden units C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c = σ ( q ) z m m D = ∑ d =1 q V x m , d m d x d 6 . 2

  34. Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ y M = 16 hidden units C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c ∂ L = ( − ^ y ) z m y = σ ( q ) ∂ W m z 1 def gradients(X,#N x D m m ∂ L = ( − ^ y ) W z (1 − z ) x 2 Y,#N x K y m m m d ∂ V m , d 3 W,#M x K D = ∑ d =1 q V x 4 V,#D x M m , d m d 5 ): 6 Z = logistic(np.dot(X, V))#N x M x d 7 N,D = X.shape 8 Yh = softmax(np.dot(Z, W))#N x K 9 dY = Yh - Y #N x K 10 dW= np.dot(Z.T, dY)/N #M x K 11 dZ = np.dot(dY, W.T) #N x M 12 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 13 return dW, dV 6 . 2

  35. Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ y M = 16 hidden units C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c ∂ L = ( − ^ y ) z m y = σ ( q ) ∂ W m z 1 1 def gradients(X,#N x D def gradients(X,#N x D m m ∂ L = ( − ^ y ) W z (1 − z ) x 2 2 Y,#N x K Y,#N x K y m m m d ∂ V m , d 3 3 W,#M x K W,#M x K D = ∑ d =1 q V x 4 4 V,#D x M V,#D x M m , d m d 5 5 ): ): 6 6 Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M x d 7 7 N,D = X.shape N,D = X.shape 8 8 Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K 9 9 dY = Yh - Y #N x K dY = Yh - Y #N x K check your gradient function using finite difference 10 10 dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K approximation that uses the cost function 11 11 dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M 12 12 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 13 13 return dW, dV return dW, dV 1 scipy.optimize.check_grad 6 . 2

  36. Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ y M = 16 hidden units C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c ∂ L = ( − ^ y ) z m y = σ ( q ) ∂ W m z 1 1 1 def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D m m ∂ L = ( − ^ y ) W z (1 − z ) x 2 2 2 Y,#N x K Y,#N x K Y,#N x K y m m m d ∂ V m , d 3 3 3 W,#M x K W,#M x K W,#M x K D = ∑ d =1 q V x 4 4 4 V,#D x M V,#D x M V,#D x M m , d m d 5 5 5 ): ): ): 6 6 6 Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M x d 7 7 7 N,D = X.shape N,D = X.shape N,D = X.shape 8 8 8 Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K 9 9 9 dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K check your gradient function using finite difference 10 10 10 dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K approximation that uses the cost function 11 11 11 dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M 12 12 12 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 13 13 13 return dW, dV return dW, dV return dW, dV 1 scipy.optimize.check_grad 6 . 2

  37. Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ y M = 16 hidden units C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c ∂ L = ( − ^ y ) z m y = σ ( q ) ∂ W m z 1 1 1 1 def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D m m ∂ L = ( − ^ y ) W z (1 − z ) x 2 2 2 2 Y,#N x K Y,#N x K Y,#N x K Y,#N x K y m m m d ∂ V m , d 3 3 3 3 W,#M x K W,#M x K W,#M x K W,#M x K D = ∑ d =1 q V x 4 4 4 4 V,#D x M V,#D x M V,#D x M V,#D x M m , d m d 5 5 5 5 ): ): ): ): 6 6 6 6 Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M x d 7 7 7 7 N,D = X.shape N,D = X.shape N,D = X.shape N,D = X.shape 8 8 8 8 Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K 9 9 9 9 dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K check your gradient function using finite difference 10 10 10 10 dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K approximation that uses the cost function 11 11 11 11 dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M 12 12 12 12 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 13 13 13 13 return dW, dV return dW, dV return dW, dV return dW, dV 1 scipy.optimize.check_grad 6 . 2

  38. Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ y M = 16 hidden units C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c ∂ L = ( − ^ y ) z m y = σ ( q ) ∂ W m z 1 1 1 1 1 def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D m m ∂ L = ( − ^ y ) W z (1 − z ) x 2 2 2 2 2 Y,#N x K Y,#N x K Y,#N x K Y,#N x K Y,#N x K y m m m d ∂ V m , d 3 3 3 3 3 W,#M x K W,#M x K W,#M x K W,#M x K W,#M x K D = ∑ d =1 q V x 4 4 4 4 4 V,#D x M V,#D x M V,#D x M V,#D x M V,#D x M m , d m d 5 5 5 5 5 ): ): ): ): ): 6 6 6 6 6 Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M x d 7 7 7 7 7 N,D = X.shape N,D = X.shape N,D = X.shape N,D = X.shape N,D = X.shape 8 8 8 8 8 Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K 9 9 9 9 9 dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K check your gradient function using finite difference 10 10 10 10 10 dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K approximation that uses the cost function 11 11 11 11 11 dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M 12 12 12 12 12 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 13 13 13 13 13 return dW, dV return dW, dV return dW, dV return dW, dV return dW, dV 1 scipy.optimize.check_grad 6 . 2

  39. Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ y M = 16 hidden units C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c ∂ L = ( − ^ y ) z m y = σ ( q ) ∂ W m z 1 1 1 1 1 1 def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D m m ∂ L = ( − ^ y ) W z (1 − z ) x 2 2 2 2 2 2 Y,#N x K Y,#N x K Y,#N x K Y,#N x K Y,#N x K Y,#N x K y m m m d ∂ V m , d 3 3 3 3 3 3 W,#M x K W,#M x K W,#M x K W,#M x K W,#M x K W,#M x K D = ∑ d =1 q V x 4 4 4 4 4 4 V,#D x M V,#D x M V,#D x M V,#D x M V,#D x M V,#D x M m , d m d 5 5 5 5 5 5 ): ): ): ): ): ): 6 6 6 6 6 6 Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M x d 7 7 7 7 7 7 N,D = X.shape N,D = X.shape N,D = X.shape N,D = X.shape N,D = X.shape N,D = X.shape 8 8 8 8 8 8 Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K 9 9 9 9 9 9 dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K check your gradient function using finite difference 10 10 10 10 10 10 dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K approximation that uses the cost function 11 11 11 11 11 11 dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M 12 12 12 12 12 12 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 13 13 return dW, dV 13 13 13 13 return dW, dV return dW, dV return dW, dV return dW, dV return dW, dV 1 scipy.optimize.check_grad 6 . 2

  40. Example: Example: classification classification Iris dataset (D=2 features + 1 bias) M = 16 hidden units C=3 classes using GD for optimization 1 def GD(X, Y, M, lr=.1, eps=1e-9, max_iters=100000): 2 N, D = X.shape 3 N,K = Y.shape 4 W = np.random.randn(M, K)*.01 5 V = np.random.randn(D, M)*.01 6 dW = np.inf*np.ones_like(W) 7 t = 0 8 while np.linalg.norm(dW) > eps and t < max_iters: 9 dW, dV = gradients(X, Y, W, V) 10 W = W - lr*dW 11 V = V - lr*dV 12 t += 1 13 return W, V 6 . 3

  41. Example: Example: classification classification Iris dataset (D=2 features + 1 bias) M = 16 hidden units C=3 classes using GD for optimization the resulting decision boundaries 1 def GD(X, Y, M, lr=.1, eps=1e-9, max_iters=100000): 2 N, D = X.shape 3 N,K = Y.shape 4 W = np.random.randn(M, K)*.01 5 V = np.random.randn(D, M)*.01 6 dW = np.inf*np.ones_like(W) 7 t = 0 8 while np.linalg.norm(dW) > eps and t < max_iters: 9 dW, dV = gradients(X, Y, W, V) 10 W = W - lr*dW 11 V = V - lr*dV 12 t += 1 13 return W, V 6 . 3 Winter 2020 | Applied Machine Learning (COMP551)

  42. Automating gradient computation Automating gradient computation gradient computation is tedious and mechanical. can we automate it? 7 . 1

  43. Automating gradient computation Automating gradient computation gradient computation is tedious and mechanical. can we automate it? using numerical differentiation? ∂ f f ( w + ϵ )− f ( w ) ≈ approximates partial derivatives using finite difference ∂ w ϵ needs multiple forward passes (for each input output pair) can be slow and inaccurate useful for black-box cost functions or checking the correctness of gradient functions 7 . 1

  44. Automating gradient computation Automating gradient computation gradient computation is tedious and mechanical. can we automate it? using numerical differentiation? ∂ f f ( w + ϵ )− f ( w ) ≈ approximates partial derivatives using finite difference ∂ w ϵ needs multiple forward passes (for each input output pair) can be slow and inaccurate useful for black-box cost functions or checking the correctness of gradient functions symbolic differentiation : symbolic calculation of derivatives does not identify the computational procedure and reuse of values 7 . 1

  45. Automating gradient computation Automating gradient computation gradient computation is tedious and mechanical. can we automate it? using numerical differentiation? ∂ f f ( w + ϵ )− f ( w ) ≈ approximates partial derivatives using finite difference ∂ w ϵ needs multiple forward passes (for each input output pair) can be slow and inaccurate useful for black-box cost functions or checking the correctness of gradient functions symbolic differentiation : symbolic calculation of derivatives does not identify the computational procedure and reuse of values automatic / algorithmic differentiation is what we want write code that calculates various functions, e.g., the cost function automatically produce (partial) derivatives e.g., gradients used in learning 7 . 1

  46. Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x use a computational graph as a data structure (for storing the result of computation) 7 . 2

  47. Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x use a computational graph as a data structure (for storing the result of computation) step 1 a = break down to atomic operations 1 wx ) 2 w L = ( y − 1 2 a = x 2 a = y 3 a = a × a 2 4 1 a = a − a 3 5 4 2 a = a 5 6 a = .5 × a 6 7 7 . 2

  48. Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x use a computational graph as a data structure (for storing the result of computation) step 1 a = break down to atomic operations 1 wx ) 2 w L = ( y − 1 2 a = x 2 build a graph with operations as internal step 2 a = y nodes and input variables as leaf nodes 3 a = a × a 2 4 1 a = a − a 3 5 4 2 a = a 5 6 a = .5 × a 6 7 7 . 2

  49. Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x L use a computational graph as a data structure (for storing the result of computation) step 1 a = break down to atomic operations 1 wx ) 2 w L = ( y − 1 a 7 2 a = x 2 build a graph with operations as internal step 2 a = y nodes and input variables as leaf nodes 3 a 6 a = a × a 2 4 1 a = a − a 3 5 4 a 5 2 a = a 5 6 a = .5 × a 6 7 a 4 a 3 a 1 a 2 7 . 2

  50. Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x L use a computational graph as a data structure (for storing the result of computation) step 1 a = break down to atomic operations 1 wx ) 2 w L = ( y − 1 a 7 2 a = x 2 build a graph with operations as internal step 2 a = y nodes and input variables as leaf nodes 3 a 6 a = a × a 2 4 1 step 3 a = a − there are two ways to use the computational graph to calculate derivatives a 3 5 4 a 5 2 a = a 5 6 a = .5 × a 6 7 a 4 a 3 a 1 a 2 7 . 2

  51. Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x L use a computational graph as a data structure (for storing the result of computation) step 1 a = break down to atomic operations 1 wx ) 2 w L = ( y − 1 a 7 2 a = x 2 build a graph with operations as internal step 2 a = y nodes and input variables as leaf nodes 3 a 6 a = a × a 2 4 1 step 3 a = a − there are two ways to use the computational graph to calculate derivatives a 3 5 4 a 5 2 a = a 5 6 forward mode: start from the leafs and propagate derivatives upward a = .5 × a 6 7 a 4 a 3 a 1 a 2 7 . 2

  52. Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x L use a computational graph as a data structure (for storing the result of computation) step 1 a = break down to atomic operations 1 wx ) 2 w L = ( y − 1 a 7 2 a = x 2 build a graph with operations as internal step 2 a = y nodes and input variables as leaf nodes 3 a 6 a = a × a 2 4 1 step 3 a = a − there are two ways to use the computational graph to calculate derivatives a 3 5 4 a 5 2 a = a 5 6 forward mode: start from the leafs and propagate derivatives upward a = .5 × a 6 7 a 4 a 3 reverse mode: 1. first in a bottom-up (forward) pass calculate the values a , … , a 1 4 a 1 a 2 2. in a top-down (backward) pass calculate the derivatives 7 . 2

  53. Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x L use a computational graph as a data structure (for storing the result of computation) step 1 a = break down to atomic operations 1 wx ) 2 w L = ( y − 1 a 7 2 a = x 2 build a graph with operations as internal step 2 a = y nodes and input variables as leaf nodes 3 a 6 a = a × a 2 4 1 step 3 a = a − there are two ways to use the computational graph to calculate derivatives a 3 5 4 a 5 2 a = a 5 6 forward mode: start from the leafs and propagate derivatives upward a = .5 × a 6 7 a 4 a 3 reverse mode: 1. first in a bottom-up (forward) pass calculate the values a , … , a 1 4 a 1 a 2 2. in a top-down (backward) pass calculate the derivatives this second procedure is called backpropagation when applied to neuran networks 7 . 2

  54. Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 7 . 3

  55. Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 7 . 3

  56. Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation a = w 0 1 a = w 1 2 a = x 3 7 . 3

  57. Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation partial derivatives ˙ = 0 a = w 0 a 1 1 a = ˙ = 1 w 1 a 2 2 a = ˙ = 0 x a 3 3 7 . 3

  58. Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation partial derivatives } ˙ = 0 a = w 0 a 1 1 we initialize these to identify which derivative we want a = ˙ = 1 w 1 a 2 ˙ ∂ □ 2 □ = this means ∂ w 1 a = ˙ = 0 x a 3 3 7 . 3

  59. Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation partial derivatives } ˙ = 0 a = w 0 a 1 1 we initialize these to identify which derivative we want a = ˙ = 1 w 1 a 2 ˙ ∂ □ 2 □ = this means ∂ w 1 a = ˙ = 0 x a 3 3 a = a × ˙ = a × ˙ + ˙ × w x a 3 a 4 a 3 a 2 a 3 x 1 4 2 2 7 . 3

  60. Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation partial derivatives } ˙ = 0 a = w 0 a 1 1 we initialize these to identify which derivative we want a = ˙ = 1 w 1 a 2 ˙ ∂ □ 2 □ = this means ∂ w 1 a = ˙ = 0 x a 3 3 a = a × ˙ = a × ˙ + ˙ × w x a 3 a 4 a 3 a 2 a 3 x 1 4 2 2 w x + a = a + ˙ = ˙ + ˙ x w 0 a 1 a 5 a 4 a 1 1 5 4 7 . 3

  61. Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation partial derivatives } ˙ = 0 a = w 0 a 1 1 we initialize these to identify which derivative we want a = ˙ = 1 w 1 a 2 ˙ ∂ □ 2 □ = this means ∂ w 1 a = ˙ = 0 x a 3 3 a = a × ˙ = a × ˙ + ˙ × w x a 3 a 4 a 3 a 2 a 3 x 1 4 2 2 w x + a = a + ˙ = ˙ + ˙ x w 0 a 1 a 5 a 4 a 1 1 5 4 ∂ y 1 y = sin( w x + w ) x cos( w x + w ) = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 1 0 1 0 5 6 5 ∂ w 1 7 . 3

  62. Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation partial derivatives } ˙ = 0 a = w 0 a 1 1 we initialize these to identify which derivative we want a = ˙ = 1 w 1 a 2 ˙ ∂ □ 2 □ = this means ∂ w 1 a = ˙ = 0 x a 3 3 a = a × ˙ = a × ˙ + ˙ × w x a 3 a 4 a 3 a 2 a 3 x 1 4 2 2 w x + a = a + ˙ = ˙ + ˙ x w 0 a 1 a 5 a 4 a 1 1 5 4 ∂ y 1 y = sin( w x + w ) x cos( w x + w ) = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 1 0 1 0 5 6 5 ∂ w 1 ∂ y 2 y = cos( w x + w ) a = cos( a ) ˙ = − ˙ sin( a ) − x sin( w x + w ) = a 7 a 5 2 1 0 7 5 1 0 5 ∂ w 1 7 . 3

  63. Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation partial derivatives } ˙ = 0 a = w 0 a 1 1 we initialize these to identify which derivative we want a = ˙ = 1 w 1 a 2 ˙ ∂ □ 2 □ = this means ∂ w 1 a = ˙ = 0 x a 3 3 a = a × ˙ = a × ˙ + ˙ × w x a 3 a 4 a 3 a 2 a 3 x 1 4 2 2 w x + a = a + ˙ = ˙ + ˙ x w 0 a 1 a 5 a 4 a 1 1 5 4 ∂ y 1 y = sin( w x + w ) x cos( w x + w ) = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 1 0 1 0 5 6 5 ∂ w 1 ∂ y 2 y = cos( w x + w ) a = cos( a ) ˙ = − ˙ sin( a ) − x sin( w x + w ) = a 7 a 5 2 1 0 7 5 1 0 5 ∂ w 1 ∂ □ note that we get all partial derivatives in one forward pass ∂ w 1 7 . 3

  64. Forward mode: computational graph Forward mode: computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 evaluation partial derivatives a = ˙ = 0 w 0 a 1 1 a = ˙ = 1 w 1 a 2 2 a = ˙ = 0 x a 3 3 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a + ˙ = ˙ + ˙ a 1 a 5 a 4 a 1 5 4 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 y = a = cos( a ) ˙ = − ˙ cos( a ) a 7 a 5 2 7 5 5 7 . 4

  65. Forward mode: Forward mode: computational graph computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 we can represent this computation using a graph evaluation partial derivatives a = ˙ = 0 w 0 a 1 1 a 6 a 7 a = ˙ = 1 w 1 a 2 2 a = ˙ = 0 x a 3 3 a 5 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a + ˙ = ˙ + ˙ a 1 a 5 a 4 a 1 a 4 a 1 5 4 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 a 2 a 3 y = a = cos( a ) ˙ = − ˙ cos( a ) a 7 a 5 2 7 5 5 7 . 4

  66. Forward mode: computational graph Forward mode: computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 we can represent this computation using a graph evaluation partial derivatives a = ˙ = 0 w 0 a 1 1 a 6 a 7 a = ˙ = 1 w 1 a 2 2 a = ˙ = 0 x a 3 3 a 5 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a = a + ˙ = ˙ + ˙ w 0 a 1 a 5 a 4 a 1 a 4 a 1 1 5 4 ˙ = 0 a 1 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 ˙ = 1 a 2 a 3 ˙ = 0 a 2 y = a = cos( a ) ˙ = − ˙ cos( a ) a 3 a 7 a 5 2 7 5 5 a = x a = 3 w 1 2 7 . 4

  67. Forward mode: Forward mode: computational graph computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 we can represent this computation using a graph evaluation partial derivatives a = ˙ = 0 w 0 a 1 1 a 6 a 7 a = ˙ = 1 w 1 a 2 2 a = ˙ = 0 x a 3 3 a 5 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a × a 3 4 2 a = a = a + ˙ = ˙ + ˙ w 0 a 1 a 5 a 4 a 1 a 4 a 1 1 5 4 ˙ = a × ˙ + ˙ × a 4 a 3 a 2 a 3 2 ˙ = 0 a 1 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 ˙ = 1 a 2 a 3 ˙ = 0 a 2 y = a = cos( a ) ˙ = − ˙ cos( a ) a 3 a 7 a 5 2 7 5 5 a = x a = 3 w 1 2 7 . 4

  68. Forward mode: Forward mode: computational graph computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 we can represent this computation using a graph evaluation partial derivatives a = ˙ = 0 w 0 a 1 1 a 6 a 7 a = ˙ = 1 w 1 a 2 2 a = a + a = a 1 ˙ = 0 5 4 x a 3 3 a 5 ˙ = ˙ + ˙ a 5 a 4 a 1 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a × a 3 4 2 a = a = a + ˙ = ˙ + ˙ w 0 a 1 a 5 a 4 a 1 a 4 a 1 1 5 4 ˙ = a × ˙ + ˙ × a 4 a 3 a 2 a 3 2 ˙ = 0 a 1 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 ˙ = 1 a 2 a 3 ˙ = 0 a 2 y = a = cos( a ) ˙ = − ˙ cos( a ) a 3 a 7 a 5 2 7 5 5 a = x a = 3 w 1 2 7 . 4

  69. Forward mode: Forward mode: computational graph computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 we can represent this computation using a graph evaluation partial derivatives y = a = sin( a ) 1 6 5 a = ˙ = 0 w 0 a 1 ∂ y 1 1 = ˙ = ˙ cos( a ) a 6 a 7 a 6 a 5 5 ∂ w 1 a = ˙ = 1 w 1 a 2 2 a = a + a = a 1 ˙ = 0 5 4 x a 3 3 a 5 ˙ = ˙ + ˙ a 5 a 4 a 1 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a × a 3 4 2 a = a = a + ˙ = ˙ + ˙ w 0 a 1 a 5 a 4 a 1 a 4 a 1 1 5 4 ˙ = a × ˙ + ˙ × a 4 a 3 a 2 a 3 2 ˙ = 0 a 1 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 ˙ = 1 a 2 a 3 ˙ = 0 a 2 y = a = cos( a ) ˙ = − ˙ cos( a ) a 3 a 7 a 5 2 7 5 5 a = x a = 3 w 1 2 7 . 4

  70. Forward mode: Forward mode: computational graph computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 we can represent this computation using a graph evaluation partial derivatives y = a = cos( a ) y = a = sin( a ) 2 7 5 1 6 5 a = ˙ = 0 w 0 a 1 ∂ y 1 ∂ y 2 1 = ˙ = ˙ cos( a ) a 6 a 7 = ˙ = − ˙ cos( a ) a 6 a 5 a 7 a 5 5 5 ∂ w 1 ∂ w 1 a = ˙ = 1 w 1 a 2 2 a = a + a = a 1 ˙ = 0 5 4 x a 3 3 a 5 ˙ = ˙ + ˙ a 5 a 4 a 1 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a × a 3 4 2 a = a = a + ˙ = ˙ + ˙ w 0 a 1 a 5 a 4 a 1 a 4 a 1 1 5 4 ˙ = a × ˙ + ˙ × a 4 a 3 a 2 a 3 2 ˙ = 0 a 1 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 ˙ = 1 a 2 a 3 ˙ = 0 a 2 y = a = cos( a ) ˙ = − ˙ cos( a ) a 3 a 7 a 5 2 7 5 5 a = x a = 3 w 1 2 7 . 4

  71. Forward mode: Forward mode: computational graph computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 we can represent this computation using a graph once the nodes up stream calculate their values and derivatives we may discard a node e.g., once are obtained we can discard the values and partial derivatives for a , ˙ a , ˙ , a , ˙ 5 a 5 4 a 4 1 a 1 evaluation partial derivatives y = a = cos( a ) y = a = sin( a ) 2 7 5 1 6 5 a = ˙ = 0 w 0 a 1 ∂ y 1 ∂ y 2 1 = ˙ = ˙ cos( a ) a 6 a 7 = ˙ = − ˙ cos( a ) a 6 a 5 a 7 a 5 5 5 ∂ w 1 ∂ w 1 a = ˙ = 1 w 1 a 2 2 a = a + a = a 1 ˙ = 0 5 4 x a 3 3 a 5 ˙ = ˙ + ˙ a 5 a 4 a 1 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a × a 3 4 2 a = a = a + ˙ = ˙ + ˙ w 0 a 1 a 5 a 4 a 1 a 4 a 1 1 5 4 ˙ = a × ˙ + ˙ × a 4 a 3 a 2 a 3 2 ˙ = 0 a 1 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 ˙ = 1 a 2 a 3 ˙ = 0 a 2 y = a = cos( a ) ˙ = − ˙ cos( a ) a 3 a 7 a 5 2 7 5 5 a = x a = 3 w 1 2 7 . 4

  72. Reverse mode Reverse mode ∂ y 2 y = cos( w x + w ) suppose we want the derivative where 2 1 0 ∂ w 1 7 . 5

  73. Reverse mode Reverse mode ∂ y 2 y = cos( w x + w ) suppose we want the derivative where 2 1 0 ∂ w 1 first do a forward pass for evaluation 1) evaluation a = w 0 1 a = w 1 2 a = x 3 a = a × w x a 3 1 4 2 a = a + w x + a 1 w 0 5 4 1 y = a = sin( a ) y = sin( w x + w ) 1 6 5 1 1 0 y = a = cos( a ) y = cos( w x + w ) 2 7 5 2 1 0 7 . 5

  74. Reverse mode Reverse mode ∂ y 2 y = cos( w x + w ) suppose we want the derivative where 2 1 0 ∂ w 1 first do a forward pass for evaluation 1) evaluation a = then use these values to calculate partial derivatives in a backward pass w 0 1 a = w 1 2 a = x 3 a = a × w x a 3 1 4 2 a = a + w x + a 1 w 0 5 4 1 y = a = sin( a ) y = sin( w x + w ) 1 6 5 1 1 0 y = a = cos( a ) y = cos( w x + w ) 2 7 5 2 1 0 7 . 5

  75. Reverse mode Reverse mode ∂ y 2 y = cos( w x + w ) suppose we want the derivative where 2 1 0 ∂ w 1 first do a forward pass for evaluation 1) evaluation a = then use these values to calculate partial derivatives in a backward pass w 0 1 a = w 1 2) partial derivatives 2 0 } a = x ˉ = 1 ˉ a 7 ∂ y 2 3 □ = this means ∂ □ a = a × ˉ = w x a 3 a 6 1 4 2 a = a + w x + a 1 w 0 5 4 1 y = a = sin( a ) y = sin( w x + w ) 1 6 5 1 1 0 y = a = cos( a ) y = cos( w x + w ) 2 7 5 2 1 0 7 . 5

  76. Reverse mode Reverse mode ∂ y 2 y = cos( w x + w ) suppose we want the derivative where 2 1 0 ∂ w 1 first do a forward pass for evaluation 1) evaluation a = then use these values to calculate partial derivatives in a backward pass w 0 1 a = w 1 2) partial derivatives 2 0 } a = x ˉ = 1 ∂ y 2 ˉ a 7 ∂ y 2 3 = 1 □ = this means ∂ y 2 ∂ □ a = a × ˉ = w x a 3 ∂ y 2 a 6 1 4 2 = 0 ∂ y 1 a = a + w x + a 1 w 0 5 4 1 y = a = sin( a ) y = sin( w x + w ) 1 6 5 1 1 0 y = a = cos( a ) y = cos( w x + w ) 2 7 5 2 1 0 7 . 5

  77. Reverse mode Reverse mode ∂ y 2 y = cos( w x + w ) suppose we want the derivative where 2 1 0 ∂ w 1 first do a forward pass for evaluation 1) evaluation a = then use these values to calculate partial derivatives in a backward pass w 0 1 a = w 1 2) partial derivatives 2 0 } a = x ˉ = 1 ∂ y 2 ˉ a 7 ∂ y 2 3 = 1 □ = this means ∂ y 2 ∂ □ a = a × ˉ = w x a 3 ∂ y 2 a 6 1 4 2 = 0 ∂ y 1 a = a + ∂ y 2 ∂ y 2 ∂ y 2 ∂ a 7 ∂ a 6 w x + a 1 ˉ = ˉ cos( a ) − ˉ sin( a ) = + = − sin( w x + w ) w 0 5 4 a 5 a 6 a 7 1 1 0 5 5 ∂ a 5 ∂ a 7 ∂ a 5 ∂ a 6 ∂ a 5 y = a = sin( a ) y = sin( w x + w ) 1 6 5 1 1 0 y = a = cos( a ) y = cos( w x + w ) 2 7 5 2 1 0 7 . 5

  78. Reverse mode Reverse mode ∂ y 2 y = cos( w x + w ) suppose we want the derivative where 2 1 0 ∂ w 1 first do a forward pass for evaluation 1) evaluation a = then use these values to calculate partial derivatives in a backward pass w 0 1 a = w 1 2) partial derivatives 2 0 } a = x ˉ = 1 ∂ y 2 ˉ a 7 ∂ y 2 3 = 1 □ = this means ∂ y 2 ∂ □ a = a × ˉ = w x a 3 ∂ y 2 a 6 1 4 2 = 0 ∂ y 1 a = a + ∂ y 2 ∂ y 2 ∂ y 2 ∂ a 7 ∂ a 6 w x + a 1 ˉ = ˉ cos( a ) − ˉ sin( a ) = + = − sin( w x + w ) w 0 5 4 a 5 a 6 a 7 1 1 0 5 5 ∂ a 5 ∂ a 7 ∂ a 5 ∂ a 6 ∂ a 5 y = a = sin( a ) y = sin( w x + w ) ∂ y 2 ˉ = ˉ = − sin( w x + w ) 1 6 5 a 4 a 5 1 1 0 1 0 ∂ a 4 y = a = cos( a ) y = cos( w x + w ) 2 7 5 2 1 0 7 . 5

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend