Applied Machine Learning Applied Machine Learning Gradient - PowerPoint PPT Presentation

Training a two layer network Training a two layer network suppose we have x , … , x D inputs 1 D ^ 1 , … , ^ C C outputs y y ... M hidden units z , … , z ^ 1 ^ 2 ^ C y y y output 1 M W ^ = g ( W h ( V x ) ) model y ... z 2 1 hidden units z 1 z M Cost function we want to minimize V ... 1 input x 1 x 2 x D ( n ) ( n ) J ( W , V ) = L ( y , g ( W h ( V x )) ∑ n for simplicity we drop the bias terms ∂ ∂ J , need gradient wrt W and V: J ∂ W ∂ V 5 . 1

Training a two layer network Training a two layer network suppose we have x , … , x D inputs 1 D ^ 1 , … , ^ C C outputs y y ... M hidden units z , … , z ^ 1 ^ 2 ^ C y y y output 1 M W ^ = g ( W h ( V x ) ) model y ... z 2 1 hidden units z 1 z M Cost function we want to minimize V ... 1 input x 1 x 2 x D ( n ) ( n ) J ( W , V ) = L ( y , g ( W h ( V x )) ∑ n for simplicity we drop the bias terms ∂ ∂ J , need gradient wrt W and V: J ∂ W ∂ V simpler to write this for one instance (n) 5 . 1

Training a two layer network Training a two layer network suppose we have x , … , x D inputs 1 D ^ 1 , … , ^ C C outputs y y ... M hidden units z , … , z ^ 1 ^ 2 ^ C y y y output 1 M W ^ = g ( W h ( V x ) ) model y ... z 2 1 hidden units z 1 z M Cost function we want to minimize V ... 1 input x 1 x 2 x D ( n ) ( n ) J ( W , V ) = L ( y , g ( W h ( V x )) ∑ n for simplicity we drop the bias terms ∂ ∂ J , need gradient wrt W and V: J ∂ W ∂ V simpler to write this for one instance (n) ∂ ∂ L , ( n ) y ∂ N ∂ ^ ( n ) so we will calculate and recover and ∂ N ∂ ( n ) y J = L ( y , ) L ^ ( n ) ∑ n =1 J = L ( y , ) ∑ n =1 ∂ W ∂ V ∂ V ∂ V ∂ W ∂ W 5 . 1

Gradient calculation Gradient calculation ... ^ 1 ^ 2 ^ C y y y pre-activations ... z 2 z M z 1 pre-activations ... x 1 x 2 x D 5 . 2

Gradient calculation Gradient calculation L ( y , ) ^ y ... ^ c = g ( u ) y ^ 1 ^ 2 ^ C y y y c M u = ∑ m =1 W z c , m m c pre-activations ... = h ( q ) z z 2 z M z 1 m m D = ∑ d =1 q V x m , d m d pre-activations ... x 1 x 2 x d x D 5 . 2

Gradient calculation Gradient calculation using the chain rule ∂ y ^ c ∂ ∂ L ∂ u c L = L ( y , ) ^ y ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ... ^ c = g ( u ) y ^ 1 ^ 2 ^ C y y y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c pre-activations z m ... = h ( q ) z z 2 z M z 1 m m D = ∑ d =1 q V x m , d m d pre-activations ... x 1 x 2 x d x D 5 . 2

Gradient calculation Gradient calculation using the chain rule ∂ y ^ c ∂ ∂ L ∂ u c L = L ( y , ) ^ y ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ... ^ c = g ( u ) y ^ 1 ^ 2 ^ C y y y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c pre-activations z m ... = h ( q ) z z 2 z M z 1 m m similarly for V D = ∑ d =1 ∂ ∂ L ∂ y ^ c ∂ u c ∂ z m ∂ q m q V x L = ∑ c ∂ y m , d m d pre-activations ∂ V m , d ^ c ∂ u c ∂ z m ∂ q m ∂ V m , d ... x 1 x 2 x d x D W c , m depends on the loss function x d depends on the activation function depends on the middle layer activation 5 . 2

Gradient calculation Gradient calculation using the chain rule ∂ ∂ L ∂ y ^ c ∂ u c L = L ( y , ) ^ ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m y ^ c = g ( u ) depends on the loss function y c depends on the activation function M u = ∑ m =1 W z c , m m c z m = h ( q ) z m m D = ∑ d =1 q V x m , d m d x d 5 . 3

Gradient calculation Gradient calculation using the chain rule ∂ ∂ L ∂ y ^ c ∂ u c L = L ( y , ) ^ ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m y ^ c = g ( u ) depends on the loss function y c depends on the activation function M u = ∑ m =1 W z c , m m c z m { = h ( q ) z ^ = g ( u ) = u = Wz y m m regression 1 2 L ( y , ) = ^ ∣∣ y − ^ 2 ∣∣ y y 2 D = ∑ d =1 q V x m , d m d x d 5 . 3

Gradient calculation Gradient calculation using the chain rule ∂ ∂ L ∂ y ^ c ∂ u c L = L ( y , ) ^ ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m y ^ c = g ( u ) depends on the loss function y c depends on the activation function M u = ∑ m =1 W z c , m m c z m { = h ( q ) z ^ = g ( u ) = u = Wz y m m regression 1 2 L ( y , ) = ^ ∣∣ y − ^ 2 ∣∣ y y 2 D = ∑ d =1 q V x m , d m d substituting 1 x d 2 L ( y , z ) = ∣∣ y − Wz ∣∣ 2 2 5 . 3

Gradient calculation Gradient calculation using the chain rule ∂ ∂ L ∂ y ^ c ∂ u c L = L ( y , ) ^ ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m y ^ c = g ( u ) depends on the loss function y c depends on the activation function M u = ∑ m =1 W z c , m m c z m { = h ( q ) z ^ = g ( u ) = u = Wz y m m regression 1 2 L ( y , ) = ^ ∣∣ y − ^ 2 ∣∣ y y 2 D = ∑ d =1 q V x m , d m d substituting 1 x d 2 L ( y , z ) = ∣∣ y − Wz ∣∣ 2 2 taking derivative ∂ L = ( ^ c − y ) z y we have seen this in linear regression lecture ∂ W c , m c m 5 . 3

Gradient calculation Gradient calculation using the chain rule L ( y , ) ^ y ∂ ∂ L ∂ y ^ c ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c z m = h ( q ) z m m scalar output C=1 { − u −1 ^ = g ( u ) = ( 1 + e ) y binary classification D = ∑ d =1 q V x m , d m d L ( y , ) = ^ y log ^ + (1 − y ) log(1 − ) ^ y y y x d 5 . 4

Gradient calculation Gradient calculation using the chain rule L ( y , ) ^ y ∂ ∂ L ∂ y ^ c ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c z m = h ( q ) z m m scalar output C=1 { − u −1 ^ = g ( u ) = ( 1 + e ) y binary classification D = ∑ d =1 q V x m , d m d L ( y , ) = ^ y log ^ + (1 − y ) log(1 − ) ^ y y y x d substituting and simplifying (see logistic regression lecture) − u L ( y , u ) = y log(1 + e ) + (1 − y ) log(1 + e ) u 5 . 4

Gradient calculation Gradient calculation using the chain rule L ( y , ) ^ y ∂ ∂ L ∂ y ^ c ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c z m = h ( q ) z m m scalar output C=1 { − u −1 ^ = g ( u ) = ( 1 + e ) y binary classification D = ∑ d =1 q V x m , d m d L ( y , ) = ^ y log ^ + (1 − y ) log(1 − ) ^ y y y x d substituting and simplifying (see logistic regression lecture) { u = − u L ( y , u ) = y log(1 + e ) + (1 − y ) log(1 + e ) u ∑ m W z m m ∂ L = ( − ^ y ) z m y substituting u in L and taking derivative ∂ W m 5 . 4

Gradient calculation Gradient calculation using the chain rule L ( y , ) ^ y ∂ ∂ L ∂ y ^ c ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c z m = h ( q ) z m m { y = g ( u ) = softmax( u ) D = multiclass classification ∑ d =1 q V x m , d m d L ( y , ) = ^ ∑ k y log ^ k y y k C is the number of classes x d 5 . 5

Gradient calculation Gradient calculation using the chain rule L ( y , ) ^ y ∂ ∂ L ∂ y ^ c ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c z m = h ( q ) z m m { y = g ( u ) = softmax( u ) D = multiclass classification ∑ d =1 q V x m , d m d L ( y , ) = ^ ∑ k y log ^ k y y k C is the number of classes x d substituting and simplifying (see logistic regression lecture) ⊤ L ( y , u ) = − y u + log ∑ c u e 5 . 5

Gradient calculation Gradient calculation using the chain rule L ( y , ) ^ y ∂ ∂ L ∂ y ^ c ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c z m = h ( q ) z m m { y = g ( u ) = softmax( u ) D = multiclass classification ∑ d =1 q V x m , d m d L ( y , ) = ^ ∑ k y log ^ k y y k C is the number of classes x d substituting and simplifying (see logistic regression lecture) { u = ⊤ L ( y , u ) = − y u + log ∑ c u e ∑ m W z c , m m c ∂ L = ( ^ c − y ) z substituting u in L and taking derivative y c m ∂ W c , m 5 . 5

Gradient calculation Gradient calculation gradient wrt V: ∂ y ^ c ∂ q m ∂ ∂ L ∂ u m ∂ z m L = ∑ c ∂ y L ( y , ) ^ y ∂ V m , d ^ c ∂ u m ∂ z m ∂ q m ∂ V m , d W k , m x d ^ c = g ( u ) y c depends on the middle layer activation M u = ∑ m =1 W z c , m m c = h ( q ) z m m D = ∑ d =1 q V x m , d m d x d 5 . 6

Gradient calculation Gradient calculation gradient wrt V: we already did this part ∂ y ^ c ∂ q m ∂ ∂ L ∂ u m ∂ z m L = ∑ c ∂ y L ( y , ) ^ y ∂ V m , d ^ c ∂ u m ∂ z m ∂ q m ∂ V m , d W k , m x d ^ c = g ( u ) y c depends on the middle layer activation M u = ∑ m =1 W z c , m m c = h ( q ) z m m D = ∑ d =1 q V x m , d m d x d 5 . 6

Gradient calculation Gradient calculation gradient wrt V: we already did this part ∂ y ^ c ∂ q m ∂ ∂ L ∂ u m ∂ z m L = ∑ c ∂ y L ( y , ) ^ y ∂ V m , d ^ c ∂ u m ∂ z m ∂ q m ∂ V m , d W k , m x d ^ c = g ( u ) y c depends on the middle layer activation M u = ∑ m =1 W z logistic function σ ( q )(1 − σ ( q )) c , m m c m m hyperbolic tan. m 2 1 − tanh( q ) = h ( q ) z m m { 0 ≤ 0 q ReLU m 1 > 0 q D = ∑ d =1 m q V x m , d m d x d 5 . 6

Gradient calculation Gradient calculation gradient wrt V: we already did this part ∂ y ^ c ∂ q m ∂ ∂ L ∂ u m ∂ z m L = ∑ c ∂ y L ( y , ) ^ y ∂ V m , d ^ c ∂ u m ∂ z m ∂ q m ∂ V m , d W k , m x d ^ c = g ( u ) y c depends on the middle layer activation M u = ∑ m =1 W z logistic function σ ( q )(1 − σ ( q )) c , m m c m m hyperbolic tan. m 2 1 − tanh( q ) = h ( q ) z m m { 0 ≤ 0 q ReLU m 1 > 0 q D = ∑ d =1 m q V x m , d m d example logistic sigmoid x d ( n ) ( n ) ( n ) ( n ) ( n ) ∂ J = ∑ n ∑ c y ( ^ c − ) W σ ( q )(1 − σ ( q )) x y c , m c m m ∂ V m , d d 5 . 6

Gradient calculation Gradient calculation gradient wrt V: we already did this part ∂ y ^ c ∂ q m ∂ ∂ L ∂ u m ∂ z m L = ∑ c ∂ y L ( y , ) ^ y ∂ V m , d ^ c ∂ u m ∂ z m ∂ q m ∂ V m , d W k , m x d ^ c = g ( u ) y c depends on the middle layer activation M u = ∑ m =1 W z logistic function σ ( q )(1 − σ ( q )) c , m m c m m hyperbolic tan. m 2 1 − tanh( q ) = h ( q ) z m m { 0 ≤ 0 q ReLU m 1 > 0 q D = ∑ d =1 m q V x m , d m d example logistic sigmoid x d ( n ) ( n ) ( n ) ( n ) ( n ) ∂ J = ∑ n ∑ c y ( ^ c − ) W σ ( q )(1 − σ ( q )) x y c , m c m m ∂ V m , d d ( n ) ( n ) ( n ) ( n ) ( n ) = ( ^ c − ) W (1 − ) x ∑ n ∑ c y y z z c , m m c m d 5 . 6

Gradient calculation Gradient calculation gradient wrt V: we already did this part ∂ y ^ c ∂ q m ∂ ∂ L ∂ u m ∂ z m L = ∑ c ∂ y L ( y , ) ^ y ∂ V m , d ^ c ∂ u m ∂ z m ∂ q m ∂ V m , d W k , m x d ^ c = g ( u ) y c depends on the middle layer activation M u = ∑ m =1 W z logistic function σ ( q )(1 − σ ( q )) c , m m c m m hyperbolic tan. m 2 1 − tanh( q ) = h ( q ) z m m { 0 ≤ 0 q ReLU m 1 > 0 q D = ∑ d =1 m q V x m , d m d example logistic sigmoid x d ( n ) ( n ) ( n ) ( n ) ( n ) ∂ J = ∑ n ∑ c y ( ^ c − ) W σ ( q )(1 − σ ( q )) x y c , m c m m ∂ V m , d d ( n ) ( n ) ( n ) ( n ) ( n ) = ( ^ c − ) W (1 − ) x ∑ n ∑ c y y z z c , m m c m d ( n ) for biases we simply assume the input is 1. x = 1 0 5 . 6

Gradient calculation Gradient calculation a common pattern ∂ y ^ c ∂ ∂ L ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ∂ L error from above ∂ u c z m input from below ∂ ∂ L ∂ y ^ c ∂ u c ∂ z m ∂ q m L = ∑ c ∂ y ∂ V m , d ^ c ∂ u c ∂ z m ∂ q m ∂ V m , d ∂ L error from above ∂ q m x d input from below 5 . 7

Gradient calculation Gradient calculation a common pattern L ( y , ) ^ y ∂ y ^ c ∂ ∂ L ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c ∂ L error from above ∂ u c z m input from below M u = ∑ m =1 W z c , m m c = h ( q ) z ∂ ∂ L ∂ y ^ c ∂ u c ∂ z m ∂ q m m m L = ∑ c ∂ y ∂ V m , d ^ c ∂ u c ∂ z m ∂ q m ∂ V m , d D ∂ L = ∑ d =1 q V x error from above ∂ q m m , d m d x d input from below x d 5 . 7 Winter 2020 | Applied Machine Learning (COMP551)

Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ M = 16 hidden units y C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c = σ ( q ) z m m D = ∑ d =1 q V x m , d m d x d 6 . 1

Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ M = 16 hidden units y C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c cost is softmax-cross-entropy helper functions = σ ( q ) z 1 def logsumexp( m m 2 Z,# NxC 3 ): 4 Zmax = np.max(Z,axis=1)[:, None] 1 def cost(X, #N x D 5 lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] D = 6 return lse #N ∑ d =1 q V x m , d 2 Y, #N x C 7 m d 8 def softmax( 3 W, #M x C 9 u, # N x C 10 ): 4 V, #D x M 11 u_exp = np.exp(u - np.max(u, 1)[:, None]) x d 12 return u_exp / np.sum(u_exp, axis=-1)[:, None] 5 ): 6 Q = np.dot(X, V) #N x M 7 Z = logistic(Q) #N x M 8 U = np.dot(Z, W) #N x K 9 Yh = softmax(U) 10 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) 11 return nll 6 . 1

Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ M = 16 hidden units y C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c cost is softmax-cross-entropy helper functions = σ ( q ) z 1 def logsumexp( m m 2 Z,# NxC 3 ): 4 Zmax = np.max(Z,axis=1)[:, None] 1 1 def cost(X, #N x D def cost(X, #N x D 5 lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] D = 6 return lse #N ∑ d =1 q V x m , d 2 2 Y, #N x C Y, #N x C 7 m d 8 def softmax( 3 3 W, #M x C W, #M x C 9 u, # N x C 10 ): 4 4 V, #D x M V, #D x M 11 u_exp = np.exp(u - np.max(u, 1)[:, None]) x d 12 return u_exp / np.sum(u_exp, axis=-1)[:, None] 5 5 ): ): 6 6 Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M 7 7 Z = logistic(Q) #N x M Z = logistic(Q) #N x M 8 8 U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K 9 9 Yh = softmax(U) Yh = softmax(U) ( n )⊤ ( n ) 10 10 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) N ( n ) J = − + log u c ∑ n =1 ∑ c y u e 11 11 return nll return nll 6 . 1

Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ M = 16 hidden units y C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c cost is softmax-cross-entropy helper functions = σ ( q ) z 1 def logsumexp( m m 2 Z,# NxC 3 ): 4 Zmax = np.max(Z,axis=1)[:, None] 1 1 1 def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D 5 lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] D = 6 return lse #N ∑ d =1 q V x m , d 2 2 2 Y, #N x C Y, #N x C Y, #N x C 7 m d 8 def softmax( 3 3 3 W, #M x C W, #M x C W, #M x C 9 u, # N x C 10 ): 4 4 4 V, #D x M V, #D x M V, #D x M 11 u_exp = np.exp(u - np.max(u, 1)[:, None]) x d 12 return u_exp / np.sum(u_exp, axis=-1)[:, None] 5 5 5 ): ): ): 6 6 6 Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M 7 7 7 Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M 8 8 8 U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K 9 9 9 Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) ( n )⊤ ( n ) 10 10 10 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) N ( n ) J = − + log u c ∑ n =1 ∑ c y u e 11 11 11 return nll return nll return nll 6 . 1

Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ M = 16 hidden units y C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c cost is softmax-cross-entropy helper functions = σ ( q ) z 1 def logsumexp( m m 2 Z,# NxC 3 ): 4 Zmax = np.max(Z,axis=1)[:, None] 1 1 1 1 def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D 5 lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] D = 6 return lse #N ∑ d =1 q V x m , d 2 2 2 2 Y, #N x C Y, #N x C Y, #N x C Y, #N x C 7 m d 8 def softmax( 3 3 3 3 W, #M x C W, #M x C W, #M x C W, #M x C 9 u, # N x C 10 ): 4 4 4 4 V, #D x M V, #D x M V, #D x M V, #D x M 11 u_exp = np.exp(u - np.max(u, 1)[:, None]) x d 12 return u_exp / np.sum(u_exp, axis=-1)[:, None] 5 5 5 5 ): ): ): ): 6 6 6 6 Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M 7 7 7 7 Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M 8 8 8 8 U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K 9 9 9 9 Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) ( n )⊤ ( n ) 10 10 10 10 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) N ( n ) J = − + log u c ∑ n =1 ∑ c y u e 11 11 11 11 return nll return nll return nll return nll 6 . 1

Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ M = 16 hidden units y C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c cost is softmax-cross-entropy helper functions = σ ( q ) z 1 def logsumexp( m m 2 Z,# NxC 3 ): 4 Zmax = np.max(Z,axis=1)[:, None] 1 1 1 1 1 def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D 5 lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] D = 6 return lse #N ∑ d =1 q V x m , d 2 2 2 2 2 Y, #N x C Y, #N x C Y, #N x C Y, #N x C Y, #N x C 7 m d 8 def softmax( 3 3 3 3 3 W, #M x C W, #M x C W, #M x C W, #M x C W, #M x C 9 u, # N x C 10 ): 4 4 4 4 4 V, #D x M V, #D x M V, #D x M V, #D x M V, #D x M 11 u_exp = np.exp(u - np.max(u, 1)[:, None]) x d 12 return u_exp / np.sum(u_exp, axis=-1)[:, None] 5 5 5 5 5 ): ): ): ): ): 6 6 6 6 6 Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M 7 7 7 7 7 Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M 8 8 8 8 8 U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K 9 9 9 9 9 Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) ( n )⊤ ( n ) 10 10 10 10 10 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) N ( n ) J = − + log u c ∑ n =1 ∑ c y u e 11 11 11 11 11 return nll return nll return nll return nll return nll 6 . 1

Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ M = 16 hidden units y C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c cost is softmax-cross-entropy helper functions = σ ( q ) z 1 def logsumexp( m m 2 Z,# NxC 3 ): 4 Zmax = np.max(Z,axis=1)[:, None] 1 1 1 1 1 1 def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D 5 lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] D = 6 return lse #N ∑ d =1 q V x m , d 2 2 2 2 2 2 Y, #N x C Y, #N x C Y, #N x C Y, #N x C Y, #N x C Y, #N x C 7 m d 8 def softmax( 3 3 3 3 3 3 W, #M x C W, #M x C W, #M x C W, #M x C W, #M x C W, #M x C 9 u, # N x C 10 ): 4 4 4 4 4 4 V, #D x M V, #D x M V, #D x M V, #D x M V, #D x M V, #D x M 11 u_exp = np.exp(u - np.max(u, 1)[:, None]) x d 12 return u_exp / np.sum(u_exp, axis=-1)[:, None] 5 5 5 5 5 5 ): ): ): ): ): ): 6 6 6 6 6 6 Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M 7 7 7 7 7 7 Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M 8 8 8 8 8 8 U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K 9 9 9 9 9 9 Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) ( n )⊤ ( n ) 10 10 10 10 10 10 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) N ( n ) J = − + log u c ∑ n =1 ∑ c y u e 11 11 11 11 11 11 return nll return nll return nll return nll return nll return nll 6 . 1

Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ y M = 16 hidden units C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c = σ ( q ) z m m D = ∑ d =1 q V x m , d m d x d 6 . 2

Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ y M = 16 hidden units C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c ∂ L = ( − ^ y ) z m y = σ ( q ) ∂ W m z 1 def gradients(X,#N x D m m ∂ L = ( − ^ y ) W z (1 − z ) x 2 Y,#N x K y m m m d ∂ V m , d 3 W,#M x K D = ∑ d =1 q V x 4 V,#D x M m , d m d 5 ): 6 Z = logistic(np.dot(X, V))#N x M x d 7 N,D = X.shape 8 Yh = softmax(np.dot(Z, W))#N x K 9 dY = Yh - Y #N x K 10 dW= np.dot(Z.T, dY)/N #M x K 11 dZ = np.dot(dY, W.T) #N x M 12 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 13 return dW, dV 6 . 2

Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ y M = 16 hidden units C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c ∂ L = ( − ^ y ) z m y = σ ( q ) ∂ W m z 1 1 def gradients(X,#N x D def gradients(X,#N x D m m ∂ L = ( − ^ y ) W z (1 − z ) x 2 2 Y,#N x K Y,#N x K y m m m d ∂ V m , d 3 3 W,#M x K W,#M x K D = ∑ d =1 q V x 4 4 V,#D x M V,#D x M m , d m d 5 5 ): ): 6 6 Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M x d 7 7 N,D = X.shape N,D = X.shape 8 8 Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K 9 9 dY = Yh - Y #N x K dY = Yh - Y #N x K check your gradient function using finite difference 10 10 dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K approximation that uses the cost function 11 11 dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M 12 12 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 13 13 return dW, dV return dW, dV 1 scipy.optimize.check_grad 6 . 2

Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ y M = 16 hidden units C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c ∂ L = ( − ^ y ) z m y = σ ( q ) ∂ W m z 1 1 1 def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D m m ∂ L = ( − ^ y ) W z (1 − z ) x 2 2 2 Y,#N x K Y,#N x K Y,#N x K y m m m d ∂ V m , d 3 3 3 W,#M x K W,#M x K W,#M x K D = ∑ d =1 q V x 4 4 4 V,#D x M V,#D x M V,#D x M m , d m d 5 5 5 ): ): ): 6 6 6 Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M x d 7 7 7 N,D = X.shape N,D = X.shape N,D = X.shape 8 8 8 Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K 9 9 9 dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K check your gradient function using finite difference 10 10 10 dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K approximation that uses the cost function 11 11 11 dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M 12 12 12 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 13 13 13 return dW, dV return dW, dV return dW, dV 1 scipy.optimize.check_grad 6 . 2

Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ y M = 16 hidden units C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c ∂ L = ( − ^ y ) z m y = σ ( q ) ∂ W m z 1 1 1 1 def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D m m ∂ L = ( − ^ y ) W z (1 − z ) x 2 2 2 2 Y,#N x K Y,#N x K Y,#N x K Y,#N x K y m m m d ∂ V m , d 3 3 3 3 W,#M x K W,#M x K W,#M x K W,#M x K D = ∑ d =1 q V x 4 4 4 4 V,#D x M V,#D x M V,#D x M V,#D x M m , d m d 5 5 5 5 ): ): ): ): 6 6 6 6 Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M x d 7 7 7 7 N,D = X.shape N,D = X.shape N,D = X.shape N,D = X.shape 8 8 8 8 Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K 9 9 9 9 dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K check your gradient function using finite difference 10 10 10 10 dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K approximation that uses the cost function 11 11 11 11 dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M 12 12 12 12 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 13 13 13 13 return dW, dV return dW, dV return dW, dV return dW, dV 1 scipy.optimize.check_grad 6 . 2

Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ y M = 16 hidden units C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c ∂ L = ( − ^ y ) z m y = σ ( q ) ∂ W m z 1 1 1 1 1 def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D m m ∂ L = ( − ^ y ) W z (1 − z ) x 2 2 2 2 2 Y,#N x K Y,#N x K Y,#N x K Y,#N x K Y,#N x K y m m m d ∂ V m , d 3 3 3 3 3 W,#M x K W,#M x K W,#M x K W,#M x K W,#M x K D = ∑ d =1 q V x 4 4 4 4 4 V,#D x M V,#D x M V,#D x M V,#D x M V,#D x M m , d m d 5 5 5 5 5 ): ): ): ): ): 6 6 6 6 6 Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M x d 7 7 7 7 7 N,D = X.shape N,D = X.shape N,D = X.shape N,D = X.shape N,D = X.shape 8 8 8 8 8 Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K 9 9 9 9 9 dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K check your gradient function using finite difference 10 10 10 10 10 dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K approximation that uses the cost function 11 11 11 11 11 dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M 12 12 12 12 12 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 13 13 13 13 13 return dW, dV return dW, dV return dW, dV return dW, dV return dW, dV 1 scipy.optimize.check_grad 6 . 2

Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ y M = 16 hidden units C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c ∂ L = ( − ^ y ) z m y = σ ( q ) ∂ W m z 1 1 1 1 1 1 def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D m m ∂ L = ( − ^ y ) W z (1 − z ) x 2 2 2 2 2 2 Y,#N x K Y,#N x K Y,#N x K Y,#N x K Y,#N x K Y,#N x K y m m m d ∂ V m , d 3 3 3 3 3 3 W,#M x K W,#M x K W,#M x K W,#M x K W,#M x K W,#M x K D = ∑ d =1 q V x 4 4 4 4 4 4 V,#D x M V,#D x M V,#D x M V,#D x M V,#D x M V,#D x M m , d m d 5 5 5 5 5 5 ): ): ): ): ): ): 6 6 6 6 6 6 Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M x d 7 7 7 7 7 7 N,D = X.shape N,D = X.shape N,D = X.shape N,D = X.shape N,D = X.shape N,D = X.shape 8 8 8 8 8 8 Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K 9 9 9 9 9 9 dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K check your gradient function using finite difference 10 10 10 10 10 10 dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K approximation that uses the cost function 11 11 11 11 11 11 dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M 12 12 12 12 12 12 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 13 13 return dW, dV 13 13 13 13 return dW, dV return dW, dV return dW, dV return dW, dV return dW, dV 1 scipy.optimize.check_grad 6 . 2

Example: Example: classification classification Iris dataset (D=2 features + 1 bias) M = 16 hidden units C=3 classes using GD for optimization 1 def GD(X, Y, M, lr=.1, eps=1e-9, max_iters=100000): 2 N, D = X.shape 3 N,K = Y.shape 4 W = np.random.randn(M, K)*.01 5 V = np.random.randn(D, M)*.01 6 dW = np.inf*np.ones_like(W) 7 t = 0 8 while np.linalg.norm(dW) > eps and t < max_iters: 9 dW, dV = gradients(X, Y, W, V) 10 W = W - lr*dW 11 V = V - lr*dV 12 t += 1 13 return W, V 6 . 3

Example: Example: classification classification Iris dataset (D=2 features + 1 bias) M = 16 hidden units C=3 classes using GD for optimization the resulting decision boundaries 1 def GD(X, Y, M, lr=.1, eps=1e-9, max_iters=100000): 2 N, D = X.shape 3 N,K = Y.shape 4 W = np.random.randn(M, K)*.01 5 V = np.random.randn(D, M)*.01 6 dW = np.inf*np.ones_like(W) 7 t = 0 8 while np.linalg.norm(dW) > eps and t < max_iters: 9 dW, dV = gradients(X, Y, W, V) 10 W = W - lr*dW 11 V = V - lr*dV 12 t += 1 13 return W, V 6 . 3 Winter 2020 | Applied Machine Learning (COMP551)

Automating gradient computation Automating gradient computation gradient computation is tedious and mechanical. can we automate it? 7 . 1

Automating gradient computation Automating gradient computation gradient computation is tedious and mechanical. can we automate it? using numerical differentiation? ∂ f f ( w + ϵ )− f ( w ) ≈ approximates partial derivatives using finite difference ∂ w ϵ needs multiple forward passes (for each input output pair) can be slow and inaccurate useful for black-box cost functions or checking the correctness of gradient functions 7 . 1

Automating gradient computation Automating gradient computation gradient computation is tedious and mechanical. can we automate it? using numerical differentiation? ∂ f f ( w + ϵ )− f ( w ) ≈ approximates partial derivatives using finite difference ∂ w ϵ needs multiple forward passes (for each input output pair) can be slow and inaccurate useful for black-box cost functions or checking the correctness of gradient functions symbolic differentiation : symbolic calculation of derivatives does not identify the computational procedure and reuse of values 7 . 1

Automating gradient computation Automating gradient computation gradient computation is tedious and mechanical. can we automate it? using numerical differentiation? ∂ f f ( w + ϵ )− f ( w ) ≈ approximates partial derivatives using finite difference ∂ w ϵ needs multiple forward passes (for each input output pair) can be slow and inaccurate useful for black-box cost functions or checking the correctness of gradient functions symbolic differentiation : symbolic calculation of derivatives does not identify the computational procedure and reuse of values automatic / algorithmic differentiation is what we want write code that calculates various functions, e.g., the cost function automatically produce (partial) derivatives e.g., gradients used in learning 7 . 1

Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x use a computational graph as a data structure (for storing the result of computation) 7 . 2

Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x use a computational graph as a data structure (for storing the result of computation) step 1 a = break down to atomic operations 1 wx ) 2 w L = ( y − 1 2 a = x 2 a = y 3 a = a × a 2 4 1 a = a − a 3 5 4 2 a = a 5 6 a = .5 × a 6 7 7 . 2

Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x use a computational graph as a data structure (for storing the result of computation) step 1 a = break down to atomic operations 1 wx ) 2 w L = ( y − 1 2 a = x 2 build a graph with operations as internal step 2 a = y nodes and input variables as leaf nodes 3 a = a × a 2 4 1 a = a − a 3 5 4 2 a = a 5 6 a = .5 × a 6 7 7 . 2

Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x L use a computational graph as a data structure (for storing the result of computation) step 1 a = break down to atomic operations 1 wx ) 2 w L = ( y − 1 a 7 2 a = x 2 build a graph with operations as internal step 2 a = y nodes and input variables as leaf nodes 3 a 6 a = a × a 2 4 1 a = a − a 3 5 4 a 5 2 a = a 5 6 a = .5 × a 6 7 a 4 a 3 a 1 a 2 7 . 2

Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x L use a computational graph as a data structure (for storing the result of computation) step 1 a = break down to atomic operations 1 wx ) 2 w L = ( y − 1 a 7 2 a = x 2 build a graph with operations as internal step 2 a = y nodes and input variables as leaf nodes 3 a 6 a = a × a 2 4 1 step 3 a = a − there are two ways to use the computational graph to calculate derivatives a 3 5 4 a 5 2 a = a 5 6 a = .5 × a 6 7 a 4 a 3 a 1 a 2 7 . 2

Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x L use a computational graph as a data structure (for storing the result of computation) step 1 a = break down to atomic operations 1 wx ) 2 w L = ( y − 1 a 7 2 a = x 2 build a graph with operations as internal step 2 a = y nodes and input variables as leaf nodes 3 a 6 a = a × a 2 4 1 step 3 a = a − there are two ways to use the computational graph to calculate derivatives a 3 5 4 a 5 2 a = a 5 6 forward mode: start from the leafs and propagate derivatives upward a = .5 × a 6 7 a 4 a 3 a 1 a 2 7 . 2

Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x L use a computational graph as a data structure (for storing the result of computation) step 1 a = break down to atomic operations 1 wx ) 2 w L = ( y − 1 a 7 2 a = x 2 build a graph with operations as internal step 2 a = y nodes and input variables as leaf nodes 3 a 6 a = a × a 2 4 1 step 3 a = a − there are two ways to use the computational graph to calculate derivatives a 3 5 4 a 5 2 a = a 5 6 forward mode: start from the leafs and propagate derivatives upward a = .5 × a 6 7 a 4 a 3 reverse mode: 1. first in a bottom-up (forward) pass calculate the values a , … , a 1 4 a 1 a 2 2. in a top-down (backward) pass calculate the derivatives 7 . 2

Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x L use a computational graph as a data structure (for storing the result of computation) step 1 a = break down to atomic operations 1 wx ) 2 w L = ( y − 1 a 7 2 a = x 2 build a graph with operations as internal step 2 a = y nodes and input variables as leaf nodes 3 a 6 a = a × a 2 4 1 step 3 a = a − there are two ways to use the computational graph to calculate derivatives a 3 5 4 a 5 2 a = a 5 6 forward mode: start from the leafs and propagate derivatives upward a = .5 × a 6 7 a 4 a 3 reverse mode: 1. first in a bottom-up (forward) pass calculate the values a , … , a 1 4 a 1 a 2 2. in a top-down (backward) pass calculate the derivatives this second procedure is called backpropagation when applied to neuran networks 7 . 2

Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 7 . 3

Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 7 . 3

Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation a = w 0 1 a = w 1 2 a = x 3 7 . 3

Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation partial derivatives ˙ = 0 a = w 0 a 1 1 a = ˙ = 1 w 1 a 2 2 a = ˙ = 0 x a 3 3 7 . 3

Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation partial derivatives } ˙ = 0 a = w 0 a 1 1 we initialize these to identify which derivative we want a = ˙ = 1 w 1 a 2 ˙ ∂ □ 2 □ = this means ∂ w 1 a = ˙ = 0 x a 3 3 7 . 3

Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation partial derivatives } ˙ = 0 a = w 0 a 1 1 we initialize these to identify which derivative we want a = ˙ = 1 w 1 a 2 ˙ ∂ □ 2 □ = this means ∂ w 1 a = ˙ = 0 x a 3 3 a = a × ˙ = a × ˙ + ˙ × w x a 3 a 4 a 3 a 2 a 3 x 1 4 2 2 7 . 3

Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation partial derivatives } ˙ = 0 a = w 0 a 1 1 we initialize these to identify which derivative we want a = ˙ = 1 w 1 a 2 ˙ ∂ □ 2 □ = this means ∂ w 1 a = ˙ = 0 x a 3 3 a = a × ˙ = a × ˙ + ˙ × w x a 3 a 4 a 3 a 2 a 3 x 1 4 2 2 w x + a = a + ˙ = ˙ + ˙ x w 0 a 1 a 5 a 4 a 1 1 5 4 7 . 3

Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation partial derivatives } ˙ = 0 a = w 0 a 1 1 we initialize these to identify which derivative we want a = ˙ = 1 w 1 a 2 ˙ ∂ □ 2 □ = this means ∂ w 1 a = ˙ = 0 x a 3 3 a = a × ˙ = a × ˙ + ˙ × w x a 3 a 4 a 3 a 2 a 3 x 1 4 2 2 w x + a = a + ˙ = ˙ + ˙ x w 0 a 1 a 5 a 4 a 1 1 5 4 ∂ y 1 y = sin( w x + w ) x cos( w x + w ) = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 1 0 1 0 5 6 5 ∂ w 1 7 . 3

Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation partial derivatives } ˙ = 0 a = w 0 a 1 1 we initialize these to identify which derivative we want a = ˙ = 1 w 1 a 2 ˙ ∂ □ 2 □ = this means ∂ w 1 a = ˙ = 0 x a 3 3 a = a × ˙ = a × ˙ + ˙ × w x a 3 a 4 a 3 a 2 a 3 x 1 4 2 2 w x + a = a + ˙ = ˙ + ˙ x w 0 a 1 a 5 a 4 a 1 1 5 4 ∂ y 1 y = sin( w x + w ) x cos( w x + w ) = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 1 0 1 0 5 6 5 ∂ w 1 ∂ y 2 y = cos( w x + w ) a = cos( a ) ˙ = − ˙ sin( a ) − x sin( w x + w ) = a 7 a 5 2 1 0 7 5 1 0 5 ∂ w 1 7 . 3

Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation partial derivatives } ˙ = 0 a = w 0 a 1 1 we initialize these to identify which derivative we want a = ˙ = 1 w 1 a 2 ˙ ∂ □ 2 □ = this means ∂ w 1 a = ˙ = 0 x a 3 3 a = a × ˙ = a × ˙ + ˙ × w x a 3 a 4 a 3 a 2 a 3 x 1 4 2 2 w x + a = a + ˙ = ˙ + ˙ x w 0 a 1 a 5 a 4 a 1 1 5 4 ∂ y 1 y = sin( w x + w ) x cos( w x + w ) = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 1 0 1 0 5 6 5 ∂ w 1 ∂ y 2 y = cos( w x + w ) a = cos( a ) ˙ = − ˙ sin( a ) − x sin( w x + w ) = a 7 a 5 2 1 0 7 5 1 0 5 ∂ w 1 ∂ □ note that we get all partial derivatives in one forward pass ∂ w 1 7 . 3

Forward mode: computational graph Forward mode: computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 evaluation partial derivatives a = ˙ = 0 w 0 a 1 1 a = ˙ = 1 w 1 a 2 2 a = ˙ = 0 x a 3 3 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a + ˙ = ˙ + ˙ a 1 a 5 a 4 a 1 5 4 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 y = a = cos( a ) ˙ = − ˙ cos( a ) a 7 a 5 2 7 5 5 7 . 4

Forward mode: Forward mode: computational graph computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 we can represent this computation using a graph evaluation partial derivatives a = ˙ = 0 w 0 a 1 1 a 6 a 7 a = ˙ = 1 w 1 a 2 2 a = ˙ = 0 x a 3 3 a 5 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a + ˙ = ˙ + ˙ a 1 a 5 a 4 a 1 a 4 a 1 5 4 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 a 2 a 3 y = a = cos( a ) ˙ = − ˙ cos( a ) a 7 a 5 2 7 5 5 7 . 4

Forward mode: computational graph Forward mode: computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 we can represent this computation using a graph evaluation partial derivatives a = ˙ = 0 w 0 a 1 1 a 6 a 7 a = ˙ = 1 w 1 a 2 2 a = ˙ = 0 x a 3 3 a 5 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a = a + ˙ = ˙ + ˙ w 0 a 1 a 5 a 4 a 1 a 4 a 1 1 5 4 ˙ = 0 a 1 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 ˙ = 1 a 2 a 3 ˙ = 0 a 2 y = a = cos( a ) ˙ = − ˙ cos( a ) a 3 a 7 a 5 2 7 5 5 a = x a = 3 w 1 2 7 . 4

Forward mode: Forward mode: computational graph computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 we can represent this computation using a graph evaluation partial derivatives a = ˙ = 0 w 0 a 1 1 a 6 a 7 a = ˙ = 1 w 1 a 2 2 a = ˙ = 0 x a 3 3 a 5 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a × a 3 4 2 a = a = a + ˙ = ˙ + ˙ w 0 a 1 a 5 a 4 a 1 a 4 a 1 1 5 4 ˙ = a × ˙ + ˙ × a 4 a 3 a 2 a 3 2 ˙ = 0 a 1 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 ˙ = 1 a 2 a 3 ˙ = 0 a 2 y = a = cos( a ) ˙ = − ˙ cos( a ) a 3 a 7 a 5 2 7 5 5 a = x a = 3 w 1 2 7 . 4

Forward mode: Forward mode: computational graph computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 we can represent this computation using a graph evaluation partial derivatives a = ˙ = 0 w 0 a 1 1 a 6 a 7 a = ˙ = 1 w 1 a 2 2 a = a + a = a 1 ˙ = 0 5 4 x a 3 3 a 5 ˙ = ˙ + ˙ a 5 a 4 a 1 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a × a 3 4 2 a = a = a + ˙ = ˙ + ˙ w 0 a 1 a 5 a 4 a 1 a 4 a 1 1 5 4 ˙ = a × ˙ + ˙ × a 4 a 3 a 2 a 3 2 ˙ = 0 a 1 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 ˙ = 1 a 2 a 3 ˙ = 0 a 2 y = a = cos( a ) ˙ = − ˙ cos( a ) a 3 a 7 a 5 2 7 5 5 a = x a = 3 w 1 2 7 . 4

Forward mode: Forward mode: computational graph computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 we can represent this computation using a graph evaluation partial derivatives y = a = sin( a ) 1 6 5 a = ˙ = 0 w 0 a 1 ∂ y 1 1 = ˙ = ˙ cos( a ) a 6 a 7 a 6 a 5 5 ∂ w 1 a = ˙ = 1 w 1 a 2 2 a = a + a = a 1 ˙ = 0 5 4 x a 3 3 a 5 ˙ = ˙ + ˙ a 5 a 4 a 1 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a × a 3 4 2 a = a = a + ˙ = ˙ + ˙ w 0 a 1 a 5 a 4 a 1 a 4 a 1 1 5 4 ˙ = a × ˙ + ˙ × a 4 a 3 a 2 a 3 2 ˙ = 0 a 1 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 ˙ = 1 a 2 a 3 ˙ = 0 a 2 y = a = cos( a ) ˙ = − ˙ cos( a ) a 3 a 7 a 5 2 7 5 5 a = x a = 3 w 1 2 7 . 4

Forward mode: Forward mode: computational graph computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 we can represent this computation using a graph evaluation partial derivatives y = a = cos( a ) y = a = sin( a ) 2 7 5 1 6 5 a = ˙ = 0 w 0 a 1 ∂ y 1 ∂ y 2 1 = ˙ = ˙ cos( a ) a 6 a 7 = ˙ = − ˙ cos( a ) a 6 a 5 a 7 a 5 5 5 ∂ w 1 ∂ w 1 a = ˙ = 1 w 1 a 2 2 a = a + a = a 1 ˙ = 0 5 4 x a 3 3 a 5 ˙ = ˙ + ˙ a 5 a 4 a 1 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a × a 3 4 2 a = a = a + ˙ = ˙ + ˙ w 0 a 1 a 5 a 4 a 1 a 4 a 1 1 5 4 ˙ = a × ˙ + ˙ × a 4 a 3 a 2 a 3 2 ˙ = 0 a 1 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 ˙ = 1 a 2 a 3 ˙ = 0 a 2 y = a = cos( a ) ˙ = − ˙ cos( a ) a 3 a 7 a 5 2 7 5 5 a = x a = 3 w 1 2 7 . 4

Forward mode: Forward mode: computational graph computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 we can represent this computation using a graph once the nodes up stream calculate their values and derivatives we may discard a node e.g., once are obtained we can discard the values and partial derivatives for a , ˙ a , ˙ , a , ˙ 5 a 5 4 a 4 1 a 1 evaluation partial derivatives y = a = cos( a ) y = a = sin( a ) 2 7 5 1 6 5 a = ˙ = 0 w 0 a 1 ∂ y 1 ∂ y 2 1 = ˙ = ˙ cos( a ) a 6 a 7 = ˙ = − ˙ cos( a ) a 6 a 5 a 7 a 5 5 5 ∂ w 1 ∂ w 1 a = ˙ = 1 w 1 a 2 2 a = a + a = a 1 ˙ = 0 5 4 x a 3 3 a 5 ˙ = ˙ + ˙ a 5 a 4 a 1 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a × a 3 4 2 a = a = a + ˙ = ˙ + ˙ w 0 a 1 a 5 a 4 a 1 a 4 a 1 1 5 4 ˙ = a × ˙ + ˙ × a 4 a 3 a 2 a 3 2 ˙ = 0 a 1 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 ˙ = 1 a 2 a 3 ˙ = 0 a 2 y = a = cos( a ) ˙ = − ˙ cos( a ) a 3 a 7 a 5 2 7 5 5 a = x a = 3 w 1 2 7 . 4

Reverse mode Reverse mode ∂ y 2 y = cos( w x + w ) suppose we want the derivative where 2 1 0 ∂ w 1 7 . 5

Reverse mode Reverse mode ∂ y 2 y = cos( w x + w ) suppose we want the derivative where 2 1 0 ∂ w 1 first do a forward pass for evaluation 1) evaluation a = w 0 1 a = w 1 2 a = x 3 a = a × w x a 3 1 4 2 a = a + w x + a 1 w 0 5 4 1 y = a = sin( a ) y = sin( w x + w ) 1 6 5 1 1 0 y = a = cos( a ) y = cos( w x + w ) 2 7 5 2 1 0 7 . 5

Reverse mode Reverse mode ∂ y 2 y = cos( w x + w ) suppose we want the derivative where 2 1 0 ∂ w 1 first do a forward pass for evaluation 1) evaluation a = then use these values to calculate partial derivatives in a backward pass w 0 1 a = w 1 2 a = x 3 a = a × w x a 3 1 4 2 a = a + w x + a 1 w 0 5 4 1 y = a = sin( a ) y = sin( w x + w ) 1 6 5 1 1 0 y = a = cos( a ) y = cos( w x + w ) 2 7 5 2 1 0 7 . 5

Reverse mode Reverse mode ∂ y 2 y = cos( w x + w ) suppose we want the derivative where 2 1 0 ∂ w 1 first do a forward pass for evaluation 1) evaluation a = then use these values to calculate partial derivatives in a backward pass w 0 1 a = w 1 2) partial derivatives 2 0 } a = x ˉ = 1 ˉ a 7 ∂ y 2 3 □ = this means ∂ □ a = a × ˉ = w x a 3 a 6 1 4 2 a = a + w x + a 1 w 0 5 4 1 y = a = sin( a ) y = sin( w x + w ) 1 6 5 1 1 0 y = a = cos( a ) y = cos( w x + w ) 2 7 5 2 1 0 7 . 5

Reverse mode Reverse mode ∂ y 2 y = cos( w x + w ) suppose we want the derivative where 2 1 0 ∂ w 1 first do a forward pass for evaluation 1) evaluation a = then use these values to calculate partial derivatives in a backward pass w 0 1 a = w 1 2) partial derivatives 2 0 } a = x ˉ = 1 ∂ y 2 ˉ a 7 ∂ y 2 3 = 1 □ = this means ∂ y 2 ∂ □ a = a × ˉ = w x a 3 ∂ y 2 a 6 1 4 2 = 0 ∂ y 1 a = a + w x + a 1 w 0 5 4 1 y = a = sin( a ) y = sin( w x + w ) 1 6 5 1 1 0 y = a = cos( a ) y = cos( w x + w ) 2 7 5 2 1 0 7 . 5

Reverse mode Reverse mode ∂ y 2 y = cos( w x + w ) suppose we want the derivative where 2 1 0 ∂ w 1 first do a forward pass for evaluation 1) evaluation a = then use these values to calculate partial derivatives in a backward pass w 0 1 a = w 1 2) partial derivatives 2 0 } a = x ˉ = 1 ∂ y 2 ˉ a 7 ∂ y 2 3 = 1 □ = this means ∂ y 2 ∂ □ a = a × ˉ = w x a 3 ∂ y 2 a 6 1 4 2 = 0 ∂ y 1 a = a + ∂ y 2 ∂ y 2 ∂ y 2 ∂ a 7 ∂ a 6 w x + a 1 ˉ = ˉ cos( a ) − ˉ sin( a ) = + = − sin( w x + w ) w 0 5 4 a 5 a 6 a 7 1 1 0 5 5 ∂ a 5 ∂ a 7 ∂ a 5 ∂ a 6 ∂ a 5 y = a = sin( a ) y = sin( w x + w ) 1 6 5 1 1 0 y = a = cos( a ) y = cos( w x + w ) 2 7 5 2 1 0 7 . 5

Reverse mode Reverse mode ∂ y 2 y = cos( w x + w ) suppose we want the derivative where 2 1 0 ∂ w 1 first do a forward pass for evaluation 1) evaluation a = then use these values to calculate partial derivatives in a backward pass w 0 1 a = w 1 2) partial derivatives 2 0 } a = x ˉ = 1 ∂ y 2 ˉ a 7 ∂ y 2 3 = 1 □ = this means ∂ y 2 ∂ □ a = a × ˉ = w x a 3 ∂ y 2 a 6 1 4 2 = 0 ∂ y 1 a = a + ∂ y 2 ∂ y 2 ∂ y 2 ∂ a 7 ∂ a 6 w x + a 1 ˉ = ˉ cos( a ) − ˉ sin( a ) = + = − sin( w x + w ) w 0 5 4 a 5 a 6 a 7 1 1 0 5 5 ∂ a 5 ∂ a 7 ∂ a 5 ∂ a 6 ∂ a 5 y = a = sin( a ) y = sin( w x + w ) ∂ y 2 ˉ = ˉ = − sin( w x + w ) 1 6 5 a 4 a 5 1 1 0 1 0 ∂ a 4 y = a = cos( a ) y = cos( w x + w ) 2 7 5 2 1 0 7 . 5

Applied Machine Learning Applied Machine Learning Gradient - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Gradient Computation & Automatic Differentiation Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives using the

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak

Applied Machine Learning Applied Machine Learning Logistic Regression Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Regularization Siamak Ravanbakhsh Siamak

into the teaching of evidence-based practice Tammy Hoffmann 1,2 , Chris Del Mar 1 , Victor

Understanding and Supporting Students with Autism Spectrum Disorder (ASD) 11 Nov 2020

P5 CURRICULUM BRIE FING 2018 20 JAN 2018 OVE RVIE W Mathematics Matters Attendance

Introduction to Operating Systems Instructor 1A. Administrative introduction to course

National Teachers Conference July 9 12, 2018 #VETeachers About Video

Prevention Program Panel Discussion Presented for Evidenced-based Solutions for Prediabetes and

Primary 1 Mathematics CHIJ Our Lady of the Nativity Simple in Virtue, Steadfast in Duty Content

Large-scale teacher professional development for effective technology in integration Sahana