Neural Networks: Design Shan-Hung Wu shwu@cs.nthu.edu.tw - PowerPoint PPT Presentation

Output Distribution a ( 2 ) 0 2 3 1 0 2 3 1 � 1 1 0 0 2 3 � 1 1 1 0 1 a ( 2 ) = σ ( A ( 1 ) w ( 2 ) ) = σ B 6 7 C B 6 7 C A = σ 2 B 6 7 C B 6 7 C 4 5 1 1 0 1 @ 4 5 @ 4 5 A � 4 1 2 1 � 1 2 3 0 1 y = 1 ( a ( 2 ) > 0 . 5 ) = 6 7 ˆ 6 7 1 4 5 0 But how to train W ( 1 ) and w ( 2 ) from examples? Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 13 / 49

Outline The Basics 1 Example: Learning the XOR Training 2 Back Propagation Neuron Design 3 Cost Function & Output Neurons Hidden Neurons Architecture Design 4 Architecture Tuning Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 14 / 49

Training an NN Given examples: X = { ( x ( i ) , y ( i ) ) } N i = 1 How to learn parameters Θ = { W ( 1 ) , ··· , W ( L ) } ? Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

Training an NN Given examples: X = { ( x ( i ) , y ( i ) ) } N i = 1 How to learn parameters Θ = { W ( 1 ) , ··· , W ( L ) } ? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmax Θ logP ( X | Θ ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

Training an NN Given examples: X = { ( x ( i ) , y ( i ) ) } N i = 1 How to learn parameters Θ = { W ( 1 ) , ··· , W ( L ) } ? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmax Θ logP ( X | Θ ) = argmin Θ � logP ( X | Θ ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

Training an NN Given examples: X = { ( x ( i ) , y ( i ) ) } N i = 1 How to learn parameters Θ = { W ( 1 ) , ··· , W ( L ) } ? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmax Θ logP ( X | Θ ) = argmin Θ � logP ( X | Θ ) = argmin Θ ∑ i � logP ( x ( i ) , y ( i ) | Θ ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

Training an NN Given examples: X = { ( x ( i ) , y ( i ) ) } N i = 1 How to learn parameters Θ = { W ( 1 ) , ··· , W ( L ) } ? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmax Θ logP ( X | Θ ) = argmin Θ � logP ( X | Θ ) = argmin Θ ∑ i � logP ( x ( i ) , y ( i ) | Θ ) = argmin Θ ∑ i [ � logP ( y ( i ) | x ( i ) , Θ ) � logP ( x ( i ) | Θ )] Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

Training an NN Given examples: X = { ( x ( i ) , y ( i ) ) } N i = 1 How to learn parameters Θ = { W ( 1 ) , ··· , W ( L ) } ? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmax Θ logP ( X | Θ ) = argmin Θ � logP ( X | Θ ) = argmin Θ ∑ i � logP ( x ( i ) , y ( i ) | Θ ) = argmin Θ ∑ i [ � logP ( y ( i ) | x ( i ) , Θ ) � logP ( x ( i ) | Θ )] = argmin Θ ∑ i � logP ( y ( i ) | x ( i ) , Θ ) = argmin Θ ∑ i C ( i ) ( Θ ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

Training an NN Given examples: X = { ( x ( i ) , y ( i ) ) } N i = 1 How to learn parameters Θ = { W ( 1 ) , ··· , W ( L ) } ? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmax Θ logP ( X | Θ ) = argmin Θ � logP ( X | Θ ) = argmin Θ ∑ i � logP ( x ( i ) , y ( i ) | Θ ) = argmin Θ ∑ i [ � logP ( y ( i ) | x ( i ) , Θ ) � logP ( x ( i ) | Θ )] = argmin Θ ∑ i � logP ( y ( i ) | x ( i ) , Θ ) = argmin Θ ∑ i C ( i ) ( Θ ) The minimizer ˆ Θ is an unbiased estimator of “true” Θ ⇤ Good for large N Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

Example: Binary Classification Pr ( y = 1 | x ) ⇠ Bernoulli ( ρ ) , where x 2 R D and y 2 { 0 , 1 } a ( L ) = ˆ ρ = σ ( z ( L ) ) the predicted distribution Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 16 / 49

Example: Binary Classification Pr ( y = 1 | x ) ⇠ Bernoulli ( ρ ) , where x 2 R D and y 2 { 0 , 1 } a ( L ) = ˆ ρ = σ ( z ( L ) ) the predicted distribution The cost function C ( i ) ( Θ ) can be written as: C ( i ) ( Θ ) = � logP ( y ( i ) | x ( i ) ; Θ ) = � log [( a ( L ) ) y ( i ) ( 1 � a ( L ) ) 1 � y ( i ) ] = � log [ σ ( z ( L ) ) y ( i ) ( 1 � σ ( z ( L ) )) 1 � y ( i ) ] Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 16 / 49

Example: Binary Classification Pr ( y = 1 | x ) ⇠ Bernoulli ( ρ ) , where x 2 R D and y 2 { 0 , 1 } a ( L ) = ˆ ρ = σ ( z ( L ) ) the predicted distribution The cost function C ( i ) ( Θ ) can be written as: C ( i ) ( Θ ) = � logP ( y ( i ) | x ( i ) ; Θ ) = � log [( a ( L ) ) y ( i ) ( 1 � a ( L ) ) 1 � y ( i ) ] = � log [ σ ( z ( L ) ) y ( i ) ( 1 � σ ( z ( L ) )) 1 � y ( i ) ] = � log [ σ (( 2 y ( i ) � 1 ) z ( L ) )] Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 16 / 49

Example: Binary Classification Pr ( y = 1 | x ) ⇠ Bernoulli ( ρ ) , where x 2 R D and y 2 { 0 , 1 } a ( L ) = ˆ ρ = σ ( z ( L ) ) the predicted distribution The cost function C ( i ) ( Θ ) can be written as: C ( i ) ( Θ ) = � logP ( y ( i ) | x ( i ) ; Θ ) = � log [( a ( L ) ) y ( i ) ( 1 � a ( L ) ) 1 � y ( i ) ] = � log [ σ ( z ( L ) ) y ( i ) ( 1 � σ ( z ( L ) )) 1 � y ( i ) ] = � log [ σ (( 2 y ( i ) � 1 ) z ( L ) )] = ζ (( 1 � 2 y ( i ) ) z ( L ) ) ζ ( · ) is the softplus function Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 16 / 49

Optimization Algorithm Most NNs use SGD to solve the problem argmin Θ ∑ i C ( i ) ( Θ ) (Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ ( 0 ) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M ; Θ ( t + 1 ) Θ ( t ) � η ∇ Θ ∑ M i = 1 C ( i ) ( Θ ( t ) ) ; } Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

Optimization Algorithm Most NNs use SGD to solve the problem argmin Θ ∑ i C ( i ) ( Θ ) Fast convergence in time [1] (Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ ( 0 ) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M ; Θ ( t + 1 ) Θ ( t ) � η ∇ Θ ∑ M i = 1 C ( i ) ( Θ ( t ) ) ; } Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

Optimization Algorithm Most NNs use SGD to solve the problem argmin Θ ∑ i C ( i ) ( Θ ) Fast convergence in time [1] Supports (GPU-based) parallelism (Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ ( 0 ) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M ; Θ ( t + 1 ) Θ ( t ) � η ∇ Θ ∑ M i = 1 C ( i ) ( Θ ( t ) ) ; } Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

Optimization Algorithm Most NNs use SGD to solve the problem argmin Θ ∑ i C ( i ) ( Θ ) Fast convergence in time [1] Supports (GPU-based) parallelism Supports online learning (Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ ( 0 ) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M ; Θ ( t + 1 ) Θ ( t ) � η ∇ Θ ∑ M i = 1 C ( i ) ( Θ ( t ) ) ; } Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

Optimization Algorithm Most NNs use SGD to solve the problem argmin Θ ∑ i C ( i ) ( Θ ) Fast convergence in time [1] Supports (GPU-based) parallelism Supports online learning Easy to implement (Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ ( 0 ) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M ; Θ ( t + 1 ) Θ ( t ) � η ∇ Θ ∑ M i = 1 C ( i ) ( Θ ( t ) ) ; } Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

Optimization Algorithm Most NNs use SGD to solve the problem argmin Θ ∑ i C ( i ) ( Θ ) Fast convergence in time [1] Supports (GPU-based) parallelism Supports online learning Easy to implement (Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ ( 0 ) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M ; Θ ( t + 1 ) Θ ( t ) � η ∇ Θ ∑ M i = 1 C ( i ) ( Θ ( t ) ) ; } How to compute ∇ Θ ∑ i C ( i ) ( Θ ( t ) ) e ffi ciently? There could be a huge number of W ( k ) i , j ’s in Θ Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

Back Propagation M Θ ( t + 1 ) Θ ( t ) � η ∇ Θ C ( n ) ( Θ ( t ) ) ∑ n = 1 We have ∇ Θ ∑ n C ( n ) ( Θ ( t ) ) = ∑ n ∇ Θ C ( n ) ( Θ ( t ) ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 19 / 49

Back Propagation M Θ ( t + 1 ) Θ ( t ) � η ∇ Θ C ( n ) ( Θ ( t ) ) ∑ n = 1 We have ∇ Θ ∑ n C ( n ) ( Θ ( t ) ) = ∑ n ∇ Θ C ( n ) ( Θ ( t ) ) Let c ( n ) = C ( n ) ( Θ ( t ) ) , our goal is to evaluate ∂ c ( n ) ∂ W ( k ) i , j for all i , j , k , and n Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 19 / 49

Back Propagation M Θ ( t + 1 ) Θ ( t ) � η ∇ Θ C ( n ) ( Θ ( t ) ) ∑ n = 1 We have ∇ Θ ∑ n C ( n ) ( Θ ( t ) ) = ∑ n ∇ Θ C ( n ) ( Θ ( t ) ) Let c ( n ) = C ( n ) ( Θ ( t ) ) , our goal is to evaluate ∂ c ( n ) ∂ W ( k ) i , j for all i , j , k , and n Back propagation (or simply backprop ) is an e ffi cient way to evaluate multiple partial derivatives at once Assuming the partial derivatives share some common evaluation steps Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 19 / 49

Back Propagation M Θ ( t + 1 ) Θ ( t ) � η ∇ Θ C ( n ) ( Θ ( t ) ) ∑ n = 1 We have ∇ Θ ∑ n C ( n ) ( Θ ( t ) ) = ∑ n ∇ Θ C ( n ) ( Θ ( t ) ) Let c ( n ) = C ( n ) ( Θ ( t ) ) , our goal is to evaluate ∂ c ( n ) ∂ W ( k ) i , j for all i , j , k , and n Back propagation (or simply backprop ) is an e ffi cient way to evaluate multiple partial derivatives at once Assuming the partial derivatives share some common evaluation steps By the chain rule, we have ∂ z ( k ) ∂ c ( n ) = ∂ c ( n ) j · ∂ W ( k ) ∂ z ( k ) ∂ W ( k ) i , j i , j j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 19 / 49

Forward Pass ∂ z ( k ) j The second term: ∂ W ( k ) i , j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 20 / 49

Forward Pass ∂ z ( k ) j The second term: ∂ W ( k ) i , j When k = 1 , we have z ( 1 ) = ∑ i W ( 1 ) i , j x ( n ) and j i ∂ z ( 1 ) j = x ( n ) i ∂ W ( 1 ) i , j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 20 / 49

Forward Pass ∂ z ( k ) j The second term: ∂ W ( k ) i , j When k = 1 , we have z ( 1 ) = ∑ i W ( 1 ) i , j x ( n ) and j i ∂ z ( 1 ) j = x ( n ) i ∂ W ( 1 ) i , j Otherwise ( k > 1 ), we have z ( k ) = ∑ i W ( k ) i , j a ( k � 1 ) and j i ∂ z ( k ) = a ( k � 1 ) j i ∂ W ( 1 ) i , j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 20 / 49

Forward Pass ∂ z ( k ) j The second term: ∂ W ( k ) i , j When k = 1 , we have z ( 1 ) = ∑ i W ( 1 ) i , j x ( n ) and j i ∂ z ( 1 ) j = x ( n ) i ∂ W ( 1 ) i , j Otherwise ( k > 1 ), we have z ( k ) = ∑ i W ( k ) i , j a ( k � 1 ) and j i ∂ z ( k ) = a ( k � 1 ) j i ∂ W ( 1 ) i , j ∂ c ( n ) We can get the second terms of all ’s starting from the most ∂ W ( k ) i , j shallow layer Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 20 / 49

Backward Pass I ∂ c ( n ) Conversely, we can get the first terms of all ’s starting from the ∂ W ( k ) i , j deepest layer Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 21 / 49

Backward Pass I ∂ c ( n ) Conversely, we can get the first terms of all ’s starting from the ∂ W ( k ) i , j deepest layer Define error signal δ ( k ) as the first term ∂ c ( n ) j ∂ z ( k ) j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 21 / 49

Backward Pass I ∂ c ( n ) Conversely, we can get the first terms of all ’s starting from the ∂ W ( k ) i , j deepest layer Define error signal δ ( k ) as the first term ∂ c ( n ) j ∂ z ( k ) j When k = L , the evaluation varies from task to task Depending on the definition of functions act ( L ) and C ( n ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 21 / 49

Backward Pass I ∂ c ( n ) Conversely, we can get the first terms of all ’s starting from the ∂ W ( k ) i , j deepest layer Define error signal δ ( k ) as the first term ∂ c ( n ) j ∂ z ( k ) j When k = L , the evaluation varies from task to task Depending on the definition of functions act ( L ) and C ( n ) E.g., in binary classification, we have: δ ( L ) = ∂ c ( n ) ∂ z ( L ) = ∂ζ (( 1 � 2 y ( n ) ) z ( L ) ) = σ (( 1 � 2 y ( n ) ) z ( L ) ) · ( 1 � 2 y ( n ) ) ∂ z ( L ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 21 / 49

Backward Pass II When k < L , we have ∂ a ( k ) δ ( k ) · act 0 ( z ( k ) = ∂ c ( n ) = ∂ c ( n ) = ∂ c ( n ) · j j ) j ∂ z ( k ) ∂ a ( k ) ∂ z ( k ) ∂ a ( k ) j j j j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 22 / 49

Backward Pass II When k < L , we have ∂ a ( k ) δ ( k ) · act 0 ( z ( k ) = ∂ c ( n ) = ∂ c ( n ) = ∂ c ( n ) · j j ) j ∂ z ( k ) ∂ a ( k ) ∂ z ( k ) ∂ a ( k ) j j j j ✓ ◆ · ∂ z ( k + 1 ) ∂ c ( n ) act 0 ( z ( k ) = j ) ∑ s s ∂ z ( k + 1 ) ∂ a ( k ) s j Theorem (Chain Rule) Let g : R ! R d and f : R d ! R , then f 2 g 0 3 1 ( x ) . ( f � g ) 0 ( x ) = f 0 ( g ( x )) g 0 ( x ) = ∇ f ( g ( x )) > . 5 . 6 7 . 4 g 0 n ( x ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 22 / 49

Backward Pass II When k < L , we have ∂ a ( k ) δ ( k ) · act 0 ( z ( k ) = ∂ c ( n ) = ∂ c ( n ) = ∂ c ( n ) · j j ) j ∂ z ( k ) ∂ a ( k ) ∂ z ( k ) ∂ a ( k ) j j j j ✓ ◆ · ∂ z ( k + 1 ) ∂ c ( n ) act 0 ( z ( k ) = j ) ∑ s s ∂ z ( k + 1 ) ∂ a ( k ) s j ✓ ∂ ∑ i W ( k + 1 ) a ( k ) ◆ ∑ s δ ( k + 1 ) act 0 ( z ( k ) i , s = · i j ) s ∂ a ( k ) j Theorem (Chain Rule) Let g : R ! R d and f : R d ! R , then f 2 g 0 3 1 ( x ) . ( f � g ) 0 ( x ) = f 0 ( g ( x )) g 0 ( x ) = ∇ f ( g ( x )) > . 5 . 6 7 . 4 g 0 n ( x ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 22 / 49

Backward Pass II When k < L , we have ∂ a ( k ) δ ( k ) · act 0 ( z ( k ) = ∂ c ( n ) = ∂ c ( n ) = ∂ c ( n ) · j j ) j ∂ z ( k ) ∂ a ( k ) ∂ z ( k ) ∂ a ( k ) j j j j ✓ ◆ · ∂ z ( k + 1 ) ∂ c ( n ) act 0 ( z ( k ) = j ) ∑ s s ∂ z ( k + 1 ) ∂ a ( k ) s j ✓ ∂ ∑ i W ( k + 1 ) a ( k ) ◆ ∑ s δ ( k + 1 ) act 0 ( z ( k ) i , s = · i j ) s ∂ a ( k ) j ⇣ ⌘ ∑ s δ ( k + 1 ) · W ( k + 1 ) act 0 ( z ( k ) = j ) s j , s Theorem (Chain Rule) Let g : R ! R d and f : R d ! R , then f 2 g 0 3 1 ( x ) . ( f � g ) 0 ( x ) = f 0 ( g ( x )) g 0 ( x ) = ∇ f ( g ( x )) > . 5 . 6 7 . 4 g 0 n ( x ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 22 / 49

Backward Pass III ✓ ◆ δ ( k ) δ ( k + 1 ) · W ( k + 1 ) act 0 ( z ( k ) ∑ = j ) s j , s j s We can evaluate all δ ( k ) ’s starting from the deepest layer j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 23 / 49

Backward Pass III ✓ ◆ δ ( k ) δ ( k + 1 ) · W ( k + 1 ) act 0 ( z ( k ) ∑ = j ) s j , s j s We can evaluate all δ ( k ) ’s starting from the deepest layer j The information propagate along a new kind of feedforward network: Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 23 / 49

Backprop Algorithm (Minibatch Size M = 1 ) Input : ( x ( n ) , y ( n ) ) and Θ ( t ) Forward pass: x ( n ) ⇤ > ; a ( 0 ) ⇥ 1 for k 1 to L do z ( k ) W ( k ) > a ( k � 1 ) ; a ( k ) act ( z ( k ) ) ; end Backward pass: Compute error signal δ ( L ) (e.g., ( 1 � 2 y ( n ) ) σ (( 1 � 2 y ( n ) ) z ( L ) ) in binary classification) for k L � 1 to 1 do δ ( k ) act 0 ( z ( k ) ) � ( W ( k + 1 ) δ ( k + 1 ) ) ; end ∂ W ( k ) = a ( k � 1 ) ⌦ δ ( k ) for all k ∂ c ( n ) Return Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 24 / 49

Backprop Algorithm (Minibatch Size M > 1 ) Input : { ( x ( n ) , y ( n ) ) } M n = 1 and Θ ( t ) Forward pass: A ( 0 ) a ( 0 , M ) ⇤ > ; ⇥ a ( 0 , 1 ) ··· for k 1 to L do Z ( k ) A ( k � 1 ) W ( k ) ; A ( k ) act ( Z ( k ) ) ; end Backward pass: Compute error signals δ ( L , M ) i > h ∆ ( L ) = δ ( L , 0 ) ··· for k L � 1 to 1 do ∆ ( k ) act 0 ( Z ( k ) ) � ( ∆ ( k + 1 ) W ( k + 1 ) > ) ; end n = 1 a ( k � 1 , n ) ⌦ δ ( k , n ) for all k ∂ c ( n ) ∂ W ( k ) = ∑ M Return Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 25 / 49

Backprop Algorithm (Minibatch Size M > 1 ) Input : { ( x ( n ) , y ( n ) ) } M n = 1 and Θ ( t ) Forward pass: A ( 0 ) a ( 0 , M ) ⇤ > ; ⇥ a ( 0 , 1 ) ··· for k 1 to L do Z ( k ) A ( k � 1 ) W ( k ) ; Speed up with A ( k ) act ( Z ( k ) ) ; GPUs? end Backward pass: Compute error signals δ ( L , M ) i > h ∆ ( L ) = δ ( L , 0 ) ··· for k L � 1 to 1 do ∆ ( k ) act 0 ( Z ( k ) ) � ( ∆ ( k + 1 ) W ( k + 1 ) > ) ; end n = 1 a ( k � 1 , n ) ⌦ δ ( k , n ) for all k ∂ c ( n ) ∂ W ( k ) = ∑ M Return Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 25 / 49

Backprop Algorithm (Minibatch Size M > 1 ) Input : { ( x ( n ) , y ( n ) ) } M n = 1 and Θ ( t ) Forward pass: A ( 0 ) a ( 0 , M ) ⇤ > ; ⇥ a ( 0 , 1 ) ··· for k 1 to L do Z ( k ) A ( k � 1 ) W ( k ) ; Speed up with A ( k ) act ( Z ( k ) ) ; GPUs? end Large width ( D ( k ) ) Backward pass: at each layer Compute error signals δ ( L , M ) i > h ∆ ( L ) = δ ( L , 0 ) ··· for k L � 1 to 1 do ∆ ( k ) act 0 ( Z ( k ) ) � ( ∆ ( k + 1 ) W ( k + 1 ) > ) ; end n = 1 a ( k � 1 , n ) ⌦ δ ( k , n ) for all k ∂ c ( n ) ∂ W ( k ) = ∑ M Return Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 25 / 49

Backprop Algorithm (Minibatch Size M > 1 ) Input : { ( x ( n ) , y ( n ) ) } M n = 1 and Θ ( t ) Forward pass: A ( 0 ) a ( 0 , M ) ⇤ > ; ⇥ a ( 0 , 1 ) ··· for k 1 to L do Z ( k ) A ( k � 1 ) W ( k ) ; Speed up with A ( k ) act ( Z ( k ) ) ; GPUs? end Large width ( D ( k ) ) Backward pass: at each layer Compute error signals δ ( L , M ) i > Large batch size h ∆ ( L ) = δ ( L , 0 ) ··· for k L � 1 to 1 do ∆ ( k ) act 0 ( Z ( k ) ) � ( ∆ ( k + 1 ) W ( k + 1 ) > ) ; end n = 1 a ( k � 1 , n ) ⌦ δ ( k , n ) for all k ∂ c ( n ) ∂ W ( k ) = ∑ M Return Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 25 / 49

Neuron Design The design of modern neurons is largely influenced by how an NN is trained Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 27 / 49

Neuron Design The design of modern neurons is largely influenced by how an NN is trained Maximum likelihood principle: � logP ( y ( i ) | x ( i ) , Θ ) Θ ∑ argmax Θ logP ( X | Θ ) = argmin i Universal cost function Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 27 / 49

Neuron Design The design of modern neurons is largely influenced by how an NN is trained Maximum likelihood principle: � logP ( y ( i ) | x ( i ) , Θ ) Θ ∑ argmax Θ logP ( X | Θ ) = argmin i Universal cost function Di ff erent output units for di ff erent P ( y | x ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 27 / 49

Neuron Design The design of modern neurons is largely influenced by how an NN is trained Maximum likelihood principle: � logP ( y ( i ) | x ( i ) , Θ ) Θ ∑ argmax Θ logP ( X | Θ ) = argmin i Universal cost function Di ff erent output units for di ff erent P ( y | x ) Gradient-based optimization: During SGD, the gradient ∂ z ( k ) ∂ z ( k ) ∂ c ( n ) = ∂ c ( n ) j = δ ( k ) j · j ∂ W ( k ) ∂ z ( k ) ∂ W ( k ) ∂ W ( k ) i , j i , j i , j j should be su ffi ciently large before we get a satisfactory NN Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 27 / 49

Negative Log Likelihood and Cross Entropy The cost function of most NNs: � logP ( y ( i ) | x ( i ) , Θ ) Θ ∑ Θ logP ( X | Θ ) = argmin argmax i Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 29 / 49

Negative Log Likelihood and Cross Entropy The cost function of most NNs: � logP ( y ( i ) | x ( i ) , Θ ) Θ ∑ Θ logP ( X | Θ ) = argmin argmax i For NNs that output an entire distribution ˆ P ( y | x ) , the problem can be equivalently described as minimizing the cross entropy (or KL divergence) from ˆ P to the empirical distribution of data: ⇥ log ˆ ⇤ � E ( x , y ) ⇠ Empirical ( X ) P ( y | x ) argmin ˆ P Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 29 / 49

Negative Log Likelihood and Cross Entropy The cost function of most NNs: � logP ( y ( i ) | x ( i ) , Θ ) Θ ∑ Θ logP ( X | Θ ) = argmin argmax i For NNs that output an entire distribution ˆ P ( y | x ) , the problem can be equivalently described as minimizing the cross entropy (or KL divergence) from ˆ P to the empirical distribution of data: ⇥ log ˆ ⇤ � E ( x , y ) ⇠ Empirical ( X ) P ( y | x ) argmin ˆ P Provides a consistent way to define output units Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 29 / 49

Sigmoid Units for Bernoulli Output Distributions In binary classification, we assuming P ( y = 1 | x ) ⇠ Bernoulli ( ρ ) y 2 { 0 , 1 } and ρ 2 ( 0 , 1 ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

Sigmoid Units for Bernoulli Output Distributions In binary classification, we assuming P ( y = 1 | x ) ⇠ Bernoulli ( ρ ) y 2 { 0 , 1 } and ρ 2 ( 0 , 1 ) Sigmoid output unit: exp ( z ( L ) ) a ( L ) = ˆ ρ = σ ( z ( L ) ) = exp ( z ( L ) )+ 1 Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

Sigmoid Units for Bernoulli Output Distributions In binary classification, we assuming P ( y = 1 | x ) ⇠ Bernoulli ( ρ ) y 2 { 0 , 1 } and ρ 2 ( 0 , 1 ) Sigmoid output unit: exp ( z ( L ) ) a ( L ) = ˆ ρ = σ ( z ( L ) ) = exp ( z ( L ) )+ 1 P ( y ( n ) | x ( n ) ; Θ ) δ ( L ) = ∂ c ( n ) ∂ z ( L ) = ∂ � log ˆ = ( 1 � 2 y ( n ) ) σ (( 1 � 2 y ( n ) ) z ( L ) ) ∂ z ( L ) Close to 0 only when y ( n ) = 1 and z ( L ) is large positive; Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

Sigmoid Units for Bernoulli Output Distributions In binary classification, we assuming P ( y = 1 | x ) ⇠ Bernoulli ( ρ ) y 2 { 0 , 1 } and ρ 2 ( 0 , 1 ) Sigmoid output unit: exp ( z ( L ) ) a ( L ) = ˆ ρ = σ ( z ( L ) ) = exp ( z ( L ) )+ 1 P ( y ( n ) | x ( n ) ; Θ ) δ ( L ) = ∂ c ( n ) ∂ z ( L ) = ∂ � log ˆ = ( 1 � 2 y ( n ) ) σ (( 1 � 2 y ( n ) ) z ( L ) ) ∂ z ( L ) Close to 0 only when y ( n ) = 1 and z ( L ) is large positive; or y ( n ) = 0 and z ( L ) is small negative Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

Sigmoid Units for Bernoulli Output Distributions In binary classification, we assuming P ( y = 1 | x ) ⇠ Bernoulli ( ρ ) y 2 { 0 , 1 } and ρ 2 ( 0 , 1 ) Sigmoid output unit: exp ( z ( L ) ) a ( L ) = ˆ ρ = σ ( z ( L ) ) = exp ( z ( L ) )+ 1 P ( y ( n ) | x ( n ) ; Θ ) δ ( L ) = ∂ c ( n ) ∂ z ( L ) = ∂ � log ˆ = ( 1 � 2 y ( n ) ) σ (( 1 � 2 y ( n ) ) z ( L ) ) ∂ z ( L ) Close to 0 only when y ( n ) = 1 and z ( L ) is large positive; or y ( n ) = 0 and z ( L ) is small negative The loss c ( n ) saturates (becomes flat) only when ˆ ρ is “correct” Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

Softmax Units for Categorical Output Distributions I In multiclass classification, we can assume that P ( y | x ) ⇠ Categorical ( ρ ) , where y , ρ 2 R K and 1 > ρ = 1 Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

Softmax Units for Categorical Output Distributions I In multiclass classification, we can assume that P ( y | x ) ⇠ Categorical ( ρ ) , where y , ρ 2 R K and 1 > ρ = 1 Softmax units: exp ( z ( L ) ) a ( L ) j ρ j = sofmax ( z ( L ) ) j = = ˆ j i = 1 exp ( z ( L ) ∑ K ) i Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

Softmax Units for Categorical Output Distributions I In multiclass classification, we can assume that P ( y | x ) ⇠ Categorical ( ρ ) , where y , ρ 2 R K and 1 > ρ = 1 Softmax units: exp ( z ( L ) ) a ( L ) j ρ j = sofmax ( z ( L ) ) j = = ˆ j i = 1 exp ( z ( L ) ∑ K ) i Actually, to define a Categorical distribution, we only need ρ 1 , ··· , ρ K � 1 ( ρ K = 1 � ∑ K � 1 i = 1 ρ i can be discarded) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

Softmax Units for Categorical Output Distributions I In multiclass classification, we can assume that P ( y | x ) ⇠ Categorical ( ρ ) , where y , ρ 2 R K and 1 > ρ = 1 Softmax units: exp ( z ( L ) ) a ( L ) j ρ j = sofmax ( z ( L ) ) j = = ˆ j i = 1 exp ( z ( L ) ∑ K ) i Actually, to define a Categorical distribution, we only need ρ 1 , ··· , ρ K � 1 ( ρ K = 1 � ∑ K � 1 i = 1 ρ i can be discarded) We can alternatively define K � 1 output units (discarding a ( L ) K = ˆ ρ K = 1 ): exp ( z ( L ) ) a ( L ) j = ˆ ρ j = j i = 1 exp ( z ( L ) ∑ K � 1 )+ 1 i that is a direct generalization of σ in binary classification Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

Softmax Units for Categorical Output Distributions I In multiclass classification, we can assume that P ( y | x ) ⇠ Categorical ( ρ ) , where y , ρ 2 R K and 1 > ρ = 1 Softmax units: exp ( z ( L ) ) a ( L ) j ρ j = sofmax ( z ( L ) ) j = = ˆ j i = 1 exp ( z ( L ) ∑ K ) i Actually, to define a Categorical distribution, we only need ρ 1 , ··· , ρ K � 1 ( ρ K = 1 � ∑ K � 1 i = 1 ρ i can be discarded) We can alternatively define K � 1 output units (discarding a ( L ) K = ˆ ρ K = 1 ): exp ( z ( L ) ) a ( L ) j = ˆ ρ j = j i = 1 exp ( z ( L ) ∑ K � 1 )+ 1 i that is a direct generalization of σ in binary classification In practice, the two versions make little di ff erence Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

Softmax Units for Categorical Output Distributions II Now we have ⇣ ρ 1 ( y ( n ) ; y ( n ) = i ) ⌘ P ( y ( n ) | x ( n ) ; Θ ) ∂ � log ∏ i ˆ = ∂ c ( n ) = ∂ � log ˆ i δ ( L ) = j ∂ z ( L ) ∂ z ( L ) ∂ z ( L ) j j j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

Softmax Units for Categorical Output Distributions II Now we have ⇣ ρ 1 ( y ( n ) ; y ( n ) = i ) ⌘ P ( y ( n ) | x ( n ) ; Θ ) ∂ � log ∏ i ˆ = ∂ c ( n ) = ∂ � log ˆ i δ ( L ) = j ∂ z ( L ) ∂ z ( L ) ∂ z ( L ) j j j ⇣ ⌘ If y ( n ) = j , then δ ( L ) = � ∂ log ˆ ρ j = � 1 ρ 2 ρ j � ˆ ˆ = ˆ ρ j � 1 j ∂ z ( L ) ρ j ˆ j j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

Softmax Units for Categorical Output Distributions II Now we have ⇣ ρ 1 ( y ( n ) ; y ( n ) = i ) ⌘ P ( y ( n ) | x ( n ) ; Θ ) ∂ � log ∏ i ˆ = ∂ c ( n ) = ∂ � log ˆ i δ ( L ) = j ∂ z ( L ) ∂ z ( L ) ∂ z ( L ) j j j ⇣ ⌘ If y ( n ) = j , then δ ( L ) = � ∂ log ˆ ρ j = � 1 ρ 2 ρ j � ˆ ˆ = ˆ ρ j � 1 j ∂ z ( L ) ρ j ˆ j j δ ( L ) is close to 0 only when ˆ ρ j is “correct” j In this case, z ( L ) dominates among all z ( L ) ’s j i Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

Softmax Units for Categorical Output Distributions II Now we have ⇣ ρ 1 ( y ( n ) ; y ( n ) = i ) ⌘ P ( y ( n ) | x ( n ) ; Θ ) ∂ � log ∏ i ˆ = ∂ c ( n ) = ∂ � log ˆ i δ ( L ) = j ∂ z ( L ) ∂ z ( L ) ∂ z ( L ) j j j ⇣ ⌘ If y ( n ) = j , then δ ( L ) = � ∂ log ˆ ρ j = � 1 ρ 2 ρ j � ˆ ˆ = ˆ ρ j � 1 j ∂ z ( L ) ρ j ˆ j j δ ( L ) is close to 0 only when ˆ ρ j is “correct” j In this case, z ( L ) dominates among all z ( L ) ’s j i If y ( n ) = i 6 = j , then δ ( L ) = � ∂ log ˆ ρ i = � 1 ρ i ( � ˆ ρ i ˆ ρ j ) = ˆ ρ j j ∂ z ( L ) ˆ j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

Softmax Units for Categorical Output Distributions II Now we have ⇣ ρ 1 ( y ( n ) ; y ( n ) = i ) ⌘ P ( y ( n ) | x ( n ) ; Θ ) ∂ � log ∏ i ˆ = ∂ c ( n ) = ∂ � log ˆ i δ ( L ) = j ∂ z ( L ) ∂ z ( L ) ∂ z ( L ) j j j ⇣ ⌘ If y ( n ) = j , then δ ( L ) = � ∂ log ˆ ρ j = � 1 ρ 2 ρ j � ˆ ˆ = ˆ ρ j � 1 j ∂ z ( L ) ρ j ˆ j j δ ( L ) is close to 0 only when ˆ ρ j is “correct” j In this case, z ( L ) dominates among all z ( L ) ’s j i If y ( n ) = i 6 = j , then δ ( L ) = � ∂ log ˆ ρ i = � 1 ρ i ( � ˆ ρ i ˆ ρ j ) = ˆ ρ j j ∂ z ( L ) ˆ j Again, close to 0 only when ˆ ρ j is “correct” Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

Linear Units for Gaussian Means An NN can also output just one conditional statistic of y given x Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

Linear Units for Gaussian Means An NN can also output just one conditional statistic of y given x For example, we can assume P ( y | x ) ⇠ N ( µ , Σ ) for regression Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

Linear Units for Gaussian Means An NN can also output just one conditional statistic of y given x For example, we can assume P ( y | x ) ⇠ N ( µ , Σ ) for regression How to design output neurons if we want to predict the mean ˆ µ ? Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

Linear Units for Gaussian Means An NN can also output just one conditional statistic of y given x For example, we can assume P ( y | x ) ⇠ N ( µ , Σ ) for regression How to design output neurons if we want to predict the mean ˆ µ ? Linear units: a ( L ) = ˆ µ = z ( L ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

Neural Networks: Design Shan-Hung Wu shwu@cs.nthu.edu.tw - PowerPoint PPT Presentation

Neural Networks: Design Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 1 / 49 Outline The Basics 1 Example:

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

Neural Networks 1. Introduction Spring 2020 1 Neural Networks are taking over! Neural

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Neural Networks 1. Introduction Spring 2019 1 Neural Networks are taking over! Neural

Lecture 1: Neurons Lecture 2: Coding with spikes Lecture 3: Tuning curves and receptive fields

Neural Networks: Introduction Machine Learning Based on slides and material from Geoffrey Hinton,

MULTILAYER NEURAL NETWORKS Jeff Robble, Brian Renzenbrink, Doug Roberts Multilayer Neural

Deep Neural Network Mathematical Mysteries for High Dimensional Learning Stphane Mallat

The Potjans-Diesmann local microcircuit model using different neuron classes for excitatory and

Universality and Individuality in Recurrent Neural Networks Niru Maheswaranathan, Alex Williams,

a i a j Input Input Activation Output Output Links Function Function Links Input

for NeuronBank Ontology Weiling Li, Rajshekhar Sunderraman, and Paul Katz Georgia State