neural networks design
play

Neural Networks: Design Shan-Hung Wu shwu@cs.nthu.edu.tw - PowerPoint PPT Presentation

Neural Networks: Design Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 1 / 49 Outline The Basics 1 Example:


  1. Output Distribution a ( 2 ) 0 2 3 1 0 2 3 1 � 1 1 0 0 2 3 � 1 1 1 0 1 a ( 2 ) = σ ( A ( 1 ) w ( 2 ) ) = σ B 6 7 C B 6 7 C A = σ 2 B 6 7 C B 6 7 C 4 5 1 1 0 1 @ 4 5 @ 4 5 A � 4 1 2 1 � 1 2 3 0 1 y = 1 ( a ( 2 ) > 0 . 5 ) = 6 7 ˆ 6 7 1 4 5 0 But how to train W ( 1 ) and w ( 2 ) from examples? Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 13 / 49

  2. Outline The Basics 1 Example: Learning the XOR Training 2 Back Propagation Neuron Design 3 Cost Function & Output Neurons Hidden Neurons Architecture Design 4 Architecture Tuning Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 14 / 49

  3. Training an NN Given examples: X = { ( x ( i ) , y ( i ) ) } N i = 1 How to learn parameters Θ = { W ( 1 ) , ··· , W ( L ) } ? Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

  4. Training an NN Given examples: X = { ( x ( i ) , y ( i ) ) } N i = 1 How to learn parameters Θ = { W ( 1 ) , ··· , W ( L ) } ? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmax Θ logP ( X | Θ ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

  5. Training an NN Given examples: X = { ( x ( i ) , y ( i ) ) } N i = 1 How to learn parameters Θ = { W ( 1 ) , ··· , W ( L ) } ? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmax Θ logP ( X | Θ ) = argmin Θ � logP ( X | Θ ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

  6. Training an NN Given examples: X = { ( x ( i ) , y ( i ) ) } N i = 1 How to learn parameters Θ = { W ( 1 ) , ··· , W ( L ) } ? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmax Θ logP ( X | Θ ) = argmin Θ � logP ( X | Θ ) = argmin Θ ∑ i � logP ( x ( i ) , y ( i ) | Θ ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

  7. Training an NN Given examples: X = { ( x ( i ) , y ( i ) ) } N i = 1 How to learn parameters Θ = { W ( 1 ) , ··· , W ( L ) } ? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmax Θ logP ( X | Θ ) = argmin Θ � logP ( X | Θ ) = argmin Θ ∑ i � logP ( x ( i ) , y ( i ) | Θ ) = argmin Θ ∑ i [ � logP ( y ( i ) | x ( i ) , Θ ) � logP ( x ( i ) | Θ )] Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

  8. Training an NN Given examples: X = { ( x ( i ) , y ( i ) ) } N i = 1 How to learn parameters Θ = { W ( 1 ) , ··· , W ( L ) } ? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmax Θ logP ( X | Θ ) = argmin Θ � logP ( X | Θ ) = argmin Θ ∑ i � logP ( x ( i ) , y ( i ) | Θ ) = argmin Θ ∑ i [ � logP ( y ( i ) | x ( i ) , Θ ) � logP ( x ( i ) | Θ )] = argmin Θ ∑ i � logP ( y ( i ) | x ( i ) , Θ ) = argmin Θ ∑ i C ( i ) ( Θ ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

  9. Training an NN Given examples: X = { ( x ( i ) , y ( i ) ) } N i = 1 How to learn parameters Θ = { W ( 1 ) , ··· , W ( L ) } ? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmax Θ logP ( X | Θ ) = argmin Θ � logP ( X | Θ ) = argmin Θ ∑ i � logP ( x ( i ) , y ( i ) | Θ ) = argmin Θ ∑ i [ � logP ( y ( i ) | x ( i ) , Θ ) � logP ( x ( i ) | Θ )] = argmin Θ ∑ i � logP ( y ( i ) | x ( i ) , Θ ) = argmin Θ ∑ i C ( i ) ( Θ ) The minimizer ˆ Θ is an unbiased estimator of “true” Θ ⇤ Good for large N Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

  10. Example: Binary Classification Pr ( y = 1 | x ) ⇠ Bernoulli ( ρ ) , where x 2 R D and y 2 { 0 , 1 } a ( L ) = ˆ ρ = σ ( z ( L ) ) the predicted distribution Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 16 / 49

  11. Example: Binary Classification Pr ( y = 1 | x ) ⇠ Bernoulli ( ρ ) , where x 2 R D and y 2 { 0 , 1 } a ( L ) = ˆ ρ = σ ( z ( L ) ) the predicted distribution The cost function C ( i ) ( Θ ) can be written as: C ( i ) ( Θ ) = � logP ( y ( i ) | x ( i ) ; Θ ) = � log [( a ( L ) ) y ( i ) ( 1 � a ( L ) ) 1 � y ( i ) ] = � log [ σ ( z ( L ) ) y ( i ) ( 1 � σ ( z ( L ) )) 1 � y ( i ) ] Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 16 / 49

  12. Example: Binary Classification Pr ( y = 1 | x ) ⇠ Bernoulli ( ρ ) , where x 2 R D and y 2 { 0 , 1 } a ( L ) = ˆ ρ = σ ( z ( L ) ) the predicted distribution The cost function C ( i ) ( Θ ) can be written as: C ( i ) ( Θ ) = � logP ( y ( i ) | x ( i ) ; Θ ) = � log [( a ( L ) ) y ( i ) ( 1 � a ( L ) ) 1 � y ( i ) ] = � log [ σ ( z ( L ) ) y ( i ) ( 1 � σ ( z ( L ) )) 1 � y ( i ) ] = � log [ σ (( 2 y ( i ) � 1 ) z ( L ) )] Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 16 / 49

  13. Example: Binary Classification Pr ( y = 1 | x ) ⇠ Bernoulli ( ρ ) , where x 2 R D and y 2 { 0 , 1 } a ( L ) = ˆ ρ = σ ( z ( L ) ) the predicted distribution The cost function C ( i ) ( Θ ) can be written as: C ( i ) ( Θ ) = � logP ( y ( i ) | x ( i ) ; Θ ) = � log [( a ( L ) ) y ( i ) ( 1 � a ( L ) ) 1 � y ( i ) ] = � log [ σ ( z ( L ) ) y ( i ) ( 1 � σ ( z ( L ) )) 1 � y ( i ) ] = � log [ σ (( 2 y ( i ) � 1 ) z ( L ) )] = ζ (( 1 � 2 y ( i ) ) z ( L ) ) ζ ( · ) is the softplus function Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 16 / 49

  14. Optimization Algorithm Most NNs use SGD to solve the problem argmin Θ ∑ i C ( i ) ( Θ ) (Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ ( 0 ) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M ; Θ ( t + 1 ) Θ ( t ) � η ∇ Θ ∑ M i = 1 C ( i ) ( Θ ( t ) ) ; } Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

  15. Optimization Algorithm Most NNs use SGD to solve the problem argmin Θ ∑ i C ( i ) ( Θ ) Fast convergence in time [1] (Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ ( 0 ) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M ; Θ ( t + 1 ) Θ ( t ) � η ∇ Θ ∑ M i = 1 C ( i ) ( Θ ( t ) ) ; } Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

  16. Optimization Algorithm Most NNs use SGD to solve the problem argmin Θ ∑ i C ( i ) ( Θ ) Fast convergence in time [1] Supports (GPU-based) parallelism (Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ ( 0 ) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M ; Θ ( t + 1 ) Θ ( t ) � η ∇ Θ ∑ M i = 1 C ( i ) ( Θ ( t ) ) ; } Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

  17. Optimization Algorithm Most NNs use SGD to solve the problem argmin Θ ∑ i C ( i ) ( Θ ) Fast convergence in time [1] Supports (GPU-based) parallelism Supports online learning (Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ ( 0 ) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M ; Θ ( t + 1 ) Θ ( t ) � η ∇ Θ ∑ M i = 1 C ( i ) ( Θ ( t ) ) ; } Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

  18. Optimization Algorithm Most NNs use SGD to solve the problem argmin Θ ∑ i C ( i ) ( Θ ) Fast convergence in time [1] Supports (GPU-based) parallelism Supports online learning Easy to implement (Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ ( 0 ) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M ; Θ ( t + 1 ) Θ ( t ) � η ∇ Θ ∑ M i = 1 C ( i ) ( Θ ( t ) ) ; } Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

  19. Optimization Algorithm Most NNs use SGD to solve the problem argmin Θ ∑ i C ( i ) ( Θ ) Fast convergence in time [1] Supports (GPU-based) parallelism Supports online learning Easy to implement (Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ ( 0 ) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M ; Θ ( t + 1 ) Θ ( t ) � η ∇ Θ ∑ M i = 1 C ( i ) ( Θ ( t ) ) ; } How to compute ∇ Θ ∑ i C ( i ) ( Θ ( t ) ) e ffi ciently? There could be a huge number of W ( k ) i , j ’s in Θ Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

  20. Outline The Basics 1 Example: Learning the XOR Training 2 Back Propagation Neuron Design 3 Cost Function & Output Neurons Hidden Neurons Architecture Design 4 Architecture Tuning Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 18 / 49

  21. Back Propagation M Θ ( t + 1 ) Θ ( t ) � η ∇ Θ C ( n ) ( Θ ( t ) ) ∑ n = 1 We have ∇ Θ ∑ n C ( n ) ( Θ ( t ) ) = ∑ n ∇ Θ C ( n ) ( Θ ( t ) ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 19 / 49

  22. Back Propagation M Θ ( t + 1 ) Θ ( t ) � η ∇ Θ C ( n ) ( Θ ( t ) ) ∑ n = 1 We have ∇ Θ ∑ n C ( n ) ( Θ ( t ) ) = ∑ n ∇ Θ C ( n ) ( Θ ( t ) ) Let c ( n ) = C ( n ) ( Θ ( t ) ) , our goal is to evaluate ∂ c ( n ) ∂ W ( k ) i , j for all i , j , k , and n Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 19 / 49

  23. Back Propagation M Θ ( t + 1 ) Θ ( t ) � η ∇ Θ C ( n ) ( Θ ( t ) ) ∑ n = 1 We have ∇ Θ ∑ n C ( n ) ( Θ ( t ) ) = ∑ n ∇ Θ C ( n ) ( Θ ( t ) ) Let c ( n ) = C ( n ) ( Θ ( t ) ) , our goal is to evaluate ∂ c ( n ) ∂ W ( k ) i , j for all i , j , k , and n Back propagation (or simply backprop ) is an e ffi cient way to evaluate multiple partial derivatives at once Assuming the partial derivatives share some common evaluation steps Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 19 / 49

  24. Back Propagation M Θ ( t + 1 ) Θ ( t ) � η ∇ Θ C ( n ) ( Θ ( t ) ) ∑ n = 1 We have ∇ Θ ∑ n C ( n ) ( Θ ( t ) ) = ∑ n ∇ Θ C ( n ) ( Θ ( t ) ) Let c ( n ) = C ( n ) ( Θ ( t ) ) , our goal is to evaluate ∂ c ( n ) ∂ W ( k ) i , j for all i , j , k , and n Back propagation (or simply backprop ) is an e ffi cient way to evaluate multiple partial derivatives at once Assuming the partial derivatives share some common evaluation steps By the chain rule, we have ∂ z ( k ) ∂ c ( n ) = ∂ c ( n ) j · ∂ W ( k ) ∂ z ( k ) ∂ W ( k ) i , j i , j j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 19 / 49

  25. Forward Pass ∂ z ( k ) j The second term: ∂ W ( k ) i , j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 20 / 49

  26. Forward Pass ∂ z ( k ) j The second term: ∂ W ( k ) i , j When k = 1 , we have z ( 1 ) = ∑ i W ( 1 ) i , j x ( n ) and j i ∂ z ( 1 ) j = x ( n ) i ∂ W ( 1 ) i , j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 20 / 49

  27. Forward Pass ∂ z ( k ) j The second term: ∂ W ( k ) i , j When k = 1 , we have z ( 1 ) = ∑ i W ( 1 ) i , j x ( n ) and j i ∂ z ( 1 ) j = x ( n ) i ∂ W ( 1 ) i , j Otherwise ( k > 1 ), we have z ( k ) = ∑ i W ( k ) i , j a ( k � 1 ) and j i ∂ z ( k ) = a ( k � 1 ) j i ∂ W ( 1 ) i , j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 20 / 49

  28. Forward Pass ∂ z ( k ) j The second term: ∂ W ( k ) i , j When k = 1 , we have z ( 1 ) = ∑ i W ( 1 ) i , j x ( n ) and j i ∂ z ( 1 ) j = x ( n ) i ∂ W ( 1 ) i , j Otherwise ( k > 1 ), we have z ( k ) = ∑ i W ( k ) i , j a ( k � 1 ) and j i ∂ z ( k ) = a ( k � 1 ) j i ∂ W ( 1 ) i , j ∂ c ( n ) We can get the second terms of all ’s starting from the most ∂ W ( k ) i , j shallow layer Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 20 / 49

  29. Backward Pass I ∂ c ( n ) Conversely, we can get the first terms of all ’s starting from the ∂ W ( k ) i , j deepest layer Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 21 / 49

  30. Backward Pass I ∂ c ( n ) Conversely, we can get the first terms of all ’s starting from the ∂ W ( k ) i , j deepest layer Define error signal δ ( k ) as the first term ∂ c ( n ) j ∂ z ( k ) j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 21 / 49

  31. Backward Pass I ∂ c ( n ) Conversely, we can get the first terms of all ’s starting from the ∂ W ( k ) i , j deepest layer Define error signal δ ( k ) as the first term ∂ c ( n ) j ∂ z ( k ) j When k = L , the evaluation varies from task to task Depending on the definition of functions act ( L ) and C ( n ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 21 / 49

  32. Backward Pass I ∂ c ( n ) Conversely, we can get the first terms of all ’s starting from the ∂ W ( k ) i , j deepest layer Define error signal δ ( k ) as the first term ∂ c ( n ) j ∂ z ( k ) j When k = L , the evaluation varies from task to task Depending on the definition of functions act ( L ) and C ( n ) E.g., in binary classification, we have: δ ( L ) = ∂ c ( n ) ∂ z ( L ) = ∂ζ (( 1 � 2 y ( n ) ) z ( L ) ) = σ (( 1 � 2 y ( n ) ) z ( L ) ) · ( 1 � 2 y ( n ) ) ∂ z ( L ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 21 / 49

  33. Backward Pass II When k < L , we have ∂ a ( k ) δ ( k ) · act 0 ( z ( k ) = ∂ c ( n ) = ∂ c ( n ) = ∂ c ( n ) · j j ) j ∂ z ( k ) ∂ a ( k ) ∂ z ( k ) ∂ a ( k ) j j j j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 22 / 49

  34. Backward Pass II When k < L , we have ∂ a ( k ) δ ( k ) · act 0 ( z ( k ) = ∂ c ( n ) = ∂ c ( n ) = ∂ c ( n ) · j j ) j ∂ z ( k ) ∂ a ( k ) ∂ z ( k ) ∂ a ( k ) j j j j ✓ ◆ · ∂ z ( k + 1 ) ∂ c ( n ) act 0 ( z ( k ) = j ) ∑ s s ∂ z ( k + 1 ) ∂ a ( k ) s j Theorem (Chain Rule) Let g : R ! R d and f : R d ! R , then f 2 g 0 3 1 ( x ) . ( f � g ) 0 ( x ) = f 0 ( g ( x )) g 0 ( x ) = ∇ f ( g ( x )) > . 5 . 6 7 . 4 g 0 n ( x ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 22 / 49

  35. Backward Pass II When k < L , we have ∂ a ( k ) δ ( k ) · act 0 ( z ( k ) = ∂ c ( n ) = ∂ c ( n ) = ∂ c ( n ) · j j ) j ∂ z ( k ) ∂ a ( k ) ∂ z ( k ) ∂ a ( k ) j j j j ✓ ◆ · ∂ z ( k + 1 ) ∂ c ( n ) act 0 ( z ( k ) = j ) ∑ s s ∂ z ( k + 1 ) ∂ a ( k ) s j ✓ ∂ ∑ i W ( k + 1 ) a ( k ) ◆ ∑ s δ ( k + 1 ) act 0 ( z ( k ) i , s = · i j ) s ∂ a ( k ) j Theorem (Chain Rule) Let g : R ! R d and f : R d ! R , then f 2 g 0 3 1 ( x ) . ( f � g ) 0 ( x ) = f 0 ( g ( x )) g 0 ( x ) = ∇ f ( g ( x )) > . 5 . 6 7 . 4 g 0 n ( x ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 22 / 49

  36. Backward Pass II When k < L , we have ∂ a ( k ) δ ( k ) · act 0 ( z ( k ) = ∂ c ( n ) = ∂ c ( n ) = ∂ c ( n ) · j j ) j ∂ z ( k ) ∂ a ( k ) ∂ z ( k ) ∂ a ( k ) j j j j ✓ ◆ · ∂ z ( k + 1 ) ∂ c ( n ) act 0 ( z ( k ) = j ) ∑ s s ∂ z ( k + 1 ) ∂ a ( k ) s j ✓ ∂ ∑ i W ( k + 1 ) a ( k ) ◆ ∑ s δ ( k + 1 ) act 0 ( z ( k ) i , s = · i j ) s ∂ a ( k ) j ⇣ ⌘ ∑ s δ ( k + 1 ) · W ( k + 1 ) act 0 ( z ( k ) = j ) s j , s Theorem (Chain Rule) Let g : R ! R d and f : R d ! R , then f 2 g 0 3 1 ( x ) . ( f � g ) 0 ( x ) = f 0 ( g ( x )) g 0 ( x ) = ∇ f ( g ( x )) > . 5 . 6 7 . 4 g 0 n ( x ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 22 / 49

  37. Backward Pass III ✓ ◆ δ ( k ) δ ( k + 1 ) · W ( k + 1 ) act 0 ( z ( k ) ∑ = j ) s j , s j s We can evaluate all δ ( k ) ’s starting from the deepest layer j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 23 / 49

  38. Backward Pass III ✓ ◆ δ ( k ) δ ( k + 1 ) · W ( k + 1 ) act 0 ( z ( k ) ∑ = j ) s j , s j s We can evaluate all δ ( k ) ’s starting from the deepest layer j The information propagate along a new kind of feedforward network: Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 23 / 49

  39. Backward Pass III ✓ ◆ δ ( k ) δ ( k + 1 ) · W ( k + 1 ) act 0 ( z ( k ) ∑ = j ) s j , s j s We can evaluate all δ ( k ) ’s starting from the deepest layer j The information propagate along a new kind of feedforward network: Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 23 / 49

  40. Backprop Algorithm (Minibatch Size M = 1 ) Input : ( x ( n ) , y ( n ) ) and Θ ( t ) Forward pass: x ( n ) ⇤ > ; a ( 0 ) ⇥ 1 for k 1 to L do z ( k ) W ( k ) > a ( k � 1 ) ; a ( k ) act ( z ( k ) ) ; end Backward pass: Compute error signal δ ( L ) (e.g., ( 1 � 2 y ( n ) ) σ (( 1 � 2 y ( n ) ) z ( L ) ) in binary classification) for k L � 1 to 1 do δ ( k ) act 0 ( z ( k ) ) � ( W ( k + 1 ) δ ( k + 1 ) ) ; end ∂ W ( k ) = a ( k � 1 ) ⌦ δ ( k ) for all k ∂ c ( n ) Return Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 24 / 49

  41. Backprop Algorithm (Minibatch Size M > 1 ) Input : { ( x ( n ) , y ( n ) ) } M n = 1 and Θ ( t ) Forward pass: A ( 0 ) a ( 0 , M ) ⇤ > ; ⇥ a ( 0 , 1 ) ··· for k 1 to L do Z ( k ) A ( k � 1 ) W ( k ) ; A ( k ) act ( Z ( k ) ) ; end Backward pass: Compute error signals δ ( L , M ) i > h ∆ ( L ) = δ ( L , 0 ) ··· for k L � 1 to 1 do ∆ ( k ) act 0 ( Z ( k ) ) � ( ∆ ( k + 1 ) W ( k + 1 ) > ) ; end n = 1 a ( k � 1 , n ) ⌦ δ ( k , n ) for all k ∂ c ( n ) ∂ W ( k ) = ∑ M Return Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 25 / 49

  42. Backprop Algorithm (Minibatch Size M > 1 ) Input : { ( x ( n ) , y ( n ) ) } M n = 1 and Θ ( t ) Forward pass: A ( 0 ) a ( 0 , M ) ⇤ > ; ⇥ a ( 0 , 1 ) ··· for k 1 to L do Z ( k ) A ( k � 1 ) W ( k ) ; Speed up with A ( k ) act ( Z ( k ) ) ; GPUs? end Backward pass: Compute error signals δ ( L , M ) i > h ∆ ( L ) = δ ( L , 0 ) ··· for k L � 1 to 1 do ∆ ( k ) act 0 ( Z ( k ) ) � ( ∆ ( k + 1 ) W ( k + 1 ) > ) ; end n = 1 a ( k � 1 , n ) ⌦ δ ( k , n ) for all k ∂ c ( n ) ∂ W ( k ) = ∑ M Return Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 25 / 49

  43. Backprop Algorithm (Minibatch Size M > 1 ) Input : { ( x ( n ) , y ( n ) ) } M n = 1 and Θ ( t ) Forward pass: A ( 0 ) a ( 0 , M ) ⇤ > ; ⇥ a ( 0 , 1 ) ··· for k 1 to L do Z ( k ) A ( k � 1 ) W ( k ) ; Speed up with A ( k ) act ( Z ( k ) ) ; GPUs? end Large width ( D ( k ) ) Backward pass: at each layer Compute error signals δ ( L , M ) i > h ∆ ( L ) = δ ( L , 0 ) ··· for k L � 1 to 1 do ∆ ( k ) act 0 ( Z ( k ) ) � ( ∆ ( k + 1 ) W ( k + 1 ) > ) ; end n = 1 a ( k � 1 , n ) ⌦ δ ( k , n ) for all k ∂ c ( n ) ∂ W ( k ) = ∑ M Return Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 25 / 49

  44. Backprop Algorithm (Minibatch Size M > 1 ) Input : { ( x ( n ) , y ( n ) ) } M n = 1 and Θ ( t ) Forward pass: A ( 0 ) a ( 0 , M ) ⇤ > ; ⇥ a ( 0 , 1 ) ··· for k 1 to L do Z ( k ) A ( k � 1 ) W ( k ) ; Speed up with A ( k ) act ( Z ( k ) ) ; GPUs? end Large width ( D ( k ) ) Backward pass: at each layer Compute error signals δ ( L , M ) i > Large batch size h ∆ ( L ) = δ ( L , 0 ) ··· for k L � 1 to 1 do ∆ ( k ) act 0 ( Z ( k ) ) � ( ∆ ( k + 1 ) W ( k + 1 ) > ) ; end n = 1 a ( k � 1 , n ) ⌦ δ ( k , n ) for all k ∂ c ( n ) ∂ W ( k ) = ∑ M Return Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 25 / 49

  45. Outline The Basics 1 Example: Learning the XOR Training 2 Back Propagation Neuron Design 3 Cost Function & Output Neurons Hidden Neurons Architecture Design 4 Architecture Tuning Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 26 / 49

  46. Neuron Design The design of modern neurons is largely influenced by how an NN is trained Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 27 / 49

  47. Neuron Design The design of modern neurons is largely influenced by how an NN is trained Maximum likelihood principle: � logP ( y ( i ) | x ( i ) , Θ ) Θ ∑ argmax Θ logP ( X | Θ ) = argmin i Universal cost function Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 27 / 49

  48. Neuron Design The design of modern neurons is largely influenced by how an NN is trained Maximum likelihood principle: � logP ( y ( i ) | x ( i ) , Θ ) Θ ∑ argmax Θ logP ( X | Θ ) = argmin i Universal cost function Di ff erent output units for di ff erent P ( y | x ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 27 / 49

  49. Neuron Design The design of modern neurons is largely influenced by how an NN is trained Maximum likelihood principle: � logP ( y ( i ) | x ( i ) , Θ ) Θ ∑ argmax Θ logP ( X | Θ ) = argmin i Universal cost function Di ff erent output units for di ff erent P ( y | x ) Gradient-based optimization: During SGD, the gradient ∂ z ( k ) ∂ z ( k ) ∂ c ( n ) = ∂ c ( n ) j = δ ( k ) j · j ∂ W ( k ) ∂ z ( k ) ∂ W ( k ) ∂ W ( k ) i , j i , j i , j j should be su ffi ciently large before we get a satisfactory NN Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 27 / 49

  50. Outline The Basics 1 Example: Learning the XOR Training 2 Back Propagation Neuron Design 3 Cost Function & Output Neurons Hidden Neurons Architecture Design 4 Architecture Tuning Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 28 / 49

  51. Negative Log Likelihood and Cross Entropy The cost function of most NNs: � logP ( y ( i ) | x ( i ) , Θ ) Θ ∑ Θ logP ( X | Θ ) = argmin argmax i Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 29 / 49

  52. Negative Log Likelihood and Cross Entropy The cost function of most NNs: � logP ( y ( i ) | x ( i ) , Θ ) Θ ∑ Θ logP ( X | Θ ) = argmin argmax i For NNs that output an entire distribution ˆ P ( y | x ) , the problem can be equivalently described as minimizing the cross entropy (or KL divergence) from ˆ P to the empirical distribution of data: ⇥ log ˆ ⇤ � E ( x , y ) ⇠ Empirical ( X ) P ( y | x ) argmin ˆ P Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 29 / 49

  53. Negative Log Likelihood and Cross Entropy The cost function of most NNs: � logP ( y ( i ) | x ( i ) , Θ ) Θ ∑ Θ logP ( X | Θ ) = argmin argmax i For NNs that output an entire distribution ˆ P ( y | x ) , the problem can be equivalently described as minimizing the cross entropy (or KL divergence) from ˆ P to the empirical distribution of data: ⇥ log ˆ ⇤ � E ( x , y ) ⇠ Empirical ( X ) P ( y | x ) argmin ˆ P Provides a consistent way to define output units Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 29 / 49

  54. Sigmoid Units for Bernoulli Output Distributions In binary classification, we assuming P ( y = 1 | x ) ⇠ Bernoulli ( ρ ) y 2 { 0 , 1 } and ρ 2 ( 0 , 1 ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

  55. Sigmoid Units for Bernoulli Output Distributions In binary classification, we assuming P ( y = 1 | x ) ⇠ Bernoulli ( ρ ) y 2 { 0 , 1 } and ρ 2 ( 0 , 1 ) Sigmoid output unit: exp ( z ( L ) ) a ( L ) = ˆ ρ = σ ( z ( L ) ) = exp ( z ( L ) )+ 1 Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

  56. Sigmoid Units for Bernoulli Output Distributions In binary classification, we assuming P ( y = 1 | x ) ⇠ Bernoulli ( ρ ) y 2 { 0 , 1 } and ρ 2 ( 0 , 1 ) Sigmoid output unit: exp ( z ( L ) ) a ( L ) = ˆ ρ = σ ( z ( L ) ) = exp ( z ( L ) )+ 1 P ( y ( n ) | x ( n ) ; Θ ) δ ( L ) = ∂ c ( n ) ∂ z ( L ) = ∂ � log ˆ = ( 1 � 2 y ( n ) ) σ (( 1 � 2 y ( n ) ) z ( L ) ) ∂ z ( L ) Close to 0 only when y ( n ) = 1 and z ( L ) is large positive; Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

  57. Sigmoid Units for Bernoulli Output Distributions In binary classification, we assuming P ( y = 1 | x ) ⇠ Bernoulli ( ρ ) y 2 { 0 , 1 } and ρ 2 ( 0 , 1 ) Sigmoid output unit: exp ( z ( L ) ) a ( L ) = ˆ ρ = σ ( z ( L ) ) = exp ( z ( L ) )+ 1 P ( y ( n ) | x ( n ) ; Θ ) δ ( L ) = ∂ c ( n ) ∂ z ( L ) = ∂ � log ˆ = ( 1 � 2 y ( n ) ) σ (( 1 � 2 y ( n ) ) z ( L ) ) ∂ z ( L ) Close to 0 only when y ( n ) = 1 and z ( L ) is large positive; or y ( n ) = 0 and z ( L ) is small negative Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

  58. Sigmoid Units for Bernoulli Output Distributions In binary classification, we assuming P ( y = 1 | x ) ⇠ Bernoulli ( ρ ) y 2 { 0 , 1 } and ρ 2 ( 0 , 1 ) Sigmoid output unit: exp ( z ( L ) ) a ( L ) = ˆ ρ = σ ( z ( L ) ) = exp ( z ( L ) )+ 1 P ( y ( n ) | x ( n ) ; Θ ) δ ( L ) = ∂ c ( n ) ∂ z ( L ) = ∂ � log ˆ = ( 1 � 2 y ( n ) ) σ (( 1 � 2 y ( n ) ) z ( L ) ) ∂ z ( L ) Close to 0 only when y ( n ) = 1 and z ( L ) is large positive; or y ( n ) = 0 and z ( L ) is small negative The loss c ( n ) saturates (becomes flat) only when ˆ ρ is “correct” Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

  59. Softmax Units for Categorical Output Distributions I In multiclass classification, we can assume that P ( y | x ) ⇠ Categorical ( ρ ) , where y , ρ 2 R K and 1 > ρ = 1 Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

  60. Softmax Units for Categorical Output Distributions I In multiclass classification, we can assume that P ( y | x ) ⇠ Categorical ( ρ ) , where y , ρ 2 R K and 1 > ρ = 1 Softmax units: exp ( z ( L ) ) a ( L ) j ρ j = sofmax ( z ( L ) ) j = = ˆ j i = 1 exp ( z ( L ) ∑ K ) i Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

  61. Softmax Units for Categorical Output Distributions I In multiclass classification, we can assume that P ( y | x ) ⇠ Categorical ( ρ ) , where y , ρ 2 R K and 1 > ρ = 1 Softmax units: exp ( z ( L ) ) a ( L ) j ρ j = sofmax ( z ( L ) ) j = = ˆ j i = 1 exp ( z ( L ) ∑ K ) i Actually, to define a Categorical distribution, we only need ρ 1 , ··· , ρ K � 1 ( ρ K = 1 � ∑ K � 1 i = 1 ρ i can be discarded) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

  62. Softmax Units for Categorical Output Distributions I In multiclass classification, we can assume that P ( y | x ) ⇠ Categorical ( ρ ) , where y , ρ 2 R K and 1 > ρ = 1 Softmax units: exp ( z ( L ) ) a ( L ) j ρ j = sofmax ( z ( L ) ) j = = ˆ j i = 1 exp ( z ( L ) ∑ K ) i Actually, to define a Categorical distribution, we only need ρ 1 , ··· , ρ K � 1 ( ρ K = 1 � ∑ K � 1 i = 1 ρ i can be discarded) We can alternatively define K � 1 output units (discarding a ( L ) K = ˆ ρ K = 1 ): exp ( z ( L ) ) a ( L ) j = ˆ ρ j = j i = 1 exp ( z ( L ) ∑ K � 1 )+ 1 i that is a direct generalization of σ in binary classification Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

  63. Softmax Units for Categorical Output Distributions I In multiclass classification, we can assume that P ( y | x ) ⇠ Categorical ( ρ ) , where y , ρ 2 R K and 1 > ρ = 1 Softmax units: exp ( z ( L ) ) a ( L ) j ρ j = sofmax ( z ( L ) ) j = = ˆ j i = 1 exp ( z ( L ) ∑ K ) i Actually, to define a Categorical distribution, we only need ρ 1 , ··· , ρ K � 1 ( ρ K = 1 � ∑ K � 1 i = 1 ρ i can be discarded) We can alternatively define K � 1 output units (discarding a ( L ) K = ˆ ρ K = 1 ): exp ( z ( L ) ) a ( L ) j = ˆ ρ j = j i = 1 exp ( z ( L ) ∑ K � 1 )+ 1 i that is a direct generalization of σ in binary classification In practice, the two versions make little di ff erence Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

  64. Softmax Units for Categorical Output Distributions II Now we have ⇣ ρ 1 ( y ( n ) ; y ( n ) = i ) ⌘ P ( y ( n ) | x ( n ) ; Θ ) ∂ � log ∏ i ˆ = ∂ c ( n ) = ∂ � log ˆ i δ ( L ) = j ∂ z ( L ) ∂ z ( L ) ∂ z ( L ) j j j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

  65. Softmax Units for Categorical Output Distributions II Now we have ⇣ ρ 1 ( y ( n ) ; y ( n ) = i ) ⌘ P ( y ( n ) | x ( n ) ; Θ ) ∂ � log ∏ i ˆ = ∂ c ( n ) = ∂ � log ˆ i δ ( L ) = j ∂ z ( L ) ∂ z ( L ) ∂ z ( L ) j j j ⇣ ⌘ If y ( n ) = j , then δ ( L ) = � ∂ log ˆ ρ j = � 1 ρ 2 ρ j � ˆ ˆ = ˆ ρ j � 1 j ∂ z ( L ) ρ j ˆ j j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

  66. Softmax Units for Categorical Output Distributions II Now we have ⇣ ρ 1 ( y ( n ) ; y ( n ) = i ) ⌘ P ( y ( n ) | x ( n ) ; Θ ) ∂ � log ∏ i ˆ = ∂ c ( n ) = ∂ � log ˆ i δ ( L ) = j ∂ z ( L ) ∂ z ( L ) ∂ z ( L ) j j j ⇣ ⌘ If y ( n ) = j , then δ ( L ) = � ∂ log ˆ ρ j = � 1 ρ 2 ρ j � ˆ ˆ = ˆ ρ j � 1 j ∂ z ( L ) ρ j ˆ j j δ ( L ) is close to 0 only when ˆ ρ j is “correct” j In this case, z ( L ) dominates among all z ( L ) ’s j i Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

  67. Softmax Units for Categorical Output Distributions II Now we have ⇣ ρ 1 ( y ( n ) ; y ( n ) = i ) ⌘ P ( y ( n ) | x ( n ) ; Θ ) ∂ � log ∏ i ˆ = ∂ c ( n ) = ∂ � log ˆ i δ ( L ) = j ∂ z ( L ) ∂ z ( L ) ∂ z ( L ) j j j ⇣ ⌘ If y ( n ) = j , then δ ( L ) = � ∂ log ˆ ρ j = � 1 ρ 2 ρ j � ˆ ˆ = ˆ ρ j � 1 j ∂ z ( L ) ρ j ˆ j j δ ( L ) is close to 0 only when ˆ ρ j is “correct” j In this case, z ( L ) dominates among all z ( L ) ’s j i If y ( n ) = i 6 = j , then δ ( L ) = � ∂ log ˆ ρ i = � 1 ρ i ( � ˆ ρ i ˆ ρ j ) = ˆ ρ j j ∂ z ( L ) ˆ j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

  68. Softmax Units for Categorical Output Distributions II Now we have ⇣ ρ 1 ( y ( n ) ; y ( n ) = i ) ⌘ P ( y ( n ) | x ( n ) ; Θ ) ∂ � log ∏ i ˆ = ∂ c ( n ) = ∂ � log ˆ i δ ( L ) = j ∂ z ( L ) ∂ z ( L ) ∂ z ( L ) j j j ⇣ ⌘ If y ( n ) = j , then δ ( L ) = � ∂ log ˆ ρ j = � 1 ρ 2 ρ j � ˆ ˆ = ˆ ρ j � 1 j ∂ z ( L ) ρ j ˆ j j δ ( L ) is close to 0 only when ˆ ρ j is “correct” j In this case, z ( L ) dominates among all z ( L ) ’s j i If y ( n ) = i 6 = j , then δ ( L ) = � ∂ log ˆ ρ i = � 1 ρ i ( � ˆ ρ i ˆ ρ j ) = ˆ ρ j j ∂ z ( L ) ˆ j Again, close to 0 only when ˆ ρ j is “correct” Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

  69. Linear Units for Gaussian Means An NN can also output just one conditional statistic of y given x Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

  70. Linear Units for Gaussian Means An NN can also output just one conditional statistic of y given x For example, we can assume P ( y | x ) ⇠ N ( µ , Σ ) for regression Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

  71. Linear Units for Gaussian Means An NN can also output just one conditional statistic of y given x For example, we can assume P ( y | x ) ⇠ N ( µ , Σ ) for regression How to design output neurons if we want to predict the mean ˆ µ ? Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

  72. Linear Units for Gaussian Means An NN can also output just one conditional statistic of y given x For example, we can assume P ( y | x ) ⇠ N ( µ , Σ ) for regression How to design output neurons if we want to predict the mean ˆ µ ? Linear units: a ( L ) = ˆ µ = z ( L ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend