deep feedforward networks
play

Deep Feedforward Networks Lecture slides for Chapter 6 of Deep - PowerPoint PPT Presentation

Deep Feedforward Networks Lecture slides for Chapter 6 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last updated 2016-10-04 Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units Architecture Design


  1. Deep Feedforward Networks Lecture slides for Chapter 6 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last updated 2016-10-04

  2. Roadmap • Example: Learning XOR • Gradient-Based Learning • Hidden Units • Architecture Design • Back-Propagation (Goodfellow 2017)

  3. XOR is not linearly separable Original x space 1 x 2 0 0 1 x 1 Figure 6.1, left (Goodfellow 2017)

  4. Rectified Linear Activation g ( z ) = max { 0 , z } 0 0 z Figure 6.3 (Goodfellow 2017)

  5. Network Diagrams y y w h 1 h 1 h 2 h 2 h W x x 1 x 1 x 2 x 2 Figure 6.2 (Goodfellow 2017)

  6. Solving XOR f ( x ; W , c , w , b ) = w > max { 0 , W > x + c } + b. (6.3)  1 � 1 W = (6.4) , 1 1  � 0 c = (6.5) , − 1  � 1 w = (6.6) , − 2 (Goodfellow 2017)

  7. Solving XOR Original x space Learned h space 1 1 x 2 h 2 0 0 0 1 0 1 2 x 1 h 1 Figure 6.1 (Goodfellow 2017)

  8. Roadmap • Example: Learning XOR • Gradient-Based Learning • Hidden Units • Architecture Design • Back-Propagation (Goodfellow 2017)

  9. Gradient-Based Learning • Specify • Model • Cost • Design model and cost so cost is smooth • Minimize cost using gradient descent or related techniques (Goodfellow 2017)

  10. Conditional Distributions and Cross-Entropy p data log p model ( y | x ) . J ( θ ) = − E x , y ∼ ˆ (6.12) (Goodfellow 2017)

  11. Output Types Output Output Cost Output Type Distribution Layer Function Binary cross- Binary Bernoulli Sigmoid entropy Discrete cross- Discrete Multinoulli Softmax entropy Gaussian cross- Continuous Gaussian Linear entropy (MSE) Mixture of Mixture Continuous Cross-entropy Gaussian Density See part III: GAN, Continuous Arbitrary Various VAE, FVBN (Goodfellow 2017)

  12. Mixture Density Outputs y x Figure 6.4 (Goodfellow 2017)

  13. Don’t mix and match Sigmoid output with target of 1 σ ( z ) Cross-entropy loss MSE loss 1 . 0 0 . 5 0 . 0 − 3 − 2 − 1 0 1 2 3 z (Goodfellow 2017)

  14. Roadmap • Example: Learning XOR • Gradient-Based Learning • Hidden Units • Architecture Design • Back-Propagation (Goodfellow 2017)

  15. Hidden units • Use ReLUs, 90% of the time • For RNNs, see Chapter 10 • For some research projects, get creative • Many hidden units perform comparably to ReLUs. New hidden units that perform comparably are rarely interesting. (Goodfellow 2017)

  16. Roadmap • Example: Learning XOR • Gradient-Based Learning • Hidden Units • Architecture Design • Back-Propagation (Goodfellow 2017)

  17. Architecture Basics y h 1 h 1 h 2 h 2 Depth x 1 x 1 x 2 x 2 Width (Goodfellow 2017)

  18. Universal Approximator Theorem • One hidden layer is enough to represent (not learn ) an approximation of any function to an arbitrary degree of accuracy • So why deeper? • Shallow net may need (exponentially) more width • Shallow net may overfit more (Goodfellow 2017)

  19. Exponential Representation Advantage of Depth Figure 6.5 (Goodfellow 2017)

  20. Better Generalization with Greater Depth 96 . 5 96 . 0 Test accuracy (percent) 95 . 5 95 . 0 94 . 5 94 . 0 93 . 5 93 . 0 92 . 5 92 . 0 3 4 5 6 7 8 9 10 11 Layers Figure 6.6 (Goodfellow 2017)

  21. Large, Shallow Models Overfit More 97 3, convolutional Test accuracy (percent) 96 3, fully connected 95 11, convolutional 94 93 92 91 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 × 10 8 Number of parameters Figure 6.7 (Goodfellow 2017)

  22. Roadmap • Example: Learning XOR • Gradient-Based Learning • Hidden Units • Architecture Design • Back-Propagation (Goodfellow 2017)

  23. Back-Propagation • Back-propagation is “just the chain rule” of calculus dx = dz dz dy (6.44) dx. dy ◆ > ✓ ∂ y r x z = (6.46) r y z, ∂ x • But it’s a particular implementation of the chain rule • Uses dynamic programming (table filling) • Avoids recomputing repeated subexpressions • Speed vs memory tradeo ff (Goodfellow 2017)

  24. Simple Back-Prop Example Compute loss y Compute activations Compute derivatives Forward prop Back-prop h 1 h 1 h 2 h 2 x 1 x 1 x 2 x 2 (Goodfellow 2017)

  25. Computation Graphs ˆ ˆ y y Multiplication σ Logistic regression u (1) u (1) u (2) u (2) z + dot × b y x x x w w (a) (b) u (2) u (2) u (3) u (3) H ReLU layer relu × sum Linear regression U (1) U (1) U (2) U (2) u (1) u (1) ˆ ˆ y y and weight decay + sqr dot matmul X W b b λ x x w w (c) (d) Figure 6.8 (Goodfellow 2017)

  26. Repeated Subexpressions z f ∂ z (6.50) ∂ w y = ∂ z ∂ y ∂ x (6.51) ∂ y ∂ x ∂ w f = f 0 ( y ) f 0 ( x ) f 0 ( w ) (6.52) x = f 0 ( f ( f ( w ))) f 0 ( f ( w )) f 0 ( w ) (6.53) f w Back-prop avoids computing this twice Figure 6.9 (Goodfellow 2017)

  27. Symbol-to-Symbol Di ff erentiation z z Figure 6.10 f f f 0 dz dz y y dy dy f f f 0 × dy dy dz dz x x dx dx dx dx f f f 0 × dx dx dz dz w w dw dw dw dw (Goodfellow 2017)

  28. Neural Network Loss Function J MLE J MLE J cross_entropy + U (2) U (2) u (8) u (8) y matmul × W (2) W (2) U (5) U (5) u (6) u (6) u (7) u (7) H λ sqr sum + relu U (1) U (1) matmul Figure 6.11 W (1) W (1) U (3) U (3) u (4) u (4) X sqr sum (Goodfellow 2017)

  29. Hessian-vector Products h ( r x f ( x )) > v i Hv = r x (6.59) . (Goodfellow 2017)

  30. Questions (Goodfellow 2017)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend