Deep Feedforward Networks Lecture slides for Chapter 6 of Deep - PowerPoint PPT Presentation

Deep Feedforward Networks Lecture slides for Chapter 6 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last updated 2016-10-04

Roadmap • Example: Learning XOR • Gradient-Based Learning • Hidden Units • Architecture Design • Back-Propagation (Goodfellow 2017)

XOR is not linearly separable Original x space 1 x 2 0 0 1 x 1 Figure 6.1, left (Goodfellow 2017)

Rectified Linear Activation g ( z ) = max { 0 , z } 0 0 z Figure 6.3 (Goodfellow 2017)

Network Diagrams y y w h 1 h 1 h 2 h 2 h W x x 1 x 1 x 2 x 2 Figure 6.2 (Goodfellow 2017)

Solving XOR f ( x ; W , c , w , b ) = w > max { 0 , W > x + c } + b. (6.3)  1 � 1 W = (6.4) , 1 1  � 0 c = (6.5) , − 1  � 1 w = (6.6) , − 2 (Goodfellow 2017)

Solving XOR Original x space Learned h space 1 1 x 2 h 2 0 0 0 1 0 1 2 x 1 h 1 Figure 6.1 (Goodfellow 2017)

Gradient-Based Learning • Specify • Model • Cost • Design model and cost so cost is smooth • Minimize cost using gradient descent or related techniques (Goodfellow 2017)

Conditional Distributions and Cross-Entropy p data log p model ( y | x ) . J ( θ ) = − E x , y ∼ ˆ (6.12) (Goodfellow 2017)

Output Types Output Output Cost Output Type Distribution Layer Function Binary cross- Binary Bernoulli Sigmoid entropy Discrete cross- Discrete Multinoulli Softmax entropy Gaussian cross- Continuous Gaussian Linear entropy (MSE) Mixture of Mixture Continuous Cross-entropy Gaussian Density See part III: GAN, Continuous Arbitrary Various VAE, FVBN (Goodfellow 2017)

Mixture Density Outputs y x Figure 6.4 (Goodfellow 2017)

Don’t mix and match Sigmoid output with target of 1 σ ( z ) Cross-entropy loss MSE loss 1 . 0 0 . 5 0 . 0 − 3 − 2 − 1 0 1 2 3 z (Goodfellow 2017)

Hidden units • Use ReLUs, 90% of the time • For RNNs, see Chapter 10 • For some research projects, get creative • Many hidden units perform comparably to ReLUs. New hidden units that perform comparably are rarely interesting. (Goodfellow 2017)

Architecture Basics y h 1 h 1 h 2 h 2 Depth x 1 x 1 x 2 x 2 Width (Goodfellow 2017)

Universal Approximator Theorem • One hidden layer is enough to represent (not learn ) an approximation of any function to an arbitrary degree of accuracy • So why deeper? • Shallow net may need (exponentially) more width • Shallow net may overfit more (Goodfellow 2017)

Exponential Representation Advantage of Depth Figure 6.5 (Goodfellow 2017)

Better Generalization with Greater Depth 96 . 5 96 . 0 Test accuracy (percent) 95 . 5 95 . 0 94 . 5 94 . 0 93 . 5 93 . 0 92 . 5 92 . 0 3 4 5 6 7 8 9 10 11 Layers Figure 6.6 (Goodfellow 2017)

Large, Shallow Models Overfit More 97 3, convolutional Test accuracy (percent) 96 3, fully connected 95 11, convolutional 94 93 92 91 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 × 10 8 Number of parameters Figure 6.7 (Goodfellow 2017)

Back-Propagation • Back-propagation is “just the chain rule” of calculus dx = dz dz dy (6.44) dx. dy ◆ > ✓ ∂ y r x z = (6.46) r y z, ∂ x • But it’s a particular implementation of the chain rule • Uses dynamic programming (table filling) • Avoids recomputing repeated subexpressions • Speed vs memory tradeo ff (Goodfellow 2017)

Simple Back-Prop Example Compute loss y Compute activations Compute derivatives Forward prop Back-prop h 1 h 1 h 2 h 2 x 1 x 1 x 2 x 2 (Goodfellow 2017)

Computation Graphs ˆ ˆ y y Multiplication σ Logistic regression u (1) u (1) u (2) u (2) z + dot × b y x x x w w (a) (b) u (2) u (2) u (3) u (3) H ReLU layer relu × sum Linear regression U (1) U (1) U (2) U (2) u (1) u (1) ˆ ˆ y y and weight decay + sqr dot matmul X W b b λ x x w w (c) (d) Figure 6.8 (Goodfellow 2017)

Repeated Subexpressions z f ∂ z (6.50) ∂ w y = ∂ z ∂ y ∂ x (6.51) ∂ y ∂ x ∂ w f = f 0 ( y ) f 0 ( x ) f 0 ( w ) (6.52) x = f 0 ( f ( f ( w ))) f 0 ( f ( w )) f 0 ( w ) (6.53) f w Back-prop avoids computing this twice Figure 6.9 (Goodfellow 2017)

Symbol-to-Symbol Di ff erentiation z z Figure 6.10 f f f 0 dz dz y y dy dy f f f 0 × dy dy dz dz x x dx dx dx dx f f f 0 × dx dx dz dz w w dw dw dw dw (Goodfellow 2017)

Neural Network Loss Function J MLE J MLE J cross_entropy + U (2) U (2) u (8) u (8) y matmul × W (2) W (2) U (5) U (5) u (6) u (6) u (7) u (7) H λ sqr sum + relu U (1) U (1) matmul Figure 6.11 W (1) W (1) U (3) U (3) u (4) u (4) X sqr sum (Goodfellow 2017)

Hessian-vector Products h ( r x f ( x )) > v i Hv = r x (6.59) . (Goodfellow 2017)

Questions (Goodfellow 2017)

Deep Feedforward Networks Lecture slides for Chapter 6 of Deep - PowerPoint PPT Presentation

Deep Feedforward Networks Lecture slides for Chapter 6 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last updated 2016-10-04 Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units Architecture Design

CHAPTER VI VI CHAPTER Learning in Feedforward Feedforward Learning in Neural Networks Neural

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward

CS7015 (Deep Learning) : Lecture 3 Sigmoid Neurons, Gradient Descent, Feedforward Neural Networks,

Deep Feedforward Networks Thanks to Sargur Srihari, Alexander Ororbia, Christopher Olah Deep

An Introduction to Neural Networks - Feedforward NN Backpropagation Agathe Merceron Beuth

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National University Neural networks

Deep learning J er emy Fix CentraleSup elec jeremy.fix@centralesupelec.fr 2016 1 / 94

Machine Learning Lecture 06: Deep Feedforward Networks Nevin L. Zhang lzhang@cse.ust.hk

Feedforward Control So far, most of the focus of this course has been on feedback control. In

CHAPTER 15: FEEDFORWARD CONTROL Outline of the lesson. A process challenge - improve

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

From Feedforward-Designed Convolutional Neural Networks (FF-CNNs) to Successive Subspace Learning

MultiLayer Neural Networks Xiaogang Wang xgwang@ee.cuhk.edu.hk January 15, 2019 cuhk Xiaogang

iToM: An Internet Topology Mapping Project Kamil Sarac (ksarac@utdallas.edu) Department of

Introduction CSCE CSCE 479/879 479/879 Good for data with a grid-like topology Lecture 4:

Election in Mesh, Cube and Complete Networks T-79.4001 Seminar on Theoretical Computer Science

Build A Wall Perimeter wall 1 // Guardicore Spoiler Alert: Wall Will be Breached 2 //

DCOM (Distributed Component Object Model) Overview Overview Objectives Objectives What is

Object-oriented analysis: Modeling a problem domain Charlie Garrod Bogdan Vasilescu School of

Distributed Computing: Distributed Objects/Services Manish Parashar parashar@ece.rutgers.edu

1 Understanding Emerald Understanding Emerald Issues for Emerald Issues for Emerald 1. Emerald