Administrative - A1 is due Today (midnight). You can use up to 3 - - PowerPoint PPT Presentation

administrative a1 is due today midnight you can use up to
SMART_READER_LITE
LIVE PREVIEW

Administrative - A1 is due Today (midnight). You can use up to 3 - - PowerPoint PPT Presentation

Administrative - A1 is due Today (midnight). You can use up to 3 late days - A2 will be up this Friday, its due next next Wednesday (Feb 4) - Project Proposal is due next Friday at midnight (~one paragraph (200-400 words), send as email)


slide-1
SLIDE 1

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 1

Administrative

  • A1 is due Today (midnight). You can use

up to 3 late days

  • A2 will be up this Friday, it’s due next

next Wednesday (Feb 4)

  • Project Proposal is due next Friday at

midnight (~one paragraph (200-400 words), send as email)

slide-2
SLIDE 2

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 2

Lecture 5: Backprop and intro to Neural Nets

slide-3
SLIDE 3

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 3

SVM: Softmax:

Linear Classification

slide-4
SLIDE 4

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 4

Optimization Landscape

slide-5
SLIDE 5

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 5

Gradient Descent

Numerical gradient: slow :(, approximate :(, easy to write :) Analytic gradient: fast :), exact :), error-prone :( In practice: Derive analytic gradient, check your implementation with numerical gradient

slide-6
SLIDE 6

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 6

This class: Becoming a backprop ninja

slide-7
SLIDE 7

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 7

slide-8
SLIDE 8

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 8

Example: x = 4, y = -3. => f(x,y) = -12 partial derivatives gradient

slide-9
SLIDE 9

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 9

Example: x = 4, y = -3. => f(x,y) = -12 partial derivatives gradient

Question: If I increase x by h, how would the output of f change?

slide-10
SLIDE 10

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 10

Compound expressions:

slide-11
SLIDE 11

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 11

Compound expressions:

Chain rule:

slide-12
SLIDE 12

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 12

Compound expressions:

Chain rule:

slide-13
SLIDE 13

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 13

Compound expressions:

Chain rule:

slide-14
SLIDE 14

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 14

Another example:

slide-15
SLIDE 15

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 15

Another example:

  • 1/(1.37^2) = -0.53
slide-16
SLIDE 16

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 16

Another example:

[local gradient] x [its gradient] [1] x [-0.53] = -0.53

slide-17
SLIDE 17

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 17

Another example:

[local gradient] x [its gradient] [e^(-1)] x [-0.53] = -0.20

slide-18
SLIDE 18

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 18

Another example:

[local gradient] x [its gradient] [-1] x [-0.2] = 0.2

slide-19
SLIDE 19

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 19

Another example:

[local gradient] x [its gradient] [1] x [0.2] = 0.2 [1] x [0.2] = 0.2 (both inputs!)

slide-20
SLIDE 20

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 20

Another example:

[local gradient] x [its gradient] x0: [2] x [0.2] ~= 0.4 w0: [-1] x [0.2] = -0.2

slide-21
SLIDE 21

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 21

Every gate during backprop computes, for all its inputs: [LOCAL GRADIENT] x [GATE GRADIENT]

Can be computed right away, even during forward pass The gate receives this during backpropagation

a gate hanging out

slide-22
SLIDE 22

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 22

sigmoid function

slide-23
SLIDE 23

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 23

sigmoid function (0.73) * (1 - 0.73) = 0.2

slide-24
SLIDE 24

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 24

sigmoid function

slide-25
SLIDE 25

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 25

sigmoid function

slide-26
SLIDE 26

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 26

We are ready:

slide-27
SLIDE 27

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 27

We are ready:

slide-28
SLIDE 28

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 28

forward pass was:

slide-29
SLIDE 29

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 29

forward pass was:

slide-30
SLIDE 30

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 30

forward pass was:

slide-31
SLIDE 31

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 31

forward pass was:

slide-32
SLIDE 32

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 32

forward pass was:

slide-33
SLIDE 33

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 33

forward pass was:

slide-34
SLIDE 34

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 34

forward pass was:

slide-35
SLIDE 35

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 35

forward pass was:

slide-36
SLIDE 36

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 36

Patterns in backward flow

add gate: gradient distributor max gate: gradient router mul gate: gradient… “switcher”?

slide-37
SLIDE 37

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 37

Gradients for vectorized code

slide-38
SLIDE 38

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 38

Gradients for vectorized code X is [10 x 3], dD is [5 x 3] dW must be [5 x 10] dX must be [10 x 3]

slide-39
SLIDE 39

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 39

Gradients for vectorized code

slide-40
SLIDE 40

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 40

In summary

  • in practice it is rarely needed to derive long

gradients of variables on pen and paper

  • structured your code in stages (layers),

where you can derive the local gradients, then chain the gradients during backprop.

  • caveat: sometimes gradients simplify (e.g.

for sigmoid, also softmax). Group these.

slide-41
SLIDE 41

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 41

NEURAL NETWORKS

slide-42
SLIDE 42

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 42

slide-43
SLIDE 43

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 43

slide-44
SLIDE 44

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 44

slide-45
SLIDE 45

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 45

sigmoid activation function

slide-46
SLIDE 46

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 46

slide-47
SLIDE 47

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 47

A Single Neuron can be used as a binary linear classifier

Regularization has the interpretation of “gradual forgetting”

slide-48
SLIDE 48

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 48

Be very careful with your Brain analogies: Biological Neurons:

  • Many different types
  • Dendrites can perform complex non-

linear computations

  • Synapses are not a single weight but

a complex non-linear dynamical system

  • Rate code may not be adequate

[Dendritic Computation. London and Hausser]

slide-49
SLIDE 49

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 49

Activation Functions

slide-50
SLIDE 50

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 50

Activation Functions

Sigmoid

  • Squashes numbers to range [0,1]
  • Historically popular since they

have nice interpretation as a saturating “firing rate” of a neuron 2 BIG problems:

slide-51
SLIDE 51

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 51

Activation Functions

Sigmoid

  • Squashes numbers to range [0,1]
  • Historically popular since they

have nice interpretation as a saturating “firing rate” of a neuron 2 BIG problems: 1. Saturated neurons “kill” the gradients

slide-52
SLIDE 52

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 52

Activation Functions

Sigmoid

  • Squashes numbers to range [0,1]
  • Historically popular since they

have nice interpretation as a saturating “firing rate” of a neuron 2 BIG problems: 1. Saturated neurons “kill” the gradients 2. Sigmoid outputs are not zero- centered

slide-53
SLIDE 53

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 53

Consider what happens when the input to a neuron is always positive... What can we say about the gradients on w?

slide-54
SLIDE 54

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 54

Consider what happens when the input to a neuron is always positive... What can we say about the gradients on w? Always all positive or all negative :( (this is also why you want zero-mean data!)

hypothetical

  • ptimal w

vector zig zag path

allowed gradient update directions allowed gradient update directions

slide-55
SLIDE 55

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 55

Activation Functions

tanh(x)

  • Squashes numbers to range [-1,1]
  • zero centered (nice)
  • still kills gradients when saturated :(
slide-56
SLIDE 56

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 56

Activation Functions

ReLU

  • Computes f(x) = max(0,x)
  • Does not saturate
  • Very computationally efficient
  • Converges much faster than

sigmoid/tanh in practice! (e.g. 6x)

  • Just one annoying problem…

hint: what is the gradient when x < 0?

slide-57
SLIDE 57

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015

DATA CLOUD

57

active ReLU dead ReLU will never activate => never update

slide-58
SLIDE 58

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 58

Activation Functions

Leaky ReLU

  • Does not saturate
  • computationally efficient
  • Converges much faster than

sigmoid/tanh in practice! (e.g. 6x)

  • will not “die”.
slide-59
SLIDE 59

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 59

Maxout “Neuron”

  • Does not have the basic form of dot product ->

nonlinearity

  • Generalizes ReLU and Leaky ReLU
  • Linear Regime! Does not saturate! Does not die!

Problem: doubles the number of parameters :(

slide-60
SLIDE 60

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 60

TLDR: In practice:

  • Use ReLU. Be careful with your learning rates
  • Try out Leaky ReLU / Maxout
  • Try out tanh but don’t expect much
  • Never use sigmoid
slide-61
SLIDE 61

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 61

NEURAL NETWORKS

slide-62
SLIDE 62

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 62

Neural Networks: Architectures

“Fully-connected” layers

slide-63
SLIDE 63

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 63

Neural Networks: Architectures

“Fully-connected” layers “2-layer Neural Net”, or “1-hidden-layer Neural Net” “3-layer Neural Net”, or “2-hidden-layer Neural Net”

slide-64
SLIDE 64

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 64

Neural Networks: Architectures

Number of Neurons: ? Number of Weights: ? Number of Parameters: ?

slide-65
SLIDE 65

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 65

Neural Networks: Architectures

Number of Neurons: 4+2 = 6 Number of Weights: [4x3 + 2x4] = 20 Number of Parameters: 20 + 6 = 26 (biases!) Number of Neurons: ? Number of Weights: ? Number of Parameters: ?

slide-66
SLIDE 66

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 66

Neural Networks: Architectures

Number of Neurons: 4+2 = 6 Number of Weights: [4x3 + 2x4] = 20 Number of Parameters: 20 + 6 = 26 (biases!) Number of Neurons: 4 + 4 + 1 = 9 Number of Weights: [4x3+4x4+1x4]=32 Number of Parameters: 32+9 = 41

slide-67
SLIDE 67

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 67

Neural Networks: Architectures

Modern CNNs: ~10 million neurons Human visual cortex: ~5 billion neurons

slide-68
SLIDE 68

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 68

Example Feed-forward computation of a Neural Network

We can efficiently evaluate an entire layer of neurons.

slide-69
SLIDE 69

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 69

Example Feed-forward computation of a Neural Network

slide-70
SLIDE 70

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 70

What kinds of functions can a Neural Network represent?

[http://neuralnetworksanddeeplearning.com/chap4.html]

slide-71
SLIDE 71

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 71

What kinds of functions can a Neural Network represent?

[http://neuralnetworksanddeeplearning.com/chap4.html]

slide-72
SLIDE 72

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 72

Setting the number of layers and their sizes

more neurons = more capacity

slide-73
SLIDE 73

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 73

(you can play with this demo over at ConvNetJS: http://cs.stanford. edu/people/karpathy/convnetjs/demo/classify2d.html)

Do not use size of neural network as a regularizer. Use stronger regularization instead:

slide-74
SLIDE 74

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 74

Summary

  • we arrange neurons into fully-connected layers
  • the abstraction of a layer has a nice property in that it

allows us to use efficient vectorized code (matrix multiplies)

  • neural networks are universal function approximators

but this doesn’t mean much.

  • neural networks are not neural
  • neural networks: bigger = better (but might have to

regularize more strongly)

slide-75
SLIDE 75

Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 5 - 21 Jan 2015 75

Next Lecture: More than you ever wanted to know about Neural Networks and how to train them