Deep Learning Tutorial Part I Greg Shakhnarovich TTI-Chicago - - PowerPoint PPT Presentation

deep learning tutorial part i greg shakhnarovich tti
SMART_READER_LITE
LIVE PREVIEW

Deep Learning Tutorial Part I Greg Shakhnarovich TTI-Chicago - - PowerPoint PPT Presentation

Deep Learning Tutorial Part I Greg Shakhnarovich TTI-Chicago December 2016 Deep Learning Tutorial,Part I 1 Overview Goals of the tutorial Somewhat organized overview of basics, and some more advanced topics Demistify jargon Pointers for


slide-1
SLIDE 1

Deep Learning Tutorial Part I Greg Shakhnarovich TTI-Chicago

December 2016

Deep Learning Tutorial,Part I 1

slide-2
SLIDE 2

Overview

Goals of the tutorial

Somewhat organized overview of basics, and some more advanced topics Demistify jargon Pointers for informed further learning Aimed mostly at vision practitioners, but tools are widely applicable beyond vision Assumes basic familiarity with machine learning

Deep Learning Tutorial,Part I 2

slide-3
SLIDE 3

Overview

Not covered

Connections to brain Deep learning outside of neural networks Many recent advances Many specialized architectures for vision tasks

Deep Learning Tutorial,Part I 3

slide-4
SLIDE 4

Overview

Outline

Introduction (3 hours): Review of relevant machine learning concepts Feedforward neural networks and backpropagation Optimization techniques and issues Complexity and regularization in neural networks Intro to convolutional networks Advanced (3 hours): Advanced techniques for learning DNNs Convnets for tasks beyond image classification Very deep networks Recurrent networks

Deep Learning Tutorial,Part I 4

slide-5
SLIDE 5

Overview

Sources

Stanford CS231N: Convolutional Neural Networks for Visual Recognition Andrej Karpathy, Justin Johnson et al. (2016 edition) vision.stanford.edu/teaching/cs231n Deep Learning by Ian Goodfellow, Aaron Courville and Yoshua Bengio, 2016 Chris Olah: Understanding LSTM Networks (blog post) colah.github.io/posts/2015-08-Understanding-LSTMs Papers on arXiv and slides by the authors

Deep Learning Tutorial,Part I 5

slide-6
SLIDE 6

Overview of ML concepts

Supervised learning: setup

Input data space X Output (label, target) space Y image classification: X = {natural images}, Y = {cat, dog, boat . . .} Unknown function f : X → Y Scenario: given a labeled training set (xi, yi), i = 1, . . . , N, with xi ∈ X, yi ∈ Y. Goal: any for future x, accurately predict y in other words: learn a mapping f : X → Y

Deep Learning Tutorial,Part I 6

slide-7
SLIDE 7

Overview of ML concepts

Loss function

A loss function ℓ : Y × Y → R maps prediction ˆ y to cost, given true value y Standard choices for regression: squared loss ℓ(ˆ y, y) = (ˆ y − y)2 absolute loss ℓ(ˆ y, y) = |ˆ y − y| Standard choice for classification: 0/1 loss ℓ(ˆ y, y) =

  • if y = ˆ

y, 1 if y = ˆ y,

  • r a more general loss matrix L ∈ R|Y|×|Y|

+

, where ℓ(ˆ y, y) = Lˆ

y,y.

Deep Learning Tutorial,Part I 7

slide-8
SLIDE 8

Overview of ML concepts

Risk of a predictor

Usually, consider a parametric function f(x; Θ) E.g., linear function: f(x; w, b) = w · xi + b Fundamental assumption: example x/label y are drawn from a joint probability distribution p(x, y). The ultimate goal is to minimize the expected loss, also known as risk: R(Θ) = E(x0,y0)∼p(x,y) [ℓ (f(x0; Θ), y0)]

Deep Learning Tutorial,Part I 8

slide-9
SLIDE 9

Overview of ML concepts

Learning by empirical risk minimization

R(Θ) = E(x0,y0)∼p(x,y) [ℓ (f(x0; Θ), y0)] Further assumption: data are i.i.d.: same (unknown!) distribution for all pairs (x, y) in both training and test data. Can’t find argminΘ R, but can try to minimize the empirical risk (empirical loss) on training set argmin

Θ

L(Θ, X, y) = argmin

Θ

1 N

N

  • i=1

ℓ (f(xi; Θ), yi) To the extent that the training set is a representative of p(x, y), the empirical loss serves as a proxy for the true risk. Technically: estimate p(x, y) by the empirical distribution of data.

Deep Learning Tutorial,Part I 9

slide-10
SLIDE 10

Overview of ML concepts

Learning via empirical loss minimization

Two steps: Select a restricted class F of hypotheses f : X → Y E.g., linear functions parametrized by w: f(x; w) = w · x Select a hypothesis f∗ ∈ F based on training set (X, Y ) E.g., minimize empirical squared loss, i.e., select f(x; w∗) where w∗ = argmin

w N

  • i=1

(yi − w · xi)2 How do we find w∗?

Deep Learning Tutorial,Part I 10

slide-11
SLIDE 11

Overview of ML concepts

Sources of error

Irreducible error (Bayes error) obtained even with the best possible mapping x → y Approximation error: the model class does not contain the best possible mapping for x → y Estimation error: argminΘ L(Θ, X, Y ) = argminΘ R(Θ) Optimization error: our optimization algorithm fails to find argminΘ L(Θ, X, Y )

Deep Learning Tutorial,Part I 11

slide-12
SLIDE 12

Linear classification

Linear classifiers

ˆ y = h(x) = sign (w · x + b) Classifying using a linear decision boundary effectively reduces the data dimension to 1. Need to find w (direction) and b (location) of the boundary Want to minimize the expected zero/one loss for classifier h : X → Y, which for (x, y) is L(h(x), y) =

  • if h(x) = y,

1 if h(x) = y.

Deep Learning Tutorial,Part I 12

slide-13
SLIDE 13

Linear classification

Surrogate loss and class scores

Minimizing 0/1 loss is not tractable even approximation is NP-hard when the data are not linearly separable Instead, we minimize surrogate loss functions (typically convex, differentiable upper bound on 0/1 loss) Basic setup: the classifier outputs scores f ∈ RC which can be converted to prediction and used to calculate loss Obvious conversion rule: ˆ y(x; Θ) = argmaxc fc(x; Θ)

Deep Learning Tutorial,Part I 13

slide-14
SLIDE 14

Linear classification

Hinge loss

Linear penalty for violating fixed classification margin: ℓ(f(x, Θ), y) = max

  • 0, max

c=y [fc + 1 − fy]

  • Best known for binary case, y ∈ {±1} used in SVM

ℓ(f(x, Θ), y) = max {0, 1 − y · fy} Hard to get to work with multi-class cases; often better to set up many binary tasks (such as one-vs-all)

Deep Learning Tutorial,Part I 14

slide-15
SLIDE 15

Linear classification

Cross-entropy loss

Associate probability model for the posterior: ˆ p(y|x; Θ) = 1 Z(x; Θ) exp (fy(x; Θ)) Log-loss: ℓ(fy(x; Θ), y) = − log ˆ p(y|x; Θ) If we represent the target labels as a distribution q(y = c) =

  • 1

if c = y,

  • therwise

we can write ℓ as cross-entropy between ˆ p and q ℓ(fy(x; Θ), y) = −

  • c

q(y = c) log ˆ p(y = c|x)

Deep Learning Tutorial,Part I 15

slide-16
SLIDE 16

Linear classification

Softmax model

ˆ p(y|x; Θ) = exp (fy(x; Θ))

  • c exp (fc(x; Θ))

General (multi-class) form of logistic regression: model fc(x) = wc · x + bc Over-parameterized: can set wC = 0, bC = 0 For C = 2, this is identical to the logistic regression The boundaries between classes linear Note: for prediction, do not need to exp. and normalize, just argmaxc fc(x)

Deep Learning Tutorial,Part I 16

slide-17
SLIDE 17

Linear classification

Softmax parameterization

p (y = c | x) = ewc·x C

k=1 ewk·x

The posteriors are invariant to shifting scores A common problem: overflow in exp(wc · x) Solution: subtract a = maxc wc · x Then, max score is 0, and the rest are negative; underflow is OK (some may turn to zero) Examples: scores = [1000, 995, 10, 10, 1] Na¨ ıve exponentiation: ≈ [∞, ∞, 2.2e4, 2.2e4, 2.7] After shifting dynamic range: ≈ [1, 0.007, 0, 0, 0]

Deep Learning Tutorial,Part I 17

slide-18
SLIDE 18

Linear classification

Softmax parameterization

p (y = c | x) = ewc·x−a C

k=1 ewk·x−a

The posteriors are invariant to shifting scores A common problem: overflow in exp(wc · x) Solution: subtract a = maxc wc · x Then, max score is 0, and the rest are negative; underflow is OK (some may turn to zero) Examples: scores = [1000, 995, 10, 10, 1] Na¨ ıve exponentiation: ≈ [∞, ∞, 2.2e4, 2.2e4, 2.7] After shifting dynamic range: ≈ [1, 0.007, 0, 0, 0]

Deep Learning Tutorial,Part I 17

slide-19
SLIDE 19

Linear classification

Softmax gradient

fc(x) = wc · x + bc Posterior from scores: ˆ p(y = c|x) = exp(fc(x))/

j exp(fj(x))

Cross-entropy loss on a single example (x, y) − log ˆ p(y|x) = −fc(x) + log

  • j

exp(fj(x)) Gradient of the loss on a single example: ∇wcL(x, y) =

  • x
  • −1 + exp(fy)/

j exp(fj)

  • if c = y,

x exp(fc)/

j exp(fj)

if c = y ∇bcL(x, y) =

  • −1 + exp(fy)/

j exp(fj)

if c = y, exp(fc)/

j exp(fj)

if c = y

Deep Learning Tutorial,Part I 18

slide-20
SLIDE 20

Linear classification Learning by gradient descent

Review: Gradient descent

Iteration counter t = 0 Initialize Θ(t) (to zero or a small random vector) for t = 1, . . .: compute gradient on data (X, Y ) g(t)(X, Y ) = ∇Θf

  • X, Y ; Θ(t−1)

update model Θ(t) = Θ(t−1) − ηg(t) check for convergence (what does this mean?) The learning rate η controls the step size

Deep Learning Tutorial,Part I 19

slide-21
SLIDE 21

Linear classification Learning by gradient descent

Running gradient descent

An epoch: a single pass through the training set A good idea: randomize the order of examples Single “iteration” t is an epoch: loop over examples (or in parallel) computing g(t)(xi, yi) accumulate the gradient g(t)(X, Y ) = 1 N

  • i

g(t)(xi, yi) make a single update at the end of the epoch. Assuming N is large, g(t)(X, Y ) is a good estimate for gradient, but it costs a lot to compute.

Deep Learning Tutorial,Part I 20

slide-22
SLIDE 22

Linear classification Learning by gradient descent

Stochastic gradient descent: intuition

Computing gradient on all N examples is expensive and may be wasteful: many data points provide similar information Idea: present examples one at a time, and pretend that the gradient

  • n the entire set is the same as gradient on one example

Formally: estimate gradient of the loss on a single example 1 N

N

  • i=1

∇ΘL(yi, xi; Θ) ≈ ∇ΘL(yt, xt; Θ) Mini-batch version: for some B ⊂ [N], |B| ≪ N, 1 N

N

  • i=1

∇ΘL(yi, xi; Θ) ≈ 1 |B|

  • t∈B

∇ΘL(yt, xt; Θ)

Deep Learning Tutorial,Part I 21

slide-23
SLIDE 23

Linear classification Learning by gradient descent

Stochastic gradient descent

An incremental algorithm:

  • Present examples (xi, yi) one at a time,
  • Modify w slightly to increase the log-probability of observed yi:

w := w + η ∂ ∂w log p (yi | xi; w) where the learning rate η determines how “slightly”. Epoch (full pass through data) contains N updates instead of one Good practice: shuffle the data each epoch

Deep Learning Tutorial,Part I 22

slide-24
SLIDE 24

Linear classification Learning by gradient descent

Gradient check

When implementing gradient-based methods: always include numerical gradient check (gradcheck) Numerical approximation of the partial derivative: ∂f(x) ∂xj ≈ f(x + δej) − f(x − δej) 2δ note: this is better than the non-centered f(x + δej) − f(x) δ Can compute this for each parameters in a model, with δ ≈ 10−6

Deep Learning Tutorial,Part I 23

slide-25
SLIDE 25

Linear classification Learning by gradient descent

Gradient check: tips

Make sure to use double precision Run on a few data points, at random points in the parameter space caveat: may be important to run around important points, e.g., during convergence Find a way to run on a subset of parameters but careful how you select them: subset of weights for each class is OK, weights for a subset of classes not OK

Deep Learning Tutorial,Part I 24

slide-26
SLIDE 26

Linear classification Learning by gradient descent

Gradient check evaluation

Suppose you get the gradient vector g from (analytic) calculation in your code, and g′ from gradcheck. A good value to look at: max |gi − g′

i|

max(|gi|, |g′

i|)

Suggested by Andrej Karpathy, who says: relative error >1e-2 usually means the gradient is probably wrong 1e-2 >relative error >1e-4 should make you feel uncomfortable 1e-4 >relative error is usually okay for objectives with kinks. But if there are no kinks [soft objective], then 1e-4 is too high. 1e-7 and less you should be happy.

Deep Learning Tutorial,Part I 25

slide-27
SLIDE 27

Deep learning: introduction

Feature functions

Machine learning relies almost entirely on linear predictors But often applied to non-linear features of the data Feature transform: φ : X → Rd fy(x; w, b) = wy · φ(x) + by Shallow learning: hand-crafted, non-hierarchical φ. Basic example: polynomial regression, φj(x) = xj, j = 0, . . . , d, ˆ y = w · φ(x) Kernel SVM: employing kernel K corresponds to (some) feature space such that K(xi, xj) = φ(xi) · φ(xj); SVM is just a linear classifier in that space.

Deep Learning Tutorial,Part I 26

slide-28
SLIDE 28

Deep learning: introduction

Shallow learning in vision

Image classification with spatial pyramids: φ is based on (1) computing SIFT descriptors over a set of points, (2) clustering descriptors, (3) computing cluster assignment histograms over various regions, (4) concatenating the histograms. Deformable parts model: φ is based on a set of filters, and a linear classifier on top. No hierarchy.

Deep Learning Tutorial,Part I 27

slide-29
SLIDE 29

Deep learning: introduction

Deep learning: definition

A system that employs a hierarchy of features of the input, learned end-to-end jointly with the predictor. fy(x) = FL(FL−1(FL−2(· · · F1(x) · · · ))) Learning methods that are not deep: SVMs nearest neighbor classifiers decision trees perceptron

Deep Learning Tutorial,Part I 28

slide-30
SLIDE 30

Deep learning: introduction

Power of two layers

Theoretical result [Cybenko, 1989]: 2-layer net with linear output (sigmoid hidden units) can approximate any continuous function over compact domain to arbitrary accuracy (given enough hidden units!) Examples: 3 hidden units with tanh(z) = e2z−1

e2z+1 activation

[from Bishop]

Deep Learning Tutorial,Part I 29

slide-31
SLIDE 31

Deep learning: introduction

Intuition: advantages of depth

What can we gain from depth? Example: parity of n-bit numbers, with AND, OR, NOT, XOR gates Trivial shallow architecture: express parity as DNF or CNF but need exponential number of gates! Deep architecture: a tree of XOR gates

Deep Learning Tutorial,Part I 30

slide-32
SLIDE 32

Deep learning: introduction

Advantages of depth

Distributed representations through hierarchy of features [Y. Bengio]

Deep Learning Tutorial,Part I 31

slide-33
SLIDE 33

Deep learning: introduction

History of deep learning

1950s: Perceptron (Rosenblatt) 1960s: first AI winter? Minsky and Pappert 1970s-1980s: connectionist models; backprop late 1980s: second AI winter most of modern deep learning discovered! early 2000s: revival of interest (CIFAR groups)

  • ca. 2005: layer-wise pretraining of deep-ish nets

2010: progress in speech and vision with deen neural nets 2012: Krizhevsky et al. win ImageNet

Deep Learning Tutorial,Part I 32

slide-34
SLIDE 34

Deep learning: introduction

Neural networks

General form of shallow linear classifiers: score is computed as fy(x; w, b) = wy · φ(x) + by Representation as a neural network: x1 x2

. . .

xd φ1 φ2

. . .

φm φ0 ≡ 1 y = 1 y = C w1,1 w1,2 w1,m b1

Deep Learning Tutorial,Part I 33

slide-35
SLIDE 35

Deep learning: introduction

Neural networks

General form of shallow linear classifiers: score is computed as fy(x; w, b) = wy · φ(x) + by Representation as a neural network: Weights w = [w1, . . . , wC], wc ∈ Rm Biases b = [b1, . . . , bC] x1 x2

. . .

xd φ1 φ2

. . .

φm y = 1 y = C wC,1 wC,2 wC,m bC

Deep Learning Tutorial,Part I 33

slide-36
SLIDE 36

Deep learning: introduction

Two-layer network

x1 x2

. . .

xd h h

. . .

h w(1)

d,m

w(1)

1,1

w(1)

2,1

w(1)

d,1

w(2)

1

w(2)

2

w(2)

m

b(2)

1

b(1)

1

Idea: learn parametric features φj(x) = h(w(1)

j

· x + b(1)

j ) for some

nonlinear function h

Deep Learning Tutorial,Part I 34

slide-37
SLIDE 37

Deep learning: introduction

Feed-forward networks

Feedforward operation, from input x to output ˆ y: fy(x) =

m

  • j=1

w(2)

j,y h

d

  • i=1

w(1)

i,j xi + b(1) j

  • + b(2)

y

. . . h . . . In matrix form: f(x) = W2 · h (W1 · x + b1) + b2 where h is applied elementwise; x ∈ Rd, W1 ∈ Rm×d, W2 ∈ RC×m, b2 ∈ RC, b1 ∈ Rm

Deep Learning Tutorial,Part I 35

slide-38
SLIDE 38

Deep learning: introduction

Learning a neural network

f(x) = W2 · h (W1 · x + b1) + b2 recall: ˆ p(y = c|x) = exp(fc(x))/

  • j

exp(fj(x)) Softmax loss computed on f(x) vs. true label y: L(x, y) = − log ˆ p(y|x) = −fy(x) + log

  • c

exp (fc(x)) Learning the network: initialize, then run [stochastic] gradient descent, updating according to ∂L ∂w2 , ∂L ∂b2 , ∂L ∂W1 , ∂L ∂b1

Deep Learning Tutorial,Part I 36

slide-39
SLIDE 39

Deep learning: introduction

Chain rule review: vectors

Consider the chain (stage-wise) mapping x Rd u Rm v Rc z R f g h Computing partial gradients: ∇vz = ∂z ∂v

Deep Learning Tutorial,Part I 37

slide-40
SLIDE 40

Deep learning: introduction

Chain rule review: vectors

Consider the chain (stage-wise) mapping x Rd u Rm v Rc z R f g h Computing partial gradients: ∇vz = ∂z ∂v ∂z ∂ui =

  • j

∂z ∂vj ∂vj ∂ui ⇒ ∇uz = ∂v ∂u ′ ∇vz

Deep Learning Tutorial,Part I 37

slide-41
SLIDE 41

Deep learning: introduction

Chain rule review: vectors

Consider the chain (stage-wise) mapping x Rd u Rm v Rc z R f g h Computing partial gradients: ∇vz = ∂z ∂v ∂z ∂ui =

  • j

∂z ∂vj ∂vj ∂ui ⇒ ∇uz = ∂v ∂u ′ ∇vz ∂z ∂xk =

  • q

∂z ∂uq ∂uq ∂xk ⇒ ∇xz = ∂u ∂x ′ ∇uz

Deep Learning Tutorial,Part I 37

slide-42
SLIDE 42

Deep learning: introduction

Chain rule review: tensors

More generally, some of the variables are tensors X Rd1×···×dx U Rm1×···×mu v Rc z R f g h ∇Xz is a tensor, same dim as X Use single index to indicate index tuples: e.g., if X is 3D, i = (i1, i2, i3) (∇Xz)i = ∂z ∂xi1,i2,i3 Now, ∇Xz =

  • j

(∇XUj) (∇Uz)j

Deep Learning Tutorial,Part I 38

slide-43
SLIDE 43

Deep learning: introduction

Staged feedforward computation

To make derivations more convenient, we will express forward computation (x, y) → L in more detail: x, W1, b1 → a1 = W1 · x + b1 a1 → z1 = h(a1) z1, W2, b2 → a2 = W2 · z1 + b2 a2, y → L = −ey · a2 + log [1 · exp(a2)] Now we have, e.g., What is ∇W1z1,j like?

Deep Learning Tutorial,Part I 39

slide-44
SLIDE 44

Deep learning: introduction

Staged feedforward computation

To make derivations more convenient, we will express forward computation (x, y) → L in more detail: x, W1, b1 → a1 = W1 · x + b1 a1 → z1 = h(a1) z1, W2, b2 → a2 = W2 · z1 + b2 a2, y → L = −ey · a2 + log [1 · exp(a2)] Now we have, e.g., ∇z1L = ∂a2 ∂z1 ′ ∇a2L, What is ∇W1z1,j like?

Deep Learning Tutorial,Part I 39

slide-45
SLIDE 45

Deep learning: introduction

Staged feedforward computation

To make derivations more convenient, we will express forward computation (x, y) → L in more detail: x, W1, b1 → a1 = W1 · x + b1 a1 → z1 = h(a1) z1, W2, b2 → a2 = W2 · z1 + b2 a2, y → L = −ey · a2 + log [1 · exp(a2)] Now we have, e.g., ∇z1L = ∂a2 ∂z1 ′ ∇a2L, ∇W1L =

  • j

(∇W1z1,j) (∇z1L)j What is ∇W1z1,j like?

Deep Learning Tutorial,Part I 39

slide-46
SLIDE 46

Backpropagation

Backpropagation: general network

General unit activation in a network (ignoring bias) Unit t receives input from I(t) = {i1, . . . , iS}, sends to O(t) = {o1, . . . , oR} at =

  • j∈I(t)

wjtzj zt = h(at) The loss L depends on wjt only through at ∂L ∂wjt = ∂L ∂at ∂at ∂wjt zt zi1 wi1,t zi2 wi2,t

. . .

ziS wiS,t zo1

. . .

zoR wt,o1 wt,oR f1 . . . fC L

. . .

. . . . . . . . . . . . Deep Learning Tutorial,Part I 40

slide-47
SLIDE 47

Backpropagation

Backpropagation: general network

General unit activation in a network (ignoring bias) Unit t receives input from I(t) = {i1, . . . , iS}, sends to O(t) = {o1, . . . , oR} at =

  • j∈I(t)

wjtzj zt = h(at) The loss L depends on wjt only through at ∂L ∂wjt = ∂L ∂at ∂at ∂wjt = ∂L ∂at zj zt zi1 wi1,t zi2 wi2,t

. . .

ziS wiS,t zo1

. . .

zoR wt,o1 wt,oR f1 . . . fC L

. . .

. . . . . . . . . . . . Deep Learning Tutorial,Part I 40

slide-48
SLIDE 48

Backpropagation

Backpropagation: general network

Starting with L, compute backward (gradient) flow Note: aj =

  • i∈I(j)

wi,jh(ai) Notation: dt = ∂L

∂at

The backward flow comes to unit t from O(t): dt =

  • ∈O(t)

∂L ∂ao ∂ao ∂at =

  • ∈O(t)

dowt,oh′(at) zt zi1 wi1,t zi2 wi2,t

. . .

ziS wiS,t zo1

. . .

zoR wt,o1 wt,oR f1 . . . fC L

. . .

. . . . . . . . . . . . Deep Learning Tutorial,Part I 41

slide-49
SLIDE 49

Backpropagation

Backpropagation: general network

Starting with L, compute backward (gradient) flow Note: aj =

  • i∈I(j)

wi,jh(ai) Notation: dt = ∂L

∂at

The backward flow comes to unit t from O(t): dt =

  • ∈O(t)

∂L ∂ao ∂ao ∂at =

  • ∈O(t)

dowt,oh′(at) = h′(at)

  • ∈O(t)

dowt,jo zt zi1 wi1,t zi2 wi2,t

. . .

ziS wiS,t zo1

. . .

zoR wt,o1 wt,oR f1 . . . fC L

. . .

. . . . . . . . . . . . Deep Learning Tutorial,Part I 41

slide-50
SLIDE 50

Backpropagation

Multilayer networks

Consider a layer t with nt units zt = h (Wtzt−1 + bt) where zt ∈ Rnt, bt ∈ Rnt, Wt ∈ Rnt×nt−1 h is applied element-wise Layer zero reads off input z0 ≡ x Last layer T produces a linear output zT = WT zT−1 + bT which is used to predict/assess loss (a.k.a. f)

Deep Learning Tutorial,Part I 42

slide-51
SLIDE 51

Backpropagation

Feed-forward pass

Compute a1 = W1 · x + b1 z1 = h(a1)

Deep Learning Tutorial,Part I 43

slide-52
SLIDE 52

Backpropagation

Feed-forward pass

Compute a1 = W1 · x + b1 z1 = h(a1) a2 = W2 · z1 + b2 z2 = h(a2)

Deep Learning Tutorial,Part I 43

slide-53
SLIDE 53

Backpropagation

Feed-forward pass

Compute a1 = W1 · x + b1 z1 = h(a1) a2 = W2 · z1 + b2 z2 = h(a2) . . . aT−1 = WT−1 · zT−2 + bT−1 zT−1 = h(aT−1)

Deep Learning Tutorial,Part I 43

slide-54
SLIDE 54

Backpropagation

Feed-forward pass

Compute a1 = W1 · x + b1 z1 = h(a1) a2 = W2 · z1 + b2 z2 = h(a2) . . . aT−1 = WT−1 · zT−2 + bT−1 zT−1 = h(aT−1) aT = WT · zT−1 + bT zT = aT Training: Compute L(zT , y) Testing: make inference ˆ y(x) = argmax

c

zT,c

Deep Learning Tutorial,Part I 43

slide-55
SLIDE 55

Backpropagation

Backward pass

The main backprop equations: dt = h′(at)

  • j∈O(t)

djwt,j, ∂L ∂wi,t = dtzi Compute gradient information, using cached z and a dT = ∂L aT ∇WT L = dT ⊗ zT−1 u ⊗ v: outer prod;

Deep Learning Tutorial,Part I 44

slide-56
SLIDE 56

Backpropagation

Backward pass

The main backprop equations: dt = h′(at)

  • j∈O(t)

djwt,j, ∂L ∂wi,t = dtzi Compute gradient information, using cached z and a dT = ∂L aT ∇WT L = dT ⊗ zT−1 dT−1 = h′(aT−1) ∗

  • d′

T WT

  • ∇WT −1L = dT−1 ⊗ zT−2

u ⊗ v: outer prod; u ∗ v: elt-wise

Deep Learning Tutorial,Part I 44

slide-57
SLIDE 57

Backpropagation

Backward pass

The main backprop equations: dt = h′(at)

  • j∈O(t)

djwt,j, ∂L ∂wi,t = dtzi Compute gradient information, using cached z and a dT = ∂L aT ∇WT L = dT ⊗ zT−1 dT−1 = h′(aT−1) ∗

  • d′

T WT

  • ∇WT −1L = dT−1 ⊗ zT−2

. . . dt = h′(at) ∗

  • d′

t+1Wt+1

  • ∇WtL = dt ⊗ zt−1

u ⊗ v: outer prod; u ∗ v: elt-wise

Deep Learning Tutorial,Part I 44

slide-58
SLIDE 58

Backpropagation

Modularity

Basic building block of a neural network: a layer, which defines two directions of computation Forward: pull activations from input units; compute and cache “raw” activation a; compute and output activation z = h(a) Backward: collect gradient information d from output units; calculate gradient w.r.t. a; calculate gradient w.r.t. weights and biases The only connections between layers is in communicating z and d

Deep Learning Tutorial,Part I 45

slide-59
SLIDE 59

Backpropagation

Computational graph

When implementing backpropagation, existing software falls into two groups Numerical: interface between layers handles numbers; computation must be fully specified Torch,MatConvnet,Caffe Symbolic: interface includes derivatives (and intermediate stages) as first-class citizens. Computation is specified by a computational graph Theano, TensorFlow, Caffe2 [Goodfellow et al]

Deep Learning Tutorial,Part I 46

slide-60
SLIDE 60

Activation functions

Choice of non-linearity: 1070s-2010

sigmoid : [h(a) = 1 1 + exp(a) h(a) = tanh(a) Good: squash activations to a fixed range Bad: gradient is nearly zero far away from midpoint ∂L ∂a = ∂L ∂h(a) dh da ≈ 0 can make learning very, very slow tanh (zero-centered) is preferable to sigmoid

Deep Learning Tutorial,Part I 47

slide-61
SLIDE 61

Activation functions

2010: RELU

Intuition: make the non-linearity non-saturating, at least in part of the range Rectified linear units: h(a) = max(0, a) Good: non-saturating; cheap to compute; greatly speeds up convergence compared to sigmoid (order of magnitude)

Deep Learning Tutorial,Part I 48

slide-62
SLIDE 62

Activation functions

RELU and dead units

Problem: if RELU gets into a state in which all batches in an epoch have zero activation, the units becomes stuck with zero gradient (“dies”). [A. Karpathy]

Deep Learning Tutorial,Part I 49

slide-63
SLIDE 63

Activation functions

RELU variants

Many attempts to improve RELUs: Leaky RELU: h(a) = max(αa, a) Learning α: Parametric RELU Exponential RELUs: h(a) =

  • a

if a ≥ 0, α(exp(a) − 1) if a < 0 [A. Karpathy] ELU are promising, but more expensive to compute RELU still the default choice; none of the variants are consistently better

Deep Learning Tutorial,Part I 50

slide-64
SLIDE 64

Initialization

Random initialization

Non-convex objective; initialization is important Can we initialize with all zeros? bad idea: all units will learn the same thing! Can initialize weights with small real numbers, e.g., drawn from Gaussian with zero mean, variance 0.01 Problem: variance of activation grows with number of inputs [A. Karpathy]

Deep Learning Tutorial,Part I 51

slide-65
SLIDE 65

Initialization

Xavier initialization

Idea: normalize the scale to provide roughly equal variance throughout the network the Xavier initialization [Glorot et al]: if unit has n inputs, draw from zero mean, variance 1/n [A. Karpathy]

Deep Learning Tutorial,Part I 52

slide-66
SLIDE 66

Initialization

Initialization and RELU

Assumption behind Xavier: (1) linear activations, (2) zero mean activations. Breaks when using RELUs: [A. Karpathy]

Deep Learning Tutorial,Part I 53

slide-67
SLIDE 67

Initialization

Kaiming initialization

Initialization scheme specifically for RELUs [He et al.]: zero mean, variance 2/n where n is the number of inputs. The Kaiming initialization currently recommended for RELU units [A. Karpathy] Note: still OK to init biases with zeros

Deep Learning Tutorial,Part I 54

slide-68
SLIDE 68

Optimization tricks

Basic stochastic gradient descent

Learning hyperparameters: architecture, regularizer R Hyperparameters: learning rate η, batch size B Initialize weights and biases Each epoch: shuffle data, partition into batches, iterate over batches b w = w − η [∇wL(Xb, Yb) + ∇wR(w)] We have covered initialization Next: optimization

Deep Learning Tutorial,Part I 55

slide-69
SLIDE 69

Optimization tricks

Learning rate

Generally, for convex functions, gradient descent will converge Setting the learning rate η may be very important to ensure rapid convergence from Lecun et al, 1996

Deep Learning Tutorial,Part I 56

slide-70
SLIDE 70

Optimization tricks

Learning rate for neural networks

For deep networks, setting the right learning rate is crucial. Typical behaviors, monitoring training loss: [A. Karpathy]

Deep Learning Tutorial,Part I 57

slide-71
SLIDE 71

Optimization tricks

Learning rate schedules

Generally, as with convex functions, we want the learning rate to decay with time Could set up automatic schedule, e.g., drop by a factor of alpha every β epochs. Most common in practice: some degree of babysitting start with a resonable learning rate monitor training loss drop LR (typically 1/10) when learning appears stuck

Deep Learning Tutorial,Part I 58

slide-72
SLIDE 72

Optimization tricks

Monitoring training loss

Too expensive to evaluate on entire training set frequently; instead, use rolling average of batch loss value Typical behavior: the red line [Larsson et al.] A few caveats: wait a bit before dropping; remember that this is surrogate loss on training (monitor validation accuracy as a precaution) better yet, drop LR based on val accuracy, not training loss do a sanity check for loss values Crashes due to NaNs etc. often due to high LR

Deep Learning Tutorial,Part I 59

slide-73
SLIDE 73

Optimization tricks

Gradient Descent with Momentum

SGD(GD) has trouble navigating ravines, i.e. areas where the surface curves much more steeply in one dimension than in another, which are common around local optima. [A. Karpathy] SGD oscillates across the slopes of the ravine, only making hesitant progress towards the local optimum. Momentum is a method that helps accelerate SGD(GD) in the relevant direction and dampens oscillations.

Deep Learning Tutorial,Part I 60

slide-74
SLIDE 74

Optimization tricks

Gradient Descent with Momentum

SGD(GD) has trouble navigating ravines, i.e. areas where the surface curves much more steeply in one dimension than in another, which are common around local optima. [A. Karpathy] SGD oscillates across the slopes of the ravine, only making hesitant progress towards the local optimum. Momentum is a method that helps accelerate SGD(GD) in the relevant direction and dampens oscillations. ∆wt = γ∆wt−1 + ηt∇f(wt) wt+1 = wt − ∆wt

Deep Learning Tutorial,Part I 60

slide-75
SLIDE 75

Optimization tricks

Gradient Descent with Momentum

Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its terminal velocity if there is air resistance, i.e. γ < 1).

Deep Learning Tutorial,Part I 61

slide-76
SLIDE 76

Optimization tricks

Gradient Descent with Momentum

Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its terminal velocity if there is air resistance, i.e. γ < 1). The momentum term increases for dimensions whose gradients point in the same directions, reduces updates for dimensions whose gradients change directions. Faster convergence, reduced oscillation [Goodfellow et al]

Deep Learning Tutorial,Part I 61

slide-77
SLIDE 77

Optimization tricks

AdaGrad

Intuition [Duchi et al.]: parameters (directions of the parameter space) are updated with varying frequency Idea: reduce learning rate in proportion to the updates Maintain cache si for each parameter θi; when updating, si = si + ∂L ∂θi 2 wi = wi − η ∂L ∂θi / (√si + ǫ) Rarely used today (reduces rate too aggresively)

Deep Learning Tutorial,Part I 62

slide-78
SLIDE 78

Optimization tricks

RMSprop

Modified idea from Adagrad; ‘[“published” in Hinton’s Coursera slides] Cached rate allows for “forgetting” si = δsi + (1 − δ) ∂L ∂θi 2 , wi = wi − η ∂L ∂θi / (√si + ǫ) The decay rate δ is typically 0.9–0.99

Deep Learning Tutorial,Part I 63

slide-79
SLIDE 79

Optimization tricks

Adam optimizer

Kind of like RMSprop with momentum [Kingma et al.] First order momentum update for wi: mi = β1mi + (1 − β1) ∂L ∂θi Second order: vi = β2vi + (1 − β2) ∂L ∂θi 2 Parameter update: wi = wi − η m √v + ǫ

Deep Learning Tutorial,Part I 64

slide-80
SLIDE 80

Optimization tricks

Warm start

Suppose we want to continue training network for more epochs All significant platforms allow for saving snapshots and resuming Need to be careful with learning rate: if re-initialize to high, might lose our place in parameter space Technical issues with momentum, Adam etc. – need to save relevant data in the snapshots to resume!

Deep Learning Tutorial,Part I 65

slide-81
SLIDE 81

Regularization

Review: regularization

Main challenge in machine learning: overfitting Bias-variance tradeoff: complex models reduce bias (approximation error), but increase variance (estimation error) Optimization error: source of concern when dealing with non-convex models Bayes error: presumed low in vision tasks (?)

Deep Learning Tutorial,Part I 66

slide-82
SLIDE 82

Regularization

Review: regularization

Regularization as a way to control the bias-variance tradeoff General form of regularized ERM for model class M: min

M∈M

  • i

L(yi, M(xi)) + λR(M)

  • For parametric models, choice of M determined by setting value to

some w ∈ RD Most common form of regularizer R: norm

d |wd|p

(shrinkage)

Deep Learning Tutorial,Part I 67

slide-83
SLIDE 83

Regularization

Review: geometry of regularization

Can write unconstrained optimization problem min

w N

  • i=1

− log ˆ p(yi|xi; w) + λ

m

  • j=1

|wj|p as an equivalent constrained problem min

w N

  • i=1

− log ˆ p(yi|xi; w) subject to

m

  • j=1

|wj|p ≤ t

w2 w1 ℓ ˆ wML w2 1 + w2 2 ˆ wridge |w1| + |w2| ˆ wlasso

p = 1 may lead to sparsity, p = 2 generally won’t

Deep Learning Tutorial,Part I 68

slide-84
SLIDE 84

Regularization

Effect of regularization

Cartoon of the effect of regularization on bias and variance:

ln λ −3 −2 −1 1 2 0.03 0.06 0.09 0.12 0.15

(bias)2 variance (bias)2 + variance test error

[Bishop] In practice, curves are less clean

Deep Learning Tutorial,Part I 69

slide-85
SLIDE 85

Regularization

Weight decay

In neural networks, L2 regularization is called weight decay Note: bias is normally not regularized Easy to incorporate weight decay into the gradient calculation ∇wλw2 = 2λw One more hyperparameter λto tune with large data sets, typically seems inconsequential

Deep Learning Tutorial,Part I 70

slide-86
SLIDE 86

Regularization

Dropout

Part of overfitting in a neural net: unit co-adaptation idea: prevent it by disrupting co-firing patterns Dropout [Srivastava et al]: during each training iteration, randomly “remove” units, just for that update

Deep Learning Tutorial,Part I 71

slide-87
SLIDE 87

Regularization

Dropout as regularizer

With each particular dropout set, we have a different network Interpretation: training an ensemble of networks with shared parameters [Goodfellow et al.]

Deep Learning Tutorial,Part I 72

slide-88
SLIDE 88

Regularization

Dropout: implementation

Dropout introduces a discrepancy between train and test Correction: suppose survival rate of a unit is p must scale weights from unit (after training) by p Modern version: “inverse dropout” scale activations by 1/p during training, do not scale during test Typically employed in top layers; p = 0.5 is most common

Deep Learning Tutorial,Part I 73

slide-89
SLIDE 89

Convolutional networks

Sparse weight patterns

One way to regularize is to sparsify parameters We can set a subset of weights to zero If the input has “spatial” semantics: can keep only weights for a contiguous set of inputs [Goodfellow et al.]

Deep Learning Tutorial,Part I 74

slide-90
SLIDE 90

Convolutional networks

Receptive field

Each unit in upper layer is affected only by a subset of units in lower layer – its receptive field [Goodfellow et al.] Conversely: each unit in lower layer only influences a subset of units in upper layer

Deep Learning Tutorial,Part I 75

slide-91
SLIDE 91

Convolutional networks

Receptive field growth

Very important: receptive field size w.r.t. the input in locally connected networks grows with layers, even if in each layer it is fixed [Goodfellow et al.]

Deep Learning Tutorial,Part I 76

slide-92
SLIDE 92

Convolutional networks

Locally connected + parameter sharing

We can further reduce network complexity by tying weights for all receptive fields in a layer not tied tied Goodfellow et al. We have now introduced convolutional layers weight sharing induces equivariance to translation!

Deep Learning Tutorial,Part I 77

slide-93
SLIDE 93

Convolutional networks

2D convolutions

Note: filters are not flipped [Goodfellow et al.]

Deep Learning Tutorial,Part I 78

slide-94
SLIDE 94

Convolutional networks

Convolutional layer operation

Suppose the input to the layer (output of previous layer) has C channels tensor W × H × C we convolve only along spatial dimensions: assuming valid convolution, with k × k × C filter (must match #channels!) we get (W − k + 1) × (H − k + 1) × 1 activation map [A. Karpathy] If we have m filters, we get (W − k + 1) × (H − k + 1) × m map as the output of the layer

Deep Learning Tutorial,Part I 79

slide-95
SLIDE 95

Convolutional networks

Implementation: efficient convolutions

Most common: convert convolution into matrix multiplication parallel, GPU-friendly! Suppose we have m filters k1, . . . , km of size f × f, with c channels. Basic idea: pre-compute index mapping im2col : Z ∈ RS×S×c → M ∈ R(S−f+1)2×k2c that maps every receptive field to a column in M Collect filters as columns of K ∈ Rf2c Now simply compute MK + b and reshape to (S − f + 1) × (S − f + 1) × m Notably, for some cases (in particular small filters, 3 × 3 or 5 × 5) more efficient implementations use FFT Most software uses 3rd party (Nvidia, Nervana) implementations under the hood

Deep Learning Tutorial,Part I 80

slide-96
SLIDE 96

Convolutional networks

Conv layer sizing

If we simplye rely on valid convolutions, the maps will quickly shrink [Goodfellow et al] Instead, we usually pad with zeros [Goodfellow et al] Usual padding is symmetric, with (f − 1)/2 – same convolution in Matlab speak

Deep Learning Tutorial,Part I 81

slide-97
SLIDE 97

Convolutional networks

Convolution size

Two extreme cases: Filter size equal to output map size of the previous layer ⇒fully connected layer (more on this later) Filter size is 1 × 1 ⇒The layer simply computes a (non-linear) projection of the features computed by previous layer

  • A. Karpathy

Deep Learning Tutorial,Part I 82

slide-98
SLIDE 98

Convolutional networks

Stride and size

Convolution with stride >1 is a cheap way to reduce spatial dimension of output Can be implemented as convolution followed by downsampling used in LeNet wasteful! Goodfellow et al. Modern implementations explicitly specify stride [Goodfellow et al] Note: matches the multiplication implementation of conv well!

Deep Learning Tutorial,Part I 83

slide-99
SLIDE 99

Convolutional networks

Pooling

Pooling applies a non-parameterized operation on a receptive field Most common operation: max and average Typically, pooling is applied with a stride>1, to reduce spatial resolution but is possible to have stride of 1!

Deep Learning Tutorial,Part I 84

slide-100
SLIDE 100

Convolutional networks

Maxout pooling

Idea: pool over feature channels, not spatially Introduces invariance w.r.t. a family of filters [Goodfellow et al.]

Deep Learning Tutorial,Part I 85

slide-101
SLIDE 101

Convolutional networks

Invariances learned by the network

Invariance to translations: due to pooling As we stack layers; invariance to deformations Invariance to lighting: if we normalize input Invariance to rotations? only as present in data Invariance to scale? ditto

Deep Learning Tutorial,Part I 86

slide-102
SLIDE 102

Convolutional networks

Case study: LeNet

LeCun et al., 1986

Deep Learning Tutorial,Part I 87

slide-103
SLIDE 103

Convolutional networks

AlexNet

Krizhevsky et al., 2012

Deep Learning Tutorial,Part I 88

slide-104
SLIDE 104

Convolutional networks

VGG-16

Simonyan and Zisserman, 2014 [blog.heuritech.com]

Deep Learning Tutorial,Part I 89

slide-105
SLIDE 105

Convolutional networks

Visualization

Retrieving maximally inducing receptive fields [Girshick et al]:

Deep Learning Tutorial,Part I 90