Deep Learning Tutorial Part I Greg Shakhnarovich TTI-Chicago
December 2016
Deep Learning Tutorial,Part I 1
Deep Learning Tutorial Part I Greg Shakhnarovich TTI-Chicago - - PowerPoint PPT Presentation
Deep Learning Tutorial Part I Greg Shakhnarovich TTI-Chicago December 2016 Deep Learning Tutorial,Part I 1 Overview Goals of the tutorial Somewhat organized overview of basics, and some more advanced topics Demistify jargon Pointers for
December 2016
Deep Learning Tutorial,Part I 1
Overview
Somewhat organized overview of basics, and some more advanced topics Demistify jargon Pointers for informed further learning Aimed mostly at vision practitioners, but tools are widely applicable beyond vision Assumes basic familiarity with machine learning
Deep Learning Tutorial,Part I 2
Overview
Connections to brain Deep learning outside of neural networks Many recent advances Many specialized architectures for vision tasks
Deep Learning Tutorial,Part I 3
Overview
Introduction (3 hours): Review of relevant machine learning concepts Feedforward neural networks and backpropagation Optimization techniques and issues Complexity and regularization in neural networks Intro to convolutional networks Advanced (3 hours): Advanced techniques for learning DNNs Convnets for tasks beyond image classification Very deep networks Recurrent networks
Deep Learning Tutorial,Part I 4
Overview
Stanford CS231N: Convolutional Neural Networks for Visual Recognition Andrej Karpathy, Justin Johnson et al. (2016 edition) vision.stanford.edu/teaching/cs231n Deep Learning by Ian Goodfellow, Aaron Courville and Yoshua Bengio, 2016 Chris Olah: Understanding LSTM Networks (blog post) colah.github.io/posts/2015-08-Understanding-LSTMs Papers on arXiv and slides by the authors
Deep Learning Tutorial,Part I 5
Overview of ML concepts
Input data space X Output (label, target) space Y image classification: X = {natural images}, Y = {cat, dog, boat . . .} Unknown function f : X → Y Scenario: given a labeled training set (xi, yi), i = 1, . . . , N, with xi ∈ X, yi ∈ Y. Goal: any for future x, accurately predict y in other words: learn a mapping f : X → Y
Deep Learning Tutorial,Part I 6
Overview of ML concepts
A loss function ℓ : Y × Y → R maps prediction ˆ y to cost, given true value y Standard choices for regression: squared loss ℓ(ˆ y, y) = (ˆ y − y)2 absolute loss ℓ(ˆ y, y) = |ˆ y − y| Standard choice for classification: 0/1 loss ℓ(ˆ y, y) =
y, 1 if y = ˆ y,
+
, where ℓ(ˆ y, y) = Lˆ
y,y.
Deep Learning Tutorial,Part I 7
Overview of ML concepts
Usually, consider a parametric function f(x; Θ) E.g., linear function: f(x; w, b) = w · xi + b Fundamental assumption: example x/label y are drawn from a joint probability distribution p(x, y). The ultimate goal is to minimize the expected loss, also known as risk: R(Θ) = E(x0,y0)∼p(x,y) [ℓ (f(x0; Θ), y0)]
Deep Learning Tutorial,Part I 8
Overview of ML concepts
R(Θ) = E(x0,y0)∼p(x,y) [ℓ (f(x0; Θ), y0)] Further assumption: data are i.i.d.: same (unknown!) distribution for all pairs (x, y) in both training and test data. Can’t find argminΘ R, but can try to minimize the empirical risk (empirical loss) on training set argmin
Θ
L(Θ, X, y) = argmin
Θ
1 N
N
ℓ (f(xi; Θ), yi) To the extent that the training set is a representative of p(x, y), the empirical loss serves as a proxy for the true risk. Technically: estimate p(x, y) by the empirical distribution of data.
Deep Learning Tutorial,Part I 9
Overview of ML concepts
Two steps: Select a restricted class F of hypotheses f : X → Y E.g., linear functions parametrized by w: f(x; w) = w · x Select a hypothesis f∗ ∈ F based on training set (X, Y ) E.g., minimize empirical squared loss, i.e., select f(x; w∗) where w∗ = argmin
w N
(yi − w · xi)2 How do we find w∗?
Deep Learning Tutorial,Part I 10
Overview of ML concepts
Irreducible error (Bayes error) obtained even with the best possible mapping x → y Approximation error: the model class does not contain the best possible mapping for x → y Estimation error: argminΘ L(Θ, X, Y ) = argminΘ R(Θ) Optimization error: our optimization algorithm fails to find argminΘ L(Θ, X, Y )
Deep Learning Tutorial,Part I 11
Linear classification
ˆ y = h(x) = sign (w · x + b) Classifying using a linear decision boundary effectively reduces the data dimension to 1. Need to find w (direction) and b (location) of the boundary Want to minimize the expected zero/one loss for classifier h : X → Y, which for (x, y) is L(h(x), y) =
1 if h(x) = y.
Deep Learning Tutorial,Part I 12
Linear classification
Minimizing 0/1 loss is not tractable even approximation is NP-hard when the data are not linearly separable Instead, we minimize surrogate loss functions (typically convex, differentiable upper bound on 0/1 loss) Basic setup: the classifier outputs scores f ∈ RC which can be converted to prediction and used to calculate loss Obvious conversion rule: ˆ y(x; Θ) = argmaxc fc(x; Θ)
Deep Learning Tutorial,Part I 13
Linear classification
Linear penalty for violating fixed classification margin: ℓ(f(x, Θ), y) = max
c=y [fc + 1 − fy]
ℓ(f(x, Θ), y) = max {0, 1 − y · fy} Hard to get to work with multi-class cases; often better to set up many binary tasks (such as one-vs-all)
Deep Learning Tutorial,Part I 14
Linear classification
Associate probability model for the posterior: ˆ p(y|x; Θ) = 1 Z(x; Θ) exp (fy(x; Θ)) Log-loss: ℓ(fy(x; Θ), y) = − log ˆ p(y|x; Θ) If we represent the target labels as a distribution q(y = c) =
if c = y,
we can write ℓ as cross-entropy between ˆ p and q ℓ(fy(x; Θ), y) = −
q(y = c) log ˆ p(y = c|x)
Deep Learning Tutorial,Part I 15
Linear classification
ˆ p(y|x; Θ) = exp (fy(x; Θ))
General (multi-class) form of logistic regression: model fc(x) = wc · x + bc Over-parameterized: can set wC = 0, bC = 0 For C = 2, this is identical to the logistic regression The boundaries between classes linear Note: for prediction, do not need to exp. and normalize, just argmaxc fc(x)
Deep Learning Tutorial,Part I 16
Linear classification
p (y = c | x) = ewc·x C
k=1 ewk·x
The posteriors are invariant to shifting scores A common problem: overflow in exp(wc · x) Solution: subtract a = maxc wc · x Then, max score is 0, and the rest are negative; underflow is OK (some may turn to zero) Examples: scores = [1000, 995, 10, 10, 1] Na¨ ıve exponentiation: ≈ [∞, ∞, 2.2e4, 2.2e4, 2.7] After shifting dynamic range: ≈ [1, 0.007, 0, 0, 0]
Deep Learning Tutorial,Part I 17
Linear classification
p (y = c | x) = ewc·x−a C
k=1 ewk·x−a
The posteriors are invariant to shifting scores A common problem: overflow in exp(wc · x) Solution: subtract a = maxc wc · x Then, max score is 0, and the rest are negative; underflow is OK (some may turn to zero) Examples: scores = [1000, 995, 10, 10, 1] Na¨ ıve exponentiation: ≈ [∞, ∞, 2.2e4, 2.2e4, 2.7] After shifting dynamic range: ≈ [1, 0.007, 0, 0, 0]
Deep Learning Tutorial,Part I 17
Linear classification
fc(x) = wc · x + bc Posterior from scores: ˆ p(y = c|x) = exp(fc(x))/
j exp(fj(x))
Cross-entropy loss on a single example (x, y) − log ˆ p(y|x) = −fc(x) + log
exp(fj(x)) Gradient of the loss on a single example: ∇wcL(x, y) =
j exp(fj)
x exp(fc)/
j exp(fj)
if c = y ∇bcL(x, y) =
j exp(fj)
if c = y, exp(fc)/
j exp(fj)
if c = y
Deep Learning Tutorial,Part I 18
Linear classification Learning by gradient descent
Iteration counter t = 0 Initialize Θ(t) (to zero or a small random vector) for t = 1, . . .: compute gradient on data (X, Y ) g(t)(X, Y ) = ∇Θf
update model Θ(t) = Θ(t−1) − ηg(t) check for convergence (what does this mean?) The learning rate η controls the step size
Deep Learning Tutorial,Part I 19
Linear classification Learning by gradient descent
An epoch: a single pass through the training set A good idea: randomize the order of examples Single “iteration” t is an epoch: loop over examples (or in parallel) computing g(t)(xi, yi) accumulate the gradient g(t)(X, Y ) = 1 N
g(t)(xi, yi) make a single update at the end of the epoch. Assuming N is large, g(t)(X, Y ) is a good estimate for gradient, but it costs a lot to compute.
Deep Learning Tutorial,Part I 20
Linear classification Learning by gradient descent
Computing gradient on all N examples is expensive and may be wasteful: many data points provide similar information Idea: present examples one at a time, and pretend that the gradient
Formally: estimate gradient of the loss on a single example 1 N
N
∇ΘL(yi, xi; Θ) ≈ ∇ΘL(yt, xt; Θ) Mini-batch version: for some B ⊂ [N], |B| ≪ N, 1 N
N
∇ΘL(yi, xi; Θ) ≈ 1 |B|
∇ΘL(yt, xt; Θ)
Deep Learning Tutorial,Part I 21
Linear classification Learning by gradient descent
An incremental algorithm:
w := w + η ∂ ∂w log p (yi | xi; w) where the learning rate η determines how “slightly”. Epoch (full pass through data) contains N updates instead of one Good practice: shuffle the data each epoch
Deep Learning Tutorial,Part I 22
Linear classification Learning by gradient descent
When implementing gradient-based methods: always include numerical gradient check (gradcheck) Numerical approximation of the partial derivative: ∂f(x) ∂xj ≈ f(x + δej) − f(x − δej) 2δ note: this is better than the non-centered f(x + δej) − f(x) δ Can compute this for each parameters in a model, with δ ≈ 10−6
Deep Learning Tutorial,Part I 23
Linear classification Learning by gradient descent
Make sure to use double precision Run on a few data points, at random points in the parameter space caveat: may be important to run around important points, e.g., during convergence Find a way to run on a subset of parameters but careful how you select them: subset of weights for each class is OK, weights for a subset of classes not OK
Deep Learning Tutorial,Part I 24
Linear classification Learning by gradient descent
Suppose you get the gradient vector g from (analytic) calculation in your code, and g′ from gradcheck. A good value to look at: max |gi − g′
i|
max(|gi|, |g′
i|)
Suggested by Andrej Karpathy, who says: relative error >1e-2 usually means the gradient is probably wrong 1e-2 >relative error >1e-4 should make you feel uncomfortable 1e-4 >relative error is usually okay for objectives with kinks. But if there are no kinks [soft objective], then 1e-4 is too high. 1e-7 and less you should be happy.
Deep Learning Tutorial,Part I 25
Deep learning: introduction
Machine learning relies almost entirely on linear predictors But often applied to non-linear features of the data Feature transform: φ : X → Rd fy(x; w, b) = wy · φ(x) + by Shallow learning: hand-crafted, non-hierarchical φ. Basic example: polynomial regression, φj(x) = xj, j = 0, . . . , d, ˆ y = w · φ(x) Kernel SVM: employing kernel K corresponds to (some) feature space such that K(xi, xj) = φ(xi) · φ(xj); SVM is just a linear classifier in that space.
Deep Learning Tutorial,Part I 26
Deep learning: introduction
Image classification with spatial pyramids: φ is based on (1) computing SIFT descriptors over a set of points, (2) clustering descriptors, (3) computing cluster assignment histograms over various regions, (4) concatenating the histograms. Deformable parts model: φ is based on a set of filters, and a linear classifier on top. No hierarchy.
Deep Learning Tutorial,Part I 27
Deep learning: introduction
A system that employs a hierarchy of features of the input, learned end-to-end jointly with the predictor. fy(x) = FL(FL−1(FL−2(· · · F1(x) · · · ))) Learning methods that are not deep: SVMs nearest neighbor classifiers decision trees perceptron
Deep Learning Tutorial,Part I 28
Deep learning: introduction
Theoretical result [Cybenko, 1989]: 2-layer net with linear output (sigmoid hidden units) can approximate any continuous function over compact domain to arbitrary accuracy (given enough hidden units!) Examples: 3 hidden units with tanh(z) = e2z−1
e2z+1 activation
[from Bishop]
Deep Learning Tutorial,Part I 29
Deep learning: introduction
What can we gain from depth? Example: parity of n-bit numbers, with AND, OR, NOT, XOR gates Trivial shallow architecture: express parity as DNF or CNF but need exponential number of gates! Deep architecture: a tree of XOR gates
Deep Learning Tutorial,Part I 30
Deep learning: introduction
Distributed representations through hierarchy of features [Y. Bengio]
Deep Learning Tutorial,Part I 31
Deep learning: introduction
1950s: Perceptron (Rosenblatt) 1960s: first AI winter? Minsky and Pappert 1970s-1980s: connectionist models; backprop late 1980s: second AI winter most of modern deep learning discovered! early 2000s: revival of interest (CIFAR groups)
2010: progress in speech and vision with deen neural nets 2012: Krizhevsky et al. win ImageNet
Deep Learning Tutorial,Part I 32
Deep learning: introduction
General form of shallow linear classifiers: score is computed as fy(x; w, b) = wy · φ(x) + by Representation as a neural network: x1 x2
xd φ1 φ2
φm φ0 ≡ 1 y = 1 y = C w1,1 w1,2 w1,m b1
Deep Learning Tutorial,Part I 33
Deep learning: introduction
General form of shallow linear classifiers: score is computed as fy(x; w, b) = wy · φ(x) + by Representation as a neural network: Weights w = [w1, . . . , wC], wc ∈ Rm Biases b = [b1, . . . , bC] x1 x2
xd φ1 φ2
φm y = 1 y = C wC,1 wC,2 wC,m bC
Deep Learning Tutorial,Part I 33
Deep learning: introduction
x1 x2
xd h h
h w(1)
d,m
w(1)
1,1
w(1)
2,1
w(1)
d,1
w(2)
1
w(2)
2
w(2)
m
b(2)
1
b(1)
1
Idea: learn parametric features φj(x) = h(w(1)
j
· x + b(1)
j ) for some
nonlinear function h
Deep Learning Tutorial,Part I 34
Deep learning: introduction
Feedforward operation, from input x to output ˆ y: fy(x) =
m
w(2)
j,y h
d
w(1)
i,j xi + b(1) j
y
. . . h . . . In matrix form: f(x) = W2 · h (W1 · x + b1) + b2 where h is applied elementwise; x ∈ Rd, W1 ∈ Rm×d, W2 ∈ RC×m, b2 ∈ RC, b1 ∈ Rm
Deep Learning Tutorial,Part I 35
Deep learning: introduction
f(x) = W2 · h (W1 · x + b1) + b2 recall: ˆ p(y = c|x) = exp(fc(x))/
exp(fj(x)) Softmax loss computed on f(x) vs. true label y: L(x, y) = − log ˆ p(y|x) = −fy(x) + log
exp (fc(x)) Learning the network: initialize, then run [stochastic] gradient descent, updating according to ∂L ∂w2 , ∂L ∂b2 , ∂L ∂W1 , ∂L ∂b1
Deep Learning Tutorial,Part I 36
Deep learning: introduction
Consider the chain (stage-wise) mapping x Rd u Rm v Rc z R f g h Computing partial gradients: ∇vz = ∂z ∂v
Deep Learning Tutorial,Part I 37
Deep learning: introduction
Consider the chain (stage-wise) mapping x Rd u Rm v Rc z R f g h Computing partial gradients: ∇vz = ∂z ∂v ∂z ∂ui =
∂z ∂vj ∂vj ∂ui ⇒ ∇uz = ∂v ∂u ′ ∇vz
Deep Learning Tutorial,Part I 37
Deep learning: introduction
Consider the chain (stage-wise) mapping x Rd u Rm v Rc z R f g h Computing partial gradients: ∇vz = ∂z ∂v ∂z ∂ui =
∂z ∂vj ∂vj ∂ui ⇒ ∇uz = ∂v ∂u ′ ∇vz ∂z ∂xk =
∂z ∂uq ∂uq ∂xk ⇒ ∇xz = ∂u ∂x ′ ∇uz
Deep Learning Tutorial,Part I 37
Deep learning: introduction
More generally, some of the variables are tensors X Rd1×···×dx U Rm1×···×mu v Rc z R f g h ∇Xz is a tensor, same dim as X Use single index to indicate index tuples: e.g., if X is 3D, i = (i1, i2, i3) (∇Xz)i = ∂z ∂xi1,i2,i3 Now, ∇Xz =
(∇XUj) (∇Uz)j
Deep Learning Tutorial,Part I 38
Deep learning: introduction
To make derivations more convenient, we will express forward computation (x, y) → L in more detail: x, W1, b1 → a1 = W1 · x + b1 a1 → z1 = h(a1) z1, W2, b2 → a2 = W2 · z1 + b2 a2, y → L = −ey · a2 + log [1 · exp(a2)] Now we have, e.g., What is ∇W1z1,j like?
Deep Learning Tutorial,Part I 39
Deep learning: introduction
To make derivations more convenient, we will express forward computation (x, y) → L in more detail: x, W1, b1 → a1 = W1 · x + b1 a1 → z1 = h(a1) z1, W2, b2 → a2 = W2 · z1 + b2 a2, y → L = −ey · a2 + log [1 · exp(a2)] Now we have, e.g., ∇z1L = ∂a2 ∂z1 ′ ∇a2L, What is ∇W1z1,j like?
Deep Learning Tutorial,Part I 39
Deep learning: introduction
To make derivations more convenient, we will express forward computation (x, y) → L in more detail: x, W1, b1 → a1 = W1 · x + b1 a1 → z1 = h(a1) z1, W2, b2 → a2 = W2 · z1 + b2 a2, y → L = −ey · a2 + log [1 · exp(a2)] Now we have, e.g., ∇z1L = ∂a2 ∂z1 ′ ∇a2L, ∇W1L =
(∇W1z1,j) (∇z1L)j What is ∇W1z1,j like?
Deep Learning Tutorial,Part I 39
Backpropagation
General unit activation in a network (ignoring bias) Unit t receives input from I(t) = {i1, . . . , iS}, sends to O(t) = {o1, . . . , oR} at =
wjtzj zt = h(at) The loss L depends on wjt only through at ∂L ∂wjt = ∂L ∂at ∂at ∂wjt zt zi1 wi1,t zi2 wi2,t
ziS wiS,t zo1
zoR wt,o1 wt,oR f1 . . . fC L
. . . . . . . . . . . . Deep Learning Tutorial,Part I 40
Backpropagation
General unit activation in a network (ignoring bias) Unit t receives input from I(t) = {i1, . . . , iS}, sends to O(t) = {o1, . . . , oR} at =
wjtzj zt = h(at) The loss L depends on wjt only through at ∂L ∂wjt = ∂L ∂at ∂at ∂wjt = ∂L ∂at zj zt zi1 wi1,t zi2 wi2,t
ziS wiS,t zo1
zoR wt,o1 wt,oR f1 . . . fC L
. . . . . . . . . . . . Deep Learning Tutorial,Part I 40
Backpropagation
Starting with L, compute backward (gradient) flow Note: aj =
wi,jh(ai) Notation: dt = ∂L
∂at
The backward flow comes to unit t from O(t): dt =
∂L ∂ao ∂ao ∂at =
dowt,oh′(at) zt zi1 wi1,t zi2 wi2,t
ziS wiS,t zo1
zoR wt,o1 wt,oR f1 . . . fC L
. . . . . . . . . . . . Deep Learning Tutorial,Part I 41
Backpropagation
Starting with L, compute backward (gradient) flow Note: aj =
wi,jh(ai) Notation: dt = ∂L
∂at
The backward flow comes to unit t from O(t): dt =
∂L ∂ao ∂ao ∂at =
dowt,oh′(at) = h′(at)
dowt,jo zt zi1 wi1,t zi2 wi2,t
ziS wiS,t zo1
zoR wt,o1 wt,oR f1 . . . fC L
. . . . . . . . . . . . Deep Learning Tutorial,Part I 41
Backpropagation
Consider a layer t with nt units zt = h (Wtzt−1 + bt) where zt ∈ Rnt, bt ∈ Rnt, Wt ∈ Rnt×nt−1 h is applied element-wise Layer zero reads off input z0 ≡ x Last layer T produces a linear output zT = WT zT−1 + bT which is used to predict/assess loss (a.k.a. f)
Deep Learning Tutorial,Part I 42
Backpropagation
Compute a1 = W1 · x + b1 z1 = h(a1)
Deep Learning Tutorial,Part I 43
Backpropagation
Compute a1 = W1 · x + b1 z1 = h(a1) a2 = W2 · z1 + b2 z2 = h(a2)
Deep Learning Tutorial,Part I 43
Backpropagation
Compute a1 = W1 · x + b1 z1 = h(a1) a2 = W2 · z1 + b2 z2 = h(a2) . . . aT−1 = WT−1 · zT−2 + bT−1 zT−1 = h(aT−1)
Deep Learning Tutorial,Part I 43
Backpropagation
Compute a1 = W1 · x + b1 z1 = h(a1) a2 = W2 · z1 + b2 z2 = h(a2) . . . aT−1 = WT−1 · zT−2 + bT−1 zT−1 = h(aT−1) aT = WT · zT−1 + bT zT = aT Training: Compute L(zT , y) Testing: make inference ˆ y(x) = argmax
c
zT,c
Deep Learning Tutorial,Part I 43
Backpropagation
The main backprop equations: dt = h′(at)
djwt,j, ∂L ∂wi,t = dtzi Compute gradient information, using cached z and a dT = ∂L aT ∇WT L = dT ⊗ zT−1 u ⊗ v: outer prod;
Deep Learning Tutorial,Part I 44
Backpropagation
The main backprop equations: dt = h′(at)
djwt,j, ∂L ∂wi,t = dtzi Compute gradient information, using cached z and a dT = ∂L aT ∇WT L = dT ⊗ zT−1 dT−1 = h′(aT−1) ∗
T WT
u ⊗ v: outer prod; u ∗ v: elt-wise
Deep Learning Tutorial,Part I 44
Backpropagation
The main backprop equations: dt = h′(at)
djwt,j, ∂L ∂wi,t = dtzi Compute gradient information, using cached z and a dT = ∂L aT ∇WT L = dT ⊗ zT−1 dT−1 = h′(aT−1) ∗
T WT
. . . dt = h′(at) ∗
t+1Wt+1
u ⊗ v: outer prod; u ∗ v: elt-wise
Deep Learning Tutorial,Part I 44
Backpropagation
Basic building block of a neural network: a layer, which defines two directions of computation Forward: pull activations from input units; compute and cache “raw” activation a; compute and output activation z = h(a) Backward: collect gradient information d from output units; calculate gradient w.r.t. a; calculate gradient w.r.t. weights and biases The only connections between layers is in communicating z and d
Deep Learning Tutorial,Part I 45
Backpropagation
When implementing backpropagation, existing software falls into two groups Numerical: interface between layers handles numbers; computation must be fully specified Torch,MatConvnet,Caffe Symbolic: interface includes derivatives (and intermediate stages) as first-class citizens. Computation is specified by a computational graph Theano, TensorFlow, Caffe2 [Goodfellow et al]
Deep Learning Tutorial,Part I 46
Activation functions
sigmoid : [h(a) = 1 1 + exp(a) h(a) = tanh(a) Good: squash activations to a fixed range Bad: gradient is nearly zero far away from midpoint ∂L ∂a = ∂L ∂h(a) dh da ≈ 0 can make learning very, very slow tanh (zero-centered) is preferable to sigmoid
Deep Learning Tutorial,Part I 47
Activation functions
Intuition: make the non-linearity non-saturating, at least in part of the range Rectified linear units: h(a) = max(0, a) Good: non-saturating; cheap to compute; greatly speeds up convergence compared to sigmoid (order of magnitude)
Deep Learning Tutorial,Part I 48
Activation functions
Problem: if RELU gets into a state in which all batches in an epoch have zero activation, the units becomes stuck with zero gradient (“dies”). [A. Karpathy]
Deep Learning Tutorial,Part I 49
Activation functions
Many attempts to improve RELUs: Leaky RELU: h(a) = max(αa, a) Learning α: Parametric RELU Exponential RELUs: h(a) =
if a ≥ 0, α(exp(a) − 1) if a < 0 [A. Karpathy] ELU are promising, but more expensive to compute RELU still the default choice; none of the variants are consistently better
Deep Learning Tutorial,Part I 50
Initialization
Non-convex objective; initialization is important Can we initialize with all zeros? bad idea: all units will learn the same thing! Can initialize weights with small real numbers, e.g., drawn from Gaussian with zero mean, variance 0.01 Problem: variance of activation grows with number of inputs [A. Karpathy]
Deep Learning Tutorial,Part I 51
Initialization
Idea: normalize the scale to provide roughly equal variance throughout the network the Xavier initialization [Glorot et al]: if unit has n inputs, draw from zero mean, variance 1/n [A. Karpathy]
Deep Learning Tutorial,Part I 52
Initialization
Assumption behind Xavier: (1) linear activations, (2) zero mean activations. Breaks when using RELUs: [A. Karpathy]
Deep Learning Tutorial,Part I 53
Initialization
Initialization scheme specifically for RELUs [He et al.]: zero mean, variance 2/n where n is the number of inputs. The Kaiming initialization currently recommended for RELU units [A. Karpathy] Note: still OK to init biases with zeros
Deep Learning Tutorial,Part I 54
Optimization tricks
Learning hyperparameters: architecture, regularizer R Hyperparameters: learning rate η, batch size B Initialize weights and biases Each epoch: shuffle data, partition into batches, iterate over batches b w = w − η [∇wL(Xb, Yb) + ∇wR(w)] We have covered initialization Next: optimization
Deep Learning Tutorial,Part I 55
Optimization tricks
Generally, for convex functions, gradient descent will converge Setting the learning rate η may be very important to ensure rapid convergence from Lecun et al, 1996
Deep Learning Tutorial,Part I 56
Optimization tricks
For deep networks, setting the right learning rate is crucial. Typical behaviors, monitoring training loss: [A. Karpathy]
Deep Learning Tutorial,Part I 57
Optimization tricks
Generally, as with convex functions, we want the learning rate to decay with time Could set up automatic schedule, e.g., drop by a factor of alpha every β epochs. Most common in practice: some degree of babysitting start with a resonable learning rate monitor training loss drop LR (typically 1/10) when learning appears stuck
Deep Learning Tutorial,Part I 58
Optimization tricks
Too expensive to evaluate on entire training set frequently; instead, use rolling average of batch loss value Typical behavior: the red line [Larsson et al.] A few caveats: wait a bit before dropping; remember that this is surrogate loss on training (monitor validation accuracy as a precaution) better yet, drop LR based on val accuracy, not training loss do a sanity check for loss values Crashes due to NaNs etc. often due to high LR
Deep Learning Tutorial,Part I 59
Optimization tricks
SGD(GD) has trouble navigating ravines, i.e. areas where the surface curves much more steeply in one dimension than in another, which are common around local optima. [A. Karpathy] SGD oscillates across the slopes of the ravine, only making hesitant progress towards the local optimum. Momentum is a method that helps accelerate SGD(GD) in the relevant direction and dampens oscillations.
Deep Learning Tutorial,Part I 60
Optimization tricks
SGD(GD) has trouble navigating ravines, i.e. areas where the surface curves much more steeply in one dimension than in another, which are common around local optima. [A. Karpathy] SGD oscillates across the slopes of the ravine, only making hesitant progress towards the local optimum. Momentum is a method that helps accelerate SGD(GD) in the relevant direction and dampens oscillations. ∆wt = γ∆wt−1 + ηt∇f(wt) wt+1 = wt − ∆wt
Deep Learning Tutorial,Part I 60
Optimization tricks
Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its terminal velocity if there is air resistance, i.e. γ < 1).
Deep Learning Tutorial,Part I 61
Optimization tricks
Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its terminal velocity if there is air resistance, i.e. γ < 1). The momentum term increases for dimensions whose gradients point in the same directions, reduces updates for dimensions whose gradients change directions. Faster convergence, reduced oscillation [Goodfellow et al]
Deep Learning Tutorial,Part I 61
Optimization tricks
Intuition [Duchi et al.]: parameters (directions of the parameter space) are updated with varying frequency Idea: reduce learning rate in proportion to the updates Maintain cache si for each parameter θi; when updating, si = si + ∂L ∂θi 2 wi = wi − η ∂L ∂θi / (√si + ǫ) Rarely used today (reduces rate too aggresively)
Deep Learning Tutorial,Part I 62
Optimization tricks
Modified idea from Adagrad; ‘[“published” in Hinton’s Coursera slides] Cached rate allows for “forgetting” si = δsi + (1 − δ) ∂L ∂θi 2 , wi = wi − η ∂L ∂θi / (√si + ǫ) The decay rate δ is typically 0.9–0.99
Deep Learning Tutorial,Part I 63
Optimization tricks
Kind of like RMSprop with momentum [Kingma et al.] First order momentum update for wi: mi = β1mi + (1 − β1) ∂L ∂θi Second order: vi = β2vi + (1 − β2) ∂L ∂θi 2 Parameter update: wi = wi − η m √v + ǫ
Deep Learning Tutorial,Part I 64
Optimization tricks
Suppose we want to continue training network for more epochs All significant platforms allow for saving snapshots and resuming Need to be careful with learning rate: if re-initialize to high, might lose our place in parameter space Technical issues with momentum, Adam etc. – need to save relevant data in the snapshots to resume!
Deep Learning Tutorial,Part I 65
Regularization
Main challenge in machine learning: overfitting Bias-variance tradeoff: complex models reduce bias (approximation error), but increase variance (estimation error) Optimization error: source of concern when dealing with non-convex models Bayes error: presumed low in vision tasks (?)
Deep Learning Tutorial,Part I 66
Regularization
Regularization as a way to control the bias-variance tradeoff General form of regularized ERM for model class M: min
M∈M
L(yi, M(xi)) + λR(M)
some w ∈ RD Most common form of regularizer R: norm
d |wd|p
(shrinkage)
Deep Learning Tutorial,Part I 67
Regularization
Can write unconstrained optimization problem min
w N
− log ˆ p(yi|xi; w) + λ
m
|wj|p as an equivalent constrained problem min
w N
− log ˆ p(yi|xi; w) subject to
m
|wj|p ≤ t
w2 w1 ℓ ˆ wML w2 1 + w2 2 ˆ wridge |w1| + |w2| ˆ wlasso
p = 1 may lead to sparsity, p = 2 generally won’t
Deep Learning Tutorial,Part I 68
Regularization
Cartoon of the effect of regularization on bias and variance:
ln λ −3 −2 −1 1 2 0.03 0.06 0.09 0.12 0.15
(bias)2 variance (bias)2 + variance test error
[Bishop] In practice, curves are less clean
Deep Learning Tutorial,Part I 69
Regularization
In neural networks, L2 regularization is called weight decay Note: bias is normally not regularized Easy to incorporate weight decay into the gradient calculation ∇wλw2 = 2λw One more hyperparameter λto tune with large data sets, typically seems inconsequential
Deep Learning Tutorial,Part I 70
Regularization
Part of overfitting in a neural net: unit co-adaptation idea: prevent it by disrupting co-firing patterns Dropout [Srivastava et al]: during each training iteration, randomly “remove” units, just for that update
Deep Learning Tutorial,Part I 71
Regularization
With each particular dropout set, we have a different network Interpretation: training an ensemble of networks with shared parameters [Goodfellow et al.]
Deep Learning Tutorial,Part I 72
Regularization
Dropout introduces a discrepancy between train and test Correction: suppose survival rate of a unit is p must scale weights from unit (after training) by p Modern version: “inverse dropout” scale activations by 1/p during training, do not scale during test Typically employed in top layers; p = 0.5 is most common
Deep Learning Tutorial,Part I 73
Convolutional networks
One way to regularize is to sparsify parameters We can set a subset of weights to zero If the input has “spatial” semantics: can keep only weights for a contiguous set of inputs [Goodfellow et al.]
Deep Learning Tutorial,Part I 74
Convolutional networks
Each unit in upper layer is affected only by a subset of units in lower layer – its receptive field [Goodfellow et al.] Conversely: each unit in lower layer only influences a subset of units in upper layer
Deep Learning Tutorial,Part I 75
Convolutional networks
Very important: receptive field size w.r.t. the input in locally connected networks grows with layers, even if in each layer it is fixed [Goodfellow et al.]
Deep Learning Tutorial,Part I 76
Convolutional networks
We can further reduce network complexity by tying weights for all receptive fields in a layer not tied tied Goodfellow et al. We have now introduced convolutional layers weight sharing induces equivariance to translation!
Deep Learning Tutorial,Part I 77
Convolutional networks
Note: filters are not flipped [Goodfellow et al.]
Deep Learning Tutorial,Part I 78
Convolutional networks
Suppose the input to the layer (output of previous layer) has C channels tensor W × H × C we convolve only along spatial dimensions: assuming valid convolution, with k × k × C filter (must match #channels!) we get (W − k + 1) × (H − k + 1) × 1 activation map [A. Karpathy] If we have m filters, we get (W − k + 1) × (H − k + 1) × m map as the output of the layer
Deep Learning Tutorial,Part I 79
Convolutional networks
Most common: convert convolution into matrix multiplication parallel, GPU-friendly! Suppose we have m filters k1, . . . , km of size f × f, with c channels. Basic idea: pre-compute index mapping im2col : Z ∈ RS×S×c → M ∈ R(S−f+1)2×k2c that maps every receptive field to a column in M Collect filters as columns of K ∈ Rf2c Now simply compute MK + b and reshape to (S − f + 1) × (S − f + 1) × m Notably, for some cases (in particular small filters, 3 × 3 or 5 × 5) more efficient implementations use FFT Most software uses 3rd party (Nvidia, Nervana) implementations under the hood
Deep Learning Tutorial,Part I 80
Convolutional networks
If we simplye rely on valid convolutions, the maps will quickly shrink [Goodfellow et al] Instead, we usually pad with zeros [Goodfellow et al] Usual padding is symmetric, with (f − 1)/2 – same convolution in Matlab speak
Deep Learning Tutorial,Part I 81
Convolutional networks
Two extreme cases: Filter size equal to output map size of the previous layer ⇒fully connected layer (more on this later) Filter size is 1 × 1 ⇒The layer simply computes a (non-linear) projection of the features computed by previous layer
Deep Learning Tutorial,Part I 82
Convolutional networks
Convolution with stride >1 is a cheap way to reduce spatial dimension of output Can be implemented as convolution followed by downsampling used in LeNet wasteful! Goodfellow et al. Modern implementations explicitly specify stride [Goodfellow et al] Note: matches the multiplication implementation of conv well!
Deep Learning Tutorial,Part I 83
Convolutional networks
Pooling applies a non-parameterized operation on a receptive field Most common operation: max and average Typically, pooling is applied with a stride>1, to reduce spatial resolution but is possible to have stride of 1!
Deep Learning Tutorial,Part I 84
Convolutional networks
Idea: pool over feature channels, not spatially Introduces invariance w.r.t. a family of filters [Goodfellow et al.]
Deep Learning Tutorial,Part I 85
Convolutional networks
Invariance to translations: due to pooling As we stack layers; invariance to deformations Invariance to lighting: if we normalize input Invariance to rotations? only as present in data Invariance to scale? ditto
Deep Learning Tutorial,Part I 86
Convolutional networks
LeCun et al., 1986
Deep Learning Tutorial,Part I 87
Convolutional networks
Krizhevsky et al., 2012
Deep Learning Tutorial,Part I 88
Convolutional networks
Simonyan and Zisserman, 2014 [blog.heuritech.com]
Deep Learning Tutorial,Part I 89
Convolutional networks
Retrieving maximally inducing receptive fields [Girshick et al]:
Deep Learning Tutorial,Part I 90