Deep Learning Tutorial Part II Greg Shakhnarovich TTI-Chicago - - PowerPoint PPT Presentation

deep learning tutorial part ii greg shakhnarovich tti
SMART_READER_LITE
LIVE PREVIEW

Deep Learning Tutorial Part II Greg Shakhnarovich TTI-Chicago - - PowerPoint PPT Presentation

Deep Learning Tutorial Part II Greg Shakhnarovich TTI-Chicago December 2016 Deep Learning Tutorial,Part II 1 Overview Goals of the tutorial Somewhat organized overview of basics, and some more advanced topics Demistify jargon Pointers


slide-1
SLIDE 1

Deep Learning Tutorial Part II Greg Shakhnarovich TTI-Chicago

December 2016

Deep Learning Tutorial,Part II 1

slide-2
SLIDE 2

Overview

Goals of the tutorial

Somewhat organized overview of basics, and some more advanced topics Demistify jargon Pointers for informed further learning Aimed mostly at vision practitioners, but tools are widely applicable beyond vision Assumes basic familiarity with machine learning

Deep Learning Tutorial,Part II 2

slide-3
SLIDE 3

Overview

Not covered

Connections to brain Deep learning outside of neural networks Many recent advances Many specialized architectures for vision tasks

Deep Learning Tutorial,Part II 3

slide-4
SLIDE 4

Overview

Outline

Introduction (3 hours): Review of relevant machine learning concepts Feedforward neural networks and backpropagation Optimization techniques and issues Complexity and regularization in neural networks Intro to convolutional networks Advanced (3 hours): Advanced techniques for learning DNNs Very deep networks Convnets for tasks beyond image classification Recurrent networks

Deep Learning Tutorial,Part II 4

slide-5
SLIDE 5

Overview

Sources

Stanford CS231N: Convolutional Neural Networks for Visual Recognition Andrej Karpathy, Justin Johnson et al. (2016 edition) vision.stanford.edu/teaching/cs231n Deep Learning by Ian Goodfellow, Aaron Courville and Yoshua Bengio, 2016 Chris Olah: Understanding LSTM Networks (blog post) colah.github.io/posts/2015-08-Understanding-LSTMs Papers on arXiv and slides by the authors

Deep Learning Tutorial,Part II 5

slide-6
SLIDE 6

More training tricks

Input normalization

Standard practice: normalize the data Theoretically could apply a variety of normalization schemes: zero-mean unit variance “box normalization” whitening In practice, for images: subtract “mean pixel” (same for all locations) Assuming zero-mean filters: zero-mean filter response Matches zero padding

Deep Learning Tutorial,Part II 6

slide-7
SLIDE 7

More training tricks

Batch normalization: motivation

Problem: covariance shift (change in function’s domain) As learning proceeds, higher layer suffer from internal covariance shift due to changing parameters of previous layers Makes learning harder! Example: MNIST (15/50/85 percentile of input to typical sigmoid) [Ioffe and Szegedy]

Deep Learning Tutorial,Part II 7

slide-8
SLIDE 8

More training tricks

Batch normalization: algorithm

Batch normalization [Ioffe and Szegedy, 2015] Scale γ and shift β (per layer or psr channel) are learned through the usual backprop!

Deep Learning Tutorial,Part II 8

slide-9
SLIDE 9

More training tricks

Batch normalization effect

[Ioffe and Szegedy] Allows for higher learning rate, faster convergence May (or may not) reduce need for dropout De-factor standard today in most architectures

Deep Learning Tutorial,Part II 9

slide-10
SLIDE 10

More training tricks

Data augmentation

Part of the invariances learned by convnets is due to including variations in the training set Natural variation: different instances of objects, scene compositions etc. Can get a lot more for free by synthetic variations Obvious: mirror flip (horizontally – but not vertically!) [A. Karpathy]

Deep Learning Tutorial,Part II 10

slide-11
SLIDE 11

More training tricks

Data augmentation

Random crops (and scales) For image classification – assumes object is large and central E.g., training ResNet on ImageNet: resize image so shorter size is a random number between 256 and 480; crop random 224×224 window [A. Karpathy] Must match to testing regime ResNet: multiple scales, fixed crops for each scale, max

Deep Learning Tutorial,Part II 11

slide-12
SLIDE 12

More training tricks

Data augmentation

Random crops (and scales) For image classification – assumes object is large and central E.g., training ResNet on ImageNet: resize image so shorter size is a random number between 256 and 480; crop random 224×224 window [A. Karpathy] Must match to testing regime ResNet: multiple scales, fixed crops for each scale, max

Deep Learning Tutorial,Part II 11

slide-13
SLIDE 13

More training tricks

Data augmentation

Color jitter Apply in a structured way (e.g., using PCA on color) rather than per-pixel [A. Karpathy] Blur (see our paper on arXiv) Rotations? Noise?

Deep Learning Tutorial,Part II 12

slide-14
SLIDE 14

Very deep networks

Quest for very deep networks

Apparent dividends from depth (albeit diminishing): [He et al.] Three main challenges with depth: computational complexity (alleviated by hardware?), learning complexity; optimization complexity

Deep Learning Tutorial,Part II 13

slide-15
SLIDE 15

Very deep networks

Training very deep networks

Naive attempts to increase depth: CIFAR-10, simple sequence of 3 × 3 conv layers with occasional stride 2 (no pooling) [He et al.] At certain depth optimization fails Clearly an optimization issue (not learning)!

Deep Learning Tutorial,Part II 14

slide-16
SLIDE 16

Very deep networks

GoogleNet

A number of ad-hoc choices: “inception blocks”, auxiliary loss paths No fully connected layers! compared to AlexNet: 12 times fewer parameters, double FLOPs [Szegedy et al., 2014]

Deep Learning Tutorial,Part II 15

slide-17
SLIDE 17

Very deep networks

Residual networks

Conventional wisdom: Key idea: allow for “shortcuts” for loss to reach low layers “deep supervision” Residual connections [He et al., 2015]: learn what to add to the previous layer rather than how to modify it

Deep Learning Tutorial,Part II 16

slide-18
SLIDE 18

Very deep networks

ResNet architecture

Compare to VGG-19 and to plain architecture

. . .

Deep Learning Tutorial,Part II 17

slide-19
SLIDE 19

Very deep networks

ResNet with bottleneck blocks

Bottleneck blocks: Can train hundreds of layers! state of the art on ImageNet/COCO is ResNets with 150-250 layers Similar to the Inception blocks in GoogleNet

Deep Learning Tutorial,Part II 18

slide-20
SLIDE 20

Very deep networks

Stochastic depth

From dropping units to dropping layers: regularization by stochastic depth at train time! Drop “ResNet blocks” with some probability [G. Hua et al., 2016] Made possible by the residual trick State of the art on recognition tasks

Deep Learning Tutorial,Part II 19

slide-21
SLIDE 21

Convnets for localization

Transfer learning with convnets

Advice from Andrej Karpathy:

Deep Learning Tutorial,Part II 20

slide-22
SLIDE 22

Convnets for localization

Localization with convnets

[A. Karpathy] Take classification net; discard top (fully connected) layers Attach new sub-net for bounding box regression; train At test time: use both classification and regression

Deep Learning Tutorial,Part II 21

slide-23
SLIDE 23

Convnets for localization

Overfeat for detection

Idea: reuse computation across overlapping sliding windows Key innovation: convert “fully connected” to convolutional layers [Sermanet et al.]

Deep Learning Tutorial,Part II 22

slide-24
SLIDE 24

Convnets for localization

Fully convolutional networks

Overfeat [Sermanet et al.]

Deep Learning Tutorial,Part II 23

slide-25
SLIDE 25

Convnets for localization

R-CNN

Deep Learning Tutorial,Part II 24

slide-26
SLIDE 26

Convnets for localization

R-CNN: results

Deep Learning Tutorial,Part II 25

slide-27
SLIDE 27

Convnets for localization

Fast RCNN

Deep Learning Tutorial,Part II 26

slide-28
SLIDE 28

Convnets for localization

Fast RCNN

Deep Learning Tutorial,Part II 27

slide-29
SLIDE 29

Convnets for localization

Fast RCNN: ROI pooling

Project region proposal to feature map Pool within each grid cell

Deep Learning Tutorial,Part II 28

slide-30
SLIDE 30

Convnets for localization

Fast RCNN results

Deep Learning Tutorial,Part II 29

slide-31
SLIDE 31

Convnets for localization

Faster RCNN

Deep Learning Tutorial,Part II 30

slide-32
SLIDE 32

Convnets for localization

Region proposal network

Learn to choose and refine coarse proposals Use a few “anchors” with diff. aspect ratios

Deep Learning Tutorial,Part II 31

slide-33
SLIDE 33

Convnets for localization

Hypercolumns for image labeling

Input image

Hypercolumns: skip-layer connections from lower layers directly to classification network [Mostajabi et al., 2015]

Deep Learning Tutorial,Part II 32

slide-34
SLIDE 34

Convnets for localization

Hypercolumns for image labeling

Input image VGG-16 conv1 1 conv5 3 (fc6) conv6 (fc7) conv7

Hypercolumns: skip-layer connections from lower layers directly to classification network [Mostajabi et al., 2015]

Deep Learning Tutorial,Part II 32

slide-35
SLIDE 35

Convnets for localization

Hypercolumns for image labeling

Input image VGG-16 conv1 1 conv5 3 (fc6) conv6 (fc7) conv7 Hypercolumn

Hypercolumns: skip-layer connections from lower layers directly to classification network [Mostajabi et al., 2015]

Deep Learning Tutorial,Part II 32

slide-36
SLIDE 36

Convnets for localization

Hypercolumns for image labeling

Input image VGG-16 conv1 1 conv5 3 (fc6) conv6 (fc7) conv7 Hypercolumn h fc1 h fc2 cls

Hypercolumns: skip-layer connections from lower layers directly to classification network [Mostajabi et al., 2015]

Deep Learning Tutorial,Part II 32

slide-37
SLIDE 37

Convnets for localization

Hypercolumns for image labeling

Input image VGG-16 conv1 1 conv5 3 (fc6) conv6 (fc7) conv7 Hypercolumn h fc1 h fc2 cls Output map

Hypercolumns: skip-layer connections from lower layers directly to classification network [Mostajabi et al., 2015]

Deep Learning Tutorial,Part II 32

slide-38
SLIDE 38

Convnets for localization

Hypercolumns for image labeling

Input image VGG-16 conv1 1 conv5 3 (fc6) conv6 (fc7) conv7 Hypercolumn h fc1 h fc2 cls Output map (low res)

↑ 2 Hypercolumns: skip-layer connections from lower layers directly to classification network [Mostajabi et al., 2015]

Deep Learning Tutorial,Part II 32

slide-39
SLIDE 39

Convnets for localization

Hypercolumns for image labeling

Input image VGG-16 conv1 1 conv5 3 (fc6) conv6 (fc7) conv7 Hypercolumn h fc1 h fc2 cls Output map Output map (low res)

↑ 2 Hypercolumns: skip-layer connections from lower layers directly to classification network [Mostajabi et al., 2015]

Deep Learning Tutorial,Part II 32

slide-40
SLIDE 40

Convnets for localization

Dilated convolution

Problem with fully-convolutional nets: large stride One solution: a-trous (hole) convolution [Chen et al.]

Deep Learning Tutorial,Part II 33

slide-41
SLIDE 41

Recurrent networks

Recurrent networks

Ostensibly, RNNs depart from the DAG assumption: h(0) = F0(x; Θ0) h(t) = F(x, h(t−1); Θ) This is a dynamical system governed (after initialization) by F However, in terms of neural network this simply means that the parameters are shared between time frames. Vanilla RNN (with single input): h(0) = g(Wxhx + b0) h(t) = g

  • Wxhx + Whhh(t−1) + bh
  • Deep Learning Tutorial,Part II

34

slide-42
SLIDE 42

Recurrent networks

Unrolling RNNs

ht = g(Wxhx + b0) ht = g (Wxhx + Whhht−1 + bh) For a given number of time frames (steps), we can unroll this to a “normal” deep neural network [Goodfellow et al.] with the constraing that Whh are tied across layers.

Deep Learning Tutorial,Part II 35

slide-43
SLIDE 43

Recurrent networks

Many forms of RNNs

[A. Karpathy] One-to-one: “normal” neural net One-to-many: image captioning (one image to sequence of words) Many-to-one: sequence classification (e.g., sentiment analysis) Many-to-many: machine translation (sequence to another sequence),

  • r frame classification (e.g., video), frame per frame

Deep Learning Tutorial,Part II 36

slide-44
SLIDE 44

Recurrent networks

Training sequence labeler

[Goodfellow et al.] Loss is computer per frame (typically cross-entropy) R allows for modeling structure (conditional dependencies between ys given x)

Deep Learning Tutorial,Part II 37

slide-45
SLIDE 45

Recurrent networks

Example: character-level language model

Input: previous character, output: next character Input characters xt: one-hot encoding Hidden layer: ht = tanh (Whhht−1 + Wxhxt) Output

  • t = Whyht

Word yt drawn from softmax(ot)

Deep Learning Tutorial,Part II 38

slide-46
SLIDE 46

Recurrent networks

Trained on Shakespeare’s sonnets

[A. Karpathy]

Deep Learning Tutorial,Part II 39

slide-47
SLIDE 47

Recurrent networks

Trained on algebraic topology text

[A. Karpathy]

Deep Learning Tutorial,Part II 40

slide-48
SLIDE 48

Recurrent networks

Training one-to-many models

[Goodfellow et al.] Loss is per frame; at training time previous true label is used (cheating?)

Deep Learning Tutorial,Part II 41

slide-49
SLIDE 49

Recurrent networks

Example: image captioning model

One-to-many, e.g., image captioning: i is image features, y0, y1, . . . is the word sequence Words are encoded as vectors xt (one-hot, or an embedding like Word2Vec), with special START and END words. Network definition: h1 = g(Wihi + Wxhx(START)) ht = g(Wxhxt + Whhht−1)

  • t = g(Whoh)

yt = softmax(ot)

Deep Learning Tutorial,Part II 42

slide-50
SLIDE 50

Recurrent networks

Image captioning: training and testing

h1 = g(Wihi + Wxhx(START)) ht = g(Wxhxt + Whhht−1)

  • t = g(Whoh)

yt = softmax(ot) [A. Karpathy] Training: xt is the ground truth word (ignore prediction) Test: xt is sampled from softmax distribution obtained from ot Length of output sequence is determined at test time by waiting for END CNN for extracting i is initialized from ImageNet, can be fine-tuned jointly with the rest!

Deep Learning Tutorial,Part II 43

slide-51
SLIDE 51

Recurrent networks

Backpropagation through time

Once unroll the RNN, can run backpropagation as usual “backpropagation through time” Enforce parameter sharing: e.g., first pretent it’s a normal NN, compute gradients, then before updating, average across time frames. All components of the model are trained end-to-end

Deep Learning Tutorial,Part II 44

slide-52
SLIDE 52

Recurrent networks

Gradient flow in RNNs

Vanilla RNNs OK for short sequences Consider gradient flow in BPTT: multiply Whh by itself many times Gradient explosion if largest eigenvalue is greater than 1! Simple solution: gradient clipping if g > τ, g = τg g [Goodfellow et al.]

Deep Learning Tutorial,Part II 45

slide-53
SLIDE 53

Recurrent networks

Limited memory problem

When largest eigenvalue is less than 1, we can have vanishing gradient This leads to forgetting [C. Olah] Consider: “I grew up in France, I speak fluent French”

  • r translating between languages with long-term dependencies

Deep Learning Tutorial,Part II 46

slide-54
SLIDE 54

Recurrent networks

Long Short Term Memory

LSTM introduced by Hochreiter and Schmidh¨ uber in 1997 Each time frame is associated with a complex cell This and subsequent LSTM figures credit: C. Olah

Deep Learning Tutorial,Part II 47

slide-55
SLIDE 55

Recurrent networks

Simple RNN cells

Deep Learning Tutorial,Part II 48

slide-56
SLIDE 56

Recurrent networks

LSTM: state and gating

The core flow through LSTM cell is via the “cell state” Gates multiply a value by the output of a sigmoid neural network (possibly one layer); 0 means “block”, 1 means “pass through”

Deep Learning Tutorial,Part II 49

slide-57
SLIDE 57

Recurrent networks

LSTM: forget gate

Forget gate determines whether to keep information in the hidden state Example: gender of a subject (forget if dealing with new subject) Note: gating is elementwise (per dimension of h)

Deep Learning Tutorial,Part II 50

slide-58
SLIDE 58

Recurrent networks

LSTM: input gate

Input gate: determines which values to influence by input Separate layer creates “candidate values” ˜ Ct to add to state use tanh to allow for signed values Note: often called g in literature Of it,j = 0 we will ignore ˜ Ct,j (i.e., will not “form new memories” for dimension j of hidden state based on input in frame t

Deep Learning Tutorial,Part II 51

slide-59
SLIDE 59

Recurrent networks

LSTM: applying updates

Apply the forgetting and new memory formation according to forget and input gate outputs and ˜ C

Deep Learning Tutorial,Part II 52

slide-60
SLIDE 60

Recurrent networks

LSTM: output gate

Output gate ot determines which dimensions of the state C will be incorporated in the output of the cell (i.e., in ht) Note: use tanh, so values in [−1, 1]

Deep Learning Tutorial,Part II 53

slide-61
SLIDE 61

Recurrent networks

Gated Recurrent Units

GRU [Cho et al.] is a simpler model Merge forget and input gate Merge cell state and hidden state Since introduction in 2014 becoming more popular

Deep Learning Tutorial,Part II 54

slide-62
SLIDE 62

Recurrent networks

Sequence to sequence LSTM models

[Goodfellow et al.] Encoder-decoder: first RNN encodes input, captures context in C; second RNN decodes output First competitive neural MT model [Sutskever et al., 2014]

Deep Learning Tutorial,Part II 55

slide-63
SLIDE 63

Recurrent networks

Deep RNNs

Straightforward to add layers to h [A. Karpathy]

Deep Learning Tutorial,Part II 56