DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY - - PowerPoint PPT Presentation

data analytics using deep learning
SMART_READER_LITE
LIVE PREVIEW

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY - - PowerPoint PPT Presentation

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY ARULRAJ L E C T U R E # 1 2 : T R A I N I N G N E U R A L N E T W O R K S ( P T 1 ) administrivia Reminders Integration with Eva Code reviews Each team must send


slide-1
SLIDE 1

DATA ANALYTICS USING DEEP LEARNING

GT 8803 // FALL 2019 // JOY ARULRAJ

L E C T U R E # 1 2 : T R A I N I N G N E U R A L N E T W O R K S ( P T 1 )

slide-2
SLIDE 2

GT 8803 // Fall 2019

administrivia

  • Reminders

– Integration with Eva – Code reviews – Each team must send Pull Requests to Eva

2

slide-3
SLIDE 3

GT 8803 // Fall 2018

Where we are now...

3

Hardware + Software PyTorch TensorFlow

slide-4
SLIDE 4

GT 8803 // Fall 2019

OVERVIEW

  • One time setup

– Activation Functions, Preprocessing, Weight

Initialization, Regularization, Gradient Checking

  • Training dynamics

– Babysitting the Learning Process, Parameter

updates, Hyperparameter Optimization

  • Evaluation

– Model ensembles, Test-time augmentation

4

slide-5
SLIDE 5

GT 8803 // Fall 2019

TODAY’s AGENDA

  • Training Neural Networks

– Activation Functions – Data Preprocessing – Weight Initialization – Batch Normalization

5

slide-6
SLIDE 6

GT 8803 // Fall 2018

ACTIVATION FUNCTIONS

6

slide-7
SLIDE 7

GT 8803 // Fall 2018 7

Activation Functions

slide-8
SLIDE 8

GT 8803 // Fall 2018 8

Sigmoid tanh ReLU Leaky ReLU Maxout ELU

Activation Functions

slide-9
SLIDE 9

GT 8803 // Fall 2018 9

Activation Functions

Sigmoid

  • Squashes numbers to range [0,1]
  • Historically popular since they have

nice interpretation as a saturating “firing rate” of a neuron

slide-10
SLIDE 10

GT 8803 // Fall 2018 10

Sigmoid

  • Squashes numbers to range [0,1]
  • Historically popular since they have

nice interpretation as a saturating “firing rate” of a neuron 3 problems:

1.

Saturated neurons “kill” the gradients

Activation Functions

slide-11
SLIDE 11

GT 8803 // Fall 2018 11

sigmoid gate

x What happens when x = -10? What happens when x = 0? What happens when x = 10?

slide-12
SLIDE 12

GT 8803 // Fall 2018 12

Sigmoid

  • Squashes numbers to range [0,1]
  • Historically popular since they have

nice interpretation as a saturating “firing rate” of a neuron 3 problems:

1.

Saturated neurons “kill” the gradients

2.

Sigmoid outputs are not zero- centered

Activation Functions

slide-13
SLIDE 13

GT 8803 // Fall 2018 13

Consider what happens when the input to a neuron is always positive... What can we say about the gradients on w?

slide-14
SLIDE 14

GT 8803 // Fall 2018 14

Consider what happens when the input to a neuron is always positive... What can we say about the gradients on w? Always all positive or all negative :(

hypothetical

  • ptimal w

vector zig zag path allowed gradient update directions allowed gradient update directions

slide-15
SLIDE 15

GT 8803 // Fall 2018 15

Consider what happens when the input to a neuron is always positive... What can we say about the gradients on w? Always all positive or all negative :( (For a single element! Minibatches help)

hypothetical

  • ptimal w

vector zig zag path allowed gradient update directions allowed gradient update directions

slide-16
SLIDE 16

GT 8803 // Fall 2018 16

Activation Functions

Sigmoid

  • Squashes numbers to range [0,1]
  • Historically popular since they have

nice interpretation as a saturating “firing rate” of a neuron 3 problems:

1.

Saturated neurons “kill” the gradients

2.

Sigmoid outputs are not zero- centered

3.

exp() is a bit compute expensive

slide-17
SLIDE 17

GT 8803 // Fall 2018 17

Activation Functions

tanh(x)

  • Squashes numbers to range [-1,1]
  • zero centered (nice)
  • still kills gradients when saturated :(

[LeCun et al., 1991]

slide-18
SLIDE 18

GT 8803 // Fall 2018 18

Activation Functions

ReLU (Rectified Linear Unit)

  • Computes f(x) = max(0,x)
  • Does not saturate (in +region)
  • Very computationally efficient
  • Converges much faster than

sigmoid/tanh in practice (e.g. 6x)

[Krizhevsky et al., 2012]

slide-19
SLIDE 19

GT 8803 // Fall 2018 19

Activation Functions

ReLU (Rectified Linear Unit)

  • Computes f(x) = max(0,x)
  • Does not saturate (in +region)
  • Very computationally efficient
  • Converges much faster than

sigmoid/tanh in practice (e.g. 6x)

  • Not zero-centered output

[Krizhevsky et al., 2012]

slide-20
SLIDE 20

GT 8803 // Fall 2018 20

Activation Functions

ReLU (Rectified Linear Unit)

  • Computes f(x) = max(0,x)
  • Does not saturate (in +region)
  • Very computationally efficient
  • Converges much faster than

sigmoid/tanh in practice (e.g. 6x)

  • Not zero-centered output

[Krizhevsky et al., 2012]

slide-21
SLIDE 21

GT 8803 // Fall 2018 21

Activation Functions

ReLU (Rectified Linear Unit)

  • Computes f(x) = max(0,x)
  • Does not saturate (in +region)
  • Very computationally efficient
  • Converges much faster than

sigmoid/tanh in practice (e.g. 6x)

  • Not zero-centered output
  • An annoyance:

hint: what is the gradient when x < 0?

slide-22
SLIDE 22

GT 8803 // Fall 2018 22

ReLU gate

x What happens when x = -10? What happens when x = 0? What happens when x = 10?

slide-23
SLIDE 23

GT 8803 // Fall 2018 23

DATA CLOUD active ReLU dead ReLU will never activate => never update

slide-24
SLIDE 24

GT 8803 // Fall 2018 24

DATA CLOUD active ReLU dead ReLU will never activate => never update => people like to initialize ReLU neurons with slightly positive biases (e.g. 0.01)

slide-25
SLIDE 25

GT 8803 // Fall 2018 25

Activation Functions

Leaky ReLU

  • Does not saturate
  • Computationally efficient
  • Converges much faster than

sigmoid/tanh in practice! (e.g. 6x)

  • will not “die”.

[Mass et al., 2013] [He et al., 2015]

slide-26
SLIDE 26

GT 8803 // Fall 2018 26

Activation Functions

Leaky ReLU

  • Does not saturate
  • Computationally efficient
  • Converges much faster than

sigmoid/tanh in practice! (e.g. 6x)

  • will not “die”.

Parametric Rectifier (PReLU)

backprop into \alpha (parameter) [Mass et al., 2013] [He et al., 2015]

slide-27
SLIDE 27

GT 8803 // Fall 2018 27

Activation Functions

Exponential Linear Units (ELU)

  • All benefits of ReLU
  • Closer to zero mean outputs
  • Negative saturation regime

compared with Leaky ReLU adds some robustness to noise

  • Computation requires exp()

[Clevert et al., 2015]

slide-28
SLIDE 28

GT 8803 // Fall 2018 28

Maxout “Neuron”

  • Does not have the basic form of dot product ->

nonlinearity

  • Generalizes ReLU and Leaky ReLU
  • Linear Regime! Does not saturate! Does not die!

Problem: doubles the number of parameters/neuron :(

[Goodfellow et al., 2013]

slide-29
SLIDE 29

GT 8803 // Fall 2018 29

TLDR: In practice:

  • Use ReLU. Be careful with your learning rates
  • Try out Leaky ReLU / Maxout / ELU
  • Try out tanh but don’t expect much
  • Don’t use sigmoid
slide-30
SLIDE 30

GT 8803 // Fall 2018

DATA PREPROCESSING

30

slide-31
SLIDE 31

GT 8803 // Fall 2018 31

DATA PREPROCESSING

(Assume X [NxD] is data matrix, each example in a row)

slide-32
SLIDE 32

GT 8803 // Fall 2018 32

Remember: Consider what happens when the input to a neuron is always positive... What can we say about the gradients on w? Always all positive or all negative :( (this is also why you want zero-mean data!)

hypothetical

  • ptimal w

vector zig zag path allowed gradient update directions allowed gradient update directions

slide-33
SLIDE 33

GT 8803 // Fall 2018 33

DATA PREPROCESSING

(Assume X [NxD] is data matrix, each example in a row)

slide-34
SLIDE 34

GT 8803 // Fall 2018 34

DATA PREPROCESSING

(data has diagonal covariance matrix) (covariance matrix is the identity matrix)

In practice, you may also see PCA and Whitening of the data

slide-35
SLIDE 35

GT 8803 // Fall 2018 35

DATA PREPROCESSING

Before normalization: classification loss very sensitive to changes in weight matrix; hard to

  • ptimize

After normalization: less sensitive to small changes in weights; easier to optimize

slide-36
SLIDE 36

GT 8803 // Fall 2018 36

TLDR: In practice for Images: center only

  • Subtract the mean image (e.g. AlexNet)

(mean image = [32,32,3] array)

  • Subtract per-channel mean (e.g. VGGNet)

(mean along each channel = 3 numbers)

  • Subtract per-channel mean and

Divide by per-channel std (e.g. ResNet)

(mean along each channel = 3 numbers)

e.g. consider CIFAR-10 example with [32,32,3] images

Not common to do PCA or whitening

slide-37
SLIDE 37

GT 8803 // Fall 2018

WEIGHT INITIALIZATION

37

slide-38
SLIDE 38

GT 8803 // Fall 2018 38

Q: what happens when W=constant init is used?

slide-39
SLIDE 39

GT 8803 // Fall 2018 39

First idea: Small random numbers (gaussian with zero mean and 1e-2 standard deviation)

slide-40
SLIDE 40

GT 8803 // Fall 2018 40

First idea: Small random numbers (gaussian with zero mean and 1e-2 standard deviation) Works ~okay for small networks, but problems with deeper networks.

slide-41
SLIDE 41

GT 8803 // Fall 2018 41

Weight Initialization: Activation statistics

Forward pass for a 6-layer net with hidden size 4096

slide-42
SLIDE 42

GT 8803 // Fall 2018 42

Weight Initialization: Activation statistics

All activations tend to zero for deeper network layers Q: What do the gradients dL/dW look like?

Forward pass for a 6-layer net with hidden size 4096

slide-43
SLIDE 43

GT 8803 // Fall 2018 43

Weight Initialization: Activation statistics

All activations tend to zero for deeper network layers Q: What do the gradients dL/dW look like? A: All zero, no learning =(

Forward pass for a 6-layer net with hidden size 4096

slide-44
SLIDE 44

GT 8803 // Fall 2018 44

Weight Initialization: Activation statistics

Increase std of initial weights from 0.01 to 0.05

slide-45
SLIDE 45

GT 8803 // Fall 2018 45

Weight Initialization: Activation statistics

All activations saturate Q: What do the gradients look like?

Increase std of initial weights from 0.01 to 0.05

slide-46
SLIDE 46

GT 8803 // Fall 2018 46

Weight Initialization: Activation statistics

All activations saturate Q: What do the gradients look like? A: Local gradients all zero, no learning =(

Increase std of initial weights from 0.01 to 0.05

slide-47
SLIDE 47

GT 8803 // Fall 2018 47

Weight Initialization: “XAVIER” Initialization

“Xavier” initialization: std = 1/sqrt(Din)

slide-48
SLIDE 48

GT 8803 // Fall 2018 48

Weight Initialization: “XAVIER” Initialization

“Xavier” initialization: std = 1/sqrt(Din)

Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010

“Just right”: Activations are nicely scaled for all layers!

slide-49
SLIDE 49

GT 8803 // Fall 2018 49

Weight Initialization: “XAVIER” Initialization

“Xavier” initialization: std = 1/sqrt(Din)

Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010

“Just right”: Activations are nicely scaled for all layers!

For conv layers, Din is

kernel_size2 * input_channels

slide-50
SLIDE 50

GT 8803 // Fall 2018 50

Weight Initialization: “XAVIER” Initialization

“Xavier” initialization: std = 1/sqrt(Din)

“Just right”: Activations are nicely scaled for all layers!

For conv layers, Din is

kernel_size2 * input_channels y = Wx h = f(y) Var(yi) = Din * Var(xiwi) [Assume x, w are iid] = Din * (E[xi2]E[wi2] - E[xi]2 E[wi]2) [Assume x, w independant] = Din * Var(xi) * Var(wi) [Assume x, w are zero-mean] If Var(wi) = 1/Din then Var(yi) = Var(xi) Derivation:

slide-51
SLIDE 51

GT 8803 // Fall 2018 51

Weight Initialization: WHAT ABOUT RELU?

Change from tanh to ReLU

slide-52
SLIDE 52

GT 8803 // Fall 2018 52

Weight Initialization: WHAT ABOUT RELU?

Change from tanh to ReLU

Xavier assumes zero centered activation function Activations collapse to zero again, no learning =(

slide-53
SLIDE 53

GT 8803 // Fall 2018 53

Weight Initialization: KAIMING/MSRA INITIALIZATION

ReLU correction: std = sqrt(2 / Din) “Just right”: Activations are

nicely scaled for all layers!

He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, ICCV 2015

slide-54
SLIDE 54

GT 8803 // Fall 2018 54

PROPER INITIALIZATION IS AN ACTIVE AREA OF RESEARCH…

  • Understanding the difficulty of training deep feedforward neural networks by

Glorot and Bengio, 2010

  • Exact solutions to the nonlinear dynamics of learning in deep linear neural

networks by Saxe et al, 2013

  • Random walk initialization for training very deep feedforward networks by

Sussillo and Abbott, 2014

  • Delving deep into rectifiers: Surpassing human-level performance on ImageNet

classification by He et al., 2015

  • Data-dependent Initializations of Convolutional Neural Networks by

Krähenbühl et al., 2015

  • All you need is a good init, Mishkin and Matas, 2015
  • Fixup Initialization: Residual Learning Without Normalization, Zhang et al, 2019
  • The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks,

Frankle and Carbin, 2019

slide-55
SLIDE 55

GT 8803 // Fall 2018

BATCH NORMALIZATION

55

slide-56
SLIDE 56

GT 8803 // Fall 2018 56

BATCH NORMALIZATION

“you want zero-mean unit-variance activations? just make them so.”

consider a batch of activations at some layer. To make each dimension zero-mean unit-variance, apply: this is a vanilla differentiable function...

[Ioffe and Szegedy, 2015]

slide-57
SLIDE 57

GT 8803 // Fall 2018 57

Input:

Per-channel mean, shape is D Per-channel var, shape is D Normalized x, Shape is N x D

X

N

BATCH NORMALIZATION

D

[Ioffe and Szegedy, 2015]

slide-58
SLIDE 58

GT 8803 // Fall 2018 58

Input:

Per-channel mean, shape is D Per-channel var, shape is D Normalized x, Shape is N x D

X

N

BATCH NORMALIZATION

D

Problem: What if zero-mean, unit variance is too hard of a constraint?

[Ioffe and Szegedy, 2015]

slide-59
SLIDE 59

GT 8803 // Fall 2018 59

Per-channel mean, shape is D Per-channel var, shape is D Normalized x, Shape is N x D

BATCH NORMALIZATION

[Ioffe and Szegedy, 2015] Output, Shape is N x D

Input: Learnable scale and shift parameters: Learning = , = will recover the identity function!

slide-60
SLIDE 60

GT 8803 // Fall 2018 60

Per-channel mean, shape is D Per-channel var, shape is D Normalized x, Shape is N x D

BATCH NORMALIZATION: TEST TIME

Output, Shape is N x D

Input: Learnable scale and shift parameters: Learning = , = will recover the identity function!

Estimates depend on minibatch; can’t do this at test-time!

slide-61
SLIDE 61

GT 8803 // Fall 2018 61

Per-channel mean, shape is D Per-channel var, shape is D Normalized x, Shape is N x D

BATCH NORMALIZATION: TEST TIME

Output, Shape is N x D

Input: Learnable scale and shift parameters:

During testing batchnorm becomes a linear operator! Can be fused with the previous fully-connected or conv layer (Running) average of values seen during training (Running) average of values seen during training

slide-62
SLIDE 62

GT 8803 // Fall 2018 62

BATCH NORMALIZATION FOR CONVNETS

x: N × D 𝞶,𝝉: 1 × D ɣ,β: 1 × D y = ɣ(x-𝞶)/𝝉+β x: N×C×H×W 𝞶,𝝉: 1×C×1×1 ɣ,β: 1×C×1×1 y = ɣ(x-𝞶)/𝝉+β

Normalize Normalize Batch Normalization for fully- connected networks Batch Normalization for convolutional networks (Spatial Batchnorm, BatchNorm2D)

slide-63
SLIDE 63

GT 8803 // Fall 2018 63

BATCH NORMALIZATION

FC BN tanh FC BN tanh ...

Usually inserted after Fully Connected

  • r Convolutional layers, and before

nonlinearity.

[Ioffe and Szegedy, 2015]

slide-64
SLIDE 64

GT 8803 // Fall 2018 64

BATCH NORMALIZATION

FC BN tanh FC BN tanh ... [Ioffe and Szegedy, 2015]

  • Makes deep networks much easier to train!
  • Improves gradient flow
  • Allows higher learning rates, faster convergence
  • Networks become more robust to initialization
  • Acts as regularization during training
  • Zero overhead at test-time: can be fused with conv!
  • Behaves differently during training and testing: this is a

very common source of bugs!

slide-65
SLIDE 65

GT 8803 // Fall 2018 65

LAYER NORMALIZATION

x: N × D 𝞶,𝝉: 1 × D ɣ,β: 1 × D y = ɣ(x-𝞶)/𝝉+β x: N × D 𝞶,𝝉: N × 1 ɣ,β: 1 × D y = ɣ(x-𝞶)/𝝉+β

Normalize Normalize Layer Normalization for fully-connected networks Same behavior at train and test! Can be used in recurrent networks Batch Normalization for fully-connected networks Ba, Kiros, and Hinton, “Layer Normalization”, arXiv 2016

slide-66
SLIDE 66

GT 8803 // Fall 2018 66

INSTANCE NORMALIZATION

x: N×C×H×W 𝞶,𝝉: 1×C×1×1 ɣ,β: 1×C×1×1 y = ɣ(x-𝞶)/𝝉+β x: N×C×H×W 𝞶,𝝉: N×C×1×1 ɣ,β: 1×C×1×1 y = ɣ(x-𝞶)/𝝉+β

Normalize Normalize Instance Normalization for convolutional networks Same behavior at train / test! Batch Normalization for convolutional networks

Ulyanov et al, Improved Texture Networks: Maximizing Quality and Diversity in Feed-forward Stylization and Texture Synthesis, CVPR 2017

slide-67
SLIDE 67

GT 8803 // Fall 2018 67

COMPARISON OF NORMALIZATION LAYERS

Wu and He, “Group Normalization”, ECCV 2018

slide-68
SLIDE 68

GT 8803 // Fall 2018 68

GROUP NORMALIZATION

Wu and He, “Group Normalization”, ECCV 2018

slide-69
SLIDE 69

SUMMARY

We looked in detail at:

  • Activation Functions (use ReLU)
  • Data Preprocessing (images: subtract mean)
  • Weight Initialization (use Xavier/He init)
  • Batch Normalization (use)

69

TLDRs

slide-70
SLIDE 70

GT 8803 // Fall 2019

NEXT LECTURE

  • Training Neural Networks (Part II)

– Parameter update schemes – Learning rate schedules – Gradient checking – Regularization (Dropout etc.) – Babysitting learning – Hyperparameter search – Evaluation (Ensembles etc.) – Transfer learning / fine-tuning

70