Lecture 10: Training Neural Networks (Part 1) Justin Johnson - PowerPoint PPT Presentation

Remember: Consider what happens when the input to a neuron is always positive... allowed gradient update directions allowed gradient update directions hypothetical What can we say about the gradients on w ? optimal w vector Always all positive or all negative :( (this is also why you want zero-mean data!) Justin Johnson Lecture 10 - 36 October 7, 2019

Data Preprocessing (Assume X [NxD] is data matrix, each example in a row) Justin Johnson Lecture 10 - 37 October 7, 2019

Data Preprocessing In practice, you may also see PCA and Whitening of the data (data has diagonal (covariance matrix is covariance matrix) the identity matrix) Justin Johnson Lecture 10 - 38 October 7, 2019

Data Preprocessing After normalization : less sensitive to Before normalization : classification small changes in weights; easier to loss very sensitive to changes in optimize weight matrix; hard to optimize Justin Johnson October 7, 2019 Lecture 10 - 39

Data Preprocessing for Images e.g. consider CIFAR-10 example with [32,32,3] images Subtract the mean image (e.g. AlexNet) - (mean image = [32,32,3] array) Subtract per-channel mean (e.g. VGGNet) - (mean along each channel = 3 numbers) Subtract per-channel mean and - Not common to Divide by per-channel std (e.g. ResNet) do PCA or (mean along each channel = 3 numbers) whitening Justin Johnson Lecture 10 - 40 October 7, 2019

Weight Initialization Justin Johnson Lecture 10 - 41 October 7, 2019

Weight Initialization Q : What happens if we initialize all W=0, b=0? Justin Johnson Lecture 10 - 42 October 7, 2019

Weight Initialization Q : What happens if we initialize all W=0, b=0? A : All outputs are 0, all gradients are the same! No “symmetry breaking” Justin Johnson Lecture 10 - 43 October 7, 2019

Weight Initialization Next idea: small random numbers (Gaussian with zero mean, std=0.01) Justin Johnson Lecture 10 - 44 October 7, 2019

Weight Initialization Next idea: small random numbers (Gaussian with zero mean, std=0.01) Works ~okay for small networks, but problems with deeper networks. Justin Johnson Lecture 10 - 45 October 7, 2019

Weight Initialization: Activation Statistics Forward pass for a 6-layer net with hidden size 4096 Justin Johnson Lecture 10 - 46 October 7, 2019

Weight Initialization: Activation Statistics Forward pass for a 6-layer All activations tend to zero for net with hidden size 4096 deeper network layers Q : What do the gradients dL/dW look like? Justin Johnson Lecture 10 - 47 October 7, 2019

Weight Initialization: Activation Statistics Forward pass for a 6-layer All activations tend to zero for net with hidden size 4096 deeper network layers Q : What do the gradients dL/dW look like? A : All zero, no learning =( Justin Johnson Lecture 10 - 48 October 7, 2019

Weight Initialization: Activation Statistics Increase std of initial weights from 0.01 to 0.05 Justin Johnson Lecture 10 - 49 October 7, 2019

Weight Initialization: Activation Statistics Increase std of initial weights All activations saturate from 0.01 to 0.05 Q : What do the gradients look like? Justin Johnson Lecture 10 - 50 October 7, 2019

Weight Initialization: Activation Statistics Increase std of initial weights All activations saturate from 0.01 to 0.05 Q : What do the gradients look like? A : Local gradients all zero, no learning =( Justin Johnson Lecture 10 - 51 October 7, 2019

Weight Initialization: Xavier Initialization “Xavier” initialization: std = 1/sqrt(Din) Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010 Justin Johnson Lecture 10 - 52 October 7, 2019

Weight Initialization: Xavier Initialization “Just right”: Activations are “Xavier” initialization: std = 1/sqrt(Din) nicely scaled for all layers! Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010 Justin Johnson Lecture 10 - 53 October 7, 2019

Weight Initialization: Xavier Initialization “Just right”: Activations are “Xavier” initialization: std = 1/sqrt(Din) nicely scaled for all layers! For conv layers, Din is kernel_size 2 * input_channels Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010 Justin Johnson Lecture 10 - 54 October 7, 2019

Weight Initialization: Xavier Initialization “Xavier” initialization: std = 1/sqrt(Din) Derivation: Variance of output = Variance of input (") 𝑧 " = $ 𝑦 & 𝑥 y = Wx & &*+ Var(y i ) = Din * Var(x i w i ) [Assume x, w are iid] 2 ] - E[x i ] 2 E[w i ] 2 ) [Assume x, w independent] = Din * (E[x i 2 ]E[w i = Din * Var(x i ) * Var(w i ) [Assume x, w are zero-mean] If Var(w i ) = 1/Din then Var(y i ) = Var(x i ) Justin Johnson Lecture 10 - 55 October 7, 2019

Weight Initialization: What about ReLU? Change from tanh to ReLU Justin Johnson Lecture 10 - 60 October 7, 2019

Weight Initialization: What about ReLU? Xavier assumes zero centered Change from tanh to ReLU activation function Activations collapse to zero again, no learning =( Justin Johnson Lecture 10 - 61 October 7, 2019

Weight Initialization: Kaiming / MSRA Initialization ”Just right” – activations nicely ReLU correction: std = sqrt(2 / Din) scaled for all layers He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, ICCV 2015 Justin Johnson Lecture 10 - 62 October 7, 2019

Weight Initialization: Residual Networks relu F(x) + x If we initialize with MSRA: then Var(F(x)) = Var(x) But then Var(F(x) + x) > Var(x) – variance grows conv with each block! relu F(x) conv X Residual Block Justin Johnson Lecture 10 - 63 October 7, 2019

Weight Initialization: Residual Networks relu F(x) + x If we initialize with MSRA: then Var(F(x)) = Var(x) But then Var(F(x) + x) > Var(x) – variance grows conv with each block! relu F(x) Solution : Initialize first conv with MSRA, initialize conv second conv to zero. Then Var(x + F(x)) = Var(x) X Residual Block Zhang et al, “Fixup Initialization: Residual Learning Without Normalization”, ICLR 2019 Justin Johnson Lecture 10 - 64 October 7, 2019

Proper initialization is an active area of research Understanding the difficulty of training deep feedforward neural networks by Glorot and Bengio, 2010 Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013 Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014 Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015 Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015 All you need is a good init , Mishkin and Matas, 2015 Fixup Initialization: Residual Learning Without Normalization , Zhang et al, 2019 The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , Frankle and Carbin, 2019 Justin Johnson Lecture 10 - 65 October 7, 2019

Now your model is training … but it overfits! Regularization Justin Johnson Lecture 10 - 66 October 7, 2019

Regularization: Add term to the loss In common use: L2 regularization (Weight decay) L1 regularization Elastic net (L1 + L2) Justin Johnson Lecture 10 - 67 October 7, 2019

Regularization: Dropout In each forward pass, randomly set some neurons to zero Probability of dropping is a hyperparameter; 0.5 is common Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR 2014 Justin Johnson Lecture 10 - 68 October 7, 2019

Regularization: Dropout Example forward pass with a 3-layer network using dropout Justin Johnson Lecture 10 - 69 October 7, 2019

Regularization: Dropout Forces the network to have a redundant representation; Prevents co-adaptation of features has an ear X has a tail X cat cat is furry score score has claws X mischievous look Justin Johnson Lecture 10 - 70 October 7, 2019

Regularization: Dropout Another interpretation: Dropout is training a large ensemble of models (that share parameters). Each binary mask is one model An FC layer with 4096 units has 2 4096 ~ 10 1233 possible masks! Only ~ 10 82 atoms in the universe... Justin Johnson Lecture 10 - 71 October 7, 2019

Dropout: Test Time Output Input (label) (image) Random Dropout makes our output random! mask Want to “average out” the randomness at test-time But this integral seems hard … Justin Johnson Lecture 10 - 72 October 7, 2019

Dropout: Test Time Want to approximate the integral Consider a single neuron. a w 1 w 2 x y Justin Johnson Lecture 10 - 73 October 7, 2019

Dropout: Test Time Want to approximate the integral Consider a single neuron. a At test time we have: w 1 w 2 x y Justin Johnson Lecture 10 - 74 October 7, 2019

Dropout: Test Time Want to approximate the integral Consider a single neuron. a At test time we have: During training we have: w 1 w 2 x y Justin Johnson Lecture 10 - 75 October 7, 2019

Dropout: Test Time Want to approximate the integral Consider a single neuron. a At test time we have: During training we have: w 1 w 2 At test time, drop x y nothing and multiply by dropout probability Justin Johnson Lecture 10 - 76 October 7, 2019

Dropout: Test Time At test time all neurons are active always => We must scale the activations so that for each neuron: output at test time = expected output at training time Justin Johnson Lecture 10 - 77 October 7, 2019

Dropout Summary drop in forward pass scale at test time Justin Johnson Lecture 10 - 78 October 7, 2019

More common: “Inverted dropout” Drop and scale during training test time is unchanged! Justin Johnson Lecture 10 - 79 October 7, 2019

Dropout architectures Recall AlexNet, VGG have most of their parameters in fully-connected layers ; usually Dropout is applied there AlexNet vs VGG-16 (Params, M) Dropout here! 120000 100000 80000 60000 40000 20000 0 conv1 conv2 conv3 conv4 conv5 fc6 fc7 fc8 AlexNet VGG-16 Justin Johnson Lecture 10 - 80 October 7, 2019

Dropout architectures Recall AlexNet, VGG have most of their parameters in fully-connected layers ; usually Dropout is applied there AlexNet vs VGG-16 (Params, M) Dropout here! Later architectures (GoogLeNet, 120000 ResNet, etc) use global average 100000 pooling instead of fully-connected 80000 60000 layers: they don’t use dropout at all! 40000 20000 0 conv1 conv2 conv3 conv4 conv5 fc6 fc7 fc8 AlexNet VGG-16 Justin Johnson Lecture 10 - 81 October 7, 2019

Regularization : A common pattern Re Training : Add some kind of randomness Testing: Average out randomness (sometimes approximate) Justin Johnson Lecture 10 - 82 October 7, 2019

Regularization : A common pattern Re Example : Batch Training : Add some kind of Normalization randomness Training : Normalize using stats from random minibatches Testing: Average out randomness (sometimes approximate) Testing : Use fixed stats to normalize Justin Johnson Lecture 10 - 83 October 7, 2019

Regularization : A common pattern Re Example : Batch Training : Add some kind of Normalization randomness For ResNet and later, often L2 and Batch Training : Normalize Normalization are the only regularizers! using stats from random minibatches Testing: Average out randomness (sometimes approximate) Testing : Use fixed stats to normalize Justin Johnson Lecture 10 - 84 October 7, 2019

Da Data Au Augme mentati tion Load image “cat” and label Compute loss CNN This image by Nikita is licensed under CC-BY 2.0 Justin Johnson Lecture 10 - 85 October 7, 2019

Da Data Au Augme mentati tion Load image “cat” and label Compute loss CNN Transform image Justin Johnson Lecture 10 - 86 October 7, 2019

tion : Horizontal Flips Da Data Au Augme mentati Justin Johnson Lecture 10 - 87 October 7, 2019

tion : Random Crops and Scales Da Data Au Augme mentati Training : sample random crops / scales ResNet: Pick random L in range [256, 480] 1. Resize training image, short side = L 2. Sample random 224 x 224 patch 3. Justin Johnson Lecture 10 - 88 October 7, 2019

tion : Random Crops and Scales Da Data Au Augme mentati Training : sample random crops / scales ResNet: Pick random L in range [256, 480] 1. Resize training image, short side = L 2. Sample random 224 x 224 patch 3. Testing : average a fixed set of crops ResNet: Resize image at 5 scales: {224, 256, 384, 480, 640} 1. For each size, use 10 224 x 224 crops: 4 corners + center, + flips 2. Justin Johnson Lecture 10 - 89 October 7, 2019

tion : Color Jitter Da Data Au Augme mentati More Complex : 1. Apply PCA to all [R, G, B] Simple: Randomize contrast and brightness pixels in training set 2. Sample a “color offset” along principal component directions 3. Add offset to all pixels of a training image (Used in AlexNet, ResNet, etc) Justin Johnson Lecture 10 - 90 October 7, 2019

tion : Get creative for your problem! Da Data Au Augme mentati Random mix/combinations of : translation - rotation - stretching - shearing, - lens distortions, … (go crazy) - Justin Johnson Lecture 10 - 91 October 7, 2019

Regularization : A common pattern Re Training : Add some randomness Testing : Marginalize over randomness Examples : Dropout Batch Normalization Data Augmentation Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013 Justin Johnson Lecture 10 - 92 October 7, 2019

Regularization : DropConnect Re Training : Drop random connections between neurons (set weight=0) Testing : Use all the connections Examples : Dropout Batch Normalization Data Augmentation DropConnect Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013 Justin Johnson Lecture 10 - 93 October 7, 2019

Regularization : Fractional Pooling Re Training : Use randomized pooling regions Testing : Average predictions over different samples Examples : Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling Graham, “Fractional Max Pooling”, arXiv 2014 Justin Johnson Lecture 10 - 94 October 7, 2019

Regularization : Stochastic Depth Re Training : Skip some residual blocks in ResNet Testing : Use the whole network Examples : Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016 Justin Johnson Lecture 10 - 95 October 7, 2019

Regularization : Stochastic Depth Re Training : Set random images regions to 0 Testing : Use the whole image Examples : Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth Cutout Works very well for small datasets like CIFAR, less common for large datasets like ImageNet DeVries and Taylor, “Improved Regularization of Convolutional Neural Networks with Cutout”, arXiv 2017 Justin Johnson Lecture 10 - 96 October 7, 2019

Regularization : Mixup Re Sample blend probability from a beta distribution Beta(a, b) Training : Train on random blends of images with a=b≈0 so blend weights are close to 0/1 Testing : Use original images Examples : Dropout Target label: Batch Normalization CNN cat: 0.4 Data Augmentation dog: 0.6 DropConnect Fractional Max Pooling Stochastic Depth Randomly blend the pixels of Cutout pairs of training images, e.g. Mixup 40% cat, 60% dog Zhang et al, “ mixup : Beyond Empirical Risk Minimization”, ICLR 2018 Justin Johnson Lecture 10 - 97 October 7, 2019

Regularization : Mixup Re Training : Train on random blends of images Testing : Use original images Examples : Dropout Target label: Batch Normalization CNN cat: 0.4 Data Augmentation dog: 0.6 DropConnect Fractional Max Pooling Stochastic Depth Randomly blend the pixels of Cutout pairs of training images, e.g. Mixup 40% cat, 60% dog Zhang et al, “ mixup : Beyond Empirical Risk Minimization”, ICLR 2018 Justin Johnson Lecture 10 - 98 October 7, 2019

Regularization : Mixup Re Training : Train on random blends of images Testing : Use original images Examples : Consider dropout for large fully- Dropout - connected layers Batch Normalization Batch normalization and data Data Augmentation - augmentation almost always a DropConnect good idea Fractional Max Pooling Try cutout and mixup especially Stochastic Depth - for small classification datasets Cutout Mixup Zhang et al, “ mixup : Beyond Empirical Risk Minimization”, ICLR 2018 Justin Johnson Lecture 10 - 99 October 7, 2019

Summary 1.One time setup Today Activation functions, data preprocessing, weight initialization, regularization 2.Training dynamics Learning rate schedules; large-batch training; hyperparameter optimization Next time 3.After training Model ensembles, transfer learning Justin Johnson Lecture 10 - 100 October 7, 2019

Lecture 10: Training Neural Networks (Part 1) Justin Johnson - PowerPoint PPT Presentation

Lecture 10: Training Neural Networks (Part 1) Justin Johnson Lecture 1 - 1 October 7, 2019 Reminder: A3 Due Monday, October 14 (1 week from today!) Remember to run the validation script! Justin Johnson Lecture 10 - 2 October 7, 2019

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Artificial Neural Networks Oliver Schulte - CMPT 726 Feed-forward Networks Network Training

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Lecture 4: Recurrent neural networks for natural language processing Plan of the lecture Part

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Lecture 11: Neural Networks (Part 3) March 2nd, 2020 Lecturer: Steven Wu Scribe: Steven Wu 1

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

Neural Networks Philipp Koehn 14 April 2020 Philipp Koehn Artificial Intelligence: Neural

When Neurons Fail El Mahdi El Mhamdi, Rachid Guerraoui BDA, Chicago July 25th, 2016 1 / 28

ResNet with one-neuron hidden layers is universal approximator Hongzhou Lin, Stefanie Jegelka

Neural encoding models & maximum likelihood Jonathan Pillow 1 probability leftovers:

Pattern Recognition Two main challenges Representation Matching Jain CSE 802, Spring

NEURON + Python Michael Hines HBP CodeJam Workshop #7 Manchester 2016 NINDS I r n e t t

Neural Networks Part 1 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University

THE NEURAL SIMULATION TOOL NEST 1st HPAC Platform Training December 11, 2018 Jochen M. Eppler

Lecture 10: Training Neural Networks (Part 1) Justin Johnson - PowerPoint PPT Presentation

Lecture 10: Training Neural Networks (Part 1) Justin Johnson Lecture 1 - 1 October 7, 2019 Reminder: A3 Due Monday, October 14 (1 week from today!) Remember to run the validation script! Justin Johnson Lecture 10 - 2 October 7, 2019

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Artificial Neural Networks Oliver Schulte - CMPT 726 Feed-forward Networks Network Training

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Lecture 4: Recurrent neural networks for natural language processing Plan of the lecture Part

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Lecture 11: Neural Networks (Part 3) March 2nd, 2020 Lecturer: Steven Wu Scribe: Steven Wu 1

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

Neural Networks Philipp Koehn 14 April 2020 Philipp Koehn Artificial Intelligence: Neural

When Neurons Fail El Mahdi El Mhamdi, Rachid Guerraoui BDA, Chicago July 25th, 2016 1 / 28

ResNet with one-neuron hidden layers is universal approximator Hongzhou Lin, Stefanie Jegelka

Neural encoding models &amp; maximum likelihood Jonathan Pillow 1 probability leftovers:

Pattern Recognition Two main challenges Representation Matching Jain CSE 802, Spring

NEURON + Python Michael Hines HBP CodeJam Workshop #7 Manchester 2016 NINDS I r n e t t

Neural Networks Part 1 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University

THE NEURAL SIMULATION TOOL NEST 1st HPAC Platform Training December 11, 2018 Jochen M. Eppler

Neural encoding models & maximum likelihood Jonathan Pillow 1 probability leftovers: