DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY - PowerPoint PPT Presentation

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY ARULRAJ L E C T U R E # 1 2 : T R A I N I N G N E U R A L N E T W O R K S ( P T 1 )

administrivia • Reminders – Integration with Eva – Code reviews – Each team must send Pull Requests to Eva GT 8803 // Fall 2019 2

Where we are now... Hardware + Software PyTorch TensorFlow 3 GT 8803 // Fall 2018

OVERVIEW • One time setup – Activation Functions, Preprocessing, Weight Initialization, Regularization, Gradient Checking • Training dynamics – Babysitting the Learning Process, Parameter updates, Hyperparameter Optimization • Evaluation – Model ensembles, Test-time augmentation GT 8803 // Fall 2019 4

TODAY’s AGENDA • Training Neural Networks – Activation Functions – Data Preprocessing – Weight Initialization – Batch Normalization GT 8803 // Fall 2019 5

ACTIVATION FUNCTIONS 6 GT 8803 // Fall 2018

Activation Functions 7 GT 8803 // Fall 2018

Activation Functions Leaky ReLU Sigmoid tanh Maxout ELU ReLU 8 GT 8803 // Fall 2018

Activation Functions Squashes numbers to range [0,1] • Historically popular since they have • nice interpretation as a saturating “firing rate” of a neuron Sigmoid 9 GT 8803 // Fall 2018

Activation Functions Squashes numbers to range [0,1] • Historically popular since they have • nice interpretation as a saturating “firing rate” of a neuron 3 problems: Saturated neurons “kill” the 1. Sigmoid gradients 10 GT 8803 // Fall 2018

x sigmoid gate What happens when x = -10? What happens when x = 0? What happens when x = 10? 11 GT 8803 // Fall 2018

Activation Functions Squashes numbers to range [0,1] • Historically popular since they have • nice interpretation as a saturating “firing rate” of a neuron 3 problems: Saturated neurons “kill” the 1. Sigmoid gradients Sigmoid outputs are not zero- 2. centered 12 GT 8803 // Fall 2018

Consider what happens when the input to a neuron is always positive... What can we say about the gradients on w ? 13 GT 8803 // Fall 2018

Consider what happens when the input to a neuron is always positive... allowed gradient update directions zig zag path allowed gradient update directions hypothetical What can we say about the gradients on w ? optimal w vector Always all positive or all negative :( 14 GT 8803 // Fall 2018

Consider what happens when the input to a neuron is always positive... allowed gradient update directions zig zag path allowed gradient update directions hypothetical What can we say about the gradients on w ? optimal w vector Always all positive or all negative :( (For a single element! Minibatches help) 15 GT 8803 // Fall 2018

Activation Functions Squashes numbers to range [0,1] • Historically popular since they have • nice interpretation as a saturating “firing rate” of a neuron 3 problems: Saturated neurons “kill” the 1. Sigmoid gradients Sigmoid outputs are not zero- 2. centered exp() is a bit compute expensive 3. 16 GT 8803 // Fall 2018

Activation Functions Squashes numbers to range [-1,1] • zero centered (nice) • still kills gradients when saturated :( • tanh(x) [LeCun et al., 1991] 17 GT 8803 // Fall 2018

Computes f(x) = max(0,x) • Activation Functions Does not saturate (in +region) • Very computationally efficient • Converges much faster than • sigmoid/tanh in practice (e.g. 6x) ReLU (Rectified Linear Unit) [Krizhevsky et al., 2012] 18 GT 8803 // Fall 2018

Computes f(x) = max(0,x) • Activation Functions Does not saturate (in +region) • Very computationally efficient • Converges much faster than • sigmoid/tanh in practice (e.g. 6x) Not zero-centered output ReLU • (Rectified Linear Unit) [Krizhevsky et al., 2012] 19 GT 8803 // Fall 2018

Computes f(x) = max(0,x) • Activation Functions Does not saturate (in +region) • Very computationally efficient • Converges much faster than • sigmoid/tanh in practice (e.g. 6x) Not zero-centered output ReLU • (Rectified Linear Unit) [Krizhevsky et al., 2012] 20 GT 8803 // Fall 2018

Computes f(x) = max(0,x) • Activation Functions Does not saturate (in +region) • Very computationally efficient • Converges much faster than • sigmoid/tanh in practice (e.g. 6x) Not zero-centered output ReLU • An annoyance: (Rectified Linear Unit) - hint: what is the gradient when x < 0? 21 GT 8803 // Fall 2018

x ReLU gate What happens when x = -10? What happens when x = 0? What happens when x = 10? 22 GT 8803 // Fall 2018

active ReLU DATA CLOUD dead ReLU will never activate => never update 23 GT 8803 // Fall 2018

active ReLU DATA CLOUD => people like to initialize dead ReLU ReLU neurons with slightly will never activate positive biases (e.g. 0.01) => never update 24 GT 8803 // Fall 2018

[Mass et al., 2013] Activation Functions [He et al., 2015] Does not saturate • Computationally efficient • Converges much faster than • sigmoid/tanh in practice! (e.g. 6x) will not “die”. • Leaky ReLU 25 GT 8803 // Fall 2018

[Mass et al., 2013] Activation Functions [He et al., 2015] Does not saturate • Computationally efficient • Converges much faster than • sigmoid/tanh in practice! (e.g. 6x) will not “die”. • Leaky ReLU Parametric Rectifier (PReLU) backprop into \alpha (parameter) 26 GT 8803 // Fall 2018

[Clevert et al., 2015] Activation Functions Exponential Linear Units (ELU) All benefits of ReLU • Closer to zero mean outputs • Negative saturation regime • compared with Leaky ReLU adds some robustness to noise Computation requires exp() • 27 GT 8803 // Fall 2018

[Goodfellow et al., 2013] Maxout “Neuron” • Does not have the basic form of dot product -> nonlinearity • Generalizes ReLU and Leaky ReLU • Linear Regime! Does not saturate! Does not die! Problem: doubles the number of parameters/neuron :( 28 GT 8803 // Fall 2018

TLDR: In practice: Use ReLU. Be careful with your learning rates • Try out Leaky ReLU / Maxout / ELU • Try out tanh but don’t expect much • Don’t use sigmoid • 29 GT 8803 // Fall 2018

DATA PREPROCESSING 30 GT 8803 // Fall 2018

DATA PREPROCESSING (Assume X [NxD] is data matrix, each example in a row) 31 GT 8803 // Fall 2018

Remember: Consider what happens when the input to a neuron is always positive... allowed gradient update directions zig zag path allowed gradient update directions hypothetical What can we say about the gradients on w ? optimal w vector Always all positive or all negative :( (this is also why you want zero-mean data!) 32 GT 8803 // Fall 2018

DATA PREPROCESSING (Assume X [NxD] is data matrix, each example in a row) 33 GT 8803 // Fall 2018

DATA PREPROCESSING (data has diagonal (covariance matrix is covariance matrix) the identity matrix) In practice, you may also see PCA and Whitening of the data 34 GT 8803 // Fall 2018

DATA PREPROCESSING After normalization : less sensitive Before normalization : to small changes in weights; easier classification loss very sensitive to to optimize changes in weight matrix; hard to optimize 35 GT 8803 // Fall 2018

TLDR: In practice for Images: center only e.g. consider CIFAR-10 example with [32,32,3] images • Subtract the mean image (e.g. AlexNet) (mean image = [32,32,3] array) • Subtract per-channel mean (e.g. VGGNet) (mean along each channel = 3 numbers) • Subtract per-channel mean and Divide by per-channel std (e.g. ResNet) Not common to do (mean along each channel = 3 numbers) PCA or whitening 36 GT 8803 // Fall 2018

WEIGHT INITIALIZATION 37 GT 8803 // Fall 2018

Q: what happens when W=constant init is used? 38 GT 8803 // Fall 2018

First idea: Small random numbers (gaussian with zero mean and 1e-2 standard deviation) 39 GT 8803 // Fall 2018

First idea: Small random numbers (gaussian with zero mean and 1e-2 standard deviation) Works ~okay for small networks, but problems with deeper networks. 40 GT 8803 // Fall 2018

Weight Initialization: Activation statistics Forward pass for a 6-layer net with hidden size 4096 41 GT 8803 // Fall 2018

Weight Initialization: Activation statistics Forward pass for a 6-layer All activations tend to zero net with hidden size 4096 for deeper network layers Q : What do the gradients dL/dW look like? 42 GT 8803 // Fall 2018

Weight Initialization: Activation statistics Forward pass for a 6-layer All activations tend to zero net with hidden size 4096 for deeper network layers Q : What do the gradients dL/dW look like? A : All zero, no learning =( 43 GT 8803 // Fall 2018

Weight Initialization: Activation statistics Increase std of initial weights from 0.01 to 0.05 44 GT 8803 // Fall 2018

Weight Initialization: Activation statistics Increase std of initial weights All activations saturate from 0.01 to 0.05 Q : What do the gradients look like? 45 GT 8803 // Fall 2018

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY - PowerPoint PPT Presentation

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY ARULRAJ L E C T U R E # 1 2 : T R A I N I N G N E U R A L N E T W O R K S ( P T 1 ) administrivia Reminders Integration with Eva Code reviews Each team must send

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Deep Data Analytics for Pricing: Uses, Issues, and Solutions Walter R. Paczkowski, Ph.D. Data

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // VENKATA KISHORE PATCHA Lecture#16 :

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Google Analytics Overview Whats Google Analytics? The Google Analytics

Document Name Solar Analytics - Rooftop PV energy analytics PREPARED BY: Your Name, Your Title

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Data Mining & Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Introduction to Machine Learning Perceptron Barnabs Pczos Contents History of Artificial

Using CNNs to understand the neural basis of vision Michael J. Tarr February 2020 AI Space

TRADE SECRETS Trade Secret A trade secret can be any formula,pattern,idea,process,physical

Cell Membranes Function as Integrative Systems Understanding how cell membranes molecules

Lecture on Distributed Representations and Coarse Coding Geoffrey Hinton Localist

PDE models of neural networks Beno t Perthame Introduction The electrically active cells

Bayesian Neural Networks - Presenters Group 1: A Practical Bayesian Framework for Backpropagation

Poster #20 Bayesian Nonparametric Federated Learning of Neural Networks Mikhail Yurochkin Mayank

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY - PowerPoint PPT Presentation

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY ARULRAJ L E C T U R E # 1 2 : T R A I N I N G N E U R A L N E T W O R K S ( P T 1 ) administrivia Reminders Integration with Eva Code reviews Each team must send

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Deep Data Analytics for Pricing: Uses, Issues, and Solutions Walter R. Paczkowski, Ph.D. Data

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // VENKATA KISHORE PATCHA Lecture#16 :

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Google Analytics Overview Whats Google Analytics? The Google Analytics

Document Name Solar Analytics - Rooftop PV energy analytics PREPARED BY: Your Name, Your Title

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Data Mining &amp; Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Introduction to Machine Learning Perceptron Barnabs Pczos Contents History of Artificial

Using CNNs to understand the neural basis of vision Michael J. Tarr February 2020 AI Space

TRADE SECRETS Trade Secret A trade secret can be any formula,pattern,idea,process,physical

Cell Membranes Function as Integrative Systems Understanding how cell membranes molecules

Lecture on Distributed Representations and Coarse Coding Geoffrey Hinton Localist

PDE models of neural networks Beno t Perthame Introduction The electrically active cells

Bayesian Neural Networks - Presenters Group 1: A Practical Bayesian Framework for Backpropagation

Poster #20 Bayesian Nonparametric Federated Learning of Neural Networks Mikhail Yurochkin Mayank

Data Mining & Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues