Basic ics of f DL Prof. Leal-Taix and Prof. Niessner 1 What we - PowerPoint PPT Presentation

Basic ics of f DL Prof. Leal-Taixé and Prof. Niessner 1

What we assume you know • Linear Algebra & Programming! • Basics from the Introduction to Deep Learning lecture • PyTorch (can use TensorFlow …) • You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data. Prof. Leal-Taixé and Prof. Niessner 2

What is is a N Neural l Network? Prof. Leal-Taixé and Prof. Niessner 3

Neural Network • Linear score function 𝑔 = 𝑋𝑦 On CIFAR-10 Credit: Li/Karpathy/Johnson On ImageNet Prof. Leal-Taixé and Prof. Niessner 4

Neural Network • Linear score function 𝑔 = 𝑋𝑦 • Neural network is a nesting of ‘functions’ – 2-layers: 𝑔 = 𝑋 2 max(0, 𝑋 1 𝑦) – 3-layers: 𝑔 = 𝑋 3 max(0, 𝑋 2 max(0, 𝑋 1 𝑦)) – 4-layers: 𝑔 = 𝑋 4 tanh (W 3 , max(0, 𝑋 2 max(0, 𝑋 1 𝑦))) – 5-layers: 𝑔 = 𝑋 5 𝜏(𝑋 4 tanh(W 3 , max(0, 𝑋 2 max(0, 𝑋 1 𝑦)))) – … up to hundreds of layers Prof. Leal-Taixé and Prof. Niessner 5

Neural Network 1-layer network: 𝑔 = 𝑋𝑦 2-layer network: 𝑔 = 𝑋 2 max(0, 𝑋 1 𝑦) 𝑦 𝑦 𝑋 𝑔 𝑋 𝑋 𝑔 ℎ 1 2 128 × 128 = 16384 1000 10 128 × 128 = 16384 10 Prof. Leal-Taixé and Prof. Niessner 6

Neural Network Credit: Li/Karpathy/Johnson Prof. Leal-Taixé and Prof. Niessner 7

Loss functio ions Prof. Leal-Taixé and Prof. Niessner 8

Neural networks What is the shape of this function? Loss (Softmax, Hinge) Prediction Prof. Leal-Taixé and Prof. Niessner 9

Loss fu functio ions Evaluate the ground • Softmax loss function truth score for the image • Hinge Loss (derived from the Multiclass SVM loss) Prof. Leal-Taixé and Prof. Niessner 10

Loss fu functio ions • Softmax loss function – Optimizes until the loss is zero • Hinge Loss (derived from the Multiclass SVM loss) – Saturates whenever it has learned a class “well enough” Prof. Leal-Taixé and Prof. Niessner 11

Activ ivatio ion functio ions Prof. Leal-Taixé and Prof. Niessner 12

Sig igmoid id Forward Saturated neurons kill the gradient flow Prof. Leal-Taixé and Prof. Niessner 13

Pro roblem of f positiv ive output More on zero- mean data later Prof. Leal-Taixé and Prof. Niessner 14

tanh Still saturates Zero- centered Still saturates LeCun 1991 Prof. Leal-Taixé and Prof. Niessner 15

Rectif ifie ied Lin inear Unit its (ReLU) Dead ReLU Large and What happens if a consistent ReLU outputs zero? gradients Fast convergence Does not saturate Prof. Leal-Taixé and Prof. Niessner 16

Maxout unit its Linear Generalization Does not Does not regimes of ReLUs die saturate Increase of the number of parameters Prof. Leal-Taixé and Prof. Niessner 17

Optim imiz izatio ion Prof. Leal-Taixé and Prof. Niessner 18

Gra radie ient Descent fo for r Neura ral Networks ℎ 0 𝜖𝑔 𝑦 0 𝑧 0 𝑢 0 ℎ 1 𝜖𝑥 0,0,0 … 𝑦 1 … 𝑧 1 𝑢 1 ℎ 2 𝜖𝑔 𝛼 𝑥,𝑐 𝑔 𝑦,𝑢 (𝑥) = 𝑦 2 𝜖𝑥 𝑚,𝑛,𝑜 … ℎ 3 … 𝜖𝑔 𝑀 𝑗 = 𝑧 𝑗 − 𝑢 𝑗 2 𝜖𝑐 𝑚,𝑛 𝑧 𝑗 = 𝐵(𝑐 1,𝑗 + ෍ ℎ 𝑘 𝑥 1,𝑗,𝑘 ) ℎ 𝑘 = 𝐵(𝑐 0,𝑘 + ෍ 𝑦 𝑙 𝑥 0,𝑘,𝑙 ) 𝑘 𝑙 Just simple: 𝐵 𝑦 = max(0, 𝑦) Prof. Leal-Taixé and Prof. Niessner 19

Stochastic Gra radient Descent (S (SGD) 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽𝛼 𝜄 𝑀(𝜄 𝑙 , 𝑦 {1..𝑛} , 𝑧 {1..𝑛} ) 𝑛 𝛼 𝜄 𝑀 𝑗 1 𝑛 σ 𝑗=1 𝛼 𝜄 𝑀 = 𝑙 now refers to 𝑙 -th iteration 𝑛 training samples in the current batch Gradient for the 𝑙 -th batch Note the terminology: iteration vs epoch 20 Prof. Leal-Taixé and Prof. Niessner

Gra radie ient Descent wit ith Momentum 𝑤 𝑙+1 = 𝛾 ⋅ 𝑤 𝑙 + 𝛼 𝜄 𝑀(𝜄 𝑙 ) accumulation rate Gradient of current minibatch velocity (‘friction’, momentum) 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑤 𝑙+1 velocity model learning rate Exponentially-weighted average of gradient Important: velocity 𝑤 𝑙 is vector-valued! 21 Prof. Leal-Taixé and Prof. Niessner

Gra radie ient Descent wit ith Momentum Step will be largest when a sequence of gradients all point to the same direction Hyperparameters are 𝛽, 𝛾 𝛾 is often set to 0.9 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑤 𝑙+1 Prof. Leal-Taixé and Prof. Niessner 22 Fig. credit: I. Goodfellow

RMSProp 𝑡 𝑙+1 = 𝛾 ⋅ 𝑡 𝑙 + (1 − 𝛾)[𝛼 𝜄 𝑀 ∘ 𝛼 𝜄 𝑀] Element-wise multiplication 𝛼 𝜄 𝑀 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑡 𝑙+1 + 𝜗 Hyperparameters: 𝛽 , 𝛾 , 𝜗 Needs tuning! Often 0.9 Typically 10 −8 23 Prof. Leal-Taixé and Prof. Niessner

RMSProp Large gradients Y-Direction X-direction Small gradients (uncentered) variance of gradients 𝑡 𝑙+1 = 𝛾 ⋅ 𝑡 𝑙 + (1 − 𝛾)[𝛼 𝜄 𝑀 ∘ 𝛼 𝜄 𝑀] -> second momentum 𝛼 𝜄 𝑀 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑡 𝑙+1 + 𝜗 We’re dividing by square gradients: - Division in Y-Direction will be large Can increase learning rate! - Division in X-Direction will be small Prof. Leal-Taixé and Prof. Niessner 24 Fig. credit: A. Ng

Adaptiv ive Moment Estim imatio ion (A (Adam) Combines Momentum and RMSProp First momentum: 𝑛 𝑙+1 = 𝛾 1 ⋅ 𝑛 𝑙 + 1 − 𝛾 1 𝛼 𝜄 𝑀 𝜄 𝑙 mean of gradients 𝑤 𝑙+1 = 𝛾 2 ⋅ 𝑤 𝑙 + (1 − 𝛾 2 )[𝛼 𝜄 𝑀 𝜄 𝑙 ∘ 𝛼 𝜄 𝑀 𝜄 𝑙 ] Second momentum: 𝑛 𝑙+1 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ variance of gradients 𝑤 𝑙+1 +𝜗 25 Prof. Leal-Taixé and Prof. Niessner

Adam Combines Momentum and RMSProp 𝑛 𝑙+1 and 𝑤 𝑙+1 are initialized with zero 𝑛 𝑙+1 = 𝛾 1 ⋅ 𝑛 𝑙 + 1 − 𝛾 1 𝛼 𝜄 𝑀 𝜄 𝑙 -> bias towards zero 𝑤 𝑙+1 = 𝛾 2 ⋅ 𝑤 𝑙 + (1 − 𝛾 2 )[𝛼 𝜄 𝑀 𝜄 𝑙 ∘ 𝛼 𝜄 𝑀 𝜄 𝑙 ] Typically, bias-corrected moment updates 𝑛 𝑙 𝑛 𝑙+1 = ෝ 1 − 𝛾 1 𝑤 𝑙 𝑛 𝑙+1 ෝ 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑤 𝑙+1 = ො 1 − 𝛾 2 𝑤 𝑙+1 +𝜗 ො Prof. Leal-Taixé and Prof. Niessner 26

Convergence 27

Train inin ing NNs Prof. Leal-Taixé and Prof. Niessner 28

Im Importance of f Learning Rate Prof. Leal-Taixé and Prof. Niessner 29

Over- and Underf rfitting Figure extracted from Deep Learning by Adam Gibson, Josh Patterson, O‘Reily Media Inc., 2017 Prof. Leal-Taixé and Prof. Niessner 30

Over- and Underf rfitting Source: http://srdas.github.io/DLBook/ImprovingModelGeneralization.html Prof. Leal-Taixé and Prof. Niessner 31

Basic re recip ipe fo for r machine le learnin ing • Split your data 60% 20% 20% validation test train Find your hyperparameters Prof. Leal-Taixé and Prof. Niessner 32

Basic re recip ipe fo for r machine le learnin ing Prof. Leal-Taixé and Prof. Niessner 33

Regula lariz izatio ion Prof. Leal-Taixé and Prof. Niessner 34

Regularization • Any strategy that aims to Lower In Increasing vali lidatio ion error train inin ing error Prof. Leal-Taixé and Prof. Niessner 35

Data augmentatio ion Krizhevsky 2012 Prof. Leal-Taixé and Prof. Niessner 36

Earl rly stopping Training time is also a hyperparameter Overfitting Prof. Leal-Taixé and Prof. Niessner 37

Bagging and ensemble methods • Bagging: uses k different datasets Training Set 3 Training Set 2 Training Set 1 Prof. Leal-Taixé and Prof. Niessner 38

Dro ropout • Disable a random set of neurons (typically 50%) Forward Srivastava 2014 Prof. Leal-Taixé and Prof. Niessner 39

How to deal l wit ith im images? Prof. Leal-Taixé and Prof. Niessner 40

Usin ing CNNs in in Computer Vis ision Prof. Leal-Taixé and Prof. Niessner 41 Credit: Li/Karpathy/Johnson

Im Image fi filt lters • Each kernel gives us a different image filter Box mean Edge detection 1 1 1 −1 −1 −1 Input 1 1 1 1 −1 8 −1 9 1 1 1 −1 −1 −1 Sharpen Gaussian blur 0 −1 0 1 2 1 1 −1 5 −1 2 4 2 16 0 −1 0 1 2 1 Prof. Leal-Taixé and Prof. Niessner 42

Convolutio ions on RGB Im Images 32 × 32 × 3 image (pixels 𝑦 ) activation map 5 × 5 × 3 filter (weights 𝑥 ) (also feature map) 28 32 5 Co Convolve 5 slide over all spatial locations 𝑦 𝑗 3 and compute all output 𝑨 𝑗 ; 28 w/o padding, there are 28 × 28 locations 32 1 3 Prof. Leal-Taixé and Prof. Niessner 43

Convolutio ion Layer 32 × 32 × 3 image Convolution “Layer” activation maps 32 28 Convolve Co Let’s apply **five** * filt lters, ea each ch wit ith dif iffe ferent wei eights! 28 32 5 3 Prof. Leal-Taixé and Prof. Niessner 44

CNN Pro rototype ConvNet is concatenation of Conv Layers and activations Input Image Conv + Conv + Conv + ReLU ReLU 24 ReLU 28 32 5 filters 8 filters 12 filters 5 × 5 × 3 5 × 5 × 5 5 × 5 × 8 20 24 28 32 12 8 5 3 Prof. Leal-Taixé and Prof. Niessner 45

CNN le learned fi filt lters Prof. Leal-Taixé and Prof. Niessner 46

Basic ics of f DL Prof. Leal-Taix and Prof. Niessner 1 What we - PowerPoint PPT Presentation

Basic ics of f DL Prof. Leal-Taix and Prof. Niessner 1 What we assume you know Linear Algebra & Programming! Basics from the Introduction to Deep Learning lecture PyTorch (can use TensorFlow ) You have trained already

ICS 233 ICS 233 ICS 233 ICS 233 Computer Architecture & Computer Architecture &

ICS Vulnerability Disclosure To Disclose or Not to Disclose ICS-CERT Control Systems Security

ics.uwex.edu ics.uwex.edu The Video Interoperability Challenge ics.uwex.edu Room Systems A

ICS COOL ENERGY ICS COOL ENERGY THE TEMPERATURE CONTROL SPECIALISTS THE TEMPERATURE CONTROL

Introduction to ICS- -214 214 Introduction to ICS Official Unit / Incident Log - A V-C-N.org

Introduction to ICS- -214 214 Introduction to ICS Official Unit / Incident Log - A V-C-N.org

CS233601: Discret e CS233601: Discret e CS233601: Discret e Mat hemat ics Mat hemat ics Mat

MESSAGE HANDLING MESSAGE HANDLING ICS- -213 213 ICS Presented by Chuck Sprick KE5RAD Feb

Incident Command System (ICS) Incident Command System (ICS) Describe the role, responsibility

Therm odynam ics Therm odynam ics and and Fabric of Spacetim e Fabric of Spacetim e Dm itri

An Integrated Care System (ICS) for the North East and North Cumbria The North East and North

BRIC ICS Law Jo Journal New research project of the University of Tyumen BRIC ICS Law Journal

An Overview of the Bio-Networking Architecture Jun Suzuki, Ph.D. jsuzuki@ics.uci.edu

The ICS-FORTH RDFSuite: Managing Voluminous RDF Description Bases

Intelligence Kalev Kask ICS 271 Fall 2018 http://www.ics.uci.edu/~kkask/Fall-2018 CS271/

Qt-Based Google APIs Integrated Computer Solutions (ICS) Qt Developer Days 2012 www.ics.com

Initial Characterization of I/O in Large-Scale Deep Learning Applications Fahim Chowdhury,

Re Relational Con Constraint So Solving ng in in SMT SMT Paul Meng , Andrew Reynolds, Cesare

The DL-Lite Family of Languages A FO Perspective Alessandro Artale KRDB Research Centre Free

Owning Your Home Network: Router Security Revisited Marcus

The Database as a Value Rich Hickey Complexity Out of the Tar Pit Moseley and Marks (2006)

Broadcasting your attack: Security testing DAB radio in cars Andy Davis, Research Director

Foundations of Chemical Kinetics Lecture 28: Diffusion-influenced reactions, Part I Marc R.

Minimal Taylor Algebras Zarathustra Brady Taylor algebras Definition A is called a set if all

Basic ics of f DL Prof. Leal-Taix and Prof. Niessner 1 What we - PowerPoint PPT Presentation

Basic ics of f DL Prof. Leal-Taix and Prof. Niessner 1 What we assume you know Linear Algebra & Programming! Basics from the Introduction to Deep Learning lecture PyTorch (can use TensorFlow ) You have trained already

ICS 233 ICS 233 ICS 233 ICS 233 Computer Architecture &amp; Computer Architecture &amp;

ICS Vulnerability Disclosure To Disclose or Not to Disclose ICS-CERT Control Systems Security

ics.uwex.edu ics.uwex.edu The Video Interoperability Challenge ics.uwex.edu Room Systems A

ICS COOL ENERGY ICS COOL ENERGY THE TEMPERATURE CONTROL SPECIALISTS THE TEMPERATURE CONTROL

Introduction to ICS- -214 214 Introduction to ICS Official Unit / Incident Log - A V-C-N.org

Introduction to ICS- -214 214 Introduction to ICS Official Unit / Incident Log - A V-C-N.org

CS233601: Discret e CS233601: Discret e CS233601: Discret e Mat hemat ics Mat hemat ics Mat

MESSAGE HANDLING MESSAGE HANDLING ICS- -213 213 ICS Presented by Chuck Sprick KE5RAD Feb

Incident Command System (ICS) Incident Command System (ICS) Describe the role, responsibility

Therm odynam ics Therm odynam ics and and Fabric of Spacetim e Fabric of Spacetim e Dm itri

An Integrated Care System (ICS) for the North East and North Cumbria The North East and North

BRIC ICS Law Jo Journal New research project of the University of Tyumen BRIC ICS Law Journal

An Overview of the Bio-Networking Architecture Jun Suzuki, Ph.D. jsuzuki@ics.uci.edu

The ICS-FORTH RDFSuite: Managing Voluminous RDF Description Bases

Intelligence Kalev Kask ICS 271 Fall 2018 http://www.ics.uci.edu/~kkask/Fall-2018 CS271/

Qt-Based Google APIs Integrated Computer Solutions (ICS) Qt Developer Days 2012 www.ics.com

Initial Characterization of I/O in Large-Scale Deep Learning Applications Fahim Chowdhury,

Re Relational Con Constraint So Solving ng in in SMT SMT Paul Meng , Andrew Reynolds, Cesare

The DL-Lite Family of Languages A FO Perspective Alessandro Artale KRDB Research Centre Free

Owning Your Home Network: Router Security Revisited Marcus

The Database as a Value Rich Hickey Complexity Out of the Tar Pit Moseley and Marks (2006)

Broadcasting your attack: Security testing DAB radio in cars Andy Davis, Research Director

Foundations of Chemical Kinetics Lecture 28: Diffusion-influenced reactions, Part I Marc R.

Minimal Taylor Algebras Zarathustra Brady Taylor algebras Definition A is called a set if all

ICS 233 ICS 233 ICS 233 ICS 233 Computer Architecture & Computer Architecture &