Scali ling Optim imiz izatio ion I2DL: Prof. Niessner, Prof. - PowerPoint PPT Presentation

Scali ling Optim imiz izatio ion I2DL: Prof. Niessner, Prof. Leal-Taixé 1

Lecture 4 Recap I2DL: Prof. Niessner, Prof. Leal-Taixé 2

Neural Network Source: http://cs231n.github.io/neural-networks-1/ I2DL: Prof. Niessner, Prof. Leal-Taixé 3

Neural Network Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Input Layer Output Layer Width Depth I2DL: Prof. Niessner, Prof. Leal-Taixé 4

Compute Gra raphs → Neura ral Network rks Output layer Input layer ∗ 𝑥 0 𝑦 0 Loss/ 𝑦 ∗ 𝑦 −𝑧 0 𝑦 0 max(0, 𝑦) + cost 𝑦 1 ∗ 𝑥 1 𝑧 0 ො 𝑧 0 ReLU Activation 𝑦 1 L2 Loss Weights Input (btw. I’m not arguing (unknowns!) this is the right choice here) We want to compute gradients w.r.t. all weights 𝑿 e.g., class label/ regression target I2DL: Prof. Niessner, Prof. Leal-Taixé 5

Compute Gra raphs → Neura ral Network rks ∗ 𝑥 0,0 Loss/ Output layer Input layer 𝑦 ∗ 𝑦 + −𝑧 0 cost ∗ 𝑥 0,1 ∗ 𝑥 1,0 𝑦 0 𝑧 0 ො 𝑧 0 Loss/ 𝑦 ∗ 𝑦 + −𝑧 0 cost 𝑦 0 ∗ 𝑥 1,1 𝑦 1 𝑧 1 ො 𝑧 1 ∗ 𝑥 2,0 𝑦 1 Loss/ 𝑧 2 ො 𝑧 2 𝑦 ∗ 𝑦 + −𝑧 0 cost ∗ 𝑥 2,1 We want to compute gradients w.r.t. all weights 𝑿 I2DL: Prof. Niessner, Prof. Leal-Taixé 6

Compute Gra raphs → Neura ral Network rks Output layer Input layer Goal: We want to compute gradients of the loss function 𝑀 w.r.t. all weights 𝑥 𝑀 = ෍ 𝑀 𝑗 𝑦 0 𝑗 𝑧 0 ො 𝑧 0 𝑀 : sum over loss per sample, e.g. L2 loss ⟶ simply sum up squares: … 𝑧 𝑗 − 𝑧 𝑗 2 𝑀 𝑗 = ො 𝑧 1 ො 𝑧 1 𝑦 𝑙 ⟶ use chain rule to compute partials 𝑧 𝑗 = 𝐵(𝑐 𝑗 + ෍ ො 𝑦 𝑙 𝑥 𝑗,𝑙 ) 𝜖𝑀 = 𝜖𝑀 ⋅ 𝜖 ො 𝑧 𝑗 𝑙 𝜖𝑥 𝑗,𝑙 𝜖 ො 𝑧 𝑗 𝜖𝑥 𝑗,𝑙 Activation bias function We want to compute gradients w.r.t. all weights 𝑿 AN AND all biases 𝑐 I2DL: Prof. Niessner, Prof. Leal-Taixé 7

Summary 𝜖𝑔 • We have 𝜖𝑥 0,0,0 … – (Directional) compute graph … 𝜖𝑔 – Structure graph into layers 𝛼 𝑿 𝑔 𝒚,𝒛 (𝑿) = 𝜖𝑥 𝑚,𝑛,𝑜 – Compute partial derivatives w.r.t. … … weights (unknowns) 𝜖𝑔 𝜖𝑐 𝑚,𝑛 • Next Gradient step: – Find weights based on gradients 𝑿 ′ = 𝑿 − 𝛽𝛼 𝑿 𝑔 𝒚,𝒛 (𝑿) I2DL: Prof. Niessner, Prof. Leal-Taixé 8

Optim imiz izatio ion I2DL: Prof. Niessner, Prof. Leal-Taixé 9

Gra radie ient Descent 𝑦 ∗ = arg min 𝑔(𝑦) Initialization Optimum I2DL: Prof. Niessner, Prof. Leal-Taixé 10

Gra radie ient Descent 𝑦 ∗ = arg min 𝑔(𝑦) Initialization Follow the slope of the DERIVATIVE 𝑒𝑔(𝑦) 𝑔 𝑦 + ℎ − 𝑔(𝑦) Optimum = lim 𝑒𝑦 ℎ ℎ→0 I2DL: Prof. Niessner, Prof. Leal-Taixé 11

Gra radie ient Descent • From derivative to gradient Direction of greatest increase of the function ⅆ𝑔 𝑦 𝛼 𝑦 𝑔 𝑦 ⅆ𝑦 • Gradient steps in direction of negative gradient 𝑦 𝛼 𝑦 𝑔(𝑦) 𝑦 ′ = 𝑦 − 𝛽𝛼 𝑦 𝑔 𝑦 Learning rate I2DL: Prof. Niessner, Prof. Leal-Taixé 12

Gra radie ient Descent • From derivative to gradient Direction of greatest increase of the function ⅆ𝑔 𝑦 𝛼 𝑦 𝑔 𝑦 ⅆ𝑦 • Gradient steps in direction of negative gradient 𝑦 𝛼 𝑦 𝑔(𝑦) 𝑦 ′ = 𝑦 − 𝛽𝛼 𝑦 𝑔 𝑦 SMALL Learning rate I2DL: Prof. Niessner, Prof. Leal-Taixé 13

Gra radie ient Descent • From derivative to gradient Direction of greatest increase of the function ⅆ𝑔 𝑦 𝛼 𝑦 𝑔 𝑦 ⅆ𝑦 • Gradient steps in direction of negative gradient 𝑦 𝛼 𝑦 𝑔(𝑦) 𝑦 ′ = 𝑦 − 𝛽𝛼 𝑦 𝑔 𝑦 LARGE Learning rate I2DL: Prof. Niessner, Prof. Leal-Taixé 14

Gra radie ient Descent 𝒚 ∗ = arg min 𝑔(𝒚) Initialization What is the gradient when Not guaranteed we reach this to reach the Optimum point? global optimum I2DL: Prof. Niessner, Prof. Leal-Taixé 15

Convergence of f Gra radient Descent • Convex function: all local minima are global minima Source: https://en.wikipedia.org/wiki/Convex_function#/media/File:ConvexFunction.svg If line/plane segment between any two points lies above or on the graph I2DL: Prof. Niessner, Prof. Leal-Taixé 16

Convergence of f Gra radient Descent • Neural networks are non-convex – many (different) local minima – no (practical) way to say which is globally optimal Source: Li, Qi. (2006). Challenging Registration of Geologic Image Data I2DL: Prof. Niessner, Prof. Leal-Taixé 17

Convergence of f Gra radient Descent Source: https://builtin.com/data-science/gradient-descent I2DL: Prof. Niessner, Prof. Leal-Taixé 18

Convergence of f Gra radient Descent Source: A. Geron I2DL: Prof. Niessner, Prof. Leal-Taixé 19

Gra radie ient Descent: : Mult ltip iple Dim imensio ions Source: builtin.com/data-science/gradient-descent Various ways to visualize… I2DL: Prof. Niessner, Prof. Leal-Taixé 20

Gra radie ient Descent: : Mult ltip iple Dim imensio ions Source: http://blog.datumbox.com/wp-content/uploads/2013/10/gradient-descent.png I2DL: Prof. Niessner, Prof. Leal-Taixé 21

Gra radie ient Descent fo for r Neura ral Networks Loss function 𝜖𝑔 𝑧 𝑗 − 𝑧 𝑗 2 𝑀 𝑗 = ො 𝜖𝑥 0,0,0 … ℎ 0 … 𝑦 0 𝜖𝑔 𝑧 0 ො ℎ 1 𝑧 0 𝛼 𝑿,𝒄 𝑔 𝒚,𝒛 (𝑿) = 𝜖𝑥 𝑚,𝑛,𝑜 𝑦 1 … 𝑧 1 ො ℎ 2 𝑧 1 … 𝜖𝑔 𝑦 2 ℎ 3 𝜖𝑐 𝑚,𝑛 𝑧 𝑗 = 𝐵(𝑐 1,𝑗 + ෍ ො ℎ 𝑘 𝑥 1,𝑗,𝑘 ) 𝑘 Just simple: ℎ 𝑘 = 𝐵(𝑐 0,𝑘 + ෍ 𝑦 𝑙 𝑥 0,𝑘,𝑙 ) 𝐵 𝑦 = max(0, 𝑦) 𝑙 I2DL: Prof. Niessner, Prof. Leal-Taixé 22

Gra radie ient Descent: : Sin ingle Tra rain inin ing Sample • Given a loss function 𝑀 and a single training sample {𝒚 𝑗 , 𝒛 𝑗 } • Find best model parameters 𝜾 = 𝑿, 𝒄 • Cost 𝑀 𝑗 𝜾, 𝒚 𝑗 , 𝒛 𝑗 – 𝜾 = arg min 𝑀 𝑗 (𝒚 𝑗 , 𝒛 𝑗 ) • Gradient Descent: – Initialize 𝜾 1 with ‘random’ values (more to that later) – 𝜾 𝑙+1 = 𝜾 𝑙 − 𝛽𝛼 𝜾 𝑀 𝑗 (𝜾 𝑙 , 𝒚 𝑗 , 𝒛 𝑗 ) – Iterate until convergence: 𝜾 𝑙+1 − 𝜾 𝑙 < 𝜗 I2DL: Prof. Niessner, Prof. Leal-Taixé 23

Gra radie ient Descent: : Sin ingle Tra rain inin ing Sample – 𝜾 𝑙+1 = 𝜾 𝑙 − 𝛽𝛼 𝜾 𝑀 𝑗 (𝜾 𝑙 , 𝒚 𝑗 , 𝒛 𝑗 ) Training sample Loss Function Weights, biases after Gradient w.r.t. 𝜾 update step Learning rate Weights, biases at step k (current model) – 𝛼 𝜾 𝑀 𝑗 𝜾 𝑙 , 𝒚 𝑗 , 𝒛 𝒋 computed via backpropagation – Typically: ⅆim 𝛼 𝜾 𝑀 𝑗 𝜾 𝑙 , 𝒚 𝑗 , 𝒛 𝑗 = ⅆim 𝜾 ≫ 1 𝑛𝑗𝑚𝑚𝑗𝑝𝑜 I2DL: Prof. Niessner, Prof. Leal-Taixé 24

Gra radie ient Descent: : Mult ltip iple Tra rain inin ing Samples • Given a loss function 𝑀 and multiple ( 𝑜 ) training samples {𝒚 𝑗 , 𝒛 𝑗 } • Find best model parameters 𝜾 = 𝑿, 𝒄 1 𝑜 • Cost 𝑀 = 𝑜 σ 𝑗=1 𝑀 𝑗 (𝜾, 𝒚 𝑗 , 𝒛 𝑗 ) – 𝜾 = arg min 𝑀 I2DL: Prof. Niessner, Prof. Leal-Taixé 25

Gra radie ient Descent: : Mult ltip iple Tra rain inin ing Samples • Update step for multiple samples 𝜾 𝑙+1 = 𝜾 𝑙 − 𝛽𝛼 𝜾 𝑀 𝜾 𝑙 , 𝒚 1..𝑜 , 𝒛 1..𝑜 • Gradient is average / sum over residuals = 1 𝑜 𝛼 𝜾 𝑀 𝜾 𝑙 , 𝒚 1..𝑜 , 𝒛 1..𝑜 𝛼 𝜾 𝑀 𝑗 𝜾 𝑙 , 𝒚 𝑗 , 𝒛 𝒋 𝑜 σ 𝑗=1 Reminder: this comes from backprop. 𝑜 • Often people are lazy and just write: 𝛼𝑀 = σ 𝑗=1 𝛼 𝜾 𝑀 𝑗  omitting 1 𝑜 is not ‘wrong’, it just means rescaling the learning rate I2DL: Prof. Niessner, Prof. Leal-Taixé 26

Sid ide Note: : Optim imal Learnin ing Rate Can compute optimal learning rate 𝛽 using Line Search (optimal for a given set) 1 𝑜 1. Compute gradient: 𝛼 𝜾 𝑀 = 𝑜 σ 𝑗=1 𝛼 𝜾 𝑀 𝑗 2. Optimize for optimal step 𝛽 : 𝛽 𝑀(𝜾 𝑙 − 𝛽 𝛼 𝜾 𝑀) arg min 𝜾 𝑙+1 Not that practical for DL since we 𝜾 𝑙+1 = 𝜾 𝑙 − 𝛽𝛼 𝜾 𝑀 3. need to solve huge system every step… I2DL: Prof. Niessner, Prof. Leal-Taixé 27

Gra radie ient Descent on Tra rain in Set • Given large train set with 𝑜 training samples {𝒚 𝑗 , 𝒛 𝑗 } – Let’s say 1 million labeled images – Let’s say our network has 500k parameters • Gradient has 500k dimensions • 𝑜 = 1 𝑛𝑗𝑚𝑚𝑗𝑝𝑜 → Extremely expensive to compute I2DL: Prof. Niessner, Prof. Leal-Taixé 28

Stochastic Gra radient Descent (S (SGD) • If we have 𝑜 training samples we need to compute the gradient for all of them which is 𝑃(𝑜) • If we consider the problem as empirical risk minimization, we can express the total loss over the training data as the expectation of all the samples 𝑜 1 𝑜 ෍ 𝑀 𝑗 𝜾, 𝒚 𝒋 , 𝒛 𝒋 = 𝔽 𝑗~ 1,…,𝑜 𝑀 𝑗 𝜾, 𝒚 𝒋 , 𝒛 𝒋 𝑗=1 I2DL: Prof. Niessner, Prof. Leal-Taixé 29

Scali ling Optim imiz izatio ion I2DL: Prof. Niessner, Prof. - PowerPoint PPT Presentation

Scali ling Optim imiz izatio ion I2DL: Prof. Niessner, Prof. Leal-Taix 1 Lecture 4 Recap I2DL: Prof. Niessner, Prof. Leal-Taix 2 Neural Network Source: http://cs231n.github.io/neural-networks-1/ I2DL: Prof. Niessner, Prof.

Optim imiz izatio ion Coachin ing for Fork/Join in Applic licatio ions on the Java Vir

PCI CIA Phas hase 2 Working Group Three Portfoli folio O o Optim imiz ization on a

Model-based ased Ve Veri rific icatio ation, Optim imiz ization ation, Sy Synthesi hesis

Par arall llel Performan ance Optim imiz ization and Productiv ivity EU H2020 Centre of of

Par arall llel Performan ance Optim imiz ization and Productiv ivity EU H2020 Centre of of

Effect of BDD Optim ization Effect of BDD Optim ization on Synthesis of Reversible and Quantum

Me Measu sured Approa oaches s to IP IPv6 Ad Addres ess An Anonym ymiz izatio ion an

For personal use only Nick Scali Limited | NCK.ASX FY17 Results Presentation 10 th August 2017

Out Out scali scaling ng of of T Tec echnologies hnologies through thr ough KV KVK-ATMA

The The Uberiz izatio ion of t the c city y and what i it t means f for l local

SAF SAFEGUAR ARDING ING CIV IVIL ILIZ IZATIO ION FORGING A CYBERSECURITY DEFENSE FO FO

Organiz izatio ion of f Prim rimary ry Ca Care in in Germ rmany Prof. Martin Scherer, Dr.

How to Id Identify fy a Good CMO? Con Contract Manufacturer Or Organiz izatio ion (CM (CMO)

Topology Optim ization ? ? State-of-the-Art and Future Perspectives Design domain Ole Sigmund

Population pharm acokinetics Population pharm acokinetics and optim al design of paediatric and

Looking For A Web Based Incentive Solution? Maxim imiz ize a any poin int o or p plateau

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

Come and Knock On Our Door: Evaluating the Impact of Varying Rules for Case Follow-Up Using

Employment Inequality in the United States 1 download slides at: www.inequality.com/slides

Spillovers, Capital Flows and Prudential Regulation in Small Open Economies BANCO CENTRAL DE

Quick Introduction to Quality of Context Hlio Carlos Brauner Filho Advisor: Prof. Dr. Claudio

08-1 Custom Exception Classes Throwable/Exception Hierarchy To create a custom exception:

Smart An Open Data Set and Tools for Enabling Research in Sustainable Homes Sean Barker , Aditya

SWEN 262 Engineering of Software Subsystems Command Pattern Clipboard Coding* 1. The word