Lecture 6: Optimization CS109B Data Science 2 Pavlos Protopapas and - PowerPoint PPT Presentation

Lecture 6: Optimization CS109B Data Science 2 Pavlos Protopapas and Mark Glickman

Outline Optimization • Challenges in Optimization • Momentum • Adaptive Learning Rate • Parameter Initialization • Batch Normalization CS109B, P ROTOPAPAS , G LICKMAN 2

Learning vs. Optimization Goal of learning: minimize generalization error In practice, empirical risk minimization: [ ] J ( θ ) = E ( x , y )~ p data L ( f ( x ; θ ), y ) m J ( θ ) = 1 ˆ ∑ ( f ( x ( i ) ; θ ), y ( i ) ) L m i = 1 Quantity optimized different from the quantity we care about CS109B, P ROTOPAPAS , G LICKMAN 3

Batch vs. Stochastic Algorithms Batch algorithms • Optimize empirical risk using exact gradients Stochastic algorithms • Estimates gradient from a small random sample [ ] ∇ J ( θ ) = E ( x , y )~ p data ∇ L ( f ( x ; θ ), y ) Large mini-batch : gradient computation expensive Small mini-batch : greater variance in estimate, longer steps for convergence CS109B, P ROTOPAPAS , G LICKMAN 4

Critical Points Points with zero gradient 2 nd -derivate (Hessian) determines curvature Goodfellow et al. (2016) CS109B, P ROTOPAPAS , G LICKMAN 5

Stochastic Gradient Descent Take small steps in direction of negative gradient Sample m examples from training set and compute: g = 1 ∑ ∇ L ( f ( x ( i ) ; θ ), y ( i ) ) Update parameters: m i θ = θ − ε k g In practice: shuffle training set once and pass through multiple times CS109B, P ROTOPAPAS , G LICKMAN 6

Local Minima Goodfellow et al. (2016) CS109B, P ROTOPAPAS , G LICKMAN 8

Local Minima Old view: local minima is major problem in neural network training Recent view: • For sufficiently large neural networks, most local minima incur low cost • Not important to find true global minimum CS109B, P ROTOPAPAS , G LICKMAN 9

Saddle Points Both local min Recent studies indicate that in and max high dim, saddle points are more likely than local min Gradient can be very small near saddle points Goodfellow et al. (2016) CS109B, P ROTOPAPAS , G LICKMAN 10

No Critical Points Gradient norm increases, but validation error decreases Convolution Nets for Object Detection Goodfellow et al. (2016) CS109B, P ROTOPAPAS , G LICKMAN 11

Saddle Points SGD is seen to escape saddle points – Moves down-hill, uses noisy gradients Second-order methods get stuck – solves for a point with zero gradient Goodfellow et al. (2016) CS109B, P ROTOPAPAS , G LICKMAN 12

Poor Conditioning Poorly conditioned Hessian matrix – High curvature: small steps leads to huge increase Learning is slow despite strong gradients Oscillations slow down progress CS109B, P ROTOPAPAS , G LICKMAN 13

No Critical Points Some cost functions do not have critical points. In particular classification. CS109B, P ROTOPAPAS , G LICKMAN 14

Exploding and Vanishing Gradients ℎ " = 𝑋𝑦 Linear activation ℎ " = 𝑋ℎ "&' , 𝑗 = 2, … , 𝑜 deeplearning.ai CS109B, P ROTOPAPAS , G LICKMAN 15

Exploding and Vanishing Gradients ! $ a 0 Suppose W = & : # 0 b " % ! $ ! $ ! $ ! $ ! $ h 1 h n ! $ x 1 x 1 a n a 0 0 # & # & 1 1 ! # & # & # & & = & = # & # # h 1 0 b x 2 h n b n x 2 # & # & 0 # & " % " % " % " % " % " % 2 2 CS109B, P ROTOPAPAS , G LICKMAN 16

Exploding and Vanishing Gradients ! $ 1 Suppose x = # & 1 " % Case 1: a = 1, b = 2 : ! $ n Explodes! y → 1, ∇ y → # & n 2 n − 1 # & " % Case 2: a = 0.5, b = 0.9 : ! $ 0 Vanishes! y → 0, ∇ y → # & 0 " % CS109B, P ROTOPAPAS , G LICKMAN 17

Exploding and Vanishing Gradients Exploding gradients lead to cliffs Can be mitigated using gradient clipping if 𝑕 > 𝑣 𝑕 ⟵ 𝑕𝑣 𝑕 Goodfellow et al. (2016) CS109B, P ROTOPAPAS , G LICKMAN 18

Stochastic Gradient Descent J ( θ ) Oscillations because updates do not exploit curvature information Goodfellow et al. (2016) CS109B, P ROTOPAPAS , G LICKMAN 20

Momentum SGD is slow when there is high curvature J ( θ ) Average gradient presents faster path to opt: – vertical components cancel out CS109B, P ROTOPAPAS , G LICKMAN 21

Momentum Uses past gradients for update Maintains a new quantity: ‘velocity’ Exponentially decaying average of gradients: g = 1 ∑ ∇ θ L ( f ( x ( i ) ; θ ), y ( i ) ) m Current gradient update i v = α v + ( − ε g ) controls how quickly α ∈ [0,1) effect of past gradients decay CS109B, P ROTOPAPAS , G LICKMAN 22

Momentum Compute gradient estimate: g = 1 ∑ ∇ θ L ( f ( x ( i ) ; θ ), y ( i ) ) m i Update velocity: v = α v − ε g Update parameters: θ = θ + v CS109B, P ROTOPAPAS , G LICKMAN 23

Momentum J ( θ ) Damped oscillations: gradients in opposite directions get cancelled out Goodfellow et al. (2016) CS109B, P ROTOPAPAS , G LICKMAN 24

Nesterov Momentum Apply an interim update: ! θ = θ + v Perform a correction based on gradient at the interim point: g = 1 ∇ θ L ( f ( x ( i ) ; ! ∑ θ ), y ( i ) ) m i v = α v − ε g θ = θ + v Momentum based on look-ahead slope CS109B, P ROTOPAPAS , G LICKMAN 25

Adaptive Learning Rates J ( θ ) θ 2 θ 1 Oscillations along vertical direction – Learning must be slower along parameter 2 Use a different learning rate for each parameter? CS109B, P ROTOPAPAS , G LICKMAN 28

AdaGrad • Accumulate squared gradients: 2 r i = r i + g i Inversely • Update each parameter: proportional to ε cumulative g i θ i = θ i − squared gradient r δ + i • Greater progress along gently sloped directions CS109B, P ROTOPAPAS , G LICKMAN 29

RMSProp • For non-convex problems, AdaGrad can prematurely decrease learning rate • Use exponentially weighted average for gradient accumulation 2 r i = ρ r i + (1 − ρ ) g i ε g i θ i = θ i − r δ + i CS109B, P ROTOPAPAS , G LICKMAN 30

Adam • RMSProp + Momentum • Estimate first moment: Also applies bias correction v i = ρ 1 v i + (1 − ρ 1 ) g i to v and r • Estimate second moment: 2 r i = ρ 2 r i + (1 − ρ 2 ) g i • Update parameters: Works well in practice, ε v i θ i = θ i − is fairly robust to r δ + hyper-parameters i CS109B, P ROTOPAPAS , G LICKMAN 31

Parameter Initialization • Goal: break symmetry between units • so that each unit computes a different function • Initialize all weights (not biases) randomly • Gaussian or uniform distribution • Scale of initialization? • Large -> grad explosion, Small -> grad vanishing CS109B, P ROTOPAPAS , G LICKMAN 33

Xavier Initialization • Heuristic for all outputs to have unit variance • For a fully-connected layer with m inputs: ! $ W ij ~ N 0, 1 # & " m % • For ReLU units, it is recommended: ! $ W ij ~ N 0, 2 # & " m % CS109B, P ROTOPAPAS , G LICKMAN 34

Normalized Initialization • Fully-connected layer with m inputs, n outputs: " % 6 6 W ij ~ U − m + n , $ ' m + n # & • Heuristic trades off between initialize all layers have same activation and gradient variance • Sparse variant when m is large Initialize k nonzero weights in each unit – CS109B, P ROTOPAPAS , G LICKMAN 35

Bias Initialization • Output unit bias • Marginal statistics of the output in the training set • Hidden unit bias • Avoid saturation at initialization • E.g. in ReLU, initialize bias to 0.1 instead of 0 • Units controlling participation of other units • Set bias to allow participation at initialization CS109B, P ROTOPAPAS , G LICKMAN 36

Outline Challenges in Optimization Momentum Adaptive Learning Rate Parameter Initialization Batch Normalization CS109B, P ROTOPAPAS , G LICKMAN 38

Feature Normalization Good practice to normalize features before applying learning algorithm: Feature vector Vector of mean feature values x = x − µ ! σ Vector of SD of feature values Features in same scale: mean 0 and variance 1 – Speeds up learning CS109B, P ROTOPAPAS , G LICKMAN 39

Feature Normalization J ( θ ) Before normalization After normalization CS109B, P ROTOPAPAS , G LICKMAN

Internal Covariance Shift Each hidden layer changes distribution of inputs to next layer: slows down learning … Normalize Normalize inputs to layer 2 inputs to layer n CS109B, P ROTOPAPAS , G LICKMAN 41

Lecture 6: Optimization CS109B Data Science 2 Pavlos Protopapas and - PowerPoint PPT Presentation

Lecture 6: Optimization CS109B Data Science 2 Pavlos Protopapas and Mark Glickman Outline Optimization Challenges in Optimization Momentum Adaptive Learning Rate Parameter Initialization Batch Normalization CS109B, P

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

AM 205: lecture 20 Today: PDE optimization, constrained optimization example New topic:

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Evolutionary Algorithm 2. Swarm Intelligence and Ant Colony Optimization Ant Colony Optimization

CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25

CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26

Optimization of HPSG Grammar Implementations in Trale Georgiana Dinu Optimization of HPSG

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Optimization Optimization Goal: Find the minimizer ! that minimizes the objective (cost)

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

St Stress Aware Layout Stress Aware Layout St A A L L t t Optimization Optimization

TEG: A New Post-Layout TEG: A New Post-Layout Optimization Method Optimization Method Shuo

Optimization Process Done by an Optimization Algorithm Jose Rueda Torres Learning Objectives

Optimization (Introduction) Optimization Goal: Find the minimizer that minimizes the

Benefiting from Negative Curvature Daniel P. Robinson Johns Hopkins University Department of

Preconditioning of Elliptic Saddle Point Systems by Substructuring and a Penalty Approach th

Investor Meeting on Financial Results for 1H FY2019 (Nov. 20, 2019) Questions and Answers Q1.

Water Conservation Goals Rachel Shilton, P.E. Utah Division of Water Resources Steven C. Jones,

Lecture 19 Additional Material: Optimization CS109A Introduction to Data Science Pavlos

With Splitting Steepest Descent Splitting yields adaptive net structure optimization Questions

^v' f it i ; {l,r ;" ^' - ru!; ;: 'L) (,h aLP Hr# 'il';Y o.r,r,t n to h, fl0 ,ff

Training RNNs with 16-bit Floa5ng Point Erich Elsen

Lecture 6: Optimization CS109B Data Science 2 Pavlos Protopapas and - PowerPoint PPT Presentation

Lecture 6: Optimization CS109B Data Science 2 Pavlos Protopapas and Mark Glickman Outline Optimization Challenges in Optimization Momentum Adaptive Learning Rate Parameter Initialization Batch Normalization CS109B, P

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

AM 205: lecture 20 Today: PDE optimization, constrained optimization example New topic:

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Evolutionary Algorithm 2. Swarm Intelligence and Ant Colony Optimization Ant Colony Optimization

CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25

CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26

Optimization of HPSG Grammar Implementations in Trale Georgiana Dinu Optimization of HPSG

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Optimization Optimization Goal: Find the minimizer ! that minimizes the objective (cost)

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

St Stress Aware Layout Stress Aware Layout St A A L L t t Optimization Optimization

TEG: A New Post-Layout TEG: A New Post-Layout Optimization Method Optimization Method Shuo

Optimization Process Done by an Optimization Algorithm Jose Rueda Torres Learning Objectives

Optimization (Introduction) Optimization Goal: Find the minimizer that minimizes the

Benefiting from Negative Curvature Daniel P. Robinson Johns Hopkins University Department of

Preconditioning of Elliptic Saddle Point Systems by Substructuring and a Penalty Approach th

Investor Meeting on Financial Results for 1H FY2019 (Nov. 20, 2019) Questions and Answers Q1.

Water Conservation Goals Rachel Shilton, P.E. Utah Division of Water Resources Steven C. Jones,

Lecture 19 Additional Material: Optimization CS109A Introduction to Data Science Pavlos

With Splitting Steepest Descent Splitting yields adaptive net structure optimization Questions

^v' f it i ; {l,r ;&quot; ^' - ru!; ;: 'L) (,h aLP Hr# 'il';Y o.r,r,t n to h, fl0 ,ff

Training RNNs with 16-bit Floa5ng Point Erich Elsen

^v' f it i ; {l,r ;" ^' - ru!; ;: 'L) (,h aLP Hr# 'il';Y o.r,r,t n to h, fl0 ,ff