Gradient Descent Michail Michailidis & Patrick Maiden - PowerPoint PPT Presentation

Gradient ¡Descent ¡ Michail ¡Michailidis ¡& ¡Patrick ¡Maiden ¡

Outline ¡ • Mo4va4on ¡ • Gradient ¡Descent ¡Algorithm ¡ ▫ Issues ¡& ¡Alterna4ves ¡ • Stochas4c ¡Gradient ¡Descent ¡ ¡ • Parallel ¡Gradient ¡Descent ¡ • HOGWILD! ¡

Mo4va4on ¡ • It ¡is ¡good ¡for ¡finding ¡global ¡minima/maxima ¡if ¡the ¡func4on ¡is ¡convex ¡ • It ¡is ¡good ¡for ¡finding ¡local ¡minima/maxima ¡if ¡the ¡func4on ¡is ¡not ¡convex ¡ • It ¡is ¡used ¡for ¡op4mizing ¡many ¡models ¡in ¡Machine ¡learning: ¡ ▫ It ¡is ¡used ¡in ¡conjunc-on ¡with: ¡ ¡  Neural ¡Networks ¡ ¡  Linear ¡Regression ¡  Logis4c ¡Regression ¡  Back-‑propaga4on ¡algorithm ¡  Support ¡Vector ¡Machines ¡

Func4on ¡Example ¡

Quickest ¡ever ¡review ¡of ¡mul4variate ¡calculus ¡ ¡ • Deriva4ve ¡ • Par4al ¡Deriva4ve ¡ • Gradient ¡Vector ¡

Deriva4ve ¡ • Slope ¡of ¡the ¡tangent ¡line ¡ 𝑔(𝑦) = 𝑦↑ 2 ¡ ¡ ¡ ¡ ¡ 𝑔 ′ (𝑦) ¡ = 𝑒𝑔/𝑒𝑦 =2 𝑦 ¡ ¡ • Easy ¡when ¡a ¡func4on ¡is ¡univariate ¡ 𝑔 ′′ (𝑦) = 𝑒↑ 2 𝑔/𝑒𝑦 = ¡2 ¡

Par4al ¡Deriva4ve ¡– ¡Mul4variate ¡Func4ons ¡ For ¡mul4variate ¡func4ons ¡(e.g ¡two ¡variables) ¡we ¡need ¡par4al ¡deriva4ves ¡ – ¡one ¡per ¡dimension. ¡Examples ¡of ¡mul4variate ¡func4ons: ¡ ¡ ¡ ¡ 𝑔(𝑦 , 𝑧) = 𝑦↑ 2 + 𝑧↑ 2 ¡ 𝑔(𝑦 , 𝑧) =− 𝑦↑ 2 − 𝑧↑ 2 ¡ 𝑔(𝑦 , 𝑧) = cos ↑ 2 ⁠ ( 𝑦 ) + cos ↑ 2 ⁠ ( 𝑧 ) ¡ 𝑔(𝑦 , 𝑧) = cos ↑ 2 ⁠ ( 𝑦 ) + 𝑧↑ 2 ¡ Convex! ¡ Concave! ¡

Par4al ¡Deriva4ve ¡– ¡Cont’d ¡ ¡ To ¡visualize ¡the ¡par4al ¡deriva4ve ¡for ¡each ¡of ¡the ¡dimensions ¡x ¡and ¡y, ¡we ¡can ¡imagine ¡a ¡plane ¡that ¡ “cuts” ¡our ¡surface ¡along ¡the ¡two ¡dimensions ¡and ¡once ¡again ¡we ¡get ¡the ¡slope ¡of ¡the ¡tangent ¡line. ¡ ¡ ¡ ¡ ¡ cut: ¡ 𝑔(𝑦 ,1 ) =8− 𝑦↑ 2 ¡ surface: ¡ 𝑔(𝑦 , 𝑧) =9− 𝑦↑ 2 − 𝑧↑ 2 ¡ plane: ¡ 𝑧 = 1 ¡ slope ¡/ ¡deriva-ve ¡of ¡cut: ¡ 𝑔 ′( 𝑦 )=−2 𝑦 ¡

Par4al ¡Deriva4ve ¡– ¡Cont’d ¡2 ¡ ¡ If ¡we ¡par4ally ¡differen4ate ¡a ¡func4on ¡with ¡respect ¡to ¡x, ¡we ¡pretend ¡y ¡is ¡constant ¡ 𝑔(𝑦 , 𝑧) =9− 𝑦↑ 2 − 𝑧↑ 2 ¡ 𝑔(𝑦 , 𝑧) =9− 𝑦↑ 2 − 𝑑↑ 2 ¡ 𝑔(𝑦 , 𝑧) =9− 𝑑↑ 2 − 𝑧↑ 2 ¡ 𝑔↓𝑦 = 𝜖𝑔/𝜖𝑦 =−2 𝑦 ¡ 𝑔↓𝑧 = 𝜖𝑔/𝜖𝑧 =−2 𝑧 ¡

Par4al ¡Deriva4ve ¡– ¡Cont’d ¡3 ¡ ¡ The ¡two ¡tangent ¡lines ¡that ¡pass ¡through ¡a ¡point, ¡define ¡the ¡tangent ¡plane ¡to ¡that ¡point ¡

Gradient ¡Vector ¡ • Is ¡the ¡vector ¡that ¡has ¡as ¡coordinates ¡the ¡par4al ¡deriva4ves ¡of ¡the ¡ func4on: ¡ 𝜖𝑔/𝜖𝑦 =−2 𝑦 ¡ 𝜖𝑔/𝜖𝑧 =−2 𝑧 ¡ ¡ 𝑔(𝑦 , 𝑧) =9− 𝑦↑ 2 − 𝑧↑ 2 ¡ 𝛼𝑔 = ¡ 𝜖𝑔/𝜖𝑦 𝑗 + 𝜖𝑔/𝜖𝑧 𝑘 = (𝜖𝑔/𝜖𝑦 , 𝜖𝑔/𝜖𝑧 ) =(−2x, ¡−2y) ¡ ¡ • Note: ¡Gradient ¡Vector ¡is ¡not ¡parallel ¡to ¡tangent ¡surface ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

Gradient ¡Descent ¡Algorithm ¡& ¡Walkthrough ¡ • Idea ¡ ▫ Start ¡somewhere ¡ ▫ Take ¡steps ¡based ¡on ¡the ¡gradient ¡vector ¡of ¡ the ¡current ¡posi4on ¡4ll ¡convergence ¡ • Convergence ¡: ¡ ▫ happens ¡when ¡change ¡between ¡two ¡steps ¡< ¡ε ¡

Gradient ¡Descent ¡Code ¡(Python) ¡ 𝑔(𝑦) = 𝑦↑ 4 −3 𝑦↑ 3 +2 ¡ 𝑔↑ ′ ( 𝑦 )=4 𝑦↑ 3 −9 𝑦↑ 2 ¡ 𝑔↑ ′ ( 𝑦 )=4 𝑦↑ 3 −9 𝑦↑ 2 ¡

Gradient ¡Descent ¡Algorithm ¡& ¡Walkthrough ¡

Poten4al ¡issues ¡of ¡gradient ¡descent ¡-‑ ¡Convexity ¡ We ¡need ¡a ¡convex ¡func4on ¡ ¡ ¡ à ¡so ¡there ¡is ¡a ¡global ¡minimum: ¡ ¡ 𝑔(𝑦 , 𝑧) = 𝑦↑ 2 + 𝑧↑ 2 ¡

Poten4al ¡issues ¡of ¡gradient ¡descent ¡– ¡Convexity ¡(2) ¡

Poten4al ¡issues ¡of ¡gradient ¡descent ¡– ¡Step ¡Size ¡ • As ¡we ¡saw ¡before, ¡one ¡parameter ¡needs ¡to ¡be ¡set ¡is ¡the ¡step ¡size ¡ • Bigger ¡steps ¡leads ¡to ¡faster ¡convergence, ¡right? ¡

Alterna4ve ¡algorithms ¡ • Newton’s ¡Method ¡ ▫ Approximates ¡a ¡polynomial ¡and ¡jumps ¡to ¡the ¡ min ¡of ¡that ¡func4on ¡ ▫ Needs ¡Hessian ¡ • BFGS ¡ ▫ More ¡complicated ¡algorithm ¡ ▫ Commonly ¡used ¡in ¡actual ¡op4miza4on ¡ packages ¡

Stochas4c ¡Gradient ¡Descent ¡ • Mo4va4on ¡ ▫ One ¡way ¡to ¡think ¡of ¡gradient ¡descent ¡is ¡as ¡a ¡minimiza4on ¡of ¡a ¡sum ¡of ¡ func4ons: ¡  𝑥 = 𝑥 ¡− 𝛽𝛼𝑀 ¡ (𝑥) = 𝑥 − 𝛽∑↑▒𝛼𝑀↓𝑗 (𝑥) ¡  ( 𝑀↓𝑗 ¡is ¡the ¡loss ¡func4on ¡evaluated ¡on ¡the ¡i-‑th ¡element ¡of ¡the ¡dataset) ¡  On ¡large ¡datasets, ¡it ¡may ¡be ¡computa4onally ¡expensive ¡to ¡iterate ¡over ¡the ¡ whole ¡dataset, ¡so ¡pulling ¡a ¡subset ¡of ¡the ¡data ¡may ¡perform ¡beeer ¡  Addi4onally, ¡sampling ¡the ¡data ¡leads ¡to ¡“noise” ¡that ¡can ¡avoid ¡finding ¡“shallow ¡ local ¡minima.” ¡This ¡is ¡good ¡for ¡op4mizing ¡non-‑convex ¡func4ons. ¡(Murphy) ¡

Stochas4c ¡Gradient ¡descent ¡ • Online ¡learning ¡algorithm ¡ • Instead ¡of ¡going ¡through ¡the ¡en4re ¡dataset ¡on ¡each ¡itera4on, ¡randomly ¡ sample ¡and ¡update ¡the ¡model ¡ ¡ Initialize ¡w ¡and ¡α ¡ Until ¡convergence ¡do: ¡ Sample ¡one ¡example ¡i ¡from ¡dataset ¡//stochastic ¡portion ¡ w ¡= ¡w ¡-‑ ¡α 𝛼𝑀↓𝑗 (𝑥) ¡ return ¡w ¡ ¡

Stochas4c ¡Gradient ¡descent ¡(2) ¡ • Checking ¡for ¡convergence ¡afer ¡each ¡data ¡example ¡can ¡be ¡slow ¡ • One ¡can ¡simulate ¡stochas4city ¡by ¡reshuffling ¡the ¡dataset ¡on ¡each ¡pass: ¡ Initialize ¡w ¡and ¡α ¡ Until ¡convergence ¡do: ¡ shuffle ¡dataset ¡of ¡n ¡elements ¡//simulating ¡stochasticity ¡ For ¡each ¡example ¡i ¡in ¡n: ¡ ¡w ¡= ¡w ¡-‑ ¡α 𝛼𝑀↓𝑗 (𝑥) ¡ return ¡w ¡ ¡ This ¡is ¡generally ¡faster ¡than ¡the ¡classic ¡itera4ve ¡approach ¡(“noise”) ¡ • However, ¡you ¡are ¡s4ll ¡passing ¡over ¡the ¡en4re ¡dataset ¡each ¡4me ¡ • An ¡approach ¡in ¡the ¡middle ¡is ¡to ¡sample ¡“batches”, ¡subsets ¡of ¡the ¡en4re ¡dataset ¡ • This ¡can ¡be ¡parallelized! ¡ ▫

Parallel ¡Gradient ¡descent ¡ • Training ¡data ¡is ¡chunked ¡into ¡batches ¡and ¡distributed ¡ ¡ Initialize ¡w ¡and ¡α ¡ Loop ¡until ¡convergence: ¡ ¡generate ¡randomly ¡sampled ¡chunk ¡of ¡data ¡m ¡ ¡on ¡each ¡worker ¡machine ¡v: ¡ ¡ 𝛼𝑀↓𝑤 (𝑥) ¡= ¡ 𝑡𝑣𝑛(𝛼𝑀↓𝑗 (𝑥)) ¡// ¡compute ¡gradient ¡on ¡batch ¡ ¡ ¡ 𝑥 ¡= ¡ 𝑥 ¡− 𝛽 ∗ 𝑡𝑣𝑛(𝛼𝑀↓𝑤 (𝑥)) ¡//update ¡global ¡w ¡model ¡ ¡ return ¡w ¡

HOGWILD! ¡(Niu, ¡et ¡al. ¡2011) ¡ • Unclear ¡why ¡it ¡is ¡called ¡this ¡ • Idea: ¡ ▫ In ¡Parallel ¡SGD, ¡each ¡batch ¡needs ¡to ¡finish ¡before ¡star4ng ¡next ¡pass ¡ ▫ In ¡HOGWILD!, ¡share ¡the ¡global ¡model ¡amongst ¡all ¡machines ¡and ¡update ¡on-‑ the-‑fly ¡  No ¡need ¡to ¡wait ¡for ¡all ¡worker ¡machines ¡to ¡finish ¡before ¡star4ng ¡next ¡epoch ¡  Assump4on: ¡component-‑wise ¡addi4on ¡is ¡atomic ¡and ¡does ¡not ¡require ¡locking ¡ ¡

Gradient Descent Michail Michailidis & Patrick Maiden - PowerPoint PPT Presentation

Gradient Descent Michail Michailidis & Patrick Maiden Outline Mo4va4on Gradient Descent Algorithm Issues & Alterna4ves Stochas4c Gradient Descent

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Gradient descent revisited Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

CDK The toolbox HAPPY for the Angular developer Christian Janz @c_janz Christian Janz

Indie Marketing 101 not Todays Session Presenting for audience with varying levels of

A physical analogue of the Schelling model c, 1 Alan Kirman, 1 , 2 , 3 Dejan Vinkovi 1

Wireless Application Protocol WAP F. Ricci 2008/2009 Content Web and mobility Problems

Help, my Security Officer is allergic to DevOps! DevOps and Security, a match made in heaven

Julia tutorial Introduction Some useful pointers Getting started Julia syntax

Todays Presenters Kendra Jones Childrens Librarian, Tacoma Public Library, WA Soraya

The Electromagnetic Spectrum Principles of Astrophysics & Cosmology - Professor Jodi Cooley

Gradient Descent Michail Michailidis & Patrick Maiden - PowerPoint PPT Presentation

Gradient Descent Michail Michailidis & Patrick Maiden Outline Mo4va4on Gradient Descent Algorithm Issues & Alterna4ves Stochas4c Gradient Descent

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Gradient descent revisited Geoff Gordon &amp; Ryan Tibshirani Optimization 10-725 / 36-725 1

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

CDK The toolbox HAPPY for the Angular developer Christian Janz @c_janz Christian Janz

Indie Marketing 101 not Todays Session Presenting for audience with varying levels of

A physical analogue of the Schelling model c, 1 Alan Kirman, 1 , 2 , 3 Dejan Vinkovi 1

Wireless Application Protocol WAP F. Ricci 2008/2009 Content Web and mobility Problems

Help, my Security Officer is allergic to DevOps! DevOps and Security, a match made in heaven

Julia tutorial Introduction Some useful pointers Getting started Julia syntax

Todays Presenters Kendra Jones Childrens Librarian, Tacoma Public Library, WA Soraya

The Electromagnetic Spectrum Principles of Astrophysics &amp; Cosmology - Professor Jodi Cooley

Gradient descent revisited Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1

The Electromagnetic Spectrum Principles of Astrophysics & Cosmology - Professor Jodi Cooley