Stochastic Gradient Descent 10701 Recitations 3 Mu Li Computer - PowerPoint PPT Presentation

Sep 16, 2022 •461 likes •575 views

Stochastic Gradient Descent 10701 Recitations 3 Mu Li Computer Science Department Cargenie Mellon University February 5, 2013 The problem A typical machine learning problem has a penalty/regularizer + loss form n w F ( w ) = g ( w ) + 1

Stochastic Gradient Descent 10701 Recitations 3 Mu Li Computer Science Department Cargenie Mellon University February 5, 2013
The problem ◮ A typical machine learning problem has a penalty/regularizer + loss form n w F ( w ) = g ( w ) + 1 � min f ( w ; y i , x i ) , n i =1 x i , w ∈ R p , y i ∈ R , both g and f are convex ◮ Today we only consider differentiable f , and let g = 0 for simplicity ◮ For example, let f ( w ; y i , x i ) = − log p ( y i | x i , w ), we are trying to maximize the log likelihood, which is n 1 � max log p ( y i | x i , w ) n w i =1
Gradient Descent ◮ choose initial w (0) , repeat Two dimensional example: w ( t +1) = w ( t ) − η t · ∇ F ( w ( t ) ) until stop ◮ η t is the learning rate, and ∇ F ( w ( t ) ) = 1 � ∇ w f ( w ( t ) ; y i , x i ) n i ◮ How to stop? � w ( t +1) − w ( t ) � ≤ ǫ or �∇ F ( w ( t ) ) � ≤ ǫ
Learning rate matters too small η t , after 100 η t = t , it is too big iterations
Backtracking line search Adaptively choose the learning rate ◮ choose a parameter 0 < β < 1 ◮ start with η = 1, repeat t = 0 , 1 , . . . ◮ while L ( w ( t ) − η ∇ L ( w ( t ) )) > L ( w ( t ) ) − η 2 �∇ L ( w ( t ) ) � 2 update η = βη ◮ w ( t +1) = w ( t ) − η ∇ L ( w ( t ) )
Backtracking line search A typical choice β = 0 . 8, converged after 13 iterations:
Stochastic Gradient Descent ◮ We name 1 � i f ( w ; y i , x i ) the empirical loss, the thing we n hope to minimize is the expected loss f ( w ) = E y i , x i f ( w ; y i , x i ) ◮ Suppose we receive an infinite stream of samples ( y t , x t ) from the distribution, one way to optimize the objective is w ( t +1) = w ( t ) − η t ∇ w f ( w ( t ) ; y t , x t ) ◮ On practice, we simulate the stream by randomly pick up ( y t , x t ) from the samples we have ◮ Comparing the average gradient of GD 1 i ∇ w f ( w ( t ) ; y i , x i ) � n
More about SGD ◮ the objective does not always decrease for each step ◮ comparing to GD, SGD needs more steps, but each step is cheaper ◮ mini-batch, say pick up 100 samples and do average, may accelerate the convergence
Relation to Perceptron ◮ Recall Perceptron: initialize w , repeat � y i x i if y i � w , x i � < 0 w = w + 0 otherwise ◮ Fix learning rate η = 1, let f ( w ; y , x ) = max(0 , − y i � w , x i � ), then � − y i x i if y i � w , x i � < 0 ∇ w f ( w ; y , x ) = 0 otherwise we derive Perceptron from SGD
Question?

Recommend

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS 2020 Aaron Mishkin, amishkin@cs.ubc.ca 1 21 Stochastic Gradient Descent: Workhorse of ML? Stochastic gradient descent (SGD) is today one of

634 views • 21 slides

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

CS6501: Deep Learning for Visual Recognition Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap Regression vs Classification Generalization / Overfitting / Underfitting Regularization

633 views • 30 slides

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS 2019 Aaron Mishkin 1 21 Stochastic Gradient Descent: Workhorse of ML? Stochastic gradient descent (SGD) is today one of the main

1.22k views • 21 slides

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University of Virginia Overview 1. Gradient Descent 2. Stochastic Gradient Descent 3. SGD with Momentum 4. Adaptive Learning Rates 1 Gradient Descent

767 views • 66 slides

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020) Learning objectives Basic idea of gradient descent stochastic gradient descent method of momentum using an adaptive learning rate sub-gradient

569 views • 34 slides

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 12 Announcements Midterm: Weds, Feb 7th. Policies: You may use a single

333 views • 18 slides

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner New requirement for the final project: For the first time ever, researchers who submit

620 views • 34 slides

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient Descent Quadratic Forms Gradient Descent in Quadratic Forms Eigen vectors and values Gradient Descent Convergence Conjugate

1.1k views • 50 slides

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla December 17, 2011 Peter J. Haas Yannis Sismanis Christina Teflioudi Faraz Makari Outline Matrix Factorization Stochastic Gradient Descent

1.24k views • 75 slides

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla August 23, 2011 Peter J. Haas Yannis Sismanis Erik Nijkamp Outline Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce

1.05k views • 81 slides

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning. iii. Asymptotic Analysis.

853 views • 37 slides

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Gradient Descent Michail Michailidis & Patrick Maiden Outline Mo4va4on Gradient Descent Algorithm Issues & Alterna4ves Stochas4c Gradient Descent

840 views • 29 slides

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1 Introduction The general aim of machine learning is always learning the data by itself, with as less human efforts as possible. Then it comes to the focus

396 views • 10 slides

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON Matthieu R Bloch Thursday, January 30, 2020 1 LOGISTICS LOGISTICS TAs and Office hours Monday: Mehrdad (TSRB 523a) - 2pm-3:15pm Tuesday: TJ (VL

299 views • 14 slides

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j Heuristic improvements to gradient descent (momentum) w j = weight vector at step . Steepest descent training algorithm [ ] E w j j

376 views • 20 slides

Part 7.5 Stochastic Gradient Descent and Stochastic Newton 181 Wolfgang Bangerth Background

Part 7.5 Stochastic Gradient Descent and Stochastic Newton 181 Wolfgang Bangerth Background In many practical applications, the objective function is a large sum: N f ( x ) = i = 1 f i ( x ) Issues and questions: Evaluating

381 views • 9 slides

Chapter 4: Training Regression Models Dr. Xudong Liu Assistant Professor School of Computing

Chapter 4: Training Regression Models Dr. Xudong Liu Assistant Professor School of Computing University of North Florida Monday, 10/14/2019 1 / 41 Overview Linear regression Normal equation Gradient descent Batch gradient descent

657 views • 41 slides

Secret sharing on large girth graphs Lszl Csirmaz, Pter Ligeti Etvs Lornd University,

Motivation Methods Secret sharing on large girth graphs Lszl Csirmaz, Pter Ligeti Etvs Lornd University, Department of Computeralgebra; Alfrd Rnyi Institute of Mathematics, Hungarian Academy of Sciences Mathematical Methods for

501 views • 18 slides

Newton Methods for Neural Networks: Part 1 Chih-Jen Lin National Taiwan University Last

Newton Methods for Neural Networks: Part 1 Chih-Jen Lin National Taiwan University Last updated: June 18, 2019 Chih-Jen Lin (National Taiwan Univ.) 1 / 29 Outline Introduction 1 Newton method 2 Hessian and Gaussian-Newton Matrices 3

771 views • 29 slides

Phase Separation, Interfaces and Wetting in Two Dimensions Gesualdo Delfino SISSA - Trieste

Lattice Models: Exact Methods and Combinatorics 18 22 May 2015, GGI Arcetri, Florence Phase Separation, Interfaces and Wetting in Two Dimensions Gesualdo Delfino SISSA - Trieste Based on : GD, J. Viti, Phase separation and interface

457 views • 23 slides

Stress-Minimizing Orthogonal Layout of Data Flow Diagrams with Ports Ulf Regg Steve Kieffer

Stress-Minimizing Orthogonal Layout of Data Flow Diagrams with Ports Ulf Regg Steve Kieffer Tim Dwyer Kim Marriott Michael Wybrow Kiel University Monash University Graph Drawing 2014 Background: Automotive Industry 2 Background:

917 views • 59 slides

New Temporal-Difference Methods Based on Gradient Descent Rich Sutton Hamid Maei Doina Precup

New Temporal-Difference Methods Based on Gradient Descent Rich Sutton Hamid Maei Doina Precup (McGill) Shalabh Bhatnagar (IIS Bangalore) Csaba Szepesvari Eric Wiewiora David Silver Outline The promise and problems of TD learning

713 views • 42 slides

De Finetti theorems for a Boolean analogue of easy quantum groups 1 / 27 De Finetti theorems for

De Finetti theorems for a Boolean analogue of easy quantum groups 1 / 27 De Finetti theorems for a Boolean analogue of easy quantum groups Tomohiro Hayase Graduate School of Mathematical Sciences, the University of Tokyo March, 2016 Free

293 views • 28 slides

Optimal Security Investments in a Prevention and Detection Game Carlos Barreto,

Optimal Security Investments in a Prevention and Detection Game Carlos Barreto, Carlos.BarretoSuarez@utdallas.edu Alvaro A. C ardenas, Alvaro.Cardenas@utdallas.edu Alain Bensoussan, Alain.Bensoussan@utdallas.edu University of Texas at Dallas

490 views • 29 slides

Stochastic Gradient Descent 10701 Recitations 3 Mu Li Computer - PowerPoint PPT Presentation

Stochastic Gradient Descent 10701 Recitations 3 Mu Li Computer Science Department Cargenie Mellon University February 5, 2013 The problem A typical machine learning problem has a penalty/regularizer + loss form n w F ( w ) = g ( w ) + 1

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Part 7.5 Stochastic Gradient Descent and Stochastic Newton 181 Wolfgang Bangerth Background

Chapter 4: Training Regression Models Dr. Xudong Liu Assistant Professor School of Computing

Secret sharing on large girth graphs Lszl Csirmaz, Pter Ligeti Etvs Lornd University,

Newton Methods for Neural Networks: Part 1 Chih-Jen Lin National Taiwan University Last

Phase Separation, Interfaces and Wetting in Two Dimensions Gesualdo Delfino SISSA - Trieste

Stress-Minimizing Orthogonal Layout of Data Flow Diagrams with Ports Ulf Regg Steve Kieffer

New Temporal-Difference Methods Based on Gradient Descent Rich Sutton Hamid Maei Doina Precup

De Finetti theorems for a Boolean analogue of easy quantum groups 1 / 27 De Finetti theorems for

Optimal Security Investments in a Prevention and Detection Game Carlos Barreto,

Gradient Descent Michail Michailidis & Patrick Maiden Outline