CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department - PowerPoint PPT Presentation

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University of Virginia

Overview 1. Gradient Descent 2. Stochastic Gradient Descent 3. SGD with Momentum 4. Adaptive Learning Rates 1

Gradient Descent

Learning as Optimization As discussed before, learning can be viewed as optimization problem. ◮ Training set S � {( x 1 , y 1 ) , . . . , ( x m , y m )} ◮ Empirical risk m L ( h θ , S ) � 1 � R ( h θ ( x i ) , y i ) (1) m i � 1 where R is the risk function ◮ Learning: minimize the empirical risk θ ← argmin L S ( h θ ′ , S ) (2) θ ′ 3

Learning as Optimization (II) Some examples of risk functions ◮ Logistic regression R ( h θ ( x i ) , y i ) � − log p ( y i | x i ; θ ) (3) 4

Learning as Optimization (II) Some examples of risk functions ◮ Logistic regression R ( h θ ( x i ) , y i ) � − log p ( y i | x i ; θ ) (3) ◮ Linear regression R ( h θ ( x i ) , y i ) � � h θ ( x i ) − y i � 2 (4) 2 4

Learning as Optimization (II) Some examples of risk functions ◮ Logistic regression R ( h θ ( x i ) , y i ) � − log p ( y i | x i ; θ ) (3) ◮ Linear regression R ( h θ ( x i ) , y i ) � � h θ ( x i ) − y i � 2 (4) 2 ◮ Neural network R ( h θ ( x i ) , y i ) � Cross-entropy ( h θ ( x i ) , y i ) (5) 4

Learning as Optimization (II) Some examples of risk functions ◮ Logistic regression R ( h θ ( x i ) , y i ) � − log p ( y i | x i ; θ ) (3) ◮ Linear regression R ( h θ ( x i ) , y i ) � � h θ ( x i ) − y i � 2 (4) 2 ◮ Neural network R ( h θ ( x i ) , y i ) � Cross-entropy ( h θ ( x i ) , y i ) (5) ◮ Percetpron and AdaBoost can also be viewed as minimizing certain loss functions 4

Constrained Optimization The dual optimization problem for SVMs of the separable cases is m m α i − 1 � � max α i α j y i y j � x i , x j � (6) 2 α i � 1 i , j � 1 s.t. α i ≥ 0 (7) m � α i y i � 0 ∀ i ∈ [ m ] (8) i � 1 5

Constrained Optimization The dual optimization problem for SVMs of the separable cases is m m α i − 1 � � max α i α j y i y j � x i , x j � (6) 2 α i � 1 i , j � 1 s.t. α i ≥ 0 (7) m � α i y i � 0 ∀ i ∈ [ m ] (8) i � 1 ◮ Lagrange multiplier α is also called dual variable ◮ This is an optimization problem only about α ◮ The dual problem is defined on the inner product � x i , x j � 5

Optimization via Gradient Descent The basic form of an optimization problem min f ( θ ) (9) s.t. θ ∈ B where f ( θ ) : R d → R is the objective function and B ⊆ R d is the constraint on θ , which usually can be formulated as a set of inequalities (e.g., SVM) 6

Optimization via Gradient Descent The basic form of an optimization problem min f ( θ ) (9) s.t. θ ∈ B where f ( θ ) : R d → R is the objective function and B ⊆ R d is the constraint on θ , which usually can be formulated as a set of inequalities (e.g., SVM) In this lecture ◮ we only focus on unconstrained optimization problem, in other words, θ ∈ R d ◮ assume f is convex and differentiable 6

Review: Gradient of a 1-D Function Consider the gradient of this 1-dimensional function y � f ( x ) � x 2 − x − 2 (10) 7

Review: Gradient of a 2-D Function Now, consider a 2-dimensional function with x � ( x 1 , x 2 ) y � f ( x ) � x 2 1 + 10 x 2 (11) 2 Here is the contour plot of this function We are going to use this as our running example 8

Gradient Descent To learn the parameter θ , the learning algorithm needs to update it iteratively using the following three steps 1. Choose an initial point θ ( 0 ) ∈ R d 2. Repeat θ ( t + 1 ) ← θ ( t ) − η t · ∇ f ( θ )| θ � θ ( t ) (12) where η t is the learning rate at time t 3. Go back step 1 until it converges 9

Gradient Descent To learn the parameter θ , the learning algorithm needs to update it iteratively using the following three steps 1. Choose an initial point θ ( 0 ) ∈ R d 2. Repeat θ ( t + 1 ) ← θ ( t ) − η t · ∇ f ( θ )| θ � θ ( t ) (12) where η t is the learning rate at time t 3. Go back step 1 until it converges ∇ f ( θ ) is defined as � ∂ f ( θ ) , · · · , ∂ f ( θ ) � ∇ f ( θ ) � (13) ∂θ 1 ∂θ d 9

Gradient Descent Interpretation An intuitive justification of the gradient descent algorithm is to consider the following plot The direction of the gradient is the direction that the 10 function has the “ fastest increase ”.

Gradient Descent Interpretation (II) Theoretical justification ◮ First-order Taylor approximation � f ( θ + ∆ θ ) ≈ f ( θ ) + � ∆ θ , ∇ f � (14) � θ 11

Gradient Descent Interpretation (II) Theoretical justification ◮ First-order Taylor approximation � f ( θ + ∆ θ ) ≈ f ( θ ) + � ∆ θ , ∇ f � (14) � θ � ◮ In gradient descent, ∆ θ � − η ∇ f � θ 11

Gradient Descent Interpretation (II) Theoretical justification ◮ First-order Taylor approximation � f ( θ + ∆ θ ) ≈ f ( θ ) + � ∆ θ , ∇ f � (14) � θ � ◮ In gradient descent, ∆ θ � − η ∇ f � θ ◮ Therefore, we have � f ( θ + ∆ θ ) ≈ f ( θ ) + � ∆ θ , ∇ f � � θ � f ( θ ) − � η ∇ f , ∇ f � � � θ � f ( θ ) − η �∇ f � 2 θ ≤ f ( θ ) (15) � � 2 11

Gradient Descent Interpretation (III) Consider the second-order Taylor approximation of f f ( θ ′ ) ≈ f ( θ ) + ∇ f ( θ )( θ ′ − θ ) + 1 2 ( θ ′ − θ ) T ∇ 2 f ( θ )( θ ′ − θ ) 12

Gradient Descent Interpretation (III) Consider the second-order Taylor approximation of f f ( θ ′ ) ≈ f ( θ ) + ∇ f ( θ )( θ ′ − θ ) + 1 2 ( θ ′ − θ ) T ∇ 2 f ( θ )( θ ′ − θ ) ◮ The quadratic approximation of f with the following f ( θ ′ ) ≈ f ( θ ) + ∇ f ( θ )( θ ′ − θ ) + 1 2 η ( θ ′ − θ ) T ( θ ′ − θ ) 12

Gradient Descent Interpretation (III) Consider the second-order Taylor approximation of f f ( θ ′ ) ≈ f ( θ ) + ∇ f ( θ )( θ ′ − θ ) + 1 2 ( θ ′ − θ ) T ∇ 2 f ( θ )( θ ′ − θ ) ◮ The quadratic approximation of f with the following f ( θ ′ ) ≈ f ( θ ) + ∇ f ( θ )( θ ′ − θ ) + 1 2 η ( θ ′ − θ ) T ( θ ′ − θ ) ◮ Minimize f ( θ ′ ) wrt θ ′ ∂ f ( θ ′ ) ∇ f ( θ ) + 1 2 η ( θ ′ − θ ) � 0 ≈ ∂ θ ′ θ ′ � θ − η · ∇ f ( θ ) ⇒ (16) 12

Gradient Descent Interpretation (III) Consider the second-order Taylor approximation of f f ( θ ′ ) ≈ f ( θ ) + ∇ f ( θ )( θ ′ − θ ) + 1 2 ( θ ′ − θ ) T ∇ 2 f ( θ )( θ ′ − θ ) ◮ The quadratic approximation of f with the following f ( θ ′ ) ≈ f ( θ ) + ∇ f ( θ )( θ ′ − θ ) + 1 2 η ( θ ′ − θ ) T ( θ ′ − θ ) ◮ Minimize f ( θ ′ ) wrt θ ′ ∂ f ( θ ′ ) ∇ f ( θ ) + 1 2 η ( θ ′ − θ ) � 0 ≈ ∂ θ ′ θ ′ � θ − η · ∇ f ( θ ) ⇒ (16) ◮ Gradient descent chooses the next point θ ′ to minimize the function 12

Step size θ ( t + 1 ) ← θ ( t ) − η t · ∂ f ( θ ) � (17) � ∂ θ � θ � θ ( t ) If choose fixed step size η t � η 0 , consider the following function f ( θ ) � ( 10 θ 2 1 + θ 2 2 )/ 2 (a) Too small 13

Step size θ ( t + 1 ) ← θ ( t ) − η t · ∂ f ( θ ) � (17) � ∂ θ � θ � θ ( t ) If choose fixed step size η t � η 0 , consider the following function f ( θ ) � ( 10 θ 2 1 + θ 2 2 )/ 2 (d) Too small (e) Too large 13

Step size θ ( t + 1 ) ← θ ( t ) − η t · ∂ f ( θ ) � (17) � ∂ θ � θ � θ ( t ) If choose fixed step size η t � η 0 , consider the following function f ( θ ) � ( 10 θ 2 1 + θ 2 2 )/ 2 (g) Too small (h) Too large (i) Just right 13

Optimal Step Sizes ◮ Exact Line Search Solve this one-dimensional subproblem t ← argmin f ( θ − s ∇ f ( θ )) (18) s ≥ 0 14

Optimal Step Sizes ◮ Exact Line Search Solve this one-dimensional subproblem t ← argmin f ( θ − s ∇ f ( θ )) (18) s ≥ 0 ◮ Backtracking Line Search : with parameters 0 < β < 1 , 0 < α ≤ 1 / 2 , and large initial value η t , if f ( θ − η ∇ f ( θ )) > f ( θ ) − αη t �∇ f ( θ )� 2 (19) 2 shrink η t ← βη t 14

Optimal Step Sizes ◮ Exact Line Search Solve this one-dimensional subproblem t ← argmin f ( θ − s ∇ f ( θ )) (18) s ≥ 0 ◮ Backtracking Line Search : with parameters 0 < β < 1 , 0 < α ≤ 1 / 2 , and large initial value η t , if f ( θ − η ∇ f ( θ )) > f ( θ ) − αη t �∇ f ( θ )� 2 (19) 2 shrink η t ← βη t ◮ Usually, this is not worth the effort, since the computational complexity may be too high (e.g., f is a neural network) 14

Convergence Analysis ◮ f is convex and differentiable, additionally �∇ f ( θ ) − ∇ f ( θ ′ )� 2 ≤ L · � θ − θ ′ � 2 (20) for any θ , θ ′ ∈ R d and L is a fixed positive value 15

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department - PowerPoint PPT Presentation

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University of Virginia Overview 1. Gradient Descent 2. Stochastic Gradient Descent 3. SGD with Momentum 4. Adaptive Learning Rates 1 Gradient Descent

CS 6316 Machine Learning The Bias-Complexity Tradeoff Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Introduction to Learning Theory Yangfeng Ji Department of Computer

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Boosting Yangfeng Ji Department of Computer Science University of

CS 6316 Machine Learning Clustering Yangfeng Ji Department of Computer Science University of

CS 6316 Machine Learning Support Vector Machines and Kernel Meth- ods Yangfeng Ji Department of

CS 6316 Machine Learning Dimensionality Reduction Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Review of Linear Algebra and Probability Yangfeng Ji Department of

CS 6316 Machine Learning Linear Predictors Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Model Selection and Validation Yangfeng Ji Department of Computer

CS 6316 Machine Learning Generative Models Yangfeng Ji Department of Computer Science

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

CS257 Linear and Convex Optimization Lecture 10 Bo Jiang John Hopcroft Center for Computer

Learning Step Size Controllers for Robust Neural Network Training Christian Daniel et al. Recent

Introduction to Machine Learning Linear Regression Prof. Andreas Krause Learning and Adaptive

Math 211 Math 211 Lecture #14 M ATLAB s ODE Solvers September 26, 2003 2 Matlab Solvers

Leamer Monoids and the Huneke-Wiegand Conjecture Roberto Carlos Pelayo Christopher ONeill

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from

Stepwise Refinement Lecture 6 CGS 3416 Spring 2017 February 6, 2017 Lecture 6CGS 3416 Spring

For Friday Finish reading chapter 3 Skip section 3.5.2 Program 1 Proposal due by 4pm

Sambuz

Useful Links

Newsletter

Mail Us