Stochastic Algorithms in Machine Learning Aymeric DIEULEVEUT EPFL, - PowerPoint PPT Presentation

Stochastic Algorithms in Machine Learning Aymeric DIEULEVEUT EPFL, Lausanne December 1st, 2017 Journ´ ee Algorithmes Stochastiques, Paris Dauphine 1

Outline 1. Machine learning context. 2. Stochastic algorithms to minimize Empirical Risk . 3. Stochastic Approximation: using stochastic gradient descent (SGD) to minimize Generalization Risk. 4. Markov chain: insightful point of view on constant step size Stochastic Approximation. 2

Supervised Machine Learning Goal: predict a phenomenon from “explanatory variables”, given a set of observations. Bio-informatics Image classification Input: DNA/RNA sequence, Input: Handwritten digits / Output: Disease predisposition Images, / Drug responsiveness Output: Digit n → 10 to 10 4 n → up to 10 9 d (e.g., number of basis) d (e.g., number of pixels) → 10 6 → 10 6 “Large scale” learning framework: both the number of examples n and the number of explanatory variables d are large. 3

Supervised Machine Learning ◮ Consider an input/output pair ( X , Y ) ∈ X × Y , following some unknown distribution ρ . ◮ Y = R (regression) or {− 1 , 1 } (classification). ◮ Goal: find a function θ : X → R , such that θ ( X ) is a good prediction for Y . ◮ Prediction as a linear function � θ, Φ( X ) � of features Φ( X ) ∈ R d . ◮ Consider a loss function ℓ : Y × R → R + : squared loss, logistic loss, 0-1 loss, etc. ◮ Define the Generalization risk (a.k.a., generalization error, “true risk”) as R ( θ ) := E ρ [ ℓ ( Y , � θ, Φ( X ) � )] . 4

Empirical Risk minimization (I) ◮ Data: n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n , i.i.d. ◮ n very large, up to 10 9 ◮ Computer vision: d = 10 4 to 10 6 ◮ Empirical risk (or training error): n R ( θ ) = 1 ˆ � ℓ ( y i , � θ, Φ( x i ) � ) . n i =1 ◮ Empirical risk minimization (ERM) (regularized): find ˆ θ solution of n 1 � � y i , � θ, Φ( x i ) � � min ℓ + µ Ω( θ ) . n θ ∈ R d i =1 convex data fitting term + regularizer 5

Empirical Risk minimization (II) For example, least-squares regression: n 1 � 2 � � y i − � θ, Φ( x i ) � min + µ Ω( θ ) , 2 n θ ∈ R d i =1 and logistic regression: n 1 � � � 1 + exp( − y i � θ, Φ( x i ) � ) min log + µ Ω( θ ) . n θ ∈ R d i =1 Two fundamental questions: (1) computing (2) analyzing ˆ θ . Take home ◮ Problem is formalized as a (convex) optimization problem. ◮ In the large scale setting, high dimensional problem and many examples. 6

Stochastic algorithms for ERM n � � R ( θ ) = 1 ˆ � ℓ ( y i , � θ, Φ( x i ) � ) min . n θ ∈ R d i =1 ⇒ First order algorithms 1. High dimension d = Gradient Descent (GD) : θ k = θ k − 1 − γ k ˆ R ′ ( θ k − 1 ) Problem: computing the gradient costs O ( dn ) per iteration. ⇒ Stochastic algorithms 2. Large n = Stochastic Gradient Descent (SGD) 7

Stochastic Gradient descent θ ∗ θ 0 ◮ Goal: θ ∈ R d f ( θ ) min given unbiased gradient θ ∗ estimates f ′ n ◮ θ ∗ := argmin R d f ( θ ). θ 0 θ 1 8 θ

SGD for ERM: f = ˆ R Loss for a single pair of observations, for any j ≤ n : f j ( θ ) := ℓ ( y j , � θ, Φ( x j ) � ) . ⇒ complexity O ( d ) per iteration. One observation at each step = n For the empirical risk ˆ R ( θ ) = 1 ℓ ( y k , � θ, Φ( x k ) � ). � n k =1 ◮ At each step k ∈ N ∗ , sample I k ∼ U{ 1 , . . . n } , and use: f ′ I k ( θ k − 1 ) = ℓ ′ ( y I k , � θ k − 1 , Φ( x I k ) � ) ◮ with F k = σ (( x i , y i ) 1 ≤ i ≤ n , ( I i ) 1 ≤ i ≤ k ), n I k ( θ k − 1 ) |F k − 1 ] = 1 � ℓ ′ ( y k , � θ, Φ( x k ) � ) = ˆ E [ f ′ R ′ ( θ k − 1 ) . n k =1 Mathematical framework: smoothness and/or strong convexity. 9

Mathematical framework: Smoothness ◮ A function g : R d → R is L -smooth if and only if it is twice differentiable and ∀ θ ∈ R d , eigenvalues g ′′ ( θ ) � � � L For all θ ∈ R d : � 2 g ( θ ) ≤ g ( θ ′ ) + � g ( θ ′ ) , θ − θ ′ � + L � θ − θ ′ � � 10

Mathematical framework: Strong Convexity ◮ A twice differentiable function g : R d → R is µ -strongly convex if and only if ∀ θ ∈ R d , eigenvalues g ′′ ( θ ) � � � µ For all θ ∈ R d : � 2 g ( θ ) ≥ g ( θ ′ ) + � g ( θ ′ ) , θ − θ ′ � + µ � θ − θ ′ � � 11

Application to machine learning ◮ We consider an a.s. convex loss in θ . Thus ˆ R and R are convex. ◮ Hessian of ˆ R ≈ covariance matrix 1 � n i =1 Φ( x i )Φ( x i ) ⊤ n ( ≃ E [Φ( X )Φ( X ) ⊤ ].) n R ′′ ( θ ) = 1 � ℓ ′′ ( � θ, Φ( X i ) � , Y i )Φ( x i )Φ( x i ) ⊤ � ˆ � n i =1 ◮ If ℓ is smooth, and E [ � Φ( X ) � 2 ] ≤ r 2 , R is smooth. ◮ If ℓ is µ -strongly convex, and data has an invertible covariance matrix (low correlation/dimension), R is strongly convex. 12

Analysis: behaviour of ( θ n ) n ≥ 0 θ k = θ k − 1 − γ k f ′ k ( θ k − 1 ) Importance of the learning rate (or sequence of step sizes) ( γ k ) k ≥ 0 . For smooth and strongly convex problem, traditional analysis shows Fabian (1968); Robbins and Siegmund (1985) that θ k → θ ∗ almost surely if ∞ ∞ � � γ 2 γ k = ∞ k < ∞ . k =1 k =1 √ d k ( θ k − θ ∗ ) → N (0 , V ), for And asymptotic normality γ k = γ 0 k , γ 0 ≥ 1 µ . ◮ Limit variance scales as 1 /µ 2 ◮ Very sensitive to ill-conditioned problems. ◮ µ generally unknown, so hard to choose the step size... 13

Polyak Ruppert averaging θ 1 θ 0 θ 1 θ n θ ∗ Introduced by Polyak and Juditsky (1992) and Ruppert (1988): θ 1 θ 0 θ 1 k 1 ¯ � θ k = θ i . θ 2 k + 1 i =0 θ n θ ∗ θ n ◮ off line averaging reduces the noise effect. ◮ on line computing: ¯ 1 k +1 ¯ k θ k +1 = k +1 θ k +1 + θ k . ◮ one could also consider other averaging schemes (e.g., 14 Lacoste-Julien et al. (2012)).

Convex stochastic approximation: convergence ◮ Known global minimax rates of convergence for non-smooth problems Nemirovsky and Yudin (1983); Agarwal et al. (2012) ◮ Strongly convex: O (( µ k ) − 1 ) Attained by averaged stochastic gradient descent with γ k ∝ ( µ k ) − 1 ◮ Non-strongly convex: O ( k − 1 / 2 ) Attained by averaged stochastic gradient descent with γ k ∝ k − 1 / 2 Smooth strongly convex problems ◮ 1 ◮ Rate µ k for γ k ∝ k − 1 / 2 : adapts to strong convexity. 15

Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ k ⊖ Gradient descent update costs n times as much as SGD update. Can we get best of both worlds ? 16

Methods for finite sum minimization ◮ GD: at step k , use 1 � n i =0 f ′ i ( θ k ) n ◮ SGD: at step k , sample i k ∼ U [1; n ], use f ′ i k ( θ k ) ◮ SAG: at step k , ◮ keep a “full gradient” 1 � n i ( θ k i ), with θ k i ∈ { θ 1 , . . . θ k } i =0 f ′ n ◮ sample i k ∼ U [1; n ], use � n � 1 � f ′ i ( θ k i ) − f ′ i k ( θ k ik ) + f ′ i k ( θ k ) , n i =0 � ⊕ update costs the same as SGD � ⊖ needs to store all gradients f ′ i ( θ k i ) at “points in the past” Some references: ◮ SAG Schmidt et al. (2013), SAGA Defazio et al. (2014a) ◮ SVRG Johnson and Zhang (2013) (reduces memory cost but 2 epochs...) ◮ FINITO Defazio et al. (2014b) ◮ S2GD Koneˇ cn` y and Richt´ arik (2013)... And many others... See for example Niao He’s lecture notes for a nice overview. 17

Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ k GD, SGD, SAG (Fig. from Schmidt et al. (2013)) 18

Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ k Lower Bounds α β γ α : Stoch. opt. information theoretic lower bounds, Agarwal et al. (2012); β : Black box first order optimization, Nesterov (2004) ; γ : Lower bounds for optimizing finite sums, Agarwal and Bottou (2014); Arjevani and Shamir (2016). 19

Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD AGD SAG SGD � � � � � 1 1 1 Convex O O O √ √ k 2 k k O ( e −√ µ k ) � k � � � � 1 1 − ( µ ∧ 1 1 Stgly-Cvx O O n ) O µ k µ k Lower Bounds α β γ α : Stoch. opt. information theoretic lower bounds, Agarwal et al. (2012); β : Black box first order optimization, Nesterov (2004); γ : Lower bounds for optimizing finite sums, Agarwal and Bottou (2014). 20

Stochastic Algorithms in Machine Learning Aymeric DIEULEVEUT EPFL, - PowerPoint PPT Presentation

Stochastic Algorithms in Machine Learning Aymeric DIEULEVEUT EPFL, Lausanne December 1st, 2017 Journ ee Algorithmes Stochastiques, Paris Dauphine 1 Outline 1. Machine learning context. 2. Stochastic algorithms to minimize Empirical Risk .

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Stochastic geometry and random generation 1 Stochastic geometry and random generation

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Stochastic Processes Will Perkins March 7, 2013 Stochastic Processes Q: What is a Stochastic

On oracle inequalities related to high dimensional linear models Yuri Golubev CNRS, Universit

!"#"$%"$&#'(')#+$+,("-)./(

Decision Theory, and Loss Functions CMSC 691 UMBC Some slides adapted from Hamed Pirsiavash

Using Probability of Exceedance to Compare the Resource Risk of Renewable and Gas-Fired

Principled Deep Neural Network Training through Linear Programming Daniel Bienstock 1 , Gonzalo

The Shadow Cost of Bank Capital Requirements Roni Kisin Washington University in St. Louis Asaf

5. Bayesian decision theory Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines

Understanding the safety-relevance of visual cue perception at a Surface Manager HMI Lothar

Stochastic Algorithms in Machine Learning Aymeric DIEULEVEUT EPFL, - PowerPoint PPT Presentation

Stochastic Algorithms in Machine Learning Aymeric DIEULEVEUT EPFL, Lausanne December 1st, 2017 Journ ee Algorithmes Stochastiques, Paris Dauphine 1 Outline 1. Machine learning context. 2. Stochastic algorithms to minimize Empirical Risk .

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Stochastic geometry and random generation 1 Stochastic geometry and random generation

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Stochastic Processes Will Perkins March 7, 2013 Stochastic Processes Q: What is a Stochastic

On oracle inequalities related to high dimensional linear models Yuri Golubev CNRS, Universit

!&quot;#&quot;$%&quot;$&amp;#'(')#*+$+,(&quot;-).*/(

Decision Theory, and Loss Functions CMSC 691 UMBC Some slides adapted from Hamed Pirsiavash

Using Probability of Exceedance to Compare the Resource Risk of Renewable and Gas-Fired

Principled Deep Neural Network Training through Linear Programming Daniel Bienstock 1 , Gonzalo

The Shadow Cost of Bank Capital Requirements Roni Kisin Washington University in St. Louis Asaf

5. Bayesian decision theory Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines

Understanding the safety-relevance of visual cue perception at a Surface Manager HMI Lothar

!"#"$%"$&#'(')#+$+,("-)./(