Bridging the gap between Stochastic Approximation and Markov chains - PowerPoint PPT Presentation

Bridging the gap between Stochastic Approximation and Markov chains Aymeric DIEULEVEUT ENS Paris, INRIA 17 november 2017 Joint work with Francis Bach and Alain Durmus. 1

Outline ◮ Introduction to Stochastic Approximation for Machine Learning. ◮ Markov chain: a simple yet insightful point of view on constant step size Stochastic Approximation. 2

Supervised Machine Learning ◮ Consider an input/output pair ( X , Y ) ∈ X × Y , following some unknown distribution ρ . ◮ Y = R (regression) or {− 1 , 1 } (classification). ◮ We want to find a function θ : X → R , such that θ ( X ) is a good prediction for Y . ◮ Prediction as a linear function � θ, Φ( X ) � of features Φ( X ) ∈ R d . ◮ Consider a loss function ℓ : Y × R → R + : squared loss, logistic loss, 0-1 loss, etc. ◮ We define the risk (generalization error) as R ( θ ) := E ρ [ ℓ ( Y , � θ, Φ( X ) � )] . 3

Empirical Risk minimization (I) ◮ Data: n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n , i.i.d. ◮ n very large, up to 10 9 ◮ Computer vision: d = 10 4 to 10 6 ◮ Empirical risk (or training error): n R ( θ ) = 1 ˆ � ℓ ( y i , � θ, Φ( x i ) � ) . n i =1 ◮ Empirical risk minimization (regularized): find ˆ θ solution of n 1 � � � min ℓ y i , � θ, Φ( x i ) � + µ Ω( θ ) . n θ ∈ R d i =1 convex data fitting term + regularizer 4

Empirical Risk minimization (II) ◮ For example, least-squares regression: n 1 � 2 � � y i − � θ, Φ( x i ) � min + µ Ω( θ ) , 2 n θ ∈ R d i =1 ◮ and logistic regression: n 1 � � � 1 + exp( − y i � θ, Φ( x i ) � ) min log + µ Ω( θ ) . n θ ∈ R d i =1 ◮ Two fundamental questions: (1) computing ˆ θ and (2) analyzing ˆ θ . 2 important insights for ML Bottou and Bousquet (2008): 1. No need to optimize below statistical error, 2. Testing error is more important than training error. 5

Stochastic Approximation θ ∗ θ 0 ◮ Goal: θ ∈ R d f ( θ ) min given unbiased gradient θ ∗ estimates f ′ n ◮ θ ∗ := argmin R d f ( θ ). θ 0 θ 1 6 θ

Stochastic Approximation in Machine learning Loss for a single pair of observations, for any k ≤ n : f k ( θ ) = ℓ ( y k , � θ, Φ( x k ) � ) . ◮ Use one observation at each step ! ◮ Complexity: O ( d ) per iteration. ◮ Can be used for both true risk and empirical risk. 7

Stochastic Approximation in Machine learning n ◮ For the empirical error ˆ R ( θ ) = 1 � ℓ ( y k , � θ, Φ( x k ) � ). n k =1 ◮ At each step k ∈ N ∗ , sample I k ∼ U{ 1 , . . . n } . ◮ F k = σ (( x i , y i ) 1 ≤ i ≤ n , ( I i ) 1 ≤ i ≤ k ). ◮ At step k ∈ N ∗ , use: f ′ I k ( θ k − 1 ) = ℓ ′ ( y I k , � θ k − 1 , Φ( x I k ) � ) E [ f ′ I Ik ( θ k − 1 ) |F k − 1 ] = ˆ R ′ ( θ k − 1 ) ◮ For the risk R ( θ ) = E f k ( θ ) = E ℓ ( y k , � θ, Φ( x k ) � ): ◮ For 0 ≤ k ≤ n , F k = σ (( x i , y i ) 1 ≤ i ≤ k ). ◮ At step 0 < k ≤ n , use a new point independent of θ k − 1 : f ′ k ( θ k − 1 ) = ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) E [ f ′ k ( θ k − 1 ) |F k − 1 ] = R ′ ( θ k − 1 ) ◮ Single pass through the data, Running-time = O ( nd ), ◮ “Automatic” regularization. Analysis: Key assumptions: smoothness and/or strong convexity. 8

Mathematical framework: Smoothness ◮ A function g : R d → R is L -smooth if and only if it is twice differentiable and ∀ θ ∈ R d , eigenvalues g ′′ ( θ ) � � � L For all θ ∈ R d : � 2 g ( θ ) ≤ g ( θ ′ ) + � g ( θ ′ ) , θ − θ ′ � + L � θ − θ ′ � � 9

Mathematical framework: Strong Convexity ◮ A twice differentiable function g : R d → R is µ -strongly convex if and only if ∀ θ ∈ R d , eigenvalues g ′′ ( θ ) � � � µ For all θ ∈ R d : � 2 g ( θ ) ≥ g ( θ ′ ) + � g ( θ ′ ) , θ − θ ′ � + µ � θ − θ ′ � � 10

Application to machine learning ◮ We consider an a.s. convex loss in θ . Thus ˆ R and R are convex. ◮ Hessian of ˆ R (resp R ) ≈ covariance matrix i =1 Φ( x i )Φ( x i ) ⊤ or E [Φ( X )Φ( X ) ⊤ ]. 1 � n n R ′′ ( θ ) = E [ ℓ ′′ ( � θ, Φ( X ) � , Y )Φ( X )Φ( X ) ⊤ ] ◮ If ℓ is smooth, and E [ � Φ( X ) � 2 ] ≤ r 2 , R is smooth. ◮ If ℓ is µ -strongly convex, and data has an invertible covariance matrix (low correlation/dimension), R is strongly convex. 11

Analysis: behaviour of ( θ n ) n ≥ 0 θ n = θ n − 1 − γ n f ′ n ( θ n − 1 ) Importance of the learning rate (or sequence of step sizes) ( γ n ) n ≥ 0 . For smooth and strongly convex problem, traditional analysis shows Fabian (1968); Robbins and Siegmund (1985) that θ n → θ ∗ almost surely if ∞ ∞ � � γ 2 γ n = ∞ n < ∞ . n =1 n =1 And asymptotic normality √ n ( θ n − θ ∗ ) d → N (0 , V ), for γ n = γ 0 n , γ 0 ≥ 1 µ . ◮ Limit variance scales as 1 /µ 2 ◮ Very sensitive to ill-conditioned problems. ◮ µ generally unknown, so hard to choose the step size... 12

Polyak Ruppert averaging θ 1 θ 0 θ 1 θ n θ ∗ Introduced by Polyak and Juditsky (1992) and Ruppert (1988): θ 1 θ 0 θ 1 n 1 ¯ � θ n = θ k . θ 2 n + 1 k =0 θ n θ ∗ θ n ◮ off line averaging reduces the noise effect. ◮ on line computing: ¯ 1 n +1 ¯ n θ n +1 = n +1 θ n +1 + θ n . ◮ one could also consider other averaging schemes (e.g., 13 Lacoste-Julien et al. (2012)).

Convex stochastic approximation: convergence results ◮ Known global minimax rates of convergence for non-smooth problems Nemirovsky and Yudin (1983); Agarwal et al. (2012) ◮ Strongly convex: O (( µ n ) − 1 ) Attained by averaged stochastic gradient descent with γ n ∝ ( µ n ) − 1 ◮ Non-strongly convex: O ( n − 1 / 2 ) Attained by averaged stochastic gradient descent with γ n ∝ n − 1 / 2 Smooth strongly convex problems ◮ ◮ All step sizes γ n = Cn − α with α ∈ (1 / 2 , 1), with averaging, lead to O ( n − 1 ): ◮ asymptotic normality Polyak and Juditsky (1992), with variance independent of µ ! ◮ non asymptotic analysis Bach and Moulines (2011). ◮ Rate µ n for γ n ∝ n − 1 / 2 : adapts to strong convexity. 1 14

Stochastic Approximation: take home message ◮ Powerful algorithm: ◮ Simple to implement ◮ Cheap ◮ No regularization needed ◮ Convergence guarantees: 1 ◮ γ n = √ n good choice in most situations Problems: ◮ Initial conditions can be forgotten slowly: could we use even larger step sizes? 15

Motivation 1/ 2. Large step sizes! � θ n ) − R ( θ ∗ ) R (¯ � log 10 log 10 ( n ) Logistic regression. Final iterate (dashed), and averaged recursion (plain). 16

Motivation 1/ 2. Large step sizes, real data � θ n ) − R ( θ ∗ ) R (¯ � log 10 log 10 ( n ) Logistic regression, Covertype dataset, n = 581012, d = 54. Comparison between a constant learning rate and decaying 1 learning rate as √ n . 17

Motivation 2/ 2. Difference between quadratic and logistic loss Logistic Regression Least-Squares Regression � 1 � E R (¯ θ n ) − R ( θ ∗ ) = O ( γ 2 ) E R (¯ θ n ) − R ( θ ∗ ) = O n with γ = 1 / (2 R 2 ) with γ = 1 / (2 R 2 ) 18

Larger step sizes: Least-mean-square algorithm ◮ Least-squares: R ( θ ) = 1 ( Y − � Φ( X ) , θ � ) 2 � � with 2 E θ ∈ R d ◮ SGD = least-mean-square algorithm ◮ Usually studied without averaging and decreasing step-sizes. ◮ New analysis for averaging and constant step-size γ = 1 / (4 R 2 ) Bach and Moulines (2013) ◮ Assume � Φ( x n ) � � r and | y n − � Φ( x n ) , θ ∗ �| � σ almost surely ◮ No assumption regarding lowest eigenvalues of the Hessian ◮ Main result: θ n ) − R ( θ ∗ ) � 4 σ 2 d + � θ 0 − θ ∗ � 2 E R (¯ n γ n ◮ Matches statistical lower bound Tsybakov (2003). 19

Related work in Sierra Led to numerous (non trivial) extensions, at least in our lab ! ◮ Beyond parametric models: Non Parametric Stochastic Approximation with Large step sizes. Dieuleveut and Bach (2015) ◮ Improved Sampling: Averaged least-mean-squares: bias-variance trade-offs and optimal sampling distributions. D´ efossez and Bach (2015) ◮ Acceleration: Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression. Dieuleveut et al. (2016) ◮ Beyond smoothness and euclidean geometry: Stochastic Composite Least-Squares Regression with convergence rate O (1 / n ). Flammarion and Bach (2017) 20

SGD: an homogeneous Markov chain Consider a L − smooth and µ − strongly convex function R . SGD with a step-size γ > 0 is an homogeneous Markov chain: θ γ k +1 = θ γ R ′ ( θ γ k ) + ε k +1 ( θ γ k − γ � � k ) , ◮ satisfies Markov property ◮ is homogeneous, for γ constant, ( ε k ) k ∈ N i.i.d. Also assume: k = R ′ + ε k +1 is almost surely L -co-coercive. ◮ R ′ ◮ Bounded moments E [ � ε k ( θ ∗ ) � 4 ] < ∞ . 21

Stochastic gradient descent as a Markov Chain: Analysis framework † ◮ Existence of a limit distribution π γ , and linear convergence to this distribution: d θ γ → π γ . n ◮ Convergence of second order moments of the chain, L 2 θ γ ¯ ¯ − → θ γ := E π γ [ θ ] . n n →∞ ◮ Behavior under the limit distribution ( γ → 0): ¯ θ γ = θ ∗ + ?. � Provable convergence improvement with extrapolation tricks. † Dieuleveut, Durmus, Bach [2017]. 22

Bridging the gap between Stochastic Approximation and Markov chains - PowerPoint PPT Presentation

Bridging the gap between Stochastic Approximation and Markov chains Aymeric DIEULEVEUT ENS Paris, INRIA 17 november 2017 Joint work with Francis Bach and Alain Durmus. 1 Outline Introduction to Stochastic Approximation for Machine

Moderately exponential approximation Bridging the gap between exact computation and polynomial

Bridging the Gender Pay Gap By: Christine Acquah Gender Pay Gap, What is it? The gap between

Bridging the Gap on Breaches: What Makes the Difference? Sponsored By: Bridging the Gap on

Stochastic approximation for adaptive Markov chain Monte Carlo algorithms Gersende FORT LTCI /

Bridging the Gap between Data Diversity and Data Dependencies Jean-Marc Petit INSA Lyon,

Bridging The Gap Between Information Security & IT Audit Agenda Introductions

6. Approximation and fitting norm approximation least-norm problems regularized

Gender Pay Gap Reporting What is Gender Pay Gap? Gender Pay Gap is the difference between the

Multi-level stochastic approximation algorithms Noufel Frikha Universit e Paris Diderot, LPMA

High Performing Governance: Bridging the Gap between Political Acceptability and Administrative

Conservation Education, Communication and Outreach Success Stories: Bridging the Gap Between

Bridging the Gap: An overview of CPRITs Early Translational Research Award (ETRA) and SEED

UCF FINANCIALS THE N EXT G EN Fit-Gap Kick Off April 17, 2018 AGENDA How are fit-gap sessions

MCP gap bottom bottom electrode gap Anode

Research on Race Bridging for 2020 Ben Bolender Assistant Division Chief Population Estimates

2017 Training The Company Gap Training GAP training is about bridging courses within the retail

CS70: Jean Walrand: Lecture 33. Review Distribution of X n n 0 . 3 3 0 . 7 2 0 . 2 2 0 . 4

the law of large numbers & the CLT 0.020 n = 4 0.015 Probability/Density 0.010 0.005

Uniformity and the delta-method Maximilian Kasy Jos e L. Montiel Olea October 27, 2014

Climate Change Predictions Spring 09 UC Berkeley Traeger 1 Climate Change 93 The

Convergence of spectral measures and eigenvalue rigidity Elizabeth Meckes Case Western Reserve

Convergence Rate of Markov Chains Will Perkins April 16, 2013 Convergence Last class we saw that

Nonlinear Distributional Gradient Temporal Difference Learning Chao Qu 1 Shie Mannor 2 Huan Xu 3 1

Limiting Spectral Distribution of Stochastic Block Model Yizhe Zhu University of Washington

Bridging the gap between Stochastic Approximation and Markov chains - PowerPoint PPT Presentation

Bridging the gap between Stochastic Approximation and Markov chains Aymeric DIEULEVEUT ENS Paris, INRIA 17 november 2017 Joint work with Francis Bach and Alain Durmus. 1 Outline Introduction to Stochastic Approximation for Machine

Moderately exponential approximation Bridging the gap between exact computation and polynomial

Bridging the Gender Pay Gap By: Christine Acquah Gender Pay Gap, What is it? The gap between

Bridging the Gap on Breaches: What Makes the Difference? Sponsored By: Bridging the Gap on

Stochastic approximation for adaptive Markov chain Monte Carlo algorithms Gersende FORT LTCI /

Bridging the Gap between Data Diversity and Data Dependencies Jean-Marc Petit INSA Lyon,

Bridging The Gap Between Information Security &amp; IT Audit Agenda Introductions

6. Approximation and fitting norm approximation least-norm problems regularized

Gender Pay Gap Reporting What is Gender Pay Gap? Gender Pay Gap is the difference between the

Multi-level stochastic approximation algorithms Noufel Frikha Universit e Paris Diderot, LPMA

High Performing Governance: Bridging the Gap between Political Acceptability and Administrative

Conservation Education, Communication and Outreach Success Stories: Bridging the Gap Between

Bridging the Gap: An overview of CPRITs Early Translational Research Award (ETRA) and SEED

UCF FINANCIALS THE N EXT G EN Fit-Gap Kick Off April 17, 2018 AGENDA How are fit-gap sessions

MCP gap bottom bottom electrode gap Anode

Research on Race Bridging for 2020 Ben Bolender Assistant Division Chief Population Estimates

2017 Training The Company Gap Training GAP training is about bridging courses within the retail

CS70: Jean Walrand: Lecture 33. Review Distribution of X n n 0 . 3 3 0 . 7 2 0 . 2 2 0 . 4

the law of large numbers &amp; the CLT 0.020 n = 4 0.015 Probability/Density 0.010 0.005

Uniformity and the delta-method Maximilian Kasy Jos e L. Montiel Olea October 27, 2014

Climate Change Predictions Spring 09 UC Berkeley Traeger 1 Climate Change 93 The

Convergence of spectral measures and eigenvalue rigidity Elizabeth Meckes Case Western Reserve

Convergence Rate of Markov Chains Will Perkins April 16, 2013 Convergence Last class we saw that

Nonlinear Distributional Gradient Temporal Difference Learning Chao Qu 1 Shie Mannor 2 Huan Xu 3 1

Limiting Spectral Distribution of Stochastic Block Model Yizhe Zhu University of Washington

Bridging The Gap Between Information Security & IT Audit Agenda Introductions

the law of large numbers & the CLT 0.020 n = 4 0.015 Probability/Density 0.010 0.005