high order methods for empirical risk minimization
play

High Order Methods for Empirical Risk Minimization Alejandro Ribeiro - PowerPoint PPT Presentation

High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu Thanks to: Aryan Mokhtari, Mark Eisen, ONR, NSF DIMACS Workshop on


  1. High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu Thanks to: Aryan Mokhtari, Mark Eisen, ONR, NSF DIMACS Workshop on Distributed Optimization, Information Processing, and Learning August 21, 2017 Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 1

  2. Introduction Introduction Incremental quasi-Newton algorithms Adaptive sample size algorithms Conclusions Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 2

  3. Large-scale empirical risk minimization ◮ We would like to solve statistical risk minimization ⇒ min w ∈ R p E θ [ f ( w , θ )] ◮ Distribution unknown, but have access to N independent realizations of θ ◮ We settle for solving the empirical risk minimization (ERM) problem N N 1 1 � � w ∈ R p F ( w ) := min min f ( w , θ i ) = min f i ( w ) w ∈ R p N w ∈ R p N i =1 i =1 ◮ Number of observations N is very large. Large dimension p as well Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 3

  4. Distribute across time and space ◮ Handle large number of observations distributing samples across space and time ⇒ Thus, we want to do decentralized online optimization θ 1 1 ∼ f 1 θ 1 2 ∼ f 1 θ 1 3 ∼ f 1 θ 1 4 ∼ f 1 1 2 3 4 θ 1 1 ∼ f 1 θ 1 2 ∼ f 1 θ 1 3 ∼ f 1 θ 1 4 ∼ f 1 1 2 3 4 F θ 1 1 ∼ f 1 θ 1 2 ∼ f 1 θ 1 3 ∼ f 1 θ 1 4 ∼ f 1 1 2 3 4 θ 1 1 ∼ f 1 θ 1 2 ∼ f 1 θ 1 3 ∼ f 1 θ 1 4 ∼ f 1 1 2 3 4 ◮ We’d like to design scalable decentralized online optimization algorithms ◮ Have scalable decentralized methods. Don’t have scalable online methods Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 4

  5. Optimization methods ◮ Stochastic methods: a subset of samples is used at each iteration ◮ SGD is the most popular; however, it is slow because of ⇒ Noise of stochasticity ⇒ Variance reduction (SAG, SAGA, SVRG, ...) ⇒ Poor curvature approx. ⇒ Stochastic QN (SGD-QN, RES, oLBFGS,...) ◮ Decentralized methods: samples are distributed over multiple processors ⇒ Primal methods: DGD, Acc. DGD, NN, ... ⇒ Dual methods: DDA, DADMM, DQM, EXTRA, ESOM, ... ◮ Adaptive sample size methods: start with a subset of samples and increase the size of training set at each iteration ⇒ Ada Newton ⇒ The solutions are close when the number of samples are close Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 5

  6. Incremental quasi-Newton algorithms Introduction Incremental quasi-Newton algorithms Adaptive sample size algorithms Conclusions Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 6

  7. Incremental Gradient Descent N ◮ Objective function gradients ⇒ s ( w ) := ∇ F ( w ) = 1 � ∇ f ( w , θ i ) N i =1 ◮ (Deterministic) gradient descent iteration ⇒ w t +1 = w t − ǫ t s ( w t ) ◮ Evaluation of (deterministic) gradients is not computationally affordable ◮ Incremental/Stochastic gradient ⇒ Sample average in lieu of expectations L θ ) = 1 s ( w , ˜ � ˜ ˆ ∇ f ( w , θ l ) θ = [ θ 1 ; ... ; θ L ] L l =1 ◮ Functions are chosen cyclically or at random with or without replacement s ( w t , ˜ ◮ Incremental gradient descent iteration ⇒ w t +1 = w t − ǫ t ˆ θ t ) ◮ (Incremental) gradient descent is (very) slow. Newton is impractical Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 7

  8. Incremental aggregated gradient method ◮ Utilize memory to reduce variance of stochastic gradient approximation ∇ f t ∇ f t ∇ f t 1 it N ∇ f it ( w t +1 ) ∇ f t +1 ∇ f t +1 ∇ f t +1 1 it N N ◮ Descend along incremental gradient ⇒ w t +1 = w t − α i = w t − α g t � ∇ f t i N i =1 ◮ Select update index i t cyclically. Uniformly at random is similar ⇒ ∇ f t +1 = ∇ f i t ( w t +1 ) ◮ Update gradient corresponding to function f i t i t ◮ Sum easy to compute ⇒ g t +1 = g t i − ∇ f t +1 + ∇ f t +1 . Converges linearly i i t i t Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 8

  9. BFGS quasi-Newton method ◮ Approximate function’s curvature with Hessian approximation matrix B − 1 t w t +1 = w t − ǫ t B − 1 t s ( w t ) ◮ Make B t close to H ( w t ) := ∇ 2 F ( w t ). Broyden, DFP, BFGS ◮ Variable variation: v t = w t +1 − w t . Gradient variation: r t = s ( w t +1 ) − s ( w t ) ◮ Matrix B t +1 satisfies secant condition B t +1 v t = r t . Underdetermined ◮ Resolve indeterminacy making B t +1 closest to previous approximation B t ◮ Using Gaussian relative entropy as proximity condition yields update B t +1 = B t + r t r T T B t v t T r t − B t v t v t t v t T B t v t ◮ Superlinear convergence ⇒ Close enough to quadratic rate of Newton ◮ BFGS requires gradients ⇒ Use incremental gradients Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 9

  10. Incremental BFGS method ◮ Keep memory of variables z t i , Hessian approximations B t i , and gradients ∇ f t i ⇒ Functions indexed by i . Time indexed by t . Select function f i t at time t z t B t ∇ f t z t z t B t B t ∇ f t ∇ f t it it it 1 N 1 N 1 N w t +1 ◮ All gradients, matrices, and variables used to update w t +1 Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 10

  11. Incremental BFGS method ◮ Keep memory of variables z t i , Hessian approximations B t i , and gradients ∇ f t i ⇒ Functions indexed by i . Time indexed by t . Select function f i t at time t z t B t ∇ f t z t z t B t B t ∇ f t ∇ f t it it it 1 N 1 N 1 N ∇ f it ( w t +1 ) w t +1 ◮ Updated variable w t +1 used to update gradient ∇ f t +1 = ∇ f i t ( w t +1 ) i t Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 11

  12. Incremental BFGS method ◮ Keep memory of variables z t i , Hessian approximations B t i , and gradients ∇ f t i ⇒ Functions indexed by i . Time indexed by t . Select function f i t at time t z t z t z t B t B t B t ∇ f t ∇ f t ∇ f t 1 it N 1 it N 1 it N B t +1 ∇ f it ( w t +1 ) w t +1 it ◮ Update B t i t to satisfy secant condition for function f i t for variable variation i t − w t +1 and gradient variation ∇ f t +1 z t − ∇ f t i t (more later) i t Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 12

  13. Incremental BFGS method ◮ Keep memory of variables z t i , Hessian approximations B t i , and gradients ∇ f t i ⇒ Functions indexed by i . Time indexed by t . Select function f i t at time t z t B t ∇ f t z t z t B t B t ∇ f t ∇ f t it it it 1 N 1 N 1 N B t +1 ∇ f it ( w t +1 ) w t +1 it z t +1 B t +1 ∇ f t +1 z t +1 z t +1 B t +1 B t +1 ∇ f t +1 ∇ f t +1 it it it 1 N 1 N 1 N ◮ Update variable, Hessian approximation, and gradient memory for function f i t Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 13

  14. Update of Hessian approximation matrices ◮ Variable variation at time t for function f i = f i t ⇒ v t i := z t +1 − z t i i ⇒ r t i := ∇ f t +1 − ∇ f t ◮ Gradient variation at time t for function f i = f i t i t i t ◮ Update B t i = B t i t to satisfy secant condition for variations v t i and r t i i + r t i r t T − B t i v t i v t T B t B t +1 = B t i i i i r t T v t v t T B t i v t i i i i ◮ We want B t i to approximate the Hessian of the function f i = f i t Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 14

  15. A naive (in hindsight) incremental BFGS method ◮ The key is in the update of w t . Use memory in stochastic quantities � − 1 � � N N � 1 1 w t +1 = w t − � B t � ∇ f t i i N N i =1 i =1 ◮ It doesn’t work ⇒ Better than incremental gradient but not superlinear ◮ Optimization updates are solutions of function approximations ◮ In this particular update we are minimizing the quadratic form n f ( w ) ≈ 1 � i ) T ( w − w t ) + 1 � � f i ( z t i ) + ∇ f i ( z t 2( w − w t ) T B t i ( w − w t ) n i =1 ◮ Gradients evaluated at z t i . Secant condition verified at z t i ◮ The quadratic form is centered at w t . Not a reasonable Taylor series Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 15

  16. A proper Taylor series expansion ◮ Each individual function f i is being approximated by the quadratic i ) T ( w − w t ) + 1 f i ( w ) ≈ f i ( z t i ) + ∇ f i ( z t 2( w − w t ) T B t i ( w − w t ) ◮ To have a proper expansion we have to recenter the quadratic form at z t i i ) + 1 f i ( w ) ≈ f i ( z t i ) + ∇ f i ( z t i ) T ( w − z t 2( w − z t i ) T B t i ( w − z t i ) ◮ I.e., we approximate f ( w ) with the aggregate quadratic function N � � f ( w ) ≈ 1 i ) + 1 � f i ( z t i ) + ∇ f i ( z t i ) T ( w − z t 2( w − z t i ) T B t i ( w − z t i ) N i =1 ◮ This is now a reasonable Taylor series that we use to derive an update Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 16

  17. Incremental BFGS ◮ Solving this quadratic program yields the update for the IQN method � − 1 � � N N N � 1 1 i − 1 w t +1 = � B t � B t i z t � ∇ f i ( z t i ) i N N N i =1 i =1 i =1 ◮ Looks difficult to implement but it is more similar to BFGS than apparent ◮ As in BFGS, it can be implemented with O ( p 2 ) operations ⇒ Write as rank-2 update, use matrix inversion lemma ⇒ Independently of N . True incremental method. Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend