op optimization for machine learning g be beyon ond st
play

Op Optimization for Machine Learning: g: Be Beyon ond St Stoc - PowerPoint PPT Presentation

Op Optimization for Machine Learning: g: Be Beyon ond St Stoc ochastic Gradient Descent Elad Hazan References and more info: http://www.cs.princeton.edu/~ehazan/tutorial/MLSStutorial.htm Based on: [Agarwal, Bullins, Hazan ICML 16]


  1. Op Optimization for Machine Learning: g: Be Beyon ond St Stoc ochastic Gradient Descent Elad Hazan References and more info: http://www.cs.princeton.edu/~ehazan/tutorial/MLSStutorial.htm Based on: [Agarwal, Bullins, Hazan ICML ’ 16] [Agarwal, Allen-Zhu, Bullins, Hazan, Ma STOC ’17] [Hazan, Singh, Zhang ICML ‘17], [Agarwal, Hazan COLT ‘17] [Agarwal, Bullins, Chen, Hazan, Singh, Zhang, Zhang ’18]

  2. Princeton-Google Brain team Naman Agarwal, Brian Bullins, Xinyi Chen, Karan Singh, Cyril Zhang, Yi Zhang

  3. Deep net, SVM, boosted decision stump,… Chair/car Function of vectors & '()*+,- (/) Distribution over Model vectors parameters { a} ∈ $ %

  4. 250 200 150 100 50 0 − 50 3 2 3 1 2 0 1 0 − 1 − 1 − 2 Minimize incorrect chair/car − 2 − 3 − 3 This talk: faster optimization predictions on training set 1. second order methods 2. adaptive regularization

  5. (Non-Convex) Optimization in ML Model Distribution over Label {"} ∈ ℝ & 8 . / = 1 minimize . / , 3 4 ℓ 5 /, " 5 , : 5 ,∈ℝ - 567 Training set size (m) & dimension of data (d) are very large, days/weeks to train

  6. Gradient Descent Given first-order oracle: !" # , !" # ≤ & Iteratively: # '() ← # ' − ,!" # ' Theorem: for smooth bounded functions, step size , ∼ . 1 (depends on smoothness), 1 1 2 ∼ 0 1 !" # ' 0 '

  7. Stochastic Gradient Descent [Robbins & Monro ‘51] ( Given stochastic first-order oracle: ! " " ≤ * ( #$ % = #$ % , ! #$ % Iteratively: % +,- ← % + − 0" #$ % + - Theorem [GL’15]: for smooth bounded functions, step size 0 = 12 3 , 1 * ( ( ∼ 5 6 #$ % + 5 +

  8. SGD *+ ! " ! "#$ ← ! " − ' " ⋅ )

  9. SGD++ Variance Reduction [Le Roux, Schmidt, Bach ‘12] … Momentum [Nesterov ‘83],… ! "#$ ← ! " − ' " ⋅ ) *+ ! " Adaptive Regularization [Duchi, Hazan, Singer ‘10],… Woodworth,Srebro ‘16: yes! Are we at the limit ? (gradient methods)

  10. Rosenbrock function

  11. Higher Order Optimization • Gradient Descent – Direction of Steepest Descent • Second Order Methods – Use Local Curvature

  12. Newton’s method (+ Trust region) ! " ! %&" = ! % − ) [+ # ,(!)] 0" +,(!) ! # For non-convex function: can move to ∞ Solution: solve a quadratic approximation in a ! $ local area (trust region)

  13. Newton’s method (+ Trust region) ! " ! %&" = ! % − ) [+ # ,(!)] 0" +,(!) ! # 1. d 3 time per iteration, Infeasible for ML!! 2. Stochastic difference of gradients ≠ hessian ! $ Till recently J

  14. Speed up the Newton direction computation?? • Spielman-Teng ‘04: diagonally dominant systems of equations in linear time! • 2015 Godel prize • Used by Daitch-Speilman for faster flow algorithms • Faster/simpler by Srivasatva, Koutis, Miller, Peng, others… • Erdogu-Montanari ‘15: low rank approximation & inversion by Sherman-Morisson • Allow stochastic information • Still prohibitive: rank * d 2

  15. Our results – Part 1 of talk • Natural Stochastic Newton Method • Every iteration in O(d) time. Linear in Input Sparsity • Couple with Matrix Sampling/ Sketching techniques - Best known running time for ! ≫ # for both convex and non-convex opt., provably faster than first order methods

  16. Q P Stochastic Newton? . JK4 = . J − M [8 5 G(.)] H4 8G(.) (convex case for illustration) 4 ( )∼ + [ℓ . / 0 ) , 2 ) + 5 |.| 5 ] • ERM, rank-1 loss: arg min ' • unbiased estimator of the Hessian: ; ⋅ ℓ′ . / 0 ) , b : + ? 8 5 = a : a : 7 @ ~ B[1, … , E] 8 5H4 ≠ 8 5 G H4 • clearly ( 7 8 5 = 8 5 G , but ( 7

  17. Circumvent Hessian creation and inversion! • 3 steps: • (1) represent Hessian inverse as infinite series For any distribution on naturals i ∼ # $ %& = / − $ & ) ( )*+ ,- . • (2) sample from the infinite series (Hessian-gradient product) , ONCE 1 / − $ & 1 ) $f = 4 )∼5 / − $ & 1 ) $f ⋅ $ & 1 %2 $1 = ( Pr[;] ) • (3) estimate Hessian-power by sampling i.i.d. data examples Single example Vector-vector 1 / − $ & 1 = E )∼5,?∼[)] @ ? $f ⋅ products only Pr[;] ?*2 ,- )

  18. Improved Estimator • Previously, Estimate a single term in one estimate • Recursive Reformulation of the series ! "# = % + (% − !)( % + % − ! ( % + * ! "# = % + (% − !)( % + % − ! ( % + * ! "# = % + (% − !)(% + % − ! "# = % + (% − "# ) )( > % + … . … . … . )) ) )) ! 3 ! ! 3"# -. 3 -. / CDE. 3=<FG5 > AB 456789:;5 59-:<=-5 ? @AB • Truncate after 0 steps. Typically 0 ~ 2 (condition # of f) • H > "# → ! "# as 0 → ∞ ! 3 • Repeat and average to reduce the variance

  19. LiSSA Linear-time Second-order Stochastic Algorithm '∈) * + ,∼ . [ℓ 1 2 3 , , 5 , + 1 2 |1| : ] arg min V is a bound on the variance of the estimator • In Practice - a small • Compute a full (large batch) gradient ;f constant (e.g. 1) • Use the estimator = ; >: ?;? defined previously & move there • In Theory - N ≤ M : Theorem 1: For large t, LiSSA returns a point in the parameter space @ A s.t. ? @ A ≤ ? @ ∗ + D G In total time log I (K + L M N ) H G , fastest known! (& provably faster 1 st order WS ’ 16) à (w. more tricks) P L log : I K + MI H

  20. Hessian Vector Products for Neural Networks in time ! " ($%&'()**%& +&,-.) 1 2 8 0 9: 20 = E 1∼=,?∼[1] F − 2 8 0 B ? 2f ⋅ Pr[,] ?C: DE 1 • 0 1 - computed via a differentiable circuit of size !(") • 20 1 - computed via a differentiable circuit of size O(d) (Backpropagation) 1 ℎ 6 7 • Define 3 1 ℎ = 20 2 3 ℎ = 2 8 0 1 ℎ 7 • There exists a !(") circuit computing 2 8 0 1 ℎ 7

  21. LiSSA for non-convex (FastCubic) Method Time to Time to Second Order? Assumption | "#(%)| ≤ ( | "#(%)| ≤ ( (Oracle) (Actual) ) * ) ./ Gradient Descent N/A Smoothness + (Folklore) , - , - ) * 0+1 / Stochastic Gradient N/A Smoothness ) Descent (Folklore) , 2 , 2 : ) / 3 4 Noisy SGD (Ge et al) Smoothness 5 - 6 ℎ ≽ −, 3 ; < , 2 ) > / ?@: + / ? : Cubic Regularization Smooth and Second 5 - 6 ℎ ≽ −, - < = (Nesterov & Polyak) Order Lipschitz , :.C * * D ./ : Fast Cubic Smooth and Second + 5 - 6 ℎ ≽ −, - < ) , :.C + ) Order Lipschitz , :.EC , :.EC

  22. 2 nd order information: new phenomena? • ”Computational lens for deep nets”: experiment with 2 nd order information… • Trust region • Cubic regularization, eigenvalue methods…. • Multiple hurdles: • Global optimization is NP-hard, even deciding whether you are at a local minimum is NP-hard • Goal: local minimum | "# ℎ | ≤ & and " ' # ℎ ≽ − & * 250 200 150 Bengio-group experiment 100 50 0 − 50 3 2 3 1 2 0 1 0 − 1 − 1 − 2 − 2 − 3 − 3

  23. Experimental Results Convex: clear Neural networks: doesn’t improve upon SGD improvements What goes wrong?

  24. Adaptive Regularization Strikes Back (GG T ) -1/2 Princeton Google Brain team: Naman Agarwal, Brian Bullins, Xinyi Chen, Elad Hazan, Karan Singh, Cyril Zhang, Yi Zhang

  25. Adaptive Preconditioning Newton’s method special case of preconditioning: make loss surface ● more isotropic " ! " ↦ ! $" !

  26. Modern ML is SGD++ Variance Reduction [Le Roux, Schmidt, Bach ‘12] … Momentum [Nesterov ‘83],… ! "#$ ← ! " − ' " ⋅ ) *+ ! " Adaptive Regularization [Duchi, Hazan, Singer ‘10],…

  27. Adaptive Optimizers Each coordinate ! " gets a learning rate # $ " - # $ ["] chosen “adaptively” using ' () ! *:$ ["] - , *:$ ["] * AdaGrad: - $ " ≔ - 3 6 ∑ 012 4 0 5 * RMSprop: - $ " ≔ - 7 380 4 0 5 3 6 ∑ 012 * Adam: - $ " ≔ - 3 7 380 4 0 5 *97 3 6 ∑ 012

  28. What about the other AdaGrad? diagonal preconditioning full-matrix preconditioning > " # $ time per iteration " # time per iteration 0(/$ 0(/$ & & / / % &'( ← % & − #34. + . , . , ⋅ . & % &'( ← % & − + . , . , ⋅ . & ,-( ,-(

  29. What does adaptive regularization even do ?! ● Convex, full-matrix case: [Duchi-Hazan-Singer ‘10]: “best regularization in hindsight” # " $ " − $ ∗ = ( 1 ! ,- . /0 ! min # " . " " 2 ● Diagonal version: up to improvement upon SGD (in optimization AND generalization) 0 ● No analysis for non-convex optimization, till recently (still no speedup vs. SGD) ○ Convergence: [Li, Orabona ‘18], [Ward, Wu, Bottou ‘18]

  30. The Case for Full-Matrix Adaptive Regularization ● GGT , a new adaptive optimizer ● Efficient full-matrix (low-rank) AdaGrad ● Theory: “Adaptive” convergence rate on convex & non-convex ! # Up to " $ faster than SGD! ● Experiments: viable in the deep learning era ● GPU-friendly; not much slower than SGD on deep models ● Accelerates training in deep learning benchmarks ● Empirical insights on anisotropic loss surfaces, real and synthetic

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend