accelerated stochastic subgradient methods under local
play

Accelerated Stochastic Subgradient Methods under Local Error Bound - PowerPoint PPT Presentation

Accelerated Stochastic Subgradient Methods under Local Error Bound Condition Yi Xu yi-xu@uiowa.edu Computer Science Department The University of Iowa April 18, 2018 Co-authors: Tianbao Yang, Qihang Lin Yi Xu VALSE Webinar Presentation


  1. Accelerated Stochastic Subgradient Methods under Local Error Bound Condition Yi Xu yi-xu@uiowa.edu Computer Science Department The University of Iowa April 18, 2018 Co-authors: Tianbao Yang, Qihang Lin Yi Xu VALSE Webinar Presentation April 18, 2018

  2. Outline Introduction 1 Accelerated Stochastic Subgradient Methods 2 Applications and experiments 3 Conclusion 4 Yi Xu VALSE Webinar Presentation April 18, 2018

  3. Introduction Outline Introduction 1 Accelerated Stochastic Subgradient Methods 2 Applications and experiments 3 Conclusion 4 Yi Xu VALSE Webinar Presentation April 18, 2018

  4. Introduction Example in machine learning 500 Table: house price 450 400 house size (sqf) price ($1k) 350 y (price) 300 1 68 500 250 2 220 800 200 150 . . . . . . . . . 100 19 359 1500 50 0 20 266 820 400 600 800 1000 1200 1400 1600 x (size) Linear model: y = f ( w ) = xw , where y = price, x = size. Yi Xu VALSE Webinar Presentation April 18, 2018

  5. Introduction Example in machine learning 500 Table: house price 450 400 house size (sqf) price ($1k) 350 y (price) 300 1 68 500 250 2 220 800 200 150 . . . . . . . . . 100 19 359 1500 50 0 20 266 820 400 600 800 1000 1200 1400 1600 x (size) Linear model: y = f ( w ) = xw , where y = price, x = size. Yi Xu VALSE Webinar Presentation April 18, 2018

  6. Introduction 400 350 300 y (price) 250 200 ( x i , f ( x i )) 150 | y i − f ( x i ) | 2 100 ( x i , y i ) 50 400 600 800 1000 1200 1400 1600 x (size) | y 1 − x 1 w | 2 + | y 2 − x 2 w | 2 + . . . | y 20 − x 20 w | 2 Yi Xu VALSE Webinar Presentation April 18, 2018

  7. Introduction Least squares regression: smooth 25 20 � n w ∈ R F ( w ) = 1 15 ( y i − x i w ) 2 loss min � ������ �� ������ � 10 n i = 1 5 square loss 0 -5 0 5 b-xa Least absolute deviations: non-smooth 5 4.5 4 � n 3.5 w ∈ R F ( w ) = 1 3 min | y i − x i w | loss 2.5 � ���� �� ���� � 2 n 1.5 i = 1 1 absolute loss 0.5 0 -5 0 5 b-xa High dimensional model: � n w ∈ R d F ( w ) = 1 i w | + λ � w � 1 = 1 | y i − x ⊤ n � X w − y � 1 + λ � w � 1 min ���� n i = 1 regularizer absolute loss is more robust to outliers problem ℓ 1 norm regularization is used for feature selection Yi Xu VALSE Webinar Presentation April 18, 2018

  8. Introduction Least squares regression: smooth 25 20 � n w ∈ R F ( w ) = 1 15 ( y i − x i w ) 2 loss min � ������ �� ������ � 10 n i = 1 5 square loss 0 -5 0 5 b-xa Least absolute deviations: non-smooth 5 4.5 4 � n 3.5 w ∈ R F ( w ) = 1 3 min | y i − x i w | loss 2.5 � ���� �� ���� � 2 n 1.5 i = 1 1 absolute loss 0.5 0 -5 0 5 b-xa High dimensional model: � n w ∈ R d F ( w ) = 1 i w | + λ � w � 1 = 1 | y i − x ⊤ n � X w − y � 1 + λ � w � 1 min ���� n i = 1 regularizer absolute loss is more robust to outliers problem ℓ 1 norm regularization is used for feature selection Yi Xu VALSE Webinar Presentation April 18, 2018

  9. Introduction Least squares regression: smooth 25 20 � n w ∈ R F ( w ) = 1 15 ( y i − x i w ) 2 loss min � ������ �� ������ � 10 n i = 1 5 square loss 0 -5 0 5 b-xa Least absolute deviations: non-smooth 5 4.5 4 � n 3.5 w ∈ R F ( w ) = 1 3 min | y i − x i w | loss 2.5 � ���� �� ���� � 2 n 1.5 i = 1 1 absolute loss 0.5 0 -5 0 5 b-xa High dimensional model: � n w ∈ R d F ( w ) = 1 i w | + λ � w � 1 = 1 | y i − x ⊤ n � X w − y � 1 + λ � w � 1 min ���� n i = 1 regularizer absolute loss is more robust to outliers problem ℓ 1 norm regularization is used for feature selection Yi Xu VALSE Webinar Presentation April 18, 2018

  10. Introduction Machine learning problems: � n w ∈ R d F ( w ) = 1 min ℓ ( w ; x i , y i ) r ( w ) + � ������ �� ������ � ���� n i = 1 regularizer loss function Classification: hinge loss: ℓ ( w ; x , y ) = max(0 , 1 − y x ⊤ w ) Regression: absolute loss: ℓ ( w ; x , y ) = | x ⊤ w − y | square loss: ℓ ( w ; x , y ) = ( x ⊤ w − y ) 2 Regularizer: ℓ 1 norm: r ( w ) = λ � w � 1 ℓ 2 2 norm: r ( w ) = λ 2 � w � 2 2 Yi Xu VALSE Webinar Presentation April 18, 2018

  11. Introduction Convex optimization problem Problem: w ∈ R d F ( w ) min F ( w ) : R d → R is convex optimal value: F ( w ∗ ) = min w ∈ R d F ( w ) optimal solution: w ∗ Goal: to find a solution � w F ( � w ) − F ( w ∗ ) ≤ ǫ 0 < ǫ ≪ 1 , (e.g. 10 − 7 ) ǫ -optimal solution: � w Yi Xu VALSE Webinar Presentation April 18, 2018

  12. Introduction Complexity measure Most optimization algorithms are iterative 0.3 w t + 1 = w t + ∇ w t 0.25 Iteration complexity : number of 0.2 Objective iterations T ( ǫ ) that 0.15 0.1 F ( w T ) − F ( w ∗ ) ≤ ǫ 0.05 ǫ T where 0 < ǫ ≪ 1 . 0 0 100 200 300 400 500 600 Iterations Time complexity : T ( ǫ ) × C ( n , d ) C ( n , d ) : Per-iteration cost Yi Xu VALSE Webinar Presentation April 18, 2018

  13. Introduction Gradient Descent (GD) F ( w ) ≤ F ( w t ) + �∇ F ( w t ) , w − w t � + L 2 � w − w t � 2 smooth 2 Problem: min w ∈ R F ( w ) Gradient ∇ F ( w 0 ) > 0 w t + 1 = arg min w ∈ R F ( w t ) + �∇ F ( w t ) , w − w t � + L 2 � w − w t � 2 2 starting point F ( w ) ( w 0 , F ( w 0 )) GD : initial w 0 ∈ R , for t = 0 , 1 , . . . w t + 1 = w t − η ∇ F ( w t ) η = 1 L > 0 : step size. simple & easy to implement w ∗ Theorem ([Nesterov, 2004]) � 1 � After T = O , F ( w T ) − F ( w ∗ ) ≤ ǫ ǫ Yi Xu VALSE Webinar Presentation April 18, 2018

  14. Introduction Gradient Descent (GD) F ( w ) ≤ F ( w t ) + �∇ F ( w t ) , w − w t � + L 2 � w − w t � 2 smooth 2 Problem: min w ∈ R F ( w ) Gradient ∇ F ( w 0 ) > 0 w t + 1 = arg min w ∈ R F ( w t ) + �∇ F ( w t ) , w − w t � + L 2 � w − w t � 2 2 starting point F ( w ) ( w 0 , F ( w 0 )) GD : initial w 0 ∈ R , for t = 0 , 1 , . . . w t + 1 = w t − η ∇ F ( w t ) η = 1 L > 0 : step size. simple & easy to implement w ∗ Theorem ([Nesterov, 2004]) � 1 � After T = O , F ( w T ) − F ( w ∗ ) ≤ ǫ ǫ Yi Xu VALSE Webinar Presentation April 18, 2018

  15. Introduction Gradient Descent (GD) F ( w ) ≤ F ( w t ) + �∇ F ( w t ) , w − w t � + L 2 � w − w t � 2 smooth 2 Problem: min w ∈ R F ( w ) Gradient ∇ F ( w 0 ) > 0 w t + 1 = arg min w ∈ R F ( w t ) + �∇ F ( w t ) , w − w t � + L 2 � w − w t � 2 2 starting point F ( w ) ( w 0 , F ( w 0 )) GD : initial w 0 ∈ R , for t = 0 , 1 , . . . w t + 1 = w t − η ∇ F ( w t ) η = 1 L > 0 : step size. simple & easy to implement w ∗ Theorem ([Nesterov, 2004]) � 1 � After T = O , F ( w T ) − F ( w ∗ ) ≤ ǫ ǫ Yi Xu VALSE Webinar Presentation April 18, 2018

  16. Introduction Accelerated Gradient Descent (AGD) Nesterov’s momentum trick AGD : initial w 0 , v 1 = w 0 , for t = 1 , 2 , . . . : w t = v t − η ∇ F ( v t ) AGD Step v t + 1 = w t + β t ( w t − w t − 1 ) Momentum Step β t ∈ (0 , 1) is momentum parameter. Gradient Step Nesterov’s Accelerated Gradient Theorem ([Beck and Teboulle, 2009]) θ t + 1 ∈ (0 , 1) with θ t + 1 = 1 + √ 1 + 4 θ 2 L , β t = θ t − 1 Let η = 1 t and θ 1 = 1 , then after � � 2 1 , F ( w T ) − F ( w ∗ ) ≤ ǫ T = O √ ǫ Yi Xu VALSE Webinar Presentation April 18, 2018

  17. Introduction SubGradient (SG) descent non-smooth Problem: min w ∈ R F ( w ) SG : initial w 0 , for t = 0 , 1 , . . . subgradient w t + 1 = w t − η∂ F ( w t ) decrease η every iteration. Theorem ([Nesterov, 2004]) � 1 � After T = O , F ( w T ) − F ( w ∗ ) ≤ ǫ ǫ 2 Yi Xu VALSE Webinar Presentation April 18, 2018

  18. Introduction SubGradient (SG) descent non-smooth Problem: min w ∈ R F ( w ) SG : initial w 0 , for t = 0 , 1 , . . . subgradient w t + 1 = w t − η∂ F ( w t ) decrease η every iteration. Theorem ([Nesterov, 2004]) � 1 � After T = O , F ( w T ) − F ( w ∗ ) ≤ ǫ ǫ 2 Yi Xu VALSE Webinar Presentation April 18, 2018

  19. Introduction Summary of time complexity � n w ∈ R d F ( w ) = 1 min f i ( w ; x i , y i ) n i = 1 Method Time complexity Smooth � nd � GD YES O � ǫ � nd AGD O YES √ ǫ � nd � SG NO O ǫ 2 GD: Gradient Descent AGD: Accelerated Gradient Descent SG: SubGradient descent Yi Xu VALSE Webinar Presentation April 18, 2018

  20. Introduction Challenge of deterministic methods Computing gradient is expensive � n w ∈ R d F ( w ) : = 1 min f i ( w ; x i , y i ) n i = 1 � n ∇ F ( w ) : = 1 ∇ f i ( w ; x i , y i ) n i = 1 When n / d is large: Big Data To compute the gradient, need to pass through all data points. At each updating step, need this expensive computation. Yi Xu VALSE Webinar Presentation April 18, 2018

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend