Accelerated Stochastic Subgradient Methods under Local Error Bound - PowerPoint PPT Presentation

Accelerated Stochastic Subgradient Methods under Local Error Bound Condition Yi Xu yi-xu@uiowa.edu Computer Science Department The University of Iowa April 18, 2018 Co-authors: Tianbao Yang, Qihang Lin Yi Xu VALSE Webinar Presentation April 18, 2018

Outline Introduction 1 Accelerated Stochastic Subgradient Methods 2 Applications and experiments 3 Conclusion 4 Yi Xu VALSE Webinar Presentation April 18, 2018

Introduction Outline Introduction 1 Accelerated Stochastic Subgradient Methods 2 Applications and experiments 3 Conclusion 4 Yi Xu VALSE Webinar Presentation April 18, 2018

Introduction Example in machine learning 500 Table: house price 450 400 house size (sqf) price ($1k) 350 y (price) 300 1 68 500 250 2 220 800 200 150 . . . . . . . . . 100 19 359 1500 50 0 20 266 820 400 600 800 1000 1200 1400 1600 x (size) Linear model: y = f ( w ) = xw , where y = price, x = size. Yi Xu VALSE Webinar Presentation April 18, 2018

Introduction 400 350 300 y (price) 250 200 ( x i , f ( x i )) 150 | y i − f ( x i ) | 2 100 ( x i , y i ) 50 400 600 800 1000 1200 1400 1600 x (size) | y 1 − x 1 w | 2 + | y 2 − x 2 w | 2 + . . . | y 20 − x 20 w | 2 Yi Xu VALSE Webinar Presentation April 18, 2018

Introduction Least squares regression: smooth 25 20 � n w ∈ R F ( w ) = 1 15 ( y i − x i w ) 2 loss min � �� 10 n i = 1 5 square loss 0 -5 0 5 b-xa Least absolute deviations: non-smooth 5 4.5 4 � n 3.5 w ∈ R F ( w ) = 1 3 min | y i − x i w | loss 2.5 � �� 2 n 1.5 i = 1 1 absolute loss 0.5 0 -5 0 5 b-xa High dimensional model: � n w ∈ R d F ( w ) = 1 i w | + λ � w � 1 = 1 | y i − x ⊤ n � X w − y � 1 + λ � w � 1 min �� n i = 1 regularizer absolute loss is more robust to outliers problem ℓ 1 norm regularization is used for feature selection Yi Xu VALSE Webinar Presentation April 18, 2018

Introduction Machine learning problems: � n w ∈ R d F ( w ) = 1 min ℓ ( w ; x i , y i ) r ( w ) + � �� n i = 1 regularizer loss function Classification: hinge loss: ℓ ( w ; x , y ) = max(0 , 1 − y x ⊤ w ) Regression: absolute loss: ℓ ( w ; x , y ) = | x ⊤ w − y | square loss: ℓ ( w ; x , y ) = ( x ⊤ w − y ) 2 Regularizer: ℓ 1 norm: r ( w ) = λ � w � 1 ℓ 2 2 norm: r ( w ) = λ 2 � w � 2 2 Yi Xu VALSE Webinar Presentation April 18, 2018

Introduction Convex optimization problem Problem: w ∈ R d F ( w ) min F ( w ) : R d → R is convex optimal value: F ( w ∗ ) = min w ∈ R d F ( w ) optimal solution: w ∗ Goal: to find a solution � w F ( � w ) − F ( w ∗ ) ≤ ǫ 0 < ǫ ≪ 1 , (e.g. 10 − 7 ) ǫ -optimal solution: � w Yi Xu VALSE Webinar Presentation April 18, 2018

Introduction Complexity measure Most optimization algorithms are iterative 0.3 w t + 1 = w t + ∇ w t 0.25 Iteration complexity : number of 0.2 Objective iterations T ( ǫ ) that 0.15 0.1 F ( w T ) − F ( w ∗ ) ≤ ǫ 0.05 ǫ T where 0 < ǫ ≪ 1 . 0 0 100 200 300 400 500 600 Iterations Time complexity : T ( ǫ ) × C ( n , d ) C ( n , d ) : Per-iteration cost Yi Xu VALSE Webinar Presentation April 18, 2018

Introduction Gradient Descent (GD) F ( w ) ≤ F ( w t ) + �∇ F ( w t ) , w − w t � + L 2 � w − w t � 2 smooth 2 Problem: min w ∈ R F ( w ) Gradient ∇ F ( w 0 ) > 0 w t + 1 = arg min w ∈ R F ( w t ) + �∇ F ( w t ) , w − w t � + L 2 � w − w t � 2 2 starting point F ( w ) ( w 0 , F ( w 0 )) GD : initial w 0 ∈ R , for t = 0 , 1 , . . . w t + 1 = w t − η ∇ F ( w t ) η = 1 L > 0 : step size. simple & easy to implement w ∗ Theorem ([Nesterov, 2004]) � 1 � After T = O , F ( w T ) − F ( w ∗ ) ≤ ǫ ǫ Yi Xu VALSE Webinar Presentation April 18, 2018

Introduction Accelerated Gradient Descent (AGD) Nesterov’s momentum trick AGD : initial w 0 , v 1 = w 0 , for t = 1 , 2 , . . . : w t = v t − η ∇ F ( v t ) AGD Step v t + 1 = w t + β t ( w t − w t − 1 ) Momentum Step β t ∈ (0 , 1) is momentum parameter. Gradient Step Nesterov’s Accelerated Gradient Theorem ([Beck and Teboulle, 2009]) θ t + 1 ∈ (0 , 1) with θ t + 1 = 1 + √ 1 + 4 θ 2 L , β t = θ t − 1 Let η = 1 t and θ 1 = 1 , then after � � 2 1 , F ( w T ) − F ( w ∗ ) ≤ ǫ T = O √ ǫ Yi Xu VALSE Webinar Presentation April 18, 2018

Introduction SubGradient (SG) descent non-smooth Problem: min w ∈ R F ( w ) SG : initial w 0 , for t = 0 , 1 , . . . subgradient w t + 1 = w t − η∂ F ( w t ) decrease η every iteration. Theorem ([Nesterov, 2004]) � 1 � After T = O , F ( w T ) − F ( w ∗ ) ≤ ǫ ǫ 2 Yi Xu VALSE Webinar Presentation April 18, 2018

Introduction Summary of time complexity � n w ∈ R d F ( w ) = 1 min f i ( w ; x i , y i ) n i = 1 Method Time complexity Smooth � nd � GD YES O � ǫ � nd AGD O YES √ ǫ � nd � SG NO O ǫ 2 GD: Gradient Descent AGD: Accelerated Gradient Descent SG: SubGradient descent Yi Xu VALSE Webinar Presentation April 18, 2018

Introduction Challenge of deterministic methods Computing gradient is expensive � n w ∈ R d F ( w ) : = 1 min f i ( w ; x i , y i ) n i = 1 � n ∇ F ( w ) : = 1 ∇ f i ( w ; x i , y i ) n i = 1 When n / d is large: Big Data To compute the gradient, need to pass through all data points. At each updating step, need this expensive computation. Yi Xu VALSE Webinar Presentation April 18, 2018

Accelerated Stochastic Subgradient Methods under Local Error Bound - PowerPoint PPT Presentation

Accelerated Stochastic Subgradient Methods under Local Error Bound Condition Yi Xu yi-xu@uiowa.edu Computer Science Department The University of Iowa April 18, 2018 Co-authors: Tianbao Yang, Qihang Lin Yi Xu VALSE Webinar Presentation

Kurdyka- Lojasiewicz inequality and Kurdyka- Lojasiewicz inequality and subgradient

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

New primal-dual subgradient methods for Convex Problems with Functional Constraints Yurii

Stochastic Local Search Methods Dynamic Local Search Iterated Local Search Tabu Search Marco

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

Zero-Convex Functions, Perturbation Resilience, and Subgradient Projections for

Generalized gradient descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1

More Subgradient Calculus: Proximal Operator Following functions are again convex, but again, may

Subgradient method Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov,

Gradient, Subgradient and how they may affect your grade(ient) David Sontag & Yoni Halpern

More Subgradient Calculus: Function Convexity first Following functions are again convex, but

Local Correlation with Local Vol and Stochastic Vol : Towards Correlation dynamics ? Pascal

Stochastic Processes Will Perkins March 7, 2013 Stochastic Processes Q: What is a Stochastic

What If We Only Have Stochastic . . . What if the Stochastic . . . Approximate Stochastic

On the Error Bound in the Normal Approximation for Jack Measures (Joint work with Le Van Thanh)

Tensor Clustering and Error Bounds Chris Ding Department of Computer Science and Engineering

Runge Kutta Methods Bernd Schr oder logo1 Bernd Schr oder Louisiana Tech University,

Error analysis Definition (Order of convergence) Suppose p n p . If , > 0 s.t. | p

MIT 6.875 & Berkeley CS276 Foundations of Cryptography Lecture 20 TODAY: Lattice-based

Examples and comparisons of rigorous error bounds for MCMC estimates Part I Wojciech Niemiro

Hierarchical Polynomial Approximation Vincent LEFVRE, Jean-Michel MULLER, Serge TORRES

Learning Theory and Model Selection Weinan Zhang Shanghai Jiao Tong University

Accelerated Stochastic Subgradient Methods under Local Error Bound - PowerPoint PPT Presentation

Accelerated Stochastic Subgradient Methods under Local Error Bound Condition Yi Xu yi-xu@uiowa.edu Computer Science Department The University of Iowa April 18, 2018 Co-authors: Tianbao Yang, Qihang Lin Yi Xu VALSE Webinar Presentation

Kurdyka- Lojasiewicz inequality and Kurdyka- Lojasiewicz inequality and subgradient

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

New primal-dual subgradient methods for Convex Problems with Functional Constraints Yurii

Stochastic Local Search Methods Dynamic Local Search Iterated Local Search Tabu Search Marco

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

Zero-Convex Functions, Perturbation Resilience, and Subgradient Projections for

Generalized gradient descent Geoff Gordon &amp; Ryan Tibshirani Optimization 10-725 / 36-725 1

More Subgradient Calculus: Proximal Operator Following functions are again convex, but again, may

Subgradient method Geoff Gordon &amp; Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov,

Gradient, Subgradient and how they may affect your grade(ient) David Sontag &amp; Yoni Halpern

More Subgradient Calculus: Function Convexity first Following functions are again convex, but

Local Correlation with Local Vol and Stochastic Vol : Towards Correlation dynamics ? Pascal

Stochastic Processes Will Perkins March 7, 2013 Stochastic Processes Q: What is a Stochastic

What If We Only Have Stochastic . . . What if the Stochastic . . . Approximate Stochastic

On the Error Bound in the Normal Approximation for Jack Measures (Joint work with Le Van Thanh)

Tensor Clustering and Error Bounds Chris Ding Department of Computer Science and Engineering

Runge Kutta Methods Bernd Schr oder logo1 Bernd Schr oder Louisiana Tech University,

Error analysis Definition (Order of convergence) Suppose p n p . If , &gt; 0 s.t. | p

MIT 6.875 &amp; Berkeley CS276 Foundations of Cryptography Lecture 20 TODAY: Lattice-based

Examples and comparisons of rigorous error bounds for MCMC estimates Part I Wojciech Niemiro

Hierarchical Polynomial Approximation Vincent LEFVRE, Jean-Michel MULLER, Serge TORRES

Learning Theory and Model Selection Weinan Zhang Shanghai Jiao Tong University

Generalized gradient descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1

Subgradient method Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember

Gradient, Subgradient and how they may affect your grade(ient) David Sontag & Yoni Halpern

Error analysis Definition (Order of convergence) Suppose p n p . If , > 0 s.t. | p

MIT 6.875 & Berkeley CS276 Foundations of Cryptography Lecture 20 TODAY: Lattice-based