Randomized Block Cubic Newton Method Nikita Doikov 1 arik 2, 3, 4 - PowerPoint PPT Presentation

Randomized Block Cubic Newton Method Nikita Doikov 1 arik 2, 3, 4 Peter Richt´ 1 Higher School of Economics, Russia 2 King Abdullah University of Science and Technology, Saudi Arabia 3 The University of Edinburgh, United Kingdom 4 Moscow Institute of Physics and Technology, Russia International Conference on Machine Learning, Stockholm July 12, 2018

Plan of the Talk 1. Review: Gradient Descent and Cubic Newton methods 2. RBCN: Randomized Block Cubic Newton 3. Application: Empirical Risk Minimization 2 / 20

Review: Classical Gradient Descent Optimization problem: x ∈ R N F ( x ) min ◮ Main assumption: gradient of F is Lipschitz-continuous: ∀ x , y ∈ R N . ‖∇ F ( x ) − ∇ F ( y ) ‖ ≤ L ‖ x − y ‖ , ◮ From which we get the Global upper bound for the function: F ( y ) ≤ F ( x ) + ⟨∇ F ( x ) , y − x ⟩ + L 2 ‖ y − x ‖ 2 , ∀ x , y ∈ R N . ◮ The Gradient Descent: [︂ 2 ‖ y − x ‖ 2 ]︂ F ( x ) + ⟨∇ F ( x ) , y − x ⟩ + L x + = argmin = x − 1 L ∇ F ( x ) . y ∈ R N 4 / 20

Review: Cubic Newton Optimization problem: x ∈ R N F ( x ) . min ◮ New assumption: Hessian of F is Lipschitz-continuous: ‖∇ 2 F ( x ) − ∇ 2 F ( y ) ‖ ≤ H ‖ x − y ‖ , ∀ x , y ∈ R N . ◮ Corresponding Global upper bound for the function: Q ( x ; y ) ≡ F ( x )+ ⟨∇ F ( x ) , y − x ⟩ + 1 2 ⟨∇ 2 F ( x )( y − x ) , y − x ⟩ , F ( y ) ≤ Q ( x ; y ) + H 6 ‖ y − x ‖ 3 , ∀ x , y ∈ R N . then ◮ Newton method with cubic regularization 1 : [︂ Q ( x ; y ) + H 6 ‖ y − x ‖ 3 ]︂ x + = argmin y ∈ R N ∇ 2 F ( x ) + H ‖ x + − x ‖ (︂ )︂ − 1 = x − ∇ F ( x ) . I 2 1 Yurii Nesterov and Boris T Polyak. “Cubic regularization of Newton’s method and its global performance”. In: Mathematical Programming 108.1 (2006), pp. 177–205. 5 / 20

Gradient Descent vs. Cubic Newton Optimization problem: F ∗ = min x ∈ R N F ( x ) . ◮ F ( x K ) − F ∗ ≤ ε, What is K – ? ◮ Let F be convex : F ( y ) ≥ F ( x ) + ⟨∇ F ( x ) , y − x ⟩ . ◮ Iteration complexity estimates: (︂ )︂ (︂ )︂ 1 1 K = O for GD, and K = O for CN (much better). √ ε ε ◮ But, cost of one iteration: O ( N ) for GD and O ( N 3 ) for CN. N is huge for modern applications. Even O ( N ) is too much! 6 / 20

Our Motivation Recent advances in block coordinate methods. 1. Paul Tseng and Sangwoon Yun. “A coordinate gradient descent method for nonsmooth separable minimization”. In: Mathematical Programming 117.1-2 (2009), pp. 387–423 2. Peter Richt´ arik and Martin Tak´ aˇ c. “Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function”. In: Mathematical Programming 144.1-2 (2014), pp. 1–38 3. Zheng Qu et al. “SDNA: stochastic dual Newton ascent for empirical risk minimization”. In: International Conference on Machine Learning . 2016, pp. 1823–1832 Computationally effective steps, convergence as for full methods. Aim: To create a second-order method with global complexity guarantees and low cost of every iteration. 7 / 20

Problem Structure ◮ Consider the following decomposition of F : R N → R : F ( x ) ≡ φ ( x ) + g ( x ) ⏟ ⏞ ⏟ ⏞ twice differentiable differentiable ◮ For a given space decomposition R N ≡ R N 1 × · · · × R N n , (︁ )︁ x ( i ) ∈ R N i , x ≡ x ( 1 ) , . . . , x ( n ) , assume block-separable structure of φ : n ∑︂ φ ( x ) ≡ φ i ( x ( i ) ) . i = 1 ◮ Block separability for g : R N → R is not fixed. 9 / 20

Main Assumptions Optimization problem: min x ∈ Q F ( x ) , where n ∑︂ F ( x ) ≡ φ i ( x ( i ) ) + g ( x ) . i = 1 ◮ Every φ i ( x ( i ) ) , i ∈ { 1 , . . . , n } is twice-differentiable and convex, with Lipschitz-continuous Hessian: ‖∇ 2 φ i ( x ) − ∇ 2 φ i ( y ) ‖ ≤ H i ‖ x − y ‖ , ∀ x , y ∈ R N i . ◮ g ( x ) is differentiable, and for some fixed positive-semidefinite matrices A ⪰ G ⪰ 0 we have bounds, for all x , y ∈ R N : ◮ g ( y ) ≤ g ( x ) + ⟨∇ g ( x ) , y − x ⟩ + 1 2 ⟨ A ( y − x ) , y − x ⟩ , ◮ g ( y ) ≥ g ( x ) + ⟨∇ g ( x ) , y − x ⟩ + 1 2 ⟨ G ( y − x ) , y − x ⟩ . ◮ Q ⊂ R N is a simple convex set. 10 / 20

Model of the Objective n ∑︂ Objective: F ( x ) ≡ φ i ( x ( i ) ) + g ( x ) i = 1 We want to build a model of F . ◮ Fix subset of blocks: S ⊂ { 1 , . . . , n } . ◮ For y ∈ R N denote by y [ S ] ∈ R N a vector with zeroed i / ∈ S . ◮ M H , S ( x ; y ) ≡ F ( x ) + ⟨∇ φ ( x ) , y [ S ] ⟩ + 1 2 ⟨∇ 2 φ ( x ) y [ S ] , y [ S ] ⟩ 6 ‖ y [ S ] ‖ 3 + ⟨∇ g ( x ) , y [ S ] ⟩ + 1 H + 2 ⟨ Ay [ S ] , y [ S ] ⟩ . ◮ From smoothness: F ( x + y ) ≤ M H , S ( x ; y ) , ∀ x , y ∈ R N ∑︂ for H ≥ H i . i ∈ S 11 / 20

RBCN: Randomized Block Cubic Newton method ◮ Method step: T H , S ( x ) ≡ M H , S ( x ; y ) . argmin y ∈ R N [ S ] s.t. x + y ∈ Q ◮ Algorithm: Initialization: choose x 0 ∈ R N , uniform random distribution ˆ S . Iterations: k ≥ 0. 1: Sample S k ∼ ˆ S 2: Find H k > 0 such that F ( x k + T H k , S k ( x k )) ≤ M H k , S k ( x k ; x k + T H k , S k ( x k )) . = x k + T H k , S k ( x k ) . def 3: Make the step: x k + 1 12 / 20

Convergence Results (︂ )︂ F ( x K ) − F ∗ ≤ ε We want to get: ≥ 1 − ρ P ε > 0 is required accuracy level , ρ ∈ ( 0 , 1 ) is confidence level . Theorem 1. General conditions. (︃ 1 (︃ )︃)︃ 1 + log 1 ε · n τ ≡ E [ | ˆ K = O τ · , S | ] . ρ Theorem 2. σ ∈ [ 0 , 1 ] is a condition number . σ ≥ λ min ( G ) λ max ( A ) > 0. (︃ 1 (︃ )︃)︃ τ · 1 1 + log 1 √ ε · n K = O σ · ρ Theorem 3. Strongly convex case: µ ≡ λ min ( G ) > 0. (︃ 1 √︄ (︃ )︃ {︃ HD }︃ )︃ · n τ · 1 D ≥ ‖ x 0 − x ∗ ‖ . K = O log σ · max µ , 1 , ε ρ 13 / 20

Empirical Risk Minimization ERM problem: [︃ n ]︃ ∑︂ φ i ( b T min P ( w ) ≡ i w ) + g ( w ) w ∈ R d ⏟ ⏞ ⏟ ⏞ i = 1 loss regularizer ◮ SVM: φ i ( a ) = max { 0 , 1 − y i a } , ◮ Logistic regression: φ i ( a ) = log( 1 + exp( − y i a )) , ◮ Regression: φ i ( a ) = ( a − y i ) 2 or φ i ( a ) = | a − y i | , ◮ Support vector regression: φ i ( a ) = max { 0 , | a − y i | − ν } , ◮ Generalized linear models. 15 / 20

Constrained Problem Reformulation [︃ n ]︃ ∑︂ φ i ( b T w ∈ R d P ( w ) min = min i w ) + g ( w ) w ∈ R d ⏟ ⏞ i = 1 ≡ µ i [︃ n ]︃ ∑︂ = min φ i ( µ i ) + g ( w ) w ∈ R d ⏟ ⏞ i = 1 µ ∈ R n differentiable ⏟ ⏞ b T i w = µ i separable, twice ⏟ ⏞ differentiable ≡ Q ◮ Approximate φ i by second-order models with cubic regularization; ◮ Treat g as quadratic function; ◮ Project onto simple constraints {︁ w ∈ R d , µ ∈ R n ⃒ }︁ ⃒ b T Q ≡ i w = µ i . 16 / 20

Proof of Concept: Does second-order information help? leukemia, tolerance=1e-6 duke breast-cancer, tolerance=1e-6 22.5 25 Block coordinate Gradient Descent 20.0 Block coordinate Cubic Newton Total computational time (s) Total computational time (s) 17.5 20 15.0 15 12.5 Block coordinate Gradient Descent Block coordinate Cubic Newton 10.0 10 7.5 5.0 5 2.5 10 25 50 100 200 500 1K 2K 4K 8K 10 25 50 100 200 500 1K 2K 4K 8K Block size Block size ◮ Training Logistic Regression, d = 7129. ◮ Cubic Newton beats Gradient Descent for 10 ≤ | S | ≤ 50. ◮ Second-order information improves convergence. 17 / 20

Maximization of the Dual Problem n ∑︂ φ i ( b T Initial objective: P ( w ) ≡ i w ) + g ( w ) . i = 1 We have Primal and Dual problems: w ∈ R d P ( w ) ≥ max min α ∈ R n D ( α ) , [︁ ]︁ s T x − f ( x ) introducing Fenchel Conjugate: f ∗ ( s ) ≡ sup , we have x n − g ∗ (︂ )︂ ∑︂ − B T α − φ ∗ D ( α ) ≡ i ( α i ) . i = 1 ⏟ ⏞ ⏟ ⏞ differentiable separable, twice differentiable Solve Dual problem by our framework: ◮ Approximate φ ∗ i by second-order cubic models; ◮ Treat g ∗ as quadratic function; ◮ Project onto Q ≡ ⋂︁ n i = 1 dom φ ∗ i . 18 / 20

Training Poisson Regression ◮ Solving the dual of Poisson regression. Synthetic Montreal bike lanes Cubic, 8 10 Cubic, 32 10 1 Cubic, 256 10 SDNA, 8 7 10 SDNA, 32 SDNA, 256 −1 10 SDCA, 8 4 Duality gap Duality gap 10 SDCA, 32 SDCA, 256 1 −3 10 10 Cubic, 8 Cubic, 32 −2 10 Cubic, 256 −5 10 SDNA, 8 −5 SDNA, 32 10 SDNA, 256 −7 SDCA, 8 10 −8 10 SDCA, 32 SDCA, 256 0 20 40 60 80 100 0 100 200 300 400 500 600 Epoch Epoch SDNA: Zheng Qu et al. “SDNA: stochastic dual Newton ascent for empirical risk minimization”. In: International Conference on Machine Learning . 2016, pp. 1823–1832 Shai Shalev-Shwartz and Tong Zhang. “Stochastic dual coordinate ascent SDCA: methods for regularized loss minimization”. In: Journal of Machine Learning Research 14.Feb (2013), pp. 567–599 19 / 20

Randomized Block Cubic Newton Method Nikita Doikov 1 arik 2, 3, 4 - PowerPoint PPT Presentation

Randomized Block Cubic Newton Method Nikita Doikov 1 arik 2, 3, 4 Peter Richt 1 Higher School of Economics, Russia 2 King Abdullah University of Science and Technology, Saudi Arabia 3 The University of Edinburgh, United Kingdom 4 Moscow Institute

NEWTON SEPAC End of Year Report to Newton School Committee June 10, 2019 Newton SEPAC Co-Chairs

Tariff Metre Metre AED / Cubic Metre metre 10.55 metre kWh kWh kWh fils kWh fils

Worldwide Newton Conference Paris, September 2004 eBook composition on the Newton MessagePad 2100

Images of Isaac Newton 1 Portrait of Isaac Newton, Godfrey Kneller, 1689 This image is in the

Randomized Algorithms Randomized Algorithms Two Types of Randomized Algorithms Two Types of

Stabilizing Cubic HfO 2 Doped Y 2 O 3 using TEM Stabilizing Cubic HfO 2 Doped Y 2 O 3 using TEM

On cubic 4-ordered graphs and cubic 4-ordered Hamiltonian graphs Hamiltonian graphs Lih-Hsing

Overview Chapter 7 Ideal Gas Equation of State P= RT/V Van der Waals Equation of State Cubic

lecture 10 - cubic curves - cubic splines - bicubic surfaces We want to define smooth curves:

Derived categories and cubic persurfaces Paolo Stellari hypersurfaces Paolo Stellari Roma,

An n component face-cubic model on the complete graph Zongzheng (Eric) Zhou School of

On the cubic-quintic Schr odinger equation R emi Carles CNRS & Univ Rennes Based on a

Problem 1 k zero bits n bits IV Block Block Block Block Cipher Cipher Cipher Cipher

CSC373 Week 11: Randomized Algorithms 373F19 - Nisarg Shah & Karan Singh 1 Randomized

NEWTON EARLY CHILDHOOD PROGRAM STAFF PRESENTATION NEWTON, MA 14 JANUARY 2020 SCHEDULE OVERVIEW

SIR ISAAC NEWTON (1642-1727) Born in the small village of Woolsthorpe, Newton quickly made an

An alternating variable metric inexact linesearch based algorithm for nonconvex nonsmooth

On a Resampling Scheme for Empirical Copula Hideatsu Tsukahara (tsukahar@seijo.ac.jp) Dept of

Distributed Frequency and Voltage Control of Islanded Microgrids John W. Simpson-Porco, Florian

Classification of Poincar e inequalities and PI-rectifiablity Sylvester ErikssonBique

Prs Ps

Part 1 Examples of optimization problems 1 Wolfgang Bangerth What is an optimization problem?

Synopsis Motivation Synchronous reactive model Syntax of CRL (Core Reactive Language)

2 nu e Nnn 2 no Y e 1 Combinatorial Theorem xn y 7 Mlb Inn yn Hey a distribute my ball

Randomized Block Cubic Newton Method Nikita Doikov 1 arik 2, 3, 4 - PowerPoint PPT Presentation

Randomized Block Cubic Newton Method Nikita Doikov 1 arik 2, 3, 4 Peter Richt 1 Higher School of Economics, Russia 2 King Abdullah University of Science and Technology, Saudi Arabia 3 The University of Edinburgh, United Kingdom 4 Moscow Institute

NEWTON SEPAC End of Year Report to Newton School Committee June 10, 2019 Newton SEPAC Co-Chairs

Tariff Metre Metre AED / Cubic Metre metre 10.55 metre kWh kWh kWh fils kWh fils

Worldwide Newton Conference Paris, September 2004 eBook composition on the Newton MessagePad 2100

Images of Isaac Newton 1 Portrait of Isaac Newton, Godfrey Kneller, 1689 This image is in the

Randomized Algorithms Randomized Algorithms Two Types of Randomized Algorithms Two Types of

Stabilizing Cubic HfO 2 Doped Y 2 O 3 using TEM Stabilizing Cubic HfO 2 Doped Y 2 O 3 using TEM

On cubic 4-ordered graphs and cubic 4-ordered Hamiltonian graphs Hamiltonian graphs Lih-Hsing

Overview Chapter 7 Ideal Gas Equation of State P= RT/V Van der Waals Equation of State Cubic

lecture 10 - cubic curves - cubic splines - bicubic surfaces We want to define smooth curves:

Derived categories and cubic persurfaces Paolo Stellari hypersurfaces Paolo Stellari Roma,

An n component face-cubic model on the complete graph Zongzheng (Eric) Zhou School of

On the cubic-quintic Schr odinger equation R emi Carles CNRS &amp; Univ Rennes Based on a

Problem 1 k zero bits n bits IV Block Block Block Block Cipher Cipher Cipher Cipher

CSC373 Week 11: Randomized Algorithms 373F19 - Nisarg Shah &amp; Karan Singh 1 Randomized

NEWTON EARLY CHILDHOOD PROGRAM STAFF PRESENTATION NEWTON, MA 14 JANUARY 2020 SCHEDULE OVERVIEW

SIR ISAAC NEWTON (1642-1727) Born in the small village of Woolsthorpe, Newton quickly made an

An alternating variable metric inexact linesearch based algorithm for nonconvex nonsmooth

On a Resampling Scheme for Empirical Copula Hideatsu Tsukahara (tsukahar@seijo.ac.jp) Dept of

Distributed Frequency and Voltage Control of Islanded Microgrids John W. Simpson-Porco, Florian

Classification of Poincar e inequalities and PI-rectifiablity Sylvester ErikssonBique

Prs Ps

Part 1 Examples of optimization problems 1 Wolfgang Bangerth What is an optimization problem?

Synopsis Motivation Synchronous reactive model Syntax of CRL (Core Reactive Language)

2 nu e Nnn 2 no Y e 1 Combinatorial Theorem xn y 7 Mlb Inn yn Hey a distribute my ball

On the cubic-quintic Schr odinger equation R emi Carles CNRS & Univ Rennes Based on a

CSC373 Week 11: Randomized Algorithms 373F19 - Nisarg Shah & Karan Singh 1 Randomized