Inexact Tensor Methods with Dynamic Accuracies Nikita Doikov Yurii - PowerPoint PPT Presentation

Inexact Tensor Methods with Dynamic Accuracies Nikita Doikov Yurii Nesterov UCLouvain, Belgium ICML 2020

Plan of the talk 1. Introduction: Tensor Methods in Convex Optimization 2. Inexact Tensor Methods 3. Acceleration 4. Numerical Example 2 / 22

Gradient Method Composite optimization problem x ∈ dom F F ( x ) := f ( x ) + ψ ( x ) , min ◮ f is convex and smooth; ◮ ψ : R n → R ∪ { + ∞} is convex (possibly nonsmooth, but simple ). The Gradient Method: {︂ 2 ‖ y − x k ‖ 2 + ψ ( y ) }︂ ⟨∇ f ( x k ) , y − x k ⟩ + H = , k ≥ 0 . x k + 1 argmin y ◮ Gradient of f is Lipschitz continuous: ‖∇ f ( y ) − ∇ f ( x ) ‖ ≤ L 1 ‖ y − x ‖ ⇒ H := L 1 ◮ Global sublinear convergence: F ( x k ) − F * ≤ O ( 1 / k ) . 4 / 22

Newton Method with Cubic Regularization ◮ Hessian of f is Lipschitz continuous: ‖∇ 2 f ( y ) − ∇ 2 f ( x ) ‖ ≤ L 2 ‖ y − x ‖ . Cubic Newton: {︂ ⟨∇ f ( x k ) , y − x k ⟩ + 1 2 ⟨∇ 2 f ( x k )( y − x k ) , y − x k ⟩ x k + 1 = argmin y 6 ‖ y − x k ‖ 3 + ψ ( y ) }︂ + H , k ≥ 0 . ◮ H := 0 ⇒ Classical Newton. Global convergence: F ( x k ) − F * ≤ O ( 1 / k 2 ) . ◮ H := L 2 ⇒ [Nesterov-Polyak, 2006] 5 / 22

Tensor Methods Let x ∈ R n be fixed, consider arbitrary h ∈ R n and one-dimensional φ ( t ) := f ( x + th ) , t ∈ R . Then φ ( 0 ) = f ( x ) , φ ′ ( 0 ) = ⟨∇ f ( x ) , h ⟩ , φ ′′ ( 0 ) = ⟨∇ 2 f ( x ) h , h ⟩ . Denote: φ ( p ) ( 0 ) . D p f ( x )[ h ] p := The model: p k ! D k f ( x )[ y − x ] k + ( p + 1 )! ‖ y − x ‖ p + 1 + ψ ( y ) . 1 H Ω H ( x ; y ) := ∑︁ k = 1 Tensor Method of order p ≥ 1 : = Ω H ( x k ; y ) , k ≥ 0 . x k + 1 argmin y ◮ p -th derivative is Lipschitz continuous: ‖ D p f ( y ) − D p f ( x ) ‖ ≤ L p ‖ y − x ‖ . ◮ Global convergence: F ( x k ) − F * ≤ O ( 1 / k p ) . [Baes, 2009] 6 / 22

Tensor Methods: Solving the Subproblem At each iteration k ≥ 0, the subproblem is p k ! D k f ( x )[ y − x ] k + ( p + 1 )! ‖ y − x ‖ p + 1 + ψ ( y ) . 1 H min Ω H ( x k ; y ) := ∑︁ y k = 1 ◮ H ≥ pL p ⇒ Ω H ( x k ; y ) is convex in y . [Nesterov, 2018] ◮ For p = 3: efficient implementation, using Gradient Method with relative smoothness condition [Van Nguyen, 2017; Bauschke-Bolte-Teboulle, 2016; Lu-Freund-Nesterov, 2018] . The cost of minimizing Ω H ( x k ; · ) is: O ( n 3 ) + ˜ O ( n ) . 7 / 22

Some Recent Results ◮ Accelerated Tensor Methods: F ( x k ) − F * ≤ O ( 1 / k p + 1 ) [Baes, 2009; Nesterov, 2018] . 3 p + 1 ◮ Optimal Tensor Methods: F ( x k ) − F * ≤ O ( 1 / k 2 ) [Gasnikov et al., 2019; Kamzolov-Gasnikov-Dvurechensky, 2020] . The oracle complexity matches the lower bound (up to logarithmic factor) from [Arjevani-Shamir-Shiff, 2017] . ◮ Universal Tensor Methods: [Grapiglia-Nesterov, 2019] . ◮ Stochastic Tensor Methods: [Lucchi-Kohler, 2019] . ◮ . . . 8 / 22

Definition of Inexactness Use a point T = T H ,δ ( x k ) with small residual in function value: Ω H ( x k ; T ) − min Ω H ( x k ; y ) ≤ δ. y ◮ Easier to achieve by inner method. ◮ Can be controlled in practice using the duality gap. Set H := pL p . We have F ( T ) ≤ F ( x k ) + δ. ◮ Inexact step can be nonmonotone. 10 / 22

Monotone Inexact Tensor Methods Initialization: choose x 0 ∈ dom F , set H := pL p . Iterations: k ≥ 0. 1: Pick up δ k + 1 ≥ 0. 2: Compute inexact monotone tensor step T , such that Ω H ( x k ; T ) − min Ω H ( x k ; y ) ≤ δ k + 1 , y and F ( T ) < F ( x k ) . 3: x k + 1 := T . c Theorem 1. Set δ k := k p + 1 , for c ≥ 0. Then (︁ 1 F ( x k ) − F * )︁ ≤ O . k p 11 / 22

Adaptive Strategy for Inner Accuracy Let us set δ k := c ( F ( x k − 2 ) − F ( x k − 1 )) . Theorem 2. (General convex case) (︁ 1 F ( x k ) − F * )︁ ≤ O . k p Theorem 3. (Uniformly convex objective) Let F ( x ) + ⟨ F ′ ( x ) , y − x ⟩ + σ p + 1 p + 1 ‖ y − x ‖ p + 1 . F ( y ) ≥ Denote ω p := max { ( p + 1 ) 2 L p p ! σ p + 1 , 1 } . Then we have linear rate 1 − p ω − 1 / p (︂ )︂ F ( x k + 1 ) − F * ( F ( x k ) − F * ) . p ≤ 2 ( p + 1 ) ◮ This works for methods, starting from p ≥ 1. Theorem 4. For p ≥ 2 and strongly convex objective, we have local superlinear rate. 12 / 22

Contracting Proximal Scheme ◮ Fix prox-function d ( x ) . Bregman divergence: β d ( x ; y ) := d ( y ) − d ( x ) − ⟨∇ d ( x ) , y − x ⟩ . ◮ Two sequences of points { x k } k ≥ 0 , { v k } k ≥ 0 , v 0 = x 0 . def = ∑︁ k ◮ Sequence of positive coefficients { a k } k ≥ 0 , A k i = 1 a i . Iterations , k ≥ 0: 1. Compute (︁ a k + 1 y + A k x k {︂ }︂ )︁ = + a k + 1 ψ ( y ) + β d ( v k ; y ) . v k + 1 argmin A k + 1 f A k + 1 y 2. Put x k + 1 = a k + 1 v k + 1 + A k x k . A k + 1 The rate of convergence: F ( x k ) − F * ≤ β d ( x 0 ; x ∗ ) . A k [Doikov-Nesterov, 2019] 14 / 22

Acceleration of Tensor Steps For Tensor Method of order p ≥ 1: p + 1 ‖ x − x 0 ‖ p + 1 . 1 ◮ Set d ( x ) := ◮ A k + 1 := ( k + 1 ) p + 1 . L p For contracted objective with regularization (︁ a k + 1 y + A k x k )︁ h k + 1 ( y ) := A k + 1 f + a k + 1 ψ ( y ) + β d ( v k ; y ) , A k + 1 we compute inexact minimizer v k + 1 : h k + 1 ( v k + 1 ) − h * c ≤ ( k + 1 ) p + 2 . k + 1 ◮ It requires ˜ O ( 1 ) inexact Tensor Steps. Theorem. For outer iterations, we obtain accelerated rate: F ( x k ) − F * 1 (︁ )︁ ≤ O . k p + 1 15 / 22

Log-sum-exp (︃ m )︂)︃ (︂ ⟨ a i , x ⟩− b i ∑︁ x ∈ R n f ( x ) := µ log min exp (SoftMax) . µ i = 1 ◮ a 1 , . . . , a m , b — given data. ◮ µ > 0 — smoothing parameter. m i ⪰ 0, and use ‖ x ‖ ≡ ⟨ Bx , x ⟩ 1 / 2 . ◮ Denote B ≡ a i a T ∑︁ i = 1 We have L 1 ≤ 1 2 4 µ , L 2 ≤ µ 2 , L 3 ≤ µ 3 . ◮ Cubic Newton ( p = 2). ◮ Compute each step (inexactly) by Fast Gradient Method. 17 / 22

Log-sum-exp: Constant strategies ◮ δ k := const. Log-sum-exp, = 0.05: constant strategies 10 1 10 1 Functional residual 1 1 10 10 10 3 10 3 10 5 10 5 2 10 10 4 10 6 10 7 10 7 10 8 0 200 400 0 20000 40000 60000 Iterations Hessian-vector products 18 / 22

Log-sum-exp: Dynamic strategies ◮ δ k := 1 / k α . Log-sum-exp, = 0.05: dynamic strategies 10 1 10 1 Functional residual 10 1 10 1 10 3 10 3 1/ k 5 5 10 10 1/ k 2 1/ k 3 1/ k 4 10 7 10 7 10 8 0 200 400 0 20000 40000 60000 Iterations Hessian-vector products 19 / 22

Log-sum-exp: Adaptive strategies ◮ δ k := ( F ( x k − 1 ) − F ( x k )) α . Log-sum-exp, = 0.05: adaptive strategies 10 0 10 0 Functional residual 10 2 10 2 10 4 10 4 adaptive 10 6 10 6 adaptive 1.5 adaptive 2 10 8 10 8 10 8 0 50 100 150 0 20000 40000 60000 Iterations Hessian-vector products 20 / 22

Log-sum-exp: Cubic Newton vs. Tensor Method Log-sum-exp, = 0.1. 10 1 10 1 Functional residual 10 3 10 5 10 7 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Time, s Cubic Newton (p = 2) Tensor (p = 3), Exact Tensor (p = 3), Adaptive ◮ H is fixed. 21 / 22

Conclusion Inexact Tensor Methods of degree p ≥ 1: p = 1 : Gradient Method. p = 2 : Newton method with Cubic regularization. p = 3 : Third order Tensor method. We admit to solve the subproblem inexactly, δ k — accuracy in functional residual for the subproblem. ◮ Dynamic strategy δ k := c k p + 1 . ◮ Adaptive strategy δ k := c ( F ( x k ) − F ( x k − 1 )) . Global rate of convergence: F ( x k ) − F * ≤ O ( 1 k p ) . ◮ Using contracting proximal iterations we obtain accelerated 1 O ( k p + 1 ) rate. Thank you for your attention! 22 / 22

Inexact Tensor Methods with Dynamic Accuracies Nikita Doikov Yurii - PowerPoint PPT Presentation

Inexact Tensor Methods with Dynamic Accuracies Nikita Doikov Yurii Nesterov UCLouvain, Belgium ICML 2020 Plan of the talk 1. Introduction: Tensor Methods in Convex Optimization 2. Inexact Tensor Methods 3. Acceleration 4. Numerical Example 2

accuracies Arie ten Cate Estimating reporting accuracies exp imp

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

Outline Exploring inexact rhyme in TradiQons in studying Russian rhyme Russian verse The

Models for Inexact Reasoning Models for Inexact Reasoning Reasoning with Certainty Factors: The

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

Tensor Methods for Signal Processing and Machine Learning Qibin Zhao Tensor Learning Unit RIKEN

and You Tensor network methods Matrix product states (MPS) Projected Entangled Pair States

PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS Andrew Kerr, Timmy Liu, Mostafa

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS Cris Cecka Senior Research Scientist,

Renormalization of Tensor Network States II. RG of Tensor Network States Tao Xiang Institute of

Design of a High-Performance GEMM-like Tensor-Tensor Multiplication Paul Springer and Paolo

Tensor Invariants and Kronecker Coefficients Jiarui Fei University of California, Riverside

AGILE Speech to Text (STT) Contributors: BBN: Long Nguyen, Tim Ng, Kham Nguyen, Rabih Zbib, John

ATraPos: Adaptive Transaction Processing on Hardware Islands Danica Porobic , Erietta Liarou,

THE ENLIGHTENMENT: Still Burning Bright FERNANDO N ZIALCITA, PhD DEPARTMENT OF SOCIOLOGY AND

Heterogeneous Networks Jie Tang, Tiancheng Lou, and Jon Kleinberg + *Tsinghua University +

Multimodal Biometrics Josef Kittler Centre for Vision, Speech and Signal Processing University

Probabilistic Logic Programming for Natural Language Processing Fabrizio Riguzzi, Evelina Lamma,

Split Cuts for Two-Stage Stochastic Integer Programs Merve Bodur 1 Sanjeeb Dash 2 Oktay Gnlk 2

Resource development and experiments in automatic South African broadcast news transcription SLTU

Inexact Tensor Methods with Dynamic Accuracies Nikita Doikov Yurii - PowerPoint PPT Presentation

Inexact Tensor Methods with Dynamic Accuracies Nikita Doikov Yurii Nesterov UCLouvain, Belgium ICML 2020 Plan of the talk 1. Introduction: Tensor Methods in Convex Optimization 2. Inexact Tensor Methods 3. Acceleration 4. Numerical Example 2

accuracies Arie ten Cate Estimating reporting accuracies exp imp

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

Outline Exploring inexact rhyme in TradiQons in studying Russian rhyme Russian verse The

Models for Inexact Reasoning Models for Inexact Reasoning Reasoning with Certainty Factors: The

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

Tensor Methods for Signal Processing and Machine Learning Qibin Zhao Tensor Learning Unit RIKEN

and You Tensor network methods Matrix product states (MPS) Projected Entangled Pair States

PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS Andrew Kerr, Timmy Liu, Mostafa

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS Cris Cecka Senior Research Scientist,

Renormalization of Tensor Network States II. RG of Tensor Network States Tao Xiang Institute of

Design of a High-Performance GEMM-like Tensor-Tensor Multiplication Paul Springer and Paolo

Tensor Invariants and Kronecker Coefficients Jiarui Fei University of California, Riverside

AGILE Speech to Text (STT) Contributors: BBN: Long Nguyen, Tim Ng, Kham Nguyen, Rabih Zbib, John

ATraPos: Adaptive Transaction Processing on Hardware Islands Danica Porobic , Erietta Liarou,

THE ENLIGHTENMENT: Still Burning Bright FERNANDO N ZIALCITA, PhD DEPARTMENT OF SOCIOLOGY AND

Heterogeneous Networks Jie Tang*, Tiancheng Lou*, and Jon Kleinberg + *Tsinghua University +

Multimodal Biometrics Josef Kittler Centre for Vision, Speech and Signal Processing University

Probabilistic Logic Programming for Natural Language Processing Fabrizio Riguzzi, Evelina Lamma,

Split Cuts for Two-Stage Stochastic Integer Programs Merve Bodur 1 Sanjeeb Dash 2 Oktay Gnlk 2

Resource development and experiments in automatic South African broadcast news transcription SLTU

Heterogeneous Networks Jie Tang, Tiancheng Lou, and Jon Kleinberg + *Tsinghua University +