An Accelerated Variance Reducing Stochastic Method with - PowerPoint PPT Presentation

An Accelerated Variance Reducing Stochastic Method with Douglas-Rachford Splitting Jingchang Liu November 12, 2018 University of Science and Technology of China 1

Table of Contents Background Moreau Envelop and Douglas-Rachford (DR) Splitting Our methods Theories Experiments Conclusions Q & A 2

Background

Problem Formulation n x ∈R d f ( x ) + h ( x ) := 1 • Regularized ERM: min � f i ( x ) + h ( x ). n i =1 f i : R d → R : empirical loss of i -th sample, convex. • • h : regularization term, convex but possibly non-smooth. • Examples: LASSO, sparse SVM, ℓ 1 , ℓ 2 -Logistic Regression. Definition � 2 γ � y − x � 2 � 1 • Proximal operator: prox γ f ( x ) = argmin y ∈ R d f ( y ) + . • Gradient mapping: f ( x ) = 1 γ ( x − prox γ f ( x )). � � • Subdifferential: ∂ f ( x ) = g | g T ( y − x ) ≤ f ( y ) − f ( x ) , ∀ y ∈ dom f . 2 � y − x � 2 . • Strongly convex: f ( y ) ≥ f ( x ) + � g , y − x � + µ 2 � y − x � 2 . • L -smooth: f ( y ) ≤ f ( x ) + �∇ f ( x ) , y − x � + L 3

Related Works Exsiting Algorithm prox γ h ( x − γ · � ), where � can be obtained from: - GD: � = ∇ f ( x ), more calculations needed in each iteration. - SGD: � = ∇ f i ( x ), small stepsize deduces slow convergence. - Variance reduction (VR): � = ∇ f i ( x ) − ∇ f i (¯ x ) + ∇ f ( x ), such as SVRG, SAGA, SDCA. Accelerated Technique • Ill condition: L /µ , the condition number, is large. • Methods: Acc-SDCA, Catalyst, Mig, Point-SAGA. • Drawbacks: More parameters need to be tuned. 4

Rate Convergence Rate • VR stochastic methods: O (( n + L /µ ) log(1 /ǫ )). � • Acc-SDCA, Mig, Point-SAGA: O (( n + nL /µ ) log(1 /ǫ )). • When L /µ ≫ n , accelerated technique makes the convergence much faster. Aim Design a simpler accelerate VR stochastic method which can achieve the fastest convergence rate. 5

Moreau Envelop and Douglas-Rachford (DR) Splitting

Moreau Envelop Formulaton f ( y ) + 1 � 2 γ � x − y � 2 � f γ ( x ) = inf . y Properties • x ∗ minimizes f ( x ) iff x ∗ minimizes f γ ( x ) • f γ is continuously differentiable even when f is non-differentiable, ∇ f γ ( x ) = ( x − prox γ f ( x )) /γ. Moreover, f γ is 1 /γ -smooth. • If f : µ -strongly convex, then f γ : µ/ ( µγ + 1)-strongly convex. • The condition number of f γ is ( µγ + 1) /µγ , which may be better. Proximal Point Algorithm (PPA) x k +1 = prox γ f ( x k ) = x k − γ ∇ f γ ( x k ) . 6

Point-SAGA Formulation n x ∈R d f ( x ) := 1 Used when h is absent: min � f i ( x ) n i =1 Iteration n x k + γ ( g k � z k g k = j − i / n ) , j i =1 x k +1 f j ( z k prox γ = j ) g k +1 ( z k j − x k +1 ) /γ, = j Equivalence n x k +1 = x k − γ � � g k +1 − g k g k � j + i / n , j i =1 where g k +1 is the gradient mapping of f at z k j . j 7

Point-SAGA: Convergence rate Strongly convex and smooth � � � µ ) log(1 n L O ( n + ǫ ) . Strongly convex and non-smooth � 1 � O . ǫ 8

Douglas-Rachford (DR) Splitting Formulation x ∈ R d f ( x ) + h ( x ) , min Iteration − x k + y k + prox γ f (2 x k − y k ) , y k +1 = x k +1 h ( y k +1 ) . prox γ = Convergence • F ( y ) = y + prox γ h (2prox γ f ( y ) − y ) − prox γ f ( y ). • y is a fixed point of F if and only if x = prox γ f ( y ) satisfies 0 ∈ ∂ f ( x ) + ∂ g ( x ): y = F ( y ) 0 ∈ ∂ f (prox γ y ( y )) + ∂ g (prox γ y ( y )) . ⇄ 9

Our methods

Algorithm 10

Iterations Main iterations n j + 1 x k − γ � � � y k +1 g k +1 − g k g k = , i j n i =1 x k +1 h ( y k ) , = prox γ where = 1 j + x k − y k ) − prox f j ( z k j + x k − y k ) g k +1 ( z k � � , j γ j − x k − y k . the gradient mapping of f j at z k Number of parameters Prox2-SAGA Point-SAGA Katyusha Mig Acc-SDCA Catalyst 1 1 3 2 2 several 11

Connections to other algorithms Point-SAGA When h = 0, we have x k = y k for Prox2-SAGA, n j − 1 x k + γ � � � z k g k g k = , j i n i =1 x k +1 f j ( z k = prox γ j ) , 1 g k +1 γ ( z k j − x k +1 ) . = j DR splitting j = � n When n = 1, since g k i =1 g k i / n in Prox2-SAGA, − x k + y k + prox γ f (2 x k − y k ) , y k +1 = x k +1 h ( y k +1 ) . = prox γ 12

Theories

Effectiveness Proposition Suppose that ( y ∞ , { g ∞ i } i =1 ,..., n ) is the fixed point of the Prox2-SAGA iteration. Then x ∞ = prox γ h ( y ∞ ) is a minimizer of f + h. Proof. ∵ y ∞ = − x ∞ + y ∞ + prox γ + x ∞ − y ∞ ), which implies f i ( z ∞ i ( z ∞ − y ∞ ) /γ ∈ ∂ f i ( x ∞ ) , i = 1 , . . . , n . (1) i Meanwhile, because x ∞ = prox γ h ( y ∞ ), we have ( y ∞ − x ∞ ) /γ ∈ ∂ h ( x ∞ ) . (2) Observing that n n 1 − y ∞ ) + ( y ∞ − x ∞ ) = 1 − x ∞ = 0 , � � ( z ∞ z ∞ i i n n i =1 i =1 from (1) and (2), we have 0 ∈ ∂ f ( x ∞ ) + ∂ h ( x ∞ ). 13

Convergence Rate Non-strongly convex case Suppose that f i : convex and L -smooth, h : convex. Denote � k j = 1 g k t =1 g t ¯ j , then for Prox2-SAGA with step size γ ≤ 1 / L , at any k time k > 0 it holds n � 2 ≤ 1 � 2 + � 1 � γ ( y 0 − y ∗ ) � 2 � � � g k j − g ∗ � � � g 0 i − g ∗ � � ¯ . E j i k i =1 Strongly convex case Suppose that f i : µ -strongly convex and L -smooth, h : convex. Then for √ � � 9 L 2 +3 µ L − 3 L 1 Prox2-SAGA with stepsize γ = min µ n , , for any time 2 µ L k > 0 it holds n µγ � k · µγ − 2 � 2 ≤ � � 2 + � y 0 − y ∗ � 2 � � � γ ( g 0 � x k − x ∗ � � � � i − g ∗ � 1 − i ) . E 2 µγ + 2 2 − n µγ i =1 14

Remarks - When the stepsize � 1 9 L 2 + 3 µ L − 3 L � � γ = min µ n , , 2 µ L then O ( n + L /µ ) log(1 /ǫ ) steps are required to achieve � 2 ≤ ǫ . � x k − x ∗ � � E - When f i is ill-conditioned, then a large stepsize � 1 36 L 2 − 6( n − 2) µ L � µ n , 6 L + � γ = min 2( n − 2) µ L � is possible, under which the required steps is O ( n + nL /µ ) log(1 /ǫ ). 15

Experiments

Experiments Figure 2: Comparison of several algorithms with ℓ 1 ℓ 2 -Logistic Regression. 16

Experiments 17 Figure 3: Comparison of several algorithms with ℓ 1 ℓ 2 -Logistic Regression.

Experiments svmguide3 rcv1 10 0 10 0 Prox-SDCA Prox2-SAGA Prox-SDCA Prox2-SAGA Prox-SAGA, Prox-SGD Prox-SAGA, Prox-SGD 10 -2 10 -2 objective gap objective gap 10 -4 10 -4 10 -6 10 -6 0 20 40 60 80 0 10 20 30 40 50 60 70 epoch epoch covtype ijcnn1 10 0 10 0 Prox2-SAGA Prox-SDCA Prox-SDCA Prox2-SAGA Prox-SAGA, Prox-SGD Prox-SGD Prox-SAGA, 10 -1 10 -2 objective gap objective gap 10 -2 10 -3 10 -4 10 -4 10 -5 10 -6 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 epoch epoch Figure 4: Comparison of several algorithms with sparse SVMs. 18

Conclusions

• Prox2-SAGA has combined Point-SAGA and DR splitting. • Point-SAGA provides faster convergence rate to Prox2-SAGA. • DR splitting provides the effectiveness. 19

An Accelerated Variance Reducing Stochastic Method with - PowerPoint PPT Presentation

An Accelerated Variance Reducing Stochastic Method with Douglas-Rachford Splitting Jingchang Liu November 12, 2018 University of Science and Technology of China 1 Table of Contents Background Moreau Envelop and Douglas-Rachford (DR)

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Outline 1 Presentation of the problem Truncated Stochastic Algorithms and Variance Reduction:

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

Alex Psomas: Lecture 18. Random Variables: Variance 1. Variance 2. Distributions Variance Flip

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

Analysis of variance and regression December 4, 2007 Variance component models Variance

Variance = E[I 2 ] 2pE[I] + p 2 = E[I] 2p p + p 2 = 2 2 = p-2p+ p pq variance.1

Stochastic Simulation Variance reduction methods Bo Friis Nielsen Applied Mathematics and

Stochastic Simulation Methods: Variance reduction methods Antithetic variables Bo Friis

Case 2: Reducing Cardiovascular Risk Type 2 Diabetes Management Case 1: Reducing Hypoglycemic

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Bias- -Variance Theory Variance Theory Bias Decompose Error Rate into components, some

Module 15 Standard Costing and Variance Analysis Dr. Varadraj Bapat 1 Standard Costing

MAP for Gaussian mean and variance Conjugate priors Mean: Gaussian prior Variance:

Feb 27: Expectation, Variance, and Standard Deviation In-class Midterm Exam MOVED to 3/10

CSC 411: Lecture 17: Ensemble Methods I Class based on Raquel Urtasun & Rich Zemels

r t st Pt Prss

REVERSE MONTE CARLO STATUS, PERSPECTIVE Giovanni Santin, ESA/ESTEC Laurent Desorgher,

Multi-Index Monte Carlo and Multi-Index Stochastic Collocation Ra ul Tempone Alexander von

PMAA July 4, 2014 Accurate computation of smallest singular values using the PRIMME eigensolver

Policy Gradients CS60077: Reinforcement Learning Abir Das IIT Kharagpur Nov 09, 10, 2020

Reconstruction of full rank algebraic branching programs Vineet Nair Joint work with: Neeraj

Motivating Example Compute expectation E ( g ( X ( t 0 ))) for stochastic process X ( t ) =

An Accelerated Variance Reducing Stochastic Method with - PowerPoint PPT Presentation

An Accelerated Variance Reducing Stochastic Method with Douglas-Rachford Splitting Jingchang Liu November 12, 2018 University of Science and Technology of China 1 Table of Contents Background Moreau Envelop and Douglas-Rachford (DR)

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Outline 1 Presentation of the problem Truncated Stochastic Algorithms and Variance Reduction:

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

Alex Psomas: Lecture 18. Random Variables: Variance 1. Variance 2. Distributions Variance Flip

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

Analysis of variance and regression December 4, 2007 Variance component models Variance

Variance = E[I 2 ] 2pE[I] + p 2 = E[I] 2p p + p 2 = 2 2 = p-2p+ p pq variance.1

Stochastic Simulation Variance reduction methods Bo Friis Nielsen Applied Mathematics and

Stochastic Simulation Methods: Variance reduction methods Antithetic variables Bo Friis

Case 2: Reducing Cardiovascular Risk Type 2 Diabetes Management Case 1: Reducing Hypoglycemic

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Bias- -Variance Theory Variance Theory Bias Decompose Error Rate into components, some

Module 15 Standard Costing and Variance Analysis Dr. Varadraj Bapat 1 Standard Costing

MAP for Gaussian mean and variance Conjugate priors Mean: Gaussian prior Variance:

Feb 27: Expectation, Variance, and Standard Deviation In-class Midterm Exam MOVED to 3/10

CSC 411: Lecture 17: Ensemble Methods I Class based on Raquel Urtasun &amp; Rich Zemels

r t st Pt Prss

REVERSE MONTE CARLO STATUS, PERSPECTIVE Giovanni Santin, ESA/ESTEC Laurent Desorgher,

Multi-Index Monte Carlo and Multi-Index Stochastic Collocation Ra ul Tempone Alexander von

PMAA July 4, 2014 Accurate computation of smallest singular values using the PRIMME eigensolver

Policy Gradients CS60077: Reinforcement Learning Abir Das IIT Kharagpur Nov 09, 10, 2020

Reconstruction of full rank algebraic branching programs Vineet Nair Joint work with: Neeraj

Motivating Example Compute expectation E ( g ( X ( t 0 ))) for stochastic process X ( t ) =

CSC 411: Lecture 17: Ensemble Methods I Class based on Raquel Urtasun & Rich Zemels