estimate sequences for variance reduced stochastic
play

Estimate Sequences for Variance-Reduced Stochastic Composite - PowerPoint PPT Presentation

Estimate Sequences for Variance-Reduced Stochastic Composite Optimization Andrei Kulunchakov Julien Mairal andrei.kulunchakov@inria.fr julien.mairal.@inria.fr International Conference on Machine Learning, 2019 Poster event-4062, (Jun 12th,


  1. Estimate Sequences for Variance-Reduced Stochastic Composite Optimization Andrei Kulunchakov Julien Mairal andrei.kulunchakov@inria.fr julien.mairal.@inria.fr International Conference on Machine Learning, 2019 Poster event-4062, (Jun 12th, Pacific Ballroom 204)

  2. Problem statement Assumptions We solve a stochastic composite optimization problem n f ( x ) = 1 � ˜ � � F ( x ) = f ( x ) + ψ ( x ) where f i ( x ) with f i ( x ) = E ξ f i ( x , ξ ) , n i =1 where ψ ( x ) is a convex penalty, each f i is L -smooth and µ -strongly convex. Variance in gradient estimates Stochastic realizations of gradients are available for each i ˜ and Var [ ξ i ] ≤ σ 2 . ∇ f i ( x ) = ∇ f i ( x ) + ξ i with E [ ξ i ] = 0 Poster event-4062, (Jun 12th, Pacific Ballroom 204)

  3. Main contribution (I) Optimal incremental algorithm robust to noise Optimal incremental algorithm with a complexity � �� � �� � � � F ( x 0 ) − F ⋆ σ 2 nL O n + log + O , µ ε µε based on the SVRG gradient estimator with random sampling. Algorithm Briefly, the algorithm is an incremental hybrid of the heavy-ball method with randomly updated SVRG anchor point and two auxiliary sequences, controlling the extrapolation. Poster event-4062, (Jun 12th, Pacific Ballroom 204)

  4. Main contribution (II) Novelty When σ 2 = 0, we recover the same complexity as Katyusha [Allen-Zhu, 2017]. • Novelty: accelerated incremental algorithm robust to σ 2 > 0 with the optimal term σ 2 /µε . • Another contributions • Generic proofs for incremental methods (SVRG, SAGA, MISO, SDCA) to show their robustness to noise � � � F ( x 0 ) − F ⋆ σ 2 n + L �� � �� log + O . O µ ε µε • When µ = 0, we recover optimal rates in fixed horizon and known σ 2 . • Provide a support for non-uniform sampling. Poster event-4062, (Jun 12th, Pacific Ballroom 204)

  5. Side contributions Adaptivity to strong convexity parameter µ When σ = 0, we show adaptivity to µ for all above-mentioned non-accelerated methods. This property is new for SVRG. Accelerated SGD A version of robust accelerated SGD with complexity similar to [Ghadimi and Lan, 2012, 2013] σ 2 + σ 2 �� �� � � � F ( x 0 ) − F ⋆ L n O µ log + O , ε µε where σ 2 n is due to sampling the data points. Poster event-4062, (Jun 12th, Pacific Ballroom 204)

  6. Experiments with three datasets in the experiments — Pascal Large Scale Learning Challenge ( n = 25 · 10 4 ) — Light gene expression data for breast cancer ( n = 295) — CIFAR-10 (images represented by features from a network) with n = 5 · 10 4 Examples with zero noise ( σ = 0) and stochastic case ( σ > 0) 10 0 rand-SVRG 1/12L 10 0 acc-SVRG 1/3L rand-SVRG-d 10 -1 10 -1 acc-SVRG-d log(F/F * -1) SGD-d log(F/F * -1) 10 -2 10 -2 acc-mb-SGD-d 10 -3 10 -3 rand-SVRG 1/12L rand-SVRG 1/3L acc-SVRG 1/3L 10 -4 10 -4 SGD 1/L SGD-d 10 -5 10 -5 acc-SGD-d 0 50 100 150 200 250 300 0 50 100 150 200 250 300 acc-mb-SGD-d Effective passes over data, CIFAR-10 Effective passes over data, Pascal Challenge Poster event-4062, (Jun 12th, Pacific Ballroom 204)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend