Stochastic Optimization with Variance Reduction for Infinite - PowerPoint PPT Presentation

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Julien Mairal Alberto Bietti Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21, 2017 1 / 20

Stochastic optimization in machine learning Stochastic approximation : min x E ζ ∼D [ f ( x , ζ )] ◮ Infinite datasets (expected risk, D : data distribution), or “single pass” ◮ SGD, stochastic mirror descent, FOBOS, RDA ◮ O (1 /ǫ ) complexity Incremental methods with variance reduction : min x 1 � n i =1 f i ( x ) n ◮ Finite datasets (empirical risk): f i ( x ) = ℓ ( y i , x ⊤ ξ i ) + ( µ/ 2) � x � 2 ◮ SAG, SDCA, SVRG, SAGA, MISO, etc. ◮ O (log 1 /ǫ ) complexity Alberto Bietti Stochastic MISO March 21, 2017 2 / 20

Data perturbations in machine learning Perturbations of data useful for regularization, stable feature selection, privacy aware learning We focus on data augmentation of a finite training set, for regularization purposes (better performance on test data), e.g.: ◮ Image data augmentation : add random transformations of each image in the training set (crop, scale, rotate, brightness, contrast, etc.) ◮ Dropout : set coordinates of feature vectors to 0 with probability δ . The colorful Norwegian city of Bergen is also a gateway to majes- The colorful of gateway to fjords. tic fjords. Bryggen Hanseatic Wharf Hanseatic Wharf will sense the cul- will give you a sense of the local cul- ture – take some to snap photos the ture – take some time to snap photos commercial buildings, which look of the Hanseatic commercial build- scenery a ings, which look like scenery from a movie set. Figure: Data augmentation on MNIST digit (left), Dropout on text (right). Alberto Bietti Stochastic MISO March 21, 2017 3 / 20

Optimization objective with perturbations n � � F ( x ) = 1 E ρ ∼ Γ [˜ � min f i ( x , ρ )] + h ( x ) n x ∈ R p i =1 f i ( x )= E ρ ∼ Γ [˜ f i ( x , ρ )] ρ : perturbation ˜ f i ( · , ρ ) is convex with L -Lipschitz gradients F is µ -strongly convex h : convex, possibly non-smooth, penalty, e.g. ℓ 1 norm Alberto Bietti Stochastic MISO March 21, 2017 4 / 20

Can we do better than SGD? � n � f ( x ) = 1 E ρ ∼ Γ [˜ � min f i ( x , ρ )] n x ∈ R p i =1 SGD is a natural choice ◮ Sample index i t , perturbation ρ t ∼ Γ ◮ Update: x t = x t − 1 − η t ∇ ˜ f i t ( x t − 1 , ρ t ) tot := E i ,ρ [ �∇ ˜ O ( σ 2 tot /µ t ) convergence, with σ 2 f i ( x ∗ , ρ ) � 2 ] Key observation : variance from perturbations only is small compared to variance across all examples Contribution : improve convergence of SGD by exploiting the finite-sum structure using variance reduction. Yields O ( σ 2 /µ t ) convergence with ≤ σ 2 ≪ σ 2 � f i ( x ∗ , ρ ) − ∇ f i ( x ∗ ) � 2 � �∇ ˜ E ρ tot Alberto Bietti Stochastic MISO March 21, 2017 5 / 20

Background: MISO algorithm (Mairal, 2015) Finite sum problem: min x f ( x ) = 1 � n i =1 f i ( x ) n i � 2 + c t i ( x ) = µ Maintains a quadratic lower bound model d t 2 � x − z t i on each f i d t i is updated using a strong convexity lower bound on f i : f i ( x ) ≥ f i ( x t − 1 ) + �∇ f i ( x t − 1 ) , x − x t − 1 � + µ 2 � x − x t − 1 � 2 =: l t i ( x ) Two steps: � (1 − α ) d t − 1 ( x ) + α l t i ( x ) , if i = i t ◮ Select i t , update: d t i i ( x ) = d t − 1 ( x ) , otherwise i ◮ Minimize the model: x t = arg min x { D t ( x ) = 1 � n i =1 d t i ( x ) } n Alberto Bietti Stochastic MISO March 21, 2017 6 / 20

MISO algorithm (Mairal, 2015) Final algorithm: at iteration t , choose index i t at random and update: (1 − α ) z t − 1 + α ( x t − 1 − 1 � µ ∇ f i ( x t − 1 )) , if i = i t z t i i = z t − 1 , otherwise. i n x t = 1 � z t i n i =1 Complexity O (( n + L /µ ) log 1 /ǫ ), typical of variance reduction Similar to SDCA without duality (Shalev-Shwartz, 2016) Alberto Bietti Stochastic MISO March 21, 2017 7 / 20

Stochastic MISO � n � f ( x ) = 1 E ρ ∼ Γ [˜ � min f i ( x , ρ )] n x ∈ R p i =1 With perturbations, we cannot compute exact strong convexity lower bounds on f i = E ρ [˜ f i ( · , ρ )] Instead, use approximate lower bounds using stochastic gradient estimates ∇ ˜ f i t ( x t − 1 , ρ t ) Allow decreasing step-sizes α t in order to guarantee convergence as in stochastic approximation Alberto Bietti Stochastic MISO March 21, 2017 8 / 20

Stochastic MISO: algorithm Input: step-size sequence ( α t ) t ≥ 1 ; for t = 1 , . . . do Sample i t uniformly at random, ρ t ∼ Γ, and update: (1 − α t ) z t − 1 + α t ( x t − 1 − 1 µ ∇ ˜ � f i t ( x t − 1 , ρ t )) , if i = i t z t i i = z t − 1 , otherwise. i n x t = 1 i = x t − 1 + 1 � z t n ( z t i t − z t − 1 ) . i t n i =1 end for Note : reduces to MISO for σ 2 = 0 , α t = α , and to SGD for n = 1. Alberto Bietti Stochastic MISO March 21, 2017 9 / 20

Stochastic MISO: convergence analysis i := x ∗ − 1 Define the Lyapunov function (with z ∗ µ ∇ f i ( x ∗ )) n C t = 1 2 � x t − x ∗ � 2 + α t � � z t i − z ∗ i � 2 . n 2 i =1 Theorem (Recursion on C t , smooth case) If ( α t ) t ≥ 1 are positive, non-increasing step-sizes with � 1 n � α 1 ≤ min 2 , , 2(2 κ − 1) with κ = L /µ , then C t obeys the recursion � 2 σ 2 1 − α t � α t � � E [ C t ] ≤ E [ C t − 1 ] + 2 µ 2 . n n Note : Similar recursion for SGD with σ 2 tot instead of σ 2 . Alberto Bietti Stochastic MISO March 21, 2017 10 / 20

Stochastic MISO: convergence with decreasing step-sizes Similar to SGD (Bottou et al., 2016). Theorem (Convergence of Lyapunov function) Let the sequence of step-sizes ( α t ) t ≥ 1 be defined by 2 n � 1 n � α t = for γ ≥ 0 s.t. α 1 ≤ min 2 , . γ + t 2(2 κ − 1) For t ≥ 0 , ν E [ C t ] ≤ γ + t + 1 , where � 8 σ 2 � ν := max µ 2 , ( γ + 1) C 0 . Q : How can we get rid of the dependence on C 0 ? Alberto Bietti Stochastic MISO March 21, 2017 11 / 20

Practical step-size strategy Following Bottou et al. (2016), we keep the step-size constant for a few epochs in order to quickly “forget” the initial condition C 0 Using a constant step-size ¯ α , we can converge linearly near a ασ 2 constant error ¯ C = 2¯ n µ 2 (in practice: a few epochs) We then start decreasing step-sizes with γ large enough s.t. α 1 = 2 n / ( γ + 1) ≈ ¯ α , no more C 0 in the convergence rate! Overall, complexity for reaching E [ � x t − x ∗ � 2 ] ≤ ǫ : � � σ 2 ( n + L /µ ) log C 0 � � O + O . µ 2 ǫ ǫ ¯ For E [ f ( x t ) − f ( x ∗ )] ≤ ǫ , the second term becomes O ( L σ 2 /µ 2 ǫ ) via smoothness. Iterate averaging brings this down to O ( σ 2 /µǫ ). Alberto Bietti Stochastic MISO March 21, 2017 12 / 20

Extensions Composite objectives ( h � = 0, e.g., ℓ 1 penalty) ◮ MISO extends to this case by adding h to lower bound model (Lin et al., 2015) ◮ Different Lyapunov function ( � x t − x ∗ � 2 replaced by an upper bound) ◮ Similar to Regularized Dual Averaging when n = 1 Non-uniform sampling ◮ Smoothness constants L i of each ˜ f i can vary a lot in heterogeneous datasets ◮ Sampling “difficult” examples more often can improve dependence in L from L max to L average Same convergence results apply (same Lyapunov recursion, decreasing step-sizes, iterate averaging) Alberto Bietti Stochastic MISO March 21, 2017 13 / 20

Experiments: dropout Dropout rate δ controls the variance of the perturbations. gene dropout, δ = 0.30 gene dropout, δ = 0.10 gene dropout, δ = 0.01 10 0 10 0 10 0 S-MISO η = 0 . 1 10 -1 10 -1 10 -1 S-MISO η = 1 . 0 10 -2 SGD η = 0 . 1 10 -2 SGD η = 1 . 0 10 -2 10 -3 f - f* N-SAGA η = 0 . 1 f - f* f - f* 10 -3 N-SAGA η = 1 . 0 10 -4 10 -3 10 -4 10 -5 10 -4 10 -5 10 -6 10 -5 10 -6 10 -7 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 epochs epochs epochs Alberto Bietti Stochastic MISO March 21, 2017 14 / 20

Experiments: image data augmentation Random image crops and scalings, encoding with an unsupervised deep convolutional network. Different conditioning, controlled by µ . STL-10 ckn, µ = 10 − 3 STL-10 ckn, µ = 10 − 4 STL-10 ckn, µ = 10 − 5 10 0 10 0 10 0 S-MISO η = 0 . 1 10 -1 S-MISO η = 1 . 0 10 -1 SGD η = 0 . 1 10 -1 10 -2 SGD η = 1 . 0 f - f* N-SAGA η = 0 . 1 f - f* f - f* 10 -2 10 -3 10 -2 10 -3 10 -4 10 -5 10 -4 10 -3 0 100 200 300 400 500 0 100 200 300 400 500 0 50 100 150 200 250 300 350 400 epochs epochs epochs Alberto Bietti Stochastic MISO March 21, 2017 15 / 20

Conclusion Exploit underlying finite-sum structures in stochastic optimization problems using variance reduction Bring SGD variance term down to the variance induced by perturbations only Useful for data augmentation (e.g. random image transformations, Dropout) Future work: application to stable feature selection? C++/Eigen library with Cython extension available: http://github.com/albietz/stochs Alberto Bietti Stochastic MISO March 21, 2017 16 / 20

Stochastic Optimization with Variance Reduction for Infinite - PowerPoint PPT Presentation

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Julien Mairal Alberto Bietti Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21, 2017 1 / 20 Stochastic optimization

Stochastic Simulation Variance reduction methods Bo Friis Nielsen Applied Mathematics and

Stochastic Simulation Methods: Variance reduction methods Antithetic variables Bo Friis

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Variance reduction Timo Tiihonen 2014 Variance reduction techniques The most efficient way to

Outline 1 Presentation of the problem Truncated Stochastic Algorithms and Variance Reduction:

Stochastic Composite Optimization: Variance Reduction, Acceleration, and Robustness to Noise

Variance reduction A primer on simplest techniques What is variance reduction Reduce

Dual Effect in Stochastic Optimization February 10, 2015 P. Carpentier Master MMMEF Cours

Alex Psomas: Lecture 18. Random Variables: Variance 1. Variance 2. Distributions Variance Flip

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

Analysis of variance and regression December 4, 2007 Variance component models Variance

Variance = E[I 2 ] 2pE[I] + p 2 = E[I] 2p p + p 2 = 2 2 = p-2p+ p pq variance.1

Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic

Stochastic Optimization and Discretization January 06, 2021 P. Carpentier Master Optimization

Introduction to Stochastic Optimization January 13, 2015 P. Carpentier Master MMMEF Cours

Stochastic Online Optimization Jian Li Institute of Interdisciplinary Information Sciences

Effective Rate Analysis of MISO Systems over - Fading Channels Jiayi Zhang 1 , 2 , Linglong

Fundamentals of MIMO W Wireless Communications Pa art II Prof. Rakhesh Sing Singh Kshetrimayum

Mode rnizing Minne sota s Grid An E c o no mic Ana lysis o f E ne rg y Sto ra g e Oppo

WHATS NEXT FOR TRANSMISSION? SUCCESS STORIES AND LESSONS LEARNED September 10, 2020

Secret Communication via Secret Communication via Multi- -antenna Systems antenna Systems

Practical Evaluation of Passive COTS Eavesdropping in 802.11b/n/ac WLAN D ANIELE A NTONIOLI

Lecture 4 Capacity of Wireless Channels I-Hsiang Wang ihwang@ntu.edu.tw 3/20,

A Global Optimization Approach to Structured Regulation Design under H Constraints Dominique