optimization in alibaba beyond convexity
play

Optimization in Alibaba: Beyond Convexity System for AI AI for - PowerPoint PPT Presentation

Optimization in Alibaba: Beyond Convexity System for AI AI for system Jian Tan Computer Systems Machine Intelligent Technology | | Optimization Machine Operations Learning Research


  1. Optimization in Alibaba: Beyond Convexity System for AI AI for system Jian Tan Computer Systems Machine Intelligent Technology ���� | ��� ������� | ���� Optimization Machine Operations Learning Research

  2. Agenda Ø Theories on non-convex optimization: Part 1 . Parallel restarted SGD: it finds first-order stationary points (why model averaging works for Deep Learning?) Part 2 . Escaping saddle points in non-convex optimization (first-order stochastic algorithms to find 1 0.8 second-order stationary points) 0.6 0.4 stuck 0.2 0 -0.2 -0.4 -0.6 Ø System optimization: BPTune for an intelligent -0.8 -1 1 database (from OR/ML perspectives) 0.5 1 0.5 0 0 escape -0.5 -0.5 A real complex system deployment -1 -1 Combine pairwise DNN, active learning, heavy-tailed randomness … Part 3 . Stochastic (large deviation) analysis for LRU caching

  3. Learning as Optimization • Stochastic (non-convex) optimization Loss function (∈* + , # = E[0(#; !)] min Model Training samples • ! : random training sample • f (#) : has Lipschitz continuous Gradient 1 0.8 0.6 0.4 0.2 v.s. 0 -0.2 -0.4 -0.6 -0.8 -1 -6 -4 -2 0 2 4 6

  4. Non-Convex Optimization is Challenging Many local minima & saddle points ( first-order stationary ) For stationary points !" # =0 ! $ "(#) ≻ 0 Local minimum local maximum ! $ "(#) ≺ 0 Local maximum saddle point ! $ " # has both +/- eigenvalues saddle points ! $ " # has 0/+ eigenvalues local minimum Degenerate case: could be either local minimum local minimum or saddle points global minimum In general, finding global minimum of non-convex optimization is NP-hard

  5. Instead … • For some applications, e.g., matrix completion, tensor decomposition, dictionary learning, and certain neural networks, Good news: local minima Bad news: saddle points • Either all local minima are • Poor function value compared all global minima with global/local minima • Or all local minima are close • Possibly many saddle points to global minima (even exponential number)

  6. Finding First-order Stationary Points (FSP) • Stochastic Gradient Descent (SGD): $ 12. = $ 1 − 5"6($ 1 ; 8 1 ) • Complexity of SGD (Ghadimi & Lan, 2013, 2016; Ghadimi et al., 2016; Yang et al., 2016) : % ] ≤ ! % : Iteration complexity ((1/! , ) • ! -FSP, E[ "# $ % • Improved Iteration complexity based on Variance Reduction: • SCSG (Lei et al.,2017): ((1/! .//0 ) • Workhorse of deep learning

  7. Part 1: Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging works for Deep Learning Hao Yu, Sen Yang, Shenghuo Zhu (AAAI 2019) • One server is not enough: • too many parameters, e.g., deep neural networks • huge number of training samples • training time is too long • Parallel on N servers: • With N machines, can we be N times faster? If yes, we have the linear speed-up (w.r.t. # of workers)

  8. Classical Parallel mini-batch SGD • The classical Parallel mini-batch SGD (PSGD) achieves O( # $% ) convergence with N workers [Dekel et al. 12]. PSGD can attain a linear speed-up. , % &-) = % & − 0 1 2 3 ∇"(% & ; ( 4 ) 45) PS % &-) % &-) % &-) % &-) ∇" (% & ; ( , ) ∇" (% & ; ( ) ) ∇" (% & ; ( 6 ) ∇" (% & ; ( + ) …… worker 1 worker 2 worker 3 worker N • Each iteration aggregates gradients from every workers. Communication too high! • Can we reduce the communication cost? Yes, model averaging.

  9. Model Averaging (Parallel Restarted SGD) Algorithm 1 Parallel Restarted SGD 1: Input: Initialize x 0 i = y 2 R m . Set learning rate γ > 0 and node synchronization interval (integer) I > 0 2: for t = 1 to T do i of f i ( · ) at point x t − 1 Each node i observes an unbiased stochastic gradient G t 3: i if t is a multiple of I , i.e., t % I = 0, then 4: ∆ P N = 1 i =1 x t − 1 Calculate node average y 5: i N Each node i in parallel updates its local solution 6: x t i = y � γ G t i , 8 i (2) else 7: Each node i in parallel updates its local solution 8: x t i = x t − 1 � γ G t 8 i (3) i , i end if 9: 10: end for

  10. Mo Model el Aver eraging • Each worker train its local model + (periodically) average on all workers • One-shot averaging : [Zindevich et al. 2010, McDonalt et al. 2010] propose to average only once at the end. • [Zhang et al. 2016] shows averaging once can leads to poor solutions for non- convex opt and suggest more frequent averaging. • If averaging every I iterations, how large is I ? • One-shot averaging: I=T • PSGT: I=1

  11. I=1 works? Why I= • If we average models each iteration (I=1), then it is equivalent to PSGD. , % &-) = % & − 0 1 2 3 ∇"(% & ; ( 4 ) 45) PS PSGD % &-) % &-) % &-) % &-) ∇" (% & ; ( , ) ∇" (% & ; ( ) ) ∇" (% & ; ( 6 ) ∇" (% & ; ( + ) 0 ! "#$ = 1 …… worker 1 worker 2 worker 3 worker N 4 2 3 ! "#$ 45$ PS ! "#$ model average (I=1) ! "#$ ! "#$ ! "#$ 0 ! "#$ = ! " − '∇) (! " ; - 0 ) ! "#$ $ = ! " − '∇) (! " ; - $ ) ! "#$ / 6 = ! " − '∇) (! " ; - / ) ! "#$ = ! " − '∇) (! " ; - 6 ) …… worker 1 worker 2 worker 3 worker N • What if we average after multiple iterations periodically (I>1)? Converge or not? Convergence rate? Linear speed-up or not?

  12. Empi Empirical work • There has been a long line of empirical works … • [Zhang et al. 2016]: CNN for MNIST • [Chen and Huo 2016] [Su, Chen, and Xu 2018] : DNN-GMM for speech recognition • [McMahan et al. 2017] :CNN for MNIST and Cifar10; LSTM for language modeling • [Kaamp et al. 2018] :CNN for MNIST • [Lin, Stich, and Jaggi 2018]: Res20 for Cifar10/100; Res50 for ImageNet • These empirical works show that ”model averaging” = PSGD with significantly less communication overhead! • Recall PSGD = linear speed-up

  13. Mo Model el Aver eraging: almost linea ear r speed eed-up up in in pr prac actic tice • Good speed up (measured in wall time used to achieve target accuracy) I I I I • I: averaging intervals (I=4 means I “average every 4 iterations”) • Resnet20 over CIFAR10 • Figure 7(a) from “Tao Lin, Sebastian U. Stich, and Martin Jaggi 2018, Don’t use large mini-batches, use local SGD”

  14. Related work • For strongly convex opt, [Stich 2018] shows the convergence (with linear speed-up w.r.t. # of workers) is maintained as long as the averaging interval I < O( #/ %) . • Why model averaging achieves almost linear speed-up for deep learning (non-convex) in practice for I>1?

  15. Main result • Prove “model averaging ” (communication reduction) has the same convergence rate as PSGD for non-convex opt under certain conditions & * If the averaging interval ! = #(% ' /) ' ) , then model averaging has - the convergence rate O( ./ ) . • ”Model averaging” works for deep learning. It is as fast as PSGD with significantly less communication.

  16. ̅ ̅ Control bias-variance after I iterations • Focus on , " # = 1 # average of local solution over all ( workers ( ) " * *+% • Note… , " #$% − . 1 " # = ̅ # ( ) / * *+% # : independent gradients sampled at different points " * #$% / * " #$% , which are unavailable at local workers without • PSGD has i.i.d. gradients at ̅ communication

  17. Technical analysis " # and " $ # • Bound the difference between ̅ # | ) ≤ 4, ) - ) . ) , ∀1, ∀2 " # − " $ Our Algorithm ensures %[|| ̅ • The rest part uses the smoothness and shows Assume: Proof. Fix t � 1. By the smoothness of f , we have E [ f ( x t )]  E [ f ( x t − 1 )] + E [ hr f ( x t − 1 ) , x t � x t − 1 i ] + L 2 E [ k x t � x t − 1 k 2 ] Note that N = γ 2 E [ k 1 E [ k x t � x t − 1 k 2 ] ( a ) X i k 2 ] G t N i =1 ……

  18. Part 2: Escaping Saddle points in non-convex optimization Yi Xu*, Rong Jin, Tianbao Yang* 1 0.8 0.6 0.4 stuck First-order Stochastic Algorithms for Escaping From 0.2 0 Saddle Points in Almost Linear Time, NIPS 2018. -0.2 * Xu and Yang are with Iowa State University -0.4 -0.6 -0.8 -1 1 0.5 1 0.5 0 0 escape -0.5 -0.5 -1 -1

  19. (First-order) Stationary Points (FSP) !1 $ " = 0 Saddle point Local minimum Local maximum 0 2 1 -0.2 1.8 0.8 -0.4 1.6 0.6 -0.6 1.4 0.4 -0.8 1.2 0.2 -1 1 0 -1.2 0.8 -0.2 -1.4 0.6 -0.4 -1.6 0.4 -0.6 -1.8 -0.8 0.2 -2 1 -1 0 1 1 0.5 1 0.5 1 0.5 0.5 1 0 0.5 0 0 0.5 0 -0.5 0 -0.5 0 -0.5 -1 -0.5 -0.5 -1 -0.5 -1 -1 -1 ! " #($) ≺ 0 -1 * +,- (! " # $ ) < 0 ! " # $ > 0 Second-order Stationary Points (SSP) SSP is Local Minimum for " = 0 , * +,- (! " # $ ) ≥ 0 !# $ non-degenerate saddle point ! " # $ has both +/- eigenvalues saddle points, which can be bad! ! " # $ has both 0/+ eigenvalues degenerate case: local minimum/saddle points

  20. The Problem • Finding an approximate local minimum by using first-order methods ( ≤ # , * +,- % ( & ' %& ' ≥ −γ # −SSP : • Choice of γ : small enough, e.g., γ = # (Nesterov & Polyak 2006) • Nesterov, Yurii, and Polyak, Boris T. "Cubic regularization of Newton method and its global performance." Mathematical Programming 108.1 (2006): 177-205.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend