Optimization in Alibaba: Beyond Convexity System for AI AI for - PowerPoint PPT Presentation

Optimization in Alibaba: Beyond Convexity System for AI AI for system Jian Tan Computer Systems Machine Intelligent Technology �� | �� | �� Optimization Machine Operations Learning Research

Agenda Ø Theories on non-convex optimization: Part 1 . Parallel restarted SGD: it finds first-order stationary points (why model averaging works for Deep Learning?) Part 2 . Escaping saddle points in non-convex optimization (first-order stochastic algorithms to find 1 0.8 second-order stationary points) 0.6 0.4 stuck 0.2 0 -0.2 -0.4 -0.6 Ø System optimization: BPTune for an intelligent -0.8 -1 1 database (from OR/ML perspectives) 0.5 1 0.5 0 0 escape -0.5 -0.5 A real complex system deployment -1 -1 Combine pairwise DNN, active learning, heavy-tailed randomness … Part 3 . Stochastic (large deviation) analysis for LRU caching

Learning as Optimization • Stochastic (non-convex) optimization Loss function (∈* + , # = E[0(#; !)] min Model Training samples • ! : random training sample • f (#) : has Lipschitz continuous Gradient 1 0.8 0.6 0.4 0.2 v.s. 0 -0.2 -0.4 -0.6 -0.8 -1 -6 -4 -2 0 2 4 6

Non-Convex Optimization is Challenging Many local minima & saddle points ( first-order stationary ) For stationary points !" # =0 ! $ "(#) ≻ 0 Local minimum local maximum ! $ "(#) ≺ 0 Local maximum saddle point ! $ " # has both +/- eigenvalues saddle points ! $ " # has 0/+ eigenvalues local minimum Degenerate case: could be either local minimum local minimum or saddle points global minimum In general, finding global minimum of non-convex optimization is NP-hard

Instead … • For some applications, e.g., matrix completion, tensor decomposition, dictionary learning, and certain neural networks, Good news: local minima Bad news: saddle points • Either all local minima are • Poor function value compared all global minima with global/local minima • Or all local minima are close • Possibly many saddle points to global minima (even exponential number)

Finding First-order Stationary Points (FSP) • Stochastic Gradient Descent (SGD): $ 12. = $ 1 − 5"6($ 1 ; 8 1 ) • Complexity of SGD (Ghadimi & Lan, 2013, 2016; Ghadimi et al., 2016; Yang et al., 2016) : % ] ≤ ! % : Iteration complexity ((1/! , ) • ! -FSP, E[ "# $ % • Improved Iteration complexity based on Variance Reduction: • SCSG (Lei et al.,2017): ((1/! .//0 ) • Workhorse of deep learning

Part 1: Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging works for Deep Learning Hao Yu, Sen Yang, Shenghuo Zhu (AAAI 2019) • One server is not enough: • too many parameters, e.g., deep neural networks • huge number of training samples • training time is too long • Parallel on N servers: • With N machines, can we be N times faster? If yes, we have the linear speed-up (w.r.t. # of workers)

Classical Parallel mini-batch SGD • The classical Parallel mini-batch SGD (PSGD) achieves O( # $% ) convergence with N workers [Dekel et al. 12]. PSGD can attain a linear speed-up. , % &-) = % & − 0 1 2 3 ∇"(% & ; ( 4 ) 45) PS % &-) % &-) % &-) % &-) ∇" (% & ; ( , ) ∇" (% & ; ( ) ) ∇" (% & ; ( 6 ) ∇" (% & ; ( + ) …… worker 1 worker 2 worker 3 worker N • Each iteration aggregates gradients from every workers. Communication too high! • Can we reduce the communication cost? Yes, model averaging.

Model Averaging (Parallel Restarted SGD) Algorithm 1 Parallel Restarted SGD 1: Input: Initialize x 0 i = y 2 R m . Set learning rate γ > 0 and node synchronization interval (integer) I > 0 2: for t = 1 to T do i of f i ( · ) at point x t − 1 Each node i observes an unbiased stochastic gradient G t 3: i if t is a multiple of I , i.e., t % I = 0, then 4: ∆ P N = 1 i =1 x t − 1 Calculate node average y 5: i N Each node i in parallel updates its local solution 6: x t i = y � γ G t i , 8 i (2) else 7: Each node i in parallel updates its local solution 8: x t i = x t − 1 � γ G t 8 i (3) i , i end if 9: 10: end for

Mo Model el Aver eraging • Each worker train its local model + (periodically) average on all workers • One-shot averaging : [Zindevich et al. 2010, McDonalt et al. 2010] propose to average only once at the end. • [Zhang et al. 2016] shows averaging once can leads to poor solutions for non- convex opt and suggest more frequent averaging. • If averaging every I iterations, how large is I ? • One-shot averaging: I=T • PSGT: I=1

I=1 works? Why I= • If we average models each iteration (I=1), then it is equivalent to PSGD. , % &-) = % & − 0 1 2 3 ∇"(% & ; ( 4 ) 45) PS PSGD % &-) % &-) % &-) % &-) ∇" (% & ; ( , ) ∇" (% & ; ( ) ) ∇" (% & ; ( 6 ) ∇" (% & ; ( + ) 0 ! "#$ = 1 …… worker 1 worker 2 worker 3 worker N 4 2 3 ! "#$ 45$ PS ! "#$ model average (I=1) ! "#$ ! "#$ ! "#$ 0 ! "#$ = ! " − '∇) (! " ; - 0 ) ! "#$ $ = ! " − '∇) (! " ; - $ ) ! "#$ / 6 = ! " − '∇) (! " ; - / ) ! "#$ = ! " − '∇) (! " ; - 6 ) …… worker 1 worker 2 worker 3 worker N • What if we average after multiple iterations periodically (I>1)? Converge or not? Convergence rate? Linear speed-up or not?

Empi Empirical work • There has been a long line of empirical works … • [Zhang et al. 2016]: CNN for MNIST • [Chen and Huo 2016] [Su, Chen, and Xu 2018] : DNN-GMM for speech recognition • [McMahan et al. 2017] :CNN for MNIST and Cifar10; LSTM for language modeling • [Kaamp et al. 2018] :CNN for MNIST • [Lin, Stich, and Jaggi 2018]: Res20 for Cifar10/100; Res50 for ImageNet • These empirical works show that ”model averaging” = PSGD with significantly less communication overhead! • Recall PSGD = linear speed-up

Mo Model el Aver eraging: almost linea ear r speed eed-up up in in pr prac actic tice • Good speed up (measured in wall time used to achieve target accuracy) I I I I • I: averaging intervals (I=4 means I “average every 4 iterations”) • Resnet20 over CIFAR10 • Figure 7(a) from “Tao Lin, Sebastian U. Stich, and Martin Jaggi 2018, Don’t use large mini-batches, use local SGD”

Related work • For strongly convex opt, [Stich 2018] shows the convergence (with linear speed-up w.r.t. # of workers) is maintained as long as the averaging interval I < O( #/ %) . • Why model averaging achieves almost linear speed-up for deep learning (non-convex) in practice for I>1?

Main result • Prove “model averaging ” (communication reduction) has the same convergence rate as PSGD for non-convex opt under certain conditions & * If the averaging interval ! = #(% ' /) ' ) , then model averaging has - the convergence rate O( ./ ) . • ”Model averaging” works for deep learning. It is as fast as PSGD with significantly less communication.

̅ ̅ Control bias-variance after I iterations • Focus on , " # = 1 # average of local solution over all ( workers ( ) " * *+% • Note… , " #$% − . 1 " # = ̅ # ( ) / * *+% # : independent gradients sampled at different points " * #$% / * " #$% , which are unavailable at local workers without • PSGD has i.i.d. gradients at ̅ communication

Technical analysis " # and " $ # • Bound the difference between ̅ # | ) ≤ 4, ) - ) . ) , ∀1, ∀2 " # − " $ Our Algorithm ensures %[|| ̅ • The rest part uses the smoothness and shows Assume: Proof. Fix t � 1. By the smoothness of f , we have E [ f ( x t )]  E [ f ( x t − 1 )] + E [ hr f ( x t − 1 ) , x t � x t − 1 i ] + L 2 E [ k x t � x t − 1 k 2 ] Note that N = γ 2 E [ k 1 E [ k x t � x t − 1 k 2 ] ( a ) X i k 2 ] G t N i =1 ……

Part 2: Escaping Saddle points in non-convex optimization Yi Xu*, Rong Jin, Tianbao Yang* 1 0.8 0.6 0.4 stuck First-order Stochastic Algorithms for Escaping From 0.2 0 Saddle Points in Almost Linear Time, NIPS 2018. -0.2 * Xu and Yang are with Iowa State University -0.4 -0.6 -0.8 -1 1 0.5 1 0.5 0 0 escape -0.5 -0.5 -1 -1

(First-order) Stationary Points (FSP) !1 $ " = 0 Saddle point Local minimum Local maximum 0 2 1 -0.2 1.8 0.8 -0.4 1.6 0.6 -0.6 1.4 0.4 -0.8 1.2 0.2 -1 1 0 -1.2 0.8 -0.2 -1.4 0.6 -0.4 -1.6 0.4 -0.6 -1.8 -0.8 0.2 -2 1 -1 0 1 1 0.5 1 0.5 1 0.5 0.5 1 0 0.5 0 0 0.5 0 -0.5 0 -0.5 0 -0.5 -1 -0.5 -0.5 -1 -0.5 -1 -1 -1 ! " #($) ≺ 0 -1 * +,- (! " # $ ) < 0 ! " # $ > 0 Second-order Stationary Points (SSP) SSP is Local Minimum for " = 0 , * +,- (! " # $ ) ≥ 0 !# $ non-degenerate saddle point ! " # $ has both +/- eigenvalues saddle points, which can be bad! ! " # $ has both 0/+ eigenvalues degenerate case: local minimum/saddle points

The Problem • Finding an approximate local minimum by using first-order methods ( ≤ # , * +,- % ( & ' %& ' ≥ −γ # −SSP : • Choice of γ : small enough, e.g., γ = # (Nesterov & Polyak 2006) • Nesterov, Yurii, and Polyak, Boris T. "Cubic regularization of Newton method and its global performance." Mathematical Programming 108.1 (2006): 177-205.

Optimization in Alibaba: Beyond Convexity System for AI AI for - PowerPoint PPT Presentation

Optimization in Alibaba: Beyond Convexity System for AI AI for system Jian Tan Computer Systems Machine Intelligent Technology | | Optimization Machine Operations Learning Research

Percona XtraBackup at Alibaba Cloud Bo Wang Alibaba Cloud About Me Bo Wang (Fungo Wang)

Alibaba Cloud DNS Practice ICANN64 TechDay guochuan.gc@alibaba-inc.com introduction Who we

Trends in Alibaba Zhaogang Wang zhaogang.wzg@alibaba-inc.com 1 About me Senior Specialist

Technological Innova.on at Alibaba Alan Qi Vice President of Ant Financial Service Group Outline

PolarDB Cloud Native DB @ Alibaba Lixun Peng Inaam Rana Alibaba Cloud Team Agenda

Alibaba Dragonwell JDK: Towards a Java Runtime for Cloud Computing Xiaoming Gu Alibaba JVM Team

Live Migration @Alibaba Cloud: issues settled & challenges remain Chao Zhang Email:

Self-Driving Networks Speaker Ming Zhang Alibaba Group Alibaba serves users around the globe

Beyond Convenience: Beyond Convexity Purushottam Kar MINI-SYMPOSIUM ON COMPUTATION AND

Convexity and Polyhedra Carlo Mannino (from Geir Dahl notes on convexity) University of Oslo,

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

Optimal covering of a straight line application to discrete convexity Jean-Marc Chassery, Isabelle

A Tightrope Walk Between Convexity and Non-convexity in Computer Vision Thomas Pock Institute

Discrete convexity and packages Gleb Koshevoy IITP(RAS) and Poncelet Center (CNRS) 12/05/2020,

Convexity and the Kalmbach monad Gejza Jena August 10, 2018 Gejza Jena Convexity and the

Computational Optimization Convexity and Unconstrained Optimization 1/29/08 and 2/1(revised)

Web Service Pa,erns 95-702 Distributed Systems October 2013

Lattice QCD on Blue Waters PI: Paul Mackenzie (Fermilab) Presenter: Steven Gottlieb (Indiana)

No more mini-languages: Autodiff in full-featured Python David Duvenaud, Dougal Maclaurin,

The role of final-state interactions in Dalitz plot studies Bastian Kubis Helmholtz-Institut f

Course Introduc-on Bryce Boe 2012/08/06 CS32, Summer 2012 B

1 Probability Recap Reasoning over Time or Space Condi+onal

Proverbs Series Lesson #002 January 6, 2013 Dean Bible Ministries www.deanbible.org Dr. Robert

A Tutorial Introduction to DANE Jan Zorz / ISOC Carlos Mar1nez /

Optimization in Alibaba: Beyond Convexity System for AI AI for - PowerPoint PPT Presentation

Optimization in Alibaba: Beyond Convexity System for AI AI for system Jian Tan Computer Systems Machine Intelligent Technology | | Optimization Machine Operations Learning Research

Percona XtraBackup at Alibaba Cloud Bo Wang Alibaba Cloud About Me Bo Wang (Fungo Wang)

Alibaba Cloud DNS Practice ICANN64 TechDay guochuan.gc@alibaba-inc.com introduction Who we

Trends in Alibaba Zhaogang Wang zhaogang.wzg@alibaba-inc.com 1 About me Senior Specialist

Technological Innova.on at Alibaba Alan Qi Vice President of Ant Financial Service Group Outline

PolarDB Cloud Native DB @ Alibaba Lixun Peng Inaam Rana Alibaba Cloud Team Agenda

Alibaba Dragonwell JDK: Towards a Java Runtime for Cloud Computing Xiaoming Gu Alibaba JVM Team

Live Migration @Alibaba Cloud: issues settled &amp; challenges remain Chao Zhang Email:

Self-Driving Networks Speaker Ming Zhang Alibaba Group Alibaba serves users around the globe

Beyond Convenience: Beyond Convexity Purushottam Kar MINI-SYMPOSIUM ON COMPUTATION AND

Convexity and Polyhedra Carlo Mannino (from Geir Dahl notes on convexity) University of Oslo,

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

Optimal covering of a straight line application to discrete convexity Jean-Marc Chassery, Isabelle

A Tightrope Walk Between Convexity and Non-convexity in Computer Vision Thomas Pock Institute

Discrete convexity and packages Gleb Koshevoy IITP(RAS) and Poncelet Center (CNRS) 12/05/2020,

Convexity and the Kalmbach monad Gejza Jena August 10, 2018 Gejza Jena Convexity and the

Computational Optimization Convexity and Unconstrained Optimization 1/29/08 and 2/1(revised)

Web Service Pa,erns 95-702 Distributed Systems October 2013

Lattice QCD on Blue Waters PI: Paul Mackenzie (Fermilab) Presenter: Steven Gottlieb (Indiana)

No more mini-languages: Autodiff in full-featured Python David Duvenaud, Dougal Maclaurin,

The role of final-state interactions in Dalitz plot studies Bastian Kubis Helmholtz-Institut f

Course Introduc-on Bryce Boe 2012/08/06 CS32, Summer 2012 B

1 Probability Recap Reasoning over Time or Space Condi+onal

Proverbs Series Lesson #002 January 6, 2013 Dean Bible Ministries www.deanbible.org Dr. Robert

A Tutorial Introduction to DANE Jan Zorz / ISOC Carlos Mar1nez /

Live Migration @Alibaba Cloud: issues settled & challenges remain Chao Zhang Email: