Acceleration for Compressed Gradient Descent in Distributed and - PowerPoint PPT Presentation

Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization Zhize Li King Abdullah University of Science and Technology (KAUST) https://zhizeli.github.io Joint work with Dmitry Kovalev (KAUST), Xun Qian (KAUST) and Peter Richt´ arik (KAUST) ICML 2020 Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 1 / 19

Overview Problem 1 Related Work 2 Our Contributions 3 Single Device Setting Distributed Setting Experiments 4 Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 2 / 19

Problem Training distributed/federated learning models is typically performed by solving an optimization problem n P ( x ) := 1 � � � min f i ( x ) + ψ ( x ) , n x ∈ R d i =1 f i ( x ): loss function associated with data stored on node/device i ψ ( x ): regularization term (e.g., ℓ 1 regularizer � x � 1 , ℓ 2 regularizer � x � 2 2 or indicator function I C ( x ) for some set C ) Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 3 / 19

Examples n P ( x ) := 1 � � � min f i ( x ) + ψ ( x ) n x ∈ R d i =1 Each node/device i stores m data samples { ( a i , j , b i , j ) ∈ R d +1 } m j =1 � m ◮ Lasso regression : f i ( x ) = 1 i , j x − b i , j ) 2 , ψ ( x ) = λ � x � 1 j =1 ( a T m � m ◮ Logistic regression : f i ( x ) = 1 � 1 + exp( − b i , j a T � j =1 log i , j x ) m ◮ SVM : f i ( x ) = 1 � m � 0 , 1 − b i , j a T � , ψ ( x ) = λ 2 � x � 2 j =1 max i , j x 2 m Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 4 / 19

Goal n P ( x ) := 1 � � � min f i ( x ) + ψ ( x ) n x ∈ R d i =1 x ) − P ( x ∗ ) ≤ ǫ or Goal: find an ǫ -solution (parameters) ˆ x , e.g., P (ˆ 2 ≤ ǫ , where x ∗ := arg min x ∈ R d P ( x ). x − x ∗ � 2 � ˆ Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 5 / 19

Goal n P ( x ) := 1 � � � min f i ( x ) + ψ ( x ) n x ∈ R d i =1 x ) − P ( x ∗ ) ≤ ǫ or Goal: find an ǫ -solution (parameters) ˆ x , e.g., P (ˆ 2 ≤ ǫ , where x ∗ := arg min x ∈ R d P ( x ). x − x ∗ � 2 � ˆ For optimization methods: Bottleneck: communication cost Common strategy: Compress the communicated messages (lower communication cost in each iteration/communication round) and hope that this will not increase the total number of iterations/comm. rounds. Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 5 / 19

Related Work • Several recent work show that the total communication complexity can be improved via compression. See e.g., QSGD [Alistarh et al., 2017], DIANA [Mishchenko et al., 2019], Natural compression [Horv´ ath et al., 2019]. Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 6 / 19

Related Work • Several recent work show that the total communication complexity can be improved via compression. See e.g., QSGD [Alistarh et al., 2017], DIANA [Mishchenko et al., 2019], Natural compression [Horv´ ath et al., 2019]. • However previous work usually lead to this kind of improvement: Communication cost per iteration ( - - ) Iterations ( + ) ⇒ Total ( - ) ‘ - ’ denotes decrease , ‘ + ’ denotes increase Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 6 / 19

Related Work • Several recent work show that the total communication complexity can be improved via compression. See e.g., QSGD [Alistarh et al., 2017], DIANA [Mishchenko et al., 2019], Natural compression [Horv´ ath et al., 2019]. • However previous work usually lead to this kind of improvement: Communication cost per iteration ( - - ) Iterations ( + ) ⇒ Total ( - ) ‘ - ’ denotes decrease , ‘ + ’ denotes increase • In this work, we provide the first optimization methods provably combining the benefits of gradient compression and acceleration : Communication cost per iteration ( - - ) Iterations ( - - ) ⇒ Total ( - - - - ) Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 6 / 19

Single Device Setting • First, consider the simple single device (i.e. n = 1)) case: x ∈ R d f ( x ) , min where f : R d → R is L -smooth, and convex or µ -strongly convex. Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 7 / 19

Single Device Setting • First, consider the simple single device (i.e. n = 1)) case: x ∈ R d f ( x ) , min where f : R d → R is L -smooth, and convex or µ -strongly convex. • f is L -smooth or has L -Lipschitz continuous gradient (for L > 0) if �∇ f ( x ) − ∇ f ( y ) � ≤ L � x − y � , (1) and µ -strongly convex (for µ ≥ 0) if f ( x ) − f ( y ) − �∇ f ( y ) , x − y � ≥ µ 2 � x − y � 2 (2) for all x , y ∈ R d . The µ = 0 case reduces to the standard convexity. Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 7 / 19

Compressed Gradient Descent (CGD) • Problem: min x ∈ R d f ( x ) 1) Given initial point x 0 , step-size η 2) CGD update: x k +1 = x k − η C ( ∇ f ( x k )) , for k ≥ 0 Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 8 / 19

Compressed Gradient Descent (CGD) • Problem: min x ∈ R d f ( x ) 1) Given initial point x 0 , step-size η 2) CGD update: x k +1 = x k − η C ( ∇ f ( x k )) , for k ≥ 0 Definition (Compression operator) A randomized map C : R d �→ R d is an ω -compression operator if E [ �C ( x ) − x � 2 ] ≤ ω � x � 2 , ∀ x ∈ R d . E [ C ( x )] = x , (3) In particular, no compression ( C ( x ) ≡ x ) implies ω = 0. Note that Condition (3) is satisfied by many practical compressions, e.g., random- k sparsification, ( p , s )-quantization. Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 8 / 19

Accelerated Compressed Gradient Descent (ACGD) Inspired by Nesterov’s accelerated gradient descent (AGD) [Nesterov, 2004] and FISTA [Beck and Teboulle, 2009], here we propose the first accelerated compressed gradient descent (ACGD) method. Our ACGD update: 1) x k = α k y k + (1 − α k ) z k 2) y k +1 = x k − η k C ( ∇ f ( x k )) 3) z k +1 = β k θ k z k + (1 − θ k ) x k � γ k y k +1 + (1 − γ k ) y k � � � + (1 − β k ) Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 9 / 19

Convergence Results in Single Device Setting Table: Convergence results (Iterations) for the single device ( n = 1) case min x ∈ R d f ( x ) Algorithm µ -strongly convex f convex f Compressed Gradient Descent � � (1 + ω ) L µ log 1 (1 + ω ) L � � O O (CGD [Khirirat et al., 2018]) ǫ ǫ � � � � � � µ log 1 L L ACGD (this paper) O (1 + ω ) O (1 + ω ) ǫ ǫ • If no compression (i.e., ω = 0): CGD recovers the results of vanilla (uncompressed) GD, i.e., O ( L µ log 1 ǫ ) and O ( L ǫ ). Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 10 / 19

Convergence Results in Single Device Setting Table: Convergence results (Iterations) for the single device ( n = 1) case min x ∈ R d f ( x ) Algorithm µ -strongly convex f convex f Compressed Gradient Descent � � (1 + ω ) L µ log 1 (1 + ω ) L � � O O (CGD [Khirirat et al., 2018]) ǫ ǫ � � � � � � µ log 1 L L ACGD (this paper) O (1 + ω ) O (1 + ω ) ǫ ǫ • If no compression (i.e., ω = 0): CGD recovers the results of vanilla (uncompressed) GD, i.e., O ( L µ log 1 ǫ ) and O ( L ǫ ). �� L � L � • If compression parameter ω ≤ O or O : µ ǫ Our ACGD enjoys the benefits of compression and acceleration , i.e., both the communication cost per iteration (compression) and the total number of iterations (acceleration) are smaller than that of GD. Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 10 / 19

Recall the Discussion in Related Work • Previous work usually lead to this kind of improvement: Communication cost per iteration ( - - ) Iterations ( + ) ⇒ Total ( - ) ‘ - ’ denotes decrease , ‘ + ’ denotes increase • In this work, we provide the first optimization methods provably combining the benefits of gradient compression and acceleration : Communication cost per iteration ( - - ) Iterations ( - - ) ⇒ Total ( - - - - ) Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 11 / 19

Distributed Setting Now, we consider the general distributed setting with n devices/nodes: n P ( x ) := 1 � � � min f i ( x ) + ψ ( x ) . n x ∈ R d i =1 The presence of multiple nodes ( n > 1) and of the regularizer ψ poses additional challenges. Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 12 / 19

Distributed Setting Now, we consider the general distributed setting with n devices/nodes: n P ( x ) := 1 � � � min f i ( x ) + ψ ( x ) . n x ∈ R d i =1 The presence of multiple nodes ( n > 1) and of the regularizer ψ poses additional challenges. We propose a distributed variant of ACGD (called ADIANA) which can be seen as an accelerated version of DIANA [Mishchenko et al., 2019]. Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 12 / 19

Acceleration for Compressed Gradient Descent in Distributed and - PowerPoint PPT Presentation

Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization Zhize Li King Abdullah University of Science and Technology (KAUST) https://zhizeli.github.io Joint work with Dmitry Kovalev (KAUST), Xun Qian (KAUST) and

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Compressed Membership for NFA (DFA) with Compressed Labels is in NP (P) Artur Je University of

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Gradient descent revisited Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

fax pcxiysfex.gs ] ELF interest = t T variable Probability density Random - driving

North State BIA Presents: Breakfast Blend Designing for Small Spaces PANELISTS: Nikky Mohanna,

#getmoving2020 getmoving2020.org Task Force agenda: Dec. 18, 2019 Public comment Staff

Sea Level Rise Gets Real (Estate) National Adaptation Forum April 2019 BUILDING PERFORMANCE.

Optimal adaptive wavelet methods for linear operator equations T. Gantumur R. Stevenson

Lectures 16 Incomplete Information Static Case 14.12 Game Theory Muhamet Yildiz 1 Road Map 1.

An introduction to a generic Health Impact Assessment Methodology Debbie Abrahams,

Immigration: Know Your Rights Up to date as of 6/22/2017 This information is not legal advice.

Acceleration for Compressed Gradient Descent in Distributed and - PowerPoint PPT Presentation

Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization Zhize Li King Abdullah University of Science and Technology (KAUST) https://zhizeli.github.io Joint work with Dmitry Kovalev (KAUST), Xun Qian (KAUST) and

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Compressed Membership for NFA (DFA) with Compressed Labels is in NP (P) Artur Je University of

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Gradient descent revisited Geoff Gordon &amp; Ryan Tibshirani Optimization 10-725 / 36-725 1

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

fax pcxiysfex.gs ] ELF interest = t T variable Probability density Random - driving

North State BIA Presents: Breakfast Blend Designing for Small Spaces PANELISTS: Nikky Mohanna,

#getmoving2020 getmoving2020.org Task Force agenda: Dec. 18, 2019 Public comment Staff

Sea Level Rise Gets Real (Estate) National Adaptation Forum April 2019 BUILDING PERFORMANCE.

Optimal adaptive wavelet methods for linear operator equations T. Gantumur R. Stevenson

Lectures 16 Incomplete Information Static Case 14.12 Game Theory Muhamet Yildiz 1 Road Map 1.

An introduction to a generic Health Impact Assessment Methodology Debbie Abrahams,

Immigration: Know Your Rights Up to date as of 6/22/2017 This information is not legal advice.

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Gradient descent revisited Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1