MOCHA: Federated Multi-Task Learning NIPS 17 Virginia Smith - PowerPoint PPT Presentation

MOCHA: Federated   Multi-Task Learning NIPS ‘17 Virginia Smith   Stanford / CMU Chao-Kai Chiang · USC Maziar Sanjabi · USC Ameet Talwalkar · CMU

MACHINE LEARNING WORKFLOW data & problem machine learning model n optimization algorithm X min ` ( w , x i ) + g ( w ) w i =1

^ IN PRACTICE MACHINE LEARNING WORKFLOW data & problem machine learning model systems setting n optimization algorithm X min ` ( w , x i ) + g ( w ) w i =1

how can we perform fast distributed optimization?

BEYOND THE DATACENTER Massively Distributed Node Heterogeneity Unbalanced Non-IID Underlying Structure

BEYOND THE DATACENTER Massively Distributed Systems Challenges Node Heterogeneity Unbalanced Statistical Challenges Non-IID Underlying Structure

^ IN PRACTICE MACHINE LEARNING WORKFLOW data & problem machine learning model systems setting n optimization algorithm X min ` ( w , x i ) + g ( w ) w i =1

^ IN PRACTICE MACHINE LEARNING WORKFLOW data & problem systems setting machine learning model n optimization algorithm X min ` ( w , x i ) + g ( w ) w i =1

OUTLINE Unbalanced Statistical Challenges Non-IID Underlying Structure Massively Distributed Systems Challenges Node Heterogeneity

A GLOBAL APPROACH W [MMRHA, AISTATS 16]

A LOCAL APPROACH W 12 W 1 W 11 W 10 W 2 W 8 W 3 W 9 W 4 W 6 W 7 W 5

OUR APPROACH: PERSONALIZED MODELS W 12 W 1 W 11 W 10 W 2 W 8 W 3 W 9 W 4 W 6 W 7 W 5

MULTI-TASK LEARNING n t m X X ` t ( w t , x i min t ) + R ( W , Ω ) W , Ω t =1 i =1 losses regularizer models task relationship All tasks Outlier related tasks Clusters Asymmetric / groups relationships [ZCY, SDM 2012]

FEDERATED DATASETS Google   Human   Glass Activity Vehicle   Land   Sensor Mine

PREDICTION ERROR Global Local MTL Human   2.23   1.34   0.46   Activity (0.30) (0.21) (0.11) Google   5.34   4.92   2.02   Glass (0.26) (0.26) (0.15) Land   27.72   23.43   20.09   Mine (1.08) (0.77) (1.04) Vehicle   13.4   7.81   6.59   Sensor (0.26) (0.13) (0.21)

  GOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING m n t X X ` t ( w T t x i min t ) + R ( W , Ω ) W , Ω t =1 i =1 Solve for W, 𝛁 in an alternating fashion 𝛁 can be updated centrally W needs to be solved in federated setting Challenges : Communication is expensive Statistical & systems heterogeneity Stragglers Fault tolerance

    GOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING m n t X X ` t ( w T t x i min t ) + R ( W , Ω ) Idea : W , Ω t =1 i =1 Solve for W, 𝛁 in an alternating fashion Modify a communication-efficient method 𝛁 can be updated centrally for the data center setting to handle: W needs to be solved in federated setting ✔ Multi-task learning Challenges : ✔ Stragglers Communication is expensive Statistical & systems heterogeneity ✔ Fault tolerance   Stragglers Fault tolerance

COCOA: COMMUNICATION-EFFICIENT DISTRIBUTED OPTIMIZATION mini-batch   methods key idea:   control communication one-shot   communication

COCOA: PRIMAL-DUAL FRAMEWORK ≥ PRIMAL DUAL K X g ∗ ( X [ k ] , α [ k ] ) ˜ � n n 1 α ∈ R n − 1 k =1 X X ` ( w T x i ) + � g ( w ) ` ∗ ( − ↵ i ) − � g ∗ ( X, α ) min max n w ∈ R d n i =1 i =1 α k(t+1) α k � α k(t)

COCOA: PRIMAL-DUAL FRAMEWORK ≥ PRIMAL DUAL K X g ∗ ( X [ k ] , α [ k ] ) ˜ � challenge #1: n n 1 α ∈ R n − 1 k =1 X X ` ( w T x i ) + � g ( w ) ` ∗ ( − ↵ i ) − � g ∗ ( X, α ) min max extend to MTL setup n w ∈ R d n i =1 i =1 α k(t+1) α k � α k(t)

COCOA: COMMUNICATION PARAMETER Main assumption:   each subproblem is solved to accuracy 𝚺 amount of local   ≈ computation   ∈ [0, 1) 𝚺 vs.   communication exactly inexactly solve solve

COCOA: COMMUNICATION PARAMETER Main assumption:   each subproblem is solved to accuracy 𝚺 challenge #2: make communication   amount of local   more flexible ≈ computation   ∈ [0, 1) 𝚺 vs.   communication exactly inexactly solve solve

MOCHA: COMMUNICATION-EFFICIENT FEDERATED OPTIMIZATION m n t X X ` t ( w T t x i min t ) + R ( W , Ω ) W , Ω t =1 i =1 Solve for W, 𝛁 in an alternating fashion Modify CoCoA to solve W in federated setting n t m X X t ( − α i min ` ∗ t ) + R ∗ ( X α ) α t =1 i =1 n t t ) + h w t ( α ) , X t ∆ α t i + � 0 2 k X t ∆ α t k 2 X t ( � α i t � ∆ α i ` ⇤ min M t ∆ α t i =1

MOCHA: PER-DEVICE, PER-ITERATION APPROXIMATIONS θ h New assumption: t ∈ [0 , 1] each subproblem is solved to accuracy θ ∈ [0 , 1) Stragglers (Statistical heterogeneity) Difficulty of solving subproblem Size of local dataset Stragglers (Systems heterogeneity) Hardware (CPU, memory) Network connection (3G, LTE, …) Power (battery level) Fault tolerance Devices going offline

CONVERGENCE New assumption: each subproblem is solved to accuracy θ h t ∈ and assume: P [ θ h t := 1] < 1 ` t Theorem 1. Let be ` t Theorem 2. Let be   -smooth, then (1 /µ ) -Lipschitz, then L 1 µ + n log n ✓ 8 L 2 n 2 1 ◆ T ≥ + ˜ T ≥ c (1 − ¯ (1 − ¯ Θ ) µ Θ ) ✏ ✏ 1/ ε rate linear rate

MOCHA: COMMUNICATION-EFFICIENT FEDERATED OPTIMIZATION Algorithm 1 Mocha : Federated Multi-Task Learning Framework 1: Input: Data X t stored on t = 1 , . . . , m devices 2: Initialize α (0) := 0 , v (0) := 0 3: for iterations i = 0 , 1 , . . . do for iterations h = 0 , 1 , · · · , H i do 4: for devices t 2 { 1 , 2 , . . . , m } in parallel do 5: call local solver, returning θ h t -approximate solution ∆ α t 6: update local variables α t α t + ∆ α t 7: reduce: v v + P t X t ∆ α t 8: Update Ω centrally using w ( v ) := r R ∗ ( v ) 9: 10: Compute w ( v ) := r R ∗ ( v ) 11: return: W := [ w 1 , . . . , w m ]

STATISTICAL HETEROGENEITY Wifi LTE Human Activity: Statistical Heterogeneity (WiFi) Human Activity: Statistical Heterogeneity (LTE) 10 2 10 2 MOCHA MOCHA CoCoA CoCoA Mb-SDCA Mb-SDCA 10 1 10 1 Mb-SGD Mb-SGD Primal Sub-Optimality Primal Sub-Optimality 10 0 10 0 10 -1 10 -1 10 -2 10 -2 10 -3 10 -3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 3G 10 6 10 6 Estimated Time Estimated Time Human Activity: Statistical Heterogeneity (3G) 10 2 MOCHA MOCHA & COCOA CoCoA Mb-SDCA 10 1 Mb-SGD PERFORM Primal Sub-Optimality MOCHA IS ROBUST TO PARTICULARLY WELL 10 0 STATISTICAL IN HIGH- HETEROGENEITY 10 -1 COMMUNICATION SETTINGS 10 -2 10 -3 0 0.5 1 1.5 2 10 7 Estimated Time

  SYSTEMS HETEROGENEITY MOCHA SIGNIFICANTLY OUTPERFORMS ALL COMPETITORS [BY 2 ORDERS OF MAGNITUDE] Low High Vehicle Sensor: Systems Heterogeneity (Low) Vehicle Sensor: Systems Heterogeneity (High) 10 2 10 2 MOCHA MOCHA CoCoA CoCoA Mb-SDCA Mb-SDCA 10 1 10 1 Mb-SGD Mb-SGD Primal Sub-Optimality Primal Sub-Optimality 10 0 10 0 10 -1 10 -1 10 -2 10 -2 10 -3 10 -3 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 10 6 10 6 Estimated Time Estimated Time

FAULT TOLERANCE W-Step Full Method Google Glass: Fault Tolerance, W Step Google Glass: Fault Tolerance, Full Method 10 2 10 2 10 1 10 1 Primal Sub-Optimality Primal Sub-Optimality 10 0 10 0 10 -1 10 -1 10 -2 10 -2 10 -3 10 -3 0 2 4 6 8 10 0 1 2 3 4 5 6 7 8 10 6 10 7 Estimated Time Estimated Time MOCHA IS ROBUST TO DROPPED NODES

WWW.SYSML.CC Virginia Smith Stanford / CMU CODE & PAPERS cs.berkeley.edu/~vsmith

MOCHA: Federated Multi-Task Learning NIPS 17 Virginia Smith - PowerPoint PPT Presentation

MOCHA: Federated Multi-Task Learning NIPS 17 Virginia Smith Stanford / CMU Chao-Kai Chiang USC Maziar Sanjabi USC Ameet Talwalkar CMU MACHINE LEARNING WORKFLOW data & problem machine learning model n optimization

Federated Learning Min Du Postdoc, UC Berkeley Outline q Preliminary: deep learning and SGD q

Mocha.jl Deep Learning in Julia Chiyuan Zhang (@pluskid) CSAIL, MIT Deep Learning Learning

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

Fair Resource Allocation in Federated Learning Tian Li (CMU) , Maziar Sanjabi (Facebook AI), Ahmad

Analyzing Federated Learning through an Adversarial Lens Arjun Nitin Bhagoji 1 , Supriyo

Federated Machine Learning via Over-the-Air Computation Yuanming Shi ShanghaiTech University 1

Multi-Task Active Learning Yi Zhang Outline Active Learning Multi-Task Active Learning

Testable JavaScript James Kovacs Technical Evangelist, JetBrains @jameskovacs | jameskovacs.com

Docker in the EGI Docker in the EGI Federated Cloud Federated Cloud Carlos Gimeno

Federated Zero-Shot Learning: A Proposal Francesco Odierna CS PhD student @ University of Pisa

Multi-Task Learning and Matrix Regularization Andreas Argyriou TTI Chicago Outline

Poisoning Attack Analysis Jeffrey Zhang Universal Multi-Party Poisoning Attacks Saeed

Agnostic federated learning Mehryar Mohri 1 , 2 , Gary Sivek 1 , Ananda Theertha Suresh 1 1 Google

Bond Task Force Draft Bond Task Force Recommendations Tuesday, February 27 , 2018 Bond Task

Task 1d: River basin management Task leader: LNEC; Involved partners EU: ISPRA, DTU, EWA Task

p wered Yva productivity AI Task Manager @nerdybff Task Management Task Management Todoist

Bits and Bytes At the smallest scale in the computer, information is stored as bits and bytes. In

GStreamer for Tiny Devices Olivier Crte Open First Who am I ? GStreamer at Collabora since

Manylogs Improving CMR/SMR Disk Bandwidth & Latency Tiratat Patana-anake , Vincentius

Communication Complexity of Private Simultaneous Messages, Revisited Manoj Mishra Department of

Future of Auditing: Audit Quality, Implementation and Innovation Commentary by: Warren Allen ,

Extreme Machine Learning with GPUs John Canny Computer Science Division University of

Anisotropic Diffusion in SPH Sergei Biriukov Supervisor: Daniel Price <latexit

Binder tude du mcanisme de communication interprocessus d'Android et de ses vulnrabilits

MOCHA: Federated Multi-Task Learning NIPS 17 Virginia Smith - PowerPoint PPT Presentation

MOCHA: Federated Multi-Task Learning NIPS 17 Virginia Smith Stanford / CMU Chao-Kai Chiang USC Maziar Sanjabi USC Ameet Talwalkar CMU MACHINE LEARNING WORKFLOW data & problem machine learning model n optimization

Federated Learning Min Du Postdoc, UC Berkeley Outline q Preliminary: deep learning and SGD q

Mocha.jl Deep Learning in Julia Chiyuan Zhang (@pluskid) CSAIL, MIT Deep Learning Learning

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

Fair Resource Allocation in Federated Learning Tian Li (CMU) , Maziar Sanjabi (Facebook AI), Ahmad

Analyzing Federated Learning through an Adversarial Lens Arjun Nitin Bhagoji 1 , Supriyo

Federated Machine Learning via Over-the-Air Computation Yuanming Shi ShanghaiTech University 1

Multi-Task Active Learning Yi Zhang Outline Active Learning Multi-Task Active Learning

Testable JavaScript James Kovacs Technical Evangelist, JetBrains @jameskovacs | jameskovacs.com

Docker in the EGI Docker in the EGI Federated Cloud Federated Cloud Carlos Gimeno

Federated Zero-Shot Learning: A Proposal Francesco Odierna CS PhD student @ University of Pisa

Multi-Task Learning and Matrix Regularization Andreas Argyriou TTI Chicago Outline

Poisoning Attack Analysis Jeffrey Zhang Universal Multi-Party Poisoning Attacks Saeed

Agnostic federated learning Mehryar Mohri 1 , 2 , Gary Sivek 1 , Ananda Theertha Suresh 1 1 Google

Bond Task Force Draft Bond Task Force Recommendations Tuesday, February 27 , 2018 Bond Task

Task 1d: River basin management Task leader: LNEC; Involved partners EU: ISPRA, DTU, EWA Task

p wered Yva productivity AI Task Manager @nerdybff Task Management Task Management Todoist

Bits and Bytes At the smallest scale in the computer, information is stored as bits and bytes. In

GStreamer for Tiny Devices Olivier Crte Open First Who am I ? GStreamer at Collabora since

Manylogs Improving CMR/SMR Disk Bandwidth &amp; Latency Tiratat Patana-anake , Vincentius

Communication Complexity of Private Simultaneous Messages, Revisited Manoj Mishra Department of

Future of Auditing: Audit Quality, Implementation and Innovation Commentary by: Warren Allen ,

Extreme Machine Learning with GPUs John Canny Computer Science Division University of

Anisotropic Diffusion in SPH Sergei Biriukov Supervisor: Daniel Price &lt;latexit

Binder tude du mcanisme de communication interprocessus d'Android et de ses vulnrabilits

Manylogs Improving CMR/SMR Disk Bandwidth & Latency Tiratat Patana-anake , Vincentius

Anisotropic Diffusion in SPH Sergei Biriukov Supervisor: Daniel Price <latexit