MOCHA: Federated Multi-Task Learning
Virginia Smith Stanford / CMU
Chao-Kai Chiang · USC Maziar Sanjabi · USC Ameet Talwalkar · CMU
NIPS ‘17
MOCHA: Federated Multi-Task Learning NIPS 17 Virginia Smith - - PowerPoint PPT Presentation
MOCHA: Federated Multi-Task Learning NIPS 17 Virginia Smith Stanford / CMU Chao-Kai Chiang USC Maziar Sanjabi USC Ameet Talwalkar CMU MACHINE LEARNING WORKFLOW data & problem machine learning model n optimization
Virginia Smith Stanford / CMU
Chao-Kai Chiang · USC Maziar Sanjabi · USC Ameet Talwalkar · CMU
NIPS ‘17
MACHINE LEARNING WORKFLOW
machine learning model data & problem
min
w n
X
i=1
`(w, xi) + g(w)
MACHINE LEARNING WORKFLOW
min
w n
X
i=1
`(w, xi) + g(w)
systems setting data & problem machine learning model
^ IN PRACTICE
BEYOND THE DATACENTER
Massively Distributed Node Heterogeneity Unbalanced Non-IID Underlying Structure
BEYOND THE DATACENTER
Massively Distributed Node Heterogeneity Unbalanced Non-IID Underlying Structure
Statistical Challenges Systems Challenges
MACHINE LEARNING WORKFLOW
min
w n
X
i=1
`(w, xi) + g(w)
data & problem machine learning model
^ IN PRACTICE
systems setting
MACHINE LEARNING WORKFLOW
min
w n
X
i=1
`(w, xi) + g(w)
data & problem machine learning model
^ IN PRACTICE
systems setting
Statistical Challenges
Unbalanced Non-IID Underlying Structure
OUTLINE
Massively Distributed Node Heterogeneity
Systems Challenges
Statistical Challenges
Unbalanced Non-IID Underlying Structure
OUTLINE
Massively Distributed Node Heterogeneity
Systems Challenges
A GLOBAL APPROACH
W
[MMRHA, AISTATS 16]
A LOCAL APPROACH
W1 W2 W3 W4 W5 W6 W7 W8 W10 W9 W11 W12
OUR APPROACH: PERSONALIZED MODELS
W1 W2 W3 W4 W5 W6 W7 W8 W10 W9 W11 W12
OUR APPROACH: PERSONALIZED MODELS
W1 W2 W3 W4 W5 W6 W7 W8 W10 W9 W11 W12
task relationship regularizer models
MULTI-TASK LEARNING
All tasks related Outlier tasks Clusters / groups Asymmetric relationships
[ZCY, SDM 2012] losses
min
W,Ω m
X
t=1 nt
X
i=1
`t(wt, xi
t) + R(W, Ω)
FEDERATED DATASETS
Human Activity Google Glass Land Mine Vehicle Sensor
PREDICTION ERROR
Global Local MTL Human Activity 2.23 (0.30) 1.34 (0.21) 0.46 (0.11) Google Glass 5.34 (0.26) 4.92 (0.26) 2.02 (0.15) Land Mine 27.72 (1.08) 23.43 (0.77) 20.09 (1.04) Vehicle Sensor 13.4 (0.26) 7.81 (0.13) 6.59 (0.21)
Statistical Challenges
Unbalanced Non-IID Underlying Structure
OUTLINE
Massively Distributed Node Heterogeneity
Systems Challenges
Statistical Challenges
Unbalanced Non-IID Underlying Structure
OUTLINE
Massively Distributed Node Heterogeneity
Systems Challenges
GOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING
min
W,Ω m
X
t=1 nt
X
i=1
`t(wT
t xi t) + R(W, Ω)
Solve for W, 𝛁 in an alternating fashion 𝛁 can be updated centrally W needs to be solved in federated setting
Challenges: Communication is expensive Statistical & systems heterogeneity Stragglers Fault tolerance
GOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING
min
W,Ω m
X
t=1 nt
X
i=1
`t(wT
t xi t) + R(W, Ω)
Solve for W, 𝛁 in an alternating fashion 𝛁 can be updated centrally W needs to be solved in federated setting
Challenges: Communication is expensive Statistical & systems heterogeneity Stragglers Fault tolerance
COCOA: COMMUNICATION-EFFICIENT DISTRIBUTED OPTIMIZATION
mini-batch methods
communication
COCOA: PRIMAL-DUAL FRAMEWORK
PRIMAL DUAL
min
w∈Rd
1 n
n
X
i=1
`(wT xi) + g(w) max
α∈Rn − 1
n
n
X
i=1
`∗(−↵i) − g∗(X, α)
X
k=1
˜ g∗(X[k], α[k])
αk(t) αk(t+1) αk
COCOA: PRIMAL-DUAL FRAMEWORK
PRIMAL DUAL
min
w∈Rd
1 n
n
X
i=1
`(wT xi) + g(w) max
α∈Rn − 1
n
n
X
i=1
`∗(−↵i) − g∗(X, α)
X
k=1
˜ g∗(X[k], α[k])
αk(t) αk(t+1) αk
COCOA: COMMUNICATION PARAMETER
amount of local computation
communication
Main assumption: each subproblem is solved to accuracy 𝚺
exactly solve inexactly solve
COCOA: COMMUNICATION PARAMETER
amount of local computation
communication
Main assumption: each subproblem is solved to accuracy 𝚺
exactly solve inexactly solve
MOCHA: COMMUNICATION-EFFICIENT FEDERATED OPTIMIZATION
min
W,Ω m
X
t=1 nt
X
i=1
`t(wT
t xi t) + R(W, Ω)
Solve for W, 𝛁 in an alternating fashion Modify CoCoA to solve W in federated setting
min
α m
X
t=1 nt
X
i=1
`∗
t (−αi t) + R∗(Xα)
min
∆αt nt
X
i=1
`⇤
t (αi t ∆αi t) + hwt(α), Xt∆αti + 0
2 kXt∆αtk2
Mt
MOCHA: PER-DEVICE, PER-ITERATION APPROXIMATIONS
Stragglers (Statistical heterogeneity) Difficulty of solving subproblem Size of local dataset Stragglers (Systems heterogeneity) Hardware (CPU, memory) Network connection (3G, LTE, …) Power (battery level) Fault tolerance Devices going offline
New assumption: each subproblem is solved to accuracy
θh
t ∈ [0, 1]
θ ∈ [0, 1)
CONVERGENCE
linear rate 1/ε rate
New assumption: each subproblem is solved to accuracy θh
t ∈
and assume: P[θh
t := 1] < 1
Theorem 1. Let be
L
T ≥ 1 (1 − ¯ Θ) ✓8L2n2 ✏ + ˜ c ◆
`t
Theorem 2. Let be
T ≥ 1 (1 − ¯ Θ) µ + n µ log n ✏
(1/µ)
`t
MOCHA: COMMUNICATION-EFFICIENT FEDERATED OPTIMIZATION
Algorithm 1 Mocha: Federated Multi-Task Learning Framework
1: Input: Data Xt stored on t = 1, . . . , m devices 2: Initialize α(0) := 0, v(0) := 0 3: for iterations i = 0, 1, . . . do 4:
for iterations h = 0, 1, · · · , Hi do
5:
for devices t 2 {1, 2, . . . , m} in parallel do
6:
call local solver, returning θh
t -approximate solution ∆αt
7:
update local variables αt αt + ∆αt
8:
reduce: v v + P
t Xt∆αt
9:
Update Ω centrally using w(v) := rR∗(v)
10: Compute w(v) := rR∗(v) 11: return: W := [w1, . . . , wm]
STATISTICAL HETEROGENEITY
1 2 3 4 5 6 7
Estimated Time
106 10-3 10-2 10-1 100 101 102
Primal Sub-Optimality Human Activity: Statistical Heterogeneity (WiFi)
MOCHA CoCoA Mb-SDCA Mb-SGD
Wifi
1 2 3 4 5 6 7 8
Estimated Time
106 10-3 10-2 10-1 100 101 102
Primal Sub-Optimality Human Activity: Statistical Heterogeneity (LTE)
MOCHA CoCoA Mb-SDCA Mb-SGD
LTE
0.5 1 1.5 2
Estimated Time
107 10-3 10-2 10-1 100 101 102
Primal Sub-Optimality Human Activity: Statistical Heterogeneity (3G)
MOCHA CoCoA Mb-SDCA Mb-SGD
3G
MOCHA IS ROBUST TO STATISTICAL HETEROGENEITY MOCHA & COCOA PERFORM PARTICULARLY WELL IN HIGH- COMMUNICATION SETTINGS
SYSTEMS HETEROGENEITY
1 2 3 4 5 6 7 8
Estimated Time
106 10-3 10-2 10-1 100 101 102
Primal Sub-Optimality Vehicle Sensor: Systems Heterogeneity (Low)
MOCHA CoCoA Mb-SDCA Mb-SGD
Low
1 2 3 4 5 6 7 8
Estimated Time
106 10-3 10-2 10-1 100 101 102
Primal Sub-Optimality Vehicle Sensor: Systems Heterogeneity (High)
MOCHA CoCoA Mb-SDCA Mb-SGD
High
MOCHA SIGNIFICANTLY OUTPERFORMS ALL COMPETITORS [BY 2 ORDERS OF MAGNITUDE]
FAULT TOLERANCE
2 4 6 8 10
Estimated Time
106 10-3 10-2 10-1 100 101 102
Primal Sub-Optimality Google Glass: Fault Tolerance, W Step
W-Step
1 2 3 4 5 6 7 8
Estimated Time
107 10-3 10-2 10-1 100 101 102
Primal Sub-Optimality Google Glass: Fault Tolerance, Full Method
Full Method MOCHA IS ROBUST TO DROPPED NODES
Statistical Challenges
Unbalanced Non-IID Underlying Structure
OUTLINE
Massively Distributed Node Heterogeneity
Systems Challenges
Stanford / CMU
cs.berkeley.edu/~vsmith
CODE & PAPERS WWW.SYSML.CC