MOCHA: Federated Multi-Task Learning NIPS 17 Virginia Smith - - PowerPoint PPT Presentation

mocha federated multi task learning
SMART_READER_LITE
LIVE PREVIEW

MOCHA: Federated Multi-Task Learning NIPS 17 Virginia Smith - - PowerPoint PPT Presentation

MOCHA: Federated Multi-Task Learning NIPS 17 Virginia Smith Stanford / CMU Chao-Kai Chiang USC Maziar Sanjabi USC Ameet Talwalkar CMU MACHINE LEARNING WORKFLOW data & problem machine learning model n optimization


slide-1
SLIDE 1

MOCHA: Federated 
 Multi-Task Learning

Virginia Smith
 Stanford / CMU

Chao-Kai Chiang · USC Maziar Sanjabi · USC Ameet Talwalkar · CMU

NIPS ‘17

slide-2
SLIDE 2

MACHINE LEARNING WORKFLOW

machine learning model data & problem

  • ptimization algorithm

min

w n

X

i=1

`(w, xi) + g(w)

slide-3
SLIDE 3

MACHINE LEARNING WORKFLOW

  • ptimization algorithm

min

w n

X

i=1

`(w, xi) + g(w)

systems setting data & problem machine learning model

^ IN PRACTICE

slide-4
SLIDE 4

how can we perform fast distributed optimization?

slide-5
SLIDE 5

BEYOND THE DATACENTER

Massively Distributed Node Heterogeneity Unbalanced Non-IID Underlying Structure

slide-6
SLIDE 6

BEYOND THE DATACENTER

Massively Distributed Node Heterogeneity Unbalanced Non-IID Underlying Structure

Statistical Challenges Systems Challenges

slide-7
SLIDE 7

MACHINE LEARNING WORKFLOW

  • ptimization algorithm

min

w n

X

i=1

`(w, xi) + g(w)

data & problem machine learning model

^ IN PRACTICE

systems setting

slide-8
SLIDE 8

MACHINE LEARNING WORKFLOW

  • ptimization algorithm

min

w n

X

i=1

`(w, xi) + g(w)

data & problem machine learning model

^ IN PRACTICE

systems setting

slide-9
SLIDE 9

Statistical Challenges

Unbalanced Non-IID Underlying Structure

OUTLINE

Massively Distributed Node Heterogeneity

Systems Challenges

slide-10
SLIDE 10

Statistical Challenges

Unbalanced Non-IID Underlying Structure

OUTLINE

Massively Distributed Node Heterogeneity

Systems Challenges

slide-11
SLIDE 11

A GLOBAL APPROACH

W

[MMRHA, AISTATS 16]

slide-12
SLIDE 12

A LOCAL APPROACH

W1 W2 W3 W4 W5 W6 W7 W8 W10 W9 W11 W12

slide-13
SLIDE 13

OUR APPROACH: PERSONALIZED MODELS

W1 W2 W3 W4 W5 W6 W7 W8 W10 W9 W11 W12

slide-14
SLIDE 14

OUR APPROACH: PERSONALIZED MODELS

W1 W2 W3 W4 W5 W6 W7 W8 W10 W9 W11 W12

slide-15
SLIDE 15

task relationship regularizer models

MULTI-TASK LEARNING

All tasks related Outlier tasks Clusters / groups Asymmetric relationships

[ZCY, SDM 2012] losses

min

W,Ω m

X

t=1 nt

X

i=1

`t(wt, xi

t) + R(W, Ω)

slide-16
SLIDE 16

FEDERATED DATASETS

Human 
 Activity Google 
 Glass Land 
 Mine Vehicle 
 Sensor

slide-17
SLIDE 17

PREDICTION ERROR

Global Local MTL Human 
 Activity 2.23 
 (0.30) 1.34 
 (0.21) 0.46 
 (0.11) Google 
 Glass 5.34 
 (0.26) 4.92 
 (0.26) 2.02 
 (0.15) Land 
 Mine 27.72 
 (1.08) 23.43 
 (0.77) 20.09 
 (1.04) Vehicle 
 Sensor 13.4 
 (0.26) 7.81 
 (0.13) 6.59 
 (0.21)

slide-18
SLIDE 18

Statistical Challenges

Unbalanced Non-IID Underlying Structure

OUTLINE

Massively Distributed Node Heterogeneity

Systems Challenges

slide-19
SLIDE 19

Statistical Challenges

Unbalanced Non-IID Underlying Structure

OUTLINE

Massively Distributed Node Heterogeneity

Systems Challenges

slide-20
SLIDE 20

GOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING

min

W,Ω m

X

t=1 nt

X

i=1

`t(wT

t xi t) + R(W, Ω)

Solve for W, 𝛁 in an alternating fashion 𝛁 can be updated centrally W needs to be solved in federated setting

Challenges: Communication is expensive Statistical & systems heterogeneity Stragglers Fault tolerance

slide-21
SLIDE 21

GOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING

min

W,Ω m

X

t=1 nt

X

i=1

`t(wT

t xi t) + R(W, Ω)

Solve for W, 𝛁 in an alternating fashion 𝛁 can be updated centrally W needs to be solved in federated setting

Challenges: Communication is expensive Statistical & systems heterogeneity Stragglers Fault tolerance


 Idea: Modify a communication-efficient method for the data center setting to handle: ✔ Multi-task learning ✔ Stragglers ✔ Fault tolerance


slide-22
SLIDE 22

COCOA: COMMUNICATION-EFFICIENT DISTRIBUTED OPTIMIZATION

mini-batch 
 methods

  • ne-shot 


communication

key idea:
 control communication

slide-23
SLIDE 23

COCOA: PRIMAL-DUAL FRAMEWORK

PRIMAL DUAL

min

w∈Rd

1 n

n

X

i=1

`(wT xi) + g(w) max

α∈Rn − 1

n

n

X

i=1

`∗(−↵i) − g∗(X, α)

  • K

X

k=1

˜ g∗(X[k], α[k])

αk(t) αk(t+1) αk

slide-24
SLIDE 24

COCOA: PRIMAL-DUAL FRAMEWORK

PRIMAL DUAL

min

w∈Rd

1 n

n

X

i=1

`(wT xi) + g(w) max

α∈Rn − 1

n

n

X

i=1

`∗(−↵i) − g∗(X, α)

  • K

X

k=1

˜ g∗(X[k], α[k])

αk(t) αk(t+1) αk

challenge #1: extend to MTL setup

slide-25
SLIDE 25

COCOA: COMMUNICATION PARAMETER

𝚺 ≈

amount of local 
 computation


  • vs. 


communication

∈ [0, 1)

Main assumption: 
 each subproblem is solved to accuracy 𝚺

exactly solve inexactly solve

slide-26
SLIDE 26

COCOA: COMMUNICATION PARAMETER

𝚺 ≈

amount of local 
 computation


  • vs. 


communication

∈ [0, 1)

Main assumption: 
 each subproblem is solved to accuracy 𝚺

exactly solve inexactly solve

challenge #2: make communication 
 more flexible

slide-27
SLIDE 27

MOCHA: COMMUNICATION-EFFICIENT FEDERATED OPTIMIZATION

min

W,Ω m

X

t=1 nt

X

i=1

`t(wT

t xi t) + R(W, Ω)

Solve for W, 𝛁 in an alternating fashion Modify CoCoA to solve W in federated setting

min

α m

X

t=1 nt

X

i=1

`∗

t (−αi t) + R∗(Xα)

min

∆αt nt

X

i=1

`⇤

t (αi t ∆αi t) + hwt(α), Xt∆αti + 0

2 kXt∆αtk2

Mt

slide-28
SLIDE 28

MOCHA: PER-DEVICE, PER-ITERATION APPROXIMATIONS

Stragglers (Statistical heterogeneity) Difficulty of solving subproblem Size of local dataset Stragglers (Systems heterogeneity) Hardware (CPU, memory) Network connection (3G, LTE, …) Power (battery level) Fault tolerance Devices going offline

New assumption: each subproblem is solved to accuracy

θh

t ∈ [0, 1]

θ ∈ [0, 1)

slide-29
SLIDE 29

CONVERGENCE

linear rate 1/ε rate

New assumption: each subproblem is solved to accuracy θh

t ∈

and assume: P[θh

t := 1] < 1

Theorem 1. Let be

  • Lipschitz, then

L

T ≥ 1 (1 − ¯ Θ) ✓8L2n2 ✏ + ˜ c ◆

`t

Theorem 2. Let be 


  • smooth, then

T ≥ 1 (1 − ¯ Θ) µ + n µ log n ✏

(1/µ)

`t

slide-30
SLIDE 30

MOCHA: COMMUNICATION-EFFICIENT FEDERATED OPTIMIZATION

Algorithm 1 Mocha: Federated Multi-Task Learning Framework

1: Input: Data Xt stored on t = 1, . . . , m devices 2: Initialize α(0) := 0, v(0) := 0 3: for iterations i = 0, 1, . . . do 4:

for iterations h = 0, 1, · · · , Hi do

5:

for devices t 2 {1, 2, . . . , m} in parallel do

6:

call local solver, returning θh

t -approximate solution ∆αt

7:

update local variables αt αt + ∆αt

8:

reduce: v v + P

t Xt∆αt

9:

Update Ω centrally using w(v) := rR∗(v)

10: Compute w(v) := rR∗(v) 11: return: W := [w1, . . . , wm]

slide-31
SLIDE 31

STATISTICAL HETEROGENEITY

1 2 3 4 5 6 7

Estimated Time

106 10-3 10-2 10-1 100 101 102

Primal Sub-Optimality Human Activity: Statistical Heterogeneity (WiFi)

MOCHA CoCoA Mb-SDCA Mb-SGD

Wifi

1 2 3 4 5 6 7 8

Estimated Time

106 10-3 10-2 10-1 100 101 102

Primal Sub-Optimality Human Activity: Statistical Heterogeneity (LTE)

MOCHA CoCoA Mb-SDCA Mb-SGD

LTE

0.5 1 1.5 2

Estimated Time

107 10-3 10-2 10-1 100 101 102

Primal Sub-Optimality Human Activity: Statistical Heterogeneity (3G)

MOCHA CoCoA Mb-SDCA Mb-SGD

3G

MOCHA IS ROBUST TO STATISTICAL HETEROGENEITY MOCHA & COCOA PERFORM PARTICULARLY WELL IN HIGH- COMMUNICATION SETTINGS

slide-32
SLIDE 32

SYSTEMS HETEROGENEITY

1 2 3 4 5 6 7 8

Estimated Time

106 10-3 10-2 10-1 100 101 102

Primal Sub-Optimality Vehicle Sensor: Systems Heterogeneity (Low)

MOCHA CoCoA Mb-SDCA Mb-SGD

Low

1 2 3 4 5 6 7 8

Estimated Time

106 10-3 10-2 10-1 100 101 102

Primal Sub-Optimality Vehicle Sensor: Systems Heterogeneity (High)

MOCHA CoCoA Mb-SDCA Mb-SGD

High

MOCHA SIGNIFICANTLY OUTPERFORMS ALL COMPETITORS 
 [BY 2 ORDERS OF MAGNITUDE]

slide-33
SLIDE 33

FAULT TOLERANCE

2 4 6 8 10

Estimated Time

106 10-3 10-2 10-1 100 101 102

Primal Sub-Optimality Google Glass: Fault Tolerance, W Step

W-Step

1 2 3 4 5 6 7 8

Estimated Time

107 10-3 10-2 10-1 100 101 102

Primal Sub-Optimality Google Glass: Fault Tolerance, Full Method

Full Method MOCHA IS ROBUST TO DROPPED NODES

slide-34
SLIDE 34

Statistical Challenges

Unbalanced Non-IID Underlying Structure

OUTLINE

Massively Distributed Node Heterogeneity

Systems Challenges

slide-35
SLIDE 35

Virginia Smith

Stanford / CMU

cs.berkeley.edu/~vsmith

CODE & PAPERS WWW.SYSML.CC