Distributed Machine Learning and Big Data Sourangshu Bhattacharya - - PowerPoint PPT Presentation

distributed machine learning and big data
SMART_READER_LITE
LIVE PREVIEW

Distributed Machine Learning and Big Data Sourangshu Bhattacharya - - PowerPoint PPT Presentation

Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015


slide-1
SLIDE 1

Distributed Machine Learning and Big Data

Sourangshu Bhattacharya

  • Dept. of Computer Science and Engineering,

IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/

August 21, 2015

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 1 / 76

slide-2
SLIDE 2

Outline

1

Machine Learning and Big Data Support Vector Machines Stochastic Sub-gradient descent

2

Distributed Optimization ADMM Convergence Distributed Loss Minimization Results Development of ADMM

3

Applications and extensions Weighted Parameter Averaging Fully-distributed SVM

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 2 / 76

slide-3
SLIDE 3

Machine Learning and Big Data

What is Big Data ?

6 Billion web queries per day. 10 Billion display advertisements per day. 30 Billion text ads per day. 150 Million credit card transactions per day. 100 Billion emails per day.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 3 / 76

slide-4
SLIDE 4

Machine Learning and Big Data

Machine Learning on Big Data

Classification - Spam / No Spam - 100B emails. Multi-label classification - image tagging - 14M images 10K tags. Regression - CTR estimation - 10B ad views. Ranking - web search - 6B queries. Recommendation - online shopping - 1.7B views in the US.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 4 / 76

slide-5
SLIDE 5

Machine Learning and Big Data

Classification example

Email spam classification. Features (ui)): Vector of counts of all words.

  • No. of Features (d): Words in vocabulary

(∼ 100,000).

  • No. of non-zero features: 100.
  • No. of emails per day: 100 M.

Size of training set using 30 days data: 6 TB (assuming 20 B per data) Time taken to read the data once: 41.67 hrs (at 20 MB per second) Solution: use multiple computers.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 5 / 76

slide-6
SLIDE 6

Machine Learning and Big Data

Big Data Paradigm

3V’s - Volume, Variety, Velocity. Distributed system. Chance of failure: Computers 1 10 100 Chance of a failure in an hour 0.01 0.09 0.63 Communication efficiency - Data locality. Many systems: Hadoop, Spark, Graphlab, etc. Goal: Implement Machine Learning algorithms on Big data systems.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 6 / 76

slide-7
SLIDE 7

Machine Learning and Big Data

Binary Classification Problem

A set of labeled datapoints (S) = {(ui, vi), i = 1, . . . , n}, ui ∈ Rd and vi ∈ {+1, −1} Linear Predictor function: v = sign(xTu) Error function: E = n

i=1 1(vixTui ≤ 0)

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 7 / 76

slide-8
SLIDE 8

Machine Learning and Big Data

Logistic Regression

Probability of v is given by: P(v|u, x) = σ(vxTu) = 1 1 + e−vxT u Learning problem is: Given dataset S, estimate x. Maximizing the regularized log likelihood: x∗ = argminx

n

  • i=1

log(1 + e−vixT ui) + λ 2xTx

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 8 / 76

slide-9
SLIDE 9

Machine Learning and Big Data

Convex Function

f is a Convex function: f(tx1 + (1 − t)x2) ≤ tf(x1) + (1 − t)f(x2)

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 9 / 76

slide-10
SLIDE 10

Machine Learning and Big Data

Convex Optimization

Convex optimization problem minimizex f(x) subject to: gi(x) ≤ 0, ∀i = 1, . . . , k where: f, gi are convex functions. For convex optimization problems, local optima are also global

  • ptima.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 10 / 76

slide-11
SLIDE 11

Machine Learning and Big Data

Optimization Algorithm: Gradient Descent

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 11 / 76

slide-12
SLIDE 12

Machine Learning and Big Data Support Vector Machines

Classification Problem

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 12 / 76

slide-13
SLIDE 13

Machine Learning and Big Data Support Vector Machines

SVM

Separating hyperplane: xTu = 0 Parallel hyperplanes (developing margin): xTu = ±1 Margin (perpendicular distance between parallel hyperplanes):

2 x

Correct classification of training datapoints: vixTui ≥ 1, ∀i Allowing error (slack), ξi: vixTui ≥ 1 − ξi, ∀i Max-margin formulation: min

x,ξ

1 2x2 + C

n

  • i=1

ξi subject to: vixTui ≥ 1 − ξi, ξi ≥ 0 ∀i = 1, . . . , n

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 13 / 76

slide-14
SLIDE 14

Machine Learning and Big Data Support Vector Machines

SVM: dual

Lagrangian: L = 1 2xTx + C

n

  • i=1

ξi +

n

  • i=1

αi(1 − ξi − vixTui) +

n

  • i=1

µiξi Dual problem: (x∗, α∗, µ∗) = maxα,µ minx L(x, α, µ) For strictly convex problem, primal and dual solutions are same (Strong duality). KKT conditions: x =

n

  • i=1

αiviui C = αi + µi

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 14 / 76

slide-15
SLIDE 15

Machine Learning and Big Data Support Vector Machines

SVM: dual

The dual problem: max

α n

  • i=1

αi − 1 2

n,n

  • i=1,j=1

αiαjvivjuT

i uj

subject to: 0 ≤ αi ≤ C, ∀i The dual is a quadratic programming problem in n variables. Can be solved even if kernel function, k(ui, uj) = uT

i uj are given.

Dimension agnostic. Many efficient algorithms exist for solving it, e.g. SMO (Platt99). Worst case complexity is O(n3), usually O(n2).

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 15 / 76

slide-16
SLIDE 16

Machine Learning and Big Data Support Vector Machines

SVM

A more compact form: minx n

i=1 max(0, 1 − vixTui) + λx2 2

Or: minx n

i=1 l(x, ui, vi) + λΩ(x)

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 16 / 76

slide-17
SLIDE 17

Machine Learning and Big Data Support Vector Machines

Multi-class classification

There are m classes. vi ∈ {1, . . . , m} Most popular scheme: vi = argmaxv∈{1,...,m}xT

v ui

Given example (ui, vi), xT

viui ≥ xT j ui ∀j ∈ {1, . . . , m}

Using a margin of at least 1, loss l(ui, vi) = maxj∈{1,...,vi−1,vi+1,...,m}{0, 1 − (xT

viui − xT j ui)}

Given dataset D, solve the problem min

x1,...,xm

  • i∈D

l(ui, vi) + λ

m

  • j=1

xj2 This can be extended to many settings e.g. sequence labeling, learning to rank, etc.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 17 / 76

slide-18
SLIDE 18

Machine Learning and Big Data Support Vector Machines

General Learning Problems

Support Vector Machines: min

x n

  • i=1

max{0, 1 − vixTui} + λx2

2

Logistic Regression: min

x n

  • i=1

log(1 + exp(−vixTui)) + λx2

2

General form: min

x n

  • i=1

l(x, ui, vi) + λΩ(x) l: loss function, Ω: regularizer.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 18 / 76

slide-19
SLIDE 19

Machine Learning and Big Data Stochastic Sub-gradient descent

Sub-gradient Descent

Sub-gradient for a non-differentiable convex function f at a point x0 is a vector v such that: f(x) − f(x0) ≥ vT(x − x0)

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 19 / 76

slide-20
SLIDE 20

Machine Learning and Big Data Stochastic Sub-gradient descent

Sub-gradient Descent

Randomly initialize x0 Iterate xk = xk−1 − tkg(xk−1), k = 1, 2, 3, . . . . Where g is a sub-gradient of f. tk =

1 √ k .

xbest(k) = mini=1,...,k f(xk)

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 20 / 76

slide-21
SLIDE 21

Machine Learning and Big Data Stochastic Sub-gradient descent

Sub-gradient Descent

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 21 / 76

slide-22
SLIDE 22

Machine Learning and Big Data Stochastic Sub-gradient descent

Stochastic Sub-gradient Descent

Convergence rate is: O( 1

√ k ).

Each iteration takes O(n) time. Reduce time by calculating the gradient using a subset of examples - stochastic subgradient. Inherently serial. Typical O( 1

ǫ2 ) behaviour.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 22 / 76

slide-23
SLIDE 23

Machine Learning and Big Data Stochastic Sub-gradient descent

Stochastic Sub-gradient Descent

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 23 / 76

slide-24
SLIDE 24

Distributed Optimization

Distributed gradient descent

Divide the dataset into m parts. Each part is processed on one

  • computer. Total m.

There is one central computer. All computers can communicate with the central computer via network. Define loss(x) = m

j=1

  • i∈Cj li(x)+λΩ(x), where li(x) = l(x, ui, vi)

The gradient (in case of differentiable loss): ∇loss(x) =

m

  • j=1

∇(

  • i∈Cj

li(x)) + λΩ(x) Compute ∇lj(x) =

i∈Cj ∇li(x) on the jth computer. Communicate

to central computer.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 24 / 76

slide-25
SLIDE 25

Distributed Optimization

Distributed gradient descent

Compute ∇loss(x) = m

j=1 ∇lj(x) + Ω(x) at the central computer.

The gradient descent update: xk+1 = xk − α∇loss(x). α chosen by a line search algorithm (distributed). For non-differentiable loss functions, we can use distributed sub-gradient descent algorithm.

Slow for most practical problems. For achieving ǫ tolerance,

Gradient descent (Logistic regression): O(1/ǫ) iterations. Sub-gradient descent (Stochastic Sub-gradient descent): O( 1

ǫ2 )

iterations.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 25 / 76

slide-26
SLIDE 26

Distributed Optimization ADMM

Alternating Direction Method of Multipliers

Problem minimizex,z f(x) + g(z) subject to: Ax + Bz = c Algorithm Iterate till convergence: xk+1 = argminxf(x) + ρ

2Ax + Bzk − c + uk2 2

zk+1 = argminzg(z) + ρ

2Axk+1 + Bz − c + uk2 2

uk+1 = uk + Axk+1 + Bzk+1 − c

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 26 / 76

slide-27
SLIDE 27

Distributed Optimization ADMM

Stopping criteria

Stop when primal and dual residuals small: r k2 ≤ ǫpri and sk2 ≤ ǫdual Hence, r k2 → 0 and sk2 → 0 as k → ∞

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 27 / 76

slide-28
SLIDE 28

Distributed Optimization ADMM

Observations

x- update requires solving an optimization problem min

x

f(x) + ρ 2Ax − v2

2

with, v = Bzk − c + uk Similarly for z-update. Sometimes has a closed form. ADMM is a meta optimization algorithm.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 28 / 76

slide-29
SLIDE 29

Distributed Optimization Convergence

Convergence of ADMM

Assumption 1: Functions f : Rn → R and g : Rm → R are closed, proper and convex.

Same as assuming epif = {(x, t) ∈ Rn × R|f(x) ≤ t} is closed and convex.

Assumption 2: The unaugmented Lagrangian L0(x, y, z) has a saddle point (x∗, z∗, y∗): L0(x∗, z∗, y) ≤ L0(x∗, z∗, y∗) ≤ L0(x, z, y∗)

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 29 / 76

slide-30
SLIDE 30

Distributed Optimization Convergence

Convergence of ADMM

Primal residual: r = Ax + Bz − c Optimal objective: p∗ = infx,z{f(x) + g(z)|Ax + Bz = c} Convergence results:

Primal residual convergence: r k → 0 as k → ∞ Dual residual convergence: sk → 0 as k → ∞ Objective convergence: f(x) + g(z) → p∗ as k → ∞ Dual variable convergence: yk → y∗ as k → ∞

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 30 / 76

slide-31
SLIDE 31

Distributed Optimization Distributed Loss Minimization

Decomposition

If f is separable: f(x) = f1(x1) + · · · + fN(xN), x = (x1, . . . , xN) A is conformably block separable; i.e. ATA is block diagonal. Then, x-update splits into N parallel updates of xi

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 31 / 76

slide-32
SLIDE 32

Distributed Optimization Distributed Loss Minimization

Consensus Optimization

Problem: min

x

f(x) =

N

  • i=1

fi(x) ADMM form: min

xi,z N

  • i=1

fi(xi) s.t. xi − z = 0, i = 1, . . . , N Augmented lagrangian: Lρ(x1, . . . , xN, z, y) =

N

  • i=1

(fi(xi) + yT

i (xi − z) + ρ

2xi − z2

2)

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 32 / 76

slide-33
SLIDE 33

Distributed Optimization Distributed Loss Minimization

Consensus Optimization

ADMM algorithm: xk+1

i

= argminxi(fi(xi) + ykT

i

(xi − zk) + ρ 2xi − zk2

2)

zk+1 = 1 N

N

  • i=1

(xk+1

i

+ 1 ρyk

i )

yk+1

i

= yk

i + ρ(xk+1 i

− zk+1) Final solution is zk.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 33 / 76

slide-34
SLIDE 34

Distributed Optimization Distributed Loss Minimization

Consensus Optimization

z-update can be written as: zk+1 = ¯ xk+1 + 1 ρ ¯ yk+1 Averaging the y-updates: ¯ yk+1 = ¯ yk + ρ(¯ xk+1 − zk+1) Substituting first into second: ¯ yk+1 = 0. Hence zk = ¯ xk. Revised algorithm: xk+1

i

= argminxi(fi(xi) + ykT

i

(xi − ¯ xk) + ρ 2xi − ¯ xk2

2)

yk+1

i

= yk

i + ρ(xk+1 i

− ¯ xk+1) Final solution is zk.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 34 / 76

slide-35
SLIDE 35

Distributed Optimization Distributed Loss Minimization

Distributed Loss minimization

Problem: min

x

l(Ax − b) + r(x) Partition A and b by rows: A =    A1 . . . AN    , b =    b1 . . . bN    , where, Ai ∈ Rmi×m and bi ∈ Rmi ADMM formulation: min

xi,z N

  • i=1

li(Aixi − bi) + r(z) s.t.: xi − z = 0, i = 1, . . . , N

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 35 / 76

slide-36
SLIDE 36

Distributed Optimization Distributed Loss Minimization

Distributed Loss minimization

ADMM solution: xk+1

i

= argminxi(li(Aixi − bi) + ρ 2xi − zk + uk

i 2 2)

zk+1 = argminz(r(z) + Nρ 2 z − ¯ xk+1 + ¯ uk2

2)

uk+1

i

= uk

i + xk+1 i

− zk+1

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 36 / 76

slide-37
SLIDE 37

Distributed Optimization Results

ADMM Results

Logistic Regression using the loss minimization formulation (Boyd et al.): min

x n

  • i=1

log(1 + exp(−vixTui)) + λx2

2

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 37 / 76

slide-38
SLIDE 38

Distributed Optimization Results

ADMM Results

Logistic Regression using the loss minimization formulation (Boyd et al.): min

x n

  • i=1

log(1 + exp(−vixTui)) + λx2

2

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 38 / 76

slide-39
SLIDE 39

Distributed Optimization Results

Other Machine Learning Problems

Ridge Regression. Lasso. Multi-class SVM. Ranking. Structured output prediction.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 39 / 76

slide-40
SLIDE 40

Distributed Optimization Results

ADMM Results

Lasso Results (Boyd et al.):

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 40 / 76

slide-41
SLIDE 41

Distributed Optimization Results

ADMM Results

SVM primal residual:

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 41 / 76

slide-42
SLIDE 42

Distributed Optimization Results

ADMM Results

SVM Accuracy:

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 42 / 76

slide-43
SLIDE 43

Distributed Optimization Results

Results

Risk and Hyperplane

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 43 / 76

slide-44
SLIDE 44

Distributed Optimization Development of ADMM

Dual Ascent

Convex equality constrained problem: min

x

f(x) subject to: Ax = b Lagrangian: L(x, y) = f(x) + yT(Ax − b) Dual function: g(y) = infxL(x, y) Dual problem: maxy g(y) Final solution: x∗ = argminxL(x, y)

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 44 / 76

slide-45
SLIDE 45

Distributed Optimization Development of ADMM

Dual Ascent

Gradient descent for dual problem: yk+1 = yk + αk∇ykg(yk) ∇ykg(yk) = A˜ x − b, where ˜ x = argminxL(x, yk) Dual ascent algorithm: xk+1 = argminxL(x, yk) yk+1 = yk + αk(Axk+1 − b) Assumptions:

L(x, yk) is strictly convex. Else, the first step can have multiple solutions. L(x, yk) is bounded below.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 45 / 76

slide-46
SLIDE 46

Distributed Optimization Development of ADMM

Dual Decomposition

Suppose f is separable: f(x) = f1(x1) + · · · + fN(xN), x = (x1, . . . , xN) L is separable in x: L(x, y) = L1(x1, y) + · · · + LN(xN, y) − yTb, where Li(xi, y) = fi(xi) + yTAixi x minimization splits into N separate problems: xk+1

i

= argminxiLi(xi, yk)

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 46 / 76

slide-47
SLIDE 47

Distributed Optimization Development of ADMM

Dual Decomposition

Dual decomposition: xk+1

i

= argminxiLi(xi, yk), i = 1, . . . , N yk+1 = yk + αk(

N

  • i=1

Aixi − b) Distributed solution:

Scatter yk to individual nodes Compute xi in the ith node (distributed step) Gather Aixi from the ith node

All drawbacks of dual ascent exist

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 47 / 76

slide-48
SLIDE 48

Distributed Optimization Development of ADMM

Method of Multipliers

Make dual ascent work under more general conditions Use augmented Lagrangian: Lρ(x, y) = f(x) + yT(Ax − b) + ρ 2Ax − b2

2

Method of multipliers: xk+1 = argminxLρ(x, yk) yk+1 = yk + ρ(Axk+1 − b)

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 48 / 76

slide-49
SLIDE 49

Distributed Optimization Development of ADMM

Methods of Multipliers

Optimality conditions (for differentiable f):

Primal feasibility: Ax∗ − b = 0 Dual feasibility: ∇f(x∗) + ATy∗ = 0

Since xk+1 minimizes Lρ(x, yk) 0 = ∇xLρ(xk+1, yk) = ∇xf(xk+1) + AT(yk + ρ(Axk+1 − b)) = ∇xf(xk+1) + ATyk+1 Dual update yk+1 = yk + ρ(Axk+1 − b) makes (xk+1, yk+1) dual feasible Primal feasibility is achieved in the limit: (Axk+1 − b) → 0

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 49 / 76

slide-50
SLIDE 50

Distributed Optimization Development of ADMM

Alternating direction method of multipliers

Problem with applying standard method of multipliers for distributed optimization:

there is no problem decomposition even if f is separable. due to square term ρ

2Ax − b2 2

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 50 / 76

slide-51
SLIDE 51

Distributed Optimization Development of ADMM

Alternating direction method of multipliers

ADMM problem: min

x,z f(x) + g(z)

subject to: Ax + Bz = c Lagrangian: Lρ(x, z, y) = f(x) + g(z) + yT(Ax + Bz − c) + ρ

2Ax + Bz − c2 2

ADMM: xk+1 = argminxLρ(x, zk, yk) zk+1 = argminzLρ(xk+1, z, yk) yk+1 = yk + ρ(Axk+1 + Bzk+1 − c)

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 51 / 76

slide-52
SLIDE 52

Distributed Optimization Development of ADMM

Alternating direction method of multipliers

Problem with applying standard method of multipliers for distributed optimization:

there is no problem decomposition even if f is separable. due to square term ρ

2Ax − b2 2

The above technique reduces to method of multipliers if we do joint minimization of x and z Since we split the joint x, z minimization step, the problem can be decomposed.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 52 / 76

slide-53
SLIDE 53

Distributed Optimization Development of ADMM

ADMM Optimality conditions

Optimality conditions (differentiable case):

Primal feasibility: Ax + Bz − c = 0 Dual feasibility: ∇f(x) + ATy = 0 and ∇g(z) + BTy = 0

Since zk+1 minimizes Lρ(xk+1, z, yk): 0 = ∇g(zk+1) + BTyk + ρBT(Axk+1 + Bzk+1 − c) = ∇g(zk+1) + BTyk+1 So, the dual variable update satisfies the second dual feasibility constraint. Primal feasibility and first dual feasibility are satisfied asymptotically.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 53 / 76

slide-54
SLIDE 54

Distributed Optimization Development of ADMM

ADMM Optimality conditions

Primal residual: r k = Axk + Bzk − c Since xk+1 minimizes Lρ(x, zk, yk): 0 = ∇f(xk+1) + ATyk + ρAT(Axk+1 + Bzk − c) = ∇f(xk+1) + AT(yk + ρr k+1 + ρB(zk − zk+1) = ∇f(xk+1) + ATyk+1 + ρATB(zk − zk+1)

  • r,

ρATB(zk − zk+1) = ∇f(xk+1) + ATyk+1 Hence, sk+1 = ρATB(zk − zk+1) can be thought as dual residual.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 54 / 76

slide-55
SLIDE 55

Distributed Optimization Development of ADMM

ADMM with scaled dual variables

Combine the linear and quadratic terms

Primal feasibility: Ax + Bz − c = 0 Dual feasibility: ∇f(x) + ATy = 0 and ∇g(z) + BTy = 0

Since zk+1 minimizes Lρ(xk+1, z, yk): 0 = ∇g(zk+1) + BTyk + ρBT(Axk+1 + Bzk+1 − c) = ∇g(zk+1) + BTyk+1 So, the dual variable update satisfies the second dual feasibility constraint. Primal feasibility and first dual feasibility are satisfied asymptotically.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 55 / 76

slide-56
SLIDE 56

Applications and extensions Weighted Parameter Averaging

Distributed Support Vector Machines

Training dataset partitioned into M partitions (Sm, m = 1, . . . , M). Each partition has L datapoints: Sm = {(xml, yml)}, l = 1, . . . , L. Each partition can be processed locally on a single computer. Distributed SVM training problem [?]: min

wm,z M

  • m=1

L

  • l=1

loss(wm; (xml, yml)) + r(z) s.t.wm − z = 0, m = 1, · · · , M, l = 1, . . . , L

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 56 / 76

slide-57
SLIDE 57

Applications and extensions Weighted Parameter Averaging

Parameter Averaging

Parameter averaging, also called “mixture weights” proposed in [?], for logistic regression. Results hold true for SVMs with suitable sub-derivative. Locally learn SVM on Sm: ˆ wm = argminw 1 L

L

  • l=1

loss(w; xml, yml) + λw2 , m = 1, . . . , M The final SVM parameter is given by: wPA = 1 M

M

  • m=1

ˆ wm

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 57 / 76

slide-58
SLIDE 58

Applications and extensions Weighted Parameter Averaging

Problem with Parameter Averaging

PA with varying number of partitions - Toy dataset.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 58 / 76

slide-59
SLIDE 59

Applications and extensions Weighted Parameter Averaging

Weighted Parameter Averaging

Final hypothesis is a weighted sum of the parameters ˆ wm. w =

M

  • m=1

βm ^ wm Also proposed in [?]. How to get βm ? Notation: β = [β1, · · · , βM]T, ^ W = [ ˆ w1, · · · , ˆ wM] w = ^ Wβ

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 59 / 76

slide-60
SLIDE 60

Applications and extensions Weighted Parameter Averaging

Weighted Parameter Averaging

Find the optimal set of weights β which attains the lowest regularized hinge loss: min

β,ξ λ ^

Wβ2 + 1 ML

M

  • m=1

L

  • i=1

ξmi subject to: ymi(βT ^ WTxmi) ≥ 1 − ξmi, ∀i, m ξmi ≥ 0, ∀m = 1, . . . , M, i = 1, . . . , L ˆ W is a pre-computed parameter.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 60 / 76

slide-61
SLIDE 61

Applications and extensions Weighted Parameter Averaging

Distributed Weighted Parameter Averaging

Distributed version of primal weighted parameter averaging: min

γm,β

1 ML

M

  • m=1

L

  • l=1

loss( ˆ Wγm; xml, yml) + r(β) s.t. γm − β = 0, m = 1, · · · , M, r(β) = λ ˆ Wβ2, γm weights for mth computer, β consensus weight.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 61 / 76

slide-62
SLIDE 62

Applications and extensions Weighted Parameter Averaging

Distributed Weighted Parameter Averaging

Distributed algorithm using ADMM: γk+1

m

:= argminγ(loss(Aiγ) + (ρ/2)γ − βk + uk

m2 2)

βk+1 := argminβ(r(β) + (Mρ/2)β − γk+1 − uk2

2)

uk+1

m

= uk

m + γk+1 m

− βk+1. um are the scaled Lagrange multipliers, γ = 1

M

M

m=1 γm and

u = 1

M

M

m=1 um.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 62 / 76

slide-63
SLIDE 63

Applications and extensions Weighted Parameter Averaging

Toy Dataset - PA and WPA

PA (left) and WPA (right) with varying number of partitions - Toy dataset.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 63 / 76

slide-64
SLIDE 64

Applications and extensions Weighted Parameter Averaging

Toy Dataset - PA and WPA

Accuracy of PA and WPA with varying number of partitions - Toy dataset.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 64 / 76

slide-65
SLIDE 65

Applications and extensions Weighted Parameter Averaging

Real World Datasets

Epsilon (2000 features, 6000 datapoints) test set accuracy with varying number of partitions.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 65 / 76

slide-66
SLIDE 66

Applications and extensions Weighted Parameter Averaging

Real World Datasets

Gisette (5000 features, 6000 datapoints) test set accuracy with varying number of partitions.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 66 / 76

slide-67
SLIDE 67

Applications and extensions Weighted Parameter Averaging

Real World Datasets

Real-sim (20000 features, 3000 datapoints) test set accuracy with varying number of partitions.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 67 / 76

slide-68
SLIDE 68

Applications and extensions Weighted Parameter Averaging

Real World Datasets

Convergence of test accuracy with iterations (200 partitions).

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 68 / 76

slide-69
SLIDE 69

Applications and extensions Weighted Parameter Averaging

Real World Datasets

Convergence of primal residual with iterations (200 partitions).

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 69 / 76

slide-70
SLIDE 70

Applications and extensions Fully-distributed SVM

Distributed SVM on Arbitrary Network

Motivations: Sensor Networks. Corporate networks. Privacy. Assumptions: Data is available at nodes of network Communication is possible

  • nly along edges of the

network.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 70 / 76

slide-71
SLIDE 71

Applications and extensions Fully-distributed SVM

Distributed SVM on Arbitrary Network

SVM optimization problem: min

w,b,ξ

1 2w2 + C

J

  • j=1

nj

  • n=1

ξjn s.t.: yjn(wtxjn + b) ≥ 1 − ξjn, ∀j ∈ J, n = 1, . . . , Nj ξjn ≥ 0, ∀j ∈ J, n = 1, . . . , Nj Node j has a copy of wj, bj. Distributed formulation: min

{wj,bj,ξjn}

1 2

J

  • j=1

wj2 + JC

J

  • j=1

nj

  • n=1

ξjn s.t.: yjn(wt

j xjn + b) ≥ 1 − ξjn, ∀j ∈ J, n = 1, . . . , Nj

ξjn ≥ 0, ∀j ∈ J, n = 1, . . . , Nj wj = wi, ∀j, i ∈ Bj

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 71 / 76

slide-72
SLIDE 72

Applications and extensions Fully-distributed SVM

Algorithm

Using vj = [wT

j bj]T, Xj = [[xj1, . . . , xjNj]T1j] and Yj = diag([yj1, . . . , yjNj]):

min

{vj,ξjn,ωji}

1 2

J

  • j=1

r(vj) + JC

J

  • j=1

nj

  • n=1

ξjn s.t.: YjXjvj ≥ 1 − ¯ ξj, ∀j ∈ J ¯ ξj ≥ 0, ∀j ∈ J vj = ωji, vi = ωji, ∀j, i ∈ Bj Surrogate augmented Lagrangian: L({vj}, {¯ ξj}, {ωji}, {αijk}) = 1 2

J

  • j=1

r(vj) + JC

J

  • j=1

nj

  • n=1

ξjn +

J

  • j=1
  • i∈Bj

(αT

ij1(vj − ωji) + αT ij2(vi − ωji)) + η

  • i∈Bj

(vj − ωji2 + vi − ωji2)

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 72 / 76

slide-73
SLIDE 73

Applications and extensions Fully-distributed SVM

Algorithm

ADMM based algorithm: {vt+1

j

, ξt+1

jn

} = argmin{vj,¯

ξj}∈WL({vj}, {¯

ξj}, {ωt

ji}, {αt ijk})

{ωt+1

ji

} = argminωjiL({vj}t+1, {¯ ξt+1

j

}, {ωji}, {αt

ijk})

αt+1

ji1

= αt

ji1 + η(vt+1 j

− ωt+1

ji

) αt+1

ji2

= αt

ji2 + η(ωt+1 ji

− vt+1

i

) From the second equation: ωt+1

ji

= 1 2η(αt

ji1 − αt ji2) + 1

2(vt+1

j

+ vt+1

i

)

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 73 / 76

slide-74
SLIDE 74

Applications and extensions Fully-distributed SVM

Algorithm

Hence: αt+1

ji1

= 1 2(αt

ji1 + αt ji2) + η

2(vt+1

j

− vt+1

i

) αt+1

ji2

= 1 2(αt

ji1 + αt ji2) + η

2(vt+1

j

− vt+1

i

) Substituting ωt+1

ji

= 1

2(vt+1 j

+ vt+1

i

) into surrogate augmented lagrangian, the third term becomes:

J

  • j=1
  • i∈Bj

αT

ij1(vj − vi) = J

  • j=1
  • i∈Bj

vT

j (αt ji1 − αt ij1)

Substitute αt

j = i∈Bj(αt ji1 − αt ij1)

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 74 / 76

slide-75
SLIDE 75

Applications and extensions Fully-distributed SVM

Algorithm

The final algorithm: {vt+1

j

, ξt+1

jn

} = argmin{vj,¯

ξj}∈WL({vj}, {¯

ξj}, {αt

j })

αt+1

j

= αt

j + η

2

  • i∈Bj

(vt+1

j

− vt+1

i

)

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 75 / 76

slide-76
SLIDE 76

Applications and extensions Fully-distributed SVM

Thank you ! Questions ?

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 76 / 76