PASSCoDe : P arallel AS ynchronous S tochastic dual Co -ordinate De - - PowerPoint PPT Presentation

passcode p arallel as ynchronous s tochastic dual co
SMART_READER_LITE
LIVE PREVIEW

PASSCoDe : P arallel AS ynchronous S tochastic dual Co -ordinate De - - PowerPoint PPT Presentation

PASSCoDe : P arallel AS ynchronous S tochastic dual Co -ordinate De scent Cho-Jui Hsieh Department of Computer Science University of Texas at Austin Joint work with H.-F. Yu and I. S. Dhillon Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 1


slide-1
SLIDE 1

PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent

Cho-Jui Hsieh Department of Computer Science University of Texas at Austin

Joint work with H.-F. Yu and I. S. Dhillon

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 1 / 29

slide-2
SLIDE 2

Outline

L2-regularized Empirical Risk Minimization Dual Coordinate Descent (Hsieh et al., 2008) Parallel Dual Coordinate Descent (on multi-core machines) Theoretical Analysis Experimental Results

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 2 / 29

slide-3
SLIDE 3

L2-regularized ERM

w ∗ = arg min

w∈Rd P(w) := 1

2w2 +

n

  • i=1

ℓi(w Txi) SVM with hinge loss: ℓi(zi) = C max (1 − zi, 0) SVM with squared hinge loss: ℓi(zi) = C max (1 − zi, 0)2 Logistic regression: ℓi(zi) = C log (1 + e−zi)

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 3 / 29

slide-4
SLIDE 4

Primal and Dual Formulations

Primal Problem w ∗ = arg min

w∈Rd P(w) := 1

2w2 +

n

  • i=1

ℓi(w Txi) Dual Problem α∗ = arg min

α∈Rn D(α) := 1

2

  • n
  • i=1

αixi

  • 2

+

n

  • i=1

ℓ∗

i (−αi),

ℓ∗

i (·): the conjugate of ℓi(·)

Primal-Dual Relationship between w ∗ and α∗ w ∗ = w (α∗) :=

n

  • i=1

α∗

i xi

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 4 / 29

slide-5
SLIDE 5

Coordinate Descent on the Dual Problem

Randomly select an i ∈ {1, . . . , n} and update αi ← αi + δ∗, where δ∗ = arg min

δ

D(α + δei)

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 5 / 29

slide-6
SLIDE 6

Coordinate Descent on the Dual Problem

Randomly select an i ∈ {1, . . . , n} and update αi ← αi + δ∗, where δ∗ = arg min

δ

D(α + δei) = arg min

δ

1 2

  • δ + (n

i=1 αixi)T xi

xi2 2 + 1 xi2 ℓ∗

i (− (αi + δ))

= Ti   n

  • i=1

αixi T xi, αi   Simple univariate problem, but O(nnz) construction time

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 6 / 29

slide-7
SLIDE 7

Coordinate Descent on the Dual Problem

Randomly select an i ∈ {1, . . . , n} and update αi ← αi + δ∗, where δ∗ = arg min

δ

D(α + δei) = arg min

δ

1 2

  • δ + (n

i=1 αixi)T xi

xi2 2 + 1 xi2 ℓ∗

i (− (αi + δ))

= Ti   n

  • i=1

αixi T xi, αi   Simple univariate problem, but O(nnz) construction time ⇒ O(ni)

DCD: [Hsieh et al 2008]

Maintain primal variable w = n

i=1 αixi and δ∗ = Ti

  • w Txi, αi
  • O(ni) construction time: ni = nnz of xi

O(ni) maintenance cost: w ← w + δ∗xi

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 7 / 29

slide-8
SLIDE 8

Dual Coordinate Descent

Stochastic Dual Coordinate Descent

For t = 1, 2, . . .

1 Randomly pick an index i 2 Compute w Txi 3 Update αi ← αi + δ∗ where δ∗ = Ti(w Txi, αi) 4 Update w ← w + δ∗xi.

Implemented in LIBLINEAR: Linear SVM (Hsieh et al., 2008), multi-class SVM (Keerthi et al., 2008), Logistic regression (Yu et al., 2011). Analysis: (Nesterov et al., 2012; Shalev-Shwartz et al., 2013)

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 8 / 29

slide-9
SLIDE 9

Dual Coordinate Descent α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 3) CPU2 DCD step: compute w Txi

R1 R2 R3 R4

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

slide-10
SLIDE 10

Dual Coordinate Descent α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 3) CPU2

  • peration: Load x32 to R2

DCD step: compute w Txi

R1 0.2 x32 R3 R4

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

slide-11
SLIDE 11

Dual Coordinate Descent α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 3) CPU2

  • peration: Load w2 to R1

DCD step: compute w Txi

w2 0.2 x32 R3 R4

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

slide-12
SLIDE 12

Dual Coordinate Descent α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 3) CPU2

  • peration: R3 = R3 + R1×R2

DCD step: compute w Txi

w2 0.2 x32 wTxi R4

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

slide-13
SLIDE 13

Dual Coordinate Descent α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 3) CPU2

  • peration: Load x34 to R2

DCD step: compute w Txi

w2 0.4 x34 wTxi R4

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

slide-14
SLIDE 14

Dual Coordinate Descent α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 3) CPU2

  • peration: Load w4 to R1

DCD step: compute w Txi

w4 0.4 x34 wTxi R4

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

slide-15
SLIDE 15

Dual Coordinate Descent α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 3) CPU2

  • peration: R3 = R3 + R1×R2

DCD step: compute w Txi

w4 0.4 x34 wTxi R4

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

slide-16
SLIDE 16

Dual Coordinate Descent α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 3) CPU2

  • peration: Load α3 to R2

DCD step: compute δ∗ = Ti

  • w Tx, αi
  • w4

α3 wTxi R4

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

slide-17
SLIDE 17

Dual Coordinate Descent α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 3) CPU2

  • peration: R4 = Ti (R2,R3)

DCD step: compute δ∗ = Ti

  • w Tx, αi
  • w4

α3 wTxi 1 δ∗

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

slide-18
SLIDE 18

Dual Coordinate Descent α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 3) CPU2

  • peration: R2 = R2 + R4

DCD step: update αi = αi + δ∗

w4 1 α3 wTxi 1 δ∗

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

slide-19
SLIDE 19

Dual Coordinate Descent α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 3) CPU2

  • peration: Save R2 to α3

DCD step: update αi = αi + δ∗

1 w4 1 α3 wTxi 1 δ∗

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

slide-20
SLIDE 20

Dual Coordinate Descent α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 3) CPU2

  • peration: Load x32 to R2

DCD step: update w = w + δ∗xi

1 w4 0.2 x32 wTxi 1 δ∗

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

slide-21
SLIDE 21

Dual Coordinate Descent α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 3) CPU2

  • peration: Load w2 to R1

DCD step: update w = w + δ∗xi

1 w2 0.2 x32 wTxi 1 δ∗

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

slide-22
SLIDE 22

Dual Coordinate Descent α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 3) CPU2

  • peration: R1 = R1 + R2×R4

DCD step: update w = w + δ∗xi

1 0.2 w2 0.2 x32 wTxi 1 δ∗

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

slide-23
SLIDE 23

Dual Coordinate Descent α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 3) CPU2

  • peration: Save R1 to w2

DCD step: update w = w + δ∗xi

0.2 1 0.2 w2 0.2 x32 wTxi 1 δ∗

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

slide-24
SLIDE 24

Dual Coordinate Descent α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 3) CPU2

  • peration: Load x34 to R2

DCD step: update w = w + δ∗xi

0.2 1 0.2 w2 0.4 x34 wTxi 1 δ∗

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

slide-25
SLIDE 25

Dual Coordinate Descent α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 3) CPU2

  • peration: Load w4 to R1

DCD step: update w = w + δ∗xi

0.2 1 w4 0.4 x34 wTxi 1 δ∗

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

slide-26
SLIDE 26

Dual Coordinate Descent α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 3) CPU2

  • peration: R1 = R1 + R2×R4

DCD step: update w = w + δ∗xi

0.2 1 0.4 w4 0.4 x34 wTxi 1 δ∗

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

slide-27
SLIDE 27

Dual Coordinate Descent α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 3) CPU2

  • peration: Save R1 to w4

DCD step: update w = w + δ∗xi

0.2 0.4 1 0.4 w4 0.4 x34 wTxi 1 δ∗

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

slide-28
SLIDE 28

Parallel DCD in Shared-memory Multi-core System

Serial DCD updates: For t = 1, 2, . . .

1

Randomly pick an index i

2

Compute w Txi

3

Update αi ← αi + δ∗ where δ∗ = Ti(w Txi, αi)

4

Update w ← w + δ∗xi.

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 10 / 29

slide-29
SLIDE 29

Parallel DCD in Shared-memory Multi-core System

Parallel DCD updates: Each thread repeatedly performs the following updates. For t = 1, 2, . . .

1

Randomly pick an index i

2

Compute w Txi

3

Update αi ← αi + δ∗ where δ∗ = Ti(w Txi, αi)

4

Update w ← w + δ∗xi.

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 11 / 29

slide-30
SLIDE 30

Parallel DCD in Shared-memory Multi-core System

Parallel DCD updates: Each thread repeatedly performs the following updates. For t = 1, 2, . . .

1

Randomly pick an index i

2

Compute w Txi

3

Update αi ← αi + δ∗ where δ∗ = Ti(w Txi, αi)

4

Update w ← w + δ∗xi.

Easy to implement using OpenMP. Variables α and w stored in shared memory.

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 11 / 29

slide-31
SLIDE 31

Parallel DCD in Shared-memory Multi-core System

Parallel DCD updates: Each thread repeatedly performs the following updates. For t = 1, 2, . . .

1

Randomly pick an index i

2

Compute w Txi

3

Update αi ← αi + δ∗ where δ∗ = Ti(w Txi, αi)

4

Update w ← w + δ∗xi.

Easy to implement using OpenMP. Variables α and w stored in shared memory. Distributed Dual Coordinate Descent: Each machine has local copy of α, w (Yang, 2013; Jaggi et al, 2014; Lee and Roth, 2015; Ma et al., 2015).

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 11 / 29

slide-32
SLIDE 32

Parallel Dual Coordinate Descent: Two Issues for Correctness

Inconsistent Read of w Conflict Write of w

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 12 / 29

slide-33
SLIDE 33

Inconsistant Read

Thread 2 reads w and thread 1 writes to w simultaneously. There may not exist any α such that w =

i αixi.

The “bounded delay” analysis in (Liu and Wright, 2014) cannot be directly applied.

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 13 / 29

slide-34
SLIDE 34

Conflict Write

Thread 1 and 2 write to w simultaneously. Updates to w can be

  • verwritten, so the

converged solution ˆ w and ˆ α may be inconsistent: ˆ w =

  • i

ˆ αixi.

CPU1: CPU2: w = w + 0.2 w = w + 0.5 OP R1 w OP R2 0.0 1.0 0.0 1 load w 1.0 1.0 load w 1.0 2 add 0.2 1.2 1.0 add 0.5 1.5 3 save w 1.2 1.2 1.5 4 1.2 1.5 save w 1.5 Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 14 / 29

slide-35
SLIDE 35

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6) DCD step: compute δ∗ = Ti

  • w Tx, αi
  • AA

0.2 0.4 1 R1 R2 R3 R4 R5 R6 R7 R8

w (1) w (2)

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-36
SLIDE 36

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: R4 = Ti (R2,R3)

DCD step: compute δ∗ = Ti

  • w Tx, αi
  • AA

0.2 0.4 1 R1 R2 R3 2 δ∗ R5 R6 R7 R8

w (1) w (2)

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-37
SLIDE 37

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: Load x11 to R2

DCD step: update w = w + δ∗xi

AA

0.2 0.4 1 R1 0.1 x11 R3 2 δ∗ R5 R6 R7 R8

w (1) w (2)

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-38
SLIDE 38

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: Load w1 to R1

DCD step: update w = w + δ∗xi

AA

0.2 0.4 1 w1 0.1 x11 R3 2 δ∗ R5 R6 R7 R8

w (1) w (2)

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-39
SLIDE 39

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: R1 = R1 + R2×R4

DCD step: update w = w + δ∗xi

  • peration: Load x61 to R6

DCD step: compute w Txi

AA

0.2 0.4 1 0.2 w1 0.1 x11 R3 2 δ∗ R5 0.1 x61 R7 R8

w (1) w (2)

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-40
SLIDE 40

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: Save R1 to w1

DCD step: update w = w + δ∗xi

  • peration: Load w1 to R5

DCD step: compute w Txi

AA

0.2 0.2 0.4 1 0.2 w1 0.1 x11 R3 2 δ∗ 0.2 w1 0.1 x61 R7 R8

w (1) w (2)

0.2

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-41
SLIDE 41

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: Load x12 to R2

DCD step: update w = w + δ∗xi

  • peration: R7 = R7 + R5×R6

DCD step: compute w Txi

AA

0.2 0.2 0.4 1 0.2 w1 0.2 x12 R3 2 δ∗ 0.2 w1 0.1 x61 0.02 wTxi R8

w (1) w (2)

0.2

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-42
SLIDE 42

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: Load w2 to R1

DCD step: update w = w + δ∗xi

  • peration: Load x66 to R6

DCD step: compute w Txi

AA

0.2 0.2 0.4 1 0.2 w2 0.2 x12 R3 2 δ∗ 0.2 w1 0.6 x66 0.02 wTxi R8

w (1)

0.2

w (2)

0.2

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-43
SLIDE 43

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: R1 = R1 + R2×R4

DCD step: update w = w + δ∗xi

  • peration: Load w6 to R5

DCD step: compute w Txi

Inconsistent Read of w

0.2 0.2 0.4 1 0.6 w2 0.2 x12 R3 2 δ∗ w6 0.6 x66 0.02 wTxi R8

w (1)

0.2

w (2)

0.2

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-44
SLIDE 44

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: Save R1 to w2

DCD step: update w = w + δ∗xi

  • peration: R7 = R7 + R5×R6

DCD step: compute w Txi

Inconsistent Read of w

0.2 0.6 0.4 1 0.6 w2 0.2 x12 R3 2 δ∗ w6 0.6 x66 0.02 wTxi R8

w (1)

0.2

w (2)

0.2

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-45
SLIDE 45

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: Load x13 to R2

DCD step: update w = w + δ∗xi

AA

  • peration: Load α6 to R6

DCD step: compute δ∗ = Ti

  • w Tx, αi
  • 0.2

0.6 0.4 1 0.6 w2 0.3 x13 R3 2 δ∗ w6 α6 0.02 wTxi R8

w (1)

0.2

w (2)

0.2

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-46
SLIDE 46

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: Load w3 to R1

DCD step: update w = w + δ∗xi

AA

  • peration: R8 = Ti (R6,R7)

DCD step: compute δ∗ = Ti

  • w Tx, αi
  • 0.2

0.6 0.4 1 w3 0.3 x13 R3 2 δ∗ w6 α6 0.02 wTxi 5 δ∗

w (1)

0.2

w (2)

0.2

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-47
SLIDE 47

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: R1 = R1 + R2×R4

DCD step: update w = w + δ∗xi

AA

DCD step: compute δ∗ = Ti

  • w Tx, αi
  • 0.2

0.6 0.4 1 0.6 w3 0.3 x13 R3 2 δ∗ w6 α6 0.02 wTxi 5 δ∗

w (1)

0.2

w (2)

0.2

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-48
SLIDE 48

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: Save R1 to w3

DCD step: update w = w + δ∗xi

AA

DCD step: compute δ∗ = Ti

  • w Tx, αi
  • 0.2

0.6 0.6 0.4 1 0.6 w3 0.3 x13 R3 2 δ∗ w6 α6 0.02 wTxi 5 δ∗

w (1)

0.2

w (2)

0.2

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-49
SLIDE 49

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: Load x14 to R2

DCD step: update w = w + δ∗xi

AA

DCD step: compute δ∗ = Ti

  • w Tx, αi
  • 0.2

0.6 0.6 0.4 1 0.6 w3 0.4 x14 R3 2 δ∗ w6 α6 0.02 wTxi 5 δ∗

w (1)

0.2

w (2)

0.2

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-50
SLIDE 50

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: Load w4 to R1

DCD step: update w = w + δ∗xi

AA

  • peration: R6 = R6 + R8

DCD step: update αi = αi + δ∗

0.2 0.6 0.6 0.4 1 0.4 w4 0.4 x14 R3 2 δ∗ w6 5 α6 0.02 wTxi 5 δ∗

w (1)

0.2 0.4

w (2)

0.2

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-51
SLIDE 51

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: R1 = R1 + R2×R4

DCD step: update w = w + δ∗xi

AA

  • peration: Save R6 to α6

DCD step: update αi = αi + δ∗

0.2 0.6 0.6 0.4 1 5 1.2 w4 0.4 x14 R3 2 δ∗ w6 5 α6 0.02 wTxi 5 δ∗

w (1)

0.2 0.4

w (2)

0.2

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-52
SLIDE 52

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: Save R1 to w4

DCD step: update w = w + δ∗xi

AA

  • peration: Load x61 to R6

DCD step: update w = w + δ∗xi

0.2 0.6 0.6 1.2 1 5 1.2 w4 0.4 x14 R3 2 δ∗ w6 0.1 x61 0.02 wTxi 5 δ∗

w (1)

0.2 0.4

w (2)

0.2

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-53
SLIDE 53

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: Load x15 to R2

DCD step: update w = w + δ∗xi

AA

  • peration: Load w1 to R5

DCD step: update w = w + δ∗xi

0.2 0.6 0.6 1.2 1 5 1.2 w4 0.5 x15 R3 2 δ∗ 0.2 w1 0.1 x61 0.02 wTxi 5 δ∗

w (1)

0.2 0.4

w (2)

0.2

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-54
SLIDE 54

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: Load w5 to R1

DCD step: update w = w + δ∗xi

AA

  • peration: R5 = R5 + R6×R8

DCD step: update w = w + δ∗xi

0.2 0.6 0.6 1.2 1 5 w5 0.5 x15 R3 2 δ∗ 0.7 w1 0.1 x61 0.02 wTxi 5 δ∗

w (1)

0.2 0.4

w (2)

0.2

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-55
SLIDE 55

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: R1 = R1 + R2×R4

DCD step: update w = w + δ∗xi

AA

  • peration: Save R5 to w1

DCD step: update w = w + δ∗xi

0.7 0.6 0.6 1.2 1 5 1 w5 0.5 x15 R3 2 δ∗ 0.7 w1 0.1 x61 0.02 wTxi 5 δ∗

w (1)

0.2 0.4

w (2)

0.2

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-56
SLIDE 56

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: Save R1 to w5

DCD step: update w = w + δ∗xi

AA

  • peration: Load x66 to R6

DCD step: update w = w + δ∗xi

0.7 0.6 0.6 1.2 1 1 5 1 w5 0.5 x15 R3 2 δ∗ 0.7 w1 0.6 x66 0.02 wTxi 5 δ∗

w (1)

0.2 0.4

w (2)

0.2

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-57
SLIDE 57

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: Load x16 to R2

DCD step: update w = w + δ∗xi

Conflict Write of w

  • peration: Load w6 to R5

DCD step: update w = w + δ∗xi

0.7 0.6 0.6 1.2 1 1 5 1 w5 0.6 x16 R3 2 δ∗ w6 0.6 x66 0.02 wTxi 5 δ∗

w (1)

0.2 0.4

w (2)

0.2

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-58
SLIDE 58

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: Load w6 to R1

DCD step: update w = w + δ∗xi

Conflict Write of w

  • peration: R5 = R5 + R6×R8

DCD step: update w = w + δ∗xi

0.7 0.6 0.6 1.2 1 1 5 w6 0.6 x16 R3 2 δ∗ 3 w6 0.6 x66 0.02 wTxi 5 δ∗

w (1)

0.2 0.4

w (2)

0.2

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-59
SLIDE 59

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: R1 = R1 + R2×R4

DCD step: update w = w + δ∗xi

Conflict Write of w

  • peration: Save R5 to w6

DCD step: update w = w + δ∗xi

0.7 0.6 0.6 1.2 1 3 1 5 1.2 w6 0.6 x16 R3 2 δ∗ 3 w6 0.6 x66 0.02 wTxi 5 δ∗

w (1)

0.2 0.4

w (2)

0.2

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-60
SLIDE 60

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

  • peration: Save R1 to w6

DCD step: update w = w + δ∗xi

Conflict Write of w

0.7 0.6 0.6 1.2 1 1.2 1 5 1.2 w6 0.6 x16 R3 2 δ∗ 3 w6 0.6 x66 0.02 wTxi 5 δ∗

w (1)

0.2 0.4

w (2)

0.2

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-61
SLIDE 61

Dual Coordinate Descent in Parallel

α :

0.1 0.2 0.3 0.4 0.5 0.6

x1

0.4 0.5

x2

0.2 0.4

x3 w

0.2 0.4

x4

0.2 0.5

x5

0.1 0.6

x6

Memory: Registers: CPU1 (i = 1) CPU2 (i = 6)

AA

0.7 0.6 0.6 1.2 1 1.2 1 5 1.2 w6 0.6 x16 R3 2 δ∗ 3 w6 0.6 x66 0.02 wTxi 5 δ∗

w (1)

0.2 0.4

w (2)

0.2

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

slide-62
SLIDE 62

PASSCoDe-Lock

Each thread repeatedly performs the following updates. For t = 1, 2, . . .

1 Randomly pick an index i 2 Lock {wj | (xi)j = 0} 3 Compute w Txi 4 Update αi ← αi + δ∗ where δ∗ = Ti(w Txi, αi) 5 Update w ← w + δ∗xi. 6 Unlock the variables. Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 16 / 29

slide-63
SLIDE 63

How to Resolve the Issues

Three PASSCoDe approaches: lock: acquire locks for all necessary wj before the update inconsistent read conflict write PASSCoDe-Lock resolved resolved Scaling (on rcv1 with 100 epochs): # threads Lock 2 98.03s / 0.27x 4 106.11s / 0.25x 10 114.43s / 0.23x

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 17 / 29

slide-64
SLIDE 64

PASSCoDe-Atomic

Each thread repeatedly performs the following updates. For t = 1, 2, . . .

1 Randomly pick an index i 2 Compute w Txi 3 Update αi ← αi + δ∗ where δ∗ = Ti(w Txi, αi) 4 For each j ∈ N(i) 5

Update wj ← wj + δ∗(xi)j atomically

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 18 / 29

slide-65
SLIDE 65

How to Resolve the Issues

Three PASSCoDe approaches: lock: acquire locks for all necessary wj before the update atomic: apply atomic operation for wj = wj + δ∗xij inconsistent read conflict write PASSCoDe-Lock resolved resolved PASSCoDe-Atomic remained resolved Scaling (on rcv1 with 100 epochs): # threads Lock Atomic 2 98.03s / 0.27x 15.28s / 1.75x 4 106.11s / 0.25x 8.35s / 3.20x 10 114.43s / 0.23x 3.86s / 6.91x

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 19 / 29

slide-66
SLIDE 66

Analysis for PASSCoDe-Atomic

Atomic operations guarantee:

all updates to w will be performed eventually ˆ w = n

i=1 ˆ

αixi holds for the outputted ( ˆ w, ˆ α)

Bounded delay assumption: to handle inconsistent read of w

all updates of w before τ iterations must be performed

Theorem

Under certain conditions on τ, PASSCoDe − Atomic has global linear convergence rate in expectation: E

  • D(αj+1) − D(α∗)
  • ≤ ηE
  • D(αj) − D(α∗)
  • Our analysis covers logistic regression and SVM with hinge loss

(where the dual problem is not strictly convex).

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 20 / 29

slide-67
SLIDE 67

PASSCoDe-Wild

Each thread repeatedly performs the following updates. For t = 1, 2, . . .

1 Randomly pick an index i 2 Compute w Txi 3 Update αi ← αi + δ∗ where δ∗ = Ti(w Txi, αi) 4 Update w ← w + δ∗xi Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 21 / 29

slide-68
SLIDE 68

How to Resolve the Issues

Three PASSCoDe approaches: lock: acquire locks for all necessary wj before the update atomic: apply atomic operation for wj = wj + δ∗xij wild: do nothing to resolve either issue inconsistent read conflict write PASSCoDe-Lock resolved resolved PASSCoDe-Atomic remained resolved PASSCoDe-Wild remained remained Scaling (on rcv1 with 100 epochs): # threads Lock Atomic Wild 2 98.03s / 0.27x 15.28s / 1.75x 14.08s / 1.90x 4 106.11s / 0.25x 8.35s / 3.20x 7.61s / 3.50x 10 114.43s / 0.23x 3.86s / 6.91x 3.59s / 7.43x

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 22 / 29

slide-69
SLIDE 69

Analysis for PASSCoDe-Wild

Some updates are missing due to memory conflicts

for the final ( ˆ w, ˆ α): ˆ w =

n

  • i=1

ˆ αixi construct ¯ w from the final ˆ α: ¯ w =

n

  • i=1

ˆ αixi Which one for prediction, ˆ w or ¯ w?

Prediction Accuracy (%) by # threads ˆ w ¯ w LIBLINEAR news20 4 97.1 96.1 97.1 8 97.2 93.3 covtype 4 67.8 38.0 66.3 8 67.6 38.0 rcv1 4 97.7 97.5 97.7 8 97.7 97.4 webspam 4 99.1 93.1 99.1 8 99.1 88.4 kddb 4 88.8 79.7 88.8 8 88.8 87.7

Question: why ˆ w is better than ¯ w?

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 23 / 29

slide-70
SLIDE 70

Backward Analysis for PASSCoDe-Wild

Recall the primal problem w ∗ = arg min

w

P(w) := 1 2w2 +

n

  • i=1

ℓi

  • w Txi
  • Theorem

Let ǫ be the error caused by the memory conflicts. ˆ w = arg min

w

ˆ P(w) := 1 2w + ǫ2 +

n

  • i=1

ℓi(w Txi) ¯ w = arg min

w

¯ P(w) := 1 2w2 +

n

  • i=1

ℓi

  • (w − ǫ) Txi
  • ˆ

P(w) is the problem with the perturbation on the regularization term ¯ P(w) is the problem with the perturbation on the prediction term

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 24 / 29

slide-71
SLIDE 71

Datasets and Experimental Settings

Datasets. n ˜ n d ¯ d C news20 16,000 3,996 1,355,191 455.5 2 rcv1 677,399 20,242 47,236 73.2 1 webspam 280,000 70,000 16,609,143 3727.7 1 kddb 19,264,097 748,401 29,890,095 29.4 1 Compared Implementation. LIBLINEAR: serial baseline PASSCoDe-Wild and PASSCoDe-Atomic: our methods CoCoA: a multi-core version of [Jaggi et al, 2014] AsySCD: [Liu & Wright, 2014] Machine: Intel Multi-core machine with 256 GB Memory

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 25 / 29

slide-72
SLIDE 72

Convergence in terms of Walltime

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 26 / 29

slide-73
SLIDE 73

Accuracy

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 27 / 29

slide-74
SLIDE 74

Speedup

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 28 / 29

slide-75
SLIDE 75

Conclusions

PASSCoDe: an simple but effective asynchronous dual coordinate descent Analysis three variants

PASSCoDe-Lock PASSCoDe-Atomic: established global linear convergence PASSCoDe-Wild: backward analysis

Future work: extend the analysis to L1-regularized problems

LASSO L1-regularized Logistic Regression

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 29 / 29