SDCA Stochastic Dual Coordinate Ascent Jingchang Liu June 29, 2017 - - PowerPoint PPT Presentation

sdca
SMART_READER_LITE
LIVE PREVIEW

SDCA Stochastic Dual Coordinate Ascent Jingchang Liu June 29, 2017 - - PowerPoint PPT Presentation

SDCA Stochastic Dual Coordinate Ascent Jingchang Liu June 29, 2017 University of Science and Technology of China 1 Table of Contents Lagrangian Duality SDCA Convergence Rate Experiments Asynchronous SDCA Q & A 2 Lagrangian Duality


slide-1
SLIDE 1

SDCA

Stochastic Dual Coordinate Ascent

Jingchang Liu June 29, 2017

University of Science and Technology of China 1

slide-2
SLIDE 2

Table of Contents

Lagrangian Duality SDCA Convergence Rate Experiments Asynchronous SDCA Q & A

2

slide-3
SLIDE 3

Lagrangian Duality

slide-4
SLIDE 4

Dual Problem

Primal Problem min f0 (x) s.t. fi (x) ≤ 0, i = 1, 2 · · · , m hi (x) = 0, i = 1, 2, · · · , p Lagrangian Function L (x, λ, v) = f0 (x) +

m

  • i=1

λifi (x) +

p

  • i=1

vihi (x) , λi ≥ 0 Dual Fucntion g (λ, v) = inf

x∈D L (x, λ, v)

g (λ, v) is a concave function.

3

slide-5
SLIDE 5

SDCA

slide-6
SLIDE 6

Reference

Stochastic Dual Coordinate AscentMethods for Regularized Loss Minimization, Shai Shalev-Shwartz & Tong Zhang, JMLR2013

4

slide-7
SLIDE 7

Optimization Objective

Formulation min

w∈Rd P (w)

P (w) := 1 n

n

  • i=1

φi

  • w Txi
  • + λ

2 w2 Parameters

  • x1, x2, · · · , xn ∈ Rd, φ1, φ2, · · · , φn : Scalar convex functions.
  • SGD: O (1/n)

Examples

  • SVM: φi
  • w Txi
  • = max
  • 0, 1 − yiw Txi
  • Logistic Regression: φi
  • w Txi
  • = log
  • 1 + exp
  • −yiw Txi
  • Ridge Regression: φi
  • w Txi
  • =
  • w Txi − yi

2

5

slide-8
SLIDE 8

Dual Problem

Dual Problem max

α D (α)

D (α) = 1 n

n

  • i=1

−φ∗

i (−αi) − λ

2

  • 1

λn

n

  • i=1

αixi

  • 2

Conjugate function: φ∗

i (u) = maxz (zu − φi (z))

Derivation P (w) = 1

n n

  • i=1

φi

  • w Txi
  • + λ

2 w2 equals to

P (y, z) = 1 n

n

  • i=1

φi (zi) + λ 2 y2 s.t. y Txi = zi, i = 1, 2, · · · , n

6

slide-9
SLIDE 9

Derivation

L (y, z, α) = P (y, z) + 1 n

n

  • i=1

αi

  • y Txi − zi
  • D (α)

= inf

y,z L (y, z, α)

= 1 n

n

  • i=1

inf

zi {φi (zi) − αizi} + inf y

  • λ

2 y2 + 1 n

n

  • i=1

αiy Txi

  • =

1 n

n

  • i=1

−φ∗

i (−αi) − λ

2

  • 1

λn

n

  • i=1

αixi

  • 2

Relationship w (α) = 1 λn

n

  • i=1

αixi

7

slide-10
SLIDE 10

Assumptions

L-Lipschitz continuous |φi (a) − φi (b)| ≤ L |a − b| 1/γ-smooth A function φi : R → R is (1/γ)-smooth if it is differentiable and its derivative is (1/γ)-Lipschitz. Remark if φi (a) is (1/γ)-smooth, then φ∗

i is γ strongly convex. 8

slide-11
SLIDE 11

Algorithms

Figure 1: Procedure SDCA

9

slide-12
SLIDE 12

Theorem

Th1 Consider Procedure SDCA with α(0) = 0. Assume that φi is L-Lipschitz for all i. To abtain a duality gap of E [P ( ¯ w) − D (¯ α)] ≤ ε, it suffices to have a total number of iterations of T ≥ T0 + n4L2 λε Th2 Consider Procedure SDCA with α(0) = 0. Assume that φi is (1/γ)-smooth for all i. To abtain a duality gap of E [P ( ¯ w) − D (¯ α)] ≤ ε, it suffices to have a total number of iterations of T ≥

  • n + 1

λγ

  • log
  • n + 1

λγ

  • · 1

ε

  • 10
slide-13
SLIDE 13

Linear Convergence For Smooth Hinge-Loss

Figure 2: Experiments with the smoothed hinge-loss (γ = 1).

11

slide-14
SLIDE 14

Convergence For Non-smooth Hinge-loss

Figure 3: Experiments with the hinge-loss (non-smooth)

12

slide-15
SLIDE 15

Effect of Smoothness Parameter

Figure 4: Duality gap as a function of the number of rounds for different values of γ

13

slide-16
SLIDE 16

Comparison To SGD

Figure 5: Comparing the primal sub-optimality of SDCA and SGD for the smoothed hinge-loss (γ = 1)

14

slide-17
SLIDE 17

Asynchronous SDCA

slide-18
SLIDE 18

Introduction

Reference PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Prime Problem min

w∈Rd P (w) := 1

2 w2 +

n

  • i=1

li

  • w Txi
  • Dual Problem

min

α∈Rd D (α) := 1

2

  • n
  • i=1

αixi

  • 2

+

n

  • i=1

l∗

i (−αi) 15

slide-19
SLIDE 19

Algorithm

Figure 6: Parallel Asynchronous Stochastic dual Co-ordinate Descent (PASSCoDe)

16

slide-20
SLIDE 20

Operation

PASSCoDe-Lock

  • Step 1.5: lock variables in Ni := {wt| (xi)t = 0}
  • The locks are then released after step 3.
  • May equal to inconsistent read.

PASSCode-Atomic

  • step 3: For each j ∈ N (i), Update wj ← wj + △αi (xi)j atomically.

17

slide-21
SLIDE 21

Linear Convergence Rate of PASSCoDe-Atomic

Theorem If

  • 6τ (τ + 1)2 eM
  • /√n ≤ 1

and 1 ≥ 2Lmax R2

min

  • 1 + eτM

√n τ 2M2e2 n then PASSCoDe-Atomic has a global linear convergence rate in expectation, that is, E

  • D
  • αj+1

− D (α∗) ≤ η

  • E
  • D
  • αj

− D (α∗)

  • where α∗ is the optimal solution and

η = 1 − κ Lmax

  • 1 − 2Lmax

R2

min

  • 1 + eτM

√n τ 2M2e2 n

  • 18
slide-22
SLIDE 22

Convergence and Efficiency

Figure 7: Convergence and Efficiency for news20, covtype, rcv1 datasets

19

slide-23
SLIDE 23

Speedup

Figure 8: Speedup for news20, covtype, rcv1 datasets

20

slide-24
SLIDE 24

Q & A