Accelerated Stochastic Subgradient Methods under Local Error Bound - - PowerPoint PPT Presentation

accelerated stochastic subgradient methods under local
SMART_READER_LITE
LIVE PREVIEW

Accelerated Stochastic Subgradient Methods under Local Error Bound - - PowerPoint PPT Presentation

Accelerated Stochastic Subgradient Methods under Local Error Bound Condition Yi Xu yi-xu@uiowa.edu Computer Science Department The University of Iowa April 18, 2018 Co-authors: Tianbao Yang, Qihang Lin Yi Xu VALSE Webinar Presentation


slide-1
SLIDE 1

Accelerated Stochastic Subgradient Methods under Local Error Bound Condition

Yi Xu

yi-xu@uiowa.edu Computer Science Department The University of Iowa April 18, 2018 Co-authors: Tianbao Yang, Qihang Lin

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-2
SLIDE 2

Outline

1

Introduction

2

Accelerated Stochastic Subgradient Methods

3

Applications and experiments

4

Conclusion

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-3
SLIDE 3

Introduction

Outline

1

Introduction

2

Accelerated Stochastic Subgradient Methods

3

Applications and experiments

4

Conclusion

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-4
SLIDE 4

Introduction

Example in machine learning

Table: house price

house size (sqf) price ($1k) 1 68 500 2 220 800 . . . . . . . . . 19 359 1500 20 266 820

x (size)

400 600 800 1000 1200 1400 1600

y (price)

50 100 150 200 250 300 350 400 450 500

Linear model:

y = f(w) = xw,

where y = price, x = size.

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-5
SLIDE 5

Introduction

Example in machine learning

Table: house price

house size (sqf) price ($1k) 1 68 500 2 220 800 . . . . . . . . . 19 359 1500 20 266 820

x (size)

400 600 800 1000 1200 1400 1600

y (price)

50 100 150 200 250 300 350 400 450 500

Linear model:

y = f(w) = xw,

where y = price, x = size.

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-6
SLIDE 6

Introduction

x (size)

400 600 800 1000 1200 1400 1600

y (price)

50 100 150 200 250 300 350 400

(xi, yi) (xi, f(xi)) |yi − f(xi)|2

|y1 − x1w|2 + |y2 − x2w|2 + . . . |y20 − x20w|2

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-7
SLIDE 7

Introduction

Least squares regression:

min

w∈R F(w) = 1

n

n

  • i=1

(yi − xiw)2

  • square loss

smooth

b-xa

  • 5

5 loss 5 10 15 20 25

Least absolute deviations:

min

w∈R F(w) = 1

n

n

  • i=1

|yi − xiw|

  • absolute loss

non-smooth

b-xa

  • 5

5 loss 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

High dimensional model:

min

w∈Rd F(w) = 1

n

n

  • i=1

|yi − x⊤

i w| + λw1 = 1

nXw − y1 + λw1

  • regularizer

absolute loss is more robust to outliers problem

ℓ1 norm regularization is used for feature selection

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-8
SLIDE 8

Introduction

Least squares regression:

min

w∈R F(w) = 1

n

n

  • i=1

(yi − xiw)2

  • square loss

smooth

b-xa

  • 5

5 loss 5 10 15 20 25

Least absolute deviations:

min

w∈R F(w) = 1

n

n

  • i=1

|yi − xiw|

  • absolute loss

non-smooth

b-xa

  • 5

5 loss 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

High dimensional model:

min

w∈Rd F(w) = 1

n

n

  • i=1

|yi − x⊤

i w| + λw1 = 1

nXw − y1 + λw1

  • regularizer

absolute loss is more robust to outliers problem

ℓ1 norm regularization is used for feature selection

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-9
SLIDE 9

Introduction

Least squares regression:

min

w∈R F(w) = 1

n

n

  • i=1

(yi − xiw)2

  • square loss

smooth

b-xa

  • 5

5 loss 5 10 15 20 25

Least absolute deviations:

min

w∈R F(w) = 1

n

n

  • i=1

|yi − xiw|

  • absolute loss

non-smooth

b-xa

  • 5

5 loss 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

High dimensional model:

min

w∈Rd F(w) = 1

n

n

  • i=1

|yi − x⊤

i w| + λw1 = 1

nXw − y1 + λw1

  • regularizer

absolute loss is more robust to outliers problem

ℓ1 norm regularization is used for feature selection

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-10
SLIDE 10

Introduction

Machine learning problems:

min

w∈Rd F(w) = 1

n

n

  • i=1

ℓ(w; xi, yi)

  • loss function

+ r(w)

  • regularizer

Classification:

hinge loss: ℓ(w; x, y) = max(0, 1 − yx⊤w)

Regression:

absolute loss: ℓ(w; x, y) = |x⊤w − y| square loss: ℓ(w; x, y) = (x⊤w − y)2

Regularizer: ℓ1 norm: r(w) = λw1 ℓ2

2 norm: r(w) = λ 2w2 2

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-11
SLIDE 11

Introduction

Convex optimization problem

Problem:

min

w∈Rd F(w)

F(w) : Rd → R is convex

  • ptimal value: F(w∗) = minw∈Rd F(w)
  • ptimal solution: w∗

Goal: to find a solution

w F( w) − F(w∗) ≤ ǫ

0 < ǫ ≪ 1, (e.g. 10−7) ǫ-optimal solution: w

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-12
SLIDE 12

Introduction

Complexity measure

Most optimization algorithms are iterative

wt+1 = wt + ∇wt

Iteration complexity: number of iterations T(ǫ) that

F(wT) − F(w∗) ≤ ǫ

where 0 < ǫ ≪ 1. Time complexity: T(ǫ) × C(n, d) C(n, d): Per-iteration cost

Iterations

100 200 300 400 500 600

Objective

0.05 0.1 0.15 0.2 0.25 0.3

ǫ T Yi Xu VALSE Webinar Presentation April 18, 2018

slide-13
SLIDE 13

Introduction

Gradient Descent (GD)

Problem: minw∈R F(w) smooth

F(w) ≤ F(wt) + ∇F(wt), w − wt + L

2 w − wt2 2

wt+1 = arg minw∈R F(wt) + ∇F(wt), w − wt + L

2w − wt2 2

GD: initial w0 ∈ R, for t = 0, 1, . . .

wt+1 = wt − η∇F(wt) η = 1

L > 0: step size.

simple & easy to implement

F(w) w∗

(w0, F(w0)) starting point Gradient

∇F(w0) > 0

Theorem ([Nesterov, 2004]) After T = O

1

ǫ

  • , F(wT) − F(w∗) ≤ ǫ

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-14
SLIDE 14

Introduction

Gradient Descent (GD)

Problem: minw∈R F(w) smooth

F(w) ≤ F(wt) + ∇F(wt), w − wt + L

2 w − wt2 2

wt+1 = arg minw∈R F(wt) + ∇F(wt), w − wt + L

2w − wt2 2

GD: initial w0 ∈ R, for t = 0, 1, . . .

wt+1 = wt − η∇F(wt) η = 1

L > 0: step size.

simple & easy to implement

F(w) w∗

(w0, F(w0)) starting point Gradient

∇F(w0) > 0

Theorem ([Nesterov, 2004]) After T = O

1

ǫ

  • , F(wT) − F(w∗) ≤ ǫ

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-15
SLIDE 15

Introduction

Gradient Descent (GD)

Problem: minw∈R F(w) smooth

F(w) ≤ F(wt) + ∇F(wt), w − wt + L

2 w − wt2 2

wt+1 = arg minw∈R F(wt) + ∇F(wt), w − wt + L

2w − wt2 2

GD: initial w0 ∈ R, for t = 0, 1, . . .

wt+1 = wt − η∇F(wt) η = 1

L > 0: step size.

simple & easy to implement

F(w) w∗

(w0, F(w0)) starting point Gradient

∇F(w0) > 0

Theorem ([Nesterov, 2004]) After T = O

1

ǫ

  • , F(wT) − F(w∗) ≤ ǫ

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-16
SLIDE 16

Introduction

Accelerated Gradient Descent (AGD)

Nesterov’s momentum trick AGD: initial w0, v1 = w0, for t = 1, 2, . . . :

wt = vt − η∇F(vt) vt+1 = wt + βt(wt − wt−1) βt ∈ (0, 1) is momentum parameter.

Nesterov’s Accelerated Gradient

Momentum Step AGD Step Gradient Step

Theorem ([Beck and Teboulle, 2009]) Let η = 1

L, βt = θt−1 θt+1 ∈ (0, 1) with θt+1 = 1+√ 1+4θ2

t

2

and θ1 = 1, then after

T = O

  • 1

√ǫ

  • , F(wT) − F(w∗) ≤ ǫ

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-17
SLIDE 17

Introduction

SubGradient (SG) descent

Problem: minw∈R F(w) non-smooth SG: initial w0, for t = 0, 1, . . .

wt+1 = wt − η∂F(wt)

decrease η every iteration.

subgradient

Theorem ([Nesterov, 2004]) After T = O

1

ǫ2

  • , F(wT) − F(w∗) ≤ ǫ

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-18
SLIDE 18

Introduction

SubGradient (SG) descent

Problem: minw∈R F(w) non-smooth SG: initial w0, for t = 0, 1, . . .

wt+1 = wt − η∂F(wt)

decrease η every iteration.

subgradient

Theorem ([Nesterov, 2004]) After T = O

1

ǫ2

  • , F(wT) − F(w∗) ≤ ǫ

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-19
SLIDE 19

Introduction

Summary of time complexity

min

w∈Rd F(w) = 1

n

n

  • i=1

fi(w; xi, yi)

Method Time complexity Smooth GD O nd

ǫ

  • YES

AGD O

  • nd

√ǫ

  • YES

SG O nd

ǫ2

  • NO

GD: Gradient Descent AGD: Accelerated Gradient Descent SG: SubGradient descent Yi Xu VALSE Webinar Presentation April 18, 2018

slide-20
SLIDE 20

Introduction

Challenge of deterministic methods

Computing gradient is expensive

min

w∈RdF(w) := 1

n

n

  • i=1

fi(w; xi, yi) ∇F(w) := 1 n

n

  • i=1

∇fi(w; xi, yi)

When n/d is large: Big Data To compute the gradient, need to pass through all data points. At each updating step, need this expensive computation.

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-21
SLIDE 21

Introduction

Stochastic Gradient Descent (SGD)

SGD: initial w0, for t = 0, 1, . . . sample one data ξt = (xt, yt)

wt+1 = wt − η∇f(wt; ξt)

decrease η every iteration simple & memory efficient problem: variance of stochastic gradient, slow convergence

min

w∈RdF(w) := Eξ∼P[f(w; ξ)]

Gradient Descent Stochastic Gradient Descent

Theorem ([Nemirovski et al., 2009]) After T = O

log(1/δ)

ǫ2

  • , F(wT) − F(w∗) ≤ ǫ with a probability 1 − δ.

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-22
SLIDE 22

Introduction

Stochastic SubGradient (SSG) descent

Problem:

min

w∈Rd F(w) = Eξ∼P[f(w; ξ)]

SSG: initial w0, for t = 0, 1, . . . sample one data ξt

wt+1 = wt − η∂f(wt; ξt)

decrease η every iteration Theorem ([Hazan and Kale, 2011]) After T = O

log(1/δ)

ǫ2

  • , F(wT) − F(w∗) ≤ ǫ with a probability 1 − δ.

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-23
SLIDE 23

Introduction

Summary of time complexity

min

w∈Rd F(w) = 1

n

n

  • i=1

fi(w; xi, yi)

Method Time complexity Smooth SGD

  • O

d

ǫ2

  • YES

SSG

  • O

d

ǫ2

  • NO

SGD: Stochastic Gradient Descent SSG: Stochastic SubGradient descent

SGD can not enjoy the smoothness property to obtain faster rate.

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-24
SLIDE 24

Introduction

How can we do better?

Assume Strong Global Assumptions (e.g., strong convexity, smoothness): smaller family of problems Strongly convex problems

F(x) ≥ F(y) + ∂F(y)⊤(x − y) + λ 2x − y2

2

λ > 0: strong convexity parameter.

SSG with ηt = 1/(λt) enjoys O

1

λǫ

  • iteration complexity.

Strong convexity is sometimes too good to be true

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-25
SLIDE 25

Introduction

Non-smooth and non-strongly problems in ML

Robust Regression:

min

w

1 n

n

  • i=1

|w⊤xi − yi|p, p ∈ [1, 2)

Sparse Classification:

min

w

1 n

n

  • i=1

max(0, 1 − yiw⊤xi) + λw1

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-26
SLIDE 26

Accelerated Stochastic Subgradient Methods

Outline

1

Introduction

2

Accelerated Stochastic Subgradient Methods

3

Applications and experiments

4

Conclusion

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-27
SLIDE 27

Accelerated Stochastic Subgradient Methods

The contributions of our paper

  • Y. Xu, Q. Lin, and T. Yang. Stochastic convex optimization: Faster local

growth implies faster global convergence. In ICML, pages 3821-3830, 2017.

A New Theory of Stochastic Convex Optimization

A Broader Family of Conditions: Local Error Bound Condition Faster Global Convergence under Local Error Bound Condition Applications in Machine Learning

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-28
SLIDE 28

Accelerated Stochastic Subgradient Methods

Local error bound (LEB) condition

Definition If there exists a constant c > 0 and a local growth rate θ ∈ (0, 1] such that:

w − w∗2 ≤ c(F(w) − F(w∗))θ, ∀w ∈ Sǫ,

(1) then we say F(w) satisfies a local error bound condition (also know as local growth condition).

Sǫ = {w ∈ Rd : F(w) − F∗ ≤ ǫ}: ǫ-sublevel set.

A local sharpness measure of the function

−0.1 −0.05 0.05 0.1 0.02 0.04 0.06 0.08 0.1 x F(x)

|x|, θ=1 |x|1.5, θ=2/3 |x|2, θ=0.5 Yi Xu VALSE Webinar Presentation April 18, 2018

slide-29
SLIDE 29

Accelerated Stochastic Subgradient Methods

Sketch of accelerated algorithm

ASSG SSG

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-30
SLIDE 30

Accelerated Stochastic Subgradient Methods

Accelerated Stochastic SubGradient (ASSG) method

1: Set η1, K and t 2: for k = 1, . . . , K do 3:

wk = SSG(wk−1, ηk, Dk, t)

4:

ηk+1 = ηk/2, Dk+1 = Dk/2

5: end for

SSG(w1, η, D, t): for τ = 1, . . . , t

wτ+1 = Projw−w12≤D[wτ − η∂fτ(wτ, zτ)]

Output:

w = t

τ=1 wτ/t

Theorem [Xu et al., 2017] After T = O

  • t log

1

ǫ

  • iterations with t ≥ log(1/δ)G2c2

ǫ2(1−θ)

, F(wK) − F∗ ≤ 2ǫ with a probability 1 − δ.

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-31
SLIDE 31

Accelerated Stochastic Subgradient Methods

Practical Variant: ASSG with Restarting (RASSG)

Setting t ≥ log(1/δ)G2c2

ǫ2(1−θ)

requires c, which is usually unknown A Practical Variant:

1: Input: D(1)

1 , t1, w(0) and η1 = ǫ0/(3G2)

2: for s = 1, 2, . . . , S do 3:

Let w(s) =ASSG(w(s−1), K, ts, D(s)

1 )

4:

Let ts+1 = ts22(1−θ), D(s+1)

1

= D(s)

1 21−θ

5: end for

another level of restarting increasing t by a factor of 22(1−θ) iteration complexity remains the same

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-32
SLIDE 32

Accelerated Stochastic Subgradient Methods

Summary of time complexity

min

w∈Rd F(w) = Eξ∼P[f(w; ξ)]

Table: Time complexities for non-smooth stochastic optimization methods1

Method Time complexity Condition SSG

O d

ǫ2

  • Stochastic structure

ASSG

  • O
  • d

ǫ2(1−θ)

  • Stochastic structure and LEB

SSG: Stochastic SubGradient descent ASSG: Accelerated Stochastic SubGradient descent 1θ ∈ (0, 1] Yi Xu VALSE Webinar Presentation April 18, 2018

slide-33
SLIDE 33

Applications and experiments

Outline

1

Introduction

2

Accelerated Stochastic Subgradient Methods

3

Applications and experiments

4

Conclusion

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-34
SLIDE 34

Applications and experiments

Piecewise linear convex optimization

θ = 1 =⇒ ASSG achieves O(log(1/ǫ)) iteration complexity

Examples: Robust Regression

min

w

1 n

n

  • i=1

|w⊤xi − yi|

Sparse Classification:

min

w

1 n

n

  • i=1

max(0, 1 − yiw⊤xi) + λw1

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-35
SLIDE 35

Applications and experiments

Piecewise quadratic convex optimization

θ = 1/2 =⇒ ASSG achieves O(1/ǫ) iteration complexity

Examples: Least-squares regression + ℓ1 regularizer

min

w

1 n

n

  • i=1

(w⊤xi − yi)2 + λw1

Squared hinge loss + ℓ1 regularizer:

min

w

1 n

n

  • i=1

max(0, 1 − yiw⊤xi)2 + λw1

Hurbe loss: ℓ(w⊤xi, yi) =

  • 1

2(w⊤xi − yi)2

for

|w⊤x − yi| ≤ γ γ(|w⊤xi − yi| − 1

2γ)

for

|w⊤xi − yi| > γ

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-36
SLIDE 36

Applications and experiments

Structured composite non-smooth problems

F(w) = h(Aw) + R(w) h(·) is strongly convex (no smoothness assumption is required) R(w) is polyhedral θ = 1/2 =⇒ ASSG achieves O(1/ǫ) iteration complexity

Examples: Robust Regression + ℓ1 regularizer

min

w

1 n

n

  • i=1

|w⊤xi − yi|p + λw1, p ∈ (1, 2)

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-37
SLIDE 37

Applications and experiments

Problems with intermediate θ

ℓp norm regression with ℓ1 constraint min

w1≤B

1 n

n

  • i=1

(w⊤xi − yi)2p, p ∈ N+

where θ = 1/(2p)

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-38
SLIDE 38

Applications and experiments

Experiments: SSG vs. ASSG

number of iterations ×107

2 4 6 8 10

log10(objective gap)

  • 8
  • 7.5
  • 7
  • 6.5
  • 6
  • 5.5
  • 5
  • 4.5
  • 4
  • 3.5
  • 3

hinge loss + ℓ1 norm, covtype

SSG ASSG(t=106) RASSG(t1=106)

number of iterations ×107

2 4 6 8 10

log10(objective gap)

  • 7.5
  • 7
  • 6.5
  • 6
  • 5.5
  • 5
  • 4.5
  • 4
  • 3.5
  • 3

huber loss + ℓ1 norm, million songs

SSG ASSG(t=106) RASSG(t1=106)

Classification Regression

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-39
SLIDE 39

Applications and experiments

Experiments: ASSG vs Other Baselines

cpu time (s)

×105 0.5 1 1.5 2

  • bjective

0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22squared hinge + ℓ1 norm, url

SSG SAGA SVRG++ RASSG

cpu time (s)

×105 0.5 1 1.5 2

  • bjective

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

huber loss + ℓ1 norm, E2006-log1p

SSG SAGA SVRG++ RASSG

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-40
SLIDE 40

Conclusion

Outline

1

Introduction

2

Accelerated Stochastic Subgradient Methods

3

Applications and experiments

4

Conclusion

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-41
SLIDE 41

Conclusion

Conclusion

Present our recent improved work ASSG with a lower iteration complexity for solving non-smooth optimization problems.

Method Time complexity Problem SSG O

  • d

ǫ2

  • Stochastic structure

ASSG

  • O
  • d

ǫ2(1−θ)

  • Stochastic structure + LEB

Study examples satisfying LEB in machine learning. RASSG for θ = 1? Nonconvex problems?

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-42
SLIDE 42

Thank You! Questions?

Yi Xu VALSE Webinar Presentation April 18, 2018

slide-43
SLIDE 43

Reference Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Img. Sci., 2:183–202, 2009. Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: an

  • ptimal algorithm for stochastic strongly-convex optimization. In

Proceedings of the 24th Annual Conference on Learning Theory (COLT), pages 421–436, 2011. Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander

  • Shapiro. Robust stochastic approximation approach to stochastic
  • programming. SIAM Journal on Optimization, 19:1574–1609, 2009.

Yurii Nesterov. Introductory lectures on convex optimization : a basic

  • course. Applied optimization. Kluwer Academic Publ., 2004. ISBN

1-4020-7553-7. Yi Xu, Qihang Lin, and Tianbao Yang. Stochastic convex optimization: Faster local growth implies faster global convergence. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 3821–3830, 2017.

Yi Xu VALSE Webinar Presentation April 18, 2018