Stochastic optimization and sparse statistical recovery: An optimal - - PowerPoint PPT Presentation

stochastic optimization and sparse statistical recovery
SMART_READER_LITE
LIVE PREVIEW

Stochastic optimization and sparse statistical recovery: An optimal - - PowerPoint PPT Presentation

Stochastic optimization and sparse statistical recovery: An optimal algorithm for high dimensions Alekh Agarwal Microsoft Research Joint work with Sahand Negahban and Martin Wainwright Workshop on Optimization and Statistical Learning 2013, Les


slide-1
SLIDE 1

Stochastic optimization and sparse statistical recovery: An optimal algorithm for high dimensions

Alekh Agarwal Microsoft Research Joint work with Sahand Negahban and Martin Wainwright Workshop on Optimization and Statistical Learning 2013, Les Houches, France

slide-2
SLIDE 2

Introduction

Sparse optimization: θ∗ = arg min

θ∈Rd EP[ℓ(θ; z)] = arg min θ L(θ),

such that θ∗ is s-sparse

slide-3
SLIDE 3

Introduction

Sparse optimization: θ∗ = arg min

θ∈Rd EP[ℓ(θ; z)] = arg min θ L(θ),

such that θ∗ is s-sparse Loss function ℓ is convex P unknown, can sample from it High dimensional setup: n ≪ d

slide-4
SLIDE 4

Introduction

Sparse optimization: θ∗ = arg min

θ∈Rd EP[ℓ(θ; z)] = arg min θ L(θ),

such that θ∗ is s-sparse Loss function ℓ is convex P unknown, can sample from it High dimensional setup: n ≪ d Want linear time and statistically (near) optimal algorithm

slide-5
SLIDE 5

Example 1 : Computational genomics

1 −1 1 C C G T A G A A G C T A C T A C C G T G T

sign

n n n = S SC X d y θ∗

Predict disease susceptibility from genome Depends on very few genes, θ∗ is sparse

slide-6
SLIDE 6

Example 1 : Computational genomics

1 −1 1 C C G T A G A A G C T A C T A C C G T G T

sign

n n n = S SC X d y θ∗

Predict disease susceptibility from genome Depends on very few genes, θ∗ is sparse Sparse logistic regression: θ∗ = arg min

θ EP[log(1 + exp(−yθTx))].

slide-7
SLIDE 7

Example 2 : Compressed sensing

y X w n × d n S SC θ∗ = +

Recover unknown signal θ∗ from noisy measurements Sparse linear regression: θ∗ = arg min

θ EP[(y − θTx)2].

slide-8
SLIDE 8

Approach 1: M-estimation (batch optimization)

Draw n i.i.d. samples Obtain θn

  • θn = arg min

θ

1 n

n

  • i=1

ℓ(θ; zi) + λnθ1

slide-9
SLIDE 9

Approach 1: M-estimation (batch optimization)

Draw n i.i.d. samples Obtain θn

  • θn = arg min

θ

1 n

n

  • i=1

ℓ(θ; zi) + λnθ1 Statistical arguments for consistency, θn → θ∗ Convex optimization to compute θn

slide-10
SLIDE 10

Batch optimization

Convergence depends on properties of 1 n

n

  • i=1

ℓ(θ; zi) + λnθ1 Sample loss not (globally) strongly convex for n < d Poor smoothness when n ≪ d

−5 5 −5 5 5 10 15 20 25

slide-11
SLIDE 11

Batch optimization

Convergence depends on properties of 1 n

n

  • i=1

ℓ(θ; zi) + λnθ1 Sample loss not (globally) strongly convex for n < d Poor smoothness when n ≪ d

−5 5 −5 5 5 10 15 20 25

But, smooth and strongly convex in sparse directions

Example: Least-squares loss with random design

slide-12
SLIDE 12

Fast convergence of gradient descent

We prove (global) linear convergence of gradient descent based on sparse condition number of 1

n

n

i=1 ℓ(θ; zi)

50 100 150 −10 −8 −6 −4 −2 2 Iteration Count log(θt − ˆ θ) (rescaled) n = 2500 p= 5000 p=10000 p=20000

slide-13
SLIDE 13

Fast convergence of gradient descent

We prove (global) linear convergence of gradient descent based on sparse condition number of 1

n

n

i=1 ℓ(θ; zi)

50 100 150 −10 −8 −6 −4 −2 2 Iteration Count log(θt − ˆ θ) (rescaled) n = 2500 p= 5000 p=10000 p=20000 50 100 150 −10 −8 −6 −4 −2 2 Iteration Count log(θt − ˆ θ) (rescaled) α = 16.3069 p= 5000 p=10000 p=20000

slide-14
SLIDE 14

Computational complexity of batch optimization

Convergence rate captures number of iterations Each iteration has complexity O(nd) One pass over data at each iteration

slide-15
SLIDE 15

Computational complexity of batch optimization

Convergence rate captures number of iterations Each iteration has complexity O(nd) One pass over data at each iteration But we wanted linear time algorithm!

slide-16
SLIDE 16

Approach 2: Stochastic optimization

Directly minimize EP[ℓ(θ; z)] Use samples to obtain gradient estimates θt+1 = θt − αt∇ℓ(θt; zt)

slide-17
SLIDE 17

Approach 2: Stochastic optimization

Directly minimize EP[ℓ(θ; z)] Use samples to obtain gradient estimates θt+1 = θt − αt∇ℓ(θt; zt) Stop after one pass over data Statistically, often competitive with batch (that is, θn − θ∗2 ≈ θn − θ∗2) Precise rates depend on the problem structure

slide-18
SLIDE 18

Structural assumptions

θ∗ is s-sparse Make additional structural assumptions on L(θ) = EP[ℓ(θ; z)]

L is Locally Lipschitz L is Locally strongly convex (LSC)

slide-19
SLIDE 19

Locally Lipschitz functions

Definition (Locally Lipschitz function) L is locally G-Lipschitz in ℓ1-norm, meaning that |L(θ) − L(˜ θ)| ≤ Gθ − ˜ θ1, if θ − θ∗1 ≤ R and ˜ θ − θ∗1 ≤ R. Globally Lipschitz Locally Lipschitz

slide-20
SLIDE 20

Locally strongly convex functions

Definition (Locally strongly convex function) There is a constant γ > 0 such that L(˜ θ) ≥ L(θ) + ∇L(θ), ˜ θ − θ + γ 2θ − ˜ θ2

2,

if θ1 ≤ R and ˜ θ1 ≤ R Locally Strongly convex Globally strongly convex

slide-21
SLIDE 21

Stochastic optimization and structural conditions

Method Sparsity LSC Convergence SGD O d

T

  • Mirror descent/RDA/FOBOS/COMID

O

  • s2log d

T

  • Our Method

O

  • slog d

T

slide-22
SLIDE 22

Some previous methods

All methods based on observing gt such that E[gt] ∈ ∂L(θt) Stochastic gradient descent: based on ℓ2 distances, exploits LSC θt+1 = arg min

θ gt, θ +

1 2αt θ − θt2

2

slide-23
SLIDE 23

Some previous methods

All methods based on observing gt such that E[gt] ∈ ∂L(θt) Stochastic gradient descent: based on ℓ2 distances, exploits LSC θt+1 = arg min

θ gt, θ +

1 2αt θ − θt2

2

Stochastic dual averaging: based on ℓp distances, exploits sparstity when p ≈ 1 θt+1 = arg min

θ t

  • s=1

gs, θ + 1 2αt θ2

p

Need to reconcile the geometries for exploiting both structures

slide-24
SLIDE 24

RADAR algorithm: outline

Based on Juditsky and Nesterov (2011) Recall the minimization problem: minθ E[ℓ(θ; z)] Algorithm proceeds over K epochs At epoch i, solve the regularized problem: min

θ∈Ωi E[ℓ(θ; z)] + λiθ1

where Ωi = θ ∈ Rd : θ − yi2

p ≤ R2 i

slide-25
SLIDE 25

RADAR algorithm: First epoch

Require: R1 such that θ∗1 ≤ R1 Perform stochastic dual averaging with p =

2log d 2log d−1 ≈ 1

Initialize θ1 = 0, y1 = 0 Observe gt where E[gt] ∈ ∂L(θt) and νt ∈ ∂θt1 Update µt+1 = µt + gt + λ1νt θt+1 = arg min

θp≤R1

θ, µt+1 + 1 2αt θ2

p θ∗ y1 = 0 R1

slide-26
SLIDE 26

RADAR algorithm: First epoch

Require: R1 such that θ∗1 ≤ R1 Perform stochastic dual averaging with p =

2log d 2log d−1 ≈ 1

Initialize θ1 = 0, y1 = 0 Observe gt where E[gt] ∈ ∂L(θt) and νt ∈ ∂θt1 Update µt+1 = µt + gt + λ1νt θt+1 = arg min

θp≤R1

θ, µt+1 + 1 2αt θ2

p θ∗ y1 = 0 R1

slide-27
SLIDE 27

RADAR algorithm: First epoch

Require: R1 such that θ∗1 ≤ R1 Perform stochastic dual averaging with p =

2log d 2log d−1 ≈ 1

Initialize θ1 = 0, y1 = 0 Observe gt where E[gt] ∈ ∂L(θt) and νt ∈ ∂θt1 Update µt+1 = µt + gt + λ1νt θt+1 = arg min

θp≤R1

θ, µt+1 + 1 2αt θ2

p θ∗ y1 = 0 R1

slide-28
SLIDE 28

RADAR algorithm: First epoch

Require: R1 such that θ∗1 ≤ R1 Perform stochastic dual averaging with p =

2log d 2log d−1 ≈ 1

Initialize θ1 = 0, y1 = 0 Observe gt where E[gt] ∈ ∂L(θt) and νt ∈ ∂θt1 Update µt+1 = µt + gt + λ1νt θt+1 = arg min

θp≤R1

θ, µt+1 + 1 2αt θ2

p θ∗ y1 = 0 R1

slide-29
SLIDE 29

RADAR algorithm: First epoch

Require: R1 such that θ∗1 ≤ R1 Perform stochastic dual averaging with p =

2log d 2log d−1 ≈ 1

Initialize θ1 = 0, y1 = 0 Observe gt where E[gt] ∈ ∂L(θt) and νt ∈ ∂θt1 Update µt+1 = µt + gt + λ1νt θt+1 = arg min

θp≤R1

θ, µt+1 + 1 2αt θ2

p R2 y1 = 0 R1 θ∗

slide-30
SLIDE 30

Initializing next epoch

Update y2 = ¯ θT Update R2

2 = R2 1/2

Update λ2 = λ1/ √ 2 Initialize θ1 = y2 for next epoch

θ∗ y2 R2

slide-31
SLIDE 31

Initializing next epoch

Update y2 = ¯ θT Update R2

2 = R2 1/2

Update λ2 = λ1/ √ 2 Initialize θ1 = y2 for next epoch Now use updates µt+1 = µt + gt + λ2νt θt+1 = arg min

θ−y2p≤R2

θ, µt+1 + 1 2αt θ − y22

p θ∗ y2 R2

slide-32
SLIDE 32

Initializing next epoch

Update y2 = ¯ θT Update R2

2 = R2 1/2

Update λ2 = λ1/ √ 2 Initialize θ1 = y2 for next epoch Now use updates µt+1 = µt + gt + λ2νt θt+1 = arg min

θ−y2p≤R2

θ, µt+1 + 1 2αt θ − y22

p

Each step still O(d)

θ∗ y2 R2

slide-33
SLIDE 33

Convergence rate for exact sparsity

Theorem Suppose the expected loss is G-Lipschitz and γ-strongly convex. Suppose θ∗ has at most s non-zero entries. With probability at least 1 − 6 exp(−δlog d/12) ¯ θT − θ∗2

2 ≤ c G 2 + σ2(1 + δ)

γ2 slog d T . Logarithmic scaling in d Error decays as 1/T Results extend to approximately sparse problems

slide-34
SLIDE 34

Convergence rate for exact sparsity

Theorem Suppose the expected loss is G-Lipschitz and γ-strongly convex. Suppose θ∗ has at most s non-zero entries. With probability at least 1 − 6 exp(−δlog d/12) ¯ θT − θ∗2

2 ≤ c G 2 + σ2(1 + δ)

γ2 slog d T . Logarithmic scaling in d Error decays as 1/T Results extend to approximately sparse problems Similar result for the method of Juditsky and Nesterov (2011) applied with a fixed λ

slide-35
SLIDE 35

Optimality of results

Error of O

  • slog d

γ2T

  • after T iterations

Stochastic gradients computed with one sample T iterations ≡ T samples Information-theoretic limit: Error Ω

  • slog d

γ2T

  • after observing T

samples for any possible method

slide-36
SLIDE 36

Optimality of results

Error of O

  • slog d

γ2T

  • after T iterations

Stochastic gradients computed with one sample T iterations ≡ T samples Information-theoretic limit: Error Ω

  • slog d

γ2T

  • after observing T

samples for any possible method We obtain the best possible error in linear time

slide-37
SLIDE 37

Simulation results

Performed simulations for sparse linear regression Compared to classical benchmarks: RDA, SGD Evaluated several versions: RADAR, EDA, RADAR-Const Results averaged over 5 random trials

slide-38
SLIDE 38

Simulation results

0.5 1 1.5 2 x 10

4

1 2 3 4 5 6 Iterations θt − θ∗2

2

Error vs. iterations RADAR SGD RDA 0.5 1 1.5 2 x 10

4

1 2 3 4 5 6 Iterations θ t − θ ∗2

2

Error vs. iterations RADAR SGD RDA

d = 20000 d = 40000

slide-39
SLIDE 39

Simulation results

0.5 1 1.5 2 x 10

4

1 2 3 4 5 Iterations θt − θ∗2

2

Error vs. iterations RADAR EDA RADAR-CONST 0.5 1 1.5 2 x 10

4

1 2 3 4 5 Iterations θt − θ∗2

2

Error vs. iterations RADAR EDA RADAR-CONST

d = 20000 d = 40000

slide-40
SLIDE 40

Intuition

Convergence rate of 1/√t within each epoch Re-centering and shrinking of set boosts convergence speed at each epoch Error halved after each epoch Epoch lengths double— initial epochs negligible Fast convergence at later epochs due to small set High regularization initially, little at the end leads to (aprpox.) sparsity all along

slide-41
SLIDE 41

Conclusions

Stochastic optimization algorithm for sparse, high-dimensional problems Simultaneously exploits sparsity and strong convexity of the problem Optimal rate of convergence Updates computed in closed form for common problems Extends to group sparsity, low-rank etc. Similar extensions for mirror descent, accelerated methods (Hazan and Kale (2011), Ghadimi and Lan (2012)) Possible extensions to distributed settings

slide-42
SLIDE 42

More details can be found in Fast global convergence of gradient methods for high dimensional statistical recovery, A., Negahban and Wainwright, http://arxiv.org/abs/1104.4824. Stochastic optimization and sparse statistical recovery: An optimal algorithm for high dimensions, A., Negahban and Wainwright, http://arxiv.org/abs/1207.4421.

Thank You