Random Methods for Large-Scale Linear Problems, Variational - - PowerPoint PPT Presentation

random methods for large scale linear problems
SMART_READER_LITE
LIVE PREVIEW

Random Methods for Large-Scale Linear Problems, Variational - - PowerPoint PPT Presentation

Random Methods for Large-Scale Linear Problems, Variational Inequalities, and Convex Optimization Doctoral Thesis Defense Mengdi Wang Laboratory for Information and Decision Systems (LIDS) Massachusetts Institute of Technology April 1st, 2013


slide-1
SLIDE 1

Random Methods for Large-Scale Linear Problems, Variational Inequalities, and Convex Optimization

Doctoral Thesis Defense Mengdi Wang

Laboratory for Information and Decision Systems (LIDS) Massachusetts Institute of Technology

April 1st, 2013

1/38

slide-2
SLIDE 2

1

A Roadmap

2

Stochastic Methods for Linear Systems

3

Stochastic Methods for Convex Optimization & Variational Inequalities Motivation A Unified Algorithmic Framework The Coupled Convergence Process

4

Summary

5

Acknowledgement

A Roadmap 2/38

slide-3
SLIDE 3

The Broader Context of Our Work: Large-Scale Problems

Linear Systems Ax = b or E [Av] x = E [bv] (inverse problems, regression, statistical learning, approximate DP)

Linear & Quadratic Programming minAx≤b x′

i Qix + c′x (approximate DP

, high performance computation)

Complementarity Problems (equilibriums, projected equations)

Convex Problems & Variational Inequalities minx∈∩iXi

  • i fi(x) (networks,

data-driven problems, cooperative games, online decision making)

Address large-scale problems by randomization/simulation

A Roadmap 3/38

slide-4
SLIDE 4

Use Stochastic Methods to Tackle Large Scale

How to obtain random samples? Importance sampling Adaptive sampling Monte Carlo methods Application/implementation-dependent methods: asynchronous, distributed, irregular, unknown random process, etc How to use random samples? Stochastic approximation Sample average approximation Use Monte Carlo estimates to iterate Modify deterministic methods to allow stochasticity

A Roadmap 4/38

slide-5
SLIDE 5

Our work

Part 1: Large scale linear systems Ax = b Deal with the joint effect of singularity and stochastic noise Stabilizing divergent iterative methods Part 2: Large scale optimization problems with complicated constraints Combine optimization and feasibility methods with randomness Incremental/Online Structure:

updating based on a part of all constraint/gradient information using minimal storage to deal with large data set allowing various sources of stochasticity

Coupled Convergence: xk → x∗ vs. xk → X

A Roadmap 5/38

slide-6
SLIDE 6

1

A Roadmap

2

Stochastic Methods for Linear Systems

3

Stochastic Methods for Convex Optimization & Variational Inequalities Motivation A Unified Algorithmic Framework The Coupled Convergence Process

4

Summary

5

Acknowledgement

Stochastic Methods for Linear Systems 6/38

slide-7
SLIDE 7

Solving linear systems Ax = b by stochastic sampling

Assume that: A = E [Aw] , b = E [bv] Moreover, a sequence of samples {(Awk, bvk)} is available. Stochastic Approximation (SA) xk+1 = xk − αk (Awkxk − bvk) Using one sample per update is too slow! Sample Average Approximation (SAA) Obtain finite-sample estimates Ak = 1

k

k

t=1 Awt and bk = 1 k

k

t=1 bvt,

then solve Akx = bk.

Stochastic Methods for Linear Systems 7/38

slide-8
SLIDE 8

Can we do better?

Using Monte Carlo Estimates Given Ak

a.s.

− → A, bk

a.s.

− → b at a rate of 1/ √ k, iterate as xk+1 = xk − γG(Akxk − bk) If ρ(I − γGA) < 1 ⇛ geometric convergence! Not working if (close to) singular! (Wang and Bertsekas, 2011) Divergence rate: xk ∼ e

√ k

Axk − b ∼ e

√ k,

w.p.1. Based on random samples of A, we cannot detect the (near) singularity We still like the nonsingular part of the system

Stochastic Methods for Linear Systems 8/38

slide-9
SLIDE 9

Deal with singularity under noise

Stabilized Iterations (Wang and Bertsekas, 2011) Given Ak

a.s.

− → A, bk

a.s.

− → b at a rate of 1/ √ k, iterate as

✭✭✭✭✭✭✭✭✭✭✭✭✭ ✭

xk+1 = xk − γG(Akxk − bk) Add a stabilization term to deal with singularity and multiplicative noise xk+1 = (1 − δk)xk − γG(Akxk − bk) where δk ↓ 0, δk = ∞ and δk ≫ noise. Then xk

a.s.

− → some x∗ Proximal Iteration Naturally Converges (Wang and Bertsekas, 2011)

xk+1 = argminxAkx − bk2 + λxk − x2 Then Axk − b

a.s.

− → 0 and we can extract a subsequence ˆ xk

a.s.

− → some x∗

Stochastic Methods for Linear Systems 9/38

slide-10
SLIDE 10

1

A Roadmap

2

Stochastic Methods for Linear Systems

3

Stochastic Methods for Convex Optimization & Variational Inequalities Motivation A Unified Algorithmic Framework The Coupled Convergence Process

4

Summary

5

Acknowledgement

Stochastic Methods for Large Scale COP & VI Motivation 10/38

slide-11
SLIDE 11

The problems

Convex Optimization Problems (COP) min

x∈X F(x)

where F : ℜn → ℜ is convex and continuously differentiable Variational Inequalities (VI) G(x∗)′(x − x∗) ≥ 0, ∀ x ∈ X where G : ℜn → ℜn is strongly monotone Strongly monotone: for some σ > 0 (y − x)′G(y − x) ≥ σx − y2 ∀ x, y VI = COP , if G(x) = ∇F(x) Equilibriums/LP/Projected equations/Complementarity Problems

Stochastic Methods for Large Scale COP & VI Motivation 11/38

slide-12
SLIDE 12

We focus on large-scale problems with incremental structure

Linearly Additive Objectives COP: F(x) =

  • Fi(x)
  • r

F(x) = E [f(x, v)] VI: G(x) =

  • Gi(x)
  • r

G(x) = E [g(x, v)] Set Intersection Constraints X = ∩m

i=1Xi

where each Xi is closed and convex Applications

Machine Learning/Distributed Optimization/Computing Nash Equilibriums

Stochastic Methods for Large Scale COP & VI Motivation 12/38

slide-13
SLIDE 13

Difficulty with practical large-scale problems

Operating with X = ∩Xi is difficult, especially for: Big data-driven problems with huge # of constraints stored in external hard drives Distributed problems where each agent can only access part of all constraints Stochastic process-driven problems whose constraints involve a random process only available through simulation Question: Why not replace X with a single Xi ?

Stochastic Methods for Large Scale COP & VI Motivation 13/38

slide-14
SLIDE 14

Putting two ideas together

Gradient projection Alternate projection

Stochastic Methods for Large Scale COP & VI Motivation 14/38

slide-15
SLIDE 15

Related works

Incremental COP: minx∈X F(x) by xk+1 = ΠX[xk − αg(xk, vk)] stochastic gradient projection (Nedi´ c and Bertsekas 2001, etc) incremental proximal (Bertsekas 2010, etc) incremental gradient with random projection (Nedi´ c 2011) Feasibility Problems: finding x ∈ ∩i∈MXi by xk+1 = ΠXwk xk alternate/cyclic projection (Gubin 1967, Tseng 1990, Deutsch and Hundal 2006-2008, Lewis 2008, etc) random projection (Nedi´ c 2010) super halfspace projection (Censor 2008, etc )

Stochastic Methods for Large Scale COP & VI Motivation 15/38

slide-16
SLIDE 16

1

A Roadmap

2

Stochastic Methods for Linear Systems

3

Stochastic Methods for Convex Optimization & Variational Inequalities Motivation A Unified Algorithmic Framework The Coupled Convergence Process

4

Summary

5

Acknowledgement

Stochastic Methods for Large Scale COP & VI A Unified Algorithmic Framework 16/38

slide-17
SLIDE 17

Existing methods

Gradient/Subgradient Projection Method for COP xk+1 = ΠX

  • xk − αk∇F(xk)
  • Projection Method for VI

xk+1 = ΠX

  • xk − αkG(xk)
  • Stochastic Gradient Projection Method for COP

Projection Method for Stochastic VI xk+1 = ΠX

  • xk − αkg(xk, vk)
  • Proximal Method for COP

xk+1 = argminx∈X

  • F(x) +

1 2αk x − xk2

  • Stochastic Methods for Large Scale COP & VI

A Unified Algorithmic Framework 17/38

slide-18
SLIDE 18

The general random incremental algorithm

A Two-Step Algorithm Optimality update: zk = xk − αkg(¯ xk, vk), with ¯ xk = xk or xk+1 Feasibility update: xk+1 = (1 − βk)zk + βkΠXwk zk When βk = 1 xk+1 = ΠXwk [xk − αkg(¯ xk, vk)] ¯ xk ∈ {xk, xk+1} Analytical difficulty: xk no longer feasible!

Stochastic Methods for Large Scale COP & VI A Unified Algorithmic Framework 18/38

slide-19
SLIDE 19

Special cases of the general algorithm

Projection algorithm using random projection and stochastic gradient xk = ΠXwk [xk − αkg(xk, vk)] Proximal algorithm using random constraint and random cost function xk+1 = argminx∈Xwk

  • F(x, vk) + (1/2αk)x − xk2

Variations that alternate between proximal and projection Successive projection algorithm xk+1 = Πwkxk

Stochastic Methods for Large Scale COP & VI A Unified Algorithmic Framework 19/38

slide-20
SLIDE 20

Sampling schemes for Xwk

Nearly independent samples by random sampling s.t. inf

k≥0 P(wk = Xi | Fk) > 0,

i = 1, . . . , m Cyclic samples by cyclic selection or random shuffling, s.t. {Xwk} consists of permutations of {X1, . . . , Xm} Most distant constraint sets by adaptively select Xwk s.t. wk = argmaxi=1,...,mxk − ΠXixk Markov samples by generating Xwk through a recurrent Markov chain with states {Xi}m

i=1

Stochastic Methods for Large Scale COP & VI A Unified Algorithmic Framework 20/38

slide-21
SLIDE 21

Sampling schemes for g(xk, vk)

Unbiased samples by random sampling s.t. E

  • g(x, vk) | Fk
  • = G(x),

∀ x, k ≥ 0, w.p.1 Cyclic samples by cyclic selection or random shuffling of component functions s.t. Avgk∈cycleE

  • g(x, vk) | Fbegining
  • = G(x),

∀ x, w.p.1 Markov samples by generating vk through an irreducible Markov chain with invariant distribution ξ, s.t. Ev∼ξ [g(x, v)] = G(x), ∀ x

Stochastic Methods for Large Scale COP & VI A Unified Algorithmic Framework 21/38

slide-22
SLIDE 22

1

A Roadmap

2

Stochastic Methods for Linear Systems

3

Stochastic Methods for Convex Optimization & Variational Inequalities Motivation A Unified Algorithmic Framework The Coupled Convergence Process

4

Summary

5

Acknowledgement

Stochastic Methods for Large Scale COP & VI The Coupled Convergence Process 22/38

slide-23
SLIDE 23

Almost sure convergence

Theorem (Wang and Bertsekas, 2012 and 2013) Under *suitable* assumptions, let wk,vk be generated by any combination of the preceding sampling schemes. Then the algorithm zk = xk − αkg(¯ xk, vk) with ¯ xk = xk or xk+1 xk+1 = (1 − βk)zk + βkΠXwk zk generates iterates such that xk

a.s.

− → some x∗

Stochastic Methods for Large Scale COP & VI The Coupled Convergence Process 23/38

slide-24
SLIDE 24

Assumptions

Stochastic Lipschitz continuity of the gradients ∀ x, y E

  • g(x, vk) − g(y, vk)
  • 2 | Fk
  • ≤ L2x − y2,

w.p.1 Regularity of constraints There exists η > 0 s.t. x − Πx2 ≤ η max

i=1,...,m x − ΠXix2,

∀ x Stepsizes

  • k=0

αk = ∞,

  • k=0

α2

k < ∞, ∞

  • k=0

α2

k

(2 − βk)βk < ∞ (implying βk ≫ αk)

Stochastic Methods for Large Scale COP & VI The Coupled Convergence Process 24/38

slide-25
SLIDE 25

Proof outline: two timescales

Strongly monotone VI or strongly convex optimization Let wk,vk be uniform and iid, let βk = 1, and let x∗ be arbitrary Feasibility Improvement Inequality (ρ < 1) E

  • d2

X(xk+1) | Fk

  • ≤ ρ d2

X(xk) + α2 kO

  • xk − x∗2 + 1
  • w.p.1

Optimality Improvement Inequality E

  • xk+1 −x∗2 | Fk
  • ≤ (1−O(αk))xk −x∗2 +O(αk) dX(xk)

w.p.1 Apply a supermartingale convergence argument to these relations

  • xk → X: Geometric Rate

VS. xk → x∗: Stochastic Approximation Rate

Stochastic Methods for Large Scale COP & VI The Coupled Convergence Process 25/38

slide-26
SLIDE 26

Generalize the analysis

Coupled Supermartingale Convergence (Wang and Bertsekas, 2013) Let {ξt}, {ζt}, {ut}, {¯ ut}, {ηt}, {θt}, {ǫt}, {µt}, and {νt} be sequences

  • f nonnegative r.v.s such that

E [ξt+1 | Gt] ≤ (1 + ηt)ξt − ut + cθtζt + µt, E [ζt+1 | Gt] ≤ (1 − θt)ζt − ¯ ut + ǫtξt + νt, where Gt denotes the collection {ξk, ζk, uk, ¯ uk, ηk, θk, ǫk, ǫk, µk, νk}t

k=1,

and c > 0. Also, let ∞

t=0(ηt + ǫt + µt + νt) < ∞ with probability 1.

Then ξt and ζt converge almost surely to nonnegative r.v.s, and

  • t=0

ut + ¯ ut < ∞, w.p.1.

Stochastic Methods for Large Scale COP & VI The Coupled Convergence Process 26/38

slide-27
SLIDE 27

The Coupled Convergence Theorem

Theorem (Wang and Bertsekas, 2013) Under the preceding assumptions, let wk, vk be s.t. for some N, M > 0 E

  • d2

X(xk+M) | Fk

  • ≤ (1 − O(βk)) d2

X(xk) + α2 kO

  • xk − x∗2 + 1
  • E
  • xk+N−x∗2 | Fk
  • ≤ xk−x∗2−O(αk)(F(xk)−F(x∗))+O(αk) dX(xk)

Then the algorithm generates iterates such that xk

a.s.

− → some x∗ Since βk ≫ αk, convergence to X is faster than convergence to x∗ Modular Architecture: Extendable to more algorithms and sampling schemes

Stochastic Methods for Large Scale COP & VI The Coupled Convergence Process 27/38

slide-28
SLIDE 28

Error bounds & convergence rates

Theorem (Strongly COP/Strongly Monotone VI with strong convexity constant σ) Under the same assumptions, there exists a r.v. N > 0 s.t. min

0≤k≤N

  • xk − x∗2 − O

αk βk

  • ≤ ǫ

where E N−1

  • k=0

αk

  • ≤ x0 − x∗2

2σǫ If constant stepsizes αk = α, βk = b are used, lim inf

k→∞ xk − x∗2 ≤ O

α σβ

  • ,

w.p.1 When αk ≈ O( 1

√ k ), error bound ≈ O( 1 √ k ), complexity bound ≈ O( 1 ǫ2 )

Stochastic Methods for Large Scale COP & VI The Coupled Convergence Process 28/38

slide-29
SLIDE 29

Constant factor in the error bound - comparison of constraint sampling schemes

Strongly COP/strongly monotone VI, number of constraints = m

Adaptive constraint selection O(1) IID uniform sampling O(m) Random shuffling between O(m) and O(m3) Deterministic cyclic sampling O(m3) Markov depend on the mixing rate and invariant distribution

Stochastic Methods for Large Scale COP & VI The Coupled Convergence Process 29/38

slide-30
SLIDE 30

Example: Estimate the invariant distribution ξ of an 1000-state Markov chain P

Approximate ξ = P′ξ by ξ ≈ Φx using projected version Φx = ΠCP′Φx where C = {Φx | x ∈ ℜ20, Φx ≥ 0, e′Φx = 1} VI Formulation (x − x∗)′Ax∗ ≥ 0, ∀ x ∈ ℜ20 s.t. Φx ≥ 0, e′Φx = 1 where A = Φ′Ξ(I − P′)Φ=

1000

  • i=1

1000

  • j=1
  • ξiφiφ′

i − ξipijφiφ′ j

  • 1000 state, 20 features, 1001 constraints, 106 components

A is not available but can be estimated by simulating the MC

Stochastic Methods for Large Scale COP & VI The Coupled Convergence Process 30/38

slide-31
SLIDE 31

Estimated distribution Φx vs. underlying invariant distribution ξ

200 400 600 800 1000 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 x 10

−3

state ξ Φ x* Stochastic Methods for Large Scale COP & VI The Coupled Convergence Process 31/38

slide-32
SLIDE 32

Compare the convergence rates

10

1

10

2

10

3

10

4

10

5

10

6

10

−2

10 10

2

10

4

10

6

10

8

10

10

k ||xk−x*|| batch f, i.i.d. projection i.i.d. f, i.i.d. projection cyclic f, i.i.d. projection i.i.d. f, cyclic projection cyclic f, cyclic projection Stochastic Methods for Large Scale COP & VI The Coupled Convergence Process 32/38

slide-33
SLIDE 33

Extensions: Assumptions that can be relaxed

Extend to Nonsmooth Optimization

✭✭✭✭✭✭✭✭✭✭✭✭✭✭

F continuously differentiable ⇛ F bounded by a quadratic function Use random samples of ✘✘✘✘

gradients ⇛ subgradients Alternative assumptions for {Xi}m

i=1

✭✭✭✭✭✭✭✭✭✭✭✭ ✭

Linear regularity condition ⇛ ∩Xi has nonempty interior Allow X = ∩∞

i=1Xi?

As long as each Πwk moves xk by a sufficient distance Extend to arbitrary convex constraint set by super halfspace projection

Stochastic Methods for Large Scale COP & VI The Coupled Convergence Process 33/38

slide-34
SLIDE 34

1

A Roadmap

2

Stochastic Methods for Linear Systems

3

Stochastic Methods for Convex Optimization & Variational Inequalities Motivation A Unified Algorithmic Framework The Coupled Convergence Process

4

Summary

5

Acknowledgement

Summary 34/38

slide-35
SLIDE 35

Summary

Randomization/simulation-based methods for large-scale problems A special case of linear system : deal with singularity under MC noise A general framework: optimization algorithms + feasibility algorithms Problem: COP & VIs with additive objective & intersection constraints Algorithm: update using random samples of objectives/gradients & constraints

Flexibility in implementation & sampling (distributed, asynchronous, adaptive ...) Optimality improvement is coupled with feasibility improvement Convergence rates: adaptive constraint selection > iid uniform sampling ≈ random shuffling > deterministic cyclic sampling

Summary 35/38

slide-36
SLIDE 36

1

A Roadmap

2

Stochastic Methods for Linear Systems

3

Stochastic Methods for Convex Optimization & Variational Inequalities Motivation A Unified Algorithmic Framework The Coupled Convergence Process

4

Summary

5

Acknowledgement

Acknowledgement 36/38

slide-37
SLIDE 37

Acknowledgement

Advisor: Prof. Dimitri Bertsekas

  • Prof. John Tsitsiklis, Prof. Devavrat Shah

Friends: Austin Collins, Lei Dai, Dawsen Huang, Ying-Zong Huang, Ying Liu, Rufan Luo, Yuan Luo, Beipeng Mu, Shen Shen, Wenzhe Wei, Bingxiao Wu, Ming Yang Family

Acknowledgement 37/38

slide-38
SLIDE 38

The end

Thank You Very Much! Any Question is Welcome :-)

Acknowledgement 38/38