Solving composite optimization problems, with applications to phase - - PowerPoint PPT Presentation

solving composite optimization problems with applications
SMART_READER_LITE
LIVE PREVIEW

Solving composite optimization problems, with applications to phase - - PowerPoint PPT Presentation

Solving composite optimization problems, with applications to phase retrieval John Duchi (based on joint work with Feng Ruan) Outline Composite optimization problems Methods for composite optimization Application: robust phase retrieval


slide-1
SLIDE 1

Solving composite optimization problems, with applications to phase retrieval

John Duchi (based on joint work with Feng Ruan)

slide-2
SLIDE 2

Outline

Composite optimization problems Methods for composite optimization Application: robust phase retrieval Experimental evaluation Large scale composite optimization?

slide-3
SLIDE 3

What I hope to accomplish today

◮ Investigate problem structures that are not quite convex but still

amenable to elegant solution approaches

◮ Show how we can leverage stochastic structure to turn hard

non-convex problems into “easy” ones

[Keshavan, Montanari, Oh 10; Loh & Wainwright 12]

◮ Consider large scale versions of these problems

slide-4
SLIDE 4

Composite optimization problems

The problem: minimize

x

f(x) := h(c(x)) where h : Rm → R is convex and c : Rn → Rm is smooth

slide-5
SLIDE 5

Motivation: the exact penalty

minimize

x

f(x) subject to x ∈ X equivalent (for all large enough λ) to minimize

x

f(x) + λ dist(x, X) dist(x, X)

slide-6
SLIDE 6

Motivation: the exact penalty

minimize

x

f(x) subject to x ∈ X equivalent (for all large enough λ) to minimize

x

f(x) + λ dist(x, X) dist(x, X)

slide-7
SLIDE 7

Motivation: the exact penalty

minimize

x

f(x) subject to x ∈ X equivalent (for all large enough λ) to minimize

x

f(x) + λ dist(x, X) dist(x, X)

slide-8
SLIDE 8

Motivation: the exact penalty

minimize

x

f(x) subject to c(x) = 0 equivalent to (for all large enough λ) minimize

x

f(x) + λ c(x)

[Fletcher & Watson 80, 82; Burke 85]

slide-9
SLIDE 9

Motivation: the exact penalty

minimize

x

f(x) subject to c(x) = 0 equivalent to (for all large enough λ) minimize

x

f(x) + λ c(x)

  • =h(c(x))

where h(z) = λ z

[Fletcher & Watson 80, 82; Burke 85]

slide-10
SLIDE 10

Motivation: nonlinear measurements and modeling

◮ Have true signal x⋆ ∈ Rn and measurement vectors ai ∈ Rn

slide-11
SLIDE 11

Motivation: nonlinear measurements and modeling

◮ Have true signal x⋆ ∈ Rn and measurement vectors ai ∈ Rn ◮ Observe nonlinear measurements

bi = φ(ai, x⋆) + ξi, i = 1, . . . , m for φ(·) a nonlinear function but smooth function An objective: f(x) = 1 m

m

  • i=1
  • φ(ai, x) − bi

2

slide-12
SLIDE 12

Motivation: nonlinear measurements and modeling

◮ Have true signal x⋆ ∈ Rn and measurement vectors ai ∈ Rn ◮ Observe nonlinear measurements

bi = φ(ai, x⋆) + ξi, i = 1, . . . , m for φ(·) a nonlinear function but smooth function An objective: f(x) = 1 m

m

  • i=1
  • φ(ai, x) − bi

2 Nonlinear least squares [Nocedal & Wright 06; Plan & Vershynin 15; Oymak &

Soltanolkotabi 16]

slide-13
SLIDE 13

(Robust) Phase retrieval

[Cand` es, Li, Soltanolkotabi 15]

slide-14
SLIDE 14

(Robust) Phase retrieval

[Cand` es, Li, Soltanolkotabi 15]

Observations (usually) bi = ai, x⋆2 yield objective f(x) = 1 m

m

  • i=1

| ai, x2 − bi|

slide-15
SLIDE 15

Optimization methods

How do we solve optimization problems?

  • 1. Build a “good” but simple local model of f
  • 2. Minimize the model (perhaps regularizing)
slide-16
SLIDE 16

Optimization methods

How do we solve optimization problems?

  • 1. Build a “good” but simple local model of f
  • 2. Minimize the model (perhaps regularizing)

Gradient descent: Taylor (first-order) model f(y) ≈ fx(y) := f(x) + ∇f(x)T (y − x)

slide-17
SLIDE 17

Optimization methods

How do we solve optimization problems?

  • 1. Build a “good” but simple local model of f
  • 2. Minimize the model (perhaps regularizing)

Newton’s method: Taylor (second-order) model f(y) ≈ fx(y) := f(x) + ∇f(x)T (y − x) + (1/2)(y − x)T ∇2f(x)(y − x)

slide-18
SLIDE 18

Modeling composite problems

Now we make a convex model f(x) = h(c(x))

slide-19
SLIDE 19

Modeling composite problems

Now we make a convex model f(x) = h( c(x)

  • linearize

)

slide-20
SLIDE 20

Modeling composite problems

Now we make a convex model f(y) ≈ h(c(x) + ∇c(x)T (y − x))

slide-21
SLIDE 21

Modeling composite problems

Now we make a convex model f(y) ≈ h(c(x) + ∇c(x)T (y − x)

  • =c(y)+O(x−y2)

)

slide-22
SLIDE 22

Modeling composite problems

Now we make a convex model fx(y) := h

  • c(x) + ∇c(x)T (y − x)
slide-23
SLIDE 23

Modeling composite problems

Now we make a convex model fx(y) := h

  • c(x) + ∇c(x)T (y − x)
  • [Burke 85; Drusvyatskiy, Ioffe, Lewis 16]
slide-24
SLIDE 24

Modeling composite problems

Now we make a convex model fx(y) := h

  • c(x) + ∇c(x)T (y − x)
  • Example: f(x) = |x2 − 1|, h(z) = |z| and c(x) = x2 − 1
slide-25
SLIDE 25

Modeling composite problems

Now we make a convex model fx(y) := h

  • c(x) + ∇c(x)T (y − x)
  • Example: f(x) = |x2 − 1|, h(z) = |z| and c(x) = x2 − 1
slide-26
SLIDE 26

Modeling composite problems

Now we make a convex model fx(y) := h

  • c(x) + ∇c(x)T (y − x)
  • Example: f(x) = |x2 − 1|, h(z) = |z| and c(x) = x2 − 1
slide-27
SLIDE 27

The prox-linear method [Burke, Drusvyatskiy et al.]

Iteratively (1) form regularized convex model and (2) minimize it xk+1 = argmin

x∈X

  • fxk(x) + 1

2α x − xk2

2

  • = argmin

x∈X

  • h
  • c(xk) + ∇c(xk)T (x − xk)
  • + 1

2α x − xk2

2

slide-28
SLIDE 28

The prox-linear method [Burke, Drusvyatskiy et al.]

Iteratively (1) form regularized convex model and (2) minimize it xk+1 = argmin

x∈X

  • fxk(x) + 1

2α x − xk2

2

  • = argmin

x∈X

  • h
  • c(xk) + ∇c(xk)T (x − xk)
  • + 1

2α x − xk2

2

slide-29
SLIDE 29

The prox-linear method [Burke, Drusvyatskiy et al.]

Iteratively (1) form regularized convex model and (2) minimize it xk+1 = argmin

x∈X

  • fxk(x) + 1

2α x − xk2

2

  • = argmin

x∈X

  • h
  • c(xk) + ∇c(xk)T (x − xk)
  • + 1

2α x − xk2

2

  • |xk − x⋆| = .3
slide-30
SLIDE 30

The prox-linear method [Burke, Drusvyatskiy et al.]

Iteratively (1) form regularized convex model and (2) minimize it xk+1 = argmin

x∈X

  • fxk(x) + 1

2α x − xk2

2

  • = argmin

x∈X

  • h
  • c(xk) + ∇c(xk)T (x − xk)
  • + 1

2α x − xk2

2

  • |xk − x⋆| = .024
slide-31
SLIDE 31

The prox-linear method [Burke, Drusvyatskiy et al.]

Iteratively (1) form regularized convex model and (2) minimize it xk+1 = argmin

x∈X

  • fxk(x) + 1

2α x − xk2

2

  • = argmin

x∈X

  • h
  • c(xk) + ∇c(xk)T (x − xk)
  • + 1

2α x − xk2

2

  • |xk − x⋆| = 3 · 10−4
slide-32
SLIDE 32

The prox-linear method [Burke, Drusvyatskiy et al.]

Iteratively (1) form regularized convex model and (2) minimize it xk+1 = argmin

x∈X

  • fxk(x) + 1

2α x − xk2

2

  • = argmin

x∈X

  • h
  • c(xk) + ∇c(xk)T (x − xk)
  • + 1

2α x − xk2

2

  • |xk − x⋆| = 4 · 10−8
slide-33
SLIDE 33

Robust phase retrieval problems

A nice application for these composite methods

slide-34
SLIDE 34

Robust phase retrieval problems

Data model: true signal x⋆ ∈ Rn, for pfail < 1

2 observe

bi = ai, x⋆2 + ξi where ξi =

  • w.p. ≥ 1 − pfail

arbitrary

  • therwise
slide-35
SLIDE 35

Robust phase retrieval problems

Data model: true signal x⋆ ∈ Rn, for pfail < 1

2 observe

bi = ai, x⋆2 + ξi where ξi =

  • w.p. ≥ 1 − pfail

arbitrary

  • therwise

Goal: solve minimize

x

f(x) = 1 m

m

  • i=1

| ai, x2 − bi|

slide-36
SLIDE 36

Robust phase retrieval problems

Data model: true signal x⋆ ∈ Rn, for pfail < 1

2 observe

bi = ai, x⋆2 + ξi where ξi =

  • w.p. ≥ 1 − pfail

arbitrary

  • therwise

Goal: solve minimize

x

f(x) = 1 m

m

  • i=1

| ai, x2 − bi| Composite problem: f(x) = 1

m φ(Ax) − b1 = h(c(x)) where φ(·) is

elementwise square, h(z) = 1 m z1 , c(x) = φ(Ax) − b

slide-37
SLIDE 37

A convergence theorem

Three key ingredients. (1) Stability: f(x) − f(x⋆) ≥ λ x − x⋆2 x + x⋆2 (2) Close models: |fx(y) − f(y)| ≤ 1

m

  • AT A
  • p x − y2

2

(3) A good initialization

slide-38
SLIDE 38

A convergence theorem

Three key ingredients. (1) Stability: f(x) − f(x⋆) ≥ λ x − x⋆2 x + x⋆2 (2) Close models: |fx(y) − f(y)| ≤ 1

m

  • AT A
  • p x − y2

2

(3) A good initialization

◮ Measurement matrix A = [a1 · · · am]T ∈ Rm×n and

1 mAT A = 1 m

m

  • i=1

aiaT

i ◮ Convex model fx of f at x defined by

fx(y) = h(c(x) + ∇c(x)T (y − x))

slide-39
SLIDE 39

A convergence theorem

Three key ingredients. (1) Stability: f(x) − f(x⋆) ≥ λ x − x⋆2 x + x⋆2 (2) Close models: |fx(y) − f(y)| ≤ 1

m

  • AT A
  • p x − y2

2

(3) A good initialization

◮ Measurement matrix A = [a1 · · · am]T ∈ Rm×n and

1 mAT A = 1 m

m

  • i=1

aiaT

i ◮ Convex model fx of f at x defined by

fx(y) = 1 m

m

  • i=1
  • ai, x2 + 2 ai, x ai, y − x
slide-40
SLIDE 40

A convergence theorem

Three key ingredients. (1) Stability: f(x) − f(x⋆) ≥ λ x − x⋆2 x + x⋆2 (2) Close models: |fx(y) − f(y)| ≤ 1

m

  • AT A
  • p x − y2

2

(3) A good initialization Theorem (D. & Ruan 17) Define dist(x, x⋆) = min{x − x⋆2 , x + x⋆2}. Let xk be generated by the prox-linear method and L = 1

m

  • AT A
  • p. Then

dist(xk, x⋆) ≤ 2L λ dist(x0, x⋆) 2k .

slide-41
SLIDE 41

Unpacking the convergence theorem

Theorem (D. & Ruan 17) Define dist(x, x⋆) = min{x − x⋆2 , x + x⋆2}. Let xk be generated by the prox-linear method and L = 1

m

  • AT A
  • p. Then

dist(xk, x⋆) ≤ 2L λ dist(x0, x⋆) 2k .

◮ Quadratic convergence: for all intents and purposes, 6 iterations ◮ Requires solving explicit convex optimization problems (quadratic

programs) with no tuning parameters

slide-42
SLIDE 42

Ingredients in convergence: stability

  • 1. Stability: (cf. Eldar and Mendelson 14)

f(x) − f(x⋆) ≥ λ x − x⋆2 x + x⋆2

slide-43
SLIDE 43

Ingredients in convergence: stability

  • 1. Stability: (cf. Eldar and Mendelson 14)

f(x) − f(x⋆) ≥ λ x − x⋆2 x + x⋆2 What is necessary? Proposition (D. & Ruan 17) Assume uniformity condition: for all u, v ∈ Rn and a ∼ P P(|uT aaT v| ≥ ǫ0 u2 v2) ≥ c > 0. Then f is 1

2ǫ0-stable with probability at least 1 − e−cm.

slide-44
SLIDE 44

Ingredients in convergence: stability

  • 1. Stability: (cf. Eldar and Mendelson 14)

f(x) − f(x⋆) ≥ λ x − x⋆2 x + x⋆2 What is necessary? Proposition (D. & Ruan 17) Assume uniformity condition: for all u, v ∈ Rn and a ∼ P P(|uT aaT v| ≥ ǫ0 u2 v2) ≥ c > 0. Then f is 1

2ǫ0-stable with probability at least 1 − e−cm.

(Gaussians satisfy this)

slide-45
SLIDE 45

Ingredients in convergence: stability

Growth condition (stability): ai, x2 − ai, x⋆2 = ai, x − x⋆ ai, x + x⋆ and under random ai with uniform enough support, f(x) = 1 m

m

  • i=1
  • (x − x⋆)T aiaT

i (x + x⋆)

  • x − x⋆2 x + x⋆2
slide-46
SLIDE 46

Ingredients in convergence

  • 2. Approximation: need

1 m

  • AT A
  • p = O(1)

What is necessary? Proposition (Vershynin 11) If the measurement vectors ai are sub-Gaussian, then 1 m

  • AT A
  • p ≤ O(1) ·

n m + t w.p. ≥ 1 − e−mt2.

slide-47
SLIDE 47

Ingredients in convergence

  • 2. Approximation: need

1 m

  • AT A
  • p = O(1)

What is necessary? Proposition (Vershynin 11) If the measurement vectors ai are sub-Gaussian, then 1 m

  • AT A
  • p ≤ O(1) ·

n m + t w.p. ≥ 1 − e−mt2. Heavy-tailed data gets

1 m

  • AT A
  • p = O(1) with reasonable probability

for m a bit larger

slide-48
SLIDE 48

Ingredients in convergence: spectral initialization

Insight: [Wang, Giannakis, Eldar 16] Most vectors ai ∈ Rn are orthogonal to x⋆

slide-49
SLIDE 49

Ingredients in convergence: spectral initialization

Insight: [Wang, Giannakis, Eldar 16] Most vectors ai ∈ Rn are orthogonal to x⋆

Xinit :=

  • i:bi≤median(b)

aiaT

i

satisfies Xinit ≈ E[aiaT

i ] − cd⋆d⋆T

where d⋆ = x⋆/ x⋆2

slide-50
SLIDE 50

Ingredients in convergence: spectral initialization

Insight: [Wang, Giannakis, Eldar 16] Most vectors ai ∈ Rn are orthogonal to x⋆

Xinit :=

  • i:bi≤median(b)

aiaT

i

satisfies Xinit ≈ E[aiaT

i ] − cd⋆d⋆T

where d⋆ = x⋆/ x⋆2 d⋆

slide-51
SLIDE 51

Ingredients in convergence: spectral initialization

  • 3. Initialization: We need dist(x0, x⋆) 1

2 x⋆2

slide-52
SLIDE 52

Ingredients in convergence: spectral initialization

  • 3. Initialization: We need dist(x0, x⋆) 1

2 x⋆2

Estimate direction d ≈ x⋆/ x⋆2 and radius r by Xinit :=

  • i:bi≤median(b)

aiaT

i

and

  • d = argmin

d∈Sn−1

  • dT Xinitd
  • r :=

1 m

m

  • i=1

b2

i

1

2

≈ x⋆2

slide-53
SLIDE 53

Ingredients in convergence: spectral initialization

  • 3. Initialization: We need dist(x0, x⋆) 1

2 x⋆2

Estimate direction d ≈ x⋆/ x⋆2 and radius r by Xinit :=

  • i:bi≤median(b)

aiaT

i

and

  • d = argmin

d∈Sn−1

  • dT Xinitd
  • r :=

1 m

m

  • i=1

b2

i

1

2

≈ x⋆2 Proposition (D. & Ruan 17) Under appropriate orthogonality conditions, x0 = r d satisfies dist(x0, x⋆) n m + t with probability at least 1 − e−mt2

slide-54
SLIDE 54

Take-home result

◮ Stability: measurements ai are uniform enough in direction ◮ Closeness: ai are sub-Gaussian or normalized ◮ Sufficient conditions for initialization: for v ∈ Sn,

E[aiaT

i | ai, v2 ≤ v2 2] = In − cvvT + E

where c > 0 and E is a small error

◮ Measurement failure probability pfail ≤ 1 4

Theorem (D. & Ruan 17) If these conditions hold and m/n 1, then the spectral initialization succeeds and iterates xk of prox-linear algorithm satisfy dist(xk, x0) ≤ (O(1) · dist(x0, x⋆))2k

slide-55
SLIDE 55

Experiments

  • 1. Random (Gaussian) measurements
  • 2. Adversarially chosen outliers
  • 3. Real images
slide-56
SLIDE 56

Experiment 1: random Gaussian measurements

◮ Data generation: dimension n = 3000,

ai

iid

∼ N(0, In) and bi = ai, x⋆2

◮ Compare to Wang, Giannakis, Eldar’s Truncated Amplitude Flow

(best performing non-convex approach)

◮ Look at success probability against m/n (note that m ≥ 2n − 1 is

necessary for injectivity)

slide-57
SLIDE 57

Experiment 1: random Gaussian measurements

1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20

m/n

0.0 0.2 0.4 0.6 0.8 1.0

P(success)

Prox TAF

slide-58
SLIDE 58

Experiment 1: random Gaussian measurements

1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20

m/n

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

Fraction of one-sided success Prox TAF

slide-59
SLIDE 59

Experiment 2: corrupted measurements

◮ Data generation: dimension n = 200,

ai

iid

∼ N(0, In) and bi =

  • w.p. pfail

ai, x⋆2

  • therwise

(most confuses our initialization method)

◮ Compare to Zhang, Chi, Liang’s Median-Truncated Wirtinger Flow

(designed specially for standard Gaussian measurements)

◮ Look at success probability against m/n (note that m ≥ 2n − 1 is

necessary for injectivity)

slide-60
SLIDE 60

Experiment 2: corrupted measurements

0.0 0.25 0.5 0.75 1.0 0.0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 1.8 2.0 2.5 3.0 4.0 6.0 8.0 0.0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 1.8 2.0 2.5 3.0 4.0 6.0 8.0

pfail

slide-61
SLIDE 61

Experiment 3: digit recovery

◮ Data generation: handwritten 16×16 grayscale digits, sensing matrix

A =   HnS1 HnS2 HnS3   ∈ R3n×n where n = 256, Sl are diagonal random sign matrices, Hn is Hadamard transform matrix

◮ Observe

b = (Ax⋆)2 + ξ where ξi =

  • w.p. 1 − pfail

Cauchy

  • therwise

◮ Other non-convex approaches designed for Gaussian data; unclear

how to parameterize them

slide-62
SLIDE 62

Experiment 3: digit recovery

Left: true image. Middle: spectral initialization. Right: solution.

slide-63
SLIDE 63

Experiment 3: digit recovery

0.00 0.05 0.10 0.15 0.20 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Success probability

2000 4000 6000 8000 10000 12000 14000 16000 18000

Matrix multiply count

pfail Performance of composite optimization scheme versus failure probability

slide-64
SLIDE 64

Experiment 4: real images

Signal size n = 222, measurements m = 3 · 224

slide-65
SLIDE 65

Experiment 4: real images

Signal size n = 222, measurements m = 3 · 224

slide-66
SLIDE 66

Composite optimization at scale

Question: What if we have composite problems with a really big sample?

slide-67
SLIDE 67

Composite optimization at scale

Question: What if we have composite problems with a really big sample?

◮ Typical stochastic optimization setup,

f(x) = E[F(x; S)] where F(x; S) = h(c(x; S); S)

slide-68
SLIDE 68

Composite optimization at scale

Question: What if we have composite problems with a really big sample?

◮ Typical stochastic optimization setup,

f(x) = E[F(x; S)] where F(x; S) = h(c(x; S); S)

◮ Example: large scale (robust) nonlinear regression

f(x) = 1 m

m

  • i=1

|φ(ai, x) − bi|

slide-69
SLIDE 69

A stochastic composite method

◮ Define (random) convex approximation

Fx(y; s) = h(c(x; s) + ∇c(x; s)T (y − x); s)

slide-70
SLIDE 70

A stochastic composite method

◮ Define (random) convex approximation

Fx(y; s) = h(c(x; s) + ∇c(x; s)T (y − x)

  • ≈c(y;s)

; s)

slide-71
SLIDE 71

A stochastic composite method

◮ Define (random) convex approximation

Fx(y; s) = h(c(x; s) + ∇c(x; s)T (y − x)

  • ≈c(y;s)

; s)

◮ Then iterate for k = 1, 2, . . .

Sk

iid

∼ P xk+1 = argmin

x∈X

  • Fxk(x; Sk) +

1 2αk x − xk2

2

slide-72
SLIDE 72

Understanding convergence behavior

Ordinary differential equations (gradient flow): ˙ x = −∇f(x) i.e. d dtx(t) = −∇f(x(t))

slide-73
SLIDE 73

Understanding convergence behavior

Ordinary differential inclusions (subgradient flow): ˙ x ∈ −∂f(x) i.e. d dtx(t) ∈ −∂f(x(t))

slide-74
SLIDE 74

The differential inclusion

For stochastic function f(x) := E[F(x; S)] = E[h(c(x; S); S)] =

  • h(c(x; s); s)dP(s)

the generalized subgradient (for non-convex, non-smooth) is [D. & Ruan 17]

∂f(x) =

  • ∇c(x; s)∂h(c(x; s); s)dP(s)

Theorem (D. & Ruan 17) For stochastic composite problem, the subdifferential inclusion ˙ x ∈ −∂f(x) has a unique trajectory for all time and f(x(t)) − f(x(0)) ≤ − t ∂f(x(τ))2 dτ. It also has limit points and they are stationary.

slide-75
SLIDE 75

The limiting differential inclusion

Recall our iteration xk+1 = argmin

x

  • Fxk(x; Sk) +

1 2αk x − xk2

2

  • .

Optimality conditions: using Fx(y; s) = h(c(x; s) + ∇c(x; s)T (y − x)),

slide-76
SLIDE 76

The limiting differential inclusion

Recall our iteration xk+1 = argmin

x

  • Fxk(x; Sk) +

1 2αk x − xk2

2

  • .

Optimality conditions: using Fx(y; s) = h(c(x; s) + ∇c(x; s)T (y − x)), 0 ∈ ∇c(xk; s)∂h(c(xk; s) + ∇c(xk; s)T (xk+1 − xk)) + 1 αk [xk+1 − xk]

slide-77
SLIDE 77

The limiting differential inclusion

Recall our iteration xk+1 = argmin

x

  • Fxk(x; Sk) +

1 2αk x − xk2

2

  • .

Optimality conditions: using Fx(y; s) = h(c(x; s) + ∇c(x; s)T (y − x)), 0 ∈ ∇c(xk; s)∂h(c(xk; s) + ∇c(xk; s)T (xk+1 − xk)

  • =c(xk;s)±O(xk−xk+12)

) + 1 αk [xk+1 − xk]

slide-78
SLIDE 78

The limiting differential inclusion

Recall our iteration xk+1 = argmin

x

  • Fxk(x; Sk) +

1 2αk x − xk2

2

  • .

Optimality conditions: using Fx(y; s) = h(c(x; s) + ∇c(x; s)T (y − x)), 0 ∈ ∇c(xk; s)∂h(c(xk; s) + ∇c(xk; s)T (xk+1 − xk)

  • =c(xk;s)±O(xk−xk+12)

) + 1 αk [xk+1 − xk] i.e. 1 αk [xk+1 − xk] ∈ −∇c(xk; s)∂h(c(xk; s); s) + subgradient mess + Noise = −∂f(xk) + subgradient mess + Noise

slide-79
SLIDE 79

Graphical example

Iterate xk+1 = argmin

x

  • Fxk(x; Sk) +

1 2αk x − xk2

2

slide-80
SLIDE 80

A convergence guarantee

Consider the stochatsic composite optimization problem minimize

x∈X

f(x) := E[F(x; S)] where F(x; s) = h(c(x; s); s). Use the iteration xk+1 = argmin

x∈X

  • Fxk(x; Sk) +

1 2αk x − xk2

2

  • .

Theorem (D. & Ruan 17) Assume X is compact and ∞

k=1 αk = ∞, ∞ k=1 α2 k < ∞. Then the

sequence {xk} satisfies (1) f(xk) converges (2) All cluster points of xk are stationary

slide-81
SLIDE 81

Experiment: noiseless phase retrieval

50 100 150 200 10-8 10-7 10-6 10-5 10-4 10-3 10-2

prox sprox sgd

slide-82
SLIDE 82

Conclusions

  • 1. Broadly interesting structures for non-convex problems that are still

approximable

  • 2. Statistical modeling allows solution of non-trivial, non-smooth,

non-convex problems

  • 3. Large scale efficient methods still important
slide-83
SLIDE 83

Conclusions

  • 1. Broadly interesting structures for non-convex problems that are still

approximable

  • 2. Statistical modeling allows solution of non-trivial, non-smooth,

non-convex problems

  • 3. Large scale efficient methods still important

References

◮ Solving (most) of a set of quadratic equalities: Composite

  • ptimization for robust phase retrieval arXiv:1705.02356

◮ Stochastic Methods for Composite Optimization Problems

arXiv:1703.08570