Zeroth-order Optimization Yining Wang in High Dimensions Carnegie - - PowerPoint PPT Presentation

zeroth order optimization
SMART_READER_LITE
LIVE PREVIEW

Zeroth-order Optimization Yining Wang in High Dimensions Carnegie - - PowerPoint PPT Presentation

AISTATS 2018, Playa Blanca, Lanzarote, Spain Zeroth-order Optimization Yining Wang in High Dimensions Carnegie Mellon University Joint work with Simon Du, Sivaraman Balakrishnan and Aarti Singh B ACKGROUND Optimization: min x X f ( x )


slide-1
SLIDE 1

AISTATS 2018, Playa Blanca, Lanzarote, Spain

Zeroth-order Optimization in High Dimensions

Yining Wang

Carnegie Mellon University

Joint work with Simon Du, Sivaraman Balakrishnan and Aarti Singh

slide-2
SLIDE 2

BACKGROUND

❖ Optimization: ❖ Classical setting (first-order):

✴ f is known (e.g., a likelihood function or an NN objective) ✴ can be evaluated, or unbiasedly approximated

❖ Zeroth-order setting:

✴ f is unknown, or very complicated ✴ is unknown, or very difficult to evaluate.

min

x∈X f(x)

rf(x) rf(x)

slide-3
SLIDE 3

APPLICATIONS

❖ Hyper-parameter tuning

✴ f maps hyper-parameter x to system performance f(x).

❖ Experimental design

✴ f maps experimental setting to experimental results.

❖ Communication-efficient optimization

✴ Data defining the objective scattered throughout machines ✴ Communicating is expensive, but f(x) ok.

rf(x)

slide-4
SLIDE 4

FORMULATION

❖ Convexity: the objective f is convex. ❖ Noisy observation model: ❖ Evaluation measure:

✴ Simple regret: ✴ Cumulative regret:

yt = f(xt) + ξt, ξt

i.i.d.

∼ N(0, σ2). f(b xT +1) − f ∗

T

X

t=1

f(xt) − f ∗

slide-5
SLIDE 5

METHODS

❖ Classical method: Estimating Gradient Descent (EGD) ❖ Gradient descent / Mirror descent: ❖ Estimating gradient:

✴ ✴ Gained popularity from (Nemirovski & Yudin’83, Flaxman

et al.’05)

xt+1 ← xt − ηtb gt(xt) b gt(xt) = d δ · E[f(xt + δvt)vt] xt+1 2 arg min

z∈Rd{ηthb

gt(xt), zi + ∆ψ(z, xt)}

slide-6
SLIDE 6

METHODS

❖ Classical method: Estimating Gradient Descent (EGD) ❖ Gradient descent / Mirror descent: ❖ Estimating gradient:

✴ ✴ Gained popularity from (Nemirovski & Yudin’83, Flaxman

et al.’05)

xt+1 ← xt − ηtb gt(xt) b gt(xt) = d δ · E[f(xt + δvt)vt] xt+1 2 arg min

z∈Rd{ηthb

gt(xt), zi + ∆ψ(z, xt)} b gt(xt) ⇡ rf(xt)

slide-7
SLIDE 7

METHODS

❖ Classical method: Estimating Gradient Descent (EGD) ❖ Gradient descent / Mirror descent: ❖ Estimating gradient:

✴ ✴ Gained popularity from (Nemirovski & Yudin’83, Flaxman

et al.’05)

xt+1 ← xt − ηtb gt(xt) b gt(xt) = d δ · E[f(xt + δvt)vt] xt+1 2 arg min

z∈Rd{ηthb

gt(xt), zi + ∆ψ(z, xt)} b gt(xt) ⇡ rf(xt) xt vt δ

slide-8
SLIDE 8

METHODS

❖ Classical method: Estimating Gradient Descent (EGD) ❖ Classical analysis

✴ Supposing and ✴ Stochastic GD/MD: ✴ Estimating GD/MD:

❖ Problem: cannot exploit (sparse) structure in x*

krfk  H kx∗k∗  B f(b x) − f ∗ . f(b x) − f ∗ . BH/ √ T √ d · BH/T 1/4

slide-9
SLIDE 9

METHODS

❖ Classical method: Estimating Gradient Descent (EGD) ❖ Classical analysis

✴ Supposing and ✴ Stochastic GD/MD: ✴ Estimating GD/MD:

❖ Problem: cannot exploit (sparse) structure in x*

krfk  H kx∗k∗  B f(b x) − f ∗ . f(b x) − f ∗ . BH/ √ T √ d · BH/T 1/4 Eb gt(xt) = rf(xt)

First-order

slide-10
SLIDE 10

METHODS

❖ Classical method: Estimating Gradient Descent (EGD) ❖ Classical analysis

✴ Supposing and ✴ Stochastic GD/MD: ✴ Estimating GD/MD:

❖ Problem: cannot exploit (sparse) structure in x*

krfk  H kx∗k∗  B f(b x) − f ∗ . f(b x) − f ∗ . BH/ √ T √ d · BH/T 1/4 Eb gt(xt) = rf(xt)

First-order

Ekb gt(xt) rf(xt)k2

2

Eb gt(xt) 6= rf(xt)

small, but

Zeroth-order

slide-11
SLIDE 11

METHODS

❖ Classical method: Estimating Gradient Descent (EGD) ❖ Classical analysis

✴ Supposing and ✴ Stochastic GD/MD: ✴ Estimating GD/MD:

❖ Problem: cannot exploit (sparse) structure in x*

krfk  H kx∗k∗  B f(b x) − f ∗ . f(b x) − f ∗ . BH/ √ T √ d · BH/T 1/4 Eb gt(xt) = rf(xt)

First-order

Ekb gt(xt) rf(xt)k2

2

Eb gt(xt) 6= rf(xt)

small, but

Zeroth-order

slide-12
SLIDE 12

ASSUMPTIONS

❖ The “function sparsity” assumption: ❖ Strong theoretically, but slightly acceptable in practice

✴ Hyper-parameter tuning: performance not sensitive to many

input parameters

✴ Visual stimuli optimization: most brain activities are not

related to visual reactions.

f(x) ≡ f(xS) S ✓ [d], |S| = s ⌧ d

slide-13
SLIDE 13

LASSO GRADIENT ESTIMATE

❖ Local linear approximation: ❖ Lasso gradient estimate:

✴ Sample and observe ✴ Construct a sparse linear system:

f(xt + δvt) ⇡ f(xt) + δhrf(xt), vti v1, · · · , vn yi ≈ f(xt + δvi) − f(xt) e Y = Y/δ = V rf(xt) + ε

slide-14
SLIDE 14

LASSO GRADIENT ESTIMATE

❖ Local linear approximation: ❖ Lasso gradient estimate:

✴ Construct a sparse linear system: ✴ Because is sparse, one can use the Lasso

f(xt + δvt) ⇡ f(xt) + δhrf(xt), vti e Y = Y/δ = V rf(xt) + ε rf(xt) b gt(xt) 2 arg min

g∈Rd

n ke Y V gk2

2 + λkgk1

slide-15
SLIDE 15

LASSO GRADIENT ESTIMATE

❖ Local linear approximation: ❖ Lasso gradient estimate:

✴ Construct a sparse linear system: ✴ Because is sparse, one can use the Lasso

f(xt + δvt) ⇡ f(xt) + δhrf(xt), vti e Y = Y/δ = V rf(xt) + ε rf(xt) b gt(xt) 2 arg min

g∈Rd

n ke Y V gk2

2 + λkgk1

  • certain “de-biasing”

required … see paper

slide-16
SLIDE 16

MAIN RESULTS

  • Theorem. Suppose for some , and
  • ther smoothness conditions on f hold. Then

Furthermore, for smoother f the can be improved to

f(x) ≡ f(xS) |S| = s ⌧ d 1 T

T

X

t=1

f(xt) − f ∗ . poly(s, log d) · T −1/4

T −1/4

T −1/3

slide-17
SLIDE 17

MAIN RESULTS

  • Theorem. Suppose for some , and
  • ther smoothness conditions on f hold. Then

Furthermore, for smoother f the can be improved to

f(x) ≡ f(xS) |S| = s ⌧ d 1 T

T

X

t=1

f(xt) − f ∗ . poly(s, log d) · T −1/4

T −1/4

T −1/3

Can handle “high-dimensional” setting d T

slide-18
SLIDE 18

SIMULATION RESULTS

slide-19
SLIDE 19

SIMULATION RESULTS

slide-20
SLIDE 20

OPEN QUESTIONS

❖ Is function/gradient sparsity absolutely necessary?

✴ Recall in first-order case, only solution x* sparsity required ✴ More specifically, only need ✴ Conjecture: if f only satisfies the above condition, then

kx∗k1  B, krfk∞  H inf

b xT sup f

E [f(b xT ) − f ∗] & poly(d, 1/T)

slide-21
SLIDE 21

OPEN QUESTIONS

❖ Is convergence achievable in high dimensions?

✴ Challenge 1: MD is awkward in exploiting strong convexity: ✴ Challenge 2: the Lasso gradient estimate is less efficient —

can we design convex body K such that

T −1/2 f(x0) f(x) + hrf(x), x0 xi + ν2 2 ∆ψ(x0, x) b gt(xt) = ρ(K) δ Z

∂K

f(xt + δv)n(v)dµ(v)

is a good gradient estimator in high dimensions?

slide-22
SLIDE 22

OPEN QUESTIONS

❖ Is convergence achievable in high dimensions?

✴ Challenge 1: MD is awkward in exploiting strong convexity: ✴ Challenge 2: the Lasso gradient estimate is less efficient —

can we design convex body K such that

T −1/2 f(x0) f(x) + hrf(x), x0 xi + ν2 2 ∆ψ(x0, x) b gt(xt) = ρ(K) δ Z

∂K

f(xt + δv)n(v)dµ(v)

is a good gradient estimator in high dimensions? Wish to replace with

kx0 xk2

1

slide-23
SLIDE 23

Thank you! Questions