ZEROTH-ORDER NON-CONVEX SMOOTH OPTIMIZATION: LOCAL MINIMAX RATES - - PowerPoint PPT Presentation

zeroth order non convex smooth optimization local minimax
SMART_READER_LITE
LIVE PREVIEW

ZEROTH-ORDER NON-CONVEX SMOOTH OPTIMIZATION: LOCAL MINIMAX RATES - - PowerPoint PPT Presentation

ZEROTH-ORDER NON-CONVEX SMOOTH OPTIMIZATION: LOCAL MINIMAX RATES Yining Wang, CMU joint work with Sivaraman Balakrishnan and Aarti Singh BACKGROUND Optimization: min x X f ( x ) Classical setting (first-order): f is known (e.g.,


slide-1
SLIDE 1

ZEROTH-ORDER NON-CONVEX SMOOTH OPTIMIZATION: LOCAL MINIMAX RATES

Yining Wang, CMU

joint work with Sivaraman Balakrishnan and Aarti Singh

slide-2
SLIDE 2

BACKGROUND

➤ Optimization: ➤ Classical setting (first-order):

✴ f is known (e.g., a likelihood function or an NN objective) ✴ can be evaluated, or unbiasedly approximated.

➤ Zeroth-order setting:

✴ f is unknown, or very complicated. ✴ is unknown, or very difficult to evaluate. ✴ can be evaluated, or unbiasedly approximated.

min

x∈X f(x)

rf(x) rf(x) f(x)

slide-3
SLIDE 3

BACKGROUND

➤ Hyper-parameter tuning

✴ f maps hyper-parameter to system performance r ✴ f is essentially unknown

➤ Experimental design

✴ f maps experimental setting (pressure, temperature, etc.) to

synthesized material quality.

➤ Communication-efficient optimization

✴ Data defining the objective scattered throughout machines ✴ Communicating is expensive, but is ok.

θ rf(x) f(x)

slide-4
SLIDE 4

PROBLEM FORMULATION

➤ Compact domain ➤ Objective function

✴ f belongs to the Holder class of order ✴ f may be non-convex

➤ Query model: adaptive

➤ Goal: minimize

X = [0, 1]d f : X → R α x1, x2, · · · , xn ∈ X yt = f(xt) + ξt ξt

i.i.d.

∼ N(0, 1) f(ˆ xn) − inf

x∈X f(x)

slide-5
SLIDE 5

PROBLEM FORMULATION

➤ Compact domain ➤ Objective function

✴ f belongs to the Holder class of order ✴ f may be non-convex

➤ Query model: adaptive

➤ Goal: minimize

X = [0, 1]d f : X → R α x1, x2, · · · , xn ∈ X yt = f(xt) + ξt ξt

i.i.d.

∼ N(0, 1) f(ˆ xn) − inf

x∈X f(x)

slide-6
SLIDE 6

PROBLEM FORMULATION

➤ Compact domain ➤ Objective function

✴ f belongs to the Holder class of order ✴ f may be non-convex

➤ Query model: adaptive

➤ Goal: minimize

X = [0, 1]d f : X → R α x1, x2, · · · , xn ∈ X yt = f(xt) + ξt ξt

i.i.d.

∼ N(0, 1) f(ˆ xn) − inf

x∈X f(x)

kf (α)k∞  M

slide-7
SLIDE 7

PROBLEM FORMULATION

➤ Compact domain ➤ Objective function

✴ f belongs to the Holder class of order ✴ f may be non-convex

➤ Query model: adaptive

➤ Goal: minimize

X = [0, 1]d f : X → R α x1, x2, · · · , xn ∈ X yt = f(xt) + ξt ξt

i.i.d.

∼ N(0, 1) f(ˆ xn) − inf

x∈X f(x)

kf (α)k∞  M

slide-8
SLIDE 8

PROBLEM FORMULATION

➤ Compact domain ➤ Objective function

✴ f belongs to the Holder class of order ✴ f may be non-convex

➤ Query model: adaptive

➤ Goal: minimize

X = [0, 1]d f : X → R α x1, x2, · · · , xn ∈ X yt = f(xt) + ξt ξt

i.i.d.

∼ N(0, 1) f(ˆ xn) − inf

x∈X f(x)

kf (α)k∞  M

slide-9
SLIDE 9

PROBLEM FORMULATION

➤ Compact domain ➤ Objective function

✴ f belongs to the Holder class of order ✴ f may be non-convex

➤ Query model: adaptive

➤ Goal: minimize

X = [0, 1]d f : X → R α x1, x2, · · · , xn ∈ X yt = f(xt) + ξt ξt

i.i.d.

∼ N(0, 1) f(ˆ xn) − inf

x∈X f(x)

kf (α)k∞  M =: L(ˆ xn; f)

slide-10
SLIDE 10

PROBLEM FORMULATION

➤ Compact domain ➤ Objective function

✴ f belongs to the Holder class of order ✴ f may be non-convex

➤ Query model: adaptive

➤ Goal: minimize

X = [0, 1]d f : X → R α x1, x2, · · · , xn ∈ X yt = f(xt) + ξt ξt

i.i.d.

∼ N(0, 1) f(ˆ xn) − inf

x∈X f(x)

kf (α)k∞  M =: L(ˆ xn; f)

slide-11
SLIDE 11

A SIMPLE IDEA FIRST …

➤ Uniform sampling + nonparametric reconstruction

slide-12
SLIDE 12

A SIMPLE IDEA FIRST …

➤ Uniform sampling + nonparametric reconstruction

slide-13
SLIDE 13

A SIMPLE IDEA FIRST …

➤ Uniform sampling + nonparametric reconstruction

slide-14
SLIDE 14

A SIMPLE IDEA FIRST …

➤ Uniform sampling + nonparametric reconstruction

✴ Classical Non-parametric analysis ✴ Implies optimization error:

➤ Can we do better? ➤ NO!

k ˆ fn fk∞ = e OP ⇣ n−α/(2α+d)⌘ f(ˆ xn) f ∗  2k ˆ fn fk∞ inf

ˆ xn

sup

f∈Σα(M)

Ef [L(ˆ xn; f)] & n−α/(2α+d)

slide-15
SLIDE 15

A SIMPLE IDEA FIRST …

➤ Uniform sampling + nonparametric reconstruction

✴ Classical Non-parametric analysis ✴ Implies optimization error:

➤ Can we do better? No! Intuitions:

k ˆ fn fk∞ = e OP ⇣ n−α/(2α+d)⌘ f(ˆ xn) f ∗  2k ˆ fn fk∞

hn ∼ n−1/(2α+d)

n ∼ n−α/(2α+d)

slide-16
SLIDE 16

LOCAL RESULTS

➤ Characterize error for functions “near” a reference

function f0

➤ What is the error rate for f close to f0 that is … ✴ a constant function? ✴ strongly convex? ✴ has regular level sets? ✴ … ➤ Can an algorithm achieve instance-optimal error,

without knowing f0?

slide-17
SLIDE 17

NOTATIONS

➤ Some definitions

✴ Level set: ✴ Distribution function:

Lf(✏) := {x ∈ X : f(x) ≤ f ∗ + ✏} µf(✏) := vol(Lf(✏))

Lf(✏) Lf(✏)

slide-18
SLIDE 18

REGULARITY CONDITIONS

➤ Some definitions

✴ Level set: ✴ Distribution function:

➤ Regularity condition (A1):

✴ # of -radius balls needed to cover

Lf(✏) := {x ∈ X : f(x) ≤ f ∗ + ✏} µf(✏) := vol(Lf(✏)) δ

Lf(✏) ⇣ 1 + µf(✏)/d

Lf(✏)

Regular level-set

slide-19
SLIDE 19

REGULARITY CONDITIONS

➤ Some definitions

✴ Level set: ✴ Distribution function:

➤ Regularity condition (A1):

✴ # of -radius balls needed to cover

Lf(✏) := {x ∈ X : f(x) ≤ f ∗ + ✏} µf(✏) := vol(Lf(✏)) δ

Lf(✏) ⇣ 1 + µf(✏)/d

Irregular level-set

Lf(✏)

slide-20
SLIDE 20

REGULARITY CONDITIONS

➤ Some definitions

✴ Level set: ✴ Distribution function:

➤ Regularity condition (A1):

✴ # of -radius balls needed to cover

Lf(✏) := {x ∈ X : f(x) ≤ f ∗ + ✏} µf(✏) := vol(Lf(✏)) δ

Lf(✏) ⇣ 1 + µf(✏)/d

Irregular level-set

Lf(✏)

slide-21
SLIDE 21

REGULARITY CONDITIONS

➤ Some definitions

✴ Level set: ✴ Distribution function:

➤ Regularity condition (A2):

Lf(✏) := {x ∈ X : f(x) ≤ f ∗ + ✏} µf(✏) := vol(Lf(✏)) µf(✏ log n) ≤ µf(✏) × O(logγ n)

f

✏ ✏ log n

µf(✏)

Regular

slide-22
SLIDE 22

REGULARITY CONDITIONS

➤ Some definitions

✴ Level set: ✴ Distribution function:

➤ Regularity condition (A2):

Lf(✏) := {x ∈ X : f(x) ≤ f ∗ + ✏} µf(✏) := vol(Lf(✏)) µf(✏ log n) ≤ µf(✏) × O(logγ n)

f

✏ ✏ log n

µf(✏)

Irregular

slide-23
SLIDE 23

LOCAL UPPER BOUND

➤ Main result on local upper bound:

THEOREM 1. Suppose regularity conditions hold. There exists an algorithm such that for sufficiently large n,

sup

f∈Σα(M)

Pr

f [L(ˆ

xn; f) ≥ Cεn(f) logc n] ≤ 1/4

where

"n(f) := sup n ✏ > 0 : ✏−(2+d/α)µf(✏) ≥ n

slide-24
SLIDE 24

LOCAL UPPER BOUND

➤ Main result on local upper bound:

THEOREM 1. Suppose regularity conditions hold. There exists an algorithm such that for sufficiently large n,

sup

f∈Σα(M)

Pr

f [L(ˆ

xn; f) ≥ Cεn(f) logc n] ≤ 1/4

where

"n(f) := sup n ✏ > 0 : ✏−(2+d/α)µf(✏) ≥ n

slide-25
SLIDE 25

LOCAL UPPER BOUND

➤ Main result on local upper bound:

THEOREM 1. Suppose regularity conditions hold. There exists an algorithm such that for sufficiently large n,

sup

f∈Σα(M)

Pr

f [L(ˆ

xn; f) ≥ Cεn(f) logc n] ≤ 1/4

where

"n(f) := sup n ✏ > 0 : ✏−(2+d/α)µf(✏) ≥ n

slide-26
SLIDE 26

LOCAL UPPER BOUND

➤ Main result on local upper bound:

THEOREM 1. Suppose regularity conditions hold. There exists an algorithm such that for sufficiently large n,

sup

f∈Σα(M)

Pr

f [L(ˆ

xn; f) ≥ Cεn(f) logc n] ≤ 1/4

where

"n(f) := sup n ✏ > 0 : ✏−(2+d/α)µf(✏) ≥ n

  • Adaptivity:

The algo does not know f.

slide-27
SLIDE 27

LOCAL UPPER BOUND

➤ Main result on local upper bound:

THEOREM 1. Suppose regularity conditions hold. There exists an algorithm such that for sufficiently large n,

sup

f∈Σα(M)

Pr

f [L(ˆ

xn; f) ≥ Cεn(f) logc n] ≤ 1/4

where

"n(f) := sup n ✏ > 0 : ✏−(2+d/α)µf(✏) ≥ n

  • Adaptivity:

The algo does not know f.

slide-28
SLIDE 28

LOCAL UPPER BOUND

➤ Main result on local upper bound:

THEOREM 1. Suppose regularity conditions hold. There exists an algorithm such that for sufficiently large n,

sup

f∈Σα(M)

Pr

f [L(ˆ

xn; f) ≥ Cεn(f) logc n] ≤ 1/4

where

"n(f) := sup n ✏ > 0 : ✏−(2+d/α)µf(✏) ≥ n

  • Adaptivity:

The algo does not know f.

slide-29
SLIDE 29

LOCAL UPPER BOUND

➤ Main result on local upper bound:

THEOREM 1. Suppose regularity conditions hold. There exists an algorithm such that for sufficiently large n,

sup

f∈Σα(M)

Pr

f [L(ˆ

xn; f) ≥ Cεn(f) logc n] ≤ 1/4

where

"n(f) := sup n ✏ > 0 : ✏−(2+d/α)µf(✏) ≥ n

  • Adaptivity:

The algo does not know f. Instance dependent: Error rate depends on f

slide-30
SLIDE 30

LOCAL UPPER BOUND

➤ Main result on local upper bound: ➤ Example 1: polynomial growth

THEOREM 1. Suppose regularity conditions hold. There exists an algorithm such that for sufficiently large n,

sup

f∈Σα(M)

Pr

f [L(ˆ

xn; f) ≥ Cεn(f) logc n] ≤ 1/4

where

"n(f) := sup n ✏ > 0 : ✏−(2+d/α)µf(✏) ≥ n

  • µf(✏) ⇣ ✏β, 0

εn(f) ⇣ n−α/(2α+d−αβ)

n−α/(2α+d)

Much faster than the “baseline” rate

slide-31
SLIDE 31

LOCAL UPPER BOUND

➤ Main result on local upper bound: ➤ Example 2: constant function

THEOREM 1. Suppose regularity conditions hold. There exists an algorithm such that for sufficiently large n,

sup

f∈Σα(M)

Pr

f [L(ˆ

xn; f) ≥ Cεn(f) logc n] ≤ 1/4

where

"n(f) := sup n ✏ > 0 : ✏−(2+d/α)µf(✏) ≥ n

  • n−α/(2α+d)

This is the worst case function, matching global rate

f ≡ 0 εn(f) ⇣ n−α/(2α+d)

slide-32
SLIDE 32

LOCAL UPPER BOUND

➤ Main result on local upper bound: ➤ Example 3: strongly convex f:

THEOREM 1. Suppose regularity conditions hold. There exists an algorithm such that for sufficiently large n,

sup

f∈Σα(M)

Pr

f [L(ˆ

xn; f) ≥ Cεn(f) logc n] ≤ 1/4

where

"n(f) := sup n ✏ > 0 : ✏−(2+d/α)µf(✏) ≥ n

  • Match the classical zeroth-order convex rate up to log terms

µf(✏) ⇣ ✏d/2 εn(f) ⇣ n−1/2

n−1/2

slide-33
SLIDE 33

LOCAL LOWER BOUND

➤ Main result on local lower bound:

THEOREM 2. Suppose f0 satisfies regularity conditions. Then we have where

"n(f) := sup n ✏ > 0 : ✏−(2+d/α)µf(✏) ≥ n

  • inf

ˆ xn

sup

f2Σα(M), kff0k∞.εn(f0)

Ef [L(ˆ xn; f)] & εn(f0)

slide-34
SLIDE 34

LOCAL LOWER BOUND

➤ Main result on local lower bound:

THEOREM 2. Suppose f0 satisfies regularity conditions. Then we have where

"n(f) := sup n ✏ > 0 : ✏−(2+d/α)µf(✏) ≥ n

  • inf

ˆ xn

sup

f2Σα(M), kff0k∞.εn(f0)

Ef [L(ˆ xn; f)] & εn(f0)

full knowledge of reference: The algo. knows f0

slide-35
SLIDE 35

LOCAL LOWER BOUND

➤ Main result on local lower bound:

THEOREM 2. Suppose f0 satisfies regularity conditions. Then we have where

"n(f) := sup n ✏ > 0 : ✏−(2+d/α)µf(✏) ≥ n

  • inf

ˆ xn

sup

f2Σα(M), kff0k∞.εn(f0)

Ef [L(ˆ xn; f)] & εn(f0)

full knowledge of reference: The algo. knows f0

slide-36
SLIDE 36

LOCAL LOWER BOUND

➤ Main result on local lower bound:

THEOREM 2. Suppose f0 satisfies regularity conditions. Then we have where

"n(f) := sup n ✏ > 0 : ✏−(2+d/α)µf(✏) ≥ n

  • inf

ˆ xn

sup

f2Σα(M), kff0k∞.εn(f0)

Ef [L(ˆ xn; f)] & εn(f0)

full knowledge of reference: The algo. knows f0

slide-37
SLIDE 37

LOCAL LOWER BOUND

➤ Main result on local lower bound:

THEOREM 2. Suppose f0 satisfies regularity conditions. Then we have where

"n(f) := sup n ✏ > 0 : ✏−(2+d/α)µf(✏) ≥ n

  • inf

ˆ xn

sup

f2Σα(M), kff0k∞.εn(f0)

Ef [L(ˆ xn; f)] & εn(f0)

full knowledge of reference: The algo. knows f0

slide-38
SLIDE 38

LOCAL LOWER BOUND

➤ Main result on local lower bound:

THEOREM 2. Suppose f0 satisfies regularity conditions. Then we have where

"n(f) := sup n ✏ > 0 : ✏−(2+d/α)µf(✏) ≥ n

  • inf

ˆ xn

sup

f2Σα(M), kff0k∞.εn(f0)

Ef [L(ˆ xn; f)] & εn(f0)

full knowledge of reference: The algo. knows f0

slide-39
SLIDE 39

LOCAL LOWER BOUND

➤ Main result on local lower bound:

THEOREM 2. Suppose f0 satisfies regularity conditions. Then we have where

"n(f) := sup n ✏ > 0 : ✏−(2+d/α)µf(✏) ≥ n

  • inf

ˆ xn

sup

f2Σα(M), kff0k∞.εn(f0)

Ef [L(ˆ xn; f)] & εn(f0)

full knowledge of reference: The algo. knows f0 Local minimality: estimation error of f close to the reference f0

slide-40
SLIDE 40

PROOF SKETCH OF UPPER BOUND

➤ Our algorithm: “successive rejection”

✴ Step 1: uniform sampling and build confidence intervals (CI) ✴ Step 2: remove sub-optimal points ✴ Step 3: uniform sample in the remaining points.

slide-41
SLIDE 41

PROOF SKETCH OF UPPER BOUND

➤ Our algorithm: “successive rejection”

✴ Step 1: uniform sampling and build confidence intervals (CI) ✴ Step 2: remove sub-optimal points ✴ Step 3: uniform sample in the remaining points.

S1

slide-42
SLIDE 42

PROOF SKETCH OF UPPER BOUND

➤ Our algorithm: “successive rejection”

✴ Step 1: uniform sampling and build confidence intervals (CI) ✴ Step 2: remove sub-optimal points ✴ Step 3: uniform sample in the remaining points.

S1

slide-43
SLIDE 43

PROOF SKETCH OF UPPER BOUND

➤ Our algorithm: “successive rejection”

✴ Step 1: uniform sampling and build confidence intervals (CI) ✴ Step 2: remove sub-optimal points ✴ Step 3: uniform sample in the remaining points.

S1

slide-44
SLIDE 44

PROOF SKETCH OF UPPER BOUND

➤ Our algorithm: “successive rejection”

✴ Step 1: uniform sampling and build confidence intervals (CI) ✴ Step 2: remove sub-optimal points ✴ Step 3: uniform sample in the remaining points.

➤ Key observation between iterations: ➤ An number of iterations suffice.

Sτ ⊆ Lf(ε) = ⇒ Sτ+1 ⊆ Lf(ε/2)

Until ε ∼ εn(f) × logc n

O(log n)

slide-45
SLIDE 45

PROOF SKETCH OF LOWER BOUND

➤ Step 1: constructing “packings” on Lf0(✏n)

X = [0, 1]2 Lf0(✏n) hn hn ⇣ ✏1/α

n

Must identify the ball w.h.p. Discrepancy in ball: 2✏n

slide-46
SLIDE 46

PROOF SKETCH OF LOWER BOUND

➤ Step 1: constructing “packings” on Lf0(✏n)

X = [0, 1]2 Lf0(✏n) hn

Resembles Bandit Pure Exploration

µ1, µ2, · · · , µH ∈ R µi = 2✏n; µ−i = 0

Identify the non-zero arm

slide-47
SLIDE 47

PROOF SKETCH OF LOWER BOUND

➤ Step 1: constructing “packings” on Lf0(✏n)

X = [0, 1]2 Lf0(✏n) hn

Resembles Bandit Pure Exploration

KL(P0kPi) . ni · ✏2

n

ni . n/H H & µf0(✏n)/hd

n

slide-48
SLIDE 48

PROOF SKETCH OF LOWER BOUND

➤ Step 1: constructing “packings” on Lf0(✏n)

X = [0, 1]2 Lf0(✏n) hn

Resembles Bandit Pure Exploration

KL(P0kPi) . ni · ✏2

n

ni . n/H

# of balls packed

H & µf0(✏n)/hd

n

slide-49
SLIDE 49

PROOF SKETCH OF LOWER BOUND

➤ Step 1: constructing “packings” on Lf0(✏n)

X = [0, 1]2 Lf0(✏n) hn

Resembles Bandit Pure Exploration

KL(P0kPi) . ni · ✏2

n

ni . n/H

# of balls packed

H & µf0(✏n)/hd

n

slide-50
SLIDE 50

PROOF SKETCH OF LOWER BOUND

➤ Step 1: constructing “packings” on Lf0(✏n)

X = [0, 1]2 Lf0(✏n) hn

Resembles Bandit Pure Exploration

KL(P0kPi) . ni · ✏2

n

ni . n/H

# of balls packed

H & µf0(✏n)/hd

n

Regularity

slide-51
SLIDE 51

TAKE-HOME MESSAGES

➤ (Noisy) zeroth-order optimization of smooth functions

is in general difficult

✴ As difficult as estimating the function in sup-norm.

➤ The optimal convergence rates exhibit significant gaps

locally for different objective functions

✴ Local minimax rate mostly dictated by level set growth; ✴ The constant function is the hardest example; ✴ Strongly convex functions do not exhibit curse of dim.

➤ A successive-rejection type algorithm is near-optimal.

slide-52
SLIDE 52

FUTURE DIRECTIONS

➤ Are the regularity conditions absolutely necessary?

✴ Can the level sets of f be irregular? ✴ Can the volumes of level sets of f grow heterogeneously?

➤ Are there more computationally efficient algorithms?

✴ Key challenge: avoiding creating sup-norm CIs explicitly.

➤ Log factors: are they removable? (conjecture: yes!)

✴ Active queries methods for nonparametric estimation / bandit

pure exploration do not have log terms.

slide-53
SLIDE 53

QUESTIONS