Two Useful Arrows Darts in that Quiver Clment Canonne FOCS Workshop - - PowerPoint PPT Presentation

two useful arrows darts in that quiver
SMART_READER_LITE
LIVE PREVIEW

Two Useful Arrows Darts in that Quiver Clment Canonne FOCS Workshop - - PowerPoint PPT Presentation

Two Useful Arrows Darts in that Quiver Clment Canonne FOCS Workshop November 9, 2019 1 Avering, Bucketing, and Investing arguments 2 Suppose you have a : X [0,1] such that E [ a ( x )] . (Lets say you already proved that.)


slide-1
SLIDE 1

1

Two Useful Arrows Darts in that Quiver

Clément Canonne FOCS Workshop – November 9, 2019

slide-2
SLIDE 2

2

Avering, Bucketing, and Investing arguments

slide-3
SLIDE 3

3

Suppose you have a : X → [0,1] such that E[a(x)] ≥ ε. (Let’s say you already proved that.) We think of a(x) as the quality of x, and “using” it has cost cost(a(x)).

slide-4
SLIDE 4

4

Suppose you have a : X → [0,1] such that E[a(x)] ≥ ε. (Let’s say you already proved that.) We think of a(x) as the quality of x, and “using” it has cost cost(a(x)).

For instance, a population of coins, each with their own bias. The expected bias is ε; for any given coin, checking bias 0 vs. bias α takes 1/α2 tosses. Goal: find a biased coin.

slide-5
SLIDE 5

5

How...

to convert this into a useful thing? How to find an x with small cost?

That is,

can we get Pr

x [a(x) ≥ blah(ε)] ≥ bluh(ε)

for some “good” functions blah, bluh?

slide-6
SLIDE 6

6

“By a standard averaging argument...”

First attempt: Markov

Lemma (Markov)

We have Pr

x

  • a(x) ≥ ε

2

  • ≥ ε

2 . (1)

slide-7
SLIDE 7

7

“By a standard averaging argument...”

First attempt: Markov

Lemma (Markov)

We have Pr

x

  • a(x) ≥ ε

2

  • ≥ ε

2 . (1)

Proof.

ε ≤ E[a(x)] ≤ ε 2 ·Pr

x

  • a(x) < ε

2

  • ≤1

+1·Pr

x

  • a(x) ≥ ε

2

slide-8
SLIDE 8

8

“By a standard averaging argument...”

First attempt: Markov

Strategy

Sample O(1/ε) x’s to find a “good” one; for each, pay cost(ε/2).

slide-9
SLIDE 9

9

“By a standard averaging argument...”

First attempt: Markov

Strategy

Sample O(1/ε) x’s to find a “good” one; for each, pay cost(ε/2).

Yes, but...

Typically, at least quadratic total cost in ε as cost(α) = Ω(1/α).

slide-10
SLIDE 10

10

“By a standard averaging argument...”

First attempt: Markov

Strategy

Sample O(1/ε) x’s to find a “good” one; for each, pay cost(ε/2).

Yes, but...

Typically, at least quadratic total cost in ε as cost(α) = Ω(1/α). We should not pay the worst of both worlds.

slide-11
SLIDE 11

11

“By a standard bucketing argument...”

Second attempt: my bucket list

Lemma (Bucketing)

There exists 1 ≤ j ≤ ⌈log(2/ε)⌉ := L s.t. Pr

x

  • a(x) ≥ 2−j

≥ 2jε 4L . (2)

slide-12
SLIDE 12

12

“By a standard bucketing argument...”

Second attempt: my bucket list

Lemma (Bucketing)

There exists 1 ≤ j ≤ ⌈log(2/ε)⌉ := L s.t. Pr

x

  • a(x) ≥ 2−j

≥ 2jε 4L . (2)

Proof.

Define buckets B0 := {x : a(x) ≤ ε/2}, B j := {x : 2−j ≤ a(x) ≤ 2−j+1},1 ≤ j ≤ L Then ε ≤ E[a(x)] ≤ ε 2 ·Pr[x ∈ B0 ]

≤1

+

L

  • j=1

2−j+1 ·Pr

  • x ∈ B j
  • so (averaging!) there exists j ∗ s.t. 2−j+1 ·Pr
  • x ∈ B j
  • ≥ ε/(2L).
slide-13
SLIDE 13

13

“By a standard bucketing argument...”

Second attempt: my bucket list

Strategy

For each j ∈ [L], in case it’s the good bucket: sample O(log(1/ε)/(2jε)) x’s to find a “good” one in B j; for each such x, pay cost(2−j).

slide-14
SLIDE 14

14

“By a standard bucketing argument...”

Second attempt: my bucket list

Strategy

For each j ∈ [L], in case it’s the good bucket: sample O(log(1/ε)/(2jε)) x’s to find a “good” one in B j; for each such x, pay cost(2−j). Total cost (examples):

L

  • j=1

log(1/ε) 2jε ·cost(2−j) =     

log2(1/ε) ε

if cost(α) ≍ 1/α

log(1/ε) ε2

if cost(α) ≍ 1/α2

slide-15
SLIDE 15

15

“By a standard bucketing argument...”

Second attempt: my bucket list

Strategy

For each j ∈ [L], in case it’s the good bucket: sample O(log(1/ε)/(2jε)) x’s to find a “good” one in B j; for each such x, pay cost(2−j). Total cost (examples):

L

  • j=1

log(1/ε) 2jε ·cost(2−j) =     

log2(1/ε) ε

if cost(α) ≍ 1/α

log(1/ε) ε2

if cost(α) ≍ 1/α2

Yes, but...

we lose log factors. Do we have to lose log factors?

slide-16
SLIDE 16

16

“By a refined averaging argument...”

Third (and last) attempt: strategic investment

Assume that cost(α) is superlinear, e.g., cost(α) = 1/α2.

Lemma (Levin’s Economical Work Investment Strategy)

There exists 1 ≤ j ≤ ⌈log(2/ε)⌉ := L s.t. Pr

x

  • a(x) ≥ 2−j

≥ 2jε 8(L +1− j)2 . (3)

slide-17
SLIDE 17

17

“By a refined averaging argument...”

Third (and last) attempt: strategic investment

Assume that cost(α) is superlinear, e.g., cost(α) = 1/α2.

Lemma (Levin’s Economical Work Investment Strategy)

There exists 1 ≤ j ≤ ⌈log(2/ε)⌉ := L s.t. Pr

x

  • a(x) ≥ 2−j

≥ 2jε 8(L +1− j)2 . (3)

Proof.

By contradiction: E[a(x)] ≤ ε 2 +

L

  • j=1

2−j+1 ·Pr

  • x ∈ B j
  • ≤ ε

2 +

L

  • j=1

2−j+1 ·Pr

  • a(x) ≥ 2−j

< ε 2 +

L

  • j=1

2−j+1 · 2jε 8(L +1− j)2 = ε 2 + ε 4

L

  • ℓ=1

1 ℓ2 < ε 2 + ε 4

  • ℓ=1

1 ℓ2 < ε “Oops.”

slide-18
SLIDE 18

18

“By a refined averaging argument...”

Third (and last) attempt: strategic investment

Strategy

For each j ∈ [L]: sample O((L +1− j)2/(2jε)) x’s to find a “good” one in B j; for each such x, pay cost(2−j) ≍ 22j.

slide-19
SLIDE 19

19

“By a refined averaging argument...”

Third (and last) attempt: strategic investment

Strategy

For each j ∈ [L]: sample O((L +1− j)2/(2jε)) x’s to find a “good” one in B j; for each such x, pay cost(2−j) ≍ 22j. Total cost:

L

  • j=1

(L +1− j)2 2jε ·22j = 1 ε

L

  • j=1

(L +1− j)2 ·2j = 2L+1 ε

L

  • ℓ=1

ℓ2 ·2−ℓ < 4 ε2

  • ℓ=1

ℓ2 ·2−ℓ

O(1)

(It’s 6.)

slide-20
SLIDE 20

20

“By a refined averaging argument...”

Third (and last) attempt: strategic investment

Strategy

For each j ∈ [L]: sample O((L +1− j)2/(2jε)) x’s to find a “good” one in B j; for each such x, pay cost(2−j) ≍ 22j. Total cost:

L

  • j=1

(L +1− j)2 2jε ·22j = 1 ε

L

  • j=1

(L +1− j)2 ·2j = 2L+1 ε

L

  • ℓ=1

ℓ2 ·2−ℓ < 4 ε2

  • ℓ=1

ℓ2 ·2−ℓ

O(1)

(It’s 6.)

Yes, but...

No, actually, nothing. Works for any cost(α) ≫ 1/α1+δ.

slide-21
SLIDE 21

21

“By a refined averaging argument...”

Third (and last) attempt: strategic investment

Strategy

For each j ∈ [L]: sample O((L +1− j)2/(2jε)) x’s to find a “good” one in B j; for each such x, pay cost(2−j) ≍ 22j. Total cost:

L

  • j=1

(L +1− j)2 2jε ·22j = 1 ε

L

  • j=1

(L +1− j)2 ·2j = 2L+1 ε

L

  • ℓ=1

ℓ2 ·2−ℓ < 4 ε2

  • ℓ=1

ℓ2 ·2−ℓ

O(1)

(It’s 6.)

Yes, but...

No, actually, nothing. Works for any cost(α) ≫ 1/α1+δ.

For cost(α) ≍ 1/α, not so easy, but some stuff exists.

slide-22
SLIDE 22

22

Thomas’ Favorite Lemma

slide-23
SLIDE 23

23

Kullback–Leibler Divergence

Recall the definition of Kullback–Leibler divergence (a.k.a. relative entropy) between two discrete distributions p,q: D(pq) =

  • ω

p(ω)log p(ω) q(ω)

slide-24
SLIDE 24

24

Kullback–Leibler Divergence

Recall the definition of Kullback–Leibler divergence (a.k.a. relative entropy) between two discrete distributions p,q: D(pq) =

  • ω

p(ω)log p(ω) q(ω) It has some issues (symmetry, triangle inequality), yes, but it is everywhere (for a reason). It also has many nice properties.

slide-25
SLIDE 25

25

Kullback–Leibler Divergence

The dual characterization

Theorem (First)

For every q ≪ p, D(pq) = sup

f

  • Ex∼p
  • f (x)
  • −logEx∼q
  • e f (x)

(4)

Theorem (Second)

For every q ≪ p, and every λ logEx∼p

  • eλx

= max

q≪p

  • λEx∼q[x]−D(qp)
  • (5)

Known as: Gibbs variational principle (1902?), Donsker-Varadhan (1975), special case of Fenchel duality, ...

slide-26
SLIDE 26

26

Kullback–Leibler Divergence

The dual characterization

Theorem (First)

For every q ≪ p, D(pq) = sup

f

  • Ex∼p
  • f (x)
  • −logEx∼q
  • e f (x)

(4)

Theorem (Second)

For every q ≪ p, and every λ logEx∼p

  • eλx

= max

q≪p

  • λEx∼q[x]−D(qp)
  • (5)

Known as: Gibbs variational principle (1902?), Donsker-Varadhan (1975), special case of Fenchel duality, ...

slide-27
SLIDE 27

27

An application

Theorem

Suppose p is subgaussian on Rd. For every function a : Rd → [0,1] (with α := Ex∼p[a(x)] > 0), Ex∼p[xa(x)]2 ≤ Cpα

  • log 1

α (6) (constant Cp depends on subgaussian parameter, not on d).

slide-28
SLIDE 28

28

An application

Theorem

Suppose p is subgaussian on Rd. For every function a : Rd → [0,1] (with α := Ex∼p[a(x)] > 0), Ex∼p[xa(x)]2 ≤ Cpα

  • log 1

α (6) (constant Cp depends on subgaussian parameter, not on d).

The proof that follows was communicated to me by Himanshu Tyagi.

slide-29
SLIDE 29

29

An application (and its proof, Gaussian case)

Setting z = xi and q ≪ p as d q

d q (x) = a(x) Ep[a(x)], we get

λEq[xi] ≤ logEp

  • eλxi
  • +D(qipi) = λ2

2 +D(qipi), Optimizing for λ, Eq[xi] ≤

  • 2D(qipi), i.e., Eq[xi]2 ≤ 2D(qipi).

Summing both sides over 1 ≤ i ≤ d, Eq[x]2

2 ≤ 2 d

  • i=1

D(qipi). and playing with nice properties of (conditional) relative entropy (chain rule, etc.) this is at most

d

  • i=1

Exi−1

  • D(qxi |xi−1pxi
  • = D(qp) =

Ep

  • a(x)loga(x)
  • Ep[a(x)]
  • ≤0

+log 1 Ep[a(x)], which completes the proof.

slide-30
SLIDE 30

30

I guess I’m done.