Central limit theorem: variants and applications Anindya De - - PowerPoint PPT Presentation

central limit theorem variants and applications
SMART_READER_LITE
LIVE PREVIEW

Central limit theorem: variants and applications Anindya De - - PowerPoint PPT Presentation

Central limit theorem: variants and applications Anindya De University of Pennsylvania Introduction One of the most cornerstone results in probability theory and statistics. Introduction One of the most cornerstone results in probability


slide-1
SLIDE 1

Central limit theorem: variants and applications

Anindya De

University of Pennsylvania

slide-2
SLIDE 2

Introduction

One of the most cornerstone results in probability theory and statistics.

slide-3
SLIDE 3

Introduction

One of the most cornerstone results in probability theory and statistics. Informally, sum of independent random variables converges to a Gaussian.

slide-4
SLIDE 4

Introduction

One of the most cornerstone results in probability theory and statistics. Informally, sum of independent random variables converges to a Gaussian.

slide-5
SLIDE 5

Applications in computer science

Many extensions discovered in the context of algorithmic problems.

  • 1. Invariance principle [Mossel-O’Donnell-Oleskiewicz] –

numerous applications in hardness of approximation, derandomization and social choice.

slide-6
SLIDE 6

Applications in computer science

Many extensions discovered in the context of algorithmic problems.

  • 1. Invariance principle [Mossel-O’Donnell-Oleskiewicz] –

numerous applications in hardness of approximation, derandomization and social choice.

  • 2. Multidimensional central limit theorems

[Daskalakis-Papadimtriou, Daskalakis-Kamath-Tzamos, Valiant-Valiant] – many extensions and applications in algorithmic game theory and lower bounds in statistics.

slide-7
SLIDE 7

Applications in computer science

Many extensions discovered in the context of algorithmic problems. 3 Moment matching theorems [P. Valiant, Gopalan-Klivans-Meka] –lower bounds in statistics, learning theory.

slide-8
SLIDE 8

Applications in computer science

Many extensions discovered in the context of algorithmic problems. 3 Moment matching theorems [P. Valiant, Gopalan-Klivans-Meka] –lower bounds in statistics, learning theory. 4 Central limit theorems for low-degree polynomials / polytopes [Harsha-Klivans-Meka, De-Servedio] – Derandomization.

slide-9
SLIDE 9

Applications in computer science

Many extensions discovered in the context of algorithmic problems. 3 Moment matching theorems [P. Valiant, Gopalan-Klivans-Meka] –lower bounds in statistics, learning theory. 4 Central limit theorems for low-degree polynomials / polytopes [Harsha-Klivans-Meka, De-Servedio] – Derandomization. 5 Discrete central limit theorems [Chen-Goldstein-Shao] – computational learning theory.

slide-10
SLIDE 10

Why is central limit theorem useful?

Central limit theorem: Even if X1, . . . , Xn are unwieldy random variables, their sum X1 + . . . + Xn is nice.

slide-11
SLIDE 11

Why is central limit theorem useful?

Central limit theorem: Even if X1, . . . , Xn are unwieldy random variables, their sum X1 + . . . + Xn is nice. In some applications, the precise convergence to the Gaussian distribution is important.

slide-12
SLIDE 12

Why is central limit theorem useful?

Central limit theorem: Even if X1, . . . , Xn are unwieldy random variables, their sum X1 + . . . + Xn is nice. In some applications, the precise convergence to the Gaussian distribution is important. In others, the fact that a Gaussian can be parameterized by two parameters is sufficient.

slide-13
SLIDE 13

Berry-Ess´ een theorem

Theorem

Let X1, . . . , Xn be n independent centered random variables such that Var(Xi) = σ2

i and E[|Xi|3] = β3,i. Define S = Xi,

σ2 = Var(S) and β3 = β3,i. Then, dK(S, N(0, σ2)) = O(1) · β3 σ3 . dK(X, Y ) = sup

z∈R

  • Pr[X ≤ z] − Pr[Y ≤ z]
  • .
slide-14
SLIDE 14

Corollary of the Berry-Ess´ een theorem

Corollary

Let X1, . . . , Xn be n independent identical centered random variables such that Var(Xi) = σ2

∗ and E[|Xi|3] = β3,∗ (for all

1 ≤ i ≤ n). Define S =

i Xi. Then,

dK(S, N(0, nσ2

∗)) = O

1 √n

  • · β3,∗

σ3

.

slide-15
SLIDE 15

Corollary of the Berry-Ess´ een theorem

Let us assume that the random variable Xi is hypercontractive – in

  • ther words, there is a constant C > 0 such that

E[|Xi|3] ≤ C ·

  • E[|Xi|2]

3/2.

slide-16
SLIDE 16

Corollary of the Berry-Ess´ een theorem

Let us assume that the random variable Xi is hypercontractive – in

  • ther words, there is a constant C > 0 such that

E[|Xi|3] ≤ C ·

  • E[|Xi|2]

3/2. This implies that β3,∗/σ3

∗ ≤ C. Thus, the error term in

Berry-Ess´ een theorem is now dK(S, N(0, nσ2

∗)) = O

C √n

  • .
slide-17
SLIDE 17

Berry–Ess´ een for non identical random variables

Continue to assume that X1, . . . , Xn are all C-hypercontractive. Suppose maxi σ2

i ≤ ǫ2 · (n j=1 σ2 j ).

Then, the error term in Berry–Ess´ een becomes β3 σ3 ≤ C ·

  • j σ3

j

(

j σ2 j )1.5 ≤ C · ǫ.

slide-18
SLIDE 18

Berry–Ess´ een for non identical random variables

Continue to assume that X1, . . . , Xn are all C-hypercontractive. Suppose maxi σ2

i ≤ ǫ2 · (n j=1 σ2 j ).

Then, the error term in Berry–Ess´ een becomes β3 σ3 ≤ C ·

  • j σ3

j

(

j σ2 j )1.5 ≤ C · ǫ.

Thus, as long as none of the individual variances are too large, the sum Xi converges to a Gaussian.

slide-19
SLIDE 19

How do you prove Berry–Ess´ een?

slide-20
SLIDE 20

How do you prove Berry–Ess´ een?

There are many known techniques used to prove “central limit theorems”.

slide-21
SLIDE 21

How do you prove Berry–Ess´ een?

There are many known techniques used to prove “central limit theorems”.

  • 1. Lindeberg exchange method (hybrid method) – used by MOO

in their proof of the invariance principle.

  • 2. Stein’s method – based on constructing an operator of which

the Gaussian is a fixed point.

  • 3. Characteristic functions – aka Fourier analysis, the original

method of Ess´ een.

slide-22
SLIDE 22

We will only prove (at a high level) this for i.i.d. random variables. Assume that X1, . . . , Xn are i.i.d. with common distribution X. Further, E[X] = 0, E[X2] = 1 and E[X4] ≤ 10. In fact, for simplicity, assume that all the moments of X exist. Goal: Show that S = X1+...+Xn

√n

satisfies dK(S, N(0, 1)) = O 1 √n

  • .
slide-23
SLIDE 23

Characteristic functions

For any ξ ∈ R and real-valued random variable W, we define

  • W(ξ) = E[eiξW].
slide-24
SLIDE 24

Characteristic functions

For any ξ ∈ R and real-valued random variable W, we define

  • W(ξ) = E[eiξW].

Observe that W(0) = 1 for any W. When X1, . . . , Xn are independent, then

  • S(ξ) =

n

  • i=1
  • Xi(ξ/√n) = (

X(ξ/√n))n.

slide-25
SLIDE 25

Characteristic functions

For any ξ ∈ R and real-valued random variable W, we define

  • W(ξ) = E[eiξW].

Observe that W(0) = 1 for any W. When X1, . . . , Xn are independent, then

  • S(ξ) =

n

  • i=1
  • Xi(ξ/√n) = (

X(ξ/√n))n. Characteristic functions are nothing but the Fourier transform of random variables.

slide-26
SLIDE 26

High level proof idea of Berry-Ess´ een theorem

  • 1. Let Z = N(0, 1).
  • 2. Goal: Show that S is close to Z is Kolmogorov distance.
slide-27
SLIDE 27

High level proof idea of Berry-Ess´ een theorem

  • 1. Let Z = N(0, 1).
  • 2. Goal: Show that S is close to Z is Kolmogorov distance.
  • 3. First show that

S(ξ) is close to Z(ξ) where Z = N(0, 1).

slide-28
SLIDE 28

High level proof idea of Berry-Ess´ een theorem

  • 1. Let Z = N(0, 1).
  • 2. Goal: Show that S is close to Z is Kolmogorov distance.
  • 3. First show that

S(ξ) is close to Z(ξ) where Z = N(0, 1).

  • 4. Perform an approximate Fourier inversion to show that S is

close to Z.

slide-29
SLIDE 29

Approximate Fourier inversion

What is meant by approximate Fourier inversion?

slide-30
SLIDE 30

Approximate Fourier inversion

What is meant by approximate Fourier inversion? Exact Fourier inversion: Pr[S ≤ x] − Pr[Z ≤ x] = lim

T→∞

1 2π ξ=T

ξ=−T

e−iξx S(ξ) − Z(ξ) iξ dξ

slide-31
SLIDE 31

Approximate Fourier inversion

What is meant by approximate Fourier inversion? Exact Fourier inversion: Pr[S ≤ x] − Pr[Z ≤ x] = lim

T→∞

1 2π ξ=T

ξ=−T

e−iξx S(ξ) − Z(ξ) iξ dξ Approximate Fourier inversion Pr[S ≤ x] − Pr[Z ≤ x] ≤ 1 2π ξ=T

ξ=−T

| S(ξ) − Z(ξ)| |ξ| dξ + O 1 T

  • .
slide-32
SLIDE 32

Approximate Fourier inversion

Strategy to prove the Berry-Es´ eeen theorem Approximate Fourier inversion Pr[S ≤ x] − Pr[Z ≤ x] ≤ 1 2π ξ=T

ξ=−T

| S(ξ) − Z(ξ)| |ξ| dξ + O 1 T

  • .
slide-33
SLIDE 33

Approximate Fourier inversion

Strategy to prove the Berry-Es´ eeen theorem Approximate Fourier inversion Pr[S ≤ x] − Pr[Z ≤ x] ≤ 1 2π ξ=T

ξ=−T

| S(ξ) − Z(ξ)| |ξ| dξ + O 1 T

  • .

Choose T ≈ √n. We will bound | S(ξ) − Z(ξ)| for |ξ| ≤ T.

slide-34
SLIDE 34

Showing S(ξ) is close to Z(ξ)

Let us start with Z(ξ) . Recall Z = N(0, 1). It easily follows that

  • Z(ξ) =
  • x

1 √ 2π e− x2

2 eiξxdx = e− ξ2 2 .

slide-35
SLIDE 35

Showing S(ξ) is close to Z(ξ)

Let us start with Z(ξ) . Recall Z = N(0, 1). It easily follows that

  • Z(ξ) =
  • x

1 √ 2π e− x2

2 eiξxdx = e− ξ2 2 .

On the other hand,

  • S(ξ) = (

X(ξ/√n))n =

  • 1 +

  • j=1

ij · E[Xj] j! ξj nj/2 n

  • S(ξ) =
  • 1 − ξ2

2n + o(|ξ|2/n) n .

slide-36
SLIDE 36

Showing S(ξ) is close to Z(ξ)

Note: Taylor expansion of X(ξ/√n) is valid only if |ξ| is small. Claim: For |ξ| ≤

√n 100, we have

  • S(ξ) −

Z(ξ)

  • = O

1 √n|ξ|3e−ξ2/3

  • .
slide-37
SLIDE 37

Showing S(ξ) is close to Z(ξ)

Note: Taylor expansion of X(ξ/√n) is valid only if |ξ| is small. Claim: For |ξ| ≤

√n 100, we have

  • S(ξ) −

Z(ξ)

  • = O

1 √n|ξ|3e−ξ2/3

  • .

Plugging this back into approximate Fourier inversion (with T = √n/100), Pr[S ≤ x] − Pr[Z ≤ x] ≤ 1 2π ξ=T

ξ=−T

|ξ|2e−ξ2/3 √n dξ + O 1 T

  • .
slide-38
SLIDE 38

Showing S(ξ) is close to Z(ξ)

Note: Taylor expansion of X(ξ/√n) is valid only if |ξ| is small. Claim: For |ξ| ≤

√n 100, we have

  • S(ξ) −

Z(ξ)

  • = O

1 √n|ξ|3e−ξ2/3

  • .
slide-39
SLIDE 39

Showing S(ξ) is close to Z(ξ)

Note: Taylor expansion of X(ξ/√n) is valid only if |ξ| is small. Claim: For |ξ| ≤

√n 100, we have

  • S(ξ) −

Z(ξ)

  • = O

1 √n|ξ|3e−ξ2/3

  • .

Proof technique: Split |ξ| into the high |ξ| regime and the low |ξ|

  • regime. In particular, define

Γlow = {ξ : |ξ| ≤ n

1 6 } and Γhigh = {ξ : n 1 2 /100 ≥ |ξ| > n 1 6 }.

slide-40
SLIDE 40

Showing S(ξ) is close to Z(ξ)

Note: Taylor expansion of X(ξ/√n) is valid only if |ξ| is small. Claim: For |ξ| ≤

√n 100, we have

  • S(ξ) −

Z(ξ)

  • = O

1 √n|ξ|3e−ξ2/3

  • .
slide-41
SLIDE 41

Showing S(ξ) is close to Z(ξ)

Note: Taylor expansion of X(ξ/√n) is valid only if |ξ| is small. Claim: For |ξ| ≤

√n 100, we have

  • S(ξ) −

Z(ξ)

  • = O

1 √n|ξ|3e−ξ2/3

  • .

Proof technique: When ξ ∈ Γlow, then we apply Taylor’s expansion – recall

  • S(ξ) =
  • 1 − ξ2

2n + o(|ξ|2/n) n .

slide-42
SLIDE 42

Showing S(ξ) is close to Z(ξ)

Note: Taylor expansion of X(ξ/√n) is valid only if |ξ| is small. Claim: For |ξ| ≤

√n 100, we have

  • S(ξ) −

Z(ξ)

  • = O

1 √n|ξ|3e−ξ2/3

  • .
slide-43
SLIDE 43

Showing S(ξ) is close to Z(ξ)

Note: Taylor expansion of X(ξ/√n) is valid only if |ξ| is small. Claim: For |ξ| ≤

√n 100, we have

  • S(ξ) −

Z(ξ)

  • = O

1 √n|ξ|3e−ξ2/3

  • .

Proof technique: On the other hand, it is not difficult to show that | S(ξ)| ≤ e− 2ξ2

3 . Using the fact that |

Z(ξ)| = e− ξ2

2 . When

ξ ∈ Γhigh, this is enough

slide-44
SLIDE 44

Finishing the proof of Berry-Ess´ een

For all |ξ| ≤

√n 100, we have

  • S(ξ) −

Z(ξ)

  • = O

1 √n|ξ|3e−ξ2/3

  • .
slide-45
SLIDE 45

Finishing the proof of Berry-Ess´ een

For all |ξ| ≤

√n 100, we have

  • S(ξ) −

Z(ξ)

  • = O

1 √n|ξ|3e−ξ2/3

  • .

Plugging this back into the approximate Fourier inversion formula (which is) | Pr[S ≤ x] − Pr[Z ≤ x]| ≤ 1 2π ξ=T

ξ=−T

| S(ξ) − Z(ξ)| |ξ| dξ + O 1 T

  • ,

we get Pr[S ≤ x] − Pr[Z ≤ x]| = O(n−1/2).

slide-46
SLIDE 46

Application of the Berry-Ess´ een theorem

Stochastic knapsack Suppose you have n items, each with a profit ci and a stochastic weight Xi where each Xi is a positive valued random variable.

slide-47
SLIDE 47

Application of the Berry-Ess´ een theorem

Stochastic knapsack Suppose you have n items, each with a profit ci and a stochastic weight Xi where each Xi is a positive valued random variable. Goal: Given a knapsack with capacity θ and error tolerance probability p, pack a subset S of items such that Pr[

  • j∈S

Xj ≤ θ] ≥ 1 − p, so that the profit

j∈S cj is maximized.

slide-48
SLIDE 48

Berry-Ess´ een theorem for stochastic knapsack

Stochastic knapsack Suppose you have n items, each with a profit ci and a stochastic weight Xi where each Xi is of the form Xi =

  • wℓ,i

w.p. 1

2

wh,i w.p. 1

2

Here all wℓ,i ∈ [1, . . . , M/4] and wh,i ∈ [3M/4, . . . , M] where M = poly(n). Further, all profits ci ∈ [1, . . . , M].

slide-49
SLIDE 49

Algorithmic result for stochastic knapsack

Result: There is an algorithm which for any error parameter ǫ > 0, runs in time poly(M, n1/ǫ2) and outputs a set S∗ such that Pr[

  • j∈S∗

Xj ≤ θ] ≥ 1 − p − ǫ, such that

j∈S∗ cj = OPT.

slide-50
SLIDE 50

Algorithmic result for stochastic knapsack

Result: There is an algorithm which for any error parameter ǫ > 0, runs in time poly(M, n1/ǫ2) and outputs a set S∗ such that Pr[

  • j∈S∗

Xj ≤ θ] ≥ 1 − p − ǫ, such that

j∈S∗ cj = OPT.

Key feature: We do not relax the knapsack capacity θ.

slide-51
SLIDE 51

Algorithmic result for stochastic knapsack

Result: There is an algorithm which for any error parameter ǫ > 0, runs in time poly(M, n1/ǫ2) and outputs a set S∗ such that Pr[

  • j∈S∗

Xj ≤ θ] ≥ 1 − p − ǫ, such that

j∈S∗ cj = OPT.

Key feature: We do not relax the knapsack capacity θ. See paper in SODA 2018 for the most general version of results.

slide-52
SLIDE 52

Main idea behind the algorithm

Observation I: If we “center” random variable Xi, i.e, Yi = Xi − E[Xi], then it satisfies E[|Yi|3] ≤

  • E[|Yi|2]

3/2.

slide-53
SLIDE 53

Main idea behind the algorithm

Observation I: If we “center” random variable Xi, i.e, Yi = Xi − E[Xi], then it satisfies E[|Yi|3] ≤

  • E[|Yi|2]

3/2. Thus, we can potentially apply Berry-Ess´ een to a sum of Xi.

slide-54
SLIDE 54

Main idea behind the algorithm

Observation I: If we “center” random variable Xi, i.e, Yi = Xi − E[Xi], then it satisfies E[|Yi|3] ≤

  • E[|Yi|2]

3/2. Thus, we can potentially apply Berry-Ess´ een to a sum of Xi. Observation II: Consider any subset of items S with |S| ≥ 100/ǫ2. Then, max

i

Var(Xi) ≤ ǫ2 · (

  • j∈S

Var(Xj)).

slide-55
SLIDE 55

Algorithmic idea

Step 1: Either the optimum solution Sopt is such that |Sopt| ≤ 100/ǫ2. In this case, we can brute-force search for Sopt. Running time is nΘ(1/ǫ2).

slide-56
SLIDE 56

Algorithmic idea

Step 1: Either the optimum solution Sopt is such that |Sopt| ≤ 100/ǫ2. In this case, we can brute-force search for Sopt. Running time is nΘ(1/ǫ2). Step 2: Otherwise, |Sopt| > 100/ǫ2. In this case, define µopt, σ2

  • pt

and Copt as (i) µopt =

j∈Sopt E[Xj]; (ii) σ2

  • pt =

j∈Sopt Var(Xj);

(iii) Copt =

j∈Sopt cj.

slide-57
SLIDE 57

Algorithmic idea

Observe that µopt, σ2

  • pt and Copt are all integral multiples of 1/4

bounded by M2.

slide-58
SLIDE 58

Algorithmic idea

Observe that µopt, σ2

  • pt and Copt are all integral multiples of 1/4

bounded by M2. We use dynamic programming to find S∗ such that Copt = C∗, µopt = µ∗ and σ2

  • pt = σ2

∗.

slide-59
SLIDE 59

Algorithmic idea

Observe that µopt, σ2

  • pt and Copt are all integral multiples of 1/4

bounded by M2. We use dynamic programming to find S∗ such that Copt = C∗, µopt = µ∗ and σ2

  • pt = σ2

∗.

Consequence of Berry-Ess´ een theorem: Pr[

  • j∈S∗

Xj ≤ θ] ≥ Pr[

  • j∈Sopt

Xj ≤ θ] − ǫ.

slide-60
SLIDE 60

Algorithmic idea

Consequence of Berry-Ess´ een theorem: Pr[

  • j∈S∗

Xj ≤ θ] ≥ Pr[

  • j∈Sopt

Xj ≤ θ] − ǫ. This is because by Berry-Ess´ een, the distribution of

j∈S∗ Xj and

  • j∈Sopt Xj follows essentially Gaussian (and their means and

variances match).

slide-61
SLIDE 61

General algorithmic result for stochastic optimization

Suppose the item sizes {Xi}n

i=1 are all hypercontractive – i.e.,

E[|Xi|3] ≤ O(1) · (E[|Xi|2])3/2.

slide-62
SLIDE 62

General algorithmic result for stochastic optimization

Suppose the item sizes {Xi}n

i=1 are all hypercontractive – i.e.,

E[|Xi|3] ≤ O(1) · (E[|Xi|2])3/2. Theorem: When item sizes are hypercontractive, then there is an algorithm running in time nO(1/ǫ2) such that the output set S∗ satisfies 1.

j∈S∗ cj ≥ (1 − ǫ) · ( j∈Sopt cj).

  • 2. Pr[

j∈S∗ Xj ≤ θ] ≥ Pr[ j∈Sopt Xj ≤ θ] − ǫ.

slide-63
SLIDE 63

General algorithmic result for stochastic optimization

Suppose the item sizes {Xi}n

i=1 are all hypercontractive – i.e.,

E[|Xi|3] ≤ O(1) · (E[|Xi|2])3/2. Theorem: When item sizes are hypercontractive, then there is an algorithm running in time nO(1/ǫ2) such that the output set S∗ satisfies 1.

j∈S∗ cj ≥ (1 − ǫ) · ( j∈Sopt cj).

  • 2. Pr[

j∈S∗ Xj ≤ θ] ≥ Pr[ j∈Sopt Xj ≤ θ] − ǫ.

Read SODA 2018 paper for more details.

slide-64
SLIDE 64

Central limit theorems: Citius, Altius, Fortius

slide-65
SLIDE 65

Central limit theorems: Citius, Altius, Fortius

Let’s do Altius – as in higher degree polynomials.

slide-66
SLIDE 66

Central limit theorems: Citius, Altius, Fortius

Let’s do Altius – as in higher degree polynomials. Berry-Ess´ een says that sums of independent random variables under mild conditions converges to a Gaussian.

slide-67
SLIDE 67

Central limit theorems: Citius, Altius, Fortius

Let’s do Altius – as in higher degree polynomials. Berry-Ess´ een says that sums of independent random variables under mild conditions converges to a Gaussian. What if we replace the sum by a polynomial? Let us think of the easy case when the degree is 2.

slide-68
SLIDE 68

Central limit theorem for low-degree polynomials

Consider p(x) = x1+...+xn

√n

  • 2. As n → ∞, x1, . . . , xn are i.i.d.

copies of unbiased ±1 random variables, the distribution of p(x) goes to a

slide-69
SLIDE 69

Central limit theorem for low-degree polynomials

Consider p(x) = x1+...+xn

√n

  • 2. As n → ∞, x1, . . . , xn are i.i.d.

copies of unbiased ±1 random variables, the distribution of p(x) goes to a χ2 distribution.

slide-70
SLIDE 70

Central limit theorem for low-degree polynomials

Consider p(x) = x1+...+xn

√n

  • 2. As n → ∞, x1, . . . , xn are i.i.d.

copies of unbiased ±1 random variables, the distribution of p(x) goes to a χ2 distribution. In fact, suppose p(x) is of degree-2 and of the following form: p(x) = λ · ℓ2(x) + q(x)m where ℓ(x) is a linear form and λ = E[p(x) · ℓ2(x)]. If λ is large, then p(x) is very far from a Gaussian.

slide-71
SLIDE 71

Central limit theorem for quadratic polynomials

Theorem

Let p(x) : Rn → R such that Var(p(x)) = 1 and E[p(x)] = µ. Express p(x) = xTAx + b, x + c where A ∈ Rn×n and b ∈ Rn. Let Aop ≤ ǫ and b∞ ≤ ǫ. Suppose, x ∼ {−1, 1}n. Then, dK(p(x), N(µ, 1)) = O(√ǫ). If p(x) is not correlated with product of two linear forms, then it is distributed as a Gaussian.

slide-72
SLIDE 72

Central limit theorem for higher degree polynomials

Corresponding to any multilinear polynomial p : Rn → R of degree-d, we have a sequence of tensors (Ad, . . . , A0) where Ai ∈ Rn×i is a tensor of order i.

slide-73
SLIDE 73

Central limit theorem for higher degree polynomials

Corresponding to any multilinear polynomial p : Rn → R of degree-d, we have a sequence of tensors (Ad, . . . , A0) where Ai ∈ Rn×i is a tensor of order i. For a tensor Ai (where i > 1), we use σmax(Ai) to denote the “maximum singular value” obtained by a non-trivial flattening.

slide-74
SLIDE 74

Central limit theorem for higher degree polynomials

Corresponding to any multilinear polynomial p : Rn → R of degree-d, we have a sequence of tensors (Ad, . . . , A0) where Ai ∈ Rn×i is a tensor of order i. For a tensor Ai (where i > 1), we use σmax(Ai) to denote the “maximum singular value” obtained by a non-trivial flattening.

Theorem

Let p : Rn → R be a degree-d polynomial with Var(p(x)) = 1 and E[p(x)] = µ. Let (Ad, . . . , A0) denote the tensors corresponding to p. Then, dK(p(x), N(µ, 1)) = Od(√ǫ), where x ∼ {−1, 1}n. Here ǫ ≥ maxj>1 σmax(Aj) and ǫ ≥ A0∞.

slide-75
SLIDE 75

Features of the central limit theorem

  • 1. Qualitatively tight: in particular, for a polynomial if

maxj>1 σmax(Aj) is large, then the distribution of p(x) does not look like a Gaussian.

slide-76
SLIDE 76

Features of the central limit theorem

  • 1. Qualitatively tight: in particular, for a polynomial if

maxj>1 σmax(Aj) is large, then the distribution of p(x) does not look like a Gaussian.

  • 2. maxj>1 σmax(Aj) is essentially capturing correlation of p(x)

with product of two lower degree polynomials.

slide-77
SLIDE 77

Features of the central limit theorem

  • 1. Qualitatively tight: in particular, for a polynomial if

maxj>1 σmax(Aj) is large, then the distribution of p(x) does not look like a Gaussian.

  • 2. maxj>1 σmax(Aj) is essentially capturing correlation of p(x)

with product of two lower degree polynomials.

  • 3. Condition for convergence to normal is efficiently checkable.
slide-78
SLIDE 78

Proof of the central limit theorem

  • The first step is to go from x ∼ {−1, 1}n to x ∼ N n(0, 1).

Accomplished via the invariance principle.

slide-79
SLIDE 79

Proof of the central limit theorem

  • The first step is to go from x ∼ {−1, 1}n to x ∼ N n(0, 1).

Accomplished via the invariance principle.

  • Once in the Gaussian domain, the question is

when does a polynomial of a Gaussian look like a Gaussian?

slide-80
SLIDE 80

Proof of the central limit theorem

  • The first step is to go from x ∼ {−1, 1}n to x ∼ N n(0, 1).

Accomplished via the invariance principle.

  • Once in the Gaussian domain, the question is

when does a polynomial of a Gaussian look like a Gaussian?

  • Proof technique: Stein’s method + Malliavin calculus.
slide-81
SLIDE 81

Central limit theorem – application in derandomization

  • Derandomization – fertile ground both for applications and

discovery of central limit theorems (in computer science)

slide-82
SLIDE 82

Central limit theorem – application in derandomization

  • Derandomization – fertile ground both for applications and

discovery of central limit theorems (in computer science)

  • Why is central limit theorem useful for derandomization?
slide-83
SLIDE 83

Central limit theorem – application in derandomization

  • Derandomization – fertile ground both for applications and

discovery of central limit theorems (in computer science)

  • Why is central limit theorem useful for derandomization?
  • Example: Suppose we are given a halfspace

f : {−1, 1}n → {−1, 1} where f (x) = sign(n

i=1 wixi − θ).

slide-84
SLIDE 84

Central limit theorem – application in derandomization

  • Derandomization – fertile ground both for applications and

discovery of central limit theorems (in computer science)

  • Why is central limit theorem useful for derandomization?
  • Example: Suppose we are given a halfspace

f : {−1, 1}n → {−1, 1} where f (x) = sign(n

i=1 wixi − θ).

Deterministically compute Prx∈{−1,1}n[f (x) = 1].

slide-85
SLIDE 85

Derandomizing halfspaces

  • For f (x) = sign(n

i=1 wixi − θ), exactly computing

Prx∈{−1,1}n[f (x) = 1] is #P-hard.

slide-86
SLIDE 86

Derandomizing halfspaces

  • For f (x) = sign(n

i=1 wixi − θ), exactly computing

Prx∈{−1,1}n[f (x) = 1] is #P-hard.

  • Computing Prx∈{−1,1}n[f (x) = 1] to additive error ǫ is trivial

using randomness.

slide-87
SLIDE 87

Derandomizing halfspaces

  • For f (x) = sign(n

i=1 wixi − θ), exactly computing

Prx∈{−1,1}n[f (x) = 1] is #P-hard.

  • Computing Prx∈{−1,1}n[f (x) = 1] to additive error ǫ is trivial

using randomness.

  • What can we do deterministically? or how are CLTs going to

be useful?

slide-88
SLIDE 88

Derandomizing halfspaces

  • For f (x) = sign(n

i=1 wixi − θ), exactly computing

Prx∈{−1,1}n[f (x) = 1] is #P-hard.

  • Computing Prx∈{−1,1}n[f (x) = 1] to additive error ǫ is trivial

using randomness.

  • What can we do deterministically? or how are CLTs going to

be useful?

  • [Servedio 2007]: Suppose all the |wi| ≤ ǫ/100 (where

w2 = 1).

slide-89
SLIDE 89

Berry-Ess´ een in action

Pr

x∈{−1,1}n

  • n
  • i=1

wixi − θ ≥ 0

  • ≈ǫ

Pr

g∼N(0,1)

  • g − θ ≥ 0
slide-90
SLIDE 90

Berry-Ess´ een in action

Pr

x∈{−1,1}n

  • n
  • i=1

wixi − θ ≥ 0

  • ≈ǫ

Pr

g∼N(0,1)

  • g − θ ≥ 0
  • However, Prg∼N(0,1)
  • g − θ ≥ 0
  • can be computed in Oǫ(1)

time.

slide-91
SLIDE 91

Berry-Ess´ een in action

Pr

x∈{−1,1}n

  • n
  • i=1

wixi − θ ≥ 0

  • ≈ǫ

Pr

g∼N(0,1)

  • g − θ ≥ 0
  • However, Prg∼N(0,1)
  • g − θ ≥ 0
  • can be computed in Oǫ(1)

time.

  • Thus, when |wi| ≤ ǫ/100, Prx∈{−1,1}n

n

i=1 wixi − θ ≥ 0

  • can be computed to ±ǫ in Oǫ(1) · poly(n) time.
slide-92
SLIDE 92

What if max |wi| ≥ ǫ/100?

  • Suppose |w1| ≥ ǫ/100. We recurse on the variable x1.

fx1=1 = sign(

n

  • i=2

wixi−θ+w1); fx1=−1 = sign(

n

  • i=2

wixi−θ−w1)

slide-93
SLIDE 93

What if max |wi| ≥ ǫ/100?

  • Suppose |w1| ≥ ǫ/100. We recurse on the variable x1.

fx1=1 = sign(

n

  • i=2

wixi−θ+w1); fx1=−1 = sign(

n

  • i=2

wixi−θ−w1)

  • Observe that it suffices to compute

1 2 ·

  • Pr

x∈{−1,1}n−1[fx1=1(x) = 1] +

Pr

x∈{−1,1}n−1[fx1=−1(x) = 1]

  • .
slide-94
SLIDE 94

Berry-Ess´ een in recursive action

  • Either maxj≥2 |wj| ≤ (ǫ/100) ·

n

i=2 w2 i .

slide-95
SLIDE 95

Berry-Ess´ een in recursive action

  • Either maxj≥2 |wj| ≤ (ǫ/100) ·

n

i=2 w2 i .

  • If yes, we can apply the Berry-Ess´

een theorem.

slide-96
SLIDE 96

Berry-Ess´ een in recursive action

  • Either maxj≥2 |wj| ≤ (ǫ/100) ·

n

i=2 w2 i .

  • If yes, we can apply the Berry-Ess´

een theorem.

  • Else, we restrict w2. Note: every time we restrict a variable,

we capture an ǫ-fraction of the remaining ℓ2 mass.

slide-97
SLIDE 97

Berry-Ess´ een in recursive action

  • Either maxj≥2 |wj| ≤ (ǫ/100) ·

n

i=2 w2 i .

  • If yes, we can apply the Berry-Ess´

een theorem.

  • Else, we restrict w2. Note: every time we restrict a variable,

we capture an ǫ-fraction of the remaining ℓ2 mass.

  • Suppose the process goes on for j iterations. Either

j ≤ ǫ−1 log(1/ǫ) or j > ǫ−1 log(1/ǫ).

slide-98
SLIDE 98

Berry-Ess´ een in recursive action

  • If j ≤ ǫ−1 log(1/ǫ), then this reduces the problem to exp(1/ǫ)

subproblems – each of which can be solved using Berry-Ess´ een.

slide-99
SLIDE 99

Berry-Ess´ een in recursive action

  • If j ≤ ǫ−1 log(1/ǫ), then this reduces the problem to exp(1/ǫ)

subproblems – each of which can be solved using Berry-Ess´ een.

  • If j > ǫ−1 log(1/ǫ), we simply stop at j = ǫ−1 log(1/ǫ).
  • The top ǫ−1 log(1/ǫ) weights capture most of the ℓ2 mass.
slide-100
SLIDE 100

Berry-Ess´ een in recursive action

  • If j ≤ ǫ−1 log(1/ǫ), then this reduces the problem to exp(1/ǫ)

subproblems – each of which can be solved using Berry-Ess´ een.

  • If j > ǫ−1 log(1/ǫ), we simply stop at j = ǫ−1 log(1/ǫ).
  • The top ǫ−1 log(1/ǫ) weights capture most of the ℓ2 mass.
  • Non-trivial: Since ǫ−1 log(1/ǫ) weights capture most of the

mass of the vector w, we can just consider the halfspace over these variables.

slide-101
SLIDE 101

Finishing the proof

  • Thus if j ≥ ǫ−1 log(1/ǫ), then we have reduced it to a

ǫ−1 log(1/ǫ)-dimensional problem.

slide-102
SLIDE 102

Finishing the proof

  • Thus if j ≥ ǫ−1 log(1/ǫ), then we have reduced it to a

ǫ−1 log(1/ǫ)-dimensional problem.

  • Meta idea: If the condition of CLT is met, apply.
slide-103
SLIDE 103

Finishing the proof

  • Thus if j ≥ ǫ−1 log(1/ǫ), then we have reduced it to a

ǫ−1 log(1/ǫ)-dimensional problem.

  • Meta idea: If the condition of CLT is met, apply.
  • If it is not met, then restrict a variable and recurse.
slide-104
SLIDE 104

Finishing the proof

  • Thus if j ≥ ǫ−1 log(1/ǫ), then we have reduced it to a

ǫ−1 log(1/ǫ)-dimensional problem.

  • Meta idea: If the condition of CLT is met, apply.
  • If it is not met, then restrict a variable and recurse.
  • Meta-idea repeated in several works in derandomization which

use limit theorems.

slide-105
SLIDE 105

Central limit theorem:Fortius

  • So far, we have seen central limit theorems which provide

convergence in Kolmogorov distance.

slide-106
SLIDE 106

Central limit theorem:Fortius

  • So far, we have seen central limit theorems which provide

convergence in Kolmogorov distance.

  • In other words, we choose a threshold x and compare

Pr[S ≤ x] with Pr[Z ≤ x].

slide-107
SLIDE 107

Central limit theorem:Fortius

  • So far, we have seen central limit theorems which provide

convergence in Kolmogorov distance.

  • In other words, we choose a threshold x and compare

Pr[S ≤ x] with Pr[Z ≤ x].

  • It is possible to sometimes get convergence in stronger

metrics.

slide-108
SLIDE 108

Central limit theorem:Fortius

Theorem (Chen-Goldstein-Shao)

Let X1, X2, . . . , Xn be independent Bernoulli random variables such that S = Xi has mean µ and variance σ2. Then, S − Ndisc(µ, σ2)1 = O(σ−1).

slide-109
SLIDE 109

Central limit theorem:Fortius

Theorem (Chen-Goldstein-Shao)

Let X1, X2, . . . , Xn be independent Bernoulli random variables such that S = Xi has mean µ and variance σ2. Then, S − Ndisc(µ, σ2)1 = O(σ−1). Discrete CLTs have found many applications in derandomization and learning.

slide-110
SLIDE 110

Central limit theorem:Fortius

Theorem (Chen-Goldstein-Shao)

Let X1, X2, . . . , Xn be independent Bernoulli random variables such that S = Xi has mean µ and variance σ2. Then, S − Ndisc(µ, σ2)1 = O(σ−1). Discrete CLTs have found many applications in derandomization and learning. Also check out the new discrete CLTs proven by Valiant-Valiant, Daskalakis-Kamath-Tzamos and many others.

slide-111
SLIDE 111

Central limit theorem:Citius

  • Recall: Sum of n-i.i.d. random variables converges to a

Gaussian at a rate of O(n−1/2).

slide-112
SLIDE 112

Central limit theorem:Citius

  • Recall: Sum of n-i.i.d. random variables converges to a

Gaussian at a rate of O(n−1/2).

  • Without more conditions, not possible to beat O(n−1/2).
  • However, if the limiting distribution can include non-Gaussian

distributions, we can get better than n−1/2 convergence rate.

slide-113
SLIDE 113

Central limit theorem:Citius

  • Gaussians are parameterized by two parameters (i.e., two

moments).

slide-114
SLIDE 114

Central limit theorem:Citius

  • Gaussians are parameterized by two parameters (i.e., two

moments).

  • By allowing richer families, say parameterized by k moments,
  • ne can get convergence rates of n−k/2.
slide-115
SLIDE 115

Central limit theorem:Citius

  • Gaussians are parameterized by two parameters (i.e., two

moments).

  • By allowing richer families, say parameterized by k moments,
  • ne can get convergence rates of n−k/2.
  • However, conditions required are a little more delicate.
slide-116
SLIDE 116

Central limit theorem:Citius

  • Gaussians are parameterized by two parameters (i.e., two

moments).

  • By allowing richer families, say parameterized by k moments,
  • ne can get convergence rates of n−k/2.
  • However, conditions required are a little more delicate.
  • Referred to as “asymptotic expansions” (see FOCS 2015

paper for a ‘computer science’ introduction).

slide-117
SLIDE 117

Summary

  • Central limit theorem(s) can be used to summarize statistics

such as sums and low-degree polynomial of independent random variables.

slide-118
SLIDE 118

Summary

  • Central limit theorem(s) can be used to summarize statistics

such as sums and low-degree polynomial of independent random variables.

  • Many applications in learning theory, game theory, algorithms

and complexity.

slide-119
SLIDE 119

Summary

  • Central limit theorem(s) can be used to summarize statistics

such as sums and low-degree polynomial of independent random variables.

  • Many applications in learning theory, game theory, algorithms

and complexity.

  • May be there are even CS inspired CLTs waiting to be

discovered?

slide-120
SLIDE 120

Thanks