The Central Limit Theorem: More of the Story Steven Janke November - - PowerPoint PPT Presentation

the central limit theorem more of the story
SMART_READER_LITE
LIVE PREVIEW

The Central Limit Theorem: More of the Story Steven Janke November - - PowerPoint PPT Presentation

The Central Limit Theorem: More of the Story Steven Janke November 2015 Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 1 / 33 Central Limit Theorem Theorem (Central Limit Theorem) Let X 1 , X 2 , . . . be a


slide-1
SLIDE 1

The Central Limit Theorem: More of the Story

Steven Janke November 2015

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 1 / 33

slide-2
SLIDE 2

Central Limit Theorem

Theorem (Central Limit Theorem)

Let X1, X2, . . . be a sequence of independent and identically distributed random variables, each with expectation µ and variance σ2. Then the distribution of Zn = X1 + X2 + · · · + Xn − nµ σ√n converges to the distribution of a standard normal random variable. lim

n→∞ P(Zn ≤ x) =

1 √ 2π x

−∞

e− y2

2 dy Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 2 / 33

slide-3
SLIDE 3

Central Limit Theorem Applications

The sampling distribution of the mean is approximately normal. The distribution of experimental errors is approximately normal. >> But why the normal distribution?

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 3 / 33

slide-4
SLIDE 4

Benford’s Law

In an arbitrary table of data, such as populations or lake areas, P[Leading digit is d] = log10(1 + 1 d ) Data: List of 60 Tallest Buildings Lead Digit Meters Feet Benford 1 0.433 0.300 0.301 2 0.117 0.133 0.176 3 0.150 0.133 0.125 4 0.100 0.100 0.097 5 0.067 0.167 0.079 6 0.017 0.083 0.067 7 0.033 0.033 0.058 8 0.083 0.017 0.051 9 0.000 0.033 0.046

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 4 / 33

slide-5
SLIDE 5

Benford Justification

Simon Newcomb 1881 Frank Benford 1938 ”Proof” arguments: Positional number system Densities Scale invariance Scale and base unbiased (Hill 1995)

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 5 / 33

slide-6
SLIDE 6

Central Limit Theorem

Theorem (Central Limit Theorem)

Let X1, X2, . . . be a sequence of independent and identically distributed random variables, each with expectation µ and variance σ2. Then the distribution of Zn = X1 + X2 + · · · + Xn − nµ σ√n converges to the distribution of a standard normal random variable. lim

n→∞ P(Zn ≤ x) =

1 √ 2π x

−∞

e− y2

2 dy Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 6 / 33

slide-7
SLIDE 7

Central Limit Theorem Proof

Proof Sketch: Let Yi = Xi − µ Moment Generating Function of Yi is MYi(t) = EetYi MGF of Zn is MZn(t) = [MY1(

t σ√n]n

limn→∞ ln MZn(t) = t2

2

The MGF of the standard normal is e

t2 2

Since the MGF’s converge, the distributions converge. (L´ evy Continuity Theorem).

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 7 / 33

slide-8
SLIDE 8

Counter-Examples

Moment problem: Lognormal R.V. not determined by moments. No first moment: Cauchy R.V. has no MGF, EX = ∞ so CLT does not hold. No second moment: f (x) =

1 |x|3 for |x| ≥ 1.

Pairwise independence is not sufficient for CLT.

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 8 / 33

slide-9
SLIDE 9

Demoivre’s Theorem 1733

Each Xi is Bernoulli ( 0 or 1 ). b(k) = P[Sn = k] = n

k

1

2n

n! ≈ ( √ 2π)nn√ne−n (Stirling’s formula) b( n

2) ≈ √ 2 √πn

log(

b( n

2 +d

b( n

2 ) ) ≈ − 2d2

n

b( n

2 + d) ≈ √ 2 √πne− 2d2

n

limn→∞ P[a ≤ Sn−n/2

√n/2 ≤ b] = 1 √ 2

b

a e− x2 2 dx

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 9 / 33

slide-10
SLIDE 10

Laplace 1810

Dealt with independent and identically distributed case. Started with discrete variables: Consider Xi where pk = P[Xi = k

m] for

k = −m, −m + 1, · · · , m − 1, m Generating function: T(t) = m

k=−m pktk

qj = P[ Xi = j

m] is coefficient of tj in T(t)n

Substitute eix for t and recall

1 2π

π

−π e−itxeisxdx = δts

Then, qj =

1 2π

π

−π e−ijx[m −m pkeikx]ndx

Now, expand eikx in a power series around 0 and use the fact that the mean of Xi is zero.

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 10 / 33

slide-11
SLIDE 11

Why Normal?

Normal Characteristic Function f (u) = 1 √ 2π

  • eiuxe− x2

2 dx = e− u2 2

f Sn

σ√n (u) = E[eiu(Sn/σ√n)] = (f (

u σ√n))n = (1 − σ2 2σ2nu2 + o( σ2 σ2nu2))n = (1 − u2 2n + o(u2 n ))n → e− u2

2 Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 11 / 33

slide-12
SLIDE 12

Levy Continuity

Theorem

If distribution functions Fn converge to F, then the corresponding ch.f. fn converge to f . Conversely, if fn converges to g continuous at 0, then Fn converges to F. Proof Sketch: First direction is the Helly-Bray theorem. The set {eiux} is a separating set for distribution functions. In both directions, continuity points and mass of Fn are critical.

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 12 / 33

slide-13
SLIDE 13

History

Laplace never presented a general CLT statement. (Concerned with limiting probabilities for particular problems). Concern over convergence led Poisson to improvements (not identically distributed case). Dirichlet and Cauchy changed conception of analysis (epsilon/delta). Counter-examples uncovered limitations. Chebychev proved CLT using convergence of moments. (Markov and Liapounov were students). First rigorous proof (Liapounov 1900). CLT holds with independent (but not necessarily i.i.d.) Xi if E|Xj|3 [ X 2

j ]3/2 =

E|Xj|3 s3

n

→ 0

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 13 / 33

slide-14
SLIDE 14

Liapounov proof

Assume:

E|Xj|3 [ X 2

j ]3/2 =

E|Xj|3 s3

n

→ 0 gn(u) = Πn

1fk( u sn ) = Πn 1[1 + (fk( u sn ) − 1)]

fk( u

sn ) = 1 − u2 2s2

n (σ2

k + δk( u sn ))

|fk( u

sn ) − 1| ≤ 2u2 σ2

k

s2

n

= ⇒ k

1 |fk( u sn ) − 1| ≤ 2u2

(E|X 2

k |)

3 2 < E|Xk|3

supk≤n

σk sn → 0 =

⇒ sup|fk( u

sn ) − 1| → 0

Use log(1 + z) = z(1 + θz) where |θ| ≤ 1 for |z| ≤ 1

2

log gn(u) = n

1(fk( u sn ) − 1) + θ n 1(fk( u sn ) − 1)2

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 14 / 33

slide-15
SLIDE 15

Liapounov proof continued

log gn(u) = n

1(fk( u sn ) − 1) + θ n 1(fk( u sn ) − 1)2

|θ n

1(fk( u sn ) − 1)2| ≤ sup|fk( u sn ) − 1| · n 1 |(fk( u sn ) − 1)| → 0

fk( u

sn ) − 1 = − u2 2 σ2

k

s2

n + θk u3

s3

n E|Xk|3

n

1(fk( u sn ) − 1) = − u2 2 + θ u3 s3

n

n

1 E|Xk|3 → − u2 2

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 15 / 33

slide-16
SLIDE 16

Lindeberg 1922

Theorem (Central Limit Theorem)

Let the variables Xi be independent with EXi = 0 and EX 2

i = σ2 i . Let s be

the standard deviation of the sum S and let F be the distribution of S

s .

With Φ(x) the normal distribution, then if

1 s2

n

|x|≥ǫsn x2dFk → 0, we

have sup

x |F(x) − Φ(x)| ≤ 5ǫ

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 16 / 33

slide-17
SLIDE 17

Lindeberg Proof

Pick auxiliary function f . Arbitrary distribution V , define F(x) =

  • f (x − t)dV (x)

With φ(x) the normal density, define Ψ(x) =

  • (f (x − t)φ(x)dx

Taylor expansion of f to third power gives |F(x) − Ψ(x)| < k

  • |x|3dV (x)

With Ui the distribution of Xi, F1(x) =

  • f (x − t)dU1(x) . . . Fn(x) =
  • Fn−1(x − t)dUn(x)

Note U(x) =

  • · · ·
  • U(x − t1 − t2 − · · · tn)dU1(t1) · · · dUn(tn)

By selecting f carefully, |U(x) − Φ(x)| < 3(n

i

  • |x|3dUi(x))

1 4 Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 17 / 33

slide-18
SLIDE 18

Still Why Normal?

Let X = X1 + X2 where X is N(0, 1) and X1 independent of X2 f (u) = f1(u)f2(u) = e− u2

2

e− u2

2 is an entire, non-vanishing function with |f1(z)| ≤ ec|u|2,

Hadamard factorization theorem = ⇒ log f1(u) is a polynomial in u

  • f at most degree 2.

f is a characteristic function = ⇒ f (0) = 1, f (u) = ¯ f (−u), and it is bounded. Hence, log f (u) = iua + bu2. This is the general form of the normal characteristic function.

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 18 / 33

slide-19
SLIDE 19

Feller - Levy 1935

Theorem (Final Central Limit Theorem)

Let the variables Xi be independent with EXi = 0 and EX 2

i = σ2 i . Let

Sn = n

1 Xi and s2 n = n 1 σ2

  • k. Φ is the normal distribution and Fk is the

distribution of Xk. Then as n → ∞, P[Sn/sn ≤ x] → Φ(x) and maxk≤n σk sn → 0 if and only if for every ǫ > 0 1 s2

n |x|≥ǫsn

x2dFk → 0

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 19 / 33

slide-20
SLIDE 20

Levy

Following appeared in Levy’s monograph in 1937.

Theorem (Levy’s version)

In order that a sum S =

j Xj of independent variables have a

distribution close to Gaussian, it is necessary and sufficient that, after reducing medians to zero, the following conditions be satisfied. Each summand that is not negligible compared to the dispersion of the entire sum has a distribution close to Gaussian. The maximum of the absolute value of the negligible summands is itself negligible compared to the dispersion of the sum.

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 20 / 33

slide-21
SLIDE 21

Normed sums and Stable Laws

Definition

X is said to have a stable law if whenever X1, X2, · · · , Xk are independent with the same distribution as X, we have X1 + X2 + · · · + Xk

D

= aX + b Both the Normal (σ2 < 0) and Cauchy (σ2 = ∞) Laws are stable.

Theorem (Limit of Normed Sums)

Suppose that Sn = X1 + X2 + · · · + Xn An − Bn

D

→ X then X has a stable law.

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 21 / 33

slide-22
SLIDE 22

Entropy

Definition (Discrete Entropy)

Let X be a discrete random variable taking values xi with probability pi. The entropy of X is H(X) = −

  • pi log pi

H(X) > 0 unless X is constant. H(aX + b) = H(X) If X is the result of flipping a fair coin, H(X) = 1. If X takes n values, H(X) is maximized when pi = 1

n.

Extend to joint entropy H(X, Y ).

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 22 / 33

slide-23
SLIDE 23

Entropy Axioms

Symmetric in pi Continuous in pi Normalized: H(X) = 1 for fair coin. X and Y independent gives H(X, Y ) = H(X) + H(Y ) Decomposition: H(r1, · · · , rm, q1, · · · , qn) = αH(r1, · · · , rm) + (1 − α)H(q1, · · · , qn) Axioms = ⇒ H(X) = − pi log pi

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 23 / 33

slide-24
SLIDE 24

Entropy

Definition (Differential Entropy)

Let X be a continuous random variable with density p. The differential entropy of X is H(X) = −

  • p(t) log p(t)dt

X is uniform on [0, c] gives H(X) = log(c) X is N(0, σ2) gives H(X) = 1

2log(2πeσ2)

H(aX + b) = H(X) + log(a) H(X) can be negative.

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 24 / 33

slide-25
SLIDE 25

Entropy

Definition (Relative Entropy)

Let p and q be densities. The relative entropy distance from p to q is D(pq) =

  • p(x) log(p(x)

q(x))dx If supp(p) ⊂ supp(q), then D(pq) = ∞. D is not a metric (not symmetric and no triangle inequality). D(pq) ≥ 0 with equality if and only if p = q a.e. Convergence in D is stronger than convergence in L1.

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 25 / 33

slide-26
SLIDE 26

Entropy

Lemma (Maximum Entropy)

Let p be the density of random variable with variance σ2, and let φ2

σ be

the density of a N(0, σ2) random variable. Then H(p) ≤ H(φ2

σ) = log(2πeσ2)

2 with equality if and only if p is a normal density. Proof: 0 ≤ D(pφ2

σ) =

  • p(x)(log(p(x) + log(2πσ2)

2 + x2 2σ2 log(e))dx = −H(p) + log(2πeσ2) 2 = −H(p) + H(φ2

σ)

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 26 / 33

slide-27
SLIDE 27

Second Law of Thermodynamics

Let Ω be microstates corresponding to a particular macrostate. Then the entropy of the macrostate is S = k loge Ω. Ω is composed of microstates r of probability pr. If there are v copies of the system, about vpr are microstates of type r. Ω = v v1!v2! · · · vk! ≈ vvv−v1

1

v−v2

2

· · · v−vk

k

Then, S = k loge Ω ≈ k(v log(v) −

  • vr log(vr)) = −kv
  • pr log(pr)

Entropy is maximized subject to energy constraint at Gibbs states. ( pr = 1 and prEr = E)

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 27 / 33

slide-28
SLIDE 28

Fisher Information

Definition

Let Y be a random variable with density g and variance σ2. Set ρ = g′

g

and let φ be the normal density with the same mean and variance as Y . Fisher Information is I(Y ) = E[(ρ(Y ))2] Standardized Fisher Information is J(Y ) = σ2E[(ρ(Y ) − ρφ(Y ))2] Z with distribution N(0, σ2) minimizes Fisher information (I(Z) =

1 σ2 )

among all distributions with variance σ2. (Note: J(Z) = 0)

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 28 / 33

slide-29
SLIDE 29

Fisher Information Properties

Lemma (de Bruijn)

If Z is a normal R.V. independent of Y with the same mean and variance, then D(Y Z) = 1 J( √ tY + √ 1 − tZ) 1 2t dt Proof relies on these facts: The normal density satisfies the heat equation: ∂φτ

∂τ = 1 2 ∂2φτ ∂/x2 .

Hence, Y + Z also satisfies the heat equation. We can then calculate the derivative of D(Y Z). We also have that if J(Y ) → 0 then D(Y Z) → 0.

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 29 / 33

slide-30
SLIDE 30

Fisher Information Properties

Lemma

If U and V are independent, then for β ∈ [0, 1] J(U + V ) ≤ β2J(U) + (1 − β2)J(V ) J(

  • βU +
  • 1 − βV ) ≤ βJ(U) + (1 − β)J(V )

with equality if and only if U and V are normal. In particular, J(X + Y ) ≤ J(X) H(X + Y ) ≥ H(X)

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 30 / 33

slide-31
SLIDE 31

Fisher Information Properties

From Lemma, J(√βU + √1 − βV ) ≤ βJ(U) + (1 − β)J(V ) In particular, with Sn = (n

1 Xi)/√n and β = n n+m,

nJ(Sn) + mJ(Sm) ≥ (m + n)J(Sn+m) If J(Sn) < ∞ for some n, then J(Sn) converges to 0. (Take n = m and assume i.i.d. variables to see monotone convergence of a subsequence.)

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 31 / 33

slide-32
SLIDE 32

Sketch of Central Limit Theorem Proof

Assume i.i.d. random variables J(Sn) converges to 0 Hence D(SnZ) converges to 0 For densities pSn and φ2

σ,

  • (pSn − φσ2)2 ≤ 2D(SnZ)

Sn → N(0, σ2) Can generalize to non i.i.d. variables.

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 32 / 33

slide-33
SLIDE 33

Conclusions

Assume errors are ”uniformally asymptotically negligible” (no one dominates). Assume independent errors with distributions that have second moments. Normalize the sum with mean and variance. Then the limit is a standard normal distribution. Normal distribution is stable (”infinitely divisible”). (X = X1 + X2) Normal distribution maximizes entropy. Convolution (adding R.V.’s) increases entropy.

Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 33 / 33