Convergence of Random Processes DS GA 1002 Probability and - - PowerPoint PPT Presentation
Convergence of Random Processes DS GA 1002 Probability and - - PowerPoint PPT Presentation
Convergence of Random Processes DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall17 Carlos Fernandez-Granda, Brett Bernstein Review Question Let X ( 1 ) , . . . , X ( n ) be iid
Review Question
Let X(1), . . . , X(n) be iid random variables each having mean µ and variance σ2.
- 1. What is E[
X(1) + · · · + X(n)]?
- 2. What is Std[
X(1) + · · · + X(n)] :=
- Var[
X(1) + · · · + X(n)]?
Review Question
Let X(1), . . . , X(n) be iid random variables each having mean µ and variance σ2.
- 1. What is E[
X(1) + · · · + X(n)]?
- Solution. nµ
- 2. What is Std[
X(1) + · · · + X(n)] :=
- Var[
X(1) + · · · + X(n)]?
Review Question
Let X(1), . . . , X(n) be iid random variables each having mean µ and variance σ2.
- 1. What is E[
X(1) + · · · + X(n)]?
- Solution. nµ
- 2. What is Std[
X(1) + · · · + X(n)] :=
- Var[
X(1) + · · · + X(n)]?
- Solution. σ√n
An Experiment In Coin Flipping
We will repeatedly flip a fair coin, and count how many heads we get. More formally, let S(i) = X(1) + · · · + X(i) where X(1), . . . are iid Bernoulli random variables with p = 1/2.
An Experiment In Coin Flipping
i=10 i=20 i=30
An Experiment In Coin Flipping
i=50 i=100
An Experiment In Coin Flipping: 5000 flip sequences
i=50 i=100
An Experiment In Coin Flipping: 5000 flip sequences
Brighter color means more occurences
i=50 i=100
An Experiment In Coin Flipping: 5000 flip sequences
−4σ √ i −3σ √ i −2σ √ i −1σ √ i µi +1σ √ i +2σ √ i +3σ √ i +4σ √ i i=50 i=100
What if we take averages instead of sums?
An Experiment In Coin Flipping: 5000 flip sequences
Averages (i.e., divide by i)
−4σ/ √ i −3σ/ √ i −2σ/ √ i −1σ/ √ i µ +1σ/ √ i +2σ/ √ i +3σ/ √ i +4σ/ √ i i=100 i=200
An Experiment In Coin Flipping: 5000 flip sequences
Averages (i.e., divide by i)
µ i=1000 i=2000
An Experiment In Coin Flipping: 5000 flip sequences
−4σ √ i −3σ √ i −2σ √ i −1σ √ i µi +1σ √ i +2σ √ i +3σ √ i +4σ √ i i=50 i=100
How do we isolate the fluctuations about the mean?
An Experiment In Coin Flipping: 5000 flip sequences
Subtract µi
−4σ √ i −3σ √ i −2σ √ i −1σ √ i +1σ √ i +2σ √ i +3σ √ i +4σ √ i i=100 i=200
How do we normalize the scale?
An Experiment In Coin Flipping: 5000 flip sequences
Subtract µi and then divide by √ i
−4σ −3σ −2σ −1σ +1σ +2σ +3σ +4σ i=100 i=200
An Experiment In Coin Flipping: 5000 flip sequences
Subtract µi and then divide by √ i
−4σ −3σ −2σ −1σ +1σ +2σ +3σ +4σ i=250 i=500
Aim
How do we rigorously describe the experiments we just conducted?
- 1. Define convergence for random processes
- 2. Illustrate some of the subtleties associated with different forms of
convergence
- 3. Describe two convergence phenomena: the law of large numbers and
the central limit theorem
Types of convergence Law of Large Numbers Central Limit Theorem Monte Carlo simulation
Convergence of deterministic sequences
A deterministic sequence of real numbers x1, x2, . . . converges to x ∈ R, lim
i→∞ xi = x
if xi is arbitrarily close to x as i grows For any ǫ > 0 there is an i0 such that for all i > i0, |xi − x| < ǫ Problem: Random sequences do not have fixed values
Convergence with probability one
Consider a discrete random process X and a random variable X defined on the same probability space If we fix the outcome ω, X (i, ω) is a deterministic sequence and X (ω) is a constant We can determine whether lim
i→∞
- X (i, ω) = X (ω)
for that particular ω
Convergence with probability one
Ω ω1 ω2 ˜ X(0, ω) ˜ X(1, ω) ˜ X(2, ω) ˜ X(3, ω) ˜ X(4, ω) ˜ X(5, ω)
Convergence with probability one
- X converges with probability one to X if
P
- ω | ω ∈ Ω,
lim
i→∞
- X (ω, i) = X (ω)
- = 1
Deterministic convergence occurs with probability one Also called almost sure convergence
Puddle
Initial amount of water is uniform between 0 and 1 gallon After a time interval i there is i times less water
- D (ω, i) := ω
i , i = 1, 2, . . .
Puddle
1 2 3 4 5 6 7 8 9 10 0.2 0.4 0.6 0.8 i
- D (ω, i)
ω = 0.31 ω = 0.89 ω = 0.52
Puddle
If we fix ω ∈ (0, 1) lim
i→∞
- D (ω, i) = lim
i→∞
ω i = 0
- D converges to zero with probability one
Puddle
10 20 30 40 50 0.5 1 i
- D (ω, i)
Alternative idea
Idea: Instead of fixing ω and checking deterministic convergence:
- 1. Measure how close
X (i) and X are for a fixed i using a deterministic quantity
- 2. Check whether the quantity tends to zero
Convergence in mean square
The mean square of Y − X is a measure of how close X and Y are If E
- (X − Y )2
= 0 then X = Y with probability one Proof: By Markov’s inequality for any ǫ > 0 P
- (Y − X)2 > ǫ
- ≤
E
- (X − Y )2
ǫ = 0
Convergence in mean square
- X converges to X in mean square if
lim
i→∞ E
- X −
X (i) 2 = 0
Convergence in probability
Alternative measure: Probability that |Y − X| > ǫ for small ǫ
- X converges to X in probability if for any ǫ > 0
lim
i→∞ P
- X −
X (i)
- > ǫ
- = 0
- Conv. in mean square implies conv. in probability
lim
i→∞ P
- X −
X (i)
- > ǫ
- Conv. in mean square implies conv. in probability
lim
i→∞ P
- X −
X (i)
- > ǫ
- = lim
i→∞ P
- X −
X (i) 2 > ǫ2
- Conv. in mean square implies conv. in probability
lim
i→∞ P
- X −
X (i)
- > ǫ
- = lim
i→∞ P
- X −
X (i) 2 > ǫ2
- ≤ lim
i→∞
E
- X −
X (i) 2 ǫ2
- Conv. in mean square implies conv. in probability
lim
i→∞ P
- X −
X (i)
- > ǫ
- = lim
i→∞ P
- X −
X (i) 2 > ǫ2
- ≤ lim
i→∞
E
- X −
X (i) 2 ǫ2 = 0
- Conv. in mean square implies conv. in probability
lim
i→∞ P
- X −
X (i)
- > ǫ
- = lim
i→∞ P
- X −
X (i) 2 > ǫ2
- ≤ lim
i→∞
E
- X −
X (i) 2 ǫ2 = 0 Convergence with probability one also implies convergence in probability
Convergence in distribution
The distribution of X (i) converges to the distribution of X
- X converges in distribution to X if
lim
i→∞ F X(i) (x) = FX (x)
for all x at which FX is continuous
Convergence in distribution
Convergence in distribution does not imply that X (i) and X are close as i → ∞! Convergence in probability does imply convergence in distribution
Binomial tends to Poisson (Review)
◮
X (i) is binomial with parameters i and p := λ/i (for i > λ)
◮ X is a Poisson random variable with parameter λ ◮
X (i) converges to X in distribution lim
i→∞ p X(i) (x) = lim i→∞
i x
- px (1 − p)(i−x)
= λx e−λ x! = pX (x)
Probability mass function of X (40) with λ = 20
Binomial with n = 40 and p = 20/40
10 20 30 40 5 · 10−2 0.1 0.15 k
Probability mass function of X (80) with λ = 20
Binomial with n = 80 and p = 20/80
10 20 30 40 5 · 10−2 0.1 0.15 k
Probability mass function of X (400) with λ = 20
Binomial with n = 400 and p = 20/400
10 20 30 40 5 · 10−2 0.1 0.15 k
Probability mass function of X with λ = 20
10 20 30 40 5 · 10−2 0.1 0.15 k
Types of convergence Law of Large Numbers Central Limit Theorem Monte Carlo simulation
Moving average
The moving average A of a discrete random process X is
- A (i) := 1
i
i
- j=1
- X (j)
Weak law of large numbers
Let X be an iid discrete random process with mean µ
X := µ and
finite variance σ2 The average A of X converges in mean square to µ
Proof
E
- A (i)
Proof
E
- A (i)
- = E
1 i
i
- j=1
- X (j)
Proof
E
- A (i)
- = E
1 i
i
- j=1
- X (j)
= 1 i
i
- j=1
E
- X (j)
Proof
E
- A (i)
- = E
1 i
i
- j=1
- X (j)
= 1 i
i
- j=1
E
- X (j)
- = µ
Proof
Var
- A (i)
Proof
Var
- A (i)
- = Var
1 i
i
- j=1
- X (j)
Proof
Var
- A (i)
- = Var
1 i
i
- j=1
- X (j)
= 1 i2
i
- j=1
Var
- X (j)
Proof
Var
- A (i)
- = Var
1 i
i
- j=1
- X (j)
= 1 i2
i
- j=1
Var
- X (j)
- = σ2
i
Proof
lim
i→∞ E
- A (i) − µ
2
Proof
lim
i→∞ E
- A (i) − µ
2 = lim
i→∞ E
- A (i) − E
- A (i)
2
Proof
lim
i→∞ E
- A (i) − µ
2 = lim
i→∞ E
- A (i) − E
- A (i)
2 = lim
i→∞ Var
- A (i)
Proof
lim
i→∞ E
- A (i) − µ
2 = lim
i→∞ E
- A (i) − E
- A (i)
2 = lim
i→∞ Var
- A (i)
- = lim
i→∞
σ2 i
Proof
lim
i→∞ E
- A (i) − µ
2 = lim
i→∞ E
- A (i) − E
- A (i)
2 = lim
i→∞ Var
- A (i)
- = lim
i→∞
σ2 i = 0
Strong law of large numbers
Let X be an iid discrete random process with mean µ
X := µ
The average A of X converges with probability one to µ
Our Bernoulli Experiment: Look at averages
µ i=1000 i=2000
iid standard Gaussian
10 20 30 40 50 i 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
Moving average Mean of iid seq.
iid standard Gaussian
100 200 300 400 500 i 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
Moving average Mean of iid seq.
iid standard Gaussian
1000 2000 3000 4000 5000 i 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
Moving average Mean of iid seq.
iid geometric with p = 0.4
10 20 30 40 50 i 2 4 6 8 10 12
Moving average Mean of iid seq.
iid geometric with p = 0.4
100 200 300 400 500 i 2 4 6 8 10 12
Moving average Mean of iid seq.
iid geometric with p = 0.4
1000 2000 3000 4000 5000 i 2 4 6 8 10 12
Moving average Mean of iid seq.
iid Cauchy
10 20 30 40 50 i 5 5 10 15 20 25 30
Moving average Median of iid seq.
iid Cauchy
100 200 300 400 500 i 10 5 5 10
Moving average Median of iid seq.
iid Cauchy
1000 2000 3000 4000 5000 i 60 50 40 30 20 10 10 20 30
Moving average Median of iid seq.
Strong law of large numbers
Why do we care about the convergence of averages?
Strong law of large numbers
Why do we care about the convergence of averages? One of the most fundamental tools a statistician/data science has access to SLLN says that as we acquire more data, the average will always converge to the true mean Justifies the convergence of pointwise estimators (coming soon)
Question to think about during break
- 1. Suppose
X(1), . . . are iid with E[ X(i)] = µ, and E[ X(i)2] = η. Construct two sequences of random variables from the X(i) that converge to η and µ2, respectively, with probability one.
Question to think about during break
- 1. Suppose
X(1), . . . are iid with E[ X(i)] = µ, and E[ X(i)2] = η. Construct two sequences of random variables from the X(i) that converge to η and µ2, respectively, with probability one. Solution. 1 n
n
- i=1
- X(i)2 → η
Question to think about during break
- 1. Suppose
X(1), . . . are iid with E[ X(i)] = µ, and E[ X(i)2] = η. Construct two sequences of random variables from the X(i) that converge to η and µ2, respectively, with probability one. Solution. 1 n
n
- i=1
- X(i)2 → η
and
- 1
n
n
- i=1
- X(i)
2 → µ2.
Types of convergence Law of Large Numbers Central Limit Theorem Monte Carlo simulation
Central Limit Theorem
Let X be an iid discrete random process with mean µ
X := µ and
finite variance σ2 √n
- A − µ
- converges in distribution to a Gaussian random variable
with mean 0 and variance σ2 The average A is approximately Gaussian with mean µ and variance σ2/i
Height data
◮ Example: Data from a population of 25 000 people ◮ We compare the histogram of the heights and the pdf of a Gaussian
random variable fitted to the data
Height data
60 62 64 66 68 70 72 74 76
Height (inches)
0.05 0.10 0.15 0.20 0.25
Gaussian distribution Real data
Sketch of proof
Pdf of sum of two independent random variables is the convolution
- f their pdfs
fX+Y (z) = ∞
y=−∞
fX (z − y) fY (y) dy Repeated convolutions of any pdf with finite variance results in a Gaussian!
Repeated convolutions
i = 1 i = 2 i = 3 i = 4 i = 5
Repeated convolutions
i = 1 i = 2 i = 3 i = 4 i = 5
iid exponential λ = 2, i = 102
0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 1 2 3 4 5 6 7 8 9
iid exponential λ = 2, i = 103
0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 5 10 15 20 25 30
iid exponential λ = 2, i = 104
0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 10 20 30 40 50 60 70 80 90
iid geometric p = 0.4, i = 102
1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 0.5 1.0 1.5 2.0 2.5
iid geometric p = 0.4, i = 103
1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 1 2 3 4 5 6 7
iid geometric p = 0.4, i = 104
1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 5 10 15 20 25
iid Cauchy, i = 102
20 15 10 5 5 10 15 0.05 0.10 0.15 0.20 0.25 0.30
iid Cauchy, i = 103
20 15 10 5 5 10 15 0.05 0.10 0.15 0.20 0.25 0.30
iid Cauchy, i = 104
20 15 10 5 5 10 15 0.05 0.10 0.15 0.20 0.25 0.30
Gaussian approximation to the binomial
X is binomial with parameters n and p Computing the probability that X is in a certain interval requires summing its pmf over the interval Central limit theorem provides a quick approximation X =
n
- i=1
Bi, E (Bi) = p, Var (Bi) = p (1 − p)
1 nX is approximately Gaussian with mean p and variance p (1 − p) /n
X is approximately Gaussian with mean np and variance np (1 − p)
Gaussian approximation to the binomial
Basketball player makes shot with probability p = 0.4 (shots are iid) Probability that she makes more than 420 shots out of 1000? Exact answer: P (X ≥ 420) =
1000
- x=420
pX (x) =
1000
- x=420
1000 x
- 0.4x0.6(n−x) = 10.4 · 10−2
Approximation : P (X ≥ 420)
Gaussian approximation to the binomial
Basketball player makes shot with probability p = 0.4 (shots are iid) Probability that she makes more than 420 shots out of 1000? Exact answer: P (X ≥ 420) =
1000
- x=420
pX (x) =
1000
- x=420
1000 x
- 0.4x0.6(n−x) = 10.4 · 10−2
Approximation (U is standard Gaussian): P (X ≥ 420) ≈ P
- np (1 − p)U + np ≥ 420
Gaussian approximation to the binomial
Basketball player makes shot with probability p = 0.4 (shots are iid) Probability that she makes more than 420 shots out of 1000? Exact answer: P (X ≥ 420) =
1000
- x=420
pX (x) =
1000
- x=420
1000 x
- 0.4x0.6(n−x) = 10.4 · 10−2
Approximation (U is standard Gaussian): P (X ≥ 420) ≈ P
- np (1 − p)U + np ≥ 420
- = P (U ≥ 1.29)
Gaussian approximation to the binomial
Basketball player makes shot with probability p = 0.4 (shots are iid) Probability that she makes more than 420 shots out of 1000? Exact answer: P (X ≥ 420) =
1000
- x=420
pX (x) =
1000
- x=420
1000 x
- 0.4x0.6(n−x) = 10.4 · 10−2
Approximation (U is standard Gaussian): P (X ≥ 420) ≈ P
- np (1 − p)U + np ≥ 420
- = P (U ≥ 1.29)
= 1 − Φ (1.29) = 9.85 · 10−2
CLT: Things to think about
- 1. The CLT allows us to model many phenomena using Gaussian
distributions
- 2. General intuition that an average of random variables concentrates
tightly around the mean, since the Gaussian distribution has very thin tails (i.e., its pdf decays very quickly).
CLT vs Chebyshev: 5000 Flip Sequences
−4σ √ i −3σ √ i −2σ √ i −1σ √ i µi +1σ √ i +2σ √ i +3σ √ i +4σ √ i i=50 i=100
Chebyshev says Pr(|X − µ| > 3σ) ≤ 1
9
CLT approximation says Pr(|X − µ| > 3σ) ≈
3 1000
Types of convergence Law of Large Numbers Central Limit Theorem Monte Carlo simulation
Monte Carlo simulation
Simulation is a powerful tool in probability and statistics Models are too complex to derive closed-form solutions (life is not a homework problem!) Example: Game of solitaire
Game of solitaire
Aim: Compute the probability that you win at solitaire If every permutation of the cards has the same probability P (Win) = Number of permutations that lead to a win Total number Problem: Characterizing permutations that lead to a win is very difficult without playing out the game We can’t just check because there are 52! ≈ 8 · 1067 permutations! Solution: Sample many permutations and compute the fraction of wins
In the words of Stanislaw Ulam
The first thoughts and attempts I made to practice (the Monte Carlo Method) were suggested by a question which occurred to me in 1946 as I was convalescing from an illness and playing solitaires. The question was what are the chances that a Canfield solitaire laid out with 52 cards will come out successfully? After spending a lot of time trying to estimate them by pure combinatorial calculations, I wondered whether a more practical method than "abstract thinking" might not be to lay it out say
- ne hundred times and simply observe and count the number of successful
plays.This was already possible to envisage with the beginning of the new era of fast computers.
Monte Carlo approximation
Main principle: Use simulation to approximate quantities that are challenging to compute exactly To approximate the probability of an event E
- 1. Generate n independent samples from 1E: I1, I2, . . . , In
- 2. Compute the average of the n samples
- A (n) := 1
n
n
- i=1
Ii By the law of large numbers A converges to P (E) as n → ∞ since E (1E) = P (E)
Basketball league
Basketball league with m teams In a season every pair of teams plays once Teams are ordered: team 1 is best, team m is worst Model: For 1 ≤ i < j ≤ m P (team j beats team i) := 1 j − i + 1 Games are independent
Basketball league
Aim: Compute probability of team ranks at the end of the season The rank of team i is modeled as a random variable Ri Pmf of R1, R2, . . . , Rm? Ri = j means Team i finished in jth place
m = 3
Game outcomes Rank Probability 1-2 1-3 2-3 R1 R2 R3 1 1 2 1 2 3 1/6 1 1 3 1 3 2 1/6 1 3 2 1 1 1 1/12 1 3 3 2 3 1 1/12 2 1 2 2 1 3 1/6 2 1 3 1 1 1 1/6 2 3 2 3 1 2 1/12 2 3 3 3 2 1 1/12
m = 3
Probability mass function R1 R2 R3 1 7/12 1/2 5/12 2 1/4 1/4 1/4 3 1/6 1/4 1/3
Basketball league: How do we compute the PMF table?
Problem: Number of possible outcomes is 2m(m−1)/2! For m = 10 this is larger than 1013 Solution: Apply Monte Carlo approximation
m = 3
Game outcomes Rank 1-2 1-3 2-3 R1 R2 R3 1 3 2 1 1 1 1 1 3 1 3 2 2 1 2 2 1 3 2 3 2 3 1 2 2 1 3 1 1 1 1 1 2 1 2 3 2 1 3 1 1 1 2 3 2 3 1 2 1 1 2 1 2 3 2 3 2 3 1 2
m = 3
Estimated pmf (n = 10) R1 R2 R3 1 0.6 (0.583) 0.7 (0.5) 0.3 (0.417) 2 0.1 (0.25) 0.2 (0.25) 0.4 (0.25) 3 0.3 (0.167) 0.1 (0.25) 0.3 (0.333)
m = 3
Estimated pmf (n = 2, 000) R1 R2 R3 1 0.582 (0.583) 0.496 (0.5) 0.417 (0.417) 2 0.248 (0.25) 0.261 (0.25) 0.244 (0.25) 3 0.171 (0.167) 0.245 (0.25) 0.339 (0.333)
Running times
2 4 6 8 10 12 14 16 18 20 10−3 10−2 10−1 100 101 102 103 Number of teams m Running time (seconds) Exact computation Monte Carlo approx.
Error
m Average error 3 9.28 · 10−3 4 12.7 · 10−3 5 7.95 · 10−3 6 7.12 · 10−3 7 7.12 · 10−3
m = 5: PMF as Heat Map
m = 20: PMF as Heat Map
m = 100: PMF as Heat Map
Monte Carlo Question to Think About
- 1. Suppose we want to use Monte Carlo to approximate the probability p
- f an event E. How many steps should we use? More precisely,
suppose we want an n step Monte Carlo approximation ˆ p such that Pr(|p − ˆ p| > ǫ) is small. How can we bound this probability?
Monte Carlo Question to Think About
- 1. Suppose we want to use Monte Carlo to approximate the probability p
- f an event E. How many steps should we use? More precisely,
suppose we want an n step Monte Carlo approximation ˆ p such that Pr(|p − ˆ p| > ǫ) is small. How can we bound this probability?
- Solution. Note that 1E is Bernoulli so Var(1E) ≤ 1/4. Thus