SLIDE 1 CS70: Jean Walrand: Lecture 32.
Chernoff, Jensen, Polling, Confidence Intervals, Linear Regression
- 1. About M3
- 2. Chernoff
- 3. Jensen
- 4. Polling
- 5. Confidence Intervals
- 6. Linear Regression
SLIDE 2
About M3
Not easy. Definitely! Should I worry? Why? Me worry? Probability takes a while to get used to. The math looks trivial, but what the ... are we computing? Be patient! Patience has its own rewards.
SLIDE 3
Some Mysteries
◮ A probability space is a set Ω with probabilities assigned to
elements ...
◮ It is uniform is all the elements have the same probability ... ◮ Let Ω = {1,2,3,4} be a uniform probability space ... ◮ Say what! Never heard of that before!!! ◮ A random variable is a function X : Ω → ℜ ... ◮ Define two random variables on the uniform probability
space Ω = {1,2,3,4} so that ...
◮ Let me try ”If you first get an odd number, then X = 2, if you
then get an even number, then Y = −3 ....
◮ What happened? ◮ Gee!!, these are conceptual questions! Not like the
homework!! Nor the M3 review!!
◮ Meaning, “to do the homework, we did not need to
understand probability space nor random variables.” Really??
SLIDE 4
Seriously Folks!
You have time to get these ideas straight. If you knew it all already, you would not learn anything from this course. It is not that complicated. You will get to the bottom of this! A midterm de-briefing will take place next week. Time and place TBA on Piazza.
SLIDE 5
Sample Question
Question: On uniform probability space Ω := {1,2,3,4}, define the RVs X and Y such that E[XY] = E[X]E[Y] even though X and Y are not independent. Recall M3 review in lecture: E[XY] = E[X]E[Y] if X and Y are independent, not only if. We have to define X : Ω → ℜ and Y : Ω → ℜ so that .... Let us try: We see that XY = 0, so E[XY] = 0 = E[X]E[Y]. Also, X,Y not independent. Note that X = 0 and Y = ... does not work because then X and Y are independent.
SLIDE 6
Herman Chernoff
Herman Chernoff (born July 1, 1923, New York) is an American applied mathematician, statistician and physicist, formerly a professor at MIT and currently working at Harvard University.
SLIDE 7 Chernoff Faces
Figure : This example shows Chernoff faces for lawyers’ ratings of twelve judges
Chernoff faces, invented by Herman Chernoff, display multivariate data in the shape of a human face. The individual parts, such as eyes, ears, mouth and nose represent values of the variables by their shape, size, placement and
SLIDE 8
Chernoff’s Inequality
Chernoff’s inequality is due to Herman Rubin, continuing the tradition started with Markov’s inequality (that is due to Chebyshev). Theorem Chernoff’s Inequality Pr[X ≥ a] ≤ min
θ>0
E[eθX] eθa . Proof: We use Markov’s inequality with f(x) = eθx with θ > 0. We find Pr[X ≥ a] ≤ E[f(X)] f(a) = E[eθX] eθa . Since the inequality holds for all θ > 0, this concludes the proof.
SLIDE 9 Chernoff’s Inequality and B(n,p)
Let X = B(n,p). We want a bound on Pr[X ≥ a]. Since X = X1 +···+Xn with Pr[Xm = 1] = p = 1−Pr[Xm = 0], we have E[eθX] = E[eθ(X1+···+Xn)] = E[eθX1 ×···×eθXn] =
n = [peθ +(1−p)]n. Thus, Pr[X ≥ a] ≤ [peθ +(1−p)]n eθa . We minimize the RHS over θ > 0 and we find (after some algebra ...) Pr[X ≥ a] ≤ Pr[X = a] Pr[Y = a] where Y = B(n, a n).
SLIDE 10
Chernoff’s Inequality and B(n,p)
Here is a picture;
SLIDE 11
Chernoff’s Inequality and P(λ)
Let X = P(λ). We want a bound on Pr[X ≥ a]. We have E[eθX] = ∑
n≥0
eθn λ n n! e−λ = ∑
n≥0
(λeθ)n n! e−λ = exp{λeθ}exp{−λ} = exp{λ(eθ −1)}. Thus, Pr[X ≥ a] ≤ E[eθX] eθa = exp{λ(eθ −1)−θa}. We minimize over θ > 0 and we find (after some algebra) Pr[X ≥ a] ≤ λ a a ea−λ = Pr[X = a] Pr[Y = a] where Y = P(a).
SLIDE 12
Chernoff’s Inequality and P(λ)
Here is a picture:
SLIDE 13
Chernoff’s Inequality
Chernoff’s inequality is typically used to estimate Pr[X1 +···+Xn n ≥ a] where X1,...,Xn are independent and have the same distribution and n ≫ 1 and a > E[X1]. We expect the average X1+···+Xn
n
to be close to the mean, so that the desired probability is small. Chernoff’s inequality yields useful bounds. It works because E[exp{θ(X1 +···+Xn)/n}] = (E[exp{θX1/n}])n , by independence. Thus, Chernoff’s bound is typically used for rare events. Herman Chernoff is now 92, a rare event.
SLIDE 14 Jensen’s Inequality
A function g(x) is convex if it is above all its tangents. Consider the tangent at the point (E[X],g(E[X])) shown in the
g(X) ≥ g(E[X])+a(X −E[X]). Taking expectation, we conclude that g(·) convex ⇒ E[g(X)] ≥ g(E[X]).
SLIDE 15
Jensen’s Inequality: Examples
◮ E[|X|] ≥ |E[X]| ◮ E[X 4] ≥ E[X]4 ◮ E[eθX] ≥ eθE[X] ◮ E[ln(X)] ≤ ln(E[X]) ◮ E[max{X 2,1+X}] ≥ max{E[X]2,1+E[X]}
SLIDE 16
Polling: Problem
Here is a central question about polling. Setup: Assume people vote democrat with probability p. We poll n people. Let An be the fraction of those who vote democrat. Question: How large should n be so that Pr[|An −p| ≥ 0.1] ≤ 0.05?
SLIDE 17 Polling: Analysis
Recall the problem: Find n so that Pr[|An −p| ≥ 0.1] ≤ 0.05? Approach: Chebyshev!
Recall Chebyshev’s inequality: Pr[|X −E[X]| ≥ a] ≤ var[X]
a2
. Here, X = An = Y/n where Y is the number of people out of n who vote democrat. Thus, Y = B(n,p). Hence, E[Y] = np and var[Y] = np(1−p). Consequently, E[X] = p and var[X] = p(1−p)/n. This gives Pr[|An −p| ≥ 0.1] ≤ p(1−p) n(0.1)2 = 100p(1−p) n . However, we do not know p. What should we do? We know that p(1−p) ≤ 1/4. Hence, Pr[|An −p| ≥ 0.1] ≤ 25 n . Thus, if n = 500, we find Pr[|An −p| ≥ 0.1] ≤ 0.05.
SLIDE 18
Estimation
Common problem: Estimate a mean value. Examples: height, weight, lifetime, arrival rate, job duration, .... Setup: X1,X2,... are independent RVs with E[Xn] = µ and var[Xn] = σ2. We observe {X1,...,Xn} and want to estimate µ. Approach: Let An = (X1 +···+Xn)/n be the average (sample mean). Then, E[An] = µ and var[An] = σ2 n . Using Chebyshev: Pr[|An − µ| ≥ a] ≤ σ2
na2 .
Thus, Pr[|An − µ| ≥ a] → 0 as n → ∞. This is the WLLN, as we know.
SLIDE 19
Confidence Interval
How much confidence in our estimate? Recall the setup Xm independent with mean µ and variance σ2. Chebyshev told us Pr[|An − µ| ≥ a] ≤ σ2 na2 . This probability is at most δ if σ2 ≤ na2δ, i.e., a ≥ σ/ √ nδ. Thus Pr[|An − µ| ≥ σ √ nδ ] ≤ δ. Equivalently, Pr[µ ∈ [An − σ √ nδ ,An + σ √ nδ ]] ≥ 1−δ. We say that [An − σ √ nδ ,An + σ √ nδ ] is a (1−δ)−confidence interval for µ.
SLIDE 20
Confidence Interval, continued
We just found out that [An − σ √ nδ ,An + σ √ nδ ] is a (1−δ)−confidence interval for µ. For δ = 0.05, this shows that [An −4.5 σ √n,An +4.5 σ √n] is a 95%−confidence interval for µ. A more refined analysis, using the Central Limit Theorem, allows to replace 4.5 by 2.
SLIDE 21
CI with Unknown Variance
If σ is not known, we replace it by the estimate sn: s2
n = 1
n
n
∑
m=1
(Xm −An)2. Thus, we expect that, for n large enough (e.g., larger than 20), [An −2 sn √n,An +2 sn √n] is a 95%−confidence interval for µ. Does this work well? The theory says we have to be careful. What happens is that the error in estimating σ may throw us off. What is known is that if the Xm have a nice distribution (e.g., Gaussian), and if n is not too small (say ≥ 15), then this is fine.
SLIDE 22
CI for Pr[H]
CIs using the upper bound σ ≤ 0.5 or using the estimated sn.
SLIDE 23
Linear Regression
An example: Random experiment: select people at random and plot their (age, height). You get (Xn,Yn) for n = 1,...,N where Xn = age and Yn = height for person n. The linear regression is a guess a+bXn for Yn that is close to the true values, in some sense to be made precise.
SLIDE 24
Linear Regression
Another example:
SLIDE 25 LR: Two Viewpoints
Linear regression: a+bXn is a guess for Yn. There are two ways to look at the linear regression: Bayesian and non-Bayesian.
◮ Bayesian Viewpoint:
◮ We have a prior: Pr[X = x,Y = y],x = ...,y = ...; ◮ We choose (a,b) to minimize E[(Y −a−bX)2].
◮ Non-Bayesian Viewpoint:
◮ We have no prior, but samples: {(Xn,Yn),n = 1,...,N}; ◮ We choose (a,b) to minimize ∑N
n=1(Yn −a−bXn)2.
◮ We hope Yk ≈ a+bXk for future samples.
SLIDE 26
Summary
Chernoff, Jensen, Polling, Confidence Intervals, Linear Regression
◮ Chernoff: Pr[X ≥ a] ≤ minθ>0 E[eθ(X−a)]. ◮ Jensen: E[c(X)] ≥ c(E[X]) for c(·) convex. ◮ Polling: How many people to poll? ◮ Confidence Interval: Sample mean ±2σ/√n. ◮ Linear Regression: Y ≈ a+bX. B or not B?