CS70: Jean Walrand: Lecture 32. Chernoff, Jensen, Polling, - - PowerPoint PPT Presentation

cs70 jean walrand lecture 32
SMART_READER_LITE
LIVE PREVIEW

CS70: Jean Walrand: Lecture 32. Chernoff, Jensen, Polling, - - PowerPoint PPT Presentation

CS70: Jean Walrand: Lecture 32. Chernoff, Jensen, Polling, Confidence Intervals, Linear Regression 1. About M3 2. Chernoff 3. Jensen 4. Polling 5. Confidence Intervals 6. Linear Regression About M3 Not easy. Definitely! Should I worry?


slide-1
SLIDE 1

CS70: Jean Walrand: Lecture 32.

Chernoff, Jensen, Polling, Confidence Intervals, Linear Regression

  • 1. About M3
  • 2. Chernoff
  • 3. Jensen
  • 4. Polling
  • 5. Confidence Intervals
  • 6. Linear Regression
slide-2
SLIDE 2

About M3

Not easy. Definitely! Should I worry? Why? Me worry? Probability takes a while to get used to. The math looks trivial, but what the ... are we computing? Be patient! Patience has its own rewards.

slide-3
SLIDE 3

Some Mysteries

◮ A probability space is a set Ω with probabilities assigned to

elements ...

◮ It is uniform is all the elements have the same probability ... ◮ Let Ω = {1,2,3,4} be a uniform probability space ... ◮ Say what! Never heard of that before!!! ◮ A random variable is a function X : Ω → ℜ ... ◮ Define two random variables on the uniform probability

space Ω = {1,2,3,4} so that ...

◮ Let me try ”If you first get an odd number, then X = 2, if you

then get an even number, then Y = −3 ....

◮ What happened? ◮ Gee!!, these are conceptual questions! Not like the

homework!! Nor the M3 review!!

◮ Meaning, “to do the homework, we did not need to

understand probability space nor random variables.” Really??

slide-4
SLIDE 4

Seriously Folks!

You have time to get these ideas straight. If you knew it all already, you would not learn anything from this course. It is not that complicated. You will get to the bottom of this! A midterm de-briefing will take place next week. Time and place TBA on Piazza.

slide-5
SLIDE 5

Sample Question

Question: On uniform probability space Ω := {1,2,3,4}, define the RVs X and Y such that E[XY] = E[X]E[Y] even though X and Y are not independent. Recall M3 review in lecture: E[XY] = E[X]E[Y] if X and Y are independent, not only if. We have to define X : Ω → ℜ and Y : Ω → ℜ so that .... Let us try: We see that XY = 0, so E[XY] = 0 = E[X]E[Y]. Also, X,Y not independent. Note that X = 0 and Y = ... does not work because then X and Y are independent.

slide-6
SLIDE 6

Herman Chernoff

Herman Chernoff (born July 1, 1923, New York) is an American applied mathematician, statistician and physicist, formerly a professor at MIT and currently working at Harvard University.

slide-7
SLIDE 7

Chernoff Faces

Figure : This example shows Chernoff faces for lawyers’ ratings of twelve judges

Chernoff faces, invented by Herman Chernoff, display multivariate data in the shape of a human face. The individual parts, such as eyes, ears, mouth and nose represent values of the variables by their shape, size, placement and

  • rientation.
slide-8
SLIDE 8

Chernoff’s Inequality

Chernoff’s inequality is due to Herman Rubin, continuing the tradition started with Markov’s inequality (that is due to Chebyshev). Theorem Chernoff’s Inequality Pr[X ≥ a] ≤ min

θ>0

E[eθX] eθa . Proof: We use Markov’s inequality with f(x) = eθx with θ > 0. We find Pr[X ≥ a] ≤ E[f(X)] f(a) = E[eθX] eθa . Since the inequality holds for all θ > 0, this concludes the proof.

slide-9
SLIDE 9

Chernoff’s Inequality and B(n,p)

Let X = B(n,p). We want a bound on Pr[X ≥ a]. Since X = X1 +···+Xn with Pr[Xm = 1] = p = 1−Pr[Xm = 0], we have E[eθX] = E[eθ(X1+···+Xn)] = E[eθX1 ×···×eθXn] =

  • E[eθX1]

n = [peθ +(1−p)]n. Thus, Pr[X ≥ a] ≤ [peθ +(1−p)]n eθa . We minimize the RHS over θ > 0 and we find (after some algebra ...) Pr[X ≥ a] ≤ Pr[X = a] Pr[Y = a] where Y = B(n, a n).

slide-10
SLIDE 10

Chernoff’s Inequality and B(n,p)

Here is a picture;

slide-11
SLIDE 11

Chernoff’s Inequality and P(λ)

Let X = P(λ). We want a bound on Pr[X ≥ a]. We have E[eθX] = ∑

n≥0

eθn λ n n! e−λ = ∑

n≥0

(λeθ)n n! e−λ = exp{λeθ}exp{−λ} = exp{λ(eθ −1)}. Thus, Pr[X ≥ a] ≤ E[eθX] eθa = exp{λ(eθ −1)−θa}. We minimize over θ > 0 and we find (after some algebra) Pr[X ≥ a] ≤ λ a a ea−λ = Pr[X = a] Pr[Y = a] where Y = P(a).

slide-12
SLIDE 12

Chernoff’s Inequality and P(λ)

Here is a picture:

slide-13
SLIDE 13

Chernoff’s Inequality

Chernoff’s inequality is typically used to estimate Pr[X1 +···+Xn n ≥ a] where X1,...,Xn are independent and have the same distribution and n ≫ 1 and a > E[X1]. We expect the average X1+···+Xn

n

to be close to the mean, so that the desired probability is small. Chernoff’s inequality yields useful bounds. It works because E[exp{θ(X1 +···+Xn)/n}] = (E[exp{θX1/n}])n , by independence. Thus, Chernoff’s bound is typically used for rare events. Herman Chernoff is now 92, a rare event.

slide-14
SLIDE 14

Jensen’s Inequality

A function g(x) is convex if it is above all its tangents. Consider the tangent at the point (E[X],g(E[X])) shown in the

  • figure. We have

g(X) ≥ g(E[X])+a(X −E[X]). Taking expectation, we conclude that g(·) convex ⇒ E[g(X)] ≥ g(E[X]).

slide-15
SLIDE 15

Jensen’s Inequality: Examples

◮ E[|X|] ≥ |E[X]| ◮ E[X 4] ≥ E[X]4 ◮ E[eθX] ≥ eθE[X] ◮ E[ln(X)] ≤ ln(E[X]) ◮ E[max{X 2,1+X}] ≥ max{E[X]2,1+E[X]}

slide-16
SLIDE 16

Polling: Problem

Here is a central question about polling. Setup: Assume people vote democrat with probability p. We poll n people. Let An be the fraction of those who vote democrat. Question: How large should n be so that Pr[|An −p| ≥ 0.1] ≤ 0.05?

slide-17
SLIDE 17

Polling: Analysis

Recall the problem: Find n so that Pr[|An −p| ≥ 0.1] ≤ 0.05? Approach: Chebyshev!

Recall Chebyshev’s inequality: Pr[|X −E[X]| ≥ a] ≤ var[X]

a2

. Here, X = An = Y/n where Y is the number of people out of n who vote democrat. Thus, Y = B(n,p). Hence, E[Y] = np and var[Y] = np(1−p). Consequently, E[X] = p and var[X] = p(1−p)/n. This gives Pr[|An −p| ≥ 0.1] ≤ p(1−p) n(0.1)2 = 100p(1−p) n . However, we do not know p. What should we do? We know that p(1−p) ≤ 1/4. Hence, Pr[|An −p| ≥ 0.1] ≤ 25 n . Thus, if n = 500, we find Pr[|An −p| ≥ 0.1] ≤ 0.05.

slide-18
SLIDE 18

Estimation

Common problem: Estimate a mean value. Examples: height, weight, lifetime, arrival rate, job duration, .... Setup: X1,X2,... are independent RVs with E[Xn] = µ and var[Xn] = σ2. We observe {X1,...,Xn} and want to estimate µ. Approach: Let An = (X1 +···+Xn)/n be the average (sample mean). Then, E[An] = µ and var[An] = σ2 n . Using Chebyshev: Pr[|An − µ| ≥ a] ≤ σ2

na2 .

Thus, Pr[|An − µ| ≥ a] → 0 as n → ∞. This is the WLLN, as we know.

slide-19
SLIDE 19

Confidence Interval

How much confidence in our estimate? Recall the setup Xm independent with mean µ and variance σ2. Chebyshev told us Pr[|An − µ| ≥ a] ≤ σ2 na2 . This probability is at most δ if σ2 ≤ na2δ, i.e., a ≥ σ/ √ nδ. Thus Pr[|An − µ| ≥ σ √ nδ ] ≤ δ. Equivalently, Pr[µ ∈ [An − σ √ nδ ,An + σ √ nδ ]] ≥ 1−δ. We say that [An − σ √ nδ ,An + σ √ nδ ] is a (1−δ)−confidence interval for µ.

slide-20
SLIDE 20

Confidence Interval, continued

We just found out that [An − σ √ nδ ,An + σ √ nδ ] is a (1−δ)−confidence interval for µ. For δ = 0.05, this shows that [An −4.5 σ √n,An +4.5 σ √n] is a 95%−confidence interval for µ. A more refined analysis, using the Central Limit Theorem, allows to replace 4.5 by 2.

slide-21
SLIDE 21

CI with Unknown Variance

If σ is not known, we replace it by the estimate sn: s2

n = 1

n

n

m=1

(Xm −An)2. Thus, we expect that, for n large enough (e.g., larger than 20), [An −2 sn √n,An +2 sn √n] is a 95%−confidence interval for µ. Does this work well? The theory says we have to be careful. What happens is that the error in estimating σ may throw us off. What is known is that if the Xm have a nice distribution (e.g., Gaussian), and if n is not too small (say ≥ 15), then this is fine.

slide-22
SLIDE 22

CI for Pr[H]

CIs using the upper bound σ ≤ 0.5 or using the estimated sn.

slide-23
SLIDE 23

Linear Regression

An example: Random experiment: select people at random and plot their (age, height). You get (Xn,Yn) for n = 1,...,N where Xn = age and Yn = height for person n. The linear regression is a guess a+bXn for Yn that is close to the true values, in some sense to be made precise.

slide-24
SLIDE 24

Linear Regression

Another example:

slide-25
SLIDE 25

LR: Two Viewpoints

Linear regression: a+bXn is a guess for Yn. There are two ways to look at the linear regression: Bayesian and non-Bayesian.

◮ Bayesian Viewpoint:

◮ We have a prior: Pr[X = x,Y = y],x = ...,y = ...; ◮ We choose (a,b) to minimize E[(Y −a−bX)2].

◮ Non-Bayesian Viewpoint:

◮ We have no prior, but samples: {(Xn,Yn),n = 1,...,N}; ◮ We choose (a,b) to minimize ∑N

n=1(Yn −a−bXn)2.

◮ We hope Yk ≈ a+bXk for future samples.

slide-26
SLIDE 26

Summary

Chernoff, Jensen, Polling, Confidence Intervals, Linear Regression

◮ Chernoff: Pr[X ≥ a] ≤ minθ>0 E[eθ(X−a)]. ◮ Jensen: E[c(X)] ≥ c(E[X]) for c(·) convex. ◮ Polling: How many people to poll? ◮ Confidence Interval: Sample mean ±2σ/√n. ◮ Linear Regression: Y ≈ a+bX. B or not B?