18.175: Lecture 8 Weak laws and moment-generating/characteristic - PowerPoint PPT Presentation

18.175: Lecture 8 Weak laws and moment-generating/characteristic functions Scott Sheffield MIT 1 18.175 Lecture 8

Outline Moment generating functions Weak law of large numbers: Markov/Chebyshev approach Weak law of large numbers: characteristic function approach 2 18.175 Lecture 8

Moment generating functions � Let X be a random variable. � The moment generating function of X is defined by M ( t ) = M X ( t ) := E [ e tX ]. tx � When X is discrete, can write M ( t ) = e p X ( x ). So M ( t ) x is a weighted average of countably many exponential functions. ∞ e tx f ( x ) dx . So � When X is continuous, can write M ( t ) = −∞ M ( t ) is a weighted average of a continuum of exponential functions. � We always have M (0) = 1. � If b > 0 and t > 0 then tX ] ≥ E [ e t min { X , b } ] ≥ P { X ≥ b } e tb E [ e . � If X takes both positive and negative values with positive probability then M ( t ) grows at least exponentially fast in | t | as | t | → ∞ . 4 18.175 Lecture 8

Moment generating functions actually generate moments Let X be a random variable and M ( t ) = E [ e tX ]. � � Then M " ( t ) = d d E [ e tX ] = E tX ) = E [ Xe tX ]. ( e � � dt dt in particular, M " (0) = E [ X ]. � � Also M "" ( t ) = d M " ( t ) = d 2 e tX ]. E [ Xe tX ] = E [ X � � dt dt So M "" (0) = E [ X 2 ]. Same argument gives that n th derivative � � n ]. of M at zero is E [ X Interesting: knowing all of the derivatives of M at a single � � k ] for all integer k ≥ 0. point tells you the moments E [ X Another way to think of this: write � � = 1 + tX + t 2 X 2 + t 3 X 3 tX + . . . . e 2! 3! Taking expectations gives � � E [ e tX ] = 1 + tm 1 + t 2 + t 3 m 2 m 3 + . . . , where m k is the k th 2! 3! moment. The k th derivative at zero is m k . 5 18.175 Lecture 8

Moment generating functions for independent sums Let X and Y be independent random variables and � � Z = X + Y . tX ] Write the moment generating functions as M X ( t ) = E [ e � � and M Y ( t ) = E [ e tY ] and M Z ( t ) = E [ e tZ ]. If you knew M X and M Y , could you compute M Z ? � � By independence, M Z ( t ) = E [ e t ( X + Y ) ] = E [ e tX e tY ] = � � tX ] E [ e tY ] = M X ( t ) M Y ( t ) for all t . E [ e In other words, adding independent random variables � � corresponds to multiplying moment generating functions. 6 18.175 Lecture 8

Moment generating functions for sums of i.i.d. random variables We showed that if Z = X + Y and X and Y are independent, � � then M Z ( t ) = M X ( t ) M Y ( t ) If X 1 . . . X n are i.i.d. copies of X and Z = X 1 + . . . + X n then � � what is M Z ? n . Follows by repeatedly applying formula above. Answer: M X � � This a big reason for studying moment generating functions. � � It helps us understand what happens when we sum up a lot of independent copies of the same random variable. 7 18.175 Lecture 8

Other observations If Z = aX then can I use M X to determine M Z ? � � Answer: Yes. M Z ( t ) = E [ e tZ ] = E [ e taX ] = M X ( at ). � � If Z = X + b then can I use M X to determine M Z ? � � Answer: Yes. M Z ( t ) = E [ e tZ ] = E [ e tX + bt ] = e bt M X ( t ). � � Latter answer is the special case of M Z ( t ) = M X ( t ) M Y ( t ) � � where Y is the constant random variable b . 8 18.175 Lecture 8

Existence issues Seems that unless f X ( x ) decays superexponentially as x tends � � to infinity, we won’t have M X ( t ) defined for all t . 1 What is M X if X is standard Cauchy, so that f X ( x ) = . � � π (1+ x 2 ) Answer: M X (0) = 1 (as is true for any X ) but otherwise � � M X ( t ) is infinite for all t = 0. Informal statement: moment generating functions are not � � defined for distributions with fat tails. 9 18.175 Lecture 8

Markov’s and Chebyshev’s inequalities Markov’s inequality: Let X be non-negative random � � E [ X ] variable. Fix a > 0. Then P { X ≥ a } ≤ . a Proof: Consider a random variable Y defined by � � f a X ≥ a Y = . Since X ≥ Y with probability one, it 0 X < a follows that E [ X ] ≥ E [ Y ] = aP { X ≥ a } . Divide both sides by a to get Markov’s inequality. Chebyshev’s inequality: If X has finite mean µ , variance σ 2 , � � and k > 0 then σ 2 P {| X − µ | ≥ k } ≤ . k 2 Proof: Note that ( X − µ ) 2 is a non-negative random variable � � and P {| X − µ | ≥ k } = P { ( X − µ ) 2 ≥ k 2 } . Now apply Markov’s inequality with a = k 2 . 12 18.175 Lecture 8

Markov and Chebyshev: rough idea Markov’s inequality: Let X be non-negative random variable � � with finite mean. Fix a constant a > 0. Then E [ X ] P { X ≥ a } ≤ . a Chebyshev’s inequality: If X has finite mean µ , variance σ 2 , � � and k > 0 then σ 2 P {| X − µ | ≥ k } ≤ . k 2 Inequalities allow us to deduce limited information about a � � distribution when we know only the mean (Markov) or the mean and variance (Chebyshev). Markov: if E [ X ] is small, then it is not too likely that X is � � large. Chebyshev: if σ 2 = Var [ X ] is small, then it is not too likely � � that X is far from its mean. 13 18.175 Lecture 8

Statement of weak law of large numbers Suppose X i are i.i.d. random variables with mean µ . � � X 1 + X 2 + ... + X n Then the value A n := is called the empirical � � n average of the first n trials. We’d guess that when n is large, A n is typically close to µ . � � Indeed, weak law of large numbers states that for all E > 0 � � we have lim n →∞ P {| A n − µ | > E } = 0. Example: as n tends to infinity, the probability of seeing more � � than . 50001 n heads in n fair coin tosses tends to zero. 14 18.175 Lecture 8

Proof of weak law of large numbers in finite variance case As above, let X i be i.i.d. random variables with mean µ and � � X 1 + X 2 + ... + X n write A n := . n By additivity of expectation, E [ A n ] = µ . � � Similarly, Var [ A n ] = n σ 2 = σ 2 / n . � � n 2 σ 2 Var [ A n ] By Chebyshev P | A n − µ | ≥ E ≤ = 2 . � � : 2 n : No matter how small E is, RHS will tend to zero as n gets � � large. 15 18.175 Lecture 8

L 2 weak law of large numbers Say X i and X j are uncorrelated if E ( X i X j ) = EX i EX j . � � Chebyshev/Markov argument works whenever variables are � � uncorrelated (does not actually require independence). 16 18.175 Lecture 8

What else can you do with just variance bounds? Having “almost uncorrelated” X i is sometimes enough: just � � need variance of A n to go to zero. Toss α n bins into n balls. How many bins are filled? � � When n is large, the number of balls in the first bin is � � approximately a Poisson random variable with expectation α . Probability first bin contains no ball is (1 − 1 / n ) α n ≈ e − α . � � We can explicitly compute variance of the number of bins � � with no balls. Allows us to show that fraction of bins with no balls concentrates about its expectation, which is e − α . 17 18.175 Lecture 8

How do you extend to random variables without variance? Assume X n are i.i.d. non-negative instances of random � � variable X with finite mean. Can one prove law of large numbers for these? Try truncating. Fix large N and write A = X 1 X > N and � � B = X 1 X ≤ N so that X = A + B . Choose N so that EB is very small. Law of large numbers holds for A . 18 18.175 Lecture 8

Extent of weak law Question: does the weak law of large numbers apply no � � matter what the probability distribution for X is? X 1 + X 2 + ... + X n Is it always the case that if we define A n := then � � n A n is typically close to some fixed value when n is large? What if X is Cauchy? � � In this strange and delightful case A n actually has the same � � probability distribution as X . In particular, the A n are not tightly concentrated around any � � particular value even when n is very large. But weak law holds as long as E [ | X | ] is finite, so that µ is � � well defined. One standard proof uses characteristic functions. � � 21 18.175 Lecture 8

18.175: Lecture 8 Weak laws and moment-generating/characteristic - PowerPoint PPT Presentation

18.175: Lecture 8 Weak laws and moment-generating/characteristic functions Scott Sheffield MIT 1 18.175 Lecture 8 Outline Moment generating functions Weak law of large numbers: Markov/Chebyshev approach Weak law of large numbers: characteristic

18.175: Lecture 14 Weak convergence and characteristic functions Scott Sheffield MIT 1 18.175

18.175: Lecture 12 DeMoivre-Laplace and weak convergence Scott Sheffield MIT 1 18.175 Lecture 12

18.175: Lecture 3 Random variables and distributions Scott Sheffield MIT 1 18.175 Lecture 3

18.175: Lecture 5 More integration and expectation Scott Sheffield MIT 1 18.175 Lecture 5 Outline

18.175: Lecture 11 Independent sums and large deviations Scott Sheffield MIT 1 18.175 Lecture 11

18.175: Lecture 13 More large deviations Scott Sheffield MIT 1 18.175 Lecture 13 Outline Legendre

18.175: Lecture 7 Sums of random variables Scott Sheffield MIT 1 18.175 Lecture 7 Outline

18.175: Lecture 23 Random walks Scott Sheffield MIT 18.175 Lecture 23 1 Outline Random walks

18.175: Lecture 18 Poisson random variables Scott Sheffield MIT 18.175 Lecture 18 1 Outline Extend

18.175: Lecture 4 Integration Scott Sheffield MIT 1 18.175 Lecture 4 Outline Integration

18.175: Lecture 10 Zero-one laws and maximal inequalities Scott Sheffield MIT 1 18.175 Lecture 10

18.175: Lecture 9 Borel-Cantelli and strong law Scott Sheffield MIT 1 18.175 Lecture 9 Outline

18.175: Lecture 32 More Markov chains Scott Sheffield MIT 1 18.175 Lecture 32 Outline General

18.175: Lecture 1 Probability spaces and -algebras Scott Sheffield MIT 1 18.175 Lecture 1

18.175: Lecture 15 Characteristic functions and central limit theorem Scott Sheffield MIT 1 18.175

18.175: Lecture 20 Infinite divisibility and L evy processes Scott Sheffield MIT 18.175 Lecture

Quantum Chebyshevs Inequality and Applications Yassine Hamoudi, Frdric Magniez IRIF ,

The Second-Moment Method Will Perkins January 28, 2013 Markovs Inequality Recall Markovs

Connecting the dots with common sense and linear models L eon Bottou NEC Labs America COS

Consensus measures generated by weighted Kemeny distances on linear orders e Luis GARC David

The Probabilistic Method Week 8: Second Moment Method Joshua Brody CS49/Math59 Fall 2015

Large Deviation Bounds A typical probability theory statement: Theorem (The Central Limit

Sketch Data Structures and Concentration Bounds Graham Cormode University of Warwick

This Lecture Basic de fi nitions and concepts. Introduction to the problem of learning.