18.175: Lecture 13 More large deviations Scott Sheffield MIT 1 - PowerPoint PPT Presentation

18.175: Lecture 13 More large deviations Scott Sheffield MIT 1 18.175 Lecture 13

Outline Legendre transform Large deviations 2 18.175 Lecture 13

Legendre transform � Define Legendre transform (or Legendre dual) of a function Λ : R d → R by ∗ ( x ) = sup { ( λ, x ) − Λ( λ ) } . Λ λ ∈ R d � Let’s describe the Legendre dual geometrically if d = 1: Λ ∗ ( x ) is where tangent line to Λ of slope x intersects the real axis. We can “roll” this tangent line around the convex hull of the graph of Λ, to get all Λ ∗ values. � Is the Legendre dual always convex? � What is the Legendre dual of x 2 ? Of the function equal to 0 at 0 and ∞ everywhere else? � How are derivatives of Λ and Λ ∗ related? � What is the Legendre dual of the Legendre dual of a convex function? � What’s the higher dimensional analog of rolling the tangent line? 4 18.175 Lecture 13

Recall: moment generating functions Let X be a random variable. � � The moment generating function of X is defined by � � M ( t ) = M X ( t ) := E [ e tX ]. tx When X is discrete, can write M ( t ) = e p X ( x ). So M ( t ) � � x is a weighted average of countably many exponential functions. ∞ e tx f ( x ) dx . So When X is continuous, can write M ( t ) = � � −∞ M ( t ) is a weighted average of a continuum of exponential functions. We always have M (0) = 1. � � If b > 0 and t > 0 then � � tX ] ≥ E [ e t min { X , b } ] ≥ P { X ≥ b } e tb E [ e . If X takes both positive and negative values with positive � � probability then M ( t ) grows at least exponentially fast in | t | as | t | → ∞ . 7 18.175 Lecture 13

Recall: moment generating functions for i.i.d. sums We showed that if Z = X + Y and X and Y are independent, � � then M Z ( t ) = M X ( t ) M Y ( t ) If X 1 . . . X n are i.i.d. copies of X and Z = X 1 + . . . + X n then � � what is M Z ? n . Answer: M X � � 8 18.175 Lecture 13

Large deviations Consider i.i.d. random variables X i . Can we show that � � P ( S n ≥ na ) → 0 exponentially fast when a > E [ X i ]? Kind of a quantitative form of the weak law of large numbers. � � The empirical average A n is very unlikely to E away from its expected value (where “very” means with probability less than some exponentially decaying function of n ). 9 18.175 Lecture 13

General large deviation principle More general framework: a large deviation principle describes � � limiting behavior as n → ∞ of family { µ n } of measures on measure space ( X , B ) in terms of a rate function I . The rate function is a lower-semicontinuous map � � I : X → [0 , ∞ ]. (The sets { x : I ( x ) ≤ a } are closed — rate function called “good” if these sets are compact.) DEFINITION: { µ n } satisfy LDP with rate function I and � � speed n if for all Γ ∈ B , 1 1 − inf I ( x ) ≤ lim inf log µ n (Γ) ≤ lim sup log µ n (Γ) ≤ − inf I ( x ) . n →∞ n n →∞ n x ∈ Γ 0 x ∈ Γ INTUITION: when “near x ” the probability density function � � − I ( x ) n for µ n is tending to zero like e , as n → ∞ . Simple case: I is continuous, Γ is closure of its interior. � � Question: How would I change if we replaced the measures � � ( λ n , · ) µ n by weighted measures e µ n ? Replace I ( x ) by I ( x ) − ( λ, x )? What is inf x I ( x ) − ( λ, x )? � � 10 18.175 Lecture 13

Cramer’s theorem n 1 Let µ n be law of empirical mean A n = X j for i.i.d. � � j =1 n vectors X 1 , X 2 , . . . , X n in R d with same law as X . Define log moment generating function of X by � � ( λ, X ) Λ( λ ) = Λ X ( λ ) = log M X ( λ ) = log E e , where ( · , · ) is inner product on R d . Define Legendre transform of Λ by � � ∗ ( x ) = sup { ( λ, x ) − Λ( λ ) } . Λ λ ∈ R d CRAMER’S THEOREM: µ n satisfy LDP with convex rate � � function Λ ∗ . 11 18.175 Lecture 13

Thinking about Cramer’s theorem n 1 Let µ n be law of empirical mean A n = X j . � � j =1 n CRAMER’S THEOREM: µ n satisfy LDP with convex rate � � function ∗ ( x ) = sup { ( λ, x ) − Λ( λ ) } , I ( x ) = Λ λ ∈ R d ( λ, X 1 ) where Λ( λ ) = log M ( λ ) = E e . This means that for all Γ ∈ B we have this asymptotic lower � � bound on probabilities µ n (Γ) 1 − inf I ( x ) ≤ lim inf log µ n (Γ) , n →∞ n x ∈ Γ 0 − n inf x ∈ Γ0 I ( x ) so (up to sub-exponential error) µ n (Γ) ≥ e . and this asymptotic upper bound on the probabilities µ n (Γ) � � 1 lim sup log µ n (Γ) ≤ − inf I ( x ) , n →∞ n x ∈ Γ − n inf I ( x ) which says (up to subexponential error) µ n (Γ) ≤ e . x ∈ Γ 12 18.175 Lecture 13

Proving Cramer upper bound Recall that I ( x ) = Λ ∗ ( x ) = sup λ ∈ R d { ( λ, x ) − Λ( λ ) } . � � For simplicity, assume that Λ is defined for all x (which � � implies that X has moments of all orders and Λ and Λ ∗ are strictly convex, and the derivatives of Λ and Λ N are inverses of each other). It is also enough to consider the case X has mean zero, which implies that Λ(0) = 0 is a minimum of Λ, and Λ ∗ (0) = 0 is a minimum of Λ ∗ . We aim to show (up to subexponential error) that � � − n inf x ∈ Γ I ( x ) µ n (Γ) ≤ e . If Γ were singleton set { x } we could find the λ corresponding � � to x , so Λ ∗ ( x ) = ( x , λ ) − Λ( λ ). Note then that ( n λ, A n ) ( λ, S n ) n Λ( λ ) n ( λ ) = e E e = E e = M X , ( n λ, A n ) ≥ e n ( λ, x ) and also E e µ n { x } . Taking logs and dividing by n gives Λ( λ ) ≥ 1 log µ n + ( λ, x ), so that n 1 log µ n (Γ) ≤ − Λ ∗ ( x ), as desired. n General Γ: cut into finitely many pieces, bound each piece? � � 13 18.175 Lecture 13

Proving Cramer lower bound Recall that I ( x ) = Λ ∗ ( x ) = sup λ ∈ R d { ( λ, x ) − Λ( λ ) } . � � − n inf x ∈ Γ0 I ( x ) We aim to show that asymptotically µ n (Γ) ≥ e . � � It’s enough to show that for each given x ∈ Γ 0 , we have that � � − n inf x ∈ Γ0 I ( x ) asymptotically µ n (Γ) ≥ e . Idea is to weight the law of X by e ( λ, x ) for some λ and � � normalize to get a new measure whose expectation is this point x . In this new measure, A n is “typically” in Γ for large Γ, so the probability is of order 1. But by how much did we have to modify the measure to make � � − n inf x ∈ Γ0 I ( x ) this typical? Not more than by factor e . 14 18.175 Lecture 13

MIT OpenCourseWare http://ocw.mit.edu 18.175 Theory of Probability Spring 2014 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

18.175: Lecture 13 More large deviations Scott Sheffield MIT 1 - PowerPoint PPT Presentation

18.175: Lecture 13 More large deviations Scott Sheffield MIT 1 18.175 Lecture 13 Outline Legendre transform Large deviations 2 18.175 Lecture 13 Outline Legendre transform Large deviations 3 18.175 Lecture 13 Legendre transform Define

18.175: Lecture 11 Independent sums and large deviations Scott Sheffield MIT 1 18.175 Lecture 11

Selection, large deviations and metastability Kyoto () Dynamics with selection, large deviations

Selection, large deviations and metastability () Dynamics with selection, large deviations and

18.175: Lecture 5 More integration and expectation Scott Sheffield MIT 1 18.175 Lecture 5 Outline

18.175: Lecture 32 More Markov chains Scott Sheffield MIT 1 18.175 Lecture 32 Outline General

18.175: Lecture 3 Random variables and distributions Scott Sheffield MIT 1 18.175 Lecture 3

18.175: Lecture 7 Sums of random variables Scott Sheffield MIT 1 18.175 Lecture 7 Outline

18.175: Lecture 23 Random walks Scott Sheffield MIT 18.175 Lecture 23 1 Outline Random walks

18.175: Lecture 18 Poisson random variables Scott Sheffield MIT 18.175 Lecture 18 1 Outline Extend

18.175: Lecture 4 Integration Scott Sheffield MIT 1 18.175 Lecture 4 Outline Integration

18.175: Lecture 9 Borel-Cantelli and strong law Scott Sheffield MIT 1 18.175 Lecture 9 Outline

18.175: Lecture 14 Weak convergence and characteristic functions Scott Sheffield MIT 1 18.175

18.175: Lecture 15 Characteristic functions and central limit theorem Scott Sheffield MIT 1 18.175

18.175: Lecture 1 Probability spaces and -algebras Scott Sheffield MIT 1 18.175 Lecture 1

18.175: Lecture 17 Poisson random variables Scott Sheffield MIT 18.175 Lecture 16 1 Outline More

Large Deviations for Multi-valued Stochastic Differential Equations Large Deviations for

Accelerated Flow for Probability Distributions Thirty-sixth International Conference on Machine

BS2247 Introduction to Econometrics Lecture 2: Fundamentals of Probability Dr. Kai Sun Aston

Hidden Markov Models Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and

Categorical Probability: Results and Challenges Tobias Fritz May 2019 What this talk is (not)

An Introduction to Probabilistic modeling Oliver Stegle and Karsten Borgwardt Machine Learning

Synthetic Probability Theory Alex Simpson Faculty of Mathematics and Physics University of

1 States and Events In an uncertain situation, any one of several possible outcomes may be

The probability function Modeling the Let T indicate the time to an event. probability of