The Interplay of Information Theory, Probability, and Statistics - PowerPoint PPT Presentation

The Interplay of Information Theory, Probability, and Statistics Andrew Barron Yale University, Department of Statistics Presentation at Purdue University, February 26, 2007

Outline • Information Theory Quantities and Tools * Entropy, relative entropy Shannon and Fisher information Information capacity • Interplay with Statistics ** Information capacity determines fundamental rates for parameter estimation and function estimation • Interplay with Probability Theory Central limit theorem *** Large deviation probability exponents **** for Markov chain Monte Carlo and optimization * Cover & Thomas, Elements of Information Theory, 1990 ** Hengartner & Barron 1998 Ann.Stat; Yang & Barron 1999 Ann.Stat. *** Barron 1986 Ann.Prob; Johnson & B. 2004 Ann.Prob; Madiman & B. 2006 ISIT **** Csiszar 1984 Ann.Prob.

Outline for Information and Probability • Central Limit Theorem If X 1 , X 2 , . . . , X n are i.i.d. with mean zero and variance 1 and f n is the density function of ( X 1 + X 2 + . . . + X n ) / √ n and φ is the standard normal density, then D ( f n || φ ) ց 0 if and only if this entropy distance is ever finite • Large Deviations and Markov Chains If { X t } is i.i.d. or reversible Markov and f is bounded then there is an exponent D ǫ characterized as a relative entropy with which n P { 1 � f ( X t ) ≥ E [ f ] + ǫ } ≤ e − nD ǫ n t =1 Markov chains based on local moves permit a differential equation which when solved determines the exponent D ǫ Should permit determination of which chains provide accurate Monte Carlo estimates.

Entropy • For a random variable Y or sequence Y = ( Y 1 , Y 2 , . . . , Y N ) with probability mass or density function p ( y ) , the Shannon entropy is 1 H ( Y ) = E log p ( Y ) • It is the shortest expected codelength for Y • It is the exponent of the size of the smallest set that has most of the probability

Relative Entropy • For distributions P Y , Q Y the relative entropy or information divergence is � log p ( Y ) � D ( P Y || Q Y ) = E P q ( Y ) • It is non-negative: D ( P || Q ) ≥ 0 with equality iff P = Q • It is the redundancy, the expected excess of the codelength log 1 /q ( Y ) beyond the optimal log 1 /p ( Y ) when Y ∼ P • It is the drop in wealth exponent when gambling according to Q on outcomes distributed according to P • It is the exponent of the smallest Q measure set that has most of the P probability (the exponent of probability of error of the best test): Chernoff • It is a standard measure of statistical loss for function estimation with normal errors and other statistical models (Kullback, Stein) D ( θ ∗ || θ ) = D ( P Y | θ ∗ || P Y | θ )

Statistics Basics • Data: Y = ( Y 1 , Y 2 , . . . , Y n ) • Likelihood: p ( Y | θ ) = p ( Y 1 | θ ) · p ( Y 2 | θ ) · · · p ( Y n | θ ) • Maximum Likelihood Estimator (MLE): ˆ θ = arg max p ( Y | θ ) θ 1 • Same as arg min θ log p ( Y | θ ) • MLE Consistency Wald 1948 n log p ( Y i | θ ∗ ) 1 ˆ � ˆ D n ( θ ∗ || θ ) θ = arg min p ( Y i | θ ) = arg min n θ θ i =1 Now ˆ D n ( θ ∗ || θ ) → D ( θ ∗ || θ ) as n → ∞ and D ( θ ∗ || ˆ θ n ) → 0 • Efficiency in smooth families: ˆ θ n is asymptotically Normal( θ, ( nI ( θ )) − 1 ) I ( θ ) = E [ ▽ log p ( Y | θ ) ▽ T log p ( Y | θ )] • Fisher information:

Statistics Basics Y = Y n = ( Y 1 , Y 2 , . . . , Y n ) • Data: • Likelihood: p ( Y | θ ) , θ ∈ Θ • Prior: p ( θ ) = w ( θ ) � • Marginal: p ( Y ) = p ( Y | θ ) w ( θ ) dθ Bayes mixture • Posterior: p ( θ | Y ) = w ( θ ) p ( Y | θ ) /p ( Y ) • Parameter loss function: ℓ ( θ, ˆ θ ) , for instance squared error ( θ − ˆ θ ) 2 • Bayes parameter estimator: ˆ θ E [ ℓ ( θ, ˆ θ achieves min ˆ θ ) | Y ] � ˆ θ = E [ θ | Y ] = θp ( θ | Y ) dθ • Density loss function ℓ ( P, Q ) , for instance D ( P, Q ) • Bayes density estimator: ˆ p ( y ) = p ( y | Y ) achives min Q E [ ℓ ( P, Q ) | Y ] � p ( y | θ ) p ( θ | Y n ) dθ p ( y ) = ˆ • Predictive coherence: Bayes estimator is the predictive density p ( Y n +1 | Y n ) evaluated at Y n +1 = y • Other loss functions do not share this property

Chain Rules for Entropy and Relative Entropy • For joint densities p ( Y 1 , Y 2 , . . . , Y N ) = p ( Y 1 ) p ( Y 2 | Y 1 ) · · · p ( Y N | Y N − 1 , . . . , Y 1 ) • Taking the expectation this is H ( Y 1 , Y 2 , . . . Y N ) = H ( Y 1 ) + H ( Y 2 | Y 1 ) + . . . + H ( Y N | Y N − 1 , . . . , Y 1 ) • The joint entropy grows like H N for stationary processes • For the relative entropy between distributions for a string Y = Y N = ( Y 1 , . . . , Y N ) we have the chain rule � D ( P Y || Q Y ) = E P D ( P Y n +1 | Y n || Q Y n +1 | Y n ) n • Thus the total divergence is a sum of contributions in which the predictive distributions Q Y n +1 | Y n based on the previous n data points is measured for their quality of fit to P Y n +1 | Y n for each n less than N • With good predictive distributions we can arrange D ( P Y N || Q Y N ) to grow at rates slower than N simultaneously for various P

Tying data compression to statistical learning p n ( y ) = p ( y | ˆ • Various plug-in ˆ θ n ) and Bayes predictive estimators � p n ( y ) = q ( y | Y n ) = p ( y | θ ) p ( θ | Y n ) dθ ˆ achieve individual risk P n ) ∼ c D ( P Y | θ || ˆ n ideally with asymptotic constant c = d/ 2 where d is the parameter di- mension (more on that ideal constant later) • Successively evaluating the predictive densities q ( Y n +1 | Y n ) these piece fit together to give a joint density q ( Y N ) with total divergence D ( P Y N | θ || Q Y N ) ∼ c log N • Conversely from any coding distribution Q Y N with good redundancy D ( P Y N | θ ) || Q Y N ) a succession of predictive estimators can be obtained • Similar conclusions hold for nonparametric function estimation problems

Local Information, Estimation, and Efficiency • The Fisher information I ( θ ) = I Fisher ( θ ) arises naturally in local analysis of Shannon information and related statistics problems. • In smooth families the relative entropy loss is locally a squared error θ ) ∼ 1 D ( θ || ˆ 2( θ − ˆ θ ) T I ( θ )( θ − ˆ θ ) • Efficient estimates have asymptotic covariance not more than I ( θ ) − 1 • If smaller than that at some θ the estimator is said to be superefficient • The expectation of the asymptotic distribution for the right side above is d 2 n • The set of parameter values with smaller asymptotic covariance is negligible, in the sense that it has zero measure

Efficiency of Estimation via Info Theory Analysis • LeCam 1950s: Efficiency of Bayes and maximum likelihood estimators. Negligibility of superefficiency for bounded loss and any efficient estimator • Hengartner and B. 1998: Negligibility of superefficiency for any parameter estimator using ED ( θ || ˆ θ ) and any density estimator using ED ( P || ˆ P n ) • The set of parameter values for which nED ( P Y | θ || ˆ P n ) has limit not smaller than d/ 2 includes all but a negligible set of θ • The proof does not require a Fisher information, yet correspond to the classical conclusion when there is such • The efficient level is from coarse covering properties of Euclidean space • The core of the proof is the chain rule plus a result of Rissanen • Rissanen 1986: no choice of joint distribution achieves D ( P Y N | θ || Q Y N ) better than ( d/ 2) log N except in a negligible set of θ • The proof works also for nonparametric problems • Negligibility of superefficiency determined by sparsity of its cover

Mutual Information and Information Capacity • We shall need two additional quantities in our discussion of information theory and statistics. These are: the Shannon mutual information I and the information capacity C

Shannon Mutual Information • For a family of distributions P Y | U of a random variable Y given an input U distributed according to P U , the Shannon mutual information is I ( Y ; U ) = D ( P U,Y || P U P Y ) = E U D ( P Y | U || P Y ) • In communications, it is the rate, the exponent of the number of input strings U that can be reliably communicated across a channel P Y | U • It is the error probability exponent with which a random U erroneously passes the test of being jointly distributed with a received string Y • In data compression, I ( Y ; θ ) is the Bayes average redundancy of the code based on the mixture P Y when θ = U is unknown • In a game with relative entropy loss, it is the Bayes optimal value corre- sponding to the the Bayes mixture P Y being the choice of Q Y achieving I ( Y ; θ ) = min Q Y E θ D ( P Y | θ || Q Y ) • Thus it is the average divergence from the centroid P Y

Information Capacity • For a family of distributions P Y | U the Shannon information capacity is C = max P U I ( Y ; U ) • It is the communications capacity, the maximum rate that can be reliably communicated across the channel • In the relative entropy game it is the maximin value C = max min Q Y E P θ D ( P Y | θ || Q Y ) P θ • Accordingly it is also the minimax value C = min Q Y max D ( P Y | θ || Q Y ) θ • Also known as the information radius of the family P Y | θ • In data compression, this means that C = max P θ I ( Y ; θ ) is also the minimax redundancy for the family P Y | θ (Gallager; Ryabko; Davisson) • In recent years the information capacity has been shown to also answer questions in statistics as we shall discuss

The Interplay of Information Theory, Probability, and Statistics - PowerPoint PPT Presentation

The Interplay of Information Theory, Probability, and Statistics Andrew Barron Yale University, Department of Statistics Presentation at Purdue University, February 26, 2007 Outline Information Theory Quantities and Tools * Entropy,

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

Which probability Which probability Which probability Which probability theory for cosmology?

Recap of Basic Probability Elements of basic probability theory probability theory The

Continuing Probability. Wrap up: Total Probability and Conditional Probability. Continuing

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Basics Probability Background Martin Emms October 1, 2020 Probability Basics

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Theory p ( E ) = p ( a 1 ) + p ( a 2 ) + ... + p ( a m ) 1 2 3 4 5 6 7 8 9 10 11 12 13

Counting and Probability Whats to come? Counting and Probability Whats to come?

Understanding the interplay Understanding the interplay between the climate regime between the

Basics of Probability Basics of Probability Janyl Jumadinova February 2426, 2020 Janyl

Unit 2: Probability and distributions Lecture 1: Probability and conditional probability

1 2 3 4 Stopping Probability Visiting Probability 5 Stopping

Chapter 1: Probability Theory (a recap) STK4011/9011: Statistical Inference Theory Johan Pensar

CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability

DATA MINING TECHNIQUES Review of Probability Theory Yijun Zhao Northeastern University spring

Estimation Theory Overview Introduction Up until now we have defined and discussed properties

Efficient estimators in nonlinear and heteroscedastic autoregressive models with constraints

Efficient Policy Learning from Surrogate-Loss Classifications Andrew Bennett (Cornell Tech)

M-Estimation of Log-Linear Structure Models Noah Smith , Doug Vail , and John Lafferty School of

Importance sampling algorithms for first passage time probabilities in the infinite server queue

Curvature-Exploiting Acceleration of Elastic Net Computation Vien V. Mai and Mikael Johansson KTH

Some Problems in the Numerical Analysis of Elastic Waves Tom Hagstrom Southern Methodist

S A V A N T Security Analytics & Visualisation for Advanced Network Threats Paul D. Hood

The Interplay of Information Theory, Probability, and Statistics - PowerPoint PPT Presentation

The Interplay of Information Theory, Probability, and Statistics Andrew Barron Yale University, Department of Statistics Presentation at Purdue University, February 26, 2007 Outline Information Theory Quantities and Tools * Entropy,

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

Which probability Which probability Which probability Which probability theory for cosmology?

Recap of Basic Probability Elements of basic probability theory probability theory The

Continuing Probability. Wrap up: Total Probability and Conditional Probability. Continuing

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Basics Probability Background Martin Emms October 1, 2020 Probability Basics

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Theory p ( E ) = p ( a 1 ) + p ( a 2 ) + ... + p ( a m ) 1 2 3 4 5 6 7 8 9 10 11 12 13

Counting and Probability Whats to come? Counting and Probability Whats to come?

Understanding the interplay Understanding the interplay between the climate regime between the

Basics of Probability Basics of Probability Janyl Jumadinova February 2426, 2020 Janyl

Unit 2: Probability and distributions Lecture 1: Probability and conditional probability

1 2 3 4 Stopping Probability Visiting Probability 5 Stopping

Chapter 1: Probability Theory (a recap) STK4011/9011: Statistical Inference Theory Johan Pensar

CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability

DATA MINING TECHNIQUES Review of Probability Theory Yijun Zhao Northeastern University spring

Estimation Theory Overview Introduction Up until now we have defined and discussed properties

Efficient estimators in nonlinear and heteroscedastic autoregressive models with constraints

Efficient Policy Learning from Surrogate-Loss Classifications Andrew Bennett (Cornell Tech)

M-Estimation of Log-Linear Structure Models Noah Smith , Doug Vail , and John Lafferty School of

Importance sampling algorithms for first passage time probabilities in the infinite server queue

Curvature-Exploiting Acceleration of Elastic Net Computation Vien V. Mai and Mikael Johansson KTH

Some Problems in the Numerical Analysis of Elastic Waves Tom Hagstrom Southern Methodist

S A V A N T Security Analytics &amp; Visualisation for Advanced Network Threats Paul D. Hood

S A V A N T Security Analytics & Visualisation for Advanced Network Threats Paul D. Hood