introduction to information theory and empirical
play

Introduction to information theory and empirical statistical theory - PowerPoint PPT Presentation

Introduction to information theory and empirical statistical theory Nan Chen and Andrew J. Majda Center for Atmosphere Ocean Science Courant Institute of Mathematical Sciences New York University October 13, 2016 6.1 Introduction Goal of


  1. Introduction to information theory and empirical statistical theory Nan Chen and Andrew J. Majda Center for Atmosphere Ocean Science Courant Institute of Mathematical Sciences New York University October 13, 2016

  2. 6.1 Introduction Goal of this lecture: Lay down statistical theories of geophysical flows and predictability that will be developed in the next a few lectures. ◮ The objective of the statistical theories is the predictions of the most probable steady state that will develop in the flow under the constraints imposed by a few bulk properties of the flow, such as averaged conserved quantities like energy and enstrophy. ◮ Here we will adapt the information-theoretical approach developed by Shannon (1948) Shannon and Weaver (1949) and Jaynes (1957). The most probable state can be selected in a natural manner as the least biased probability measure that is consistent with the given external constraints imposed in the theory. 1 / 24

  3. An motivation example. Case I Case II 100 100 90 90 80 80 70 70 60 60 50 50 40 40 30 30 20 20 10 10 0 0 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 0.3 0.3 | µ 1 − µ 2 | = 2 | µ 1 − µ 2 | = 1 2 0.25 0.25 1 0.2 0.2 0 0.15 0.15 −5 0 5 0.1 0.1 0.05 0.05 0 0 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 Question: If the blue ensembles are the truth and the red ensembles are the prediction, which case has a better prediction? Regarding the mean state, the answer is case II. But intuitively ... ? Information theory is needed to quantify the uncertainty and access the least biased prediction given some statistical constraints. 2 / 24

  4. Claude Elwood Shannon ◮ Claude Elwood Shannon (1916 – 2001) was an American mathematician, electronic engineer, and cryptographer known as “the father of information theory”. ◮ Pioneered by Shannon in a 1948 article, “A mathematical theory of communication”, Bell System Technical Journal . ◮ The article was later published in 1949 as a book by Shannon and Warren Weaver, “The Mathematical Theory of Communication”, Univ of Illinois Press. ◮ Shannon’s article in 1951 Prediction and Entropy of Printed English was the contribution of information theory to natural linguistics. 3 / 24

  5. Shannon’s other work, hobbies and inventions, ◮ Shannon’s mouse ◮ Computer chess program ◮ Juggling ◮ Using information theory to calculate odds while gambling ... ◮ Ultimate machine ◮ ... 4 / 24

  6. 6.2 Information theory and Shannon’s entropy Definition 6.1 (The Shannon entropy) Let p be a finite, discrete probability measure on the sample space A = { a 1 , . . . , a n } n n � � p = p i δ a i , p i ≥ 0 , p i = 1 . (6.2) i = 1 i = 1 The Shannon entropy S p of the probability p is defined as n � S ( p ) = S ( p 1 , . . . , p n ) = − p i ln p i . (6.3) i = 1 5 / 24

  7. Shannon’s intuition from the theory of communication Representing a “word” in a message as a sequence of binary digits with length n . ◮ The set A 2 n of all words of length n has 2 n = N elements. ◮ The amount of information needed to characterize one element is n = log 2 N . 1 0 1 0 1 0 1 Example: 0 1 1 1 0 1 0 n = 7, N = 128. . . . . . . . . . . . . . . . Following this type of reasoning, the amount of information needed to characterize an element of any set, A N , is n = log 2 N for general N . 6 / 24

  8. ◮ Now consider a situation of a set A = A N 1 ∪ · · · ∪ A N k , where the set A N i are pairwise disjoint from each other with A N i having N i total elements. ◮ Set p i to be given by p i = N i / N , where N = � N i . Example: A = A N 1 ∪ A N 2 1 0 0 1 0 1 1 � �� � � �� � N = 24, N 1 = 8, N 2 = 16. N 1 = 23 = 8 N 2 = 24 = 16 p 1 = 8 / 24 p 2 = 16 / 24 ◮ If we know that an element of A belongs to some A N i , then we need log 2 N i additional information to determine it completely. ◮ The average amount of information we need to determine an element, provided that we already now the A N i to which it belongs, is given by � N i � N i N i � � � N log 2 N i = N log 2 N · N = p i log 2 p i + log 2 N . i i ◮ Recall that log 2 N is the information that we need to determine an element given the set A if we do not know to which A N i a given element belongs. ◮ Thus, the corresponding average lack of information is � − p i log 2 p i . 7 / 24

  9. Proposition 6.1 (Uniqueness of the Shannon entropy) Let H n be a function defined on the space of discrete probability measure PM n ( A ) and satisfying the following properties: 1. H n ( p 1 , . . . , p n ) is a continuous function. 2. A ( n ) = H n ( 1 / n , . . . , 1 / n ) is monotonic increasing in n , i.e., H n increases with increasing uncertainty, H n ( { 1 / n , . . . , 1 / n } ) > H n ′ ( { 1 / n ′ , . . . , 1 / n ′ } ) , n > n ′ 3. Composite rule: H n ( p 1 , . . . , p k , p k + 1 . . . , p n ) = H 2 ( w 1 , w 2 ) + w 1 H k ( p 1 / w 1 , . . . , p k / w 1 ) + w 2 H n − k ( p k + 1 / w 2 , . . . , p n / w 2 ) , where w 1 = p 1 + . . . + p k , w 2 = p k + 1 + . . . + p n and p i / w j is the conditional probability. Then H n is a positive multiple of the Shannon entropy n � H n ( p 1 , . . . , p n ) = K S ( p 1 , . . . , p n ) = − K p i ln p i , with K > 0 . i = 1 Key of proof: Using composite rule = ⇒ A ( nv ) = A ( n ) + A ( v ) . The only function satisfying this condition is A ( n ) = K ln n . See [Majda & Wang, 2006] page 186-187 for the details of proof. 8 / 24

  10. Definition 6.2 (Empirical maximum entropy principle) With a given probability measure p ∈ PM ( A ) , the expected value, or statistical measurement, of f with respect to p is given by n � � f � p = f ( a i ) p i . i = 1 A practical issue is to look for the least biased probability distribution consistent with certain statistical constraints, � � C L = p ∈ PM ( A ) |� f j � p = F j , 1 ≤ j ≤ L . The least biased probability distribution p L given the constraints C L is the maximum entropy distribution, max S ( p ) = S ( p L ) , p L ∈ C L . p ∈C L Remark: Uncertainty decreases with more information being included 1 . for L ′ < L . S ( p L ) ≤ S ( p L ′ ) , 1 In thermodynamics, entropy is commonly associated with the amount of order, disorder, or chaos in a thermodynamic system. 9 / 24

  11. Example 1. Find the least biased probability distribution p on A = { a 1 , . . . , a n } with no additional constraints. n � Maximize S ( p 1 , . . . , p n ) = − p i ln p i , i = 1 n � Subject to p i = 1 and all p i ≥ 0 . i = 1 10 / 24

  12. Example 1. Find the least biased probability distribution p on A = { a 1 , . . . , a n } with no additional constraints. n � Maximize S ( p 1 , . . . , p n ) = − p i ln p i , i = 1 n � Subject to p i = 1 and all p i ≥ 0 . i = 1 Construct the Lagrange function n n � � L = − p i ln p i + λ p i . i = 1 i = 1 The minimum p ∗ satisfies ∂ L = 0 , i = 1 , . . . , n , ∂ p i which results in ln p ∗ i + 1 + λ = 0 , i = 1 , . . . , n . This implies all the probabilities p ∗ i are equal and therefore p ∗ i = 1 / n . Uniform distribution! 10 / 24

  13. Example 2. Find the least biased probability distribution p on A = { a 1 , . . . , a n } subject to the r + 1 constraints ( r + 1 ≤ n ), n n � � F j = � f j � p = f j ( a i ) p i , j = 1 , . . . , r , p i = 1 . i = 1 i = 1 11 / 24

  14. Example 2. Find the least biased probability distribution p on A = { a 1 , . . . , a n } subject to the r + 1 constraints ( r + 1 ≤ n ), n n � � F j = � f j � p = f j ( a i ) p i , j = 1 , . . . , r , p i = 1 . i = 1 i = 1 Construct the Lagrange function n n r � � � L = − p i ln p i + λ 0 p i + λ j � f j � p . i = 1 i = 1 j = 1 Componentwise it yields a system of n equations for the unknowns p ∗ i r � ln p ∗ i = − λ j f j ( a i ) − ( λ 0 + 1 ) , i = 1 , . . . , n , j = 1 and solving for p ∗ i we obtain   r � p ∗  .  − λ j f j ( a i ) − ( λ 0 + 1 ) i = exp Exponential family! j = 1 To eliminate the multiplier λ 0 , we utilize the constraint that the sum of all the probabilities p ∗ i is 1. This simplifies the formula for p ∗ i , � � − � r The other Lagrange multipliers λ i must be j = 1 λ j f j ( a i ) exp p ∗ i = � . determined through the constraint equations � � n − � r i = 1 exp j = 1 λ j f j ( a i ) (generally a non-trivial task). 11 / 24

  15. 6.3: Most probable states with prior distribution Example 3: Let A = A 1 ∪ A 2 , where A 1 , A 2 represent disjoint sample spaces which are completely unrelated A 1 = { a 1 , . . . , a l } , A 2 = { a l + 1 , . . . , a n } . By applying the maximum entropy principle to each set, we get the least biased measure for each set, i.e., l 1 � p ( 1 ) A 1 : = l δ a j , 0 j = 1 n − l 1 � p ( 2 ) A 2 : = n − l δ a l + j . 0 j = 1 If we know that statistical measurements on each set are equally important, then the least biased probability measure for the whole set should be l n − l p 0 = 1 1 l δ a j + 1 1 � � n − l δ a l + j . 2 2 j = 1 j = 1 For example, if A 1 = { a 1 , a 2 , a 3 } and A 2 = { a 4 , a 5 } , then p ( a 1 ) = p ( a 2 ) = p ( a 3 ) = 1 / 6 , p ( a 4 ) = p ( a 5 ) = 1 / 4 . 12 / 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend