Introduction to information theory and empirical statistical theory - PowerPoint PPT Presentation

Introduction to information theory and empirical statistical theory Nan Chen and Andrew J. Majda Center for Atmosphere Ocean Science Courant Institute of Mathematical Sciences New York University October 13, 2016

6.1 Introduction Goal of this lecture: Lay down statistical theories of geophysical flows and predictability that will be developed in the next a few lectures. ◮ The objective of the statistical theories is the predictions of the most probable steady state that will develop in the flow under the constraints imposed by a few bulk properties of the flow, such as averaged conserved quantities like energy and enstrophy. ◮ Here we will adapt the information-theoretical approach developed by Shannon (1948) Shannon and Weaver (1949) and Jaynes (1957). The most probable state can be selected in a natural manner as the least biased probability measure that is consistent with the given external constraints imposed in the theory. 1 / 24

An motivation example. Case I Case II 100 100 90 90 80 80 70 70 60 60 50 50 40 40 30 30 20 20 10 10 0 0 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 0.3 0.3 | µ 1 − µ 2 | = 2 | µ 1 − µ 2 | = 1 2 0.25 0.25 1 0.2 0.2 0 0.15 0.15 −5 0 5 0.1 0.1 0.05 0.05 0 0 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 Question: If the blue ensembles are the truth and the red ensembles are the prediction, which case has a better prediction? Regarding the mean state, the answer is case II. But intuitively ... ? Information theory is needed to quantify the uncertainty and access the least biased prediction given some statistical constraints. 2 / 24

Claude Elwood Shannon ◮ Claude Elwood Shannon (1916 – 2001) was an American mathematician, electronic engineer, and cryptographer known as “the father of information theory”. ◮ Pioneered by Shannon in a 1948 article, “A mathematical theory of communication”, Bell System Technical Journal . ◮ The article was later published in 1949 as a book by Shannon and Warren Weaver, “The Mathematical Theory of Communication”, Univ of Illinois Press. ◮ Shannon’s article in 1951 Prediction and Entropy of Printed English was the contribution of information theory to natural linguistics. 3 / 24

Shannon’s other work, hobbies and inventions, ◮ Shannon’s mouse ◮ Computer chess program ◮ Juggling ◮ Using information theory to calculate odds while gambling ... ◮ Ultimate machine ◮ ... 4 / 24

6.2 Information theory and Shannon’s entropy Definition 6.1 (The Shannon entropy) Let p be a finite, discrete probability measure on the sample space A = { a 1 , . . . , a n } n n � � p = p i δ a i , p i ≥ 0 , p i = 1 . (6.2) i = 1 i = 1 The Shannon entropy S p of the probability p is defined as n � S ( p ) = S ( p 1 , . . . , p n ) = − p i ln p i . (6.3) i = 1 5 / 24

Shannon’s intuition from the theory of communication Representing a “word” in a message as a sequence of binary digits with length n . ◮ The set A 2 n of all words of length n has 2 n = N elements. ◮ The amount of information needed to characterize one element is n = log 2 N . 1 0 1 0 1 0 1 Example: 0 1 1 1 0 1 0 n = 7, N = 128. . . . . . . . . . . . . . . . Following this type of reasoning, the amount of information needed to characterize an element of any set, A N , is n = log 2 N for general N . 6 / 24

◮ Now consider a situation of a set A = A N 1 ∪ · · · ∪ A N k , where the set A N i are pairwise disjoint from each other with A N i having N i total elements. ◮ Set p i to be given by p i = N i / N , where N = � N i . Example: A = A N 1 ∪ A N 2 1 0 0 1 0 1 1 � �� N = 24, N 1 = 8, N 2 = 16. N 1 = 23 = 8 N 2 = 24 = 16 p 1 = 8 / 24 p 2 = 16 / 24 ◮ If we know that an element of A belongs to some A N i , then we need log 2 N i additional information to determine it completely. ◮ The average amount of information we need to determine an element, provided that we already now the A N i to which it belongs, is given by � N i � N i N i � � � N log 2 N i = N log 2 N · N = p i log 2 p i + log 2 N . i i ◮ Recall that log 2 N is the information that we need to determine an element given the set A if we do not know to which A N i a given element belongs. ◮ Thus, the corresponding average lack of information is � − p i log 2 p i . 7 / 24

Proposition 6.1 (Uniqueness of the Shannon entropy) Let H n be a function defined on the space of discrete probability measure PM n ( A ) and satisfying the following properties: 1. H n ( p 1 , . . . , p n ) is a continuous function. 2. A ( n ) = H n ( 1 / n , . . . , 1 / n ) is monotonic increasing in n , i.e., H n increases with increasing uncertainty, H n ( { 1 / n , . . . , 1 / n } ) > H n ′ ( { 1 / n ′ , . . . , 1 / n ′ } ) , n > n ′ 3. Composite rule: H n ( p 1 , . . . , p k , p k + 1 . . . , p n ) = H 2 ( w 1 , w 2 ) + w 1 H k ( p 1 / w 1 , . . . , p k / w 1 ) + w 2 H n − k ( p k + 1 / w 2 , . . . , p n / w 2 ) , where w 1 = p 1 + . . . + p k , w 2 = p k + 1 + . . . + p n and p i / w j is the conditional probability. Then H n is a positive multiple of the Shannon entropy n � H n ( p 1 , . . . , p n ) = K S ( p 1 , . . . , p n ) = − K p i ln p i , with K > 0 . i = 1 Key of proof: Using composite rule = ⇒ A ( nv ) = A ( n ) + A ( v ) . The only function satisfying this condition is A ( n ) = K ln n . See [Majda & Wang, 2006] page 186-187 for the details of proof. 8 / 24

Definition 6.2 (Empirical maximum entropy principle) With a given probability measure p ∈ PM ( A ) , the expected value, or statistical measurement, of f with respect to p is given by n � � f � p = f ( a i ) p i . i = 1 A practical issue is to look for the least biased probability distribution consistent with certain statistical constraints, � � C L = p ∈ PM ( A ) |� f j � p = F j , 1 ≤ j ≤ L . The least biased probability distribution p L given the constraints C L is the maximum entropy distribution, max S ( p ) = S ( p L ) , p L ∈ C L . p ∈C L Remark: Uncertainty decreases with more information being included 1 . for L ′ < L . S ( p L ) ≤ S ( p L ′ ) , 1 In thermodynamics, entropy is commonly associated with the amount of order, disorder, or chaos in a thermodynamic system. 9 / 24

Example 1. Find the least biased probability distribution p on A = { a 1 , . . . , a n } with no additional constraints. n � Maximize S ( p 1 , . . . , p n ) = − p i ln p i , i = 1 n � Subject to p i = 1 and all p i ≥ 0 . i = 1 10 / 24

Example 1. Find the least biased probability distribution p on A = { a 1 , . . . , a n } with no additional constraints. n � Maximize S ( p 1 , . . . , p n ) = − p i ln p i , i = 1 n � Subject to p i = 1 and all p i ≥ 0 . i = 1 Construct the Lagrange function n n � � L = − p i ln p i + λ p i . i = 1 i = 1 The minimum p ∗ satisfies ∂ L = 0 , i = 1 , . . . , n , ∂ p i which results in ln p ∗ i + 1 + λ = 0 , i = 1 , . . . , n . This implies all the probabilities p ∗ i are equal and therefore p ∗ i = 1 / n . Uniform distribution! 10 / 24

Example 2. Find the least biased probability distribution p on A = { a 1 , . . . , a n } subject to the r + 1 constraints ( r + 1 ≤ n ), n n � � F j = � f j � p = f j ( a i ) p i , j = 1 , . . . , r , p i = 1 . i = 1 i = 1 11 / 24

Example 2. Find the least biased probability distribution p on A = { a 1 , . . . , a n } subject to the r + 1 constraints ( r + 1 ≤ n ), n n � � F j = � f j � p = f j ( a i ) p i , j = 1 , . . . , r , p i = 1 . i = 1 i = 1 Construct the Lagrange function n n r � � � L = − p i ln p i + λ 0 p i + λ j � f j � p . i = 1 i = 1 j = 1 Componentwise it yields a system of n equations for the unknowns p ∗ i r � ln p ∗ i = − λ j f j ( a i ) − ( λ 0 + 1 ) , i = 1 , . . . , n , j = 1 and solving for p ∗ i we obtain   r � p ∗  .  − λ j f j ( a i ) − ( λ 0 + 1 ) i = exp Exponential family! j = 1 To eliminate the multiplier λ 0 , we utilize the constraint that the sum of all the probabilities p ∗ i is 1. This simplifies the formula for p ∗ i , � � − � r The other Lagrange multipliers λ i must be j = 1 λ j f j ( a i ) exp p ∗ i = � . determined through the constraint equations � � n − � r i = 1 exp j = 1 λ j f j ( a i ) (generally a non-trivial task). 11 / 24

6.3: Most probable states with prior distribution Example 3: Let A = A 1 ∪ A 2 , where A 1 , A 2 represent disjoint sample spaces which are completely unrelated A 1 = { a 1 , . . . , a l } , A 2 = { a l + 1 , . . . , a n } . By applying the maximum entropy principle to each set, we get the least biased measure for each set, i.e., l 1 � p ( 1 ) A 1 : = l δ a j , 0 j = 1 n − l 1 � p ( 2 ) A 2 : = n − l δ a l + j . 0 j = 1 If we know that statistical measurements on each set are equally important, then the least biased probability measure for the whole set should be l n − l p 0 = 1 1 l δ a j + 1 1 � � n − l δ a l + j . 2 2 j = 1 j = 1 For example, if A 1 = { a 1 , a 2 , a 3 } and A 2 = { a 4 , a 5 } , then p ( a 1 ) = p ( a 2 ) = p ( a 3 ) = 1 / 6 , p ( a 4 ) = p ( a 5 ) = 1 / 4 . 12 / 24

Introduction to information theory and empirical statistical theory - PowerPoint PPT Presentation

Introduction to information theory and empirical statistical theory Nan Chen and Andrew J. Majda Center for Atmosphere Ocean Science Courant Institute of Mathematical Sciences New York University October 13, 2016 6.1 Introduction Goal of

Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabs Pczos Empirical Risk

Functional Principal Component Analysis May 14, 2018 Empirical Principal Component FPC for the

Empirical Project Monitor and Results from 100 OSS Development Projects Masao Ohira Empirical

Empirical research on economic inequality: Normative considerations and empirical practice.

8/29/2015 Effect of Empirical Left Atrial Appendage Isolation on Effect of Empirical Left Atrial

Empirical problem solving Statistical method R.W. Oldford Empirical problem solving - PPDAC The

SFB 1102: Information Density and Linguistic Encoding The Empirical Basis of Slavic The Empirical

Empirical Methods Empirical Methods t= a +b Research Landscape Quantitative =

Empirical Method Based Aggregate Loss Distributions C. K. Stan Khury 2012 INTRODUCTION

Chapter 2- -3 3 Chapter 2 Definition of Theory: A theory is a systematic Definition of

Utilit Utility and d Happiness pp Miles Kimball and Bob Willis + Related Empirical

Empirical Invariance in Stock Market and Chii-Ruey Hwang Institute of Related Problems

Empirical Mode Decomposition, Lifting and Block Wavelet Transform April 9 Empirical Mode

EMPIRICAL RESEARCH EMPIRICAL RESEARCH IS . . . H ELP WITH IDEAS AND FUNDING A PPROVAL FROM YOUR

Empirical Studies & Domain Experts Alark Joshi Visualization and Graphics Lab

CSC2412: Private Gradient Descent & Empirical Risk Minimization Sasho Nikolov 1 Empirical

Prevailing Trends in Consumer Product Liability: Allegations of Fraud Navigating Aggregate

1. Autocatalytic chemical reactions in the flow reactor 2. Replication, mutation, selection and

Social Studies Curriculum BOARD OF EDUCATION PRESENTATION MARCH 4, 2019 Purpose Outline the

S upporting International S tudents S urrey S chool District Largest school board in

Advanced High Grade Copper- Gold-Silver Exploration in the Peruvian Andes N e w Yo r k M i n i

Phonics and Reading @St Johns C of E Infant What does phonics mean ?

Reception Meeting contents: Speaking and listening Nursery Fine motor skills

A Derivation System for Security Protocols and its Logical Formalization Anupam Datta Ante

Introduction to information theory and empirical statistical theory - PowerPoint PPT Presentation

Introduction to information theory and empirical statistical theory Nan Chen and Andrew J. Majda Center for Atmosphere Ocean Science Courant Institute of Mathematical Sciences New York University October 13, 2016 6.1 Introduction Goal of

Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabs Pczos Empirical Risk

Functional Principal Component Analysis May 14, 2018 Empirical Principal Component FPC for the

Empirical Project Monitor and Results from 100 OSS Development Projects Masao Ohira Empirical

Empirical research on economic inequality: Normative considerations and empirical practice.

8/29/2015 Effect of Empirical Left Atrial Appendage Isolation on Effect of Empirical Left Atrial

Empirical problem solving Statistical method R.W. Oldford Empirical problem solving - PPDAC The

SFB 1102: Information Density and Linguistic Encoding The Empirical Basis of Slavic The Empirical

Empirical Methods Empirical Methods t= a +b Research Landscape Quantitative =

Empirical Method Based Aggregate Loss Distributions C. K. Stan Khury 2012 INTRODUCTION

Chapter 2- -3 3 Chapter 2 Definition of Theory: A theory is a systematic Definition of

Utilit Utility and d Happiness pp Miles Kimball and Bob Willis + Related Empirical

Empirical Invariance in Stock Market and Chii-Ruey Hwang Institute of Related Problems

Empirical Mode Decomposition, Lifting and Block Wavelet Transform April 9 Empirical Mode

EMPIRICAL RESEARCH EMPIRICAL RESEARCH IS . . . H ELP WITH IDEAS AND FUNDING A PPROVAL FROM YOUR

Empirical Studies &amp; Domain Experts Alark Joshi Visualization and Graphics Lab

CSC2412: Private Gradient Descent &amp; Empirical Risk Minimization Sasho Nikolov 1 Empirical

Prevailing Trends in Consumer Product Liability: Allegations of Fraud Navigating Aggregate

1. Autocatalytic chemical reactions in the flow reactor 2. Replication, mutation, selection and

Social Studies Curriculum BOARD OF EDUCATION PRESENTATION MARCH 4, 2019 Purpose Outline the

S upporting International S tudents S urrey S chool District Largest school board in

Advanced High Grade Copper- Gold-Silver Exploration in the Peruvian Andes N e w Yo r k M i n i

Phonics and Reading @St Johns C of E Infant What does phonics mean ?

Reception Meeting contents: Speaking and listening Nursery Fine motor skills

A Derivation System for Security Protocols and its Logical Formalization Anupam Datta Ante

Empirical Studies & Domain Experts Alark Joshi Visualization and Graphics Lab

CSC2412: Private Gradient Descent & Empirical Risk Minimization Sasho Nikolov 1 Empirical