Lecture 8: Information Theory and Maximum Entropy Lecturer: Mike - PDF document

NEU 560: Statistical Modeling and Analysis of Neural Data Spring 2018 Lecture 8: Information Theory and Maximum Entropy Lecturer: Mike Morais Scribes: 8.1 Fundamentals of Information theory Information theory started with Claude Shannon’s A mathematical theory of communication . The first building block was entropy , which he sought as a functional H ( · ) of probability densities with two desired properties: 1. Decreasing in P ( X ), such that if P ( X 1 ) < P ( X 2 ), then h ( P ( X 1 )) > h ( P ( X 2 )). 2. Independent variables add, such that if X and Y are independent, then H ( P ( X, Y )) = H ( P ( X )) + H ( P ( Y )). These are only satisfied for − log( · ). Think of it as a “surprise” function. Definition 8.1 (Entropy) The entropy of a random variable is the amount of information needed to fully describe it; alternate interpretations: average number of yes/no questions needed to identify X , how uncertain you are about X ? � H ( X ) = − P ( X ) log P ( X ) = − E X [log P ( X )] (8.1) X Average information, surprise, or uncertainty are all somewhat parsimonious plain English analogies for entropy. There are a few ways to measure entropy for multiple variables; we’ll use two, X and Y . Definition 8.2 (Conditional entropy) The conditional entropy of a random variable is the entropy of one random variable conditioned on knowledge of another random variable, on average. Alternative interpretations: the average number of yes/no questions needed to identify X given knowledge of Y , on average; or How uncertain you are about X if you know Y , on average? � � � � � H ( X | Y ) = P ( Y )[ H ( P ( X | Y ))] = P ( Y ) − P ( X | Y ) log P ( X | Y ) Y Y X � = = − P ( X, Y ) log P ( X | Y ) X,Y = − E X,Y [log P ( X | Y )] (8.2) Definition 8.3 (Joint entropy) � H ( X, Y ) = − P ( X, Y ) log P ( X, Y ) = − E X,Y [log P ( X, Y )] (8.3) X,Y 8-1

8-2 Lecture 8: Information Theory and Maximum Entropy • Bayes’ rule for entropy H ( X 1 | X 2 ) = H ( X 2 | X 1 ) + H ( X 1 ) − H ( X 2 ) (8.4) • Chain rule of entropies n � H ( X n , X n − 1 , ...X 1 ) = H ( X n | X n − 1 , ...X 1 ) (8.5) i =1 It can be useful to think about these interrelated concepts with a so-called information diagram. These aid intuition, but are somewhat of a disservice to the mathematics behind them. Think of the area of each circle as the information needed to describe it, and any overlap would imply the “same information” (sorry.) describes both processes. The entropy of X is the entire blue circle. Knowledge of Y removes the green slice. The joint entropy is the union of both circles. How do we describe their intersection, the green slice? Definition 8.4 (Mutual information) The mutual information between two random variables is the “amount of information” describing one random variable obtained through the other (mutual dependence); alternate interpretations: how much is your uncertainty about X reduced from knowing Y , how much does X inform Y ? P ( X, Y ) � I ( X, Y ) = P ( X, Y ) log P ( X ) P ( Y ) X,Y = H ( X ) − H ( X | Y ) = H ( Y ) − H ( Y | X ) = H ( X ) + H ( Y ) − H ( X, Y ) (8.6) Note that I ( X, Y ) = I ( Y, X ) ≥ 0, with equality if and only if X and Y are independent.

Lecture 8: Information Theory and Maximum Entropy 8-3 8.1.1 KL Divergence From Bayes’ rule, we can rewrite the joint distribution P ( X, Y ) = P ( X | Y ) P ( Y ) and rewrite the mutual information as P ( X | Y ) log P ( X | Y ) � �� I ( X, Y ) = P ( Y ) = E Y D KL P ( X | Y ) � P ( X ) (8.7) P ( X ) Y X which we introduce as the Kullback-Leibler, or KL, divergence from P ( X ) to P ( X | Y ). Definition first, then intuition. Definition 8.5 (Relative entropy, KL divergence) The KL divergence D KL ( p � q ) from q to p , or the relative entropy of p with respect to q , is the information lost when approximating p with q , or conversely the information gained when updating q with p . In terms of p and q : (8.8) p ( X ) log p ( X ) � � � D KL p ( X ) � q ( X ) = q ( X ) X In terms of a prior and a posterior: p ( X | Y ) log p ( X | Y ) � � � D KL p ( X | Y ) � p ( X ) = p ( X ) X We can think of it as the amount of extra information needed to describe p ( X | Y ) (the posterior) if we used p ( X ) (the prior) instead. Conversely, in Bayesian syntax, we can think of it as the information gain when updating belief from a prior p ( X ) to the posterior p ( X | Y ); it is the information gained about X by observing Y . Claim 8.6 Maximizing log-likelihood of observing data X with respect to model parameters θ is equivalent to minimizing KL divergence between the likelihood and the true source distribution of the data. Proof: The KL divergence from p true ( X ), the true source of the data (unknown), to p ( X | θ ), the model likelihood fit to the data, is given by p true ( X ) log p true ( X ) � � � p true ( X ) � p ( X | θ ) = D KL p ( X | θ ) X � � = − p true ( X ) log p ( X | θ ) + p true ( X ) log p true ( X ) X X � N � N →∞ − 1 � � � = lim log p ( x i | θ ) + H p true ( X ) (8.9) N i =1 For an observed dataset { x 1 , x 2 , ..., x N } , we approximate the first sum with a Monte Carlo integral that is equal in the infinite limit. Other names for these sums are the cross entropy and the log-likelihood (you’ll see this ML/cross-entropy equivalence leveraged when optimizing parameters in deep learning). Since the entropy of the data source is fixed with respect to our model parameters, it follows that � N � 1 � � ˆ � � arg min p true ( X ) � p ( X | θ ) = arg max lim log p ( x i | θ ) (8.10) D KL θ ML N θ θ N →∞ i =1

8-4 Lecture 8: Information Theory and Maximum Entropy 8.1.2 Data processing inequality Garbage in, garbage out. Suppose three random variables form a Markov chain X → Y 1 → Y 2 ; this is a sequence of events where each depends only on the former, such that p ( X, Y 1 , Y 2 ) = p ( Y 2 | Y 1 ) p ( Y 1 | X ) p ( X ) (8.11) The data processing inequality tells us that processing (e.g. from Y 1 to Y 2 ) cannot possibly increase information, so I ( X, Y 1 ) ≥ I ( X, Y 2 ) (8.12) 8.2 Principle of maximum entropy Entropy underlies a core theory for selecting probability distributions. Thomas Jaynes argues that the maxent distribution is “uniquely determined as the one which is maximally noncommittal with regard to missing information, in that it agrees with what is known, but expresses maximum uncertainty with respect to all other matters”. Therefore, this is the most principled choice. Many common probability distributions naturally arise as maximum entropy distributions under moment constraints. The basic problem looks like this: � � maximize − p ( X ) log p ( X ) subject to p ( X ) f i ( X ) = c i for all constraints f i p ( X ) X X � � � with solution: p ( X ) = exp − 1 + λ 0 + λ i f i ( X ) (8.13) i One constraint is always f 0 ( X ) = 1 and c 0 = 1; that is, we constrain that it must be a proper probability distribution and integrate (sum) to 1. 8.2.1 Optimization with Langrange multipliers We solve the constrained optimization problem by forming a Langrangian and introducing Lagrange multipliers λ i (recall when we derived PCA!). � � � � � � � � � � L p ( X ) , λ 0 , { λ i } = − p ( X ) log p ( X ) + λ 0 p ( X ) − 1 + λ i p ( X ) f i ( X ) − c i (8.14) i X X X � � A solution, if it exists, will do so at a critical point of this Lagrangian, i.e. when it’s gradient ∇L p ( X ) , λ 0 , { λ i } ≡ 0. Recall that the gradient is the vector of all partial derivatives of L with respect to p ( X ) and all of the Lagrange multipliers, identically zero when each partial derivative is zero. So ∂ L � ∂p ( X ) = 0 = − log p ( X ) − 1 + λ 0 + λ i f i ( X ) i ∂ L � = 0 = p ( X ) − 1 ∂λ 0 X ∂ L � = 0 = p ( X ) f i ( X ) − c i (8.15) ∂λ i X

Lecture 8: Information Theory and Maximum Entropy Lecturer: Mike - PDF document

NEU 560: Statistical Modeling and Analysis of Neural Data Spring 2018 Lecture 8: Information Theory and Maximum Entropy Lecturer: Mike Morais Scribes: 8.1 Fundamentals of Information theory Information theory started with Claude Shannons A

Entropy, Relative Entropy, Cross Entropy Entropy Entropy, H(x) is a measure of the uncertainty of

Formal Modeling in Cognitive Science Lecture 25: Entropy, Joint Entropy, Conditional Entropy 1

Comparison Between Bayesian and Maximum Entropy Analysis of Flow Networks 1 Maximum Entropy

Maximum Entropy Beyond Fact to Explain Selecting Probability Maximum Entropy . . . Explaining a

Topic III.2: Maximum Entropy Models Discrete Topics in Data Mining Universitt des Saarlandes,

MAXIMUM CARDS MAXIMUM CARDS What is a Maximum Card ? The Maximum Card is the one which contains a

A gentle introduction to Maximum Entropy Models and their friends Mark Johnson Brown University

Maximum Entropy Inverse Reinforcement Learning Nomenclature Basis Feature Expectation Matching

Maximum Entropy Tagging (for the Maximum Entropy method itself, refer to NPFL067 added slides

Chapter 2 Entropy, Relative Entropy, and Mutual Infor- mation Peng-Hua Wang Graduate Institute

1) Entropy = measure of randomness 2) Entropy = measure of compressibility More random = Less

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

Information Theory Lecture 1 Course introduction Entropy, relative entropy and mutual

Entropy and Shannon information Entropy and Shannon information For a random variable X with

Topological entropy and algebraic entropy on locally compact abelian groups - The Bridge Theorem

Probabilistic Models of Human Sentence Experiment 1: Entropy and Sentence Length 2 Processing

Circuits TM: A single program that works for every input length Circuits: A program tailored to

Discuss: P rogramming L anguage What is a PL? CS 251

Bayesian Learning l A powerful approach in machine learning l Combine data seen so far with prior

On Topological Entropy of Switched Linear Systems with Pairwise Commuting Matrices Guosong Yang

Fast and simple constant-time hashing to the BLS12-381 elliptic curve (and other curves, too!)

Optimal Slack-Driven Block Shaping Algorithm in Fixed-Outline Floorplanning Jackey Z. Yan Chris

Bayesian Networks in Reliability: A primer Helge Langseth helgel@math.ntnu.no Department of

1 Ancient DNA: would the real Neandertal please stand up? Eur. Eur. Afr. Asia Afr. Asia H.