Lecture 5: Measures of Information for Continuous Random Variables - PowerPoint PPT Presentation

Entropy and Mutual Information Differential Entropy Lecture 5: Measures of Information for Continuous Random Variables I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 26, 2015 1 / 24 I-Hsiang Wang IT Lecture 5

Entropy and Mutual Information Differential Entropy From Discrete to Continuous So far we have focused on the discrete (& finite-alphabet) r.v’s: Entropy and mutual information for discrete r.v’s. Lossless source coding for discrete stationary sources. Channel coding over discrete memoryless channels. In this lecture and the next two lectures, we extend the basic principles In particular: Mutual information for continuous r.v.’s. (this lecture) Lossy source coding for continuous stationary sources. (Lecture 7) Gaussian channel capacity. (Lecture 6) 2 / 24 I-Hsiang Wang IT Lecture 5 and fundamental theorems to continuous random sources and channels.

Entropy and Mutual Information Differential Entropy Outline 1 First we investigate basic information measures – entropy, mutual information, and KL divergence – when the r.v.’s are continuous. We will see that both mutual information and KL divergence are well defined, while entropy of continuous r.v. is not. 2 Then, we introduce differential entropy as a continuous r.v.’s counterpart of Shannon entropy, and discuss the related properties. 3 / 24 I-Hsiang Wang IT Lecture 5

Entropy and Mutual Information Differential Entropy 1 Entropy and Mutual Information 2 Differential Entropy 4 / 24 I-Hsiang Wang IT Lecture 5

Entropy and Mutual Information Differential Entropy I-Hsiang Wang 5 / 24 IT Lecture 5 What is the entropy of a continuous real-valued random variable X ? Question : Entropy of a Continuous Random Variable Suppose X has the probability density function (p.d.f.) f ( x ) . Let us discretize X to answer this question, as follows: Partition R into length- ∆ intervals: ∞ ∪ R = [ k ∆ , ( k + 1)∆) . k = −∞ Suppose that f ( x ) is continuous, then by the mean-value theorem, ∫ ( k +1)∆ ∀ k ∈ Z , ∃ x k ∈ [ k ∆ , ( k + 1)∆) such that f ( x k ) = 1 f ( x ) dx . ∆ k ∆ Set [ X ] ∆ ≜ x k if X ∈ [ k ∆ , ( k + 1)∆) , with p.m.f. p ( x k ) = f ( x k ) ∆ .

Entropy and Mutual Information Differential Entropy 6 / 24 I-Hsiang Wang IT Lecture 5 f ( x ) f ( x ) � dF X ( x ) F X ( x ) � P { X ≤ x } dx x

Entropy and Mutual Information Differential Entropy 7 / 24 I-Hsiang Wang IT Lecture 5 f ( x ) ∆ x

Entropy and Mutual Information Differential Entropy 8 / 24 I-Hsiang Wang IT Lecture 5 f ( x ) ∆ x x 1 x 3 x 5

Entropy and Mutual Information Differential Entropy I-Hsiang Wang 9 / 24 be arbitrarily large, because it can take infinitely many possible values. It is quite intuitive that the entropy of a continuous random variable can exists. log IT Lecture 5 Observation : lim ∆ → 0 H ([ X ] ∆ ) = H ( X ) (intuitively), while ∞ ∑ H ([ X ] ∆ ) = − ( f ( x k ) ∆) log ( f ( x k ) ∆) k = −∞ ∞ ∑ = − ∆ f ( x k ) log f ( x k ) − log ∆ k = −∞ ∫ ∞ → − f ( x ) log f ( x ) dx + ∞ ∞ = ∞ as ∆ → 0 [ ] ∫ ∞ 1 Hence, H ( X ) = ∞ if − ∞ f ( x ) log f ( x ) dx = E f ( X )

Entropy and Mutual Information Differential Entropy I-Hsiang Wang 10 / 24 such that j IT Lecture 5 Again, we use discretization: Mutual Information between Continuous Random Variables How about mutual information between two continuous r.v.’s X and Y , with joint p.d.f. f X , Y ( x , y ) and marginal p.d.f.’s f X ( x ) and f Y ( y ) ? Partition R 2 plane into ∆ × ∆ squares: R 2 = ∪ ∞ k , j = −∞ I ∆ k × I ∆ j , where I ∆ k ≜ [ k ∆ , ( k + 1)∆) . Suppose that f X , Y ( x , y ) is continuous, then by the mean-value theorem (MVT), ∀ k , j ∈ Z , ∃ ( x k , y j ) ∈ I ∆ k × I ∆ ∫ 1 f X , Y ( x k , y j ) = j f X , Y ( x , y ) dx dy . ∆ 2 I ∆ k ×I ∆ Set ([ X ] ∆ , [ Y ] ∆ ) ≜ ( x k , y j ) if ( X , Y ) ∈ I ∆ k × I ∆ j , with p.m.f. p ( x k , y j ) = f X , Y ( x k , y j ) ∆ 2 .

Entropy and Mutual Information I I-Hsiang Wang 11 / 24 if the improper integral exists. log Differential Entropy IT Lecture 5 such that j x k ∈ I ∆ y j ∈ I ∆ By MVT, ∀ k , j ∈ Z , ∃ � k and � ∫ ∫ p ( x k ) = k f X ( x ) dx = ∆ f X ( � x k ) , p ( y j ) = j f Y ( y ) dy = ∆ f Y ( � y j ) . I ∆ I ∆ Observation : lim ∆ → 0 I ([ X ] ∆ ; [ Y ] ∆ ) = I ( X ; Y ) (intuitively), while ∑ ∞ ( ) p ( x k , y j ) [ X ] ∆ ; [ Y ] ∆ = p ( x k , y j ) log p ( x k ) p ( y j ) k , j = −∞ ✚ ∑ log f X , Y ( x k , y j ) ✚ ∞ ( f X , Y ( x k , y j ) ∆ 2 ) ∆ 2 = ✚ y j ) ✚ ∆ 2 f X ( � x k ) f Y ( � k , j = −∞ ∑ ∞ f X , Y ( x k , y j ) log f X , Y ( x k , y j ) = ∆ 2 f X ( � x k ) f Y ( � y j ) k , j = −∞ ∫ ∞ ∫ ∞ f X , Y ( x , y ) log f X , Y ( x , y ) f X ( x ) f Y ( y ) dx dy as ∆ → 0 → −∞ −∞ [ ] f ( X , Y ) Hence, I ( X ; Y ) = E f ( X ) f ( Y )

Entropy and Mutual Information Differential Entropy I-Hsiang Wang 12 / 24 information measures. is nice, these definitions do not provide explicit ways to compute these Remark : Although defining information measures in such a general way are discrete, continuous, etc. between two probability measures, no matter the probability distributions Similar to mutual information, KL divergence can also be defined , where the supremum is taken IT Lecture 5 The mutual information between two random variables X and Y is I Definition 1 (Mutual information) random variables (no necessarily continuous or discrete) as follows. Unlike entropy that is only well-defined for discrete random variables, in Mutual Information general we can define the mutual information between two real-valued ( ) defined as I ( X ; Y ) = sup [ X ] P ; [ Y ] Q P , Q over all pairs of partitions P and Q of R .

Entropy and Mutual Information Differential Entropy 1 Entropy and Mutual Information 2 Differential Entropy 13 / 24 I-Hsiang Wang IT Lecture 5

Entropy and Mutual Information if the (improper) integral exists. I-Hsiang Wang 14 / 24 log Theorem 1 (Mutual information between two continuous r.v.’s) We have the following theorem immediately from the previous discussion: if the (improper) integral exists. log Differential Entropy The conditional differential entropy of a continuous r.v. X given Y , IT Lecture 5 Definition 2 (Differential entropy and conditional differential entropy) Differential Entropy For continuous r.v.’s, it turns out to be useful to define the counterparts of entropy and conditional entropy, as follows: log The differential entropy of a continuous r.v. X with p.d.f. f ( x ) is defined [ ] as h ( X ) ≜ E 1 f ( X ) where ( X , Y ) has joint p.d.f. f ( x , y ) and conditional p.d.f. f ( x | y ) , is [ ] 1 defined as h ( X | Y ) ≜ E f ( X | Y ) [ ] f ( X , Y ) = h ( X ) − h ( X | Y ) . I ( X ; Y ) = E f ( X ) f ( Y )

Entropy and Mutual Information if the I-Hsiang Wang 15 / 24 some points with zero probability). Proposition 1 (Non-negativity of KL divergence) of KL divergence remains. By Jensen’s inequality, it is straightforward to see that the non-negativity Differential Entropy IT Lecture 5 The Kullback-Leibler divergence between two probability density Definition 3 (KL divergence between densities) Kullback-Leibler Divergence [ ] log f ( X ) functions f ( x ) and g ( x ) is defined as D ( f ∥ g ) ≜ E g ( X ) (improper) integral exists. The expectation is taken over r.v. X ∼ f ( x ) . D ( f ∥ g ) ≥ 0 , with equality iff f = g almost everywhere (i.e., except for Note : D ( f ∥ g ) is finite only if the support of f ( x ) is contained in the support of g ( x ) .

Entropy and Mutual Information h I-Hsiang Wang 16 / 24 Proposition 4 (Non-negativity of mutual information) Proposition 3 (Conditioning reduces differential entropy) Differential Entropy X i IT Lecture 5 Proposition 2 (Chain rule) n Properties that Extend to Continuous R.V.’s ∑ ( � X i − 1 ) � h ( X n ) = h ( X , Y ) = h ( X ) + h ( Y | X ) , . i =1 h ( X | Y ) ≤ h ( X ) , h ( X | Y , Z ) ≤ h ( X | Z ) . I ( X ; Y ) ≥ 0 , I ( X ; Y | Z ) ≥ 0 .

Entropy and Mutual Information Differential Entropy I-Hsiang Wang 17 / 24 IT Lecture 5 Examples Example 1 (Differential entropy of a uniform r.v.) 1 For a r.v. X ∼ Unif [ a , b ] , that is, its p.d.f. f X ( x ) = b − a 1 { a ≤ x ≤ b } , its differential entropy h ( X ) = log ( b − a ) . Example 2 (Differential entropy of N (0 , 1) ) 2 π e − x 2 1 For a r.v. X ∼ N (0 , 1) , that is, its p.d.f. f X ( x ) = √ 2 , its differential entropy h ( X ) = 1 2 log (2 π e ) .

Entropy and Mutual Information Differential Entropy New Properties of Differential Entropy Differential entropy can be negative . negative. Hence, the non-negative property of entropy cannot be extended to differential entropy. Scaling will change the differential entropy . an invertible function. 18 / 24 I-Hsiang Wang IT Lecture 5 Since b − a can be made arbitrarily small, h ( X ) = log ( b − a ) can be Consider X ∼ Unif [0 , 1] . Then, 2 X ∼ Unif [0 , 2] . Hence, h ( X ) = log 1 = 0 , h (2 X ) = log 2 = 1 = ⇒ h ( X ) ̸ = h (2 X ) . This is in sharp contrast to entropy: H ( X ) = H ( g ( X ) ) as long as g ( · ) is

Lecture 5: Measures of Information for Continuous Random Variables - PowerPoint PPT Presentation

Entropy and Mutual Information Differential Entropy Lecture 5: Measures of Information for Continuous Random Variables I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 26, 2015 1 / 24

Formal Modeling in Cognitive Science 1 Continuous Random Variables Lecture 21: Continuous Random

continuous random variables continuous random variables Discrete random variable: takes values in

continuous random variables Continuous random variable: takes values in an uncountable set, e.g.

Probability and Random Processes Lecture 6 Differentiation Absolutely continuous functions

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

P3 - Continuous random variables STAT 587 (Engineering) Iowa State University August 22, 2020

Chapter 2: Random Variables In this chapter we will cover: 1. Discrete Random variables, ( 2.1

Chapter 5 Continuous Random Variables Continuous Probability Distributions Continuous Probability

Continuous Distributions 1.8-1.9: Continuous Random Variables 1.10.1: Uniform Distribution

Continuous Distributions 1.8-1.9: Continuous Random Variables 1.10.1: Uniform Distribution

18.175: Lecture 3 Random variables and distributions Scott Sheffield MIT 1 18.175 Lecture 3

Continuous Random Variables Recall: A continuous random variable X satisfies: its range is the

Lecture 18 : Pairs of Continuous Random Variables 0/ 21 Definition Let X and Y be continuous

Constructing dependent random probability measures from completely random measures Changyou Chen 1

Random Numbers, Files, and Onwards Random Numbers Computers cannot produce truly random numbers.

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Econ 2148, fall 2017 Shrinkage in the Normal means model Maximilian Kasy Department of

Finding k-best MAP Solutions Using LP Relaxations Amir Globerson School of Computer Science and

Introduction to General and Generalized Linear Models Hierarchical models Henrik Madsen Poul

UTA-splines and UTADIS-splines Olivier Sobrie 1 , 2 - Nicolas Gillis 2 - Vincent Mousseau 1 - Marc

Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Maximum Margin Criterion Math

From Martingales in Finance to Quantization for pricing Giorgia Callegaro Universit di Padova

Lessons we learned developing infrastructure software at Box April 2018, Aleksandr Kuzminsky 1.

A Concept Analysis Inspired Greedy Algorithm for Test Suite Minimization Sriraman Tallam Neelam