Chapter 2 Entropy, Relative Entropy, and Mutual Infor- mation - PowerPoint PPT Presentation

Chapter 2 Entropy, Relative Entropy, and Mutual Infor- mation Peng-Hua Wang Graduate Institute of Communication Engineering National Taipei University

Chapter Outline Chap. 2 Entropy, Relative Entropy, and Mutual Information 2.1 Entropy 2.2 Joint entropy and conditional entropy 2.3 Relative entropy and mutual information 2.4 Relationship between entropy and mutual information 2.5 Chain Rules for Entropy, Relative Entropy, and Mutual Information Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 2/51

Chapter Outline Chap. 2 Entropy, Relative Entropy, and Mutual Information 2.6 Jensen’s inequality and its consequences 2.7 Log sum inequality and its applications 2.8 Data processing inequality 2.9 Sufficient Statistics 2.10 Fano’s Inequality Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 3/51

2.1 Entropy Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 4/51

Entropy Definition 1 (Entropy) The entropy H ( X ) of a discrete random variable X is defined by H ( X ) = − ∑ p ( x ) log p ( x ) x ∈X Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 5/51

Entropy ■ X be a discrete random variable with alphabet X and pmf p ( x ) = Pr [ X = x ] , x ∈ X . ■ log 2 p ( x ) , the entropy is expressed in bits. ■ If the base is e , i.e., ln p ( x ) , the entropy is expressed in nats. ■ If the base is b , we denote the entropy as H b ( X ) . ■ 0 log 0 � lim t → 0 + t log t = 0. 1 ■ H ( X ) = E [ log p ( X ) ] = − E log p ( X ) ■ H ( X ) may not exist. Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 6/51

Properties of entropy Lemma 1 H ( X ) ≥ 0 Lemma 2 H b ( X ) = log b ( a ) H a ( X ) Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 7/51

Meaning of entropy ■ The amount of information (code length) required on the average to describe the random variable. ■ The minimum expected number of binary questions required to determine X lies between H ( X ) and H ( X ) + 1. ■ The amount of “information” provided by an observation of a random variable. ◆ If an event is less probable, we receive more information when it occurs. ◆ A certain event provides no information. ■ “Uncertainty” about a random variable. ■ “Randomness” of a random variable. Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 8/51

Example 1.1.1 Consider a random variable that has a uniform distribution over 32 outcomes. To identify an outcome, we need a label that takes on 32 different values. (1) How may bit is sufficient as label? (2) Compute the entropy of the random variable. Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 9/51

Example 1.1.2 Suppose that we have a horse race with eight horses taking part. Assume that the probabilities of winning for the eight horses are ( 1/2, 1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64 ) . Suppose that we wish to send a message indicating which horse won the race. (1) How may bit is sufficient for labeling the horse? (2) Compute the entropy H ( X ) . (3) Can we label the horse in average H ( X ) bits? Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 10/51

Example 2.1.1 Let Pr [ X = 1 ] = p and Pr [ X = 0 ] = 1 − p . The entropy H ( X ) � H ( p ) = − p log p − ( 1 − p ) log ( 1 − p ) . ■ H ( p ) is a concave function of the distribution. ■ H ( p ) = 0 if p = 0 or 1. ■ H ( p ) = 1 is maximum if p = 1/2. Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 11/51

Example 2.1.2 Let  with probability 1 a , 2 ,      with probability 1  b , 4 , X = with probability 1 c , 8 ,      with probability 1  d , 8 . Compute H ( X ) . ■ We wish to determine the value of X with the “Yes/No” questions. ■ The minimum number of binary questions lies between H ( X ) and H ( X ) + 1. Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 12/51

2.2 Joint entropy and conditional entropy Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 13/51

Joint entropy Definition 2 (Joint Entropy) Let ( X , Y ) be a pair of discrete random variables with a joint distribution p ( x , y ) . The joint entropy H ( X , Y ) is defined as H ( X , Y ) = − ∑ x ∈X ∑ p ( x , y ) log p ( X , Y ) y ∈Y = − E log p ( x , y ) Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 14/51

Example 2.2.1 Let ( X , Y ) have the following joint distribution: X = 1 X = 2 X = 3 X = 4 Y = 1 1/8 1/16 1/32 1/32 Y = 2 1/16 1/8 1/32 1/32 Y = 3 1/16 1/16 1/16 1/16 Y = 4 1/4 0 0 0 Compute H ( X ) , H ( Y ) , H ( X , Y ) , H ( Y | X ) , H ( X | Y ) . Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 16/51

Properties of conditional entropy Theorem 1 (Chain rule) H ( X , Y ) = H ( X ) + H ( Y | X ) Proof. Take logarithm and expectation on [ p ( x , y )] − 1 = [ p ( x )] − 1 [ p ( y | x )] − 1 . � Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 17/51

Properties of conditional entropy Corollary 1 H ( X , Y | Z ) = H ( X | Z ) + H ( Y | X , Z ) Proof. Take logarithm and expectation on [ p ( x , y | z )] − 1 = [ p ( x | z )] − 1 [ p ( y | x , z )] − 1 . � ■ H ( Y | X ) � = H ( X | Y ) . ■ H ( X ) − H ( X | Y ) = H ( Y ) − H ( Y | X ) Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 18/51

2.3 Relative entropy and mutual information Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 19/51

Relative entropy Definition 4 (Relative Entropy) The relative entropy between two distributions p ( x ) and q ( x ) is defined as p ( x ) log p ( x ) D ( p || q ) = ∑ q ( x ) x ∈X = E p log p ( X ) q ( X ) ■ D ( p || q ) is also called the Kullback–Leibler Distance 0 = 0 and p log p ■ We will use 0 log 0 0 = ∞ Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 20/51

Meaning of Relative entropy ■ D ( p || q ) is a measure of the distance between two distributions. ■ D ( p || q ) is a measure of the inefficiency of assuming that the distribution is q ( x ) when the true distribution is p ( x ) . Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 21/51

Meaning of Relative entropy ■ If we know the true distribution p ( x ) , we could construct a code with average description length 1 ∑ p ( x ) log p ( x ) = H ( p ) . x ∈X If, instead, we used the distribution q ( x ) to construct the code (wrong code), the average code length is 1 L = ∑ p ( x ) log q ( x ) . x ∈X The difference is p ( x ) log p ( x ) L − H ( p ) = ∑ q ( x ) = D ( p || q ) x ∈X Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 22/51

Mutual information Definition 5 (Mutual Information) The mutual information I ( X ; Y ) is defined as p ( x , y ) I ( X ; Y ) = ∑ x ∈X ∑ p ( x , y ) log p ( x ) p ( y ) y ∈Y = D ( p ( x , y ) || p ( x ) p ( y )) p ( X , Y ) = E p ( x , y ) log p ( X ) p ( Y ) ■ The mutual information I ( X ; Y ) is the relative entropy between the joint distribution and the product distribution p ( x ) p ( y ) . Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 23/51

Example 2.3.1 Consider two distributions p and q on X = { 0, 1 } . Let p ( 0 ) = 1 − r , p ( 1 ) = r , and let q ( 0 ) = 1 − s , q ( 1 ) = s . Compute D ( p || q ) and D ( q || p ) . Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 24/51

2.4 Relationship between entropy and mutual information Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 25/51

Mutual information and entropy Theorem 2 (Mutual information and entropy) I ( X ; Y ) = H ( X ) − H ( X | Y ) I ( X ; Y ) = H ( Y ) − H ( Y | X ) I ( X ; Y ) = H ( X ) + H ( Y ) − H ( X , Y ) I ( X ; Y ) = I ( Y ; X ) I ( X ; X ) = H ( X ) Proof. 1. Take logarithm and expectation on [ p ( x , y ) / p ( x ) p ( y )] − 1 = [ p ( x )] − 1 ÷ [ p ( x | y )] − 1 . � ■ The mutual information I ( X ; Y ) is the reduction in the uncertainty of X due to the knowledge of Y . Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 26/51

Mutual information and entropy Relationships between mutual information and entropy Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 27/51

2.5 Chain Rules for Entropy, Relative Entropy, and Mutual Information Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 28/51

Chain rules Theorem 3 (Chain rule for entropy) n ∑ H ( X 1 , X 2 , . . . , X n ) = H ( X i | X i − 1 , . . . , X 1 ) i = 1 Proof. Take logarithm and expectation on [ p ( x 1 , x 2 , . . . , x n )] − 1 =[ p ( x 1 )] − 1 [ p ( x 2 | x 1 )] − 1 [ p ( x 3 | x 1 , x 2 )] − 1 · · · . � Theorem 4 (Chain rule for information) n ∑ I ( X 1 , X 2 , . . . , X n ; Y ) = I ( X i ; Y | X i − 1 , . . . , X 1 ) i = 1 Peng-Hua Wang, February 19, 2012 Information Theory, Chap. 2 - p. 29/51

Chapter 2 Entropy, Relative Entropy, and Mutual Infor- mation - PowerPoint PPT Presentation

Chapter 2 Entropy, Relative Entropy, and Mutual Infor- mation Peng-Hua Wang Graduate Institute of Communication Engineering National Taipei University Chapter Outline Chap. 2 Entropy, Relative Entropy, and Mutual Information 2.1 Entropy 2.2

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 11/27/2006 Chapter 13

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 Inheritance Concepts

Chapter 13 Chapter 13 1 What is this? Chapter 13 2 What is this? Chapter 13 3 What is

CHAPTER CHAPTER VII CHAPTER CHAPTER VII VII VII MANAGEMENT AND MANAGEMENT AND

Appendix A Chapter 9 versus Chapter 1 1 at a Glance Chapter 9 Chapter 1 1 ( I n) voluntary Cannot

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Pushdown Automata Chapter 5 Chapter 5 Chapter 5 Chapter 5

Chapter 6 Programme design and development Lets Recap Chapter 2: Chapter 3: Chapter 1:

OWASP London Chapter Meeting 27th July 2017 London Chapter Chapter Leaders: Sam

Constraint Satisfaction Problem s C t i t S ti f ti P bl Reading: Chapter 6 (3 rd ed );

Chapter 3 Chapter 3 Data Description McGraw-Hill, Bluman, 7 th ed, Chapter 3 1 Ch Chapter 3

OWASP London Chapter Meeting 23rd November 2017 London Chapter Chapter Leaders: Sam

A.I.S. Class 22: Outline I Learning Objectives for Chapter 8 I Chapter 8 Quiz I New ACCESS Features

A.I.S. Class 27: Outline I Learning Objectives for Chapter 8 I Chapter 8 Quiz I New ACCESS Features

Chapters for the Final Exam Chapter 20: Electric forces and fields (Conceptual Questions) Chapter

Chapter: 9 9 9 9 Chapter: Chapter: Chapter: High-Speed Downlink High-Speed Downlink Packet

Overview and Compa risons of Long-T erm Finanial Risk Mo dels Overview and Compa

Creating a Transit Generation The Effect of the U-Pass on Lifelong Transit Use By Caitlin

Lepton-nucleus interactions within many-body approaches: from the quasi-elastic to the DIS region

Trustworthy Self-Assembly: A Use-Case for Distributed Runtime Verification John Rushby Computer

Bayesian Analysis for Algorithm Performance Comparison Is it possible to compare optimization

Build Your Own Idea Factory! IN 20 15-SECOND STEPS! by Aaron Vegh This presentation is about

LCS 11: Cognitive Science Overview of information processing in cognitive science Overview of

The Secret of Success In Iraq: Its All About The People SAME Mid-Atlantic/ Middle East