machine learning
play

Machine Learning Lecture 01-2: Basics of Information Theory Nevin - PowerPoint PPT Presentation

Machine Learning Lecture 01-2: Basics of Information Theory Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology Nevin L. Zhang (HKUST) Machine Learning 1 / 30


  1. Machine Learning Lecture 01-2: Basics of Information Theory Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology Nevin L. Zhang (HKUST) Machine Learning 1 / 30

  2. Jensen’s Inequality Outline 1 Jensen’s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 2 / 30

  3. Jensen’s Inequality Concave functions A function f is concave on interval I if for any x , y ∈ I , λ f ( x ) + (1 − λ ) f ( y ) ≤ f ( λ x + (1 − λ ) y ) for any λ ∈ [0 , 1] Weighted average of function is upper bounded by function of weighted average. It is strictly concave if the equality holds only when x = y . Nevin L. Zhang (HKUST) Machine Learning 3 / 30

  4. Jensen’s Inequality Jensen’s Inequality Theorem (1.1) Suppose function f is concave on interval I.Then For any p i ∈ [0 , 1] , � n i =1 p i = 1 and x i ∈ I. n n � � p i f ( x i ) ≤ f ( p i x i ) i =1 i =1 Weighted average of function is upper bounded by function of weighted average. If f is strictly CONCAVE, the equality holds iff p i × p j � = 0 implies x i = x j . Exercise: Prove this (using induction). Nevin L. Zhang (HKUST) Machine Learning 4 / 30

  5. Jensen’s Inequality Logarithmic function The logarithmic function is concave in the interval (0 , ∞ ): Hence n n � � p i log ( x i ) ≤ log ( p i x i ) 0 ≤ x i i =1 i =1 In words, exchanging � i p i with log increases quantity. Or, swapping expectation and logarithm increases quantity: E [log x ] ≤ log E [ x ] . Nevin L. Zhang (HKUST) Machine Learning 5 / 30

  6. Entropy Outline 1 Jensen’s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 6 / 30

  7. Entropy Entropy The entropy of a random variable X : 1 � H ( X ) = P ( X ) log P ( X ) = − E P [log P ( X )] X with convention that 0 log(1 / 0) = 0. Base of logarithm is 2, unit is bit. Sometimes, also called the entropy of the distribution, H ( P ). H ( X ) measures the amount of uncertainty about X . � For real-valued variable, replace � X . . . with . . . dx . Nevin L. Zhang (HKUST) Machine Learning 7 / 30

  8. Entropy Entropy Example: X — result of coin tossing Y — result of dice throw Z — result of randomly pick a card from a deck of 54 Which one has the highest uncertainty? Entropy: 1 2 log 2 + 1 H ( X ) = 2 log 2 = 1(log 2) 6 log 6 + . . . + 1 1 H ( Y ) = 6 log 6 = log 6 54 log 54 + . . . + 1 1 H ( Z ) = 54 log 54 = log 54 Indeed we have: H ( X ) < H ( Y ) < H ( Z ) . Nevin L. Zhang (HKUST) Machine Learning 8 / 30

  9. Entropy Entropy X binary. The chart on the right shows H ( X ) as a function of p = P ( X =1). The higher H ( X ) is, the more uncertainty about the value of X Nevin L. Zhang (HKUST) Machine Learning 9 / 30

  10. Entropy Entropy Proposition (1.2) H ( X ) ≥ 0 H ( X ) = 0 iff P ( X = x ) = 1 for some x ∈ Ω X . i.e. iff no uncertainty. H ( X ) ≤ log ( | X | ) with equality iff P ( X = x )=1 / | X | . Uncertainty is the highest in the case of uniform distribution. Proof : Because log is concave, by Jensen’s inequality: 1 � H ( X ) = P ( X ) log P ( X ) X 1 � ≤ log P ( X ) P ( X ) = log | X | X Nevin L. Zhang (HKUST) Machine Learning 10 / 30

  11. Entropy Conditional entropy The conditional entropy of Y given event X = x : Entropy of the conditional distribution P ( Y | X = x ), i.e. 1 � H ( Y | X = x ) = P ( Y | X = x ) log P ( Y | X = x ) Y The uncertainty that remains about Y when X is known to be y . It is possible that H ( Y | X = x ) > H ( Y ) Intuitively X = x might contradicts our prior knowledge about Y and increase our uncertainty about Y Exercise: Give example. Nevin L. Zhang (HKUST) Machine Learning 11 / 30

  12. Entropy Conditional Entropy The conditional entropy of Y given variable X : � H ( Y | X ) = P ( X = x ) H ( Y | X = x ) x 1 � � = P ( X ) P ( Y | X ) log P ( Y | X ) X Y 1 � = P ( X , Y ) log P ( Y | X ) X , Y = − E [ logP ( Y | X )] The average uncertainty that remains about X when Y is known. Nevin L. Zhang (HKUST) Machine Learning 12 / 30

  13. Divergence Outline 1 Jensen’s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 13 / 30

  14. Divergence Kullback-Leibler divergence Relative entropy or Kullback-Leibler divergence Measures how much a distribution Q ( X ) differs from a ”true” probability distribution P ( X ). K-L divergence of Q from P is defined as follows: P ( X ) log P ( X ) � KL ( P || Q ) = Q ( X ) X 0 log 0 0 = 0 and plog p 0 = ∞ if p � =0 Nevin L. Zhang (HKUST) Machine Learning 14 / 30

  15. Divergence Kullback-Leibler divergence Theorem (1.2) ( Gibbs’ inequality ) KL ( P , Q ) ≥ 0 with equality holds iff P is identical to Q Proof : P ( X ) log P ( X ) P ( X ) log Q ( X ) � � = − Q ( X ) P ( X ) X X P ( X ) Q ( X ) � ≥ − log Jensen’s inequality P ( X ) X � = − log Q ( X ) = 0 . X KL divergence between P and Q is larger than 0 unless P and Q are identical. Nevin L. Zhang (HKUST) Machine Learning 15 / 30

  16. Divergence Cross Entropy 1 Entropy: H ( P ) = � X P ( X ) log P ( X ) = − E [log P ( x )] Cross entropy : 1 � H ( P , Q ) = P ( X ) log Q ( X ) = − E P [ logQ ( X )] X Relationship with KL: P ( X ) log P ( X ) � KL ( P || Q ) = Q ( X ) = E P [ logP ( X )] − E P [ logQ ( X )] X = H ( P , Q ) − H ( P ) Or, H ( P , Q ) = KL ( P || Q ) + H ( P ) Nevin L. Zhang (HKUST) Machine Learning 16 / 30

  17. Divergence A corollary Corollary (1.1) (Gibbs Inequality) H ( P , Q ) ≥ H ( P ) , or � � P ( X ) log Q ( X ) ≤ P ( X ) log P ( X ) X X In general, let f ( X ) be a non-negative function. Then � � f ( X ) log Q ( X ) ≤ f ( X ) log P ∗ ( X ) X X where P ∗ ( X ) = f ( X ) / � X f ( X ). Nevin L. Zhang (HKUST) Machine Learning 17 / 30

  18. Divergence Unsupervised Learning Unknown true distribution P ( x ). sampling learning → D = { x i } N P ( x ) − − − − − − − − − → Q ( x ) i =1 Objective: Minimizing KL : KL ( P || Q ) Same as minimizing cross entropy : H ( P , Q ) Approximating the cross entropy using data: � H ( P , Q ) = − P ( x ) log Q ( x ) d x N − 1 � ≈ log Q ( x i ) N i =1 − 1 = N log Q ( D ) Same as maximizing likelihood : log Q ( D ). Nevin L. Zhang (HKUST) Machine Learning 18 / 30

  19. Divergence Supervised Learning Unknown true distribution P ( x , y ), where y is label of input x . sampling learning → D = { x i , y i } N P ( x , y ) − − − − − − − − − → Q ( y | x ) i =1 Objective: Minimizing cross (conditional) entropy : � H ( P , Q ) = − P ( x , y ) log Q ( y | x ) d x dy N − 1 � ≈ log Q ( y i | x i ) N i =1 Same as maximizing loglikelihood : � N i =1 log Q ( y i | x i ), Or minimizing the negative loglikelihood (NLL) : − � N i =1 log Q ( y i | x i ) Nevin L. Zhang (HKUST) Machine Learning 19 / 30

  20. Divergence Jensen-Shannon divergence KL is not symmetric: KL ( P || Q ) usually is not equal to reverse KL KL ( Q || P ). Jensen-Shannon divergence is one symmetrized version of KL: JS ( P || Q ) = 1 2 KL ( P || M ) + 1 2 KL ( Q || M ) where M = P + Q 2 Properties: 0 ≤ JS ( P || Q ) ≤ log 2 JS ( P || Q ) = 0 if P = Q JS ( P || Q ) = log 2 if P and Q has disjoint support. Nevin L. Zhang (HKUST) Machine Learning 20 / 30

  21. Mutual Information Outline 1 Jensen’s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 21 / 30

  22. Mutual Information Mutual information The mutual information of X and Y : I ( X ; Y ) = H ( X ) − H ( X | Y ) Average reduction in uncertainty about X from learning the value of Y , or Average amount of information Y conveys about X . Nevin L. Zhang (HKUST) Machine Learning 22 / 30

  23. Mutual Information Mutual information and KL Divergence Note that: 1 1 � � I ( X ; Y ) = P ( X ) log P ( X ) − P ( X , Y ) log P ( X | Y ) X X , Y 1 1 � � = P ( X , Y ) log P ( X ) − P ( X , Y ) log P ( X | Y ) X , Y X , Y P ( X , Y ) log P ( X | Y ) � = P ( X ) X , Y P ( X , Y ) log P ( X , Y ) � = equivalent definition P ( X ) P ( Y ) X , Y = KL ( P ( X , Y ) || P ( X ) P ( Y )) Due to equivalent definition: I ( X ; Y ) = H ( X ) − H ( X | Y ) = I ( Y ; X ) = H ( Y ) − H ( Y | X ) Nevin L. Zhang (HKUST) Machine Learning 23 / 30

  24. Mutual Information Property of Mutual information Theorem (1.3) I ( X ; Y ) ≥ 0 with equality holds iff X ⊥ Y . Interpretation: X and Y are independent iff X contains no information about Y and vice versa. Proof : Follows from previous slide and Theorem 1.2. Nevin L. Zhang (HKUST) Machine Learning 24 / 30

  25. Mutual Information Conditional Entropy Revisited Theorem (1.4) H ( X | Y ) ≤ H ( X ) with equality holds iff X ⊥ Y Observation reduces uncertainty in average except for the case of independence. Proof : Follows from Theorem 1.3. Nevin L. Zhang (HKUST) Machine Learning 25 / 30

  26. Mutual Information Mutual information and Entropy From definition of mutual information I ( X ; Y ) = H ( X ) − H ( X | Y ) and the chain rule, H ( X , Y ) = H ( Y ) + H ( X | Y ) we get H ( X ) + H ( Y ) = H ( X , Y ) + I ( X ; Y ) I ( X ; Y ) = H ( X ) + H ( Y ) − H ( X , Y ) Consequently H ( X , Y ) ≤ H ( X ) + H ( Y ) with equality holds iff X ⊥ Y . Nevin L. Zhang (HKUST) Machine Learning 26 / 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend