cs480 680 machine learning lecture 5 january 21 st 2020
play

CS480/680 Machine Learning Lecture 5: January 21 st , 2020 - PowerPoint PPT Presentation

CS480/680 Machine Learning Lecture 5: January 21 st , 2020 Information Theory Zahra Sheikhbahaee Sources: Elements Of Information Theory Information Theory, Inference, and Learning Algorithms University of Waterloo CS480/680 Winter 2020 Zahra


  1. CS480/680 Machine Learning Lecture 5: January 21 st , 2020 Information Theory Zahra Sheikhbahaee Sources: Elements Of Information Theory Information Theory, Inference, and Learning Algorithms University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee

  2. Outline • Information Theoretical Entropy • Mutual Information • Decision Tree • KL Divergence • Applications University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 2

  3. 3 Information Theory What is information theory? A quantitive measure of the information content of a message or measuring how much surprise there is in an event. • What is the ultimate data compression (entropy) • What is the ultimate transmission rate of communication (channel capacity: The ability of channel to transmit what is produced out of source of a given information) University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee

  4. Information Theory • A message saying the sun rose this morning is so uninformative • A message saying there was a solar eclipse this morning is very informative • Independent events should have additive information University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 4

  5. Entropy • Definition : Entropy measures the amount of uncertainty of a random quantity. View information as a reduction in uncertainty and as surprise: Observe something unexpected gain information. The Shannon’s entropy : the average amount of information about a random variable X is given by the expected value 𝐼 𝑌 = − ∑ & 𝑞 𝑦 log , 𝑞(𝑦) = −𝔽[log , 𝑞(𝑦)] Probability of a friend lives in any of the apartments is 𝑄(𝑦) = 3 3 3 4, 4, so the entropy − ∑ 563 4, log , 4, = 5 bits 3 3 8 After a neighbor tells that your friend lives on top floor − ∑ 563 8 log , 8 = 3 bits 4 floors The neighbor conveyed 2 bits of information University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 5 8 apartments on each floor

  6. Entropy • Definition (Conditional Entropy). Given two random variables 𝑌 and 𝑍 , the Conditional Entropy of 𝑌 given 𝑍 , written as 𝐼(𝑌|𝑍 ) : 𝐼(𝑌|𝑍 ) = ∑ < 𝐼 (𝑌|𝑍 = 𝑧) · 𝑄 (𝑍 = 𝑧) = 𝔽 < [𝐼(𝑌|𝑍 = 𝑧) ] In the special case that 𝑌 and 𝑍 are independent, 𝐼 𝑌 𝑍 = 𝐼 𝑌 , which captures that we learn nothing about 𝑌 from 𝑍 • Theorem: Let 𝑌 and 𝑍 be random variables. Then: 𝐼(𝑌|𝑍 ) ≤ 𝐼(𝑌) This means that learning information about another variable 𝑍 can only decrease the uncertainty of 𝑌 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 6

  7. Jensen’s Inequality • Definition : If 𝑔 is a continuous and concave function, and 𝑞 3 ,· · · , 𝑞 B are nonnegative reals summing to 1, then for any 𝑦 = 𝑦 3 ,· · ·, 𝑦 B : B B ∑ 563 𝑞 5 𝑔(𝑦 5 ) ≤ 𝑔(∑ 563 𝑞 5 𝑦 5 ) If we treat ( 𝑞 3 ,· · · , 𝑞 B ) as a distribution 𝑞 , and 𝑔(𝑦) is the vector obtained by applying 𝑔 coordinate-wise to 𝑦 then we can write the inequality as: 𝔽 C [𝑔(𝑦)] ≤ 𝑔 ( 𝔽 C [𝑦] ) 3 If 𝑞 5 = B and the concave function ln 𝑦 , we have B 1 B 𝑦 5 E 𝑜 ln 𝑦 5 ≤ ln(E 𝑜) 563 563 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 7

  8. Mutual Information 3 𝐼 𝑦 𝑧 − 𝐼 𝑦 = ∑ &,< 𝑄 𝑍 = 𝑧 𝑄 𝑌 = 𝑦 𝑍 = 𝑧 . log , C 𝑌 = 𝑦 𝑍 = 𝑧 − ∑ & 𝑄( 𝑌 = 3 I(J6&) ) ∑ < 𝑄 𝑍 = 𝑧 𝑌 = 𝑦 = ∑ &,< 𝑄(X = 𝑦 ∩ Y = y). log , 𝑦 log , C 𝑌 = 𝑦 𝑍 = 𝑧 = I J6& I J6& I(O6<) I J6& I(O6<) ∑ &,< 𝑄(X = 𝑦 ∩ Y = y). log , C(J6&∩O6<) ≤ log , [∑ &,< 𝑄(𝑌 = 𝑦 ∩ 𝑍 = 𝑧)] C(J6&∩O6<) ]= log , 1 = 0 Definition: The Mutual Information of two random variables 𝑌 and 𝑍 , written 𝐽(𝑌; 𝑍 ) : 𝐽(𝑌; 𝑍) = 𝐼(𝑌) − 𝐼(𝑌|𝑍 ) = 𝐼(𝑍 ) − 𝐼(𝑍 |𝑌) = 𝐽(𝑍 ; 𝑌) In the case that 𝑌 and 𝑍 are independent, as noted above 𝐽(𝑌; 𝑍 ) = 𝐼(𝑌) − 𝐼(𝑌|𝑍 ) = 0 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 8

  9. Information Gain • Definition : the amount of information gained about a random variable or signal from observing another random variable. • We want to determine which attribute in a given set of training feature vectors is most useful for discriminating between the classes to be learned. • Information gain tells us how important a given attribute of the feature vectors is. • We will use it to decide the ordering of attributes in the nodes of a non-linear classifier known as decision tree . University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 9

  10. 10 Decision Tree Each node checks one feature 𝑦 5 • Go left if 𝑦 5 < threshold • Go right if 𝑦 5 ≥ threshold University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee

  11. Decision Tree • Every binary split of a node 𝑢 generates two descendent nodes ( 𝑢 O\] , 𝑢 ^_ ) with subsets ( 𝑌 ` a , 𝑌 ` b ) respectively. • Tree grows from root node down to the leaves and generate subsets that are more class homogeneous compared to the ancestor’s subset 𝑌 ` . • A measure that quantifies node impurity and split the node which leads to decreasing overall impurity of the descendent nodes w.r.t. the ancestor’s impurity is given by c 𝐽 𝑢 = − E 𝑄(𝑥 5 |𝑢) log , 𝑄(𝑥 5 |𝑢) 563 𝑄(𝑥 5 |𝑢) : the probability that a vector in the subset 𝑌 ` associated with node 𝑢 belongs to class 𝑥 5 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 11

  12. Decision Tree 3 • If all probabilities are equal to c (high impurity) • If all data belong to a single class 𝐽 𝑢 = −1 log 1 = 0 • Information gain: measure how good is the split with defining the decrease in node impurity ^ fa ^ fb ∆𝐽 𝑢 = 𝐽 𝑢 − ^ f 𝐽(𝑢 O ) - ^ f 𝐽(𝑢 ^ ) 𝐽(𝑢 O ) : the impurity of 𝑢 O Goal: adopt set of candidate questions which performs the split leading to the highest decrease of impurity University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 12

  13. Decision Tree • Entropy=0 if all samples are in the same class • Entropy is large of 𝑄 1 = ⋯ = 𝑄 𝑁 Choose the best one which gives the maximal information gain University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 13

  14. KL Divergence • Consider some unknown distribution 𝑞(𝑦) , and suppose that we have modelled this using an approximate distribution 𝑟 𝑦 , the average additional amount of information required to specify value of 𝑦 as a result of using 𝑟(𝑦) instead of 𝑞(𝑦) 𝐿𝑀 𝑞 ∥ 𝑟 = − m 𝑞 𝑦 ln{𝑟(𝑦) 𝑞(𝑦)} 𝑒𝑦 This is known as the relative entropy or Kullback-Leibler divergence. • KL divergence is not a symmetric quantity University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 14

  15. KL Divergence • We show that 𝐿𝑀 𝑞 ∥ 𝑟 ≥ 0 and we have equality if and only if 𝑞 𝑦 = 𝑟 𝑦 . • A function 𝑔 is convex if it has the property that every cord lies on or above the function. For any value of 𝑦 in the interval from 𝑦 = 𝑏 to 𝑦 = 𝑐 can be written in the form 𝜇𝑏 + 1 − 𝜇 𝑐 where 0 ≤ 𝜇 ≤ 1. • Convexity for function 𝑔 is given by Using the induction proof technique Then the KL divergence becomes University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 15

  16. KL Divergence • We can minimize KL divergence with respect to the parameters of 𝑟 : 𝑏𝑠𝑕 min y 𝐸 {| (𝑄 ∥ 𝑟 y ) If 𝑄(𝑦) is a bimodal distribution If we try to approximate 𝑄 with a Gaussian distribution using KL divergence. We consider this mean-seeking behaviour, because the approximate distribution 𝑟 y must cover all the modes and regions of high probability in 𝑄 . University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend