CS480/680 Machine Learning Lecture 5: January 21 st , 2020 - PowerPoint PPT Presentation

CS480/680 Machine Learning Lecture 5: January 21 st , 2020 Information Theory Zahra Sheikhbahaee Sources: Elements Of Information Theory Information Theory, Inference, and Learning Algorithms University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee

Outline • Information Theoretical Entropy • Mutual Information • Decision Tree • KL Divergence • Applications University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 2

3 Information Theory What is information theory? A quantitive measure of the information content of a message or measuring how much surprise there is in an event. • What is the ultimate data compression (entropy) • What is the ultimate transmission rate of communication (channel capacity: The ability of channel to transmit what is produced out of source of a given information) University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee

Information Theory • A message saying the sun rose this morning is so uninformative • A message saying there was a solar eclipse this morning is very informative • Independent events should have additive information University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 4

Entropy • Definition : Entropy measures the amount of uncertainty of a random quantity. View information as a reduction in uncertainty and as surprise: Observe something unexpected gain information. The Shannon’s entropy : the average amount of information about a random variable X is given by the expected value 𝐼 𝑌 = − ∑ & 𝑞 𝑦 log , 𝑞(𝑦) = −𝔽[log , 𝑞(𝑦)] Probability of a friend lives in any of the apartments is 𝑄(𝑦) = 3 3 3 4, 4, so the entropy − ∑ 563 4, log , 4, = 5 bits 3 3 8 After a neighbor tells that your friend lives on top floor − ∑ 563 8 log , 8 = 3 bits 4 floors The neighbor conveyed 2 bits of information University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 5 8 apartments on each floor

Entropy • Definition (Conditional Entropy). Given two random variables 𝑌 and 𝑍 , the Conditional Entropy of 𝑌 given 𝑍 , written as 𝐼(𝑌|𝑍 ) : 𝐼(𝑌|𝑍 ) = ∑ < 𝐼 (𝑌|𝑍 = 𝑧) · 𝑄 (𝑍 = 𝑧) = 𝔽 < [𝐼(𝑌|𝑍 = 𝑧) ] In the special case that 𝑌 and 𝑍 are independent, 𝐼 𝑌 𝑍 = 𝐼 𝑌 , which captures that we learn nothing about 𝑌 from 𝑍 • Theorem: Let 𝑌 and 𝑍 be random variables. Then: 𝐼(𝑌|𝑍 ) ≤ 𝐼(𝑌) This means that learning information about another variable 𝑍 can only decrease the uncertainty of 𝑌 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 6

Jensen’s Inequality • Definition : If 𝑔 is a continuous and concave function, and 𝑞 3 ,· · · , 𝑞 B are nonnegative reals summing to 1, then for any 𝑦 = 𝑦 3 ,· · ·, 𝑦 B : B B ∑ 563 𝑞 5 𝑔(𝑦 5 ) ≤ 𝑔(∑ 563 𝑞 5 𝑦 5 ) If we treat ( 𝑞 3 ,· · · , 𝑞 B ) as a distribution 𝑞 , and 𝑔(𝑦) is the vector obtained by applying 𝑔 coordinate-wise to 𝑦 then we can write the inequality as: 𝔽 C [𝑔(𝑦)] ≤ 𝑔 ( 𝔽 C [𝑦] ) 3 If 𝑞 5 = B and the concave function ln 𝑦 , we have B 1 B 𝑦 5 E 𝑜 ln 𝑦 5 ≤ ln(E 𝑜) 563 563 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 7

Mutual Information 3 𝐼 𝑦 𝑧 − 𝐼 𝑦 = ∑ &,< 𝑄 𝑍 = 𝑧 𝑄 𝑌 = 𝑦 𝑍 = 𝑧 . log , C 𝑌 = 𝑦 𝑍 = 𝑧 − ∑ & 𝑄( 𝑌 = 3 I(J6&) ) ∑ < 𝑄 𝑍 = 𝑧 𝑌 = 𝑦 = ∑ &,< 𝑄(X = 𝑦 ∩ Y = y). log , 𝑦 log , C 𝑌 = 𝑦 𝑍 = 𝑧 = I J6& I J6& I(O6<) I J6& I(O6<) ∑ &,< 𝑄(X = 𝑦 ∩ Y = y). log , C(J6&∩O6<) ≤ log , [∑ &,< 𝑄(𝑌 = 𝑦 ∩ 𝑍 = 𝑧)] C(J6&∩O6<) ]= log , 1 = 0 Definition: The Mutual Information of two random variables 𝑌 and 𝑍 , written 𝐽(𝑌; 𝑍 ) : 𝐽(𝑌; 𝑍) = 𝐼(𝑌) − 𝐼(𝑌|𝑍 ) = 𝐼(𝑍 ) − 𝐼(𝑍 |𝑌) = 𝐽(𝑍 ; 𝑌) In the case that 𝑌 and 𝑍 are independent, as noted above 𝐽(𝑌; 𝑍 ) = 𝐼(𝑌) − 𝐼(𝑌|𝑍 ) = 0 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 8

Information Gain • Definition : the amount of information gained about a random variable or signal from observing another random variable. • We want to determine which attribute in a given set of training feature vectors is most useful for discriminating between the classes to be learned. • Information gain tells us how important a given attribute of the feature vectors is. • We will use it to decide the ordering of attributes in the nodes of a non-linear classifier known as decision tree . University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 9

10 Decision Tree Each node checks one feature 𝑦 5 • Go left if 𝑦 5 < threshold • Go right if 𝑦 5 ≥ threshold University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee

Decision Tree • Every binary split of a node 𝑢 generates two descendent nodes ( 𝑢 O\] , 𝑢 ^_ ) with subsets ( 𝑌 ` a , 𝑌 ` b ) respectively. • Tree grows from root node down to the leaves and generate subsets that are more class homogeneous compared to the ancestor’s subset 𝑌 ` . • A measure that quantifies node impurity and split the node which leads to decreasing overall impurity of the descendent nodes w.r.t. the ancestor’s impurity is given by c 𝐽 𝑢 = − E 𝑄(𝑥 5 |𝑢) log , 𝑄(𝑥 5 |𝑢) 563 𝑄(𝑥 5 |𝑢) : the probability that a vector in the subset 𝑌 ` associated with node 𝑢 belongs to class 𝑥 5 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 11

Decision Tree 3 • If all probabilities are equal to c (high impurity) • If all data belong to a single class 𝐽 𝑢 = −1 log 1 = 0 • Information gain: measure how good is the split with defining the decrease in node impurity ^ fa ^ fb ∆𝐽 𝑢 = 𝐽 𝑢 − ^ f 𝐽(𝑢 O ) - ^ f 𝐽(𝑢 ^ ) 𝐽(𝑢 O ) : the impurity of 𝑢 O Goal: adopt set of candidate questions which performs the split leading to the highest decrease of impurity University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 12

Decision Tree • Entropy=0 if all samples are in the same class • Entropy is large of 𝑄 1 = ⋯ = 𝑄 𝑁 Choose the best one which gives the maximal information gain University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 13

KL Divergence • Consider some unknown distribution 𝑞(𝑦) , and suppose that we have modelled this using an approximate distribution 𝑟 𝑦 , the average additional amount of information required to specify value of 𝑦 as a result of using 𝑟(𝑦) instead of 𝑞(𝑦) 𝐿𝑀 𝑞 ∥ 𝑟 = − m 𝑞 𝑦 ln{𝑟(𝑦) 𝑞(𝑦)} 𝑒𝑦 This is known as the relative entropy or Kullback-Leibler divergence. • KL divergence is not a symmetric quantity University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 14

KL Divergence • We show that 𝐿𝑀 𝑞 ∥ 𝑟 ≥ 0 and we have equality if and only if 𝑞 𝑦 = 𝑟 𝑦 . • A function 𝑔 is convex if it has the property that every cord lies on or above the function. For any value of 𝑦 in the interval from 𝑦 = 𝑏 to 𝑦 = 𝑐 can be written in the form 𝜇𝑏 + 1 − 𝜇 𝑐 where 0 ≤ 𝜇 ≤ 1. • Convexity for function 𝑔 is given by Using the induction proof technique Then the KL divergence becomes University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 15

KL Divergence • We can minimize KL divergence with respect to the parameters of 𝑟 : 𝑏𝑠𝑕 min y 𝐸 {| (𝑄 ∥ 𝑟 y ) If 𝑄(𝑦) is a bimodal distribution If we try to approximate 𝑄 with a Gaussian distribution using KL divergence. We consider this mean-seeking behaviour, because the approximate distribution 𝑟 y must cover all the modes and regions of high probability in 𝑄 . University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 16

CS480/680 Machine Learning Lecture 5: January 21 st , 2020 - PowerPoint PPT Presentation

CS480/680 Machine Learning Lecture 5: January 21 st , 2020 Information Theory Zahra Sheikhbahaee Sources: Elements Of Information Theory Information Theory, Inference, and Learning Algorithms University of Waterloo CS480/680 Winter 2020 Zahra

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

CS480/680 Machine Learning Lecture 1: January 7 th , 2020 Course Introduction Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 3: January 14 th , 2020 Linear Regression Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 6: January 23 st , 2020 Maximum A posteriori & Maximum

CS480/680 Machine Learning Lecture 8: January 30 th , 2020 Graphical Models Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 1: May 6 th , 2019 Course Introduction Pascal Poupart

CS480/680 Machine Learning Lecture 12: February 13 th , 2020 Expectation-Maximization Zahra

CS480/680 Machine Learning Lecture 20: Convolutional Neural Network Zahra Sheikhbahaee March 29,

CS480/680 Machine Learning Lecture 3: May 13, 2019 Linear Regression [RN] Sec. 18.6.1, [HTF]

CS480/680 Lecture 4: May 15, 2019 Statistical Learning [RN]: Sec 20.1, 20.2, [M]: Sec. 2.2, 3.2

CS480/680 Lecture 22: July 22, 2019 Ensemble Learning [RN] Sec. 18.10, [M] Sec. 16.2.5, [B]

CS480/680 Lecture 2: May 8 th , 2019 Nearest Neighbour [RN] Sec. 18.8.1, [HTF] Sec. 2.3.2, [D]

CS480/680 Lecture 9: June 5, 2019 Perceptrons, Neural Networks [D] Chapt. 4, [HTF] Chapt. 11,

CS480/680 Lecture 18: July 8, 2019 Recurrent and Recursive Neural Networks [GBC] Chap. 10

CS480/680 Lecture 15: June 26, 2019 Deep Neural Networks [GBC] Chap. 6, 7, 8 University of

CS480/680 Lecture 24: July 29, 2019 Gradient Boosting, Bagging, Decision Forest [RN] Sec. 18.10,

Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net

Algorithms for Distributed Functional Monitoring

Wasserstein barycenters over Riemannian manifolds Brendan Pass (joint work with Y.H. Kim (UBC))

The Fundamental Theorem prof. dr Arno Siebes Algorithmic Data Analysis Group Department of

Robust Hidden Markov Models Inference in the Presence of Label Noise Benot Frnay February 7,

H Filtering of Uncertain LPV Systems with Time-Delay C.Briat, O.Sename and JF.Lafay August

Probability, Entropy, and Inference Ensemble X is a triple ( x, A X , P X ) , where Based on

Todays exercises 5.17: Football Pools 5.18: Cells of Line and Hyperplane Arrangements

CS480/680 Machine Learning Lecture 5: January 21 st , 2020 - PowerPoint PPT Presentation

CS480/680 Machine Learning Lecture 5: January 21 st , 2020 Information Theory Zahra Sheikhbahaee Sources: Elements Of Information Theory Information Theory, Inference, and Learning Algorithms University of Waterloo CS480/680 Winter 2020 Zahra

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

CS480/680 Machine Learning Lecture 1: January 7 th , 2020 Course Introduction Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 3: January 14 th , 2020 Linear Regression Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 6: January 23 st , 2020 Maximum A posteriori &amp; Maximum

CS480/680 Machine Learning Lecture 8: January 30 th , 2020 Graphical Models Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 1: May 6 th , 2019 Course Introduction Pascal Poupart

CS480/680 Machine Learning Lecture 12: February 13 th , 2020 Expectation-Maximization Zahra

CS480/680 Machine Learning Lecture 20: Convolutional Neural Network Zahra Sheikhbahaee March 29,

CS480/680 Machine Learning Lecture 3: May 13, 2019 Linear Regression [RN] Sec. 18.6.1, [HTF]

CS480/680 Lecture 4: May 15, 2019 Statistical Learning [RN]: Sec 20.1, 20.2, [M]: Sec. 2.2, 3.2

CS480/680 Lecture 22: July 22, 2019 Ensemble Learning [RN] Sec. 18.10, [M] Sec. 16.2.5, [B]

CS480/680 Lecture 2: May 8 th , 2019 Nearest Neighbour [RN] Sec. 18.8.1, [HTF] Sec. 2.3.2, [D]

CS480/680 Lecture 9: June 5, 2019 Perceptrons, Neural Networks [D] Chapt. 4, [HTF] Chapt. 11,

CS480/680 Lecture 18: July 8, 2019 Recurrent and Recursive Neural Networks [GBC] Chap. 10

CS480/680 Lecture 15: June 26, 2019 Deep Neural Networks [GBC] Chap. 6, 7, 8 University of

CS480/680 Lecture 24: July 29, 2019 Gradient Boosting, Bagging, Decision Forest [RN] Sec. 18.10,

Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net

Algorithms for Distributed Functional Monitoring

Wasserstein barycenters over Riemannian manifolds Brendan Pass (joint work with Y.H. Kim (UBC))

The Fundamental Theorem prof. dr Arno Siebes Algorithmic Data Analysis Group Department of

Robust Hidden Markov Models Inference in the Presence of Label Noise Benot Frnay February 7,

H Filtering of Uncertain LPV Systems with Time-Delay C.Briat, O.Sename and JF.Lafay August

Probability, Entropy, and Inference Ensemble X is a triple ( x, A X , P X ) , where Based on

Todays exercises 5.17: Football Pools 5.18: Cells of Line and Hyperplane Arrangements

CS480/680 Machine Learning Lecture 6: January 23 st , 2020 Maximum A posteriori & Maximum