CS480/680 Machine Learning Lecture 5: January 21 st , 2020 - - PowerPoint PPT Presentation

β–Ά
cs480 680 machine learning lecture 5 january 21 st 2020
SMART_READER_LITE
LIVE PREVIEW

CS480/680 Machine Learning Lecture 5: January 21 st , 2020 - - PowerPoint PPT Presentation

CS480/680 Machine Learning Lecture 5: January 21 st , 2020 Information Theory Zahra Sheikhbahaee Sources: Elements Of Information Theory Information Theory, Inference, and Learning Algorithms University of Waterloo CS480/680 Winter 2020 Zahra


slide-1
SLIDE 1

CS480/680 Winter 2020 Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 5: January 21st, 2020

Information Theory Zahra Sheikhbahaee

Sources: Elements Of Information Theory Information Theory, Inference, and Learning Algorithms

University of Waterloo

slide-2
SLIDE 2

CS480/680 Winter 2020 Zahra Sheikhbahaee

Outline

  • Information Theoretical Entropy
  • Mutual Information
  • Decision Tree
  • KL Divergence
  • Applications

University of Waterloo

2

slide-3
SLIDE 3

CS480/680 Winter 2020 Zahra Sheikhbahaee

Information Theory

What is information theory? A quantitive measure of the information content of a message

  • r measuring how much surprise there is in an event.
  • What is the ultimate data compression (entropy)
  • What

is the ultimate transmission rate

  • f

communication (channel capacity: The ability

  • f

channel to transmit what is produced out of source of a given information)

University of Waterloo

3

slide-4
SLIDE 4

CS480/680 Winter 2020 Zahra Sheikhbahaee

  • A message saying the sun rose this morning is

so uninformative

  • A message saying there was a solar eclipse

this morning is very informative

  • Independent events should have additive

information

University of Waterloo

4

Information Theory

slide-5
SLIDE 5

CS480/680 Winter 2020 Zahra Sheikhbahaee

Entropy

  • Definition: Entropy measures the amount of uncertainty of a random quantity.

View information as a reduction in uncertainty and as surprise: Observe something unexpected gain information. The Shannon’s entropy : the average amount of information about a random variable X is given by the expected value 𝐼 π‘Œ = βˆ’ βˆ‘& π‘ž 𝑦 log, π‘ž(𝑦)=βˆ’π”½[log, π‘ž(𝑦)] Probability of a friend lives in any of the apartments is 𝑄(𝑦) = 3

4, so the entropy βˆ’ βˆ‘563 4, 3 4, log, 3 4, = 5 bits

After a neighbor tells that your friend lives on top floor βˆ’ βˆ‘563

8 3 8 log, 3 8 = 3 bits

The neighbor conveyed 2 bits of information

University of Waterloo

5

4 floors 8 apartments on each floor

slide-6
SLIDE 6

CS480/680 Winter 2020 Zahra Sheikhbahaee

Entropy

  • Definition (Conditional Entropy). Given two random variables π‘Œ and 𝑍 , the Conditional Entropy
  • f π‘Œ given 𝑍 , written as 𝐼(π‘Œ|𝑍 ) :

𝐼(π‘Œ|𝑍 ) = βˆ‘< 𝐼 (π‘Œ|𝑍 = 𝑧) Β· 𝑄 (𝑍 = 𝑧) = 𝔽< [𝐼(π‘Œ|𝑍 = 𝑧)] In the special case that π‘Œ and 𝑍 are independent, 𝐼 π‘Œ 𝑍 = 𝐼 π‘Œ , which captures that we learn nothing about π‘Œ from 𝑍

  • Theorem: Let π‘Œ and 𝑍 be random variables. Then:

𝐼(π‘Œ|𝑍 ) ≀ 𝐼(π‘Œ) This means that learning information about another variable 𝑍 can only decrease the uncertainty of π‘Œ

University of Waterloo

6

slide-7
SLIDE 7

CS480/680 Winter 2020 Zahra Sheikhbahaee

Jensen’s Inequality

  • Definition: If 𝑔 is a continuous and concave function, and π‘ž3,Β· Β· Β· , π‘žB are

nonnegative reals summing to 1, then for any 𝑦 = 𝑦3,Β· Β· Β·, 𝑦B : βˆ‘563

B

π‘ž5 𝑔(𝑦5) ≀ 𝑔(βˆ‘563

B

π‘ž5𝑦5) If we treat (π‘ž3,Β· Β· Β· , π‘žB ) as a distribution π‘ž, and 𝑔(𝑦) is the vector obtained by applying 𝑔 coordinate-wise to 𝑦 then we can write the inequality as: 𝔽C [𝑔(𝑦)] ≀ 𝑔(𝔽C [𝑦] ) If π‘ž5 =

3 B and the concave function ln 𝑦 , we have

E

563 B 1

π‘œ ln 𝑦5 ≀ ln(E

563 B 𝑦5

π‘œ)

University of Waterloo

7

slide-8
SLIDE 8

CS480/680 Winter 2020 Zahra Sheikhbahaee

Mutual Information

𝐼 𝑦 𝑧 βˆ’ 𝐼 𝑦 = βˆ‘&,< 𝑄 𝑍 = 𝑧 𝑄 π‘Œ = 𝑦 𝑍 = 𝑧 . log,

3 C π‘Œ = 𝑦 𝑍 = 𝑧 βˆ’ βˆ‘& 𝑄(

) π‘Œ = 𝑦 log,

3 I J6&

βˆ‘< 𝑄 𝑍 = 𝑧 π‘Œ = 𝑦 = βˆ‘&,< 𝑄(X = 𝑦 ∩ Y = y). log,

I(J6&) C π‘Œ = 𝑦 𝑍 = 𝑧 =

βˆ‘&,< 𝑄(X = 𝑦 ∩ Y = y). log,

I J6& I(O6<) C(J6&∩O6<) ≀ log,[βˆ‘&,< 𝑄(π‘Œ = 𝑦 ∩ 𝑍 = 𝑧)] I J6& I(O6<) C(J6&∩O6<) ]=log, 1 = 0

Definition: The Mutual Information of two random variables π‘Œ and 𝑍 , written 𝐽(π‘Œ; 𝑍 ) : 𝐽(π‘Œ; 𝑍) = 𝐼(π‘Œ) βˆ’ 𝐼(π‘Œ|𝑍 ) = 𝐼(𝑍 ) βˆ’ 𝐼(𝑍 |π‘Œ) = 𝐽(𝑍 ; π‘Œ) In the case that π‘Œ and 𝑍 are independent, as noted above 𝐽(π‘Œ; 𝑍 ) = 𝐼(π‘Œ) βˆ’ 𝐼(π‘Œ|𝑍 ) =

University of Waterloo

8

slide-9
SLIDE 9

CS480/680 Winter 2020 Zahra Sheikhbahaee

Information Gain

  • Definition: the amount of information gained about a random

variable or signal from observing another random variable.

  • We want to determine which attribute in a given set of training feature

vectors is most useful for discriminating between the classes to be learned.

  • Information gain tells us how important a given attribute of the feature

vectors is.

  • We will use it to decide the ordering of attributes in the nodes of a

non-linear classifier known as decision tree.

University of Waterloo

9

slide-10
SLIDE 10

CS480/680 Winter 2020 Zahra Sheikhbahaee

Decision Tree

Each node checks one feature 𝑦5

  • Go left if 𝑦5 < threshold
  • Go right if 𝑦5 β‰₯ threshold

University of Waterloo

10

slide-11
SLIDE 11

CS480/680 Winter 2020 Zahra Sheikhbahaee

Decision Tree

  • Every binary split of a node 𝑒 generates two descendent nodes (𝑒O\], 𝑒^_) with

subsets (π‘Œ`a, π‘Œ`b) respectively.

  • Tree grows from root node down to the leaves and generate subsets that are more

class homogeneous compared to the ancestor’s subset π‘Œ`.

  • A measure that quantifies node impurity and split the node which leads to

decreasing overall impurity of the descendent nodes w.r.t. the ancestor’s impurity is given by 𝐽 𝑒 = βˆ’ E

563 c

𝑄(π‘₯5|𝑒) log, 𝑄(π‘₯5|𝑒) 𝑄(π‘₯5|𝑒): the probability that a vector in the subset π‘Œ` associated with node 𝑒 belongs to class π‘₯5

University of Waterloo

11

slide-12
SLIDE 12

CS480/680 Winter 2020 Zahra Sheikhbahaee

Decision Tree

  • If all probabilities are equal to

3 c (high impurity)

  • If all data belong to a single class 𝐽 𝑒 = βˆ’1 log 1 = 0
  • Information gain: measure how good is the split with defining the

decrease in node impurity βˆ†π½ 𝑒 = 𝐽 𝑒 βˆ’

^fa ^f 𝐽(𝑒O)- ^fb ^f 𝐽(𝑒^)

𝐽(𝑒O): the impurity of 𝑒O Goal: adopt set of candidate questions which performs the split leading to the highest decrease of impurity

University of Waterloo

12

slide-13
SLIDE 13

CS480/680 Winter 2020 Zahra Sheikhbahaee

Decision Tree

University of Waterloo

13

  • Entropy=0 if all samples are in the same class
  • Entropy is large of 𝑄 1 = β‹― = 𝑄 𝑁

Choose the best one which gives the maximal information gain

slide-14
SLIDE 14

CS480/680 Winter 2020 Zahra Sheikhbahaee

KL Divergence

  • Consider some unknown distribution π‘ž(𝑦), and suppose that we have

modelled this using an approximate distribution π‘Ÿ 𝑦 , the average additional amount of information required to specify value of 𝑦 as a result of using π‘Ÿ(𝑦) instead of π‘ž(𝑦) 𝐿𝑀 π‘ž βˆ₯ π‘Ÿ = βˆ’ m π‘ž 𝑦 ln{π‘Ÿ(𝑦) π‘ž(𝑦)} 𝑒𝑦 This is known as the relative entropy or Kullback-Leibler divergence.

  • KL divergence is not a symmetric quantity

University of Waterloo

14

slide-15
SLIDE 15

CS480/680 Winter 2020 Zahra Sheikhbahaee

KL Divergence

  • We show that 𝐿𝑀 π‘ž βˆ₯ π‘Ÿ β‰₯ 0 and we have equality if and only if

π‘ž 𝑦 = π‘Ÿ 𝑦 .

  • A function 𝑔 is convex if it has the property that every cord lies on or

above the function. For any value of 𝑦 in the interval from 𝑦 = 𝑏 to 𝑦 = 𝑐 can be written in the form πœ‡π‘ + 1 βˆ’ πœ‡ 𝑐 where 0 ≀ πœ‡ ≀ 1.

  • Convexity for function 𝑔 is given by

Using the induction proof technique Then the KL divergence becomes

University of Waterloo

15

slide-16
SLIDE 16

CS480/680 Winter 2020 Zahra Sheikhbahaee

KL Divergence

  • We can minimize KL divergence with respect to the parameters of π‘Ÿ:

𝑏𝑠𝑕 min

y 𝐸{|(𝑄 βˆ₯ π‘Ÿy)

If 𝑄(𝑦) is a bimodal distribution If we try to approximate 𝑄 with a Gaussian distribution using KL

  • divergence. We consider this mean-seeking behaviour, because the

approximate distribution π‘Ÿy must cover all the modes and regions of high probability in 𝑄.

University of Waterloo

16