Inform ormat ation & & Cor Correlati tion on Jill - - PowerPoint PPT Presentation

inform ormat ation cor correlati tion on
SMART_READER_LITE
LIVE PREVIEW

Inform ormat ation & & Cor Correlati tion on Jill - - PowerPoint PPT Presentation

Inform ormat ation & & Cor Correlati tion on Jill illes V s Vreeken 11 11 June une 2014 2014 (TA TADA) Quest uestio ions of th f the da day What is information? How can we measure correlation? and what do talking drums


slide-1
SLIDE 1

Inform

  • rmat

ation & & Cor Correlati tion

  • n

Jill illes V s Vreeken

11 11 June une 2014 2014 (TA TADA)

slide-2
SLIDE 2

Quest uestio ions of th f the da day

What is information? How can we measure correlation? and what do talking drums have to do with this?

slide-3
SLIDE 3

What is

 information  a bit  entropy  information theory  compression  …

Bit Bits a s and Piec Pieces es

slide-4
SLIDE 4

Branch of science concerned with measuring information Field founded by Claude Shannon in 1948, ‘A Mathematical Theory of Communication’ Information Theory is essentially about uncertainty in communication: not what you say, but what you could say

In Informatio ion Th Theo eory

slide-5
SLIDE 5

Communication is a series

  • f discrete messages

each message reduces the uncertainty

  • f the recipient about

a) the series and b) that message by how much? that is the amount of information

Th The B e Big ig In Insigh sight

slide-6
SLIDE 6

Shannon showed that uncertainty can be quantified, linking physical entropy to messages Shannon defined the entropy of a discrete random variable 𝑌 as

𝐼(𝑌) = − 𝑄(𝑦𝑗)log 𝑄(𝑦𝑗)

𝑗

Uncerta tainty nty

slide-7
SLIDE 7

Shannon showed that uncertainty can be quantified, linking physical entropy to messages A side-result of Shannon entropy is that

− log2 𝑄 𝑦𝑗

gives the length in bits of the optimal prefix code for a message 𝑦𝑗

Op Optim imal pr l pref efix ix-code

  • des
slide-8
SLIDE 8

Prefix(-free) code:

a code 𝐷 where no code word 𝑑 ∈ 𝐷 is the prefix of another 𝑒 ∈ 𝐷 with 𝑑 ≠ 𝑒 Essentially, a prefix code defines a tree, where each code corresponds to a path from the root to a leaf in a decision tree

Wha hat is is a pr pref efix ix code? de?

slide-9
SLIDE 9

Binary digit

 smallest and most fundamental piece of information  yes or no  invented by Claude Shannon in 1948  name by John Tukey

Bits have been in use for a long-long time, though

 Punch cards (1725, 1804)  Morse code (1844)  African ‘talking drums’

Wha hat’s a a bit bit?

slide-10
SLIDE 10

Mo Morse se c code de

slide-11
SLIDE 11

Punishes ‘bad’ redundancy:

  • ften-used words are shorter

Rewards useful redundancy: cotxent alolws mishaireng/raeding African Talking Drums have used this for efficient, fast, long-distance communication

mimic vocalized sounds: tonal language very reliable means of communication

Natural la l lang nguage ge

slide-12
SLIDE 12

How much information carries a given string?

how many bits?

Say we have a binary string of 10000 ‘messages’

1) 00010001000100010001…000100010001000100010001000100010001 2) 01110100110100100110…101011101011101100010110001011011100 3) 00011000001010100000…001000100001000000100011000000100110 4) 0000000000000000000000000000100000000000000000000…0000000

  • bviously, they are 10000 bits long.

But, are they worth those 10000 bits?

Mea Measu surin ing b g bit its

slide-13
SLIDE 13

So, So, h how

  • w man

many bit bits? s?

Depends on the encoding! What is the best encoding?

 one that takes the entropy of the data into account  things that occur often should get short code  things that occur seldom should get long code

An encoding matching Shannon Entropy is optimal

slide-14
SLIDE 14

T ell ell us! us! Ho How w many ny bit bits? s? Please?

In our simplest example we have 𝑄(1) = 1/100000 𝑄(0) = 99999/100000 |𝑑𝑑𝑒𝑑1| = −log (1/100000) = 16.61 |𝑑𝑑𝑒𝑑0| = −log (99999/100000) = 0.0000144 So, knowing 𝑄 our string contains 1 ∗ 16.61 + 99999 ∗ 0.0000144 = 18.049 bits

  • f information
slide-15
SLIDE 15

Shannon lets us calculate optimal code lengths

 what about actual codes? 0.0000144 bits?  Shannon and Fano invented a near-optimal encoding in 1948,

within one bit of the optimal, but not lowest expected

Fano gave students an option: regular exam, or invent a better encoding

 David Huffman didn’t like exams; invented Huffman-codes (1952)  optimal for symbol-by-symbol encoding with fixed probs.

(arithmetic coding is overall optimal, Rissanen 1976)

Op Optim imal… l….

slide-16
SLIDE 16

To encode optimally, we need optimal probabilities What happens if we don’t?

Kullback-Leibler divergence, 𝐸(𝑞 || 𝑟), measures bits we ‘waste’ when we use 𝑞 while 𝑟 is the ‘true’ distribution

𝐸 𝑞 ‖ 𝑟 = log 𝑞 𝑗 𝑟 𝑗 𝑞(𝑗)

𝑗

Op Optim imali lity

slide-17
SLIDE 17

So far we’ve been thinking about a single sequence of messages How does entropy work for multivariate data? Simple!

Mu Mult ltiv ivariate E e Ent ntropy

slide-18
SLIDE 18

Entropy, for when we, like, know stuff 𝐼 𝑌 𝑍 = 𝑞 𝑦 𝐼(𝑍|𝑌 = 𝑦)

𝑦∈X

When is this useful?

Condit nditio ional E l Ent ntropy py

slide-19
SLIDE 19

Mutual Information

the amount of information shared between two variables 𝑌 and 𝑍

𝐽 𝑌, 𝑍 = 𝐼 𝑌 − 𝐼 𝑌 𝑍 = 𝐼 𝑍 − 𝐼 𝑍 𝑌

= 𝑞 𝑦, 𝑧 log 𝑞 𝑦, 𝑧 𝑞 𝑦 𝑞 𝑧

𝑦∈𝑌 𝑧∈𝑍

high ↔ correlation low ↔ independence

Mu Mutua ual l In Information a and C Correla lation

slide-20
SLIDE 20

(small aside)

Entropy and KL are used in decision trees What is the best split in a tree?

  • ne that results in as homogeneous label distributions

in the sub-nodes as possible: minimal entropy

How do we compare over multiple options?

𝐽𝐽 𝑈, 𝑏 = 𝐼 𝑈 − 𝐼(𝑈|𝑏)

In Informatio ion Ga Gain in

slide-21
SLIDE 21

Theory of Computation Probability Theory 1 No No 1887 Yes No 156 No Yes 143 Yes yes 219

Lo Low-Entropy S y Sets ts

(Heikinheimo et al. 2007)

slide-22
SLIDE 22

Maturity Test Software Engineering Theory of Computation No No No 1570 Yes No No 79 No Yes No 99 Yes Yes No 282 No No Yes 28 Yes No Yes 164 No Yes Yes 13 Yes Yes Yes 170

Lo Low-Entropy S y Sets ts

(Heikinheimo et al. 2007)

slide-23
SLIDE 23

Lo Low-Entropy T y Trees

Scientific Writing Maturity Test Software Engineering Project Theory of Computation Probability Theory 1

(Heikinheimo et al. 2007)

slide-24
SLIDE 24

So far we only considered discrete-valued data Lots of data is continuous-valued (or is it) What does this mean for entropy?

Entr tropy fo y for r Conti

  • ntinuous-value

ued d d data

slide-25
SLIDE 25

ℎ 𝑌 = − 𝑔 𝑦 log 𝑔 𝑦 𝑒𝑦

𝐘

Dif Differentia ial E l Ent ntropy py

(Shannon, 1948)

slide-26
SLIDE 26

How about… the entropy of Uniform(0,1/2) ?

− −2 log 2 𝑒𝑦 = − log 2

1 2

Hm, negative?

Dif Differentia ial E l Ent ntropy py

slide-27
SLIDE 27

In discrete data step size ‘dx’ is trivial. What is its effect here?

ℎ 𝑌 = − 𝑔 𝑦 log 𝑔 𝑦 𝑒𝑦

𝐘

Dif Differentia ial E l Ent ntropy py

(Shannon, 1948)

slide-28
SLIDE 28

Cumula lativ ive Dist Distribu ibutions

slide-29
SLIDE 29

We can define entropy for cumulative distribution functions! ℎ𝐷𝐷 𝑌 = − 𝑄 𝑌 ≤ 𝑦 log 𝑄 𝑌 ≤ 𝑦 𝑒𝑦

𝑒𝑒𝑒 𝑌

As 0 ≤ 𝑄 𝑌 ≤ 𝑦 ≤ 1 we obtain ℎ𝐷𝐷 𝑌 ≥ 0 (!)

Cumula lativ ive E Ent ntropy

(Rao et al, 2004, 2005)

slide-30
SLIDE 30

How do we compute it in practice? Easy. Let 𝑌1 ≤ ⋯ ≤ 𝑌𝑜 be i.i.d. random samples of continuous random variable X

ℎ𝐷𝐷 𝑌 = − 𝑌𝑗+1 − 𝑌𝑗 𝑗 𝑜 log 𝑗 𝑜

𝑜−1 𝑗=1

Cumula lativ ive E Ent ntropy

(Rao et al, 2004, 2005)

slide-31
SLIDE 31

Tricky. Very tricky. Too tricky for now.

Mu Mult ltiv ivariate C e Cum umulativ ive E Entropy py?

(Nguyen et al, 2013, 2014)

slide-32
SLIDE 32

Given continuous valued data

  • ver a set of attributes 𝑌 we want to identify

𝑍 ⊂ 𝑌 such that Y has high mutual information. Can we do this with cumulative entropy?

Cumula lativ ive Mut Mutual Inf l Information

slide-33
SLIDE 33

Ident Identif ifyin ing Int Inter eracting S Subs ubspaces es

slide-34
SLIDE 34

First things first. We need

ℎ𝐷𝐷 𝑌 | 𝑧 = ∫ ℎ𝐷𝐷 𝑌 𝑧 𝑞 𝑧 𝑒𝑧

which, in practice, means

ℎ𝐷𝐷 𝑌 | 𝑍 = ℎ𝐷𝐷 𝑌 𝑧 𝑞(𝑧)

𝑧∈𝑍

with 𝑧 just some datapoints, and 𝑞 𝑧 =

𝑧 𝑜

How do we choose 𝑧? such that ℎ𝐷𝐷 𝑌 𝑍 is minimal

Mu Mult ltiv ivariate C e Cum umulativ ive E Entropy py

slide-35
SLIDE 35

We cannot (realistically) calculate ℎ𝐷𝐷 𝑌1, … , 𝑌𝑒 in one go but… Mutual Information has this nice factorization property… so, what we can do is

ℎ𝐷𝐷 𝑌𝑗 −

𝑗=2

ℎ𝐷𝐷(𝑌𝑗|𝑌1, … , 𝑌𝑗−1)

𝑗=2

Entr trez, C , CMI

slide-36
SLIDE 36

super simple: a priori-style

Th The C e CMI MI algo lgorit ithm

slide-37
SLIDE 37

CMI in a MI in actio ion

slide-38
SLIDE 38

Information is about uncertainty of what you could say Entropy is a core aspect of information theory

 lots of nice properties  optimal prefix-code lengths, mutual information, etc

Entropy for continuous data is… more tricky

 differential entropy is a bit problematic  cumulative distributions provide a way out,

but are mostly unchartered territory

Conclusi sions

slide-39
SLIDE 39

Information is about uncertainty of what you could say Entropy is a core aspect of information theory

 lots of nice properties  optimal prefix-code lengths, mutual information, etc

Entropy for continuous data is… more tricky

 differential entropy is a bit problematic  cumulative distributions provide a way out,

but are mostly unchartered territory

Thank you!