Entropy & Information Jill illes V s Vreeken 29 29 May 2015 - - PowerPoint PPT Presentation

entropy information
SMART_READER_LITE
LIVE PREVIEW

Entropy & Information Jill illes V s Vreeken 29 29 May 2015 - - PowerPoint PPT Presentation

Entropy & Information Jill illes V s Vreeken 29 29 May 2015 2015 Qu Question o of f th the da day What is infor ormation tion? (and what do talking drums have to do with it?) Bit Bits a s and Piec Pieces es What are


slide-1
SLIDE 1

Entropy & Information

Jill illes V s Vreeken

29 29 May 2015 2015

slide-2
SLIDE 2

Qu Question o

  • f

f th the da day

What is

infor

  • rmation

tion?

(and what do talking drums have to do with it?)

slide-3
SLIDE 3

Bit Bits a s and Piec Pieces es

What are

 information  a bit  entropy  mutual information  divergence  information theory  …

slide-4
SLIDE 4

In Informatio ion Th Theo eory

Field founded by Claud ude Sha Shannon nnon in 1948, ‘A Mathematical Theory of Communication’ a branch of stati tisti stics that is essentially about uncertainty in communication not wha hat you say, but what you co could ld say

slide-5
SLIDE 5

Th The B e Big ig In Insigh sight

Communication is a ser series

  • f discrete messages

each message reduces the uncertainty of the recipient of a) the series and b) that message by how much is the amount of information ion

slide-6
SLIDE 6

Uncerta tainty nty

Shannon showed that uncertainty can be quantified, linking physical entropy to messages and defined the entropy of a discrete random variable 𝑌 as

𝐼(𝑌) = − 𝑄(𝑦𝑗)log 𝑄(𝑦𝑗)

𝑗

slide-7
SLIDE 7

Op Optim imal pr l pref efix ix-code

  • des

Shannon showed that uncertainty can be quantified, linking physical entropy to messages A (key) result of Shannon entropy is that

− log2 𝑄 𝑦𝑗

gives the length in bits of the optim imal p l prefix ix co code for a message 𝑦𝑗

slide-8
SLIDE 8

Codes a s and L Lengt engths

A code de 𝐷 maps a set of messages ges 𝑌 to a set of code de words 𝑍 𝑀𝐷 ⋅ is a code length function for 𝐷 with 𝑀𝐷 𝑦 ∈ 𝑌 = |𝐷 𝑦 ∈ 𝑍| the le length in in bit its of the code word y ∈ 𝑍 that 𝐷 assigns to symbol 𝑦 ∈ 𝑌.

slide-9
SLIDE 9

Efficie ienc ncy

Not all codes are created equal. Let 𝐷1 and 𝐷2 be two codes for set of messages 𝑌

1.

We call 𝐷1 more effic icie ient nt than 𝐷2 if for all 𝑦 ∈ 𝑌, 𝑀1 𝑦 ≤ 𝑀2(𝑦) while for at least one 𝑦 ∈ 𝑌, 𝑀1 𝑦 < 𝑀2 𝑦

2.

We call a code 𝐷 for set 𝑌 complet ete e if there does not exist a code 𝐷𝐷 that is more efficient than 𝐷 A code is is comple lete when n it it does not waste any bit its

slide-10
SLIDE 10

Th The Mo e Most st Im Important Slid lide

We only care about code lengths

slide-11
SLIDE 11

Th The Mo e Most st Im Important Slid lide

Actual code words are of no interest to us whatsoever.

slide-12
SLIDE 12

Th The Mo e Most st Im Important Slid lide

Our goal is measuring complexity, not to instantiate an actual compressor

slide-13
SLIDE 13

My My Fir irst st C Code de

Let us consider a sequence 𝑇

  • ver a discrete alphabet 𝑌 = 𝑦1, 𝑦2, … , 𝑦𝑛 .

As code 𝐷 for 𝑇 we can instantiate a bl block co code, identifying the value of 𝑡𝑗 ∈ 𝑇 by an index over 𝑌, which require a constant number of log2 |𝑌| bits per message in 𝑇, i.e., 𝑀 𝑦𝑗 = log2 |𝑌| We can always instantiate a prefix-free code with code words of lengths 𝑀 𝑦𝑗 = log2 |𝑌|

slide-14
SLIDE 14

Codes in s in a Tree ee

root

00 01 1 10 11

slide-15
SLIDE 15

Beyo yond U Uni nifo form

What if we know the distribution 𝑄(𝑦𝑗 ∈ 𝑌) over 𝑇 and it is not uniform? We do not want to waste any bits, so using block codes is a bad idea. We do not want to introduce any undue bias, so we want an effi efficien ent code that is uniq ique uely ly dec decodable e without having to use arbitrary length stop-words. We want an optimal prefix-code.

slide-16
SLIDE 16

Prefi efix C Codes

A code 𝐷 is a prefi fix code e iff there is no code word 𝐷 𝑦 that is an extension of another code word 𝐷(𝑦′). Or, in other words, 𝐷 defines a binary tree with the leaves as the code words. How do we find the optimal tree? root

00 01 1

slide-17
SLIDE 17

Shann nnon E Entropy

Let 𝑄(𝑦𝑗) be the probability of 𝑦𝑗 ∈ 𝑌 in 𝑇, then 𝐼(𝑇) = − 𝑄(𝑦𝑗)log 𝑄(𝑦𝑗)

𝑦𝑗∈𝑌

is the Sha hanno nnon n entropy of 𝑇 (wrt 𝑌)

(see Shannon 1948)

the ‘weight’, how

  • ften we see 𝑦𝑗

number of bits needed to identify 𝑦𝑗 under 𝑄 average number of bits needed per message 𝑡𝑗 ∈ 𝑇

slide-18
SLIDE 18

Op Optim imal l Pref efix C ix Code L de Lengt engths

What if the distribution of 𝑌 in 𝑇 is not uniform? Let 𝑄(𝑦𝑗) be the probability of 𝑦𝑗 in 𝑇, then 𝑀(𝑦𝑗) = − log 𝑄(𝑦𝑗) is the le length h of the optimal prefi efix code de for message 𝑦𝑗 knowing distribution 𝑄

(see Shannon 1948)

slide-19
SLIDE 19

Kraft’s Inequa Inequalit lity

For any code C for finite alphabet 𝑌 = 𝑦1, … , 𝑦𝑛 , the code word lengths 𝑀𝐷 ⋅ must satisfy the inequality 2−𝑀(𝑦𝑗)

𝑦𝑗∈𝑌

≤ 1.

a)

when a set of code word lengths satisfies the inequality, ther ere ex exists a prefi efix code de with these e code de word len engths,

b)

when it holds with strict equality, the cod

  • de is

s comp

  • mplete,

it does not waste any part of the coding space,

c)

when it does not hold, the code is not uniq iquely ly decodable le

slide-20
SLIDE 20

Wha hat’s a a bit bit?

Binary digit

 smalle

lest and most fundamental piece of information

 yes or no  invented by Claude Shannon in 1948  name by John Tukey

Bits have been in use for a long-long time, though

 Punch cards (1725, 1804)  Morse code (1844)  African ‘talking drums’

slide-21
SLIDE 21

Mo Morse se c code de

slide-22
SLIDE 22

Natural la l lang nguage ge

Punishes ‘bad’ redundancy:

  • ften-used words are shorter

Rewards useful redundancy: cotxent alolws mishaireng/raeding African Talking Drums have used this for efficient, fast, long-distance communication

mimic vocalized sounds: tonal language very reliable means of communication

slide-23
SLIDE 23

Mea Measu surin ing b g bit its

How much information carries a given string? How many bits?

Say we have a binary string of 10000 ‘messages’ 1) 00010001000100010001…000100010001000100010001000100010001 2) 01110100110100100110…101011101011101100010110001011011100 3) 00011000001010100000…001000100001000000100011000000100110 4) 0000000000000000000000000000100000000000000000000…0000000

  • bviously, all four are 10000 bits long.

But, are they worth those 10000 bits?

slide-24
SLIDE 24

So, So, h how

  • w man

many bit bits? s?

Depends on the encoding! What is the best encoding?

 one that takes the entropy of the data into account  things that occur often should get short code  things that occur seldom should get long code

An encoding matching Shannon Entropy is optimal

slide-25
SLIDE 25

T ell ell us! us! Ho How w many ny bit bits? s? Please?

In our simplest example we have 𝑄(1) = 1/100000 𝑄(0) = 99999/100000 |𝑑𝑑𝑑𝑑1| = −log (1/100000) = 16.61 |𝑑𝑑𝑑𝑑0| = −log (99999/100000) = 0.0000144 So, knowing 𝑄 our string contains 1 ∗ 16.61 + 99999 ∗ 0.0000144 = 18.049 bits

  • f information
slide-26
SLIDE 26

Op Optim imal… l….

Shannon lets us calculate optimal code lengths

 what about actual codes? 0.0000144 bits?  Shannon and Fano invented a near-optimal encoding in 1948,

within one bit of the optimal, but not lowest expected

Fano gave students an option: regular exam, or invent a better encoding

 David didn’t like exams; invented Huffman-codes (1952)  optimal for symbol-by-symbol encoding with fixed probs.

(arithmetic coding is overall optimal, Rissanen 1976)

slide-27
SLIDE 27

Op Optim imali lity

To encode optimally, we need optimal probabilities What happens if we don’t?

slide-28
SLIDE 28

Mea Measu surin ing Div g Diver ergenc gence

Kullback-Leibler divergence from 𝑅 to 𝑄, denoted by 𝐸(𝑄 ‖ 𝑅), measures the number of bits we ‘waste’ when we use 𝑅 while 𝑄 is the ‘true’ distribution 𝐸 𝑄 ‖ 𝑅 = 𝑄(𝑗) log 𝑄 𝑗 𝑅 𝑗

𝑗

slide-29
SLIDE 29

Mu Mult ltiv ivariate E e Ent ntropy

So far we’ve been thinking about a single sequence of messages How does entropy work for multivariate data? Simple!

slide-30
SLIDE 30

T

  • wa

wards ds Mut Mutual Inf Informatio ion

Condit itio ional l Entropy is defined as

𝐼 𝑌 𝑍 = 𝑄 𝑦 𝐼(𝑍|𝑌 = 𝑦)

𝑦∈X

‘average number of bits needed for message 𝑦 ∈ 𝑌 knowing 𝑍𝐷 Symmetric

slide-31
SLIDE 31

Mu Mutua ual Inf l Information

the amount of information shared between two variables 𝑌 and 𝑍

𝐽 𝑌, 𝑍 = 𝐼 𝑌 − 𝐼 𝑌 𝑍 = 𝐼 𝑍 − 𝐼 𝑍 𝑌

= 𝑄 𝑦, 𝑧 log 𝑄 𝑦, 𝑧 𝑄 𝑦 𝑄 𝑧

𝑦∈𝑌 𝑧∈𝑍

high 𝐽(𝑌, 𝑍) implies correlation low 𝐽(𝑌, 𝑍) implies independence Information is symmetric!

slide-32
SLIDE 32

In Informatio ion Ga Gain in

(small aside)

Entropy and KL are used in decision trees What is the best split in a tree?

  • ne that results in as homogeneous label distributions

in the sub-nodes as possible: min inim imal l entropy

How do we compare over multiple options?

𝐽𝐽 𝑈, 𝑏 = 𝐼 𝑈 − 𝐼(𝑈|𝑏)

slide-33
SLIDE 33

Go Goal al: Finds sets of attributes that interact st strongl gly Ta Task: mine all sets of attributes such that the entropy over their values instantiations ≤ 𝜏

1.087 bits

Lo Low-Entropy S y Sets ts

Theory of Computation Probability Theory 1 No No 1887 Yes No 156 No Yes 143 Yes Yes 219

(Heikinheimo et al. 2007)

slide-34
SLIDE 34

Lo Low-Entropy S y Sets ts

Maturity Test Software Engineering Theory of Computation No No No 1570 Yes No No 79 No Yes No 99 Yes Yes No 282 No No Yes 28 Yes No Yes 164 No Yes Yes 13 Yes Yes Yes 170

(Heikinheimo et al. 2007)

slide-35
SLIDE 35

Lo Low-Entropy T y Trees

Scientific Writing Maturity Test Software Engineering Project Theory of Computation Probability Theory 1

(Heikinheimo et al. 2007)

Define entropy of a tree 𝑈 = 𝐵, 𝑈

1, … , 𝑈𝑙 as

𝐼𝑉 𝑈 = 𝐼(𝐵 ∣ A1, … , Ak) + ∑𝐼𝑉(𝑈

𝑘)

The tree 𝑈 for an itemset 𝐵 minimizing 𝐼𝑉 𝑈 identifies dir irectiona nal explanations!

𝐼 𝐵 ≤ 𝐼(𝑇𝑇|𝑁𝑈, 𝑇𝑇𝑄, 𝑈𝐷, 𝑄𝑈) + 𝐼(𝑁𝑈|𝑇𝑇𝑄, 𝑈𝐷, 𝑄𝑄) + 𝐼 𝑇𝑇𝑄 + 𝐼 𝑈𝐷 𝑄𝑈 + 𝐼 𝑄𝑈

slide-36
SLIDE 36

Entr tropy fo y for r Conti

  • ntinuous-value

ues

So far we only considered discrete-valued data Lots of data is continuous-valued (or is it) What does this mean for entropy?

slide-37
SLIDE 37

Dif Differentia ial E l Ent ntropy py

ℎ 𝑌 = − 𝑔 𝑦 log 𝑔 𝑦 𝑑𝑦

𝐘

(Shannon, 1948)

slide-38
SLIDE 38

Dif Differentia ial E l Ent ntropy py

How about… the entropy of Uniform(0,1/2) ?

− −2 log 2 𝑑𝑦 = − log 2

1 2

Hm, negative?

slide-39
SLIDE 39

Dif Differentia ial E l Ent ntropy py

In discrete data step size ‘dx’ is trivial. What is its effect here?

ℎ 𝑌 = − 𝑔 𝑦 log 𝑔 𝑦 𝑑𝑦

𝐘

(Shannon, 1948)

slide-40
SLIDE 40

Im Impo poss ssib ibru?

No. But you’ll have to wait till next week for the answer.

slide-41
SLIDE 41

Conclusi sions

Information is related to the reduction in uncertainty

  • f what you could say

Entropy is a core aspect of information theory

 lots of nic

ice properties

 optimal prefix-code lengths, mutual information, etc

Entropy for continuous data is… more tricky

 differential entropy is a bit problematic

slide-42
SLIDE 42

Thank you!

Information is related to the reduction in uncertainty

  • f what you could say

Entropy is a core aspect of information theory

 lots of nic

ice properties

 optimal prefix-code lengths, mutual information, etc

Entropy for continuous data is… more tricky

 differential entropy is a bit problematic