Entropy & Information
Jill illes V s Vreeken
29 29 May 2015 2015
Entropy & Information Jill illes V s Vreeken 29 29 May 2015 - - PowerPoint PPT Presentation
Entropy & Information Jill illes V s Vreeken 29 29 May 2015 2015 Qu Question o of f th the da day What is infor ormation tion? (and what do talking drums have to do with it?) Bit Bits a s and Piec Pieces es What are
29 29 May 2015 2015
(and what do talking drums have to do with it?)
information a bit entropy mutual information divergence information theory …
Field founded by Claud ude Sha Shannon nnon in 1948, ‘A Mathematical Theory of Communication’ a branch of stati tisti stics that is essentially about uncertainty in communication not wha hat you say, but what you co could ld say
Shannon showed that uncertainty can be quantified, linking physical entropy to messages and defined the entropy of a discrete random variable 𝑌 as
𝐼(𝑌) = − 𝑄(𝑦𝑗)log 𝑄(𝑦𝑗)
𝑗
Shannon showed that uncertainty can be quantified, linking physical entropy to messages A (key) result of Shannon entropy is that
− log2 𝑄 𝑦𝑗
gives the length in bits of the optim imal p l prefix ix co code for a message 𝑦𝑗
A code de 𝐷 maps a set of messages ges 𝑌 to a set of code de words 𝑍 𝑀𝐷 ⋅ is a code length function for 𝐷 with 𝑀𝐷 𝑦 ∈ 𝑌 = |𝐷 𝑦 ∈ 𝑍| the le length in in bit its of the code word y ∈ 𝑍 that 𝐷 assigns to symbol 𝑦 ∈ 𝑌.
Not all codes are created equal. Let 𝐷1 and 𝐷2 be two codes for set of messages 𝑌
1.
We call 𝐷1 more effic icie ient nt than 𝐷2 if for all 𝑦 ∈ 𝑌, 𝑀1 𝑦 ≤ 𝑀2(𝑦) while for at least one 𝑦 ∈ 𝑌, 𝑀1 𝑦 < 𝑀2 𝑦
2.
We call a code 𝐷 for set 𝑌 complet ete e if there does not exist a code 𝐷𝐷 that is more efficient than 𝐷 A code is is comple lete when n it it does not waste any bit its
Let us consider a sequence 𝑇
As code 𝐷 for 𝑇 we can instantiate a bl block co code, identifying the value of 𝑡𝑗 ∈ 𝑇 by an index over 𝑌, which require a constant number of log2 |𝑌| bits per message in 𝑇, i.e., 𝑀 𝑦𝑗 = log2 |𝑌| We can always instantiate a prefix-free code with code words of lengths 𝑀 𝑦𝑗 = log2 |𝑌|
root
What if we know the distribution 𝑄(𝑦𝑗 ∈ 𝑌) over 𝑇 and it is not uniform? We do not want to waste any bits, so using block codes is a bad idea. We do not want to introduce any undue bias, so we want an effi efficien ent code that is uniq ique uely ly dec decodable e without having to use arbitrary length stop-words. We want an optimal prefix-code.
A code 𝐷 is a prefi fix code e iff there is no code word 𝐷 𝑦 that is an extension of another code word 𝐷(𝑦′). Or, in other words, 𝐷 defines a binary tree with the leaves as the code words. How do we find the optimal tree? root
Let 𝑄(𝑦𝑗) be the probability of 𝑦𝑗 ∈ 𝑌 in 𝑇, then 𝐼(𝑇) = − 𝑄(𝑦𝑗)log 𝑄(𝑦𝑗)
𝑦𝑗∈𝑌
is the Sha hanno nnon n entropy of 𝑇 (wrt 𝑌)
(see Shannon 1948)
the ‘weight’, how
number of bits needed to identify 𝑦𝑗 under 𝑄 average number of bits needed per message 𝑡𝑗 ∈ 𝑇
What if the distribution of 𝑌 in 𝑇 is not uniform? Let 𝑄(𝑦𝑗) be the probability of 𝑦𝑗 in 𝑇, then 𝑀(𝑦𝑗) = − log 𝑄(𝑦𝑗) is the le length h of the optimal prefi efix code de for message 𝑦𝑗 knowing distribution 𝑄
(see Shannon 1948)
For any code C for finite alphabet 𝑌 = 𝑦1, … , 𝑦𝑛 , the code word lengths 𝑀𝐷 ⋅ must satisfy the inequality 2−𝑀(𝑦𝑗)
𝑦𝑗∈𝑌
≤ 1.
a)
when a set of code word lengths satisfies the inequality, ther ere ex exists a prefi efix code de with these e code de word len engths,
b)
when it holds with strict equality, the cod
s comp
it does not waste any part of the coding space,
c)
when it does not hold, the code is not uniq iquely ly decodable le
Binary digit
smalle
lest and most fundamental piece of information
yes or no invented by Claude Shannon in 1948 name by John Tukey
Bits have been in use for a long-long time, though
Punch cards (1725, 1804) Morse code (1844) African ‘talking drums’
Punishes ‘bad’ redundancy:
Rewards useful redundancy: cotxent alolws mishaireng/raeding African Talking Drums have used this for efficient, fast, long-distance communication
mimic vocalized sounds: tonal language very reliable means of communication
Say we have a binary string of 10000 ‘messages’ 1) 00010001000100010001…000100010001000100010001000100010001 2) 01110100110100100110…101011101011101100010110001011011100 3) 00011000001010100000…001000100001000000100011000000100110 4) 0000000000000000000000000000100000000000000000000…0000000
Depends on the encoding! What is the best encoding?
one that takes the entropy of the data into account things that occur often should get short code things that occur seldom should get long code
An encoding matching Shannon Entropy is optimal
In our simplest example we have 𝑄(1) = 1/100000 𝑄(0) = 99999/100000 |𝑑𝑑𝑑𝑑1| = −log (1/100000) = 16.61 |𝑑𝑑𝑑𝑑0| = −log (99999/100000) = 0.0000144 So, knowing 𝑄 our string contains 1 ∗ 16.61 + 99999 ∗ 0.0000144 = 18.049 bits
Shannon lets us calculate optimal code lengths
what about actual codes? 0.0000144 bits? Shannon and Fano invented a near-optimal encoding in 1948,
within one bit of the optimal, but not lowest expected
Fano gave students an option: regular exam, or invent a better encoding
David didn’t like exams; invented Huffman-codes (1952) optimal for symbol-by-symbol encoding with fixed probs.
(arithmetic coding is overall optimal, Rissanen 1976)
To encode optimally, we need optimal probabilities What happens if we don’t?
Kullback-Leibler divergence from 𝑅 to 𝑄, denoted by 𝐸(𝑄 ‖ 𝑅), measures the number of bits we ‘waste’ when we use 𝑅 while 𝑄 is the ‘true’ distribution 𝐸 𝑄 ‖ 𝑅 = 𝑄(𝑗) log 𝑄 𝑗 𝑅 𝑗
𝑗
Condit itio ional l Entropy is defined as
𝐼 𝑌 𝑍 = 𝑄 𝑦 𝐼(𝑍|𝑌 = 𝑦)
𝑦∈X
‘average number of bits needed for message 𝑦 ∈ 𝑌 knowing 𝑍𝐷 Symmetric
the amount of information shared between two variables 𝑌 and 𝑍
𝐽 𝑌, 𝑍 = 𝐼 𝑌 − 𝐼 𝑌 𝑍 = 𝐼 𝑍 − 𝐼 𝑍 𝑌
= 𝑄 𝑦, 𝑧 log 𝑄 𝑦, 𝑧 𝑄 𝑦 𝑄 𝑧
𝑦∈𝑌 𝑧∈𝑍
high 𝐽(𝑌, 𝑍) implies correlation low 𝐽(𝑌, 𝑍) implies independence Information is symmetric!
(small aside)
Entropy and KL are used in decision trees What is the best split in a tree?
in the sub-nodes as possible: min inim imal l entropy
How do we compare over multiple options?
𝐽𝐽 𝑈, 𝑏 = 𝐼 𝑈 − 𝐼(𝑈|𝑏)
Go Goal al: Finds sets of attributes that interact st strongl gly Ta Task: mine all sets of attributes such that the entropy over their values instantiations ≤ 𝜏
1.087 bits
Theory of Computation Probability Theory 1 No No 1887 Yes No 156 No Yes 143 Yes Yes 219
(Heikinheimo et al. 2007)
Maturity Test Software Engineering Theory of Computation No No No 1570 Yes No No 79 No Yes No 99 Yes Yes No 282 No No Yes 28 Yes No Yes 164 No Yes Yes 13 Yes Yes Yes 170
(Heikinheimo et al. 2007)
Scientific Writing Maturity Test Software Engineering Project Theory of Computation Probability Theory 1
(Heikinheimo et al. 2007)
Define entropy of a tree 𝑈 = 𝐵, 𝑈
1, … , 𝑈𝑙 as
𝐼𝑉 𝑈 = 𝐼(𝐵 ∣ A1, … , Ak) + ∑𝐼𝑉(𝑈
𝑘)
The tree 𝑈 for an itemset 𝐵 minimizing 𝐼𝑉 𝑈 identifies dir irectiona nal explanations!
𝐼 𝐵 ≤ 𝐼(𝑇𝑇|𝑁𝑈, 𝑇𝑇𝑄, 𝑈𝐷, 𝑄𝑈) + 𝐼(𝑁𝑈|𝑇𝑇𝑄, 𝑈𝐷, 𝑄𝑄) + 𝐼 𝑇𝑇𝑄 + 𝐼 𝑈𝐷 𝑄𝑈 + 𝐼 𝑄𝑈
ℎ 𝑌 = − 𝑔 𝑦 log 𝑔 𝑦 𝑑𝑦
𝐘
(Shannon, 1948)
How about… the entropy of Uniform(0,1/2) ?
− −2 log 2 𝑑𝑦 = − log 2
1 2
Hm, negative?
ℎ 𝑌 = − 𝑔 𝑦 log 𝑔 𝑦 𝑑𝑦
𝐘
(Shannon, 1948)
Information is related to the reduction in uncertainty
Entropy is a core aspect of information theory
lots of nic
ice properties
optimal prefix-code lengths, mutual information, etc
Entropy for continuous data is… more tricky
differential entropy is a bit problematic
Information is related to the reduction in uncertainty
Entropy is a core aspect of information theory
lots of nic
ice properties
optimal prefix-code lengths, mutual information, etc
Entropy for continuous data is… more tricky
differential entropy is a bit problematic