Entropy & Information Jill illes V s Vreeken 29 29 May 2015 - PowerPoint PPT Presentation

Entropy & Information Jill illes V s Vreeken 29 29 May 2015 2015

Qu Question o of f th the da day What is infor ormation tion? (and what do talking drums have to do with it?)

Bit Bits a s and Piec Pieces es What are  information  a bit  entropy  mutual information  divergence  information theory  …

In Informatio ion Th Theo eory Field founded by Claud ude Sha Shannon nnon in 1948, ‘A Mathematical Theory of Communication’ a branch of stati tisti stics that is essentially about uncertainty in communication not wha hat you say, but what you co could ld say

Th The B e Big ig In Insigh sight Communication is a ser series of discrete messages each message reduces the uncertainty of the recipient of a ) the series and b ) that message by how much is the amount of information ion

Uncerta tainty nty Shannon showed that uncertainty can be quantified, linking physical entropy to messages and defined the entropy of a discrete random variable 𝑌 as 𝐼 ( 𝑌 ) = − � 𝑄 ( 𝑦 𝑗 )log 𝑄 ( 𝑦 𝑗 ) 𝑗

Op Optim imal pr l pref efix ix-code odes Shannon showed that uncertainty can be quantified, linking physical entropy to messages A (key) result of Shannon entropy is that − log 2 𝑄 𝑦 𝑗 gives the length in bits of the optim imal p l prefix ix co code for a message 𝑦 𝑗

Codes a s and L Lengt engths A code de 𝐷 maps a set of messages ges 𝑌 to a set of code de words 𝑍 𝑀 𝐷 ⋅ is a code length function for 𝐷 with 𝑀 𝐷 𝑦 ∈ 𝑌 = | 𝐷 𝑦 ∈ 𝑍 | the le length in in bit its of the code word y ∈ 𝑍 that 𝐷 assigns to symbol 𝑦 ∈ 𝑌 .

Efficie ienc ncy Not all codes are created equal. Let 𝐷 1 and 𝐷 2 be two codes for set of messages 𝑌 We call 𝐷 1 more effic icie ient nt than 𝐷 2 if for all 𝑦 ∈ 𝑌 , 𝑀 1 𝑦 ≤ 1. 𝑀 2 ( 𝑦 ) while for at least one 𝑦 ∈ 𝑌 , 𝑀 1 𝑦 < 𝑀 2 𝑦 We call a code 𝐷 for set 𝑌 complet ete e if there does not exist a 2. code 𝐷𝐷 that is more efficient than 𝐷 A code is is comple lete when n it it does not waste any bit its

Th The Mo e Most st Im Important Slid lide We only care about code lengths

Th The Mo e Most st Im Important Slid lide Actual code words are of no interest to us whatsoever.

Th The Mo e Most st Im Important Slid lide Our goal is measuring complexity , not to instantiate an actual compressor

My My Fir irst st C Code de Let us consider a sequence 𝑇 over a discrete alphabet 𝑌 = 𝑦 1 , 𝑦 2 , … , 𝑦 𝑛 . As code 𝐷 for 𝑇 we can instantiate a bl block co code, identifying the value of 𝑡 𝑗 ∈ 𝑇 by an index over 𝑌 , which require a constant number of log 2 | 𝑌 | bits per message in 𝑇 , i.e., 𝑀 𝑦 𝑗 = log 2 | 𝑌 | We can always instantiate a prefix-free code with code words of lengths 𝑀 𝑦 𝑗 = log 2 | 𝑌 |

Codes in s in a Tree ee 00 0 01 root 10 1 11

Beyo yond U Uni nifo form What if we know the distribution 𝑄 ( 𝑦 𝑗 ∈ 𝑌 ) over 𝑇 and it is not uniform? We do not want to waste any bits, so using block codes is a bad idea. We do not want to introduce any undue bias, so we want an effi efficien ent code that is uniq ique uely ly dec decodable e without having to use arbitrary length stop-words. We want an optimal prefix-code.

Prefi efix C Codes A code 𝐷 is a prefi fix code e iff there is no code word 𝐷 𝑦 that is an extension of another code word 𝐷 ( 𝑦 ′ ) . Or, in other words, 𝐷 defines a binary tree with the leaves as the code words. 00 0 01 root 1 How do we find the optimal tree?

Shann nnon E Entropy Let 𝑄 ( 𝑦 𝑗 ) be the probability of 𝑦 𝑗 ∈ 𝑌 in 𝑇 , then average number of 𝐼 ( 𝑇 ) = − � 𝑄 ( 𝑦 𝑗 )log 𝑄 ( 𝑦 𝑗 ) bits needed per 𝑦 𝑗 ∈𝑌 message 𝑡 𝑗 ∈ 𝑇 is the Sha hanno nnon n entropy of 𝑇 (wrt 𝑌 ) the ‘weight’, how number of bits needed often we see 𝑦 𝑗 to identify 𝑦 𝑗 under 𝑄 (see Shannon 1948)

Op Optim imal l Pref efix C ix Code L de Lengt engths What if the distribution of 𝑌 in 𝑇 is not uniform? Let 𝑄 ( 𝑦 𝑗 ) be the probability of 𝑦 𝑗 in 𝑇 , then 𝑀 ( 𝑦 𝑗 ) = − log 𝑄 ( 𝑦 𝑗 ) is the le length h of the optimal prefi efix code de for message 𝑦 𝑗 knowing distribution 𝑄 (see Shannon 1948)

Kraft’s Inequa Inequalit lity For any code C for finite alphabet 𝑌 = 𝑦 1 , … , 𝑦 𝑛 , the code word lengths 𝑀 𝐷 ⋅ must satisfy the inequality � 2 −𝑀 ( 𝑦𝑗 ) ≤ 1. 𝑦 𝑗 ∈𝑌 when a set of code word lengths satisfies the inequality, a) ther ere ex exists a prefi efix code de with these e code de word len engths, when it holds with strict equality, the cod ode is s comp omplete, b) it does not waste any part of the coding space, when it does not hold, the code is not uniq iquely ly decodable le c)

Wha hat’s a a bit bit? Binary digit  smalle lest and most fundamental piece of information  yes or no  invented by Claude Shannon in 1948  name by John Tukey Bits have been in use for a long-long time, though  Punch cards (1725, 1804)  Morse code (1844)  African ‘talking drums’

Mo Morse se c code de

Natural la l lang nguage ge Punishes ‘bad’ redundancy: often-used words are shorter Rewards useful redundancy: cotxent alolws mishaireng/raeding African Talking Drums have used this for efficient, fast, long-distance communication mimic vocalized sounds: tonal language very reliable means of communication

Mea Measu surin ing b g bit its How much information carries a given string? How many bits? Say we have a binary string of 10000 ‘messages’ 1) 00010001000100010001…000100010001000100010001000100010001 2) 01110100110100100110…101011101011101100010110001011011100 3) 00011000001010100000…001000100001000000100011000000100110 4) 0000000000000000000000000000100000000000000000000…0000000 obviously, all four are 10000 bits long. But, are they worth those 10000 bits?

So, So, h how ow man many bit bits? s? Depends on the encoding! What is the best encoding?  one that takes the entropy of the data into account  things that occur often should get short code  things that occur seldom should get long code An encoding matching Shannon Entropy is optimal

T ell ell us! us! Ho How w many ny bit bits? s? Please? In our simplest example we have 𝑄 (1) = 1/100000 𝑄 (0) = 99999/100000 (1/100000) = 16.61 | 𝑑𝑑𝑑𝑑 1 | = − log (99999/100000) = 0.0000144 | 𝑑𝑑𝑑𝑑 0 | = − log So, knowing 𝑄 our string contains 1 ∗ 16.61 + 99999 ∗ 0.0000144 = 18.049 bits of information

Op Optim imal… l…. Shannon lets us calculate optimal code lengths  what about actual codes? 0.0000144 bits?  Shannon and Fano invented a near-optimal encoding in 1948, within one bit of the optimal, but not lowest expected Fano gave students an option: regular exam, or invent a better encoding  David didn’t like exams; invented Huffman-codes (1952)  optimal for symbol-by-symbol encoding with fixed probs. (arithmetic coding is overall optimal, Rissanen 1976)

Op Optim imali lity To encode optimally, we need optimal probabilities What happens if we don’t?

Mea Measu surin ing Div g Diver ergenc gence Kullback-Leibler divergence from 𝑅 to 𝑄 , denoted by 𝐸 ( 𝑄 ‖ 𝑅 ) , measures the number of bits we ‘waste’ when we use 𝑅 while 𝑄 is the ‘true’ distribution 𝐸 𝑄 ‖ 𝑅 = � 𝑄 ( 𝑗 ) log 𝑄 𝑗 𝑅 𝑗 𝑗

Mu Mult ltiv ivariate E e Ent ntropy So far we’ve been thinking about a single sequence of messages How does entropy work for multivariate data? Simple!

T owa wards ds Mut Mutual Inf Informatio ion Condit itio ional l Entropy is defined as 𝐼 𝑌 𝑍 = � 𝑄 𝑦 𝐼 ( 𝑍 | 𝑌 = 𝑦 ) 𝑦∈X ‘average number of bits needed for message 𝑦 ∈ 𝑌 knowing 𝑍𝐷 Symmetric

Mutua Mu ual Inf l Information the amount of information shared between two variables 𝑌 and 𝑍 𝐽 𝑌 , 𝑍 = 𝐼 𝑌 − 𝐼 𝑌 𝑍 = 𝐼 𝑍 − 𝐼 𝑍 𝑌 𝑄 𝑦 , 𝑧 = � � 𝑄 𝑦 , 𝑧 log 𝑄 𝑦 𝑄 𝑧 𝑧∈𝑍 𝑦∈𝑌 high 𝐽 ( 𝑌 , 𝑍 ) implies correlation low 𝐽 ( 𝑌 , 𝑍 ) implies independence Information is symmetric!

In Informatio ion Ga Gain in (small aside) Entropy and KL are used in decision trees What is the best split in a tree? one that results in as homogeneous label distributions in the sub-nodes as possible: min inim imal l entropy How do we compare over multiple options? 𝐽𝐽 𝑈 , 𝑏 = 𝐼 𝑈 − 𝐼 ( 𝑈 | 𝑏 )

Low-Entropy S Lo y Sets ts Goal Go al: Finds sets of attributes that interact st strongl gly Task: mine all sets of attributes Ta such that the entropy over their values instantiations ≤ 𝜏 Theory of Probability Computation Theory 1 No No 1887 Yes No 156 No Yes 143 Yes Yes 219 1.087 bits (Heikinheimo et al. 2007)

Entropy & Information Jill illes V s Vreeken 29 29 May 2015 - PowerPoint PPT Presentation

Entropy & Information Jill illes V s Vreeken 29 29 May 2015 2015 Qu Question o of f th the da day What is infor ormation tion? (and what do talking drums have to do with it?) Bit Bits a s and Piec Pieces es What are

Entropy, Relative Entropy, Cross Entropy Entropy Entropy, H(x) is a measure of the uncertainty of

Formal Modeling in Cognitive Science Lecture 25: Entropy, Joint Entropy, Conditional Entropy 1

1) Entropy = measure of randomness 2) Entropy = measure of compressibility More random = Less

Chapter 2 Entropy, Relative Entropy, and Mutual Infor- mation Peng-Hua Wang Graduate Institute

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

Entropy and Shannon information Entropy and Shannon information For a random variable X with

Road detection via entropy By Anna Zaidman 1 1 What is entropy? Entropy is a mathematically

Entropy Change in Entropy Reversible Isobaric Process Ideal Gas in a Reversible Process Free

Entropy and The Second Law of Thermodynamics Entropy (S)

Orc David Schleef Entropy Wave Inc (c) 2009 Entropy Wave Inc What is Orc A system for

Topological entropy and algebraic entropy on locally compact abelian groups - The Bridge Theorem

Probabilistic Models of Human Sentence Experiment 1: Entropy and Sentence Length 2 Processing

Information & Entropy Comp 595 DM Professor Wang Information & Entropy Information

Huffman Encoding 13-Oct-11 Entropy Entropy is a measure of information content: the number of

Infotheory for Statistics and Learning Lecture 1 Entropy Relative entropy Mutual

Model Uncertainty and Robustness: Entropy Coherent and Entropy Convex Measures of Risk Roger J.

HOST SCA Countermeasures I ECE 525 Side-Channel Analysis (SCA) Countermeasures Reference

Physical layer Encoding data into signals Computer networks Girts Strazdins, gist@ntnu.no, NTNU

Signal Encoding Techniques Digital Data, Analog Signals Analog Data, Digital Signals ITS323:

Everything I ever learned about JVM performance tuning @twitter 1 12. mrcius 3., szombat

FedSel: Federated SGD under Local Differential Privacy with Top-k Dimension Selection Ruixuan

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #2: IN-MEMORY

Differential Fault Analysis against AES-192 and AES-256 with Minimal Faults Chong Hee KIM

Differential Fault Analysis of HC-128 Aleksandar Kircanski and Amr M. Youssef AFRICACRYPT 2010

Entropy & Information Jill illes V s Vreeken 29 29 May 2015 - PowerPoint PPT Presentation

Entropy & Information Jill illes V s Vreeken 29 29 May 2015 2015 Qu Question o of f th the da day What is infor ormation tion? (and what do talking drums have to do with it?) Bit Bits a s and Piec Pieces es What are

Entropy, Relative Entropy, Cross Entropy Entropy Entropy, H(x) is a measure of the uncertainty of

Formal Modeling in Cognitive Science Lecture 25: Entropy, Joint Entropy, Conditional Entropy 1

1) Entropy = measure of randomness 2) Entropy = measure of compressibility More random = Less

Chapter 2 Entropy, Relative Entropy, and Mutual Infor- mation Peng-Hua Wang Graduate Institute

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

Entropy and Shannon information Entropy and Shannon information For a random variable X with

Road detection via entropy By Anna Zaidman 1 1 What is entropy? Entropy is a mathematically

Entropy Change in Entropy Reversible Isobaric Process Ideal Gas in a Reversible Process Free

Entropy and The Second Law of Thermodynamics Entropy (S)

Orc David Schleef Entropy Wave Inc (c) 2009 Entropy Wave Inc What is Orc A system for

Topological entropy and algebraic entropy on locally compact abelian groups - The Bridge Theorem

Probabilistic Models of Human Sentence Experiment 1: Entropy and Sentence Length 2 Processing

Information &amp; Entropy Comp 595 DM Professor Wang Information &amp; Entropy Information

Huffman Encoding 13-Oct-11 Entropy Entropy is a measure of information content: the number of

Infotheory for Statistics and Learning Lecture 1 Entropy Relative entropy Mutual

Model Uncertainty and Robustness: Entropy Coherent and Entropy Convex Measures of Risk Roger J.

HOST SCA Countermeasures I ECE 525 Side-Channel Analysis (SCA) Countermeasures Reference

Physical layer Encoding data into signals Computer networks Girts Strazdins, gist@ntnu.no, NTNU

Signal Encoding Techniques Digital Data, Analog Signals Analog Data, Digital Signals ITS323:

Everything I ever learned about JVM performance tuning @twitter 1 12. mrcius 3., szombat

FedSel: Federated SGD under Local Differential Privacy with Top-k Dimension Selection Ruixuan

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #2: IN-MEMORY

Differential Fault Analysis against AES-192 and AES-256 with Minimal Faults Chong Hee KIM

Differential Fault Analysis of HC-128 Aleksandar Kircanski and Amr M. Youssef AFRICACRYPT 2010

Information & Entropy Comp 595 DM Professor Wang Information & Entropy Information