1) Entropy = measure of randomness 2) Entropy = measure of - PowerPoint PPT Presentation

Introduction to Information Retrieval Entropy: a basic introduction 1) Entropy = measure of randomness 2) Entropy = measure of compressibility More random = Less compressible High entropy = high randomness/low compressibility Low entropy = low randomness/high compressibility Entropy is a key notion applied in information retrieval and data compression in general. 1

Introduction to Information Retrieval Entropy application Entropy enables one to compute the compressibility of data without actually needed to compress the data first! We will illustrate this with a well-known file compression method: the Huffman algorithm 2

Introduction to Information Retrieval Huffman encoding example We compute the Huffman code and measure the compression of the file. This is compared to the “entropy”, a measure of file compressibility obtained from the file (without the need to actually compress) 3

Introduction to Information Retrieval Probabilities and randomness 6-sided fair dice pi = [Probability of outcome = i] = 1/6 W Where i is any number from 1 to 6 6-sided biased dice p6 = 3/12 = ¼ (6 is more likely) p1 = 1/12 (1 is less likely, piece of led in the “dot”) p2 = p3 = p4 = p5 = 2/12 = 1/6 Sum of probabilities of all possible outcomes is 1 p1 + p 2 + p3 + p4 + p5 + p6 = 1 4

Introduction to Information Retrieval Probabilities (general case) For the general case, instead of 6 outcomes (dice) we allow n outcomes. For instance, consider a file with 100,000 characters. Say the character “a” occurs 45,000 times in this file. What is the probability of encountering “a” if we pick a character in the file at random? Answer: 45/100 = 0.45 i.e. nearly half of the file consists of “a” -s Say there are n characters in the file. Each character will have a probability pi. Sum of the probabilities (p 1 + …. + p n ) = 100/100 = 1 5

Introduction to Information Retrieval Entropy Given a “probability distribution”, i.e. given probabilities, p 1 , … , p n , with sum 1, then we define the entropy H of this distribution as: H(p 1 ,…, p n ) = -p 1 log(p 1 ) -p 2 log(p 2 ) - … - p n log(p n ) Note: log has base 2 in this notation and log 2 (k) = ln(k)/ln(2) (where “ ln ” is “log in base e ”) Exercise: compute the entropy for a) the probability distribution of the fair dice b) the probability distribution of the biased dice 6

Introduction to Information Retrieval Comment on logs – plog(p) = plog(1/p) log(1/p) = log(1) – log(p) = 0 – log(p) = – log(p) IMPORTANT: plog(1/p) measures a very intuitive concept * p is the probability of an event * 1/p is the number of times the event occurs * log(k) measures how many bits are needed to represent the outcomes We check this on the fair dice 7

Introduction to Information Retrieval Fair dice example p = 1/6 p = probability of an outcome = 1/6 1/p = 1/(1/6) = 6 1/p = number of outcomes = 6 log(1/p) = log(6) = 2.59 log(1/p) = log(number of outcomes) = “number” of bits needed to represent the 1/p = 6 outcomes 8

Introduction to Information Retrieval Rounding up Note: in general “number” of bits, i.e. log(1/p), is not a positive integer. E.g. 2.59. In practice we can take the smallest integer greater than or equal to log(1/p), which is 3 Note that 3 bits suffice to represent the 6 outcomes: Binary numbers of length 3, there are 8 of them, so pick six of them to represent the outcomes of the dice, e.g. 000, 001, 010, 011, 100 and 101 Comment: in the following we don’t round up (we will see why). 9

Introduction to Information Retrieval Nice Entropy interpretation H(p 1 ,…, pn) = average encoding length We have shown that – plog(p) = plog(1/p) So our entropy can be written as: H(p 1 ,…, pn) = p 1 log(1/p1) + p 2 log(1/p2) + … + p n log(1/pn) Where p i log(1/p i ) = probability of occurrence x encoding length Thus: H(p 1 ,…, pn) = average encoding length 10

Introduction to Information Retrieval Binary represenation To encode n distinct numbers in binary notation we need to use binary numbers of length log(n) Note that from here on “log” will be the logarithm in base 2 since we are interested in binary compression only. To encode 6 numbers, we need to use binary numbers of length log(6) (in fact, we need to take the nearest integer above this value, i.e. 3). Binary numbers of length 2 will not suffice (there are only 4 which is not suitable to encode 6 numbers). We keep matters as an approximation and talk about binary numbers of “length” log(6), even though this is not an integer value. Binary number length to encode 8 numbers is log(8) = 3 11

Introduction to Information Retrieval Exercise: solution a) Fair dice: p1 = p2 = … = p6 = 1/6 So H(p1,…,p6) = -1/6log(1/6) X 6 = - log(1/6) = --log(6) = log(6) = 2.59 Interpretation: Entropy measures the amount of randomness. In the case of a fair dice, the randomness is “maximum”. All 6 outcomes are equally likely. This means that to represent the outcomes we will “roughly” need log(6) = 2.59 bits to represent them in binary form (the form compression will take). 12

Introduction to Information Retrieval Solution continued b) The entropy for the biased dice is: -1/4log(1/4) – 3/20log(3/20) x 5 = 1/4log(4) + 3/20log(20/3) x 5 = ¼ x 2 + 3/4 log(20/3) = ½ + 3/4log(20/3) = 0.5 + 2.055 = 2.555 (Lower than our previous result!) 13

Introduction to Information Retrieval Exercise continued Try the same for an 8-sided dice (dungeons and dragons dice) which is a) Fair b) Totally biased, with prob(8) =1 and thus prob(1) = … = prob(7) = 0 Answers: a) Entropy is log(8) = 3, we need 3 bits to represent the 8 outcomes (maximum randomness) b) Entropy is 1log(1) = 0, we need a bit of length 0 to represent the outcome. Justify! (Note: bit of length 1 has 2 values. Bit of length 0 has ? Values). 14

Introduction to Information Retrieval Compression Revisit previous example of 8-sided dice Compression for outcomes of fair dice: No compression (we still need 8 values to encode) (Maximum randomness) (outcome of entropy/total number of values) = 8/8 = 1 Compression for outcomes of biased dice: Total compression (we only need 1 bit to encode) (“No” randomness) (outcome of entropy/total number of values) = 0/8 = 0 15

Introduction to Information Retrieval Exercise: Huffman code Consider a file with the following properties: Characters in file: a,b,c,d,e and f Number of characters: 100,000 Frequencies of characters (in multiples of 1,000): freq(a) = 45, freq(b) = 13, freq(c) = 12, freq(d) = 16, freq(e) = 9, freq(f) = 5 So “a” occurs 45,000 times and similar for the others 16

Introduction to Information Retrieval Exercise continued a) Compute the Huffman encoding b) Compute the cost of the encoding c) Compute the average length of the encoding d) Express the probability of encountering a character in the file (do it for each character) e) Compute the Entropy f) Compare the Entropy to the compression percentage What is your conclusion? 17

Introduction to Information Retrieval Solution We assume familiarity with the Huffman code Algorithm. Answer: a) (prefix) codes for characters: a: 0, b: 101, c:100, d: 111, e: 1101, f: 1100 b) Cost of encoding = number of bits in encoding = 45 x 1 + 13 x 3 + 12 x 3 + 16 x 3 + 9 x 4 + 5 x 4 = 224,000 bits c) 224,000/100,000 = 2.24 average encoding length d) Prob(char = a) = 45/100, …, Prob(char = f) = 5/100 Check: sum of probabilities = 100/100 = 1 18

Introduction to Information Retrieval Solution (continued) e) Entropy = H(45/100, 13/100, 12/100, 16/100, 9/100, 5/100) = – 45/100log(45/100) – 13/100log(13/100) – 12/100log(12/100) – 16/100log(16/100) – 9/100log(9/100) – 5/100log(5/100) = 2.23 f) Conclusion: Entropy is an excellent prediction of average binary encoding length (some minor round-off errors). It predicted the average code length to be 2.23, very close to 2.24. It also predicts total size of compressed file: 2.23 x 100,000 = 223,000 which is very close to actual compressed size: 224,000 19

1) Entropy = measure of randomness 2) Entropy = measure of - PowerPoint PPT Presentation

Introduction to Information Retrieval Entropy: a basic introduction 1) Entropy = measure of randomness 2) Entropy = measure of compressibility More random = Less compressible High entropy = high randomness/low compressibility Low entropy =

Entropy, Relative Entropy, Cross Entropy Entropy Entropy, H(x) is a measure of the uncertainty of

Formal Modeling in Cognitive Science Lecture 25: Entropy, Joint Entropy, Conditional Entropy 1

Algorithmic randomness Cuny logic worshop Benoit Monin - LACL - Universit e Paris-Est Cr

Huffman Encoding 13-Oct-11 Entropy Entropy is a measure of information content: the number of

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

Chapter 2 Entropy, Relative Entropy, and Mutual Infor- mation Peng-Hua Wang Graduate Institute

On the Security of Election Audits with Low Entropy Randomness Eric Rescorla ekr@rtfm.com

Entropy and The Second Law of Thermodynamics Entropy (S)

Lecture 19 Randomness, Pseudo Randomness, and Confidentiality Stephen Checkoway University

Counting Words: Non- Randomness Pre-Processing and Non-Randomness The End Marco Baroni &

Randomness Some content taken from Silence on the Wire by Michal Zalewski Todays Agenda

CS 574: Randomized Algorithms Lecture 1. Introduction to Randomness August 25, 2015 Lecture 1.

Randomness in Computing L ECTURE 1 Randomness in Computing Course information Verifying

15-251 Great Theoretical Ideas in Computer Science Lecture 21: Introduction to Randomness and

Firmware Insider Bluetooth Randomness is Mostly Random RANDOMNESS IS MY PASSION Jrn

Road detection via entropy By Anna Zaidman 1 1 What is entropy? Entropy is a mathematically

Compression, Information and Entropy Huffmans coding Lecture 26 December 3, 2013 Sariel

Priority Queues and Huffman Encoding Introduction to Homework 8 Hunter Schafer CSE 143, Autumn

JANUARY 23, 2019 9:00 AM Call Instructions: Please Mute your phone, microphone, and speakers

Introduction Logistics Attendees lines have been muted Ask questions in the

21. Dynamic Programming III FPTAS [Ottman/Widmayer, Kap. 7.2, 7.3, Cormen et al, Kap. 15,35.5]

CSC373 Week 2: Greedy Algorithms Nisarg Shah 373F20 - Nisarg Shah 1 Recap Divide &

Los Alamos Computer Science Symposium Los Alamos Computer Science Symposium (LACSS) (LACSS)

Quantum-safe hybrid handshake for TLS 1.3 Recent updates Sep 2015 Zhenfei Zhang Security

1) Entropy = measure of randomness 2) Entropy = measure of - PowerPoint PPT Presentation

Introduction to Information Retrieval Entropy: a basic introduction 1) Entropy = measure of randomness 2) Entropy = measure of compressibility More random = Less compressible High entropy = high randomness/low compressibility Low entropy =

Entropy, Relative Entropy, Cross Entropy Entropy Entropy, H(x) is a measure of the uncertainty of

Formal Modeling in Cognitive Science Lecture 25: Entropy, Joint Entropy, Conditional Entropy 1

Algorithmic randomness Cuny logic worshop Benoit Monin - LACL - Universit e Paris-Est Cr

Huffman Encoding 13-Oct-11 Entropy Entropy is a measure of information content: the number of

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

Chapter 2 Entropy, Relative Entropy, and Mutual Infor- mation Peng-Hua Wang Graduate Institute

On the Security of Election Audits with Low Entropy Randomness Eric Rescorla ekr@rtfm.com

Entropy and The Second Law of Thermodynamics Entropy (S)

Lecture 19 Randomness, Pseudo Randomness, and Confidentiality Stephen Checkoway University

Counting Words: Non- Randomness Pre-Processing and Non-Randomness The End Marco Baroni &amp;

Randomness Some content taken from Silence on the Wire by Michal Zalewski Todays Agenda

CS 574: Randomized Algorithms Lecture 1. Introduction to Randomness August 25, 2015 Lecture 1.

Randomness in Computing L ECTURE 1 Randomness in Computing Course information Verifying

15-251 Great Theoretical Ideas in Computer Science Lecture 21: Introduction to Randomness and

Firmware Insider Bluetooth Randomness is Mostly Random RANDOMNESS IS MY PASSION Jrn

Road detection via entropy By Anna Zaidman 1 1 What is entropy? Entropy is a mathematically

Compression, Information and Entropy Huffmans coding Lecture 26 December 3, 2013 Sariel

Priority Queues and Huffman Encoding Introduction to Homework 8 Hunter Schafer CSE 143, Autumn

JANUARY 23, 2019 9:00 AM Call Instructions: Please Mute your phone, microphone, and speakers

Introduction Logistics Attendees lines have been muted Ask questions in the

21. Dynamic Programming III FPTAS [Ottman/Widmayer, Kap. 7.2, 7.3, Cormen et al, Kap. 15,35.5]

CSC373 Week 2: Greedy Algorithms Nisarg Shah 373F20 - Nisarg Shah 1 Recap Divide &amp;

Los Alamos Computer Science Symposium Los Alamos Computer Science Symposium (LACSS) (LACSS)

Quantum-safe hybrid handshake for TLS 1.3 Recent updates Sep 2015 Zhenfei Zhang Security

Counting Words: Non- Randomness Pre-Processing and Non-Randomness The End Marco Baroni &

CSC373 Week 2: Greedy Algorithms Nisarg Shah 373F20 - Nisarg Shah 1 Recap Divide &