information theory statistics and decision trees
play

Information Theory, Statistics, and Decision Trees L eon Bottou - PowerPoint PPT Presentation

Information Theory, Statistics, and Decision Trees L eon Bottou COS 424 4/6/2010 Summary 1. Basic information theory. 2. Decision trees. 3. Information theory and statistics. L eon Bottou 2/31 COS 424 4/6/2010 I. Basic


  1. Information Theory, Statistics, and Decision Trees L´ eon Bottou COS 424 – 4/6/2010

  2. Summary 1. Basic information theory. 2. Decision trees. 3. Information theory and statistics. L´ eon Bottou 2/31 COS 424 – 4/6/2010

  3. I. Basic Information theory L´ eon Bottou 3/31 COS 424 – 4/6/2010

  4. Why do we care? Information theory – Invented by Claude Shannon in 1948 A Mathematical Theory of Communication. Bell System Technical Journal , October 1948. – The “quantity of information” measured in “bits”. – The “capacity of a transmission channel”. – Data coding and data compression. Information gain – A derived concept. – Quantify how much information we acquire about a phenomenon. – A justification for the Kullback-Leibler divergence. L´ eon Bottou 4/31 COS 424 – 4/6/2010

  5. The coding paradigm Intuition The quantity of information of a message is the length of the smallest code that can represent the message. Paradigm – Assume there are n possible messages i = 1 . . . n . – We want a signal that indicates the occurrence of one of them. – We can transmit an alphabet of r symbols. For instance a wire could carry r = 2 electrical levels. – The code for message i is a sequence of l i symbols. Properties – Codes should be uniquely decodable . – Average code length for a message: � n x =1 p i l i . L´ eon Bottou 5/31 COS 424 – 4/6/2010

  6. Prefix codes ������������ � � ������������� � � ������������ � � � – Messages 1 and 2 have codes one symbol long ( l i = 1 ). – Messages 3 and 4 have codes two symbols long ( l i = 2 ). – Messages 5 and 6,have codes three symbols long ( l i = 2 ). – There is an unused three symbol code. That’s inefficient. Properties – Prefix codes are uniquely decodable. – There are trickier kinds of uniquely decodable codes, e.g. a �→ 0 , b �→ 01 , c �→ 011 versus a �→ 0 , b �→ 10 , c �→ 110 . L´ eon Bottou 6/31 COS 424 – 4/6/2010

  7. Kraft inequality Uniquely decodable codes satisfy n � 1 � l i � ≤ 1 r x =1 – All uniquely decodable codes satisfy this inequality. – If integer code lengths l i satisfy this inequality, there exists a prefix code with such code lengths. Consequences – If some messages have short codes, others must have long codes. – To minimize the average code length: - give short codes to high probability messages. - give long codes to low probability messages. – Equiprobable messages should have similar code lengths. L´ eon Bottou 7/31 COS 424 – 4/6/2010

  8. Kraft inequality for prefix codes Prefix codes satisfy Kraft inequality         � �  � 1 � − l i  � � r l − l i ≤ r l ⇐ ⇒ ≤ 1 r  � �  � �� � � � � �� � � � i i     � �� � � � � �� � � �    � � � All uniquely decodable codes satisfy Kraft inequality – Proof must deal with infinite sequences of messages. Given integer code lengths l i : – Build a balanced r -ary tree of depth l = max i l i . – For each message, prune one subtree at depth l i . – Kraft inequality ensures that there will be enough branches left to define a code for each message. L´ eon Bottou 8/31 COS 424 – 4/6/2010

  9. Redundant codes � r − l i < 1 Assume i – There are leftover branches in the tree. – There are codes that are not used, or – There are multiple codes for each message. � r − l i = 1 For best compression, i – This is not always possible with integer code lengths l i . – But we can use this to compute a lower bound. L´ eon Bottou 9/31 COS 424 – 4/6/2010

  10. Lower bound for the average code length Choose code lengths l i such that � � r − l i = 1 , min l i > 0 p i l i subject to l 1 ...l n i i – Define s i = r − l i , that is, l i = − log r ( s i ) . – Maximize C = � p i log r ( s i ) subject to � i s i = 1 p i – We get ∂C ∂s i = s i log ( r ) = Constant , that is s i ∝ p i . – Replacing in the constraint gives s i = p i . Therefore � � l i = − log r ( p i ) and p i l i = − p i log r ( p i ) i i Fractional code lengths – What does it mean to code a message on 0 . 5 symbols? L´ eon Bottou 10/31 COS 424 – 4/6/2010

  11. Arithmetic coding – An infinite sequence of messages i 1 , i 2 , . . . can be viewed as a number x = 0 .i 1 i 2 i 3 . . . in base n . – An infinite sequence of symbols c 1 , c 2 , . . . can be viewed as a number y = 0 .c 1 c 2 c 3 . . . in base r . � ��������� �������� ��������� ���������� � � � ��������� �������� ��������� ���������� � ��������� �������� ��������� ���������� � � �������� ������� �������� ��������� � � L´ eon Bottou 11/31 COS 424 – 4/6/2010

  12. Arithmetic coding � ������� �������� ��������� ���������� � � � � � � � � � � � � ������� �������� ��������� ���������� ������� �������� ��������� ���������� � � ������ ������� �������� ��������� � � To encode a sequence of L messages i 1 , . . . , i L . L � – The code y must belong to an interval of size p i k . k =1 � L � � – It is sufficient to specify l ( i 1 i 2 . . . i L ) = log r ( p i k ) digits of y . k =1 L´ eon Bottou 12/31 COS 424 – 4/6/2010

  13. Arithmetic coding To encode a sequence of L messages i 1 , . . . , i L . L � � � – It is sufficient to specify l ( i 1 i 2 . . . i L ) = − log r ( p i k ) digits of y . k =1 – The average code length per message is   L 1 � � − log r ( p i k ) p i 1 . . . p i L   L     i 1 i 2 ...i L k =1 L log r ( p i k ) � � L →∞ − → p i 1 . . . p i L L i 1 i 2 ...i L k =1 L r 1 � � � � � � � = p i h p i k log p i k = − p i log p i L i k =1 k =1 h � = k i i 1 ...i L \ i k Arithmetic coding reaches the lower bound when L → ∞ . L´ eon Bottou 13/31 COS 424 – 4/6/2010

  14. Quantity of information Optimal code length: l i = − log r ( p i ) . Optimal expected code length: � p i l i = − � p i log r ( p i ) . Receiving a message x with probability p x : – The acquired information is h ( x ) = − log 2 ( p x ) bits. – An informative message is a surprising message! Expecting a message X with distribution p 1 . . . p n : – The expected information is H ( X ) = − � x ∈X p x log 2 ( p x ) bits. – This is also called entropy . These are two distinct definitions! Note how we switched to logarithms in base two. This is a multiplicative factor: log 2 ( p ) = log r ( p ) log 2 ( r ) . Choosing base 2 defines a unit of information: the bit. L´ eon Bottou 14/31 COS 424 – 4/6/2010

  15. Mutual information ���������� ���� ������ ��� ����� �������� ����������� ����� �� ��� �� � ����� ���������� ����� �� �� �� � ����� ���� !���� � �� �� �� ����� ���� � �� �� �� ����� �������� ����� ����� ����� ����� ����������� ���� ���������� ���� ������ ��� ����� ����� ����� ����� ���� ���� ���������� ����� ���� ���� ���� �� � !���� ��"� ���� ���� �� � #��� ����� ���� ����� ���� ����������������� ���� ������������������ ���� H ( X ) = − � i P ( X = i ) log P ( X = i ) – Expected information: H ( X, Y ) = � i,j P ( X = i, Y = j ) log P ( X = i, Y = j ) – Joint information: – Mutual information: I ( X, Y ) = H ( X ) + H ( Y ) − H ( X, Y ) ���� ������ ���� ������ L´ eon Bottou 15/31 COS 424 – 4/6/2010

  16. II. Decision trees L´ eon Bottou 16/31 COS 424 – 4/6/2010

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend