Information Theory, Statistics, and Decision Trees L eon Bottou - PowerPoint PPT Presentation

Information Theory, Statistics, and Decision Trees L´ eon Bottou COS 424 – 4/6/2010

Summary 1. Basic information theory. 2. Decision trees. 3. Information theory and statistics. L´ eon Bottou 2/31 COS 424 – 4/6/2010

I. Basic Information theory L´ eon Bottou 3/31 COS 424 – 4/6/2010

Why do we care? Information theory – Invented by Claude Shannon in 1948 A Mathematical Theory of Communication. Bell System Technical Journal , October 1948. – The “quantity of information” measured in “bits”. – The “capacity of a transmission channel”. – Data coding and data compression. Information gain – A derived concept. – Quantify how much information we acquire about a phenomenon. – A justification for the Kullback-Leibler divergence. L´ eon Bottou 4/31 COS 424 – 4/6/2010

The coding paradigm Intuition The quantity of information of a message is the length of the smallest code that can represent the message. Paradigm – Assume there are n possible messages i = 1 . . . n . – We want a signal that indicates the occurrence of one of them. – We can transmit an alphabet of r symbols. For instance a wire could carry r = 2 electrical levels. – The code for message i is a sequence of l i symbols. Properties – Codes should be uniquely decodable . – Average code length for a message: � n x =1 p i l i . L´ eon Bottou 5/31 COS 424 – 4/6/2010

Prefix codes �� – Messages 1 and 2 have codes one symbol long ( l i = 1 ). – Messages 3 and 4 have codes two symbols long ( l i = 2 ). – Messages 5 and 6,have codes three symbols long ( l i = 2 ). – There is an unused three symbol code. That’s inefficient. Properties – Prefix codes are uniquely decodable. – There are trickier kinds of uniquely decodable codes, e.g. a �→ 0 , b �→ 01 , c �→ 011 versus a �→ 0 , b �→ 10 , c �→ 110 . L´ eon Bottou 6/31 COS 424 – 4/6/2010

Kraft inequality Uniquely decodable codes satisfy n � 1 � l i � ≤ 1 r x =1 – All uniquely decodable codes satisfy this inequality. – If integer code lengths l i satisfy this inequality, there exists a prefix code with such code lengths. Consequences – If some messages have short codes, others must have long codes. – To minimize the average code length: - give short codes to high probability messages. - give long codes to low probability messages. – Equiprobable messages should have similar code lengths. L´ eon Bottou 7/31 COS 424 – 4/6/2010

Kraft inequality for prefix codes Prefix codes satisfy Kraft inequality         � �  � 1 � − l i  � � r l − l i ≤ r l ⇐ ⇒ ≤ 1 r  � �  � �� i i     � ��    � � � All uniquely decodable codes satisfy Kraft inequality – Proof must deal with infinite sequences of messages. Given integer code lengths l i : – Build a balanced r -ary tree of depth l = max i l i . – For each message, prune one subtree at depth l i . – Kraft inequality ensures that there will be enough branches left to define a code for each message. L´ eon Bottou 8/31 COS 424 – 4/6/2010

Redundant codes � r − l i < 1 Assume i – There are leftover branches in the tree. – There are codes that are not used, or – There are multiple codes for each message. � r − l i = 1 For best compression, i – This is not always possible with integer code lengths l i . – But we can use this to compute a lower bound. L´ eon Bottou 9/31 COS 424 – 4/6/2010

Lower bound for the average code length Choose code lengths l i such that � � r − l i = 1 , min l i > 0 p i l i subject to l 1 ...l n i i – Define s i = r − l i , that is, l i = − log r ( s i ) . – Maximize C = � p i log r ( s i ) subject to � i s i = 1 p i – We get ∂C ∂s i = s i log ( r ) = Constant , that is s i ∝ p i . – Replacing in the constraint gives s i = p i . Therefore � � l i = − log r ( p i ) and p i l i = − p i log r ( p i ) i i Fractional code lengths – What does it mean to code a message on 0 . 5 symbols? L´ eon Bottou 10/31 COS 424 – 4/6/2010

Arithmetic coding – An infinite sequence of messages i 1 , i 2 , . . . can be viewed as a number x = 0 .i 1 i 2 i 3 . . . in base n . – An infinite sequence of symbols c 1 , c 2 , . . . can be viewed as a number y = 0 .c 1 c 2 c 3 . . . in base r . � �� L´ eon Bottou 11/31 COS 424 – 4/6/2010

Arithmetic coding � �� To encode a sequence of L messages i 1 , . . . , i L . L � – The code y must belong to an interval of size p i k . k =1 � L � � – It is sufficient to specify l ( i 1 i 2 . . . i L ) = log r ( p i k ) digits of y . k =1 L´ eon Bottou 12/31 COS 424 – 4/6/2010

Arithmetic coding To encode a sequence of L messages i 1 , . . . , i L . L � � � – It is sufficient to specify l ( i 1 i 2 . . . i L ) = − log r ( p i k ) digits of y . k =1 – The average code length per message is   L 1 � � − log r ( p i k ) p i 1 . . . p i L   L     i 1 i 2 ...i L k =1 L log r ( p i k ) � � L →∞ − → p i 1 . . . p i L L i 1 i 2 ...i L k =1 L r 1 � � � � � � � = p i h p i k log p i k = − p i log p i L i k =1 k =1 h � = k i i 1 ...i L \ i k Arithmetic coding reaches the lower bound when L → ∞ . L´ eon Bottou 13/31 COS 424 – 4/6/2010

Quantity of information Optimal code length: l i = − log r ( p i ) . Optimal expected code length: � p i l i = − � p i log r ( p i ) . Receiving a message x with probability p x : – The acquired information is h ( x ) = − log 2 ( p x ) bits. – An informative message is a surprising message! Expecting a message X with distribution p 1 . . . p n : – The expected information is H ( X ) = − � x ∈X p x log 2 ( p x ) bits. – This is also called entropy . These are two distinct definitions! Note how we switched to logarithms in base two. This is a multiplicative factor: log 2 ( p ) = log r ( p ) log 2 ( r ) . Choosing base 2 defines a unit of information: the bit. L´ eon Bottou 14/31 COS 424 – 4/6/2010

Mutual information �� !�� !�� "� �� #�� H ( X ) = − � i P ( X = i ) log P ( X = i ) – Expected information: H ( X, Y ) = � i,j P ( X = i, Y = j ) log P ( X = i, Y = j ) – Joint information: – Mutual information: I ( X, Y ) = H ( X ) + H ( Y ) − H ( X, Y ) �� L´ eon Bottou 15/31 COS 424 – 4/6/2010

II. Decision trees L´ eon Bottou 16/31 COS 424 – 4/6/2010

Information Theory, Statistics, and Decision Trees L eon Bottou - PowerPoint PPT Presentation

Information Theory, Statistics, and Decision Trees L eon Bottou COS 424 4/6/2010 Summary 1. Basic information theory. 2. Decision trees. 3. Information theory and statistics. L eon Bottou 2/31 COS 424 4/6/2010 I. Basic

Decision Trees Lecture 23 To left or to right 1 Decision Trees 2 Decision Trees A different

Decision Trees Lecture 22 To left or to right 1 Decision Trees 2 Decision Trees A different

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

Decision Tree R Greiner Cmput 466 / 551 Learning Decision Trees Def'n: Decision Trees

Trees Eric McCreath Overview In this lecture we will explore: general trees, binary trees,

Lecture 23: Decision Trees Decision trees Prof. Julia Hockenmaier

Outline Univariate Trees 1 Decision Trees Classification Regression Pruning Steven J Zeil

Trees Applied Multivariate Statistics Spring 2012 Overview Intuition for Trees

2-3-4 Trees and Red- Black Trees 204 erm CS 16: Balanced Trees 2-3-4 Trees Revealed Nodes

/ + - * * 5 3 2 6 5 2 Examples Binary Trees BSTs Augmenting BinExpr General Trees

Learning Decision Trees Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Decision Trees: Discussion Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Optimal Sparse Decision Trees Xiyang Hu Cynthia Rudin Margo Seltzer Carnegie Mellon Duke

{Scheme, Code} tuples for News Industry Taxonomies 22 May 2006 Quick intro to NewsML 2

Dependability and Performance Assessment of Dynamic C ONNECT ed Systems Antonia Bertolino,

Fault tolerance with transactions: past, present and future Dr Mark Little Technical Development

. l i h P / . g . h g . g . n c n n y i i i L L L s P D D D U M M

Sessions and Pipelines for Structured Service Programming Michele Boreale 1 Roberto Bruni 2 Rocco

A proposal for an integrated approach between sentiment analysis and social network analysis

The EMIN Project and the EU Roadmap for Adequate Minimum Income Schemes Ramn Pea-Casas

Natural Language Understanding using Knowledge Bases and Random Walks Eneko Agirre

Sambuz

Useful Links

Newsletter

Mail Us