Tree Models Weinan Zhang Shanghai Jiao Tong University - PowerPoint PPT Presentation

2019 CS420, Machine Learning, Lecture 5 Tree Models Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html

ML Task: Function Approximation • Problem setting • Instance feature space X • Instance label space Y • Unknown underlying function (target) f : X 7! Y f : X 7! Y • Set of function hypothesis H = f h j h : X 7! Yg H = f h j h : X 7! Yg • Input: training data generated from the unknown f ( x ( i ) ; y ( i ) ) g = f ( x (1) ; y (1) ) ; : : : ; ( x ( n ) ; y ( n ) ) g f ( x ( i ) ; y ( i ) ) g = f ( x (1) ; y (1) ) ; : : : ; ( x ( n ) ; y ( n ) ) g • Output: a hypothesis that best approximates h 2 H h 2 H f • Optimize in functional space, not just parameter space

Optimize in Functional Space • Tree models • Intermediate node for splitting data • Leaf node for label prediction • Continuous data example x 2 x 2 Class 1 Class 2 x 1 < a 1 Root Node Yes No a 3 a 3 a 2 a 2 Intermediate x 2 < a 2 x 2 < a 3 Node Class 2 Yes No Yes No Class 1 Leaf a 1 a 1 x 1 x 1 y = -1 y = 1 y = 1 y = -1 Node

Optimize in Functional Space • Tree models • Intermediate node for splitting data • Leaf node for label prediction • Discrete/categorical data example Outlook Root Node Sunny Overcast Rain Intermediate Humidity Wind y = 1 Node Leaf High Normal Strong Weak Node Leaf y = -1 y = 1 y = -1 y = 1 Node

Decision Tree Learning • Problem setting • Instance feature space X • Instance label space Y • Unknown underlying function (target) f : X 7! Y f : X 7! Y • Set of function hypothesis H = f h j h : X 7! Yg H = f h j h : X 7! Yg • Input: training data generated from the unknown f ( x ( i ) ; y ( i ) ) g = f ( x (1) ; y (1) ) ; : : : ; ( x ( n ) ; y ( n ) ) g f ( x ( i ) ; y ( i ) ) g = f ( x (1) ; y (1) ) ; : : : ; ( x ( n ) ; y ( n ) ) g • Output: a hypothesis that best approximates h 2 H h 2 H f • Here each hypothesis is a decision tree h h

Decision Tree – Decision Boundary • Decision trees divide the feature space into axis- parallel (hyper-)rectangles • Each rectangular region is labeled with one label • or a probabilistic distribution over labels Slide credit: Eric Eaton

History of Decision-Tree Research • Hunt and colleagues used exhaustive search decision-tree methods (CLS) to model human concept learning in the 1960’s. • In the late 70’s, Quinlan developed ID3 with the information gain heuristic to learn expert systems from examples. • Simultaneously, Breiman and Friedman and colleagues developed CART (Classification and Regression Trees), similar to ID3. • In the 1980’s a variety of improvements were introduced to handle noise, continuous features, missing features, and improved splitting criteria. Various expert-system development tools results. • Quinlan’s updated decision-tree package (C4.5) released in 1993. • Sklearn (python)Weka (Java) now include ID3 and C4.5 Slide credit: Raymond J. Mooney

Decision Trees • Tree models • Intermediate node for splitting data • Leaf node for label prediction • Key questions for decision trees • How to select node splitting conditions? • How to make prediction? • How to decide the tree structure?

Node Splitting • Which node splitting condition to choose? Outlook Temperature Sunny Overcast Rain Hot Mild Cool • Choose the features with higher classification capacity • Quantitatively, with higher information gain

Fundamentals of Information Theory • Entropy (more specifically, Shannon entropy) is the expected value (average) of the information contained in each message. • Suppose X is a random variable with n discrete values P ( X = x i ) = p i P ( X = x i ) = p i • then its entropy H ( X ) is n n X X H ( X ) = ¡ H ( X ) = ¡ p i log p i p i log p i i =1 i =1 • It is easy to verify n n n n X X X X n log 1 n log 1 1 1 H ( X ) = ¡ H ( X ) = ¡ p i log p i · ¡ p i log p i · ¡ n = log n n = log n i =1 i =1 i =1 i =1

Illustration of Entropy • Entropy of binary distribution H ( X ) = ¡ p 1 log p 1 ¡ (1 ¡ p 1 ) log(1 ¡ p 1 ) H ( X ) = ¡ p 1 log p 1 ¡ (1 ¡ p 1 ) log(1 ¡ p 1 )

Cross Entropy • Cross entropy is used to measure the difference between two random variable distributions n n X X H ( X; Y ) = ¡ H ( X; Y ) = ¡ P ( X = i ) log P ( Y = i ) P ( X = i ) log P ( Y = i ) i =1 i =1 • Continuous formulation Z Z H ( p; q ) = ¡ H ( p; q ) = ¡ p ( x ) log q ( x ) dx p ( x ) log q ( x ) dx • Compared to KL divergence Z Z p ( x ) log p ( x ) p ( x ) log p ( x ) D KL ( p k q ) = D KL ( p k q ) = q ( x ) dx = H ( p; q ) ¡ H ( p ) q ( x ) dx = H ( p; q ) ¡ H ( p )

KL-Divergence Kullback–Leibler divergence (also called relative entropy) is a measure of how one probability distribution diverges from a second, expected probability distribution Z Z p ( x ) log p ( x ) p ( x ) log p ( x ) D KL ( p k q ) = D KL ( p k q ) = q ( x ) dx = H ( p; q ) ¡ H ( p ) q ( x ) dx = H ( p; q ) ¡ H ( p )

Review Cross Entropy in Logistic Regression • Logistic regression is a binary classification model 1 1 ¾ ( x ) ¾ ( x ) p μ ( y = 1 j x ) = ¾ ( μ > x ) = p μ ( y = 1 j x ) = ¾ ( μ > x ) = 1 + e ¡ μ > x 1 + e ¡ μ > x e ¡ μ > x e ¡ μ > x p μ ( y = 0 j x ) = p μ ( y = 0 j x ) = 1 + e ¡ μ > x 1 + e ¡ μ > x x • Cross entropy loss function L ( y; x; p μ ) = ¡ y log ¾ ( μ > x ) ¡ (1 ¡ y ) log(1 ¡ ¾ ( μ > x )) L ( y; x; p μ ) = ¡ y log ¾ ( μ > x ) ¡ (1 ¡ y ) log(1 ¡ ¾ ( μ > x )) • Gradient @ L ( y; x; p μ ) @ L ( y; x; p μ ) 1 1 ¡ 1 ¡ 1 = ¡ y = ¡ y ¾ ( μ > x ) ¾ ( z )(1 ¡ ¾ ( z )) x ¡ (1 ¡ y ) ¾ ( μ > x ) ¾ ( z )(1 ¡ ¾ ( z )) x ¡ (1 ¡ y ) 1 ¡ ¾ ( μ > x ) ¾ ( z )(1 ¡ ¾ ( z )) x 1 ¡ ¾ ( μ > x ) ¾ ( z )(1 ¡ ¾ ( z )) x @ μ @ μ = ( ¾ ( μ > x ) ¡ y ) x = ( ¾ ( μ > x ) ¡ y ) x @¾ ( z ) @¾ ( z ) μ Ã μ + ( y ¡ ¾ ( μ > x )) x μ Ã μ + ( y ¡ ¾ ( μ > x )) x = ¾ ( z )(1 ¡ ¾ ( z )) = ¾ ( z )(1 ¡ ¾ ( z )) @z @z

Conditional Entropy n n X X • Entropy H ( X ) = ¡ H ( X ) = ¡ P ( X = i ) log P ( X = i ) P ( X = i ) log P ( X = i ) i =1 i =1 • Specific conditional entropy of X given Y = v n n X X H ( X j Y = v ) = ¡ H ( X j Y = v ) = ¡ P ( X = i j Y = v ) log P ( X = i j Y = v ) P ( X = i j Y = v ) log P ( X = i j Y = v ) i =1 i =1 • Specific conditional entropy of X given Y X X H ( X j Y ) = H ( X j Y ) = P ( Y = v ) H ( X j Y = v ) P ( Y = v ) H ( X j Y = v ) v 2 values( Y ) v 2 values( Y ) • Information Gain of X given Y I ( X; Y ) = H ( X ) ¡ H ( X j Y ) = H ( Y ) ¡ H ( Y j X ) I ( X; Y ) = H ( X ) ¡ H ( X j Y ) = H ( Y ) ¡ H ( Y j X ) = H ( X ) + H ( Y ) ¡ H ( X; Y ) = H ( X ) + H ( Y ) ¡ H ( X; Y ) Entropy of ( X , Y ) instead of cross entropy

Information Gain • Information Gain of X given Y I ( X; Y ) = H ( X ) ¡ H ( X j Y ) I ( X; Y ) = H ( X ) ¡ H ( X j Y ) X X X X X X = ¡ = ¡ P ( X = v ) log P ( X = v ) + P ( X = v ) log P ( X = v ) + P ( Y = u ) P ( Y = u ) P ( X = v j Y = u ) log P ( X = v j Y = u ) P ( X = v j Y = u ) log P ( X = v j Y = u ) v v u u v v X X X X X X = ¡ = ¡ P ( X = v ) log P ( X = v ) + P ( X = v ) log P ( X = v ) + P ( X = v; Y = u ) log P ( X = v j Y = u ) P ( X = v; Y = u ) log P ( X = v j Y = u ) v v u u v v X X X X X X = ¡ = ¡ P ( X = v ) log P ( X = v ) + P ( X = v ) log P ( X = v ) + P ( X = v; Y = u )[log P ( X = v; Y = u ) ¡ log P ( Y = u )] P ( X = v; Y = u )[log P ( X = v; Y = u ) ¡ log P ( Y = u )] v v u u v v X X X X X X = ¡ = ¡ P ( X = v ) log P ( X = v ) ¡ P ( X = v ) log P ( X = v ) ¡ P ( Y = u ) log P ( Y = u ) + P ( Y = u ) log P ( Y = u ) + P ( X = v; Y = u ) log P ( X = v; Y = u ) P ( X = v; Y = u ) log P ( X = v; Y = u ) v v u u u;v u;v = H ( X ) + H ( Y ) ¡ H ( X; Y ) = H ( X ) + H ( Y ) ¡ H ( X; Y ) Entropy of ( X , Y ) instead of cross entropy X X H ( X; Y ) = ¡ H ( X; Y ) = ¡ P ( X = v; Y = u ) log P ( X = v; Y = u ) P ( X = v; Y = u ) log P ( X = v; Y = u ) u;v u;v

Tree Models Weinan Zhang Shanghai Jiao Tong University - PowerPoint PPT Presentation

2019 CS420, Machine Learning, Lecture 5 Tree Models Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html ML Task: Function Approximation Problem setting Instance feature space X

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Tree-sitter @maxbrunsfeld What is Tree-sitter? Why I wrote Tree-sitter What were

Session 12 Tree-based models: tree and rpart Two libraries The tree library is like the

Final Examples Announcements Trees Tree-Structured Data def tree(label, branches=[]): A tree

Compositions of Tree-to-Tree Statistical Machine Translation Models Andreas Maletti Institute

PLTree A tree programming language Overview Philosophy: Everything is a tree All data structures

Education Endowment (TREE) Fund TREE Fund is a 501(c)3 nonprofit organization that supports

Services Using E-Tree Service Type Ethernet Private Tree (EP-Tree) and Ethernet Virtual Private

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

TREE = TOKEN The Frontier of Impact Finance T TREE T TREE Token = oken = 1 The Frontier

Capturing Translational Divergences with Zhechev & Andy Way a Statistical Tree-to-Tree

Trees CoSc 450: Programming Paradigms 08 The definition of a tree CoSc 450: Programming

Another tree example Phylogenetic tree Patient 1 Plan Clone Phylogeny B C RFTA16 Om1

Basic Blocks and Traces Lecture 8 Canonical Trees signature CANON = sig val linearize :

Minimal Spanning Trees Spanning Tree Assume you have an undirected graph G = (V,E)

Physique des Systmes Instrumentaux :RMFLHFK 'XOLQVNL 3L[HO

More on Supervised Learning Amir H. Payberah payberah@kth.se 21/11/2018 The Course Web Page

A New Perspective on Quality Evaluation for Control Systems with Stochastic Timing Maximilian

The SKA Control System Lorenzo Pivetta 5 April 2018 The System South Africa SKA Regional

1 Transducers Inherently Discrete values devices that convert from one representation to

Di Digi gital and Analog og Si Sign gnals 01219335 Data Acquisition and Integration Chaipo

Administrivia Main feedback from last lecture. Lecture 8 Mud: k -means clustering. Lab 2 handed

Type-1 Interferon and SARS-CoV-2 Literature update 16 Nov 2020 Guido Vanham 16/11/2020 1

Tree Models Weinan Zhang Shanghai Jiao Tong University - PowerPoint PPT Presentation

2019 CS420, Machine Learning, Lecture 5 Tree Models Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html ML Task: Function Approximation Problem setting Instance feature space X

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Tree-sitter @maxbrunsfeld What is Tree-sitter? Why I wrote Tree-sitter What were

Session 12 Tree-based models: tree and rpart Two libraries The tree library is like the

Final Examples Announcements Trees Tree-Structured Data def tree(label, branches=[]): A tree

Compositions of Tree-to-Tree Statistical Machine Translation Models Andreas Maletti Institute

PLTree A tree programming language Overview Philosophy: Everything is a tree All data structures

Education Endowment (TREE) Fund TREE Fund is a 501(c)3 nonprofit organization that supports

Services Using E-Tree Service Type Ethernet Private Tree (EP-Tree) and Ethernet Virtual Private

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

TREE = TOKEN The Frontier of Impact Finance T TREE T TREE Token = oken = 1 The Frontier

Capturing Translational Divergences with Zhechev &amp; Andy Way a Statistical Tree-to-Tree

Trees CoSc 450: Programming Paradigms 08 The definition of a tree CoSc 450: Programming

Another tree example Phylogenetic tree Patient 1 Plan Clone Phylogeny B C RFTA16 Om1

Basic Blocks and Traces Lecture 8 Canonical Trees signature CANON = sig val linearize :

Minimal Spanning Trees Spanning Tree Assume you have an undirected graph G = (V,E)

Physique des Systmes Instrumentaux :RMFLHFK 'XOLQVNL 3L[HO

More on Supervised Learning Amir H. Payberah payberah@kth.se 21/11/2018 The Course Web Page

A New Perspective on Quality Evaluation for Control Systems with Stochastic Timing Maximilian

The SKA Control System Lorenzo Pivetta 5 April 2018 The System South Africa SKA Regional

1 Transducers Inherently Discrete values devices that convert from one representation to

Di Digi gital and Analog og Si Sign gnals 01219335 Data Acquisition and Integration Chaipo

Administrivia Main feedback from last lecture. Lecture 8 Mud: k -means clustering. Lab 2 handed

Type-1 Interferon and SARS-CoV-2 Literature update 16 Nov 2020 Guido Vanham 16/11/2020 1

Capturing Translational Divergences with Zhechev & Andy Way a Statistical Tree-to-Tree