tree models
play

Tree Models Weinan Zhang Shanghai Jiao Tong University - PowerPoint PPT Presentation

2019 CS420, Machine Learning, Lecture 5 Tree Models Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html ML Task: Function Approximation Problem setting Instance feature space X


  1. 2019 CS420, Machine Learning, Lecture 5 Tree Models Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html

  2. ML Task: Function Approximation • Problem setting • Instance feature space X • Instance label space Y • Unknown underlying function (target) f : X 7! Y f : X 7! Y • Set of function hypothesis H = f h j h : X 7! Yg H = f h j h : X 7! Yg • Input: training data generated from the unknown f ( x ( i ) ; y ( i ) ) g = f ( x (1) ; y (1) ) ; : : : ; ( x ( n ) ; y ( n ) ) g f ( x ( i ) ; y ( i ) ) g = f ( x (1) ; y (1) ) ; : : : ; ( x ( n ) ; y ( n ) ) g • Output: a hypothesis that best approximates h 2 H h 2 H f • Optimize in functional space, not just parameter space

  3. Optimize in Functional Space • Tree models • Intermediate node for splitting data • Leaf node for label prediction • Continuous data example x 2 x 2 Class 1 Class 2 x 1 < a 1 Root Node Yes No a 3 a 3 a 2 a 2 Intermediate x 2 < a 2 x 2 < a 3 Node Class 2 Yes No Yes No Class 1 Leaf a 1 a 1 x 1 x 1 y = -1 y = 1 y = 1 y = -1 Node

  4. Optimize in Functional Space • Tree models • Intermediate node for splitting data • Leaf node for label prediction • Discrete/categorical data example Outlook Root Node Sunny Overcast Rain Intermediate Humidity Wind y = 1 Node Leaf High Normal Strong Weak Node Leaf y = -1 y = 1 y = -1 y = 1 Node

  5. Decision Tree Learning • Problem setting • Instance feature space X • Instance label space Y • Unknown underlying function (target) f : X 7! Y f : X 7! Y • Set of function hypothesis H = f h j h : X 7! Yg H = f h j h : X 7! Yg • Input: training data generated from the unknown f ( x ( i ) ; y ( i ) ) g = f ( x (1) ; y (1) ) ; : : : ; ( x ( n ) ; y ( n ) ) g f ( x ( i ) ; y ( i ) ) g = f ( x (1) ; y (1) ) ; : : : ; ( x ( n ) ; y ( n ) ) g • Output: a hypothesis that best approximates h 2 H h 2 H f • Here each hypothesis is a decision tree h h

  6. Decision Tree – Decision Boundary • Decision trees divide the feature space into axis- parallel (hyper-)rectangles • Each rectangular region is labeled with one label • or a probabilistic distribution over labels Slide credit: Eric Eaton

  7. History of Decision-Tree Research • Hunt and colleagues used exhaustive search decision-tree methods (CLS) to model human concept learning in the 1960’s. • In the late 70’s, Quinlan developed ID3 with the information gain heuristic to learn expert systems from examples. • Simultaneously, Breiman and Friedman and colleagues developed CART (Classification and Regression Trees), similar to ID3. • In the 1980’s a variety of improvements were introduced to handle noise, continuous features, missing features, and improved splitting criteria. Various expert-system development tools results. • Quinlan’s updated decision-tree package (C4.5) released in 1993. • Sklearn (python)Weka (Java) now include ID3 and C4.5 Slide credit: Raymond J. Mooney

  8. Decision Trees • Tree models • Intermediate node for splitting data • Leaf node for label prediction • Key questions for decision trees • How to select node splitting conditions? • How to make prediction? • How to decide the tree structure?

  9. Node Splitting • Which node splitting condition to choose? Outlook Temperature Sunny Overcast Rain Hot Mild Cool • Choose the features with higher classification capacity • Quantitatively, with higher information gain

  10. Fundamentals of Information Theory • Entropy (more specifically, Shannon entropy) is the expected value (average) of the information contained in each message. • Suppose X is a random variable with n discrete values P ( X = x i ) = p i P ( X = x i ) = p i • then its entropy H ( X ) is n n X X H ( X ) = ¡ H ( X ) = ¡ p i log p i p i log p i i =1 i =1 • It is easy to verify n n n n X X X X n log 1 n log 1 1 1 H ( X ) = ¡ H ( X ) = ¡ p i log p i · ¡ p i log p i · ¡ n = log n n = log n i =1 i =1 i =1 i =1

  11. Illustration of Entropy • Entropy of binary distribution H ( X ) = ¡ p 1 log p 1 ¡ (1 ¡ p 1 ) log(1 ¡ p 1 ) H ( X ) = ¡ p 1 log p 1 ¡ (1 ¡ p 1 ) log(1 ¡ p 1 )

  12. Cross Entropy • Cross entropy is used to measure the difference between two random variable distributions n n X X H ( X; Y ) = ¡ H ( X; Y ) = ¡ P ( X = i ) log P ( Y = i ) P ( X = i ) log P ( Y = i ) i =1 i =1 • Continuous formulation Z Z H ( p; q ) = ¡ H ( p; q ) = ¡ p ( x ) log q ( x ) dx p ( x ) log q ( x ) dx • Compared to KL divergence Z Z p ( x ) log p ( x ) p ( x ) log p ( x ) D KL ( p k q ) = D KL ( p k q ) = q ( x ) dx = H ( p; q ) ¡ H ( p ) q ( x ) dx = H ( p; q ) ¡ H ( p )

  13. KL-Divergence Kullback–Leibler divergence (also called relative entropy) is a measure of how one probability distribution diverges from a second, expected probability distribution Z Z p ( x ) log p ( x ) p ( x ) log p ( x ) D KL ( p k q ) = D KL ( p k q ) = q ( x ) dx = H ( p; q ) ¡ H ( p ) q ( x ) dx = H ( p; q ) ¡ H ( p )

  14. Review Cross Entropy in Logistic Regression • Logistic regression is a binary classification model 1 1 ¾ ( x ) ¾ ( x ) p μ ( y = 1 j x ) = ¾ ( μ > x ) = p μ ( y = 1 j x ) = ¾ ( μ > x ) = 1 + e ¡ μ > x 1 + e ¡ μ > x e ¡ μ > x e ¡ μ > x p μ ( y = 0 j x ) = p μ ( y = 0 j x ) = 1 + e ¡ μ > x 1 + e ¡ μ > x x • Cross entropy loss function L ( y; x; p μ ) = ¡ y log ¾ ( μ > x ) ¡ (1 ¡ y ) log(1 ¡ ¾ ( μ > x )) L ( y; x; p μ ) = ¡ y log ¾ ( μ > x ) ¡ (1 ¡ y ) log(1 ¡ ¾ ( μ > x )) • Gradient @ L ( y; x; p μ ) @ L ( y; x; p μ ) 1 1 ¡ 1 ¡ 1 = ¡ y = ¡ y ¾ ( μ > x ) ¾ ( z )(1 ¡ ¾ ( z )) x ¡ (1 ¡ y ) ¾ ( μ > x ) ¾ ( z )(1 ¡ ¾ ( z )) x ¡ (1 ¡ y ) 1 ¡ ¾ ( μ > x ) ¾ ( z )(1 ¡ ¾ ( z )) x 1 ¡ ¾ ( μ > x ) ¾ ( z )(1 ¡ ¾ ( z )) x @ μ @ μ = ( ¾ ( μ > x ) ¡ y ) x = ( ¾ ( μ > x ) ¡ y ) x @¾ ( z ) @¾ ( z ) μ Ã μ + ( y ¡ ¾ ( μ > x )) x μ Ã μ + ( y ¡ ¾ ( μ > x )) x = ¾ ( z )(1 ¡ ¾ ( z )) = ¾ ( z )(1 ¡ ¾ ( z )) @z @z

  15. Conditional Entropy n n X X • Entropy H ( X ) = ¡ H ( X ) = ¡ P ( X = i ) log P ( X = i ) P ( X = i ) log P ( X = i ) i =1 i =1 • Specific conditional entropy of X given Y = v n n X X H ( X j Y = v ) = ¡ H ( X j Y = v ) = ¡ P ( X = i j Y = v ) log P ( X = i j Y = v ) P ( X = i j Y = v ) log P ( X = i j Y = v ) i =1 i =1 • Specific conditional entropy of X given Y X X H ( X j Y ) = H ( X j Y ) = P ( Y = v ) H ( X j Y = v ) P ( Y = v ) H ( X j Y = v ) v 2 values( Y ) v 2 values( Y ) • Information Gain of X given Y I ( X; Y ) = H ( X ) ¡ H ( X j Y ) = H ( Y ) ¡ H ( Y j X ) I ( X; Y ) = H ( X ) ¡ H ( X j Y ) = H ( Y ) ¡ H ( Y j X ) = H ( X ) + H ( Y ) ¡ H ( X; Y ) = H ( X ) + H ( Y ) ¡ H ( X; Y ) Entropy of ( X , Y ) instead of cross entropy

  16. Information Gain • Information Gain of X given Y I ( X; Y ) = H ( X ) ¡ H ( X j Y ) I ( X; Y ) = H ( X ) ¡ H ( X j Y ) X X X X X X = ¡ = ¡ P ( X = v ) log P ( X = v ) + P ( X = v ) log P ( X = v ) + P ( Y = u ) P ( Y = u ) P ( X = v j Y = u ) log P ( X = v j Y = u ) P ( X = v j Y = u ) log P ( X = v j Y = u ) v v u u v v X X X X X X = ¡ = ¡ P ( X = v ) log P ( X = v ) + P ( X = v ) log P ( X = v ) + P ( X = v; Y = u ) log P ( X = v j Y = u ) P ( X = v; Y = u ) log P ( X = v j Y = u ) v v u u v v X X X X X X = ¡ = ¡ P ( X = v ) log P ( X = v ) + P ( X = v ) log P ( X = v ) + P ( X = v; Y = u )[log P ( X = v; Y = u ) ¡ log P ( Y = u )] P ( X = v; Y = u )[log P ( X = v; Y = u ) ¡ log P ( Y = u )] v v u u v v X X X X X X = ¡ = ¡ P ( X = v ) log P ( X = v ) ¡ P ( X = v ) log P ( X = v ) ¡ P ( Y = u ) log P ( Y = u ) + P ( Y = u ) log P ( Y = u ) + P ( X = v; Y = u ) log P ( X = v; Y = u ) P ( X = v; Y = u ) log P ( X = v; Y = u ) v v u u u;v u;v = H ( X ) + H ( Y ) ¡ H ( X; Y ) = H ( X ) + H ( Y ) ¡ H ( X; Y ) Entropy of ( X , Y ) instead of cross entropy X X H ( X; Y ) = ¡ H ( X; Y ) = ¡ P ( X = v; Y = u ) log P ( X = v; Y = u ) P ( X = v; Y = u ) log P ( X = v; Y = u ) u;v u;v

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend