L ECTURE 3: D ECISION T REES Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Fall 2013) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 3: D ECISION T REES Prof. Julia Hockenmaier juliahmr@illinois.edu

Announcements Office hours start this week Arun’s office hours are now Tuesdays 10 am - 12 noon in 4407 HW0 (ungraded) is on the class website (http://courses.engr.illinois.edu/cs446/syllabus.html) CS446 Machine Learning 2

Last lecture’s key concepts Supervised Learning: – What is our instance space? What features do we use to represent instances? – What is our label space? Classification: discrete labels – What is our hypothesis space? – What learning algorithm do we use? CS446 Machine Learning 3

Today’s lecture Decision trees for (binary) classification Non-linear classifiers Learning decision trees (ID3 algorithm) Greedy heuristic (based on information gain) Originally developed for discrete features Overfitting CS446 Machine Learning 4

What are decision trees? CS446 Machine Learning 5

Will customers add sugar to their drinks? CS446 Machine Learning 6

Will customers add sugar to their drinks? Data Features Class Drink? Milk? Sugar? #1 Coffee No Yes #2 Coffee Yes No #3 Tea Yes Yes #4 Tea No No CS446 Machine Learning 7

Will customers add sugar to their drinks? Decision tree Data Features Class Drink? Drink? Milk? Sugar? Coffee Tea #1 Coffee No Yes Milk? Milk? #2 Coffee Yes No Yes No Yes No #3 Tea Yes Yes No Sugar Sugar Sugar No Sugar #4 Tea No No CS446 Machine Learning 8

Decision trees in code Drink? Tea Coffee Milk? Milk? Yes No Yes No No Sugar Sugar Sugar No Sugar if Drink = Coffee switch (Drink) if Milk = Yes case Coffee: switch (Milk): Sugar := Yes case Yes: else if Milk = No Sugar := Yes Sugar := No case No: else if Drink = Tea Sugar := No if Milk = Yes case Tea: Sugar := No switch (Milk): else if Milk = No case Yes: Sugar := No Sugar := Yes case No: Sugar := Yes CS446 Machine Learning 9

Decision trees are classifiers Non-leaf nodes test the value of one feature – Tests: yes/no questions; switch statements – Each child = a different value of that feature Leaf-nodes assign a class label Drink? Coffee Tea Milk? Milk? Yes No Yes No No Sugar Sugar Sugar No Sugar CS446 Machine Learning 10

How expressive are decision trees? Hypothesis spaces for binary classification: Each hypothesis h ∈ H H assigns true to one subset of the instance space X Decision trees do not restrict H : There is a decision tree for every hypothesis Any subset of X X can be identified via yes/no questions CS446 Machine Learning 11

Hypothesis space for our task Milk The target Yes No Drink Coffee No Sugar Sugar hypothesis… Tea Sugar No Sugar x 2 0 1 … is equivalent to 0 y=0 y=1 x 1 1 y=1 y=0 CS446 Machine Learning 12

Hypothesis space for our task H x 2 x 2 x 2 x 2 x 2 0 1 0 1 0 1 0 1 0 1 x 1 0 0 0 x 1 0 0 0 x 1 0 0 0 x 1 0 1 0 x 1 0 0 1 1 0 0 1 1 0 1 0 1 1 0 0 1 0 0 x 2 x 2 x 2 x 2 x 2 x 2 0 1 0 1 0 1 0 1 0 1 0 1 x 1 0 0 0 x 1 0 1 0 x 1 0 0 1 x 1 0 1 0 x 1 0 0 1 x 1 0 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 0 1 1 0 0 x 2 x 2 x 2 x 2 x 2 0 1 0 1 0 1 0 1 0 1 x 1 0 1 0 x 1 0 0 1 x 1 0 1 1 x 1 0 1 1 x 1 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 CS446 Machine Learning 13

How do we learn (induce) decision trees? CS446 Machine Learning 14

How do we learn decision trees? We want the smallest tree that is consistent with the training data (i.e. that assigns the correct labels to training items) But we can’t enumerate all possible trees. | H | is exponential in the number of features We use a heuristic: greedy top-down search This is guaranteed to find a consistent tree, and is biased towards finding smaller trees CS446 Machine Learning 15

Learning decision trees Each node is associated with a subset of the training examples – The root has all items in the training data – Add new levels to the tree until each leaf has only items with the same class label CS446 Machine Learning 16

Learning decision trees Complete + - - + + + - - + - + - + + - - + + + - - + - + - - + - - + - + - - + - + - + + - - + + - - - + - + Training Data - + + - - + + + - - + - + - + + - - + - + � - - + + + - + - + + - + - + + + - - - + - + - + - - - + - + + + + - + - + - - + - + � - - - + - - + - - - � + + + + � - - + - + - + - - - - - - - - - - - + + + + - + + + � - - - - - - � - - - - - + + + + - - � + + + + + + � - - - - - � + + + + + + � Leaf nodes

How do we split a node N ? The node N is associated with a subset S of the training examples. – If all items in S have the same class label, N is a leaf node – Else, split on the values V F = {v 1 , …, v K } of the most informative feature F : For each v k ∈ V F : add a new child C k to N . C k is associated with S k , the subset of items in S where F takes the value v k CS446 Machine Learning 18

Which feature to split on? We add children to a parent node in order to be more certain about which class label to assign to the examples at the child nodes. Reducing uncertainty = reducing entropy We want to reduce the entropy of the label distribution P(Y) CS446 Machine Learning 19

Entropy (binary case) The class label Y is a binary random variable: – It takes on value 1 with probability p – It takes on value 0 with probability 1 − p. The entropy of Y , H ( Y ), is defined as H ( Y ) = − p log 2 p − (1 − p ) log 2 (1 − p ) CS446 Machine Learning 20

Entropy (general discrete case) The class label Y is a discrete random variable: – It can take on K different values – It takes on value k with probability P ( Y = k) = p k The entropy of Y , H ( Y ), is defined as: K ∑ H ( Y ) = − p k log 2 p k i = 1 CS446 Machine Learning 21

K Example ∑ H ( Y ) = − p k log 2 p k i = 1 P (Y=a) = 0.5 P (Y=b) = 0.25 P (Y=c) = 0.25 H ( Y ) = − 0.5 log 2 (0.5) − 0.25 log 2 (0.25) − 0.25 log 2 (0.25) = − 0.5 ( − 1) − 0.25( − 2) − 0.25( − 2) = 0.5 + 0.5 + 0.5 = 1.5 CS446 Machine Learning 22

K Example ∑ H ( Y ) = − p k log 2 p k i = 1 P ( Y =a) = 0.5 P ( Y =b) = 0.25 H ( Y ) = 1.5 P ( Y =c) = 0.25 Entropy of Y = the average number of bits required to specify Y Bit encoding for Y : a = 1 b = 01 c = 00 CS446 Machine Learning 23

Entropy (binary case) Entropy as a measure of uncertainty: H(Y) is maximized when p = 0.5 (uniform distribution) H(Y) is minimized when p = 0 or p = 0 CS446 Machine Learning 24

Sample entropy (binary case) Entropy of a sample (data set) S = {( x , y)} with N = N + + N − items Use the sample to estimate P(Y): p = N + /N N + = number of positive items (Y = 1) n = N − /N N − = number of negative items (Y = 0) This gives H ( S ) = − p log 2 p − n log 2 n H( S ) measures the impurity of S CS446 Machine Learning 25

Using entropy to guide decision tree learning At each step, we want to reduce H(Y) H(Y) = entropy of (distribution of) class labels P(Y) We don’t care about the entropy of the features X Reduction in entropy = gain in information Define H(S) = label entropy (H(Y)) for the sample S CS446 Machine Learning 26

Using entropy to guide decision tree learning – The parent S has entropy H ( S ) and size |S| – Splitting S on feature X i with values 1,…, k yields k children S 1 , …, S k with entropy H ( S k ) & size | S k | – After splitting S on X i the expected entropy is S k ∑ S H ( S k ) k CS446 Machine Learning 27

Using entropy to guide decision tree learning – The parent S has entropy H ( S ) and size |S| – Splitting S on feature X i with values 1,…, k yields k children S 1 , …, S k with entropy H ( S k ) & size | S k | – After splitting S on X i the expected entropy is S k ∑ S H ( S k ) k – When we split S on X i , the information gain is: S k ∑ Gain ( S , X i ) = H ( S ) − S H ( S k ) k CS446 Machine Learning 28

Information Gain S b : H(S b ) � + - - + + + - - + - + - + + - - + + + - - + - + - - + - - + - + - - + - + - + + - - + + - - - + - + - + + - - + + + - - + - + - + + - - + - + � - - + + + - + - + - - + - + - + + + + + + - + - + + + - - - - - + - - - - + + + + � + - + - - + - + � + - - + - - - � S 1 : H(S 1 ) � S 2 : H(S 2 ) � S 3 : H(S 3 ) �

Will I play tennis today? Features – Outlook: {Sun, Overcast, Rain} – Temperature: {Hot, Mild, Cool} – Humidity: {High, Normal, Low} – Wind: {Strong, Weak} Labels – Binary classification task: Y = {+, -} CS446 Machine Learning 30

Will I play tennis today? O utlook: S(unny), O T H W Play? 1 S H H W - O(vercast), 2 S H H S - R(ainy) 3 O H H W + 4 R M H W + T emperature: H(ot), 5 R C N W + M(edium), 6 R C N S - C(ool) 7 O C N S + 8 S M H W - H umidity: H(igh), 9 S C N W + N(ormal), 10 R M N W + L(ow) 11 S M N S + 12 O M H S + W ind: S(trong), 13 O H N W + W(eak) 14 R M H S - CS446 Machine Learning 31

L ECTURE 3: D ECISION T REES Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Fall 2013) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 3: D ECISION T REES Prof. Julia Hockenmaier juliahmr@illinois.edu Announcements Office hours start

L ECTURE 3: D ECISION T REES Prof. Julia Hockenmaier juliahmr@illinois.edu Admin CS446 Machine

Rees-Dart Riv iver management: resilience and adaptation Rees-Dart natu tural l hazards Lake

COMS 4721: Machine Learning for Data Science Lecture 12, 2/28/2017 Prof. John Paisley Department

Rees algebras of square-free monomial ideals Louiza Fouli New Mexico State University

S UPPORTED D ECISION -M AKING : Elver Ariza- Silver Quality Trust From Theory to Practice

PRESENTATION PROPOSAL Food for Thought: Nutrition for mental health Tim Rees Phone: 0049 (0)151

B ALANCED T REES Acknowledgement: The course slides are adapted from the slides prepared by R.

B ALANCED T REES Acknowledgement: The course slides are adapted from the slides prepared by R.

B ALANCED T REES Acknowledgement: The course slides are adapted from the slides prepared by R.

B ALANCED T REES Acknowledgement: The course slides are adapted from the slides prepared by R.

S UPPORTED D ECISION - M AKING : F ROM T HEORY Morgan K. TO P RACTICE Whitlatch Legal Director,

S UPPORTED D ECISION -M AKING : Disabilities Morgan K. A L ISTENING S ESSION Whitlatch Lead

S UPPORTED D ECISION -M AKING : Morgan K. Whitlatch Update on U.S. Trends & Legal Director,

PLC OR ARC? F ARM B ILL P ROGRAM S IGN - UP AND D ECISION A IDS Katie Pfeiffer Sauk County UW

I MPROVED D ECISION M AKING F OR M AINTENANCE U SING D ATA Arnab Majumdar, Khalid Nur, William

F ARM B ILL P ROGRAM S IGN - UP AND D ECISION A IDS Nick Baker Rock County UW Extension

CHATTAHOOCHEE RIVER DISTRICT PUBLIC MEETING FOLLOW-UP 1 July 14 Public Information/Open

APPLI LICATION ON DOM OMAIN: N: SENS NSOR OR NE NETWOR ORKS KS SENSOR NETWORK AS A

Digital Transformation Adding value to the customer experience Chris Cherrett Digital Service

CS449/649: Human-Computer Interaction Winter 2018 Lecture V Anastasia Kuzminykh Understand

Resugaring: Lifting Evaluation Sequences through Syntactic Sugar Justin Pombrio, Shriram

VOTD: Time of Check, Time of Use Engineering Secure Software Last Revised: September 1, 2020

Syntac'c sugar : Syntax in a programming language that

Distributional Semantics CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky Reminders Read

L ECTURE 3: D ECISION T REES Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Fall 2013) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 3: D ECISION T REES Prof. Julia Hockenmaier juliahmr@illinois.edu Announcements Office hours start

L ECTURE 3: D ECISION T REES Prof. Julia Hockenmaier juliahmr@illinois.edu Admin CS446 Machine

Rees-Dart Riv iver management: resilience and adaptation Rees-Dart natu tural l hazards Lake

COMS 4721: Machine Learning for Data Science Lecture 12, 2/28/2017 Prof. John Paisley Department

Rees algebras of square-free monomial ideals Louiza Fouli New Mexico State University

S UPPORTED D ECISION -M AKING : Elver Ariza- Silver Quality Trust From Theory to Practice

PRESENTATION PROPOSAL Food for Thought: Nutrition for mental health Tim Rees Phone: 0049 (0)151

B ALANCED T REES Acknowledgement: The course slides are adapted from the slides prepared by R.

B ALANCED T REES Acknowledgement: The course slides are adapted from the slides prepared by R.

B ALANCED T REES Acknowledgement: The course slides are adapted from the slides prepared by R.

B ALANCED T REES Acknowledgement: The course slides are adapted from the slides prepared by R.

S UPPORTED D ECISION - M AKING : F ROM T HEORY Morgan K. TO P RACTICE Whitlatch Legal Director,

S UPPORTED D ECISION -M AKING : Disabilities Morgan K. A L ISTENING S ESSION Whitlatch Lead

S UPPORTED D ECISION -M AKING : Morgan K. Whitlatch Update on U.S. Trends &amp; Legal Director,

PLC OR ARC? F ARM B ILL P ROGRAM S IGN - UP AND D ECISION A IDS Katie Pfeiffer Sauk County UW

I MPROVED D ECISION M AKING F OR M AINTENANCE U SING D ATA Arnab Majumdar, Khalid Nur, William

F ARM B ILL P ROGRAM S IGN - UP AND D ECISION A IDS Nick Baker Rock County UW Extension

CHATTAHOOCHEE RIVER DISTRICT PUBLIC MEETING FOLLOW-UP 1 July 14 Public Information/Open

APPLI LICATION ON DOM OMAIN: N: SENS NSOR OR NE NETWOR ORKS KS SENSOR NETWORK AS A

Digital Transformation Adding value to the customer experience Chris Cherrett Digital Service

CS449/649: Human-Computer Interaction Winter 2018 Lecture V Anastasia Kuzminykh Understand

Resugaring: Lifting Evaluation Sequences through Syntactic Sugar Justin Pombrio, Shriram

VOTD: Time of Check, Time of Use Engineering Secure Software Last Revised: September 1, 2020

Syntac'c sugar : Syntax in a programming language that

Distributional Semantics CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky Reminders Read

S UPPORTED D ECISION -M AKING : Morgan K. Whitlatch Update on U.S. Trends & Legal Director,