Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan - - PowerPoint PPT Presentation

lecture 8 regression trees
SMART_READER_LITE
LIVE PREVIEW

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan - - PowerPoint PPT Presentation

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan Thirumuruganathan Outline 1 Regression 2 Linear Regression 3 Regression Trees CSE 5334 Saravanan Thirumuruganathan Regression and Linear Regression CSE


slide-1
SLIDE 1

Lecture 8: Regression Trees

Instructor: Saravanan Thirumuruganathan

CSE 5334 Saravanan Thirumuruganathan

slide-2
SLIDE 2

Outline

1 Regression 2 Linear Regression 3 Regression Trees CSE 5334 Saravanan Thirumuruganathan

slide-3
SLIDE 3

Regression and Linear Regression

CSE 5334 Saravanan Thirumuruganathan

slide-4
SLIDE 4

Supervised Learning

Dataset:

Training (labeled) data: D = {(xi, yi)} xi ∈ Rd Test (unlabeled) data: x0 ∈ Rd

Tasks:

Classification: yi ∈ {1, 2, . . . , C} Regression: yi ∈ R

Objective: Given x0, predict y0 Supervised learning as yi was given during training

CSE 5334 Saravanan Thirumuruganathan

slide-5
SLIDE 5

Regression

Predict cost of house from details Predict job salary from job description Predict SAT, GRE scores Predict future price of Petrol from past prices Predict future GDP of a country, valuation of a company

CSE 5334 Saravanan Thirumuruganathan

slide-6
SLIDE 6

Linear Regression : One-dimensional Case

CSE 5334 Saravanan Thirumuruganathan

slide-7
SLIDE 7

Linear Regression : One-dimensional Case

CSE 5334 Saravanan Thirumuruganathan

slide-8
SLIDE 8

Linear Regression : One-dimensional Case

CSE 5334 Saravanan Thirumuruganathan

slide-9
SLIDE 9

Linear Regression: Poverty vs HS Graduation Rate

CSE 5334 Saravanan Thirumuruganathan

slide-10
SLIDE 10

Linear Regression: Poverty vs HS Graduation Rate

CSE 5334 Saravanan Thirumuruganathan

slide-11
SLIDE 11

Residuals

CSE 5334 Saravanan Thirumuruganathan

slide-12
SLIDE 12

Residuals

CSE 5334 Saravanan Thirumuruganathan

slide-13
SLIDE 13

A measure for the best line

CSE 5334 Saravanan Thirumuruganathan

slide-14
SLIDE 14

Least Squares Line

CSE 5334 Saravanan Thirumuruganathan

slide-15
SLIDE 15

Prediction

CSE 5334 Saravanan Thirumuruganathan

slide-16
SLIDE 16

Linear Regression in Higher Dimensions

CSE 5334 Saravanan Thirumuruganathan

slide-17
SLIDE 17

Linear Regression in Higher Dimensions

CSE 5334 Saravanan Thirumuruganathan

slide-18
SLIDE 18

Linear Regression in Higher Dimensions

CSE 5334 Saravanan Thirumuruganathan

slide-19
SLIDE 19

Linear Regression: Objective Function

CSE 5334 Saravanan Thirumuruganathan

slide-20
SLIDE 20

Linear Regression: Gradient Descent based Solution

CSE 5334 Saravanan Thirumuruganathan

slide-21
SLIDE 21

Regression Trees

CSE 5334 Saravanan Thirumuruganathan

slide-22
SLIDE 22

Predicting Baseball salary data

Salary is color-coded from low (blue, green) to high (yellow,red)

CSE 5334 Saravanan Thirumuruganathan

slide-23
SLIDE 23

Decision tree for Baseball Salary Prediction

CSE 5334 Saravanan Thirumuruganathan

slide-24
SLIDE 24

Decision tree for Baseball Salary Prediction

CSE 5334 Saravanan Thirumuruganathan

slide-25
SLIDE 25

Interpreting the Decision Tree

CSE 5334 Saravanan Thirumuruganathan

slide-26
SLIDE 26

Interpreting the Decision Tree

Years is the most important factor in determining Salary, and players with less experience earn lower salaries than more experienced players. Given that a player is less experienced, the number of Hits that he made in the previous year seems to play little role in his Salary . But among players who have been in the major leagues for five or more years, the number of Hits made in the previous year does affect Salary , and players who made more Hits last year tend to have higher salaries. Surely an over-simplification, but compared to a regression model, it is easy to display, interpret and explain

CSE 5334 Saravanan Thirumuruganathan

slide-27
SLIDE 27

High Level Idea

Classification Tree: Quality of split measured by general “Impurity measure” Regression Tree: Quality of split measured by “Squared error”

CSE 5334 Saravanan Thirumuruganathan

slide-28
SLIDE 28

High Level Idea

We divide the feature space into J distinct and non-overlapping regions R1, R2, . . . , RJ For every observation that falls into the region Ri, we make same prediction, which is simply the mean of the response values for the training observations in Ri Objective: Find boxes R1, R2, . . . , RJ that minimizes Residual Sum of Square (RSS) RSS =

J

  • i=1
  • j∈Ri

(yj − yRi)2 where yRi is the mean response for the training in the i-th box.

CSE 5334 Saravanan Thirumuruganathan

slide-29
SLIDE 29

Building Regression Trees

We first select the feature Xi and the cutpoint s such that splitting the feature space into the regions {X|Xi < s} and {X|Xi ≥ s} leads to the greatest possible reduction in RSS. Next, we repeat the process, looking for the best attribute and best cutpoint in order to split the data further so as to minimize the RSS within each of the resulting regions. The process continues until a stopping criterion is reached; for instance, we may continue until no region contains more than five observations.

CSE 5334 Saravanan Thirumuruganathan

slide-30
SLIDE 30

Summary Major Concepts:

Geometric interpretation of Classification Decision trees

CSE 5334 Saravanan Thirumuruganathan

slide-31
SLIDE 31

Slide Material References

Slides from ISLR book Slides by Piyush Rai Slides from OpenIntro Statistics book (http://www.webpages.uidaho.edu/~stevel/251/ slides/os2_slides_07.pdf) See also the footnotes

CSE 5334 Saravanan Thirumuruganathan