Week 8: Classification & Model Building Classification for - PowerPoint PPT Presentation

BUS41100 Applied Regression Analysis Week 8: Classification & Model Building Classification for Binary Outcomes, Variable Selection, BIC, AIC, LASSO Max H. Farrell The University of Chicago Booth School of Business

Classification A common goal with logistic regression is to classify the inputs depending on their predicted response probabilities. Example: evaluating the credit quality of (potential) debtors. ◮ Take a list of borrower characteristics. ◮ Build a prediction rule for their credit. ◮ Use this rule to automatically evaluate applicants (and track your risk profile). You can do all this with logistic regression, and then use the predicted probabilities to build a classification rule. ◮ A simple classification rule would be that anyone with ˆ P (good | x ) > 0 . 5 can get a loan, and the rest cannot. —————— 1 (Classification is a huge field, we’re only scratching the surface here.)

We have data on 1000 loan applicants at German community banks, and judgment of the loan outcomes (good or bad). The data has 20 borrower characteristics, including ◮ credit history (5 categories), ◮ housing (rent, own, or free), ◮ the loan purpose and duration, ◮ and installment rate as a percent of income. Unfortunately, many of the columns in the data file are coded categorically in a very opaque way. (Most are factors in R.) 2

Logistic regression yields ˆ P [good | x ] = ˆ P [ Y = 1 | x ] : > full <- glm(GoodCredit~., family=binomial, data=credit) > predfull <- predict(full, type="response") Need to compare to binary Y = { 0 , 1 } . ◮ Convert: ˆ Y = 1 { ˆ P [ Y = 1 | x ] > 0 . 5 } ◮ classification error: Y i − ˆ Y i = {− 1 , 0 , 1 } . > errorfull <- credit[,1] - (predfull >= .5) > table(errorfull) -1 0 1 74 786 140 > mean(abs(errorfull)) ## add weights if you want [1] 0.214 > mean(errorfull^2) [1] 0.214 3

ROC & PR curves Is one type of mistake worse than the other? ◮ You’ll want to have ˆ P > 0 . 5 of a “positive” outcome before taking a risky action. ◮ You decide which error is worse, and by how much. > table(credit[,1] - (predfull >= .6) -1 0 1 40 789 171 > mean((credit[,1] - (predfull >= .6))^2) [1] 0.211 What happens as the cut-off varies? What’s the “best” cut-off? To answer we can use two curves: 1. ROC: Receiver Operating Characteristic 2. PR: Precision-Recall 4

> library("pROC") > roc.full <- roc(credit[,1] ~ predfull) > coords(roc.full, x=0.5) threshold specificity sensitivity 0.5000000 0.8942857 0.5333333 > coords(roc.full, "best") threshold specificity sensitivity 0.3102978 0.7614286 0.7700000 1.0 Sensitivity 0.8 Y true positive rate Sensitivity 0.6 X Specificity 0.4 true negative rate 0.2 —————— X cut−off = 0.5 Many related names: hit rate, fall-out Y cut−off = best 0.0 false discovery rate, . . . 1.0 0.8 0.6 0.4 0.2 0.0 Specificity 5

> library("PRROC") > pr.full <- pr.curve(scores.class0=predfull, + weights.class0=credit[,1], curve=TRUE) 1.0 Recall 0.8 true positive rate 0.8 0.6 same as senstivity Precision 0.6 0.4 Precision 0.4 positive predictive value 0.2 0.2 —————— 0.0 0.0 Many related names: hit rate, fall-out 0.0 0.2 0.4 0.6 0.8 1.0 false discovery rate, . . . Recall 6

Now we know how to evaluate a classification model, so we can compare models. But which models should we compare? > empty <- glm(GoodCredit~1, family=binomial, data=credit) > history <- glm(GoodCredit~history3, family=binomial, data=credit) Misclassification rates: > c(empty=mean(abs(errorempty)), + history=mean(abs(errorhistory)), + full=mean(abs(errorfull)) ) empty history full 0.300 0.283 0.214 Why is this both obvious and not helpful? 7

A word of caution Why not just throw everything in there? > too.good <- glm(GoodCredit~. + .^2, family=binomial, + data=credit) Warning messages: 1: glm.fit: algorithm did not converge 2: glm.fit: fitted probabilities numerically 0 or 1 occurred This warning means you have the logistic version of our “connect the dots” model. ◮ Just as useless as before! > c(empty=mean(abs(errorempty)), + history=mean(abs(errorhistory)), + full=mean(abs(errorfull)) , + too.good=mean(abs(errortoo.good)) ) empty history full too.good 0.300 0.283 0.214 0.000 8

Model Selection Our job now is to pick which X variables belong in our model. ◮ A good prediction model summarizes the data but does not overfit. What if the goal isn’t just prediction? ◮ A good model answers the question at issue. ◮ Better predictions don’t matter if the model doesn’t answer the question. ◮ A good regression model obeys our assumptions. ◮ Especially important when the goal is inference/relationships. ◮ A causal model is only good when it meets even more assumptions. 9

What is the goal? 1. Relationship-type questions and inference? ◮ Are women paid differently than men on average? > lm(log.WR ~ sex) ◮ Does age/experience differently affect men and women? > lm(log.WR ~ age*sex - sex) ◮ No other models matter 2. Data summarization? ◮ In time series we matched the dynamics/trends, and stopped there. 3. Prediction? ◮ Need a fair, objective criterion that matches the idea of predicting the future. Avoid overfitting. 10

Overfitting We have already seen overfitting twice: 1. Week 5: R 2 ↑ as more variables went into MLR > c(summary(trucklm1)$r.square, summary(trucklm3)$r.square, + summary(trucklm6)$r.square) [1] 0.021 0.511 0.693 2. Just a minute ago: Classification error ↓ as more variables into logit empty history full 0.300 0.283 0.214 Fitting the data at hand better and better . . . but getting worse at predicting the next observation. How can we use the data to pick the model without relying on the data too much? 11

Out-of-sample prediction How do we evaluate a forecasting model? ◮ Make predictions! ◮ Out-of-sample prediction error is the Gold Standard for comparing models. (If what you care about is prediction.) Basic Idea: We want to use the model to forecast outcomes for observations we have not seen before. ◮ Use the data to create a prediction problem. ◮ See how our candidate models perform. We’ll use most of the data for training the model, and the left over part for validating/testing it. 12

In a validation scheme, you ◮ fit a bunch of models to most of the data (training set) ◮ choose the one performing best on the rest (testing set). For each model: ◮ Obtain b 0 , . . . , b d on the training data. ◮ Use the model to obtain fitted values for the n test testing data points: ˆ j b or ˆ Y j = 1 { ˆ Y j = x ′ P [ Y =1 | x j ] > 0 . 5 } ◮ Calculate the Mean Square Error for these predictions. n test 1 � ( Y j − ˆ Y j ) 2 MSE = n test j =1 13

Out of sample validation steps: 1) Split the data into testing/training samples. > set.seed(2) > train.samples <- sample.int(nrow(credit), 0.95*nrow(credit)) > train <- credit[train.samples,] > test <- credit[-train.samples,] 2) Fit models on the training data > full <- glm(GoodCredit~., family=binomial, data=train) > history <- glm(GoodCredit~history3, family=binomial, data=train 3) Predict on the test data > predfull <- predict(full, type="response", newdata=test) > errorfull <- test[,"GoodCredit"] - (predfull >= .5) 4) Compute MSE/MAE > c(empty=mean(errorempty^2), history=mean(errorhistory^2), + full=mean(errorfull^2) , too.good=mean(errortoo.good^2) ) empty history full too.good 14 0.24 0.20 0.32 0.42

This missing piece is in 2) Fit models on the training data Which models? The rest of these slides are about tools to help with choosing models. We’ll do linear and logistic examples. ◮ Once we have tools for step 2, it’s easy to compute out-of-sample MSE. There are two pieces to the puzzle: ◮ Select the “universe of variables” ◮ Choose the best model(s) The computer helps only with the 2 nd ! 15

The universe of variables is HUGE! ◮ includes all possible covariates that you think might have a linear effect on the response ◮ . . . and all squared terms . . . and all interactions . . . . You decide on this universe through your experience and discipline-based knowledge (and data availability). ◮ Consult subject matter research and experts. ◮ Consider carefully what variables have explanatory power, and how they should be transformed. ◮ If you can avoid it, don’t just throw everything in. This step is very important! And also difficult. . . . and sadly, not much we can do today. 16

Week 8: Classification & Model Building Classification for - PowerPoint PPT Presentation

BUS41100 Applied Regression Analysis Week 8: Classification & Model Building Classification for Binary Outcomes, Variable Selection, BIC, AIC, LASSO Max H. Farrell The University of Chicago Booth School of Business Classification A common

MATH2130-F17 Week 13 Week 14 Week 15, Inner Farid Aliniaeifard Product Space CU BOULDER

Time Matters Week 7 Week 6 Prototyping + Needfinding Week 7 Week 8 Implementation Week 9

Math 610 Section 700 - Recitation week 3 week 4 week 6 week 8 TA: Peng Wei Office: Blocker

Galatians: week 3 Galatians 3:1-29 Week 1: Galatians 1:1-2:14 Week 2: Galatians 2:15-21 Week 3:

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

Vermont M nt Marble: A e: Americas s nt Stone Monument Sto Class S s Schedule e Week

Week 1: Christ: The Source of True Happiness Week 2: Happiness, the Gospel and Living Well Week

(a) Quantitative classification (b) Qualitative classification (c) Area classification (d) Simple

Classification Image Classification Set of predefined categories [eg: table, apple, dog, giraffe]

Library of Congress Classification: Module 1.3 1 Library of Congress Classification: Module 1.3

Classification K-nearest neighbor classification D istance functions Choice of k Choice of k

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Management of Classification Lookup Files The basics of classification The basics of

Classification Classification and Prediction Classification: predict categorical class labels

Lexical Analyzer Scanner ASU Textbook Chapter 3.1, 3.3, 3.4, 3.6, 3.7, 3.5 Tsan-sheng Hsu

ANTICORPI MONOCLONALI + INIBITORI DEL PROTEASOMA Giulia Benevolo ELO: c1-2 weekly C3-8 day 1;

Forecast of the atmospheric parameters at the LBT site in the context of the ALTA project Turchi

Problem Set 2 CS9635 Submission instructions on last page Let G be a directed graph with n

What is a latent v ariable ? SU R VE Y AN D ME ASU R E ME N T D E VE L OP ME N T IN R George Mo

Plasmas- Confinement Swadesh Mahajan Advanced understanding of elementary ideas ICTP, Oct-Nov.

Diffusion and Transport in Axisymmetric Geometry Stephen C. Jardin Princeton Plasma Physics

Eddy Current Septum Magnet Optimization Powering Options of SMH42 and the Influence of the