overview
play

Overview Department of Statistics Department of Statistics Data - PDF document

Wharton Wharton Overview Department of Statistics Department of Statistics Data Mining w Applications - Marketing: Direct mail advertising (Zahavi example) - Biomedical: finding predictive risk factors - Financial: predicting returns and


  1. Wharton Wharton Overview Department of Statistics Department of Statistics Data Mining w Applications - Marketing: Direct mail advertising (Zahavi example) - Biomedical: finding predictive risk factors - Financial: predicting returns and bankruptcy Bob Stine w Role of management Department of Statistics - Setting goals - Coordinating players w Critical stages of modeling process www-stat.wharton.upenn.edu/~bob - Picking the model <-- My research interest - Validation 2 Wharton Wharton Predicting Health Risk Predicting Stock Market Returns Department of Statistics Department of Statistics w Who is at risk for a disease? w Predicting returns on the S&P 500 index - Costs - Extrapolate recent history • False positive: treat a healthy person - Exogenous factors • False negative: miss a person with the disease - Example: detect osteoporosis without need for x-ray w What would distinguish a good model? w What sort of predictors, at what cost? - Highly statistically significant predictors - Very expensive: Laboratory measurements, “genetic” - Reproduces pattern in observed history - Expensive: Doctor reported clinical observations - Cheap: Self-reported behavior - Extrapolate better than guessing, hunches w Missing data w Validation - Always present - Test of the model yields sobering insight - Are records with missing data like those that are not missing? 3 4 1

  2. Wharton Wharton Predicting the Market Historical patterns? Department of Statistics Department of Statistics w Build a regression model 0.08 - Response is return on the value-weighted S&P 0.06 - Use standard forward/backward stepwise 0.04 - Battery of 12 predictors 0.02 vwReturn w Train the model during 1992-1996 ? 0.00 - Model captures most of variation in 5 years of returns -0.02 - Retain only the most significant features (Bonferroni) -0.04 w Predict what happens in 1997 -0.06 w Another version in Foster, Stine & Waterman 92 93 94 95 96 97 98 Year 5 6 Wharton Wharton Fitted model predicts... What happened? Department of Statistics Department of Statistics 0.15 0.10 Exceptional Feb return? 0.05 0.10 -0.00 0.05 Pred Error -0.05 -0.00 Training Period -0.10 -0.05 -0.15 92 93 94 95 96 97 98 92 93 94 95 96 97 98 Year Year 7 8 2

  3. Wharton Wharton Claimed versus Actual Error Over-confidence? Department of Statistics Department of Statistics w Over-fitting 120 - DM model fits the training data too well – better than it can Actual 100 predict when extrapolated to future. Squared Prediction - Greedy model-fitting procedure Error 80 “Optimization capitalizes on chance” 60 w Some intuition for the phenomenon - Coincidences 40 • Cancer clusters, the “birthday problem” Claimed - Illustration with an auction 20 0 10 20 30 40 50 60 70 80 90 100 • What is the value of the coins in this jar? Complexity of Model 9 10 Wharton Wharton Auctions and Over-fitting Roles of Management Department of Statistics Department of Statistics w Auction jar of coins to a Management determines whether a project succeeds… 9 class of students w Whose data is it? 8 w Histogram shows the bids of - Ownership and shared obligations/rewards 7 30 students w Irrational expectations 6 w Some were suspicious, but a - Budgeting credit: “How could you miss?” few were not! 5 w Moving targets w Actual value is $3.85 4 - Energy policy: “You’ve got the old model.” w Known as “ Winner’s Curse” 3 w Lack of honest verification w Similar to over-fitting: - Stock example… Given time, can always find a good fit. 2 best model like high bidder - Rx marketing: “They did well on this question.” 1 11 12 3

  4. Wharton Wharton What are the costs? Back to a real application… Department of Statistics Department of Statistics w Symmetry of mistakes? - Is over-predicting as costly as under-predicting? How can we avoid some of these problems? - Managing inventories and sales - Visible costs versus hidden costs I’ll focus on w Does a false positive = a false negative? - Classification * statistical modeling aspects (my research interest), • Credit modeling, flagging “risky” customers and also - Differential costs for different types of errors * reinforce the business environment. • False positive: call a good customer “bad” • False negative: fail to identify a “bad” 13 14 Wharton Wharton Predicting Bankruptcy Stages in Modeling Department of Statistics Department of Statistics w “Needle in a haystack” w Having framed the problem, gotten relevant data… - 3,000,000 months of credit-card activity w Build the model - 2244 bankruptcies Identify patterns that predict future observations. - Best customers resemble worst customers w Evaluate the model w What factors anticipate bankruptcy? When can you tell if its going to succeed… - Spending patterns? Payment history? - During the model construction phase - Demographics? Missing data? • Only incorporate meaningful features - Combinations of factors? - After the model is built • Cash Advance + Las Vegas = Problem • Validate by predicting new observations w We consider more than 100,000 predictors! 15 16 4

  5. Wharton Wharton Building a Predictive Model My Choices Department of Statistics Department of Statistics So many choices… w Simple structure - Linear regression with nonlinear via interactions w Structure: What type of model? - All 2-way and many 3-way, 4-way interactions • Neural net (projection pursuit) • CART, classification tree w Rigorous identification • Additive model or regression spline (MARS) - Conservative standard error w Identification: Which features to use? - Comparison of conservative t-ratio to adaptive threshold • Time lags, “natural” transformations w Greedy search • Combinations of other features - Forward stepwise regression w Search: How does one find these features? - Coming: Dynamically changing list of features • Brute force has become cheap. • Good choice affects where you search next. 17 18 Wharton Wharton Bankruptcy Model: Fitting Bankruptcy Model: Construction Department of Statistics Department of Statistics w Context w Where should the fitting process be stopped? - Identify current customers who might declare bankruptcy w Split data to allow validation, comparison Residual Sum of Squares - Training data 470 • 600,000 months with 450 bankruptcies 460 - Validation data 450 440 SS • 2,400,000 months with 1786 bankruptcies 430 420 410 w Selection via adaptive thresholding 400 - Analogy: Compare sequence of t-stats to Sqrt(2 log p/q) 0 50 100 150 Number of Predictors - Dynamic expansion of feature space 19 20 5

  6. Wharton Wharton Bankruptcy Model: Fitting Bankruptcy Model: Validation Department of Statistics Department of Statistics w The validation indicates that the fit gets better while w Our adaptive selection procedure stops at a model the model expands. Avoids over-fitting. with 39 predictors. Validation Sum of Squares Residual Sum of Squares 1760 470 460 1720 450 440 SS SS 430 1680 420 410 1640 400 0 50 100 150 0 50 100 150 Number of Predictors Number of Predictors 21 22 Wharton Wharton Lift Chart Example: Lift Chart Department of Statistics Department of Statistics w Measures how well model classifies sought-for group 1.0 Model 0.8 % bankrupt in DM selection Lift = % bankrupt in all data %Responders 0.6 Random w Depends on rule used to label customers 0.4 - Very high probability of bankruptcy Lots of lift, but few bankrupt customers are found. 0.2 - Lower rule Lift drops, but finds more bankrupt customers. 0.0 0 10 20 30 40 50 60 70 80 90 100 w Tie to the economics of the problem - Slope gives you the trade-off point % Chosen 23 24 6

  7. Wharton Wharton Bankruptcy Model: Lift Calibration Department of Statistics Department of Statistics w Much better than diagonal! w Classifier assigns 100 Prob(“BR”) 100 rating to a customer. 75 w Weather forecast Actual 50 75 % Found w Among those classified as 25 50 2/10 chance of “BR”, 0 how many are BR? 10 20 30 40 50 60 70 80 90 25 w Closer to diagonal is 0 better. 0 25 50 75 100 % Contacted 25 26 Wharton Wharton Bankruptcy Model: Calibration Modeling Bankruptcy Department of Statistics Department of Statistics w Over-predicts risk near claimed probability 0.3. w Automatic, adaptive selection - Finds patterns that predict new observations Calibration Chart - Predictive, but not easy to explain 1.2 w Dynamic feature set 1 - Current research 0.8 Actual - Information theory allows changing search space 0.6 0.4 - Finds more structure than direct search could find 0.2 w Validation 0 - Remains essential only for judging fit, reserve more for 0 0.2 0.4 0.6 0.8 modeling Claim - Comparison to rival technology (we compared to C4.5) 27 28 7

  8. Wharton Wrap-Up Data Mining Department of Statistics w Data, data, data - Often most time consuming steps • Cleaning and merging data - Without relevant, timely data, no chance for success. w Clear objective - Identified in advance - Checked along the way, with “honest” methods w Rewards - Who benefits from success? - Who suffers if it fails? 29 8

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend