gradient boosted trees w ith xgboost
play

Gradient boosted trees w ith XGBoost C R E D IT R ISK MOD E L IN G - PowerPoint PPT Presentation

Gradient boosted trees w ith XGBoost C R E D IT R ISK MOD E L IN G IN P YTH ON Michael Crabtree Data Scientist , Ford Motor Compan y Decision trees Creates predictions similar to logistic regression Not str u ct u red like a regression


  1. Gradient boosted trees w ith XGBoost C R E D IT R ISK MOD E L IN G IN P YTH ON Michael Crabtree Data Scientist , Ford Motor Compan y

  2. Decision trees Creates predictions similar to logistic regression Not str u ct u red like a regression CREDIT RISK MODELING IN PYTHON

  3. Decision trees for loan stat u s Simple decision tree for predicting loan_status probabilit y of defa u lt CREDIT RISK MODELING IN PYTHON

  4. Decision tree impact Loan Tr u e loan stat u s Pred . Loan Stat u s Loan pa y o � v al u e Selling Val u e Gain / Loss 1 0 1 $1,500 $250 -$1,250 2 0 1 $1,200 $250 -$950 CREDIT RISK MODELING IN PYTHON

  5. A forest of trees XGBoost u ses man y simplistic trees ( ensemble ) Each tree w ill be slightl y be � er than a coin toss CREDIT RISK MODELING IN PYTHON

  6. Creating and training trees Part of the xgboost P y thon package , called xgb here Trains w ith .fit() j u st like the logistic regression model # Create a logistic regression model clf_logistic = LogisticRegression() # Train the logistic regression clf_logistic.fit(X_train, np.ravel(y_train)) # Create a gradient boosted tree model clf_gbt = xgb.XGBClassifier() # Train the gradient boosted tree clf_gbt.fit(X_train,np.ravel(y_train)) CREDIT RISK MODELING IN PYTHON

  7. Defa u lt predictions w ith XGBoost Predicts w ith both .predict() and .predict_proba() .predict_proba() prod u ces a v al u e bet w een 0 and 1 .predict() prod u ces a 1 or 0 for loan_status # Predict probabilities of default gbt_preds_prob = clf_gbt.predict_proba(X_test) # Predict loan_status as a 1 or 0 gbt_preds = clf_gbt.predict(X_test) # gbt_preds_prob array([[0.059, 0.940], [0.121, 0.989]]) # gbt_preds array([1, 1, 0...]) CREDIT RISK MODELING IN PYTHON

  8. H y perparameters of gradient boosted trees H y perparameters : model parameters ( se � ings ) that cannot be learned from data Some common h y perparameters for gradient boosted trees learning_rate : smaller v al u es make each step more conser v ati v e max_depth : sets ho w deep each tree can go , larger means more comple x xgb.XGBClassifier(learning_rate = 0.2, max_depth = 4) CREDIT RISK MODELING IN PYTHON

  9. Let ' s practice ! C R E D IT R ISK MOD E L IN G IN P YTH ON

  10. Col u mn selection for credit risk C R E D IT R ISK MOD E L IN G IN P YTH ON Michael Crabtree Data Scientist , Ford Motor Compan y

  11. Choosing specific col u mns We 'v e been u sing all col u mns for predictions # Selects a few specific columns X_multi = cr_loan_prep[['loan_int_rate','person_emp_length']] # Selects all data except loan_status X = cr_loan_prep.drop('loan_status', axis = 1) Ho w y o u can tell ho w important each col u mn is Logistic Regression : col u mn coe � cients Gradient Boosted Trees : ? CREDIT RISK MODELING IN PYTHON

  12. Col u mn importances Use the .get_booster() and .get_score() methods Weight : the n u mber of times the col u mn appears in all trees # Train the model clf_gbt.fit(X_train,np.ravel(y_train)) # Print the feature importances clf_gbt.get_booster().get_score(importance_type = 'weight') {'person_home_ownership_RENT': 1, 'person_home_ownership_OWN': 2} CREDIT RISK MODELING IN PYTHON

  13. Col u mn importance interpretation # Column importances from importance_type = 'weight' {'person_home_ownership_RENT': 1, 'person_home_ownership_OWN': 2} CREDIT RISK MODELING IN PYTHON

  14. Plotting col u mn importances Use the plot_importance() f u nction xgb.plot_importance(clf_gbt, importance_type = 'weight') {'person_income': 315, 'loan_int_rate': 195, 'loan_percent_income': 146} CREDIT RISK MODELING IN PYTHON

  15. Choosing training col u mns Col u mn importance is u sed to sometimes decide w hich col u mns to u se for training Di � erent sets a � ect the performance of the models Model Model Defa u lt Col u mns Importances Acc u rac y Recall loan _ int _ rate , person _ emp _ length (100, 100) 0.81 0.67 loan _ int _ rate , person _ emp _ length , (98, 70, 5) 0.84 0.52 loan _ percent _ income CREDIT RISK MODELING IN PYTHON

  16. F 1 scoring for models Thinking abo u t acc u rac y and recall for di � erent col u mn gro u ps is time cons u ming F 1 score is a single metric u sed to look at both acc u rac y and recall Sho w s u p as a part of the classification_report() CREDIT RISK MODELING IN PYTHON

  17. Let ' s practice ! C R E D IT R ISK MOD E L IN G IN P YTH ON

  18. Cross v alidation for credit models C R E D IT R ISK MOD E L IN G IN P YTH ON Michael Crabtree Data Scientist , Ford Motor Compan y

  19. Cross v alidation basics Used to train and test the model in a w a y that sim u lates u sing the model on ne w data Segments training data into di � erent pieces to estimate f u t u re performance Uses DMatrix , an internal str u ct u re optimi z ed for XGBoost Earl y stopping tells cross v alidation to stop a � er a scoring metric has not impro v ed a � er a n u mber of iterations CREDIT RISK MODELING IN PYTHON

  20. Ho w cross v alidation w orks Processes parts of training data as ( called folds ) and tests against u n u sed part Final testing against the act u al test set 1 2 h � ps :// scikit learn . org / stable / mod u les / cross _v alidation . html CREDIT RISK MODELING IN PYTHON

  21. Setting u p cross v alidation w ithin XGBoost # Set the number of folds n_folds = 2 # Set early stopping number early_stop = 5 # Set any specific parameters for cross validation params = {'objective': 'binary:logistic', 'seed': 99, 'eval_metric':'auc'} 'binary':'logistic' is u sed to specif y classi � cation for loan_status 'eval_metric':'auc' tells XGBoost to score the model ' s performance on AUC CREDIT RISK MODELING IN PYTHON

  22. Using cross v alidation w ithin XGBoost # Restructure the train data for xgboost DTrain = xgb.DMatrix(X_train, label = y_train) # Perform cross validation xgb.cv(params, DTrain, num_boost_round = 5, nfold=n_folds, early_stopping_rounds=early_stop) DMatrix() creates a special object for xgboost optimi z ed for training CREDIT RISK MODELING IN PYTHON

  23. The res u lts of cross v alidation Creates a data frame of the v al u es from the cross v alidation CREDIT RISK MODELING IN PYTHON

  24. Cross v alidation scoring Uses cross v alidation and scoring metrics w ith cross_val_score() f u nction in scikit - learn # Import the module from sklearn.model_selection import cross_val_score # Create a gbt model xg = xgb.XGBClassifier(learning_rate = 0.4, max_depth = 10) # Use cross valudation and accuracy scores 5 consecutive times cross_val_score(gbt, X_train, y_train, cv = 5) array([0.92748092, 0.92575308, 0.93975392, 0.93378608, 0.93336163]) CREDIT RISK MODELING IN PYTHON

  25. Let ' s practice ! C R E D IT R ISK MOD E L IN G IN P YTH ON

  26. Class imbalance in loan data C R E D IT R ISK MOD E L IN G IN P YTH ON Michael Crabtree Data Scientist , Ford Motor Compan y

  27. Not eno u gh defa u lts in the data The v al u es of loan_status are the classes Non - defa u lt : 0 Defa u lt : 1 y_train['loan_status'].value_counts() loan _ stat u s Training Data Co u nt Percentage of Total 0 13,798 78% 1 3,877 22% CREDIT RISK MODELING IN PYTHON

  28. Model loss f u nction Gradient Boosted Trees in xgboost u se a loss f u nction of log - loss The goal is to minimi z e this v al u e Tr u e loan stat u s Predicted probabilit y Log Loss 1 0.1 2.3 0 0.9 2.3 An inacc u ratel y predicted defa u lt has more negati v e � nancial impact CREDIT RISK MODELING IN PYTHON

  29. The cost of imbalance A false negati v e ( defa u lt predicted as non - defa u lt ) is m u ch more costl y Person Loan Amo u nt Potential Pro � t Predicted Stat u s Act u al Stat u s Losses A $1,000 $10 Defa u lt Non - Defa u lt -$10 B $1,000 $10 Non - Defa u lt Defa u lt -$1,000 Log - loss for the model is the same for both , o u r act u al losses is not CREDIT RISK MODELING IN PYTHON

  30. Ca u ses of imbalance Data problems Credit data w as not sampled correctl y Data storage problems B u siness processes : Meas u res alread y in place to not accept probable defa u lts Probable defa u lts are q u ickl y sold to other � rms Beha v ioral factors : Normall y, people do not defa u lt on their loans The less o � en the y defa u lt , the higher their credit rating CREDIT RISK MODELING IN PYTHON

  31. Dealing w ith class imbalance Se v eral w a y s to deal w ith class imbalance in data Method Pros Cons Increases n u mber of Gather more data Percentage of defa u lts ma y not change defa u lts Increases recall for Model req u ires more t u ning and Penali z e models defa u lts maintenance Sample data Least technical Fe w er defa u lts in data di � erentl y adj u stment CREDIT RISK MODELING IN PYTHON

  32. Undersampling strateg y Combine smaller random sample of non - defa u lts w ith defa u lts CREDIT RISK MODELING IN PYTHON

  33. Combining the split data sets Test and training set m u st be p u t back together Create t w o ne w sets based on act u al loan_status # Concat the training sets X_y_train = pd.concat([X_train.reset_index(drop = True), y_train.reset_index(drop = True)], axis = 1) # Get the counts of defaults and non-defaults count_nondefault, count_default = X_y_train['loan_status'].value_counts() # Separate nondefaults and defaults nondefaults = X_y_train[X_y_train['loan_status'] == 0] defaults = X_y_train[X_y_train['loan_status'] == 1] CREDIT RISK MODELING IN PYTHON

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend