Gradient boosted trees w ith XGBoost C R E D IT R ISK MOD E L IN G - PowerPoint PPT Presentation

Gradient boosted trees w ith XGBoost C R E D IT R ISK MOD E L IN G IN P YTH ON Michael Crabtree Data Scientist , Ford Motor Compan y

Decision trees Creates predictions similar to logistic regression Not str u ct u red like a regression CREDIT RISK MODELING IN PYTHON

Decision trees for loan stat u s Simple decision tree for predicting loan_status probabilit y of defa u lt CREDIT RISK MODELING IN PYTHON

Decision tree impact Loan Tr u e loan stat u s Pred . Loan Stat u s Loan pa y o � v al u e Selling Val u e Gain / Loss 1 0 1 $1,500 $250 -$1,250 2 0 1 $1,200 $250 -$950 CREDIT RISK MODELING IN PYTHON

A forest of trees XGBoost u ses man y simplistic trees ( ensemble ) Each tree w ill be slightl y be � er than a coin toss CREDIT RISK MODELING IN PYTHON

Creating and training trees Part of the xgboost P y thon package , called xgb here Trains w ith .fit() j u st like the logistic regression model # Create a logistic regression model clf_logistic = LogisticRegression() # Train the logistic regression clf_logistic.fit(X_train, np.ravel(y_train)) # Create a gradient boosted tree model clf_gbt = xgb.XGBClassifier() # Train the gradient boosted tree clf_gbt.fit(X_train,np.ravel(y_train)) CREDIT RISK MODELING IN PYTHON

Defa u lt predictions w ith XGBoost Predicts w ith both .predict() and .predict_proba() .predict_proba() prod u ces a v al u e bet w een 0 and 1 .predict() prod u ces a 1 or 0 for loan_status # Predict probabilities of default gbt_preds_prob = clf_gbt.predict_proba(X_test) # Predict loan_status as a 1 or 0 gbt_preds = clf_gbt.predict(X_test) # gbt_preds_prob array([[0.059, 0.940], [0.121, 0.989]]) # gbt_preds array([1, 1, 0...]) CREDIT RISK MODELING IN PYTHON

H y perparameters of gradient boosted trees H y perparameters : model parameters ( se � ings ) that cannot be learned from data Some common h y perparameters for gradient boosted trees learning_rate : smaller v al u es make each step more conser v ati v e max_depth : sets ho w deep each tree can go , larger means more comple x xgb.XGBClassifier(learning_rate = 0.2, max_depth = 4) CREDIT RISK MODELING IN PYTHON

Let ' s practice ! C R E D IT R ISK MOD E L IN G IN P YTH ON

Col u mn selection for credit risk C R E D IT R ISK MOD E L IN G IN P YTH ON Michael Crabtree Data Scientist , Ford Motor Compan y

Choosing specific col u mns We 'v e been u sing all col u mns for predictions # Selects a few specific columns X_multi = cr_loan_prep[['loan_int_rate','person_emp_length']] # Selects all data except loan_status X = cr_loan_prep.drop('loan_status', axis = 1) Ho w y o u can tell ho w important each col u mn is Logistic Regression : col u mn coe � cients Gradient Boosted Trees : ? CREDIT RISK MODELING IN PYTHON

Col u mn importances Use the .get_booster() and .get_score() methods Weight : the n u mber of times the col u mn appears in all trees # Train the model clf_gbt.fit(X_train,np.ravel(y_train)) # Print the feature importances clf_gbt.get_booster().get_score(importance_type = 'weight') {'person_home_ownership_RENT': 1, 'person_home_ownership_OWN': 2} CREDIT RISK MODELING IN PYTHON

Col u mn importance interpretation # Column importances from importance_type = 'weight' {'person_home_ownership_RENT': 1, 'person_home_ownership_OWN': 2} CREDIT RISK MODELING IN PYTHON

Plotting col u mn importances Use the plot_importance() f u nction xgb.plot_importance(clf_gbt, importance_type = 'weight') {'person_income': 315, 'loan_int_rate': 195, 'loan_percent_income': 146} CREDIT RISK MODELING IN PYTHON

Choosing training col u mns Col u mn importance is u sed to sometimes decide w hich col u mns to u se for training Di � erent sets a � ect the performance of the models Model Model Defa u lt Col u mns Importances Acc u rac y Recall loan _ int _ rate , person _ emp _ length (100, 100) 0.81 0.67 loan _ int _ rate , person _ emp _ length , (98, 70, 5) 0.84 0.52 loan _ percent _ income CREDIT RISK MODELING IN PYTHON

F 1 scoring for models Thinking abo u t acc u rac y and recall for di � erent col u mn gro u ps is time cons u ming F 1 score is a single metric u sed to look at both acc u rac y and recall Sho w s u p as a part of the classification_report() CREDIT RISK MODELING IN PYTHON

Cross v alidation for credit models C R E D IT R ISK MOD E L IN G IN P YTH ON Michael Crabtree Data Scientist , Ford Motor Compan y

Cross v alidation basics Used to train and test the model in a w a y that sim u lates u sing the model on ne w data Segments training data into di � erent pieces to estimate f u t u re performance Uses DMatrix , an internal str u ct u re optimi z ed for XGBoost Earl y stopping tells cross v alidation to stop a � er a scoring metric has not impro v ed a � er a n u mber of iterations CREDIT RISK MODELING IN PYTHON

Ho w cross v alidation w orks Processes parts of training data as ( called folds ) and tests against u n u sed part Final testing against the act u al test set 1 2 h � ps :// scikit learn . org / stable / mod u les / cross _v alidation . html CREDIT RISK MODELING IN PYTHON

Setting u p cross v alidation w ithin XGBoost # Set the number of folds n_folds = 2 # Set early stopping number early_stop = 5 # Set any specific parameters for cross validation params = {'objective': 'binary:logistic', 'seed': 99, 'eval_metric':'auc'} 'binary':'logistic' is u sed to specif y classi � cation for loan_status 'eval_metric':'auc' tells XGBoost to score the model ' s performance on AUC CREDIT RISK MODELING IN PYTHON

Using cross v alidation w ithin XGBoost # Restructure the train data for xgboost DTrain = xgb.DMatrix(X_train, label = y_train) # Perform cross validation xgb.cv(params, DTrain, num_boost_round = 5, nfold=n_folds, early_stopping_rounds=early_stop) DMatrix() creates a special object for xgboost optimi z ed for training CREDIT RISK MODELING IN PYTHON

The res u lts of cross v alidation Creates a data frame of the v al u es from the cross v alidation CREDIT RISK MODELING IN PYTHON

Cross v alidation scoring Uses cross v alidation and scoring metrics w ith cross_val_score() f u nction in scikit - learn # Import the module from sklearn.model_selection import cross_val_score # Create a gbt model xg = xgb.XGBClassifier(learning_rate = 0.4, max_depth = 10) # Use cross valudation and accuracy scores 5 consecutive times cross_val_score(gbt, X_train, y_train, cv = 5) array([0.92748092, 0.92575308, 0.93975392, 0.93378608, 0.93336163]) CREDIT RISK MODELING IN PYTHON

Class imbalance in loan data C R E D IT R ISK MOD E L IN G IN P YTH ON Michael Crabtree Data Scientist , Ford Motor Compan y

Not eno u gh defa u lts in the data The v al u es of loan_status are the classes Non - defa u lt : 0 Defa u lt : 1 y_train['loan_status'].value_counts() loan _ stat u s Training Data Co u nt Percentage of Total 0 13,798 78% 1 3,877 22% CREDIT RISK MODELING IN PYTHON

Model loss f u nction Gradient Boosted Trees in xgboost u se a loss f u nction of log - loss The goal is to minimi z e this v al u e Tr u e loan stat u s Predicted probabilit y Log Loss 1 0.1 2.3 0 0.9 2.3 An inacc u ratel y predicted defa u lt has more negati v e � nancial impact CREDIT RISK MODELING IN PYTHON

The cost of imbalance A false negati v e ( defa u lt predicted as non - defa u lt ) is m u ch more costl y Person Loan Amo u nt Potential Pro � t Predicted Stat u s Act u al Stat u s Losses A $1,000 $10 Defa u lt Non - Defa u lt -$10 B $1,000 $10 Non - Defa u lt Defa u lt -$1,000 Log - loss for the model is the same for both , o u r act u al losses is not CREDIT RISK MODELING IN PYTHON

Ca u ses of imbalance Data problems Credit data w as not sampled correctl y Data storage problems B u siness processes : Meas u res alread y in place to not accept probable defa u lts Probable defa u lts are q u ickl y sold to other � rms Beha v ioral factors : Normall y, people do not defa u lt on their loans The less o � en the y defa u lt , the higher their credit rating CREDIT RISK MODELING IN PYTHON

Dealing w ith class imbalance Se v eral w a y s to deal w ith class imbalance in data Method Pros Cons Increases n u mber of Gather more data Percentage of defa u lts ma y not change defa u lts Increases recall for Model req u ires more t u ning and Penali z e models defa u lts maintenance Sample data Least technical Fe w er defa u lts in data di � erentl y adj u stment CREDIT RISK MODELING IN PYTHON

Undersampling strateg y Combine smaller random sample of non - defa u lts w ith defa u lts CREDIT RISK MODELING IN PYTHON

Combining the split data sets Test and training set m u st be p u t back together Create t w o ne w sets based on act u al loan_status # Concat the training sets X_y_train = pd.concat([X_train.reset_index(drop = True), y_train.reset_index(drop = True)], axis = 1) # Get the counts of defaults and non-defaults count_nondefault, count_default = X_y_train['loan_status'].value_counts() # Separate nondefaults and defaults nondefaults = X_y_train[X_y_train['loan_status'] == 0] defaults = X_y_train[X_y_train['loan_status'] == 1] CREDIT RISK MODELING IN PYTHON

Gradient boosted trees w ith XGBoost C R E D IT R ISK MOD E L IN G - PowerPoint PPT Presentation

Gradient boosted trees w ith XGBoost C R E D IT R ISK MOD E L IN G IN P YTH ON Michael Crabtree Data Scientist , Ford Motor Compan y Decision trees Creates predictions similar to logistic regression Not str u ct u red like a regression

Gradient Boosted Regression Trees scikit Peter Prettenhofer (@pprett) Gilles Louppe (@glouppe)

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

Boosted Top Tagging Seung J. Lee Outline Introduction: top jets @ LHC Modern boosted top

Trees Eric McCreath Overview In this lecture we will explore: general trees, binary trees,

2-3-4 Trees and Red- Black Trees 204 erm CS 16: Balanced Trees 2-3-4 Trees Revealed Nodes

/ + - * * 5 3 2 6 5 2 Examples Binary Trees BSTs Augmenting BinExpr General Trees

Introduction to Boosted Trees Tianqi Chen Oct. 22 2014 Outline Review of key concepts of

Welcome to the course! EX TREME GRADIEN T BOOS TIN G W ITH X GBOOS T Sergey Fogelson VP of

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Provably Robust Boosted Decision Stumps and Trees against Adversarial Attacks Maksym

Boosted Higgs, b tagging and other tools/techniques (Part 2) Dinko Ferenek Rutgers, The State

NEAT 022 Switching from a Boosted PI to Dolutegravir NEAT 022: Design Study Design

SPIRIT STUDY Switch to RPV-TDF-FTC from Ritonavir-boosted PI Regimen Spirit: Study Design Study

Modelling of Dependent Credit Rating Transitions Verena Goldammer (Joint work with Uwe Schmock)

A case study on using generalized additive models to fit credit rating scores Marlene Mller,

Four principles of unequals unequally, in proportion to relevant similarities and differences

1 Welcome ome and Introdu roduction ion Test? An online test will be set after t odays

T RADING P ERFORMANCE W AS P ROFITABLE A ND R ELATIVELY S TABLE I N Q4 40 Q4 Trading Revenue vs.

Credit Risk Transfer and the Pricing of Mortgage Default Risk Edward Golding Executive Director,

st s t ts rt

THE COURSE APPLIED SKILLS IN RISK MANAGEMENT OVERVIEW ARMENIAN STATE UNIVERSITY OF ECONOMICS