Variable selection IN TR OD U C TION TO P R E D IC TIVE AN ALYTIC - - PowerPoint PPT Presentation

variable selection
SMART_READER_LITE
LIVE PREVIEW

Variable selection IN TR OD U C TION TO P R E D IC TIVE AN ALYTIC - - PowerPoint PPT Presentation

Variable selection IN TR OD U C TION TO P R E D IC TIVE AN ALYTIC S IN P YTH ON Nele Verbiest , Ph . D Data Scientist @ P y thonPredictions Candidate predictors age max_gift income_low min_gift , mean_gift , median_gift country_USA ,


slide-1
SLIDE 1

Variable selection

IN TR OD U C TION TO P R E D IC TIVE AN ALYTIC S IN P YTH ON

Nele Verbiest, Ph.D

Data Scientist @PythonPredictions

slide-2
SLIDE 2

INTRODUCTION TO PREDICTIVE ANALYTICS IN PYTHON

Candidate predictors

age max_gift income_low min_gift , mean_gift , median_gift country_USA , country_India , country_UK number_gift_min50 , number_gift_min100 , number_gift_min150

slide-3
SLIDE 3

INTRODUCTION TO PREDICTIVE ANALYTICS IN PYTHON

Variable selection: motivation

Drawbacks of models with many variables: Over-ing Hard to maintain or implement Hard to interpret, multi-collinearity

slide-4
SLIDE 4

INTRODUCTION TO PREDICTIVE ANALYTICS IN PYTHON

Model evaluation: AUC

import numpy as np from sklearn.metrics import roc_auc_score roc_auc_score(true_target, prob_target)

slide-5
SLIDE 5

Let's practice!

IN TR OD U C TION TO P R E D IC TIVE AN ALYTIC S IN P YTH ON

slide-6
SLIDE 6

Forward stepwise variable selection

IN TR OD U C TION TO P R E D IC TIVE AN ALYTIC S IN P YTH ON

Nele Verbiest, Ph.D

Data Scientist @PythonPredictions

slide-7
SLIDE 7

INTRODUCTION TO PREDICTIVE ANALYTICS IN PYTHON

The forward stepwise variable selection procedure

Empty set Find best variable v Find best variable v in combination with v Find best variable v in combination with v ,v ... (Until all variables are added or until predened number of variables is added)

1 2 1 3 1 2

slide-8
SLIDE 8

INTRODUCTION TO PREDICTIVE ANALYTICS IN PYTHON

Functions in Python

def function_sum(a,b): s = a + b return(s) print(function_sum(1,2)) 3

slide-9
SLIDE 9

INTRODUCTION TO PREDICTIVE ANALYTICS IN PYTHON

Implementation of the forward stepwise procedure

Function auc that calculates AUC given a certain set of variables Function best_next that returns next best variable in combination with current variables Loop until desired number of variables

slide-10
SLIDE 10

INTRODUCTION TO PREDICTIVE ANALYTICS IN PYTHON

Implementation of the AUC function

from sklearn import linear_model from sklearn.metrics import roc_auc_score def auc(variables, target, basetable): X = basetable[variables] y = basetable[target] logreg = linear_model.LogisticRegression() logreg.fit(X, y) predictions = logreg.predict_proba(X)[:,1] auc = roc_auc_score(y, predictions) return(auc) auc = auc(["age","gender_F"],["target"],basetable) print(round(auc,2)) 0.54

slide-11
SLIDE 11

INTRODUCTION TO PREDICTIVE ANALYTICS IN PYTHON

Calculating the next best variable

def next_best(current_variables,candidate_variables, target, basetable): best_auc = -1 best_variable = None for v in candidate_variables: auc_v = auc(current_variables + [v], target, basetable) if auc_v >= best_auc: best_auc = auc_v best_variable = v return best_variable current_variables = ["age","gender_F"] candidate_variables = ["min_gift","max_gift","mean_gift"] next_variable = next_best(current_variables, candidate_variables, basetable) print(next_variable) min_gift

slide-12
SLIDE 12

INTRODUCTION TO PREDICTIVE ANALYTICS IN PYTHON

The forward stepwise variable selection procedure

candidate_variables = ["mean_gift","min_gift","max_gift", "age","gender_F","country_USA","income_low"] current_variables = [] target = ["target"] max_number_variables = 5 number_iterations = min(max_number_variables, len(candidate_variables)) for i in range(0,number_iterations): next_var = next_best(current_variables,candidate_variables,target,basetable) current_variables = current_variables + [next_variable] candidate_variables.remove(next_variable) print(current_variables) ['max_gift', 'mean_gift', 'min_gift', 'age', 'gender_F']

slide-13
SLIDE 13

Let's practice!

IN TR OD U C TION TO P R E D IC TIVE AN ALYTIC S IN P YTH ON

slide-14
SLIDE 14

Deciding on the number of variables

IN TR OD U C TION TO P R E D IC TIVE AN ALYTIC S IN P YTH ON

Nele Verbiest, Ph.D

Data Scientist @PythonPredictions

slide-15
SLIDE 15

INTRODUCTION TO PREDICTIVE ANALYTICS IN PYTHON

Evaluating the AUC

auc_values = [] variables_evaluate = [] for v in variables_forward: variables_evaluate.append(v) auc_value = auc(variables_evaluate, ["target"], basetable) auc_values.append(auc_value)

slide-16
SLIDE 16

INTRODUCTION TO PREDICTIVE ANALYTICS IN PYTHON

Evaluating the AUC

slide-17
SLIDE 17

INTRODUCTION TO PREDICTIVE ANALYTICS IN PYTHON

Over-fitting

slide-18
SLIDE 18

INTRODUCTION TO PREDICTIVE ANALYTICS IN PYTHON

Detecting over-fitting

slide-19
SLIDE 19

INTRODUCTION TO PREDICTIVE ANALYTICS IN PYTHON

Partitioning

from sklearn.cross_validation import train_test_split X = basetable.drop("target", 1) y = basetable["target"] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, stratify = Y) train = pd.concat([X_train, y_train], axis=1) test = pd.concat([X_test, y_test], axis=1)

slide-20
SLIDE 20

INTRODUCTION TO PREDICTIVE ANALYTICS IN PYTHON

Deciding the cut-off

High test AUC Low number of variables

slide-21
SLIDE 21

INTRODUCTION TO PREDICTIVE ANALYTICS IN PYTHON

Deciding the cut-off

slide-22
SLIDE 22

Let's practice!

IN TR OD U C TION TO P R E D IC TIVE AN ALYTIC S IN P YTH ON