Counterfactual evaluation
- f machine learning models
Michael Manapat @mlmanapat
Counterfactual evaluation of machine learning models Michael - - PowerPoint PPT Presentation
Counterfactual evaluation of machine learning models Michael Manapat @mlmanapat O N L I N E F R A U D O N L I N E F R A U D O N L I N E F R A U D R A D A R Rules engine (block if amount > $100 and more than 3 charges from IP
Michael Manapat @mlmanapat
O N L I N E F R A U D
O N L I N E F R A U D
O N L I N E F R A U D
R A D A R
M A C H I N E L E A R N I N G F O R F R A U D
M O D E L B U I L D I N G
December 31st, 2013
disputes on data from Jan 1st to Sep 30th
Oct 31st (need to wait ~60 days for labels)
policy for actioning scores: block if score > 50
Q U E S T I O N S
N E X T I T E R AT I O N
December 31st, 2014. We repeat the exercise from a year earlier
to Sep 30th
Oct 31st (need to wait ~60 days for labels)
N E X T I T E R AT I O N
F U N D A M E N TA L P R O B L E M
F I R S T AT T E M P T
if score > 50: if random.random() < 0.05: allow() else: block()
R E C A L L
1,000,000 charges Score < 50 Score > 50 Total 900,000 100,000 No outcome 95,000 Legitimate 890,000 1,000 Fraudulent 10,000 4,000
T R A I N I N G
have been blocked if not for the random reversal
from sklearn.ensemble import \ RandomForestRegressor ... r = RandomForestRegressor(n_estimators=100) r.fit(X, Y, sample_weight=weights)
T R A I N I N G
from sklearn import cross_validation X_train, X_test, y_train, y_test = \ cross_validation.train_test_split( data, target, test_size=0.2) r = RandomForestRegressor(...) ... r.score( X_test, y_test, sample_weight=weights)
P O L I C Y C U R V E
Very likely to be fraud Could go either way
B E T T E R A P P R O A C H
classifier scores to P(Allow)
lower probability we let the charge through
we want to improve on
"obvious" fraud ("budget" for evaluation)
B E T T E R A P P R O A C H
def propensity(score): # Piecewise linear/sigmoidal ... ps = propensity(score)
selected_block = random.random() < ps if selected_block: block() else: allow() log_record( id, score, ps, original_block, selected_block)
ID Score p(Allow) Original Action Selected Action Outcome 1 10 1.0 Allow Allow OK 2 45 1.0 Allow Allow Fraud 3 55 0.30 Block Block
65 0.20 Block Allow Fraud 5 100 0.0005 Block Block
60 0.25 Block Allow OK
A N A LY S I S
ID Score P(Allow) Weight Original Action Selected Action Outcome 1 10 1.0 1 Allow Allow OK 2 45 1.0 1 Allow Allow Fraud 4 65 0.20 5 Block Allow Fraud 6 60 0.25 4 Block Allow OK
Evaluating the "block if score > 50" policy
ID Score P(Allow) Weight Original Action Selected Action Outcome 1 10 1.0 1 Allow Allow OK 2 45 1.0 1 Allow Allow Fraud 4 65 0.20 5 Block Allow Fraud 6 60 0.25 4 Block Allow OK
Evaluating the "block if score > 40” policy
ID Score P(Allow) Weight Original Action Selected Action Outcome 1 10 1.0 1 Allow Allow OK 2 45 1.0 1 Allow Allow Fraud 4 65 0.20 5 Block Allow Fraud 6 60 0.25 4 Block Allow OK
Evaluating the "block if score > 62” policy
A N A LY S I S O F P R O D U C T I O N D ATA
T R A I N I N G N E W M O D E L S
T E C H N I C A L I T I E S
T E C H N I C A L I T I E S
C O N C L U S I O N
don’t have a counterfactual evaluation plan before putting your model into production
counterfactual
“reversal budget” where it matters most
arbitrarily many models and policies in this framework
T H A N K S
Michael Manapat @mlmanapat mlm@stripe.com