16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
A/B Testing
Md Emdadul Sadik Md Enamul Huq Sarker Summer 2020
1
A/B Testing Md Emdadul Sadik Md Enamul Huq Sarker Summer 2020 1 - - PowerPoint PPT Presentation
A/B Testing Md Emdadul Sadik Md Enamul Huq Sarker Summer 2020 1 16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq | Overview 1.A/B testing What is it? Why is that used?
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
Md Emdadul Sadik Md Enamul Huq Sarker Summer 2020
1
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq|
1.A/B testing
2.Multivariate testing 3.A/B Testing of ML Models
2
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq|
single variable)
3
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
4
Image source
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
5
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
purchase by 20 percent.
6
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
7
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
̄ = 25 minutes.
̄ >= 25 minutes | H0 is true)
8
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
9
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Huq, Sarker, Md Enamul |
the data or (+) something else that is equal (probability) or (+) something rarer (less probability)
10
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
11
Fail to Reject Reject H0 is true Correct conclusion Type I error H0 is false Type II error Correct conclusion
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
12
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
13
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
lead to best decision.
achieve A, instead of adding a button ‘X’ focus on Y.
14
Image source
*
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
file.
affecting business operations.
deployed.
(M) (V) (T) (P)
15
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
Imagine, we have some clinical data that helps deciding whether a patient has heart disease or not.
16
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
We deploy Random Forest (model A) and K-Nearest (model B) and to find out. TP looks good for model A.
Model A - RF Model B - KNN
17
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
We deploy Random Forest (model A) and K-Nearest (model B) to find out. TN also looks good for model A.
18
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
We deploy Random Forest (model A) and K-Nearest (model B) to find out. Model A wins!
19
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
Model Quantification - MQ
Sensitivit y = TPR = TP TP + FN = TP Actual Positives
Effect Size: The difference between the two models’ performance metrices.
Statistical significance α = 1 − CL
ConfidenceLevel, CL = The probability of correctly retaining the H0 ; 95 %
Ha
Accuracy = Total Correct Predictions Total Data Set
Specif icit y = TNR = TN TN + FP = TP Actual Negatives 20
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
Model B - Random Forest Model A - Logistic Regression
Again we have a confusion matrix from that clinical data we saw. This time we apply LR (A) and RF (B) to measure models’ performance w/ Sensitivity and Specificity.
Src: StatQuest Src: StatQuest
21
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
Sensitivity(LR) = 139 139 + 32 = 0.81 Sensitivity(RF) = 142 142 + 29 = 0.83
Sensitivit y = TPR = TP TP + FN = TP Actual Positives 22
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
Specificity(RF) = 110 110 + 22 = 0.83 Specificity(LR) = 112 112 + 20 = 0.85
Specif icit y = TNR = TN TN + FP = TP Actual Negatives 23
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
24
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
MQ: Accuracy
Image Courtesy : Minsuk Heo
Accuracy = Total Correct Predictions Total Data Set
The picture shows two models deployed to classify multiple classes (A-D). By comparing the accuracies one could decide that Model 1 wins.
For balanced data accuracy could alone answer for the best model. But the reality is not always ideal!
25
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
Model Deployment - MD
The picture shows an A/B testing of two models. If we add more models C,D.. N in the same way the test would become a A/B/n or multivariate test.
Orcale White Paper on Model Testing
A Trivial model deployment example using Python Flask http endpoint.
mlinproduction.com
26
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
MD - Post Deployment Discrepancies
Discrepancies reveals post deployment.
Orcale White Paper on Model Testing
27
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
MD - Deploy an A/B Test
as the one used during the model training phase (e.g., F1, AUC, RMSE, etc.)
selected minimum effect size, significance level, power, and computed/estimated sample variance.
Designing a Model A/B Test
At a high level, designing an A/B test for models involves the following steps
Effect Size: The difference between the two models’ performance metrices. 28
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
MD - Mistake of Early Declaration
(Sig) in one of 20 independent tests for a fixed and identical N (N=1000).
spurious false positives.
It’s a mistake, don’t you pull the plug!
Declaring a model a resounding success before collecting N units of
29
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
MD - Holy Grails of Model A/B Testing
Often, there is a novelty effect in first few days of model deployment and a higher risk of false positives.
done before randomisation is needed.
new model instead.)
metrics.
30
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
A / B Testing: The Most Powerful Way to Turn Clicks Into Customers. John Wiley & Sons. ISBN 978-1-118-65920-5.
Designing and Deploying Online Field Experiments By Eytan Bakshy, Dean Eckles, Michael S. Bernstein When A/B Testing Isn’t Worth It
Oracle Whitepaper - Testing Predictive Models in Production
By Ruslana Dalinina Jean-René Gauthier, and Pramit Choudhary
A/B Testing Machine Learning Models (Deployment Series: Guide 08)
ML in production
Khan Academy - Unit: Significance tests (hypothesis testing)
Confusion Matrix _ StatQuest on Youtube Sensitivity and Specificity - StatQuest on Youtube. Statistical Significance in A/B Testing – a Complete Guide 31
16.06.2020 | Fachbereich FB20 | SE4AI - A/B Testing | Sadik, Md Emdadul & Sarker, Md Enamul Huq |
32