Validity
How to make sure your testing investment yields reliable D i l B t i How to make sure your testing investment yields reliable data (interactive panel with experts) B b K Daniel Burstein Aaron Gray Bob Kemper Eric Miller
Validity How to make sure your testing investment yields reliable - - PowerPoint PPT Presentation
Validity How to make sure your testing investment yields reliable How to make sure your testing investment yields reliable data (interactive panel with experts) D Daniel Burstein i l B t i B b K Bob Kemper Aaron Gray Eric Miller Interactive
How to make sure your testing investment yields reliable D i l B t i How to make sure your testing investment yields reliable data (interactive panel with experts) B b K Daniel Burstein Aaron Gray Bob Kemper Eric Miller
Daniel Burstein, Director of Editorial Content, MECLABS @DanielBurstein Eric Miller, Director of Product Management, Monetate @dynamiller @monetate @dynamiller @monetate Bob Kemper, Director of Sciences, MECLABS Bob Kemper, Director of Sciences, MECLABS Aaron Gray, Head of Day 2 @agray
p
*…except when it’s not
Incremental Revenue: ‐$35k Incremental Revenue: $35k New Customer Acquisition: 14.79% lift at p95 AOV: ‐8.81% lift at p99
Incremental Revenue: $43k AOV: 13.15% lift at p99 l f RPS: 13.47% lift at p90
Conversion: 4.30% lift at p80
segments
and won with new customer acquisition, and is now shown only
to that segment.
customers with significant lift in AOV over time.
take advantage of learnings
“Well, you’re alive today even though you didn’t have one of those fancy car n=2 , y y g y f y seats.” – My Mom “Compared with seat belts, child restraints…were associated with a 28% reduction in risk for death ” n=7,813 reduction in risk for death. – Michael R. Elliott, PhD; Michael J. Kallan, MS; Dennis R. Durbin, MD, MSCE; Flaura K. Winston, MD, PhD
Number of test subjects needed to get “statistically significant” results
Factors in determining Sample Size g p
P f diff b t i ti
distribution of time is a factor
Be realistic about what kind of test your site can support y pp
“Piled Higher and Deeper” by Jorge Cham www.phdcomics.com p
What is it?
Statistical Level of Confidence – The statistical probability that there really is a performance difference between the control and experimental treatments based upon the data collected to date.
H ( h ) d I i ?
(“unofficial” but useful)
How (or where) do I get it?
The math – Determine the Mean difference, the standard deviation and the sample size and use the formula...
Confidence Interval Limits
Or… get it from your metrics / testing tool. What are the chances that what I just saw could have happened ‘just by chance’…, and that these two (pages) are really no different at all? The big (inferential) statistics question… and that these two (pages) are really no different at all?
What does it MEAN?
Imagine an experiment…
g ; p
Proportional to #‐times it comes out with that many Heads
The math – 5 times out of every 100 that I do the coin‐flip experiment, I expect to get a difference between my two samples that's AT LEAST as big as this one ‐ even though there is NO ACTUAL difference... g
How do I decide on the right level?
Mainly a matter of convention or agreement.
Confidence Interval Limits
g g
Experiment – Background
Experiment ID: (Protected) Location: MarketingExperiments Research Library B k d C h ff li b k i Research Notes: Background: Consumer company that offers online brokerage services Goal: To increase the volume of accounts created online Primary research question: Which page design will generate the highest rate
Test Design: A/B/C/D multi factor split test Test Design: A/B/C/D multi‐factor split test
Experiment – Control Treatment
imagery and messages Control g y g
ROTATING BANNER
Experiment – Exp. Treatment 1
page are unchanged, only
Treatment 1 been optimized
ROTATING BANNER
key value proposition points h h
removed
g , has been added
Experiment – Exp. Treatment 2
same, but we removed footer
ROTATING BANNER Treatment 2
elements
in right‐hand column
similar to Treatment 1
Experiment – Exp. Treatment 3
except left‐hand column ROTATING BANNER
Treatment 3
width reduced even further
l l ROTATING BANNER navigational role
single call to action design single call‐to‐action design
Experiment – All Treatments Summary Control Treatment 1 Treatment 2 Treatment 3
Experiment – Results No Significant Difference
None of the treatment designs performed with conclusive results Test Designs Conversion Rate Relative Diff% Control 5.95% ‐ Treatment 1 6.99% 17.42% Treatment 2 6.51% 9.38%
Wh t d t d t d A di t th t ti l tf
Treatment 3 6.77% 13.70%
were using, the aggregate results came up inconclusive. None of the treatments outperformed the control with any significant difference.
Validity Threat
Experiment
shift in the control and treatments towards the end
email sent that skewed the sampling distribution.
15.00% 17.00% 19.00%
te Treatment consistently is beating the control Control beats the treatment
7.00% 9.00% 11.00% 13.00%
Control Treatment 3
version Rat
3.00% 5.00%
Conv Test Duration
Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10 Day 11
Results
Experiment
31% Increase in Conversions
The highest performing treatment outperformed the control by 31% Test Designs Treatment Relative Diff% Control 5.35% ‐ Treatment 1 6.67% 25% Treatment 2 6.13% 15%
T t t 3 7 03% 31% Treatment 3 7.03% 31%
h l h d b h f h b ll the email had been sent out, each of the treatments substantially
Validity Threats: The reason you can’t blindly trust your tools trust your tools
caused by failing to collect a sufficient number of
y
Validity Threats: History effect
When a test variable is affected by an extraneous variable associated with the When a test variable is affected by an extraneous variable associated with the passage of time Examples p
experiment)
temporarily or permanently (e.g., 9/11 attack)
Validity Threats: History effect
Validity Threats: History effect
Validity Threats: History effect
Identification:
Mitigation:
Iterate, iterate, iterate, target, test
Validity Threats: Instrumentation effect
h t t i bl i ff t d b h i th t i t t when a test variable is affected by a change in the measurement instrument Examples
Validity Threats: Instrumentation effect
Identification:
e.g. conversion falls to 0; number of visitors to one variation spikes or dips
p y g f
Mitigation: Mitigation:
caused them
code that does not disrupt control code
Validity Threats: Selection effect
h t t i bl i ff t d b diff t t f bj t t b i l when a test variable is affected by different types of subjects not being evenly distributed among experimental treatments Examples p
Identification:
(not matching target environment)
user selects Option A; Treatment 2 user selects Option B
Mitigation:
Confirm that Control & recent historical performance are consistent.
i it ill visitor will see.
behavioral characteristics (of visitors or researchers).
certification course – MarketingExperiments.com/Validity
Supplemental material to support potential Supplemental material to support potential questions during Q&A
Shrinking sample size by a factor of 10 yields identical Conversion rates, but expands the Confidence Interval “ b ” i h l “error bars,” causing them to overlap… … meaning that the test would not validate at the same level of confidence with only 10% of the traffic.
95% Confidence Level “error bars” Conclusive “error bars” Conclusive Confidence Interval Limits Sample size requirement for Proportions Likely not conclusive
p
nsity (p) Probability Den
0.399
Infinite number of Normal Distribution Curves There is one Standard Normal curve (μ = 0, σ = 1.0) Probability Density function for Standard Normal distribution
Probability Density function for Standard Normal distribution
Standard Normal distribution
Difference in Sample Means
Poisson Distribution for different values of λ (mean #‐events/sample) (special case of the binomial distribution that applies when the phenomenon under study occurs as rare, discrete events). The Normal distribution is the limiting case of a discrete binomial distribution as the sample size (N) becomes large, in which case Pp (n|N) is normal with mean and variance mu‐ N p, and sigma^2 = N p q, with q defined as 1‐p.