Validity How to make sure your testing investment yields reliable - - PowerPoint PPT Presentation

validity
SMART_READER_LITE
LIVE PREVIEW

Validity How to make sure your testing investment yields reliable - - PowerPoint PPT Presentation

Validity How to make sure your testing investment yields reliable How to make sure your testing investment yields reliable data (interactive panel with experts) D Daniel Burstein i l B t i B b K Bob Kemper Aaron Gray Eric Miller Interactive


slide-1
SLIDE 1

Validity

How to make sure your testing investment yields reliable D i l B t i How to make sure your testing investment yields reliable data (interactive panel with experts) B b K Daniel Burstein Aaron Gray Bob Kemper Eric Miller

slide-2
SLIDE 2

Interactive panel with experts

Daniel Burstein, Director of Editorial Content, MECLABS @DanielBurstein Eric Miller, Director of Product Management, Monetate @dynamiller @monetate @dynamiller @monetate Bob Kemper, Director of Sciences, MECLABS Bob Kemper, Director of Sciences, MECLABS Aaron Gray, Head of Day 2 @agray

slide-3
SLIDE 3

How marketers validate test results

slide-4
SLIDE 4

Three important (and often misunderstood) elements of statistical significance elements of statistical significance

  • Significant difference
  • Sample size

p

  • Level of confidence
slide-5
SLIDE 5

Significant difference

11% is more than 10%* 11% is more than 10%* 11% is more than 10% 11% is more than 10%

*…except when it’s not

slide-6
SLIDE 6

Which Treatment Won?

A B

slide-7
SLIDE 7

Initial Monetate Campaign

slide-8
SLIDE 8

Test A ‐ Warm Fleece

slide-9
SLIDE 9

Test B ‐ Layers

slide-10
SLIDE 10

Monetate Reporting: A Vs. Control

Incremental Revenue: ‐$35k Incremental Revenue: $35k New Customer Acquisition: 14.79% lift at p95 AOV: ‐8.81% lift at p99

slide-11
SLIDE 11

Monetate Reporting: B Vs. Control

Incremental Revenue: $43k AOV: 13.15% lift at p99 l f RPS: 13.47% lift at p90

slide-12
SLIDE 12

A Vs. B

Conversion: 4.30% lift at p80

slide-13
SLIDE 13

Final Results

A

  • They both won for different

segments

A

  • “A‐Fleece” was the overall winner

and won with new customer acquisition, and is now shown only

B

to that segment.

  • “B‐Layers” won with existing

customers with significant lift in AOV over time.

  • New campaigns were iterated to

take advantage of learnings

slide-14
SLIDE 14

Resulting Campaign

slide-15
SLIDE 15

Sample size

“Well, you’re alive today even though you didn’t have one of those fancy car n=2 , y y g y f y seats.” – My Mom “Compared with seat belts, child restraints…were associated with a 28% reduction in risk for death ” n=7,813 reduction in risk for death. – Michael R. Elliott, PhD; Michael J. Kallan, MS; Dennis R. Durbin, MD, MSCE; Flaura K. Winston, MD, PhD

slide-16
SLIDE 16

Sample size

Number of test subjects needed to get “statistically significant” results

  • Achieving that number is a function of visitor volume and time

Factors in determining Sample Size g p

  • Test complexity (number of versions being tested)
  • Conversion rate

P f diff b t i ti

  • Performance difference between variations
  • Confidence level
  • But – too short a test may not be as valid as it looks, especially if

distribution of time is a factor

Be realistic about what kind of test your site can support y pp

slide-17
SLIDE 17

Level of confidence

“Piled Higher and Deeper” by Jorge Cham www.phdcomics.com p

slide-18
SLIDE 18

Level of confidence

What is it?

Statistical Level of Confidence – The statistical probability that there really is a performance difference between the control and experimental treatments based upon the data collected to date.

H ( h ) d I i ?

(“unofficial” but useful)

How (or where) do I get it?

The math – Determine the Mean difference, the standard deviation and the sample size and use the formula...

Confidence Interval Limits

Or… get it from your metrics / testing tool. What are the chances that what I just saw could have happened ‘just by chance’…, and that these two (pages) are really no different at all? The big (inferential) statistics question… and that these two (pages) are really no different at all?

slide-19
SLIDE 19

Level of confidence

What does it MEAN?

Imagine an experiment…

  • Take one FAIR coin. (i.e., if flipped ∞ times, would come out heads 50%).
  • Flip the coin ‘n’ (many) times and record # Heads (e.g., say 60 times)
  • Then do it over and over again; same # flips.

g ; p

Proportional to #‐times it comes out with that many Heads

The math – 5 times out of every 100 that I do the coin‐flip experiment, I expect to get a difference between my two samples that's AT LEAST as big as this one ‐ even though there is NO ACTUAL difference... g

slide-20
SLIDE 20

Level of confidence

How do I decide on the right level?

  • Most common is 95% (i.e., 5% chance you’ll think they’re different when they’re really not)
  • There is no ‘magic’ to the 95% LoC.
  • Mainly a matter of ‘convention’ or agreement.

Mainly a matter of convention or agreement.

  • The onus for picking the ‘right’ level for your test is on YOU.
  • Sometimes the tools limit you
  • 95% is seldom a “bad” choice.
  • Higher = Longer test

Confidence Interval Limits

g g

  • Bigger difference needed for validity
  • Decide based on…
  • Level of risk of being wrong vs. cost of prolonging the test.
slide-21
SLIDE 21

The iPod of validity tools

slide-22
SLIDE 22

How marketers validate test results

slide-23
SLIDE 23

Experiment – Background

Experiment ID: (Protected) Location: MarketingExperiments Research Library B k d C h ff li b k i Research Notes: Background: Consumer company that offers online brokerage services Goal: To increase the volume of accounts created online Primary research question: Which page design will generate the highest rate

  • f conversion?

Test Design: A/B/C/D multi factor split test Test Design: A/B/C/D multi‐factor split test

slide-24
SLIDE 24

Experiment – Control Treatment

  • Heavily competing

imagery and messages Control g y g

  • Multiple calls‐to‐action

ROTATING BANNER

slide-25
SLIDE 25

Experiment – Exp. Treatment 1

  • Most of the elements on the

page are unchanged, only

  • ne block of information has

Treatment 1 been optimized

  • Headline has been added

ROTATING BANNER

  • Bulleted copy highlighted

key value proposition points h h

  • Chat With a Live Agent CTA

removed

  • Large, clear call‐to‐action

g , has been added

slide-26
SLIDE 26

Experiment – Exp. Treatment 2

  • Left column remained the

same, but we removed footer

ROTATING BANNER Treatment 2

elements

  • Long copy, vertical flow
  • Added awards and testimonials

in right‐hand column

  • Large, clear call‐to‐action

similar to Treatment 1

slide-27
SLIDE 27

Experiment – Exp. Treatment 3

  • Similar to Treatment 2,

except left‐hand column ROTATING BANNER

Treatment 3

width reduced even further

  • Left‐hand column has a more

l l ROTATING BANNER navigational role

  • Still a long copy, vertical flow,

single call to action design single call‐to‐action design

slide-28
SLIDE 28

Experiment – All Treatments Summary Control Treatment 1 Treatment 2 Treatment 3

slide-29
SLIDE 29

Experiment – Results No Significant Difference

None of the treatment designs performed with conclusive results Test Designs Conversion Rate Relative Diff% Control 5.95% ‐ Treatment 1 6.99% 17.42% Treatment 2 6.51% 9.38%

Wh t d t d t d A di t th t ti l tf

Treatment 3 6.77% 13.70%

What you need to understand: According to the testing platform we

were using, the aggregate results came up inconclusive. None of the treatments outperformed the control with any significant difference.

slide-30
SLIDE 30

Validity Threat

Experiment

  • However, we noticed an interesting performance

shift in the control and treatments towards the end

  • f the test.
  • We discovered that during the test, there was an

email sent that skewed the sampling distribution.

15.00% 17.00% 19.00%

te Treatment consistently is beating the control Control beats the treatment

7.00% 9.00% 11.00% 13.00%

Control Treatment 3

version Rat

3.00% 5.00%

Conv Test Duration

Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10 Day 11

slide-31
SLIDE 31

Results

Experiment

31% Increase in Conversions

The highest performing treatment outperformed the control by 31% Test Designs Treatment Relative Diff% Control 5.35% ‐ Treatment 1 6.67% 25% Treatment 2 6.13% 15%

T t t 3 7 03% 31% Treatment 3 7.03% 31%

What you need to understand: After excluding the data collected after

h l h d b h f h b ll the email had been sent out, each of the treatments substantially

  • utperformed the control with conclusive validity.
slide-32
SLIDE 32

Validity Threats: The reason you can’t blindly trust your tools trust your tools

  • Sample Distortion Effect – the effect on a test outcome

caused by failing to collect a sufficient number of

  • bservations
  • History Effect X

y

  • Instrumentation Effect

X X

  • Selection Effect X
slide-33
SLIDE 33

Validity Threats: History effect

When a test variable is affected by an extraneous variable associated with the When a test variable is affected by an extraneous variable associated with the passage of time Examples p

  • An email send that skews conversion for one treatment (as in the previous

experiment)

  • Newsworthy event that changes the nature of arriving subjects—whether

temporarily or permanently (e.g., 9/11 attack)

slide-34
SLIDE 34

Validity Threats: History effect

slide-35
SLIDE 35

Validity Threats: History effect

slide-36
SLIDE 36

Validity Threats: History effect

Identification:

  • Sniff Test –but to a point
  • Sniff Test –but to a point
  • Did anything happen? REALLY HARD

Mitigation:

  • Segmented reporting
  • Test with longer time horizons, but to a point
  • Iterate, iterate, iterate, target, test

Iterate, iterate, iterate, target, test

  • Balance the cost of being wrong
slide-37
SLIDE 37

Validity Threats: Instrumentation effect

h t t i bl i ff t d b h i th t i t t when a test variable is affected by a change in the measurement instrument Examples

  • Short‐duration response time slowdowns
  • E.g., due to server‐load, page‐weight, page‐code problems
  • Splitter malfunction
  • Inconsistent URLs
  • Server downtime
slide-38
SLIDE 38

Validity Threats: Instrumentation effect

Identification:

  • Sudden change in performance or traffic distribution to one or more variations
  • e.g. conversion falls to 0; number of visitors to one variation spikes or dips

e.g. conversion falls to 0; number of visitors to one variation spikes or dips

  • Improper or inconsistent placement of test control code across variations
  • e.g. script in header on some pages, in body on others.
  • Temporary change to or disablement of control code

p y g f

  • e.g. disruptive code deploy happens during test run

Mitigation: Mitigation:

  • Watch for sudden, drastic changes and investigate for instrumentation errors that may have

caused them

  • Ensure that control code placement follows best practice, and is consistent across all pages
  • Don’t do significant deploys during testing; Ensure that you have a method of deploying

code that does not disrupt control code

slide-39
SLIDE 39

Validity Threats: Selection effect

h t t i bl i ff t d b diff t t f bj t t b i l when a test variable is affected by different types of subjects not being evenly distributed among experimental treatments Examples p

  • Channel profile does not match customer profiles
  • Uneven distribution of traffic from sources among treatments
  • Self selection (bias)
slide-40
SLIDE 40

Validity Threats: Selection effect

Identification:

  • Dramatic differences between test Control and recent historical performance

(not matching target environment)

  • Sudden changes in performance or traffic balance among treatments (uneven dist.)
  • Esp. after email sends or other channel‐specific events or actions
  • Test design that separates treatments according to a visitor decision (self‐selection)
  • e g Treatment 1

user selects Option A; Treatment 2 user selects Option B

  • e.g., Treatment 1 = user selects Option A; Treatment 2 = user selects Option B

Mitigation:

  • Confirm that Control & recent historical performance are consistent.

Confirm that Control & recent historical performance are consistent.

  • Monitor and watch for sudden changes within & across treatments during collection period
  • Ensure unbiased selection and random distribution of subjects among treatments.
  • Ideally, there should be no possible way for anyone to predict which treatment a given incoming

i it ill visitor will see.

  • At the very least, there should be no ‘systematic’ treatment‐assignment based upon technical or

behavioral characteristics (of visitors or researchers).

  • Note that ‘round robin’ meets
slide-41
SLIDE 41

More from the panelists

  • MarketingExperiments.com/blog
  • MarketingSherpa com/blog
  • MarketingSherpa.com/blog
  • Monetate.com/blog
  • Monetate.com/resources
  • GetReadyforDay2.com
slide-42
SLIDE 42

Resources

  • MECLABS Basic Validity Tool – on thumb drive at MECLABS booth
  • MarketingExperiments “Fundamentals of Online Testing” on‐demand
  • MarketingExperiments Fundamentals of Online Testing on demand

certification course – MarketingExperiments.com/Validity

slide-43
SLIDE 43

Question‐and‐Answer Session Q

slide-44
SLIDE 44

APPENDIX APPENDIX

Supplemental material to support potential Supplemental material to support potential questions during Q&A

slide-45
SLIDE 45

Appendix

Shrinking sample size by a factor of 10 yields identical Conversion rates, but expands the Confidence Interval “ b ” i h l “error bars,” causing them to overlap… … meaning that the test would not validate at the same level of confidence with only 10% of the traffic.

95% Confidence Level “error bars” Conclusive “error bars” Conclusive Confidence Interval Limits Sample size requirement for Proportions Likely not conclusive

slide-46
SLIDE 46

p

Appendix

nsity (p) Probability Den

0.399

Infinite number of Normal Distribution Curves There is one Standard Normal curve (μ = 0, σ = 1.0) Probability Density function for Standard Normal distribution

Probability Density function for Standard Normal distribution

Standard Normal distribution

slide-47
SLIDE 47

Difference in Sample Means

Appendix

slide-48
SLIDE 48

Appendix

Poisson Distribution for different values of λ (mean #‐events/sample) (special case of the binomial distribution that applies when the phenomenon under study occurs as rare, discrete events). The Normal distribution is the limiting case of a discrete binomial distribution as the sample size (N) becomes large, in which case Pp (n|N) is normal with mean and variance mu‐ N p, and sigma^2 = N p q, with q defined as 1‐p.