Validity How to make sure your testing investment yields reliable - PowerPoint PPT Presentation

Validity How to make sure your testing investment yields reliable How to make sure your testing investment yields reliable data (interactive panel with experts) D Daniel Burstein i l B t i B b K Bob Kemper Aaron Gray Eric Miller

Interactive panel with experts Daniel Burstein, Director of Editorial Content, MECLABS @DanielBurstein Eric Miller, Director of Product Management, Monetate @dynamiller @monetate @dynamiller @monetate Bob Kemper, Director of Sciences, MECLABS Bob Kemper, Director of Sciences, MECLABS Aaron Gray, Head of Day 2 @agray

How marketers validate test results

Three important (and often misunderstood) elements of statistical significance elements of statistical significance • Significant difference • Sample size p • Level of confidence

Significant difference 11% is more than 10%* 11% is more than 10%* 11% is more than 10% 11% is more than 10% *…except when it’s not

Which Treatment Won? A B

Initial Monetate Campaign

Test A ‐ Warm Fleece

Test B ‐ Layers

Monetate Reporting: A Vs. Control Incremental Revenue: ‐ $35k Incremental Revenue: $35k New Customer Acquisition: 14.79% lift at p95 AOV: ‐ 8.81% lift at p99

Monetate Reporting: B Vs. Control Incremental Revenue: $43k AOV: 13.15% lift at p99 RPS: 13.47% lift at p90 l f

A Vs. B Conversion: 4.30% lift at p80

Final Results A A • They both won for different segments • “A ‐ Fleece” was the overall winner B and won with new customer acquisition , and is now shown only to that segment. • “B ‐ Layers” won with existing customers with significant lift in AOV over time. • New campaigns were iterated to take advantage of learnings

Resulting Campaign

Sample size n=2 “Well, you’re alive today even though you didn’t have one of those fancy car , y y g y f y seats.” – My Mom n=7,813 “Compared with seat belts, child restraints…were associated with a 28% reduction in risk for death ” reduction in risk for death. – Michael R. Elliott, PhD; Michael J. Kallan, MS; Dennis R. Durbin, MD, MSCE; Flaura K. Winston, MD, PhD

Sample size Number of test subjects needed to get “statistically significant” results Achieving that number is a function of visitor volume and time • Factors in determining Sample Size g p Test complexity (number of versions being tested) • Conversion rate • Performance difference between variations P f diff b t i ti • Confidence level • But – too short a test may not be as valid as it looks, especially if • distribution of time is a factor Be realistic about what kind of test your site can support y pp

Level of confidence “Piled Higher and Deeper” by Jorge Cham www.phdcomics.com p

Level of confidence What is it? Statistical Level of Confidence – The statistical probability that there really is a performance difference between the control and experimental treatments (“unofficial” but useful) based upon the data collected to date. H How (or where) do I get it? ( h ) d I i ? The math – Determine the Mean difference, the standard deviation and the sample size and use the formula ... Confidence Interval Limits Or… get it from your metrics / testing tool . The big (inferential) statistics question… What are the chances that what I just saw could have happened ‘just by chance ’…, and that these two (pages) are really no different at all ? and that these two (pages) are really no different at all ?

Level of confidence What does it MEAN? Imagine an experiment… Take one FAIR coin. (i.e., if flipped ∞ times, would come out heads 50%) . • Flip the coin ‘n’ (many) times and record # Heads (e.g., say 60 times) • Then do it over and over again; same # flips. g ; p • Proportional to # ‐ times it comes out with that many Heads The math – 5 times out of every 100 that I do the coin ‐ flip experiment, I expect to get a difference between my two samples that's AT LEAST as big as this one ‐ even though there is NO ACTUAL difference... g

Level of confidence How do I decide on the right level? Most common is 95% (i.e., 5% chance you’ll think they’re different when they’re really not) • There is no ‘magic’ to the 95% LoC. • Mainly a matter of ‘convention’ or agreement. Mainly a matter of convention or agreement. • The onus for picking the ‘right’ level for your test is on YOU. • Sometimes the tools limit you • 95% is seldom a “bad” choice. • Confidence Interval Limits Higher = Longer test g g • Bigger difference needed for validity • Decide based on… • Level of risk of being wrong vs. cost of prolonging the test. •

The iPod of validity tools

How marketers validate test results

Experiment – Background Experiment ID: (Protected) Location: MarketingExperiments Research Library Research Notes: B Background: Consumer company that offers online brokerage services k d C h ff li b k i Goal: To increase the volume of accounts created online Primary research question: Which page design will generate the highest rate of conversion? Test Design: A/B/C/D multi factor split test Test Design: A/B/C/D multi ‐ factor split test

Experiment – Control Treatment Control Heavily competing • imagery and messages g y g ROTATING BANNER Multiple calls ‐ to ‐ action •

Experiment – Exp. Treatment 1 Most of the elements on the Treatment 1 • page are unchanged, only one block of information has been optimized ROTATING BANNER Headline has been added • Bulleted copy highlighted • key value proposition points Chat With a Live Agent CTA h h • removed Large, clear call ‐ to ‐ action g , • has been added

Experiment – Exp. Treatment 2 Treatment 2 • Left column remained the same, but we removed footer ROTATING BANNER elements • Long copy, vertical flow • Added awards and testimonials in right ‐ hand column • Large, clear call ‐ to ‐ action similar to Treatment 1

Experiment – Exp. Treatment 3 Treatment 3 • Similar to Treatment 2, except left ‐ hand column ROTATING BANNER ROTATING BANNER width reduced even further • Left ‐ hand column has a more navigational role l l • Still a long copy, vertical flow, single call to action design single call ‐ to ‐ action design

Experiment – All Treatments Summary Control Treatment 1 Treatment 2 Treatment 3

Experiment – Results No Significant Difference None of the treatment designs performed with conclusive results Conversion Test Designs Relative Diff% Rate Control 5.95% ‐ Treatment 1 6.99% 17.42% Treatment 2 6.51% 9.38% Treatment 3 6.77% 13.70% � What you need to understand: According to the testing platform we � Wh t d t d t d A di t th t ti l tf were using, the aggregate results came up inconclusive. None of the treatments outperformed the control with any significant difference.

Experiment Validity Threat However, we noticed an interesting performance • shift in the control and treatments towards the end of the test. We discovered that during the test, there was an • email sent that skewed the sampling distribution. 19.00% Treatment consistently is Control beats 17.00% 15.00% beating the control the treatment te version Rat 13.00% 11.00% Control Treatment 3 9.00% 7.00% Conv 5.00% 3.00% Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10 Day 11 Test Duration

Experiment Results 31% Increase in Conversions The highest performing treatment outperformed the control by 31% Test Designs Treatment Relative Diff% Control 5.35% ‐ Treatment 1 6.67% 25% Treatment 2 6.13% 15% Treatment 3 T t t 3 7 03% 7.03% 31% 31% � What you need to understand: After excluding the data collected after the email had been sent out, each of the treatments substantially h l h d b h f h b ll outperformed the control with conclusive validity.

Validity Threats: The reason you can’t blindly trust your tools trust your tools • Sample Distortion Effect – the effect on a test outcome caused by failing to collect a sufficient number of observations • History Effect X X y X • Instrumentation Effect • Selection Effect X

Validity Threats: History effect When a test variable is affected by an extraneous variable associated with the When a test variable is affected by an extraneous variable associated with the passage of time Examples p • An email send that skews conversion for one treatment (as in the previous experiment) • Newsworthy event that changes the nature of arriving subjects—whether temporarily or permanently (e.g., 9/11 attack)

Validity Threats: History effect

Validity Threats: History effect Identification: • Sniff Test –but to a point • Sniff Test –but to a point • Did anything happen? REALLY HARD Mitigation: • Segmented reporting • Test with longer time horizons, but to a point • Iterate, iterate, iterate, target, test Iterate, iterate, iterate, target, test • Balance the cost of being wrong

Validity Threats: Instrumentation effect when a test variable is affected by a change in the measurement instrument h t t i bl i ff t d b h i th t i t t Examples • Short ‐ duration response time slowdowns • E.g., due to server ‐ load, page ‐ weight, page ‐ code problems • Splitter malfunction • Inconsistent URLs • Server downtime

Validity How to make sure your testing investment yields reliable - PowerPoint PPT Presentation

Validity How to make sure your testing investment yields reliable How to make sure your testing investment yields reliable data (interactive panel with experts) D Daniel Burstein i l B t i B b K Bob Kemper Aaron Gray Eric Miller Interactive

External Validity of NYC Macroscope Electronic Health External Validity of NYC Macroscope

External Validity March 25 1 / 16 Definition How do we define external validity? Mundane

RESEARCH VALIDITY Winfred Arthur, Jr. Department of Psychological and Brain Sciences and

Proving the Validity of an Argument Torben Amtoft Kansas State University Torben Amtoft Kansas

Cue validity Cue validity - predictiveness of a cue for a given category Central

First-Order Necessity and Validity First-Order Necessity and Validity Mark Criley IWU

Circuit Validity Checker D. Mitch Bailey Shuhari System, Japan WOSET 2020 CVC: Circuit Validity

ASSESSING THE MEASUREMENT MODEL RELIABILITY AND VALIDITY USING SPSS/AMOS USING SPSS/AMOS

Validity-preservation properties of rules for combining inferential models combining

External Validity In order to test our RH: we have to decide on a research design, sample

Validity Checking Propositional and First-Order Logic Carlos Bacelar Almeida Departmento de

Does it matter what validity means? Professor Paul E. Newton Date: 4 February 2013

The Validity and Soundness of Arguments Torben Amtoft Kansas State University Torben Amtoft

Field studies and ecological validity Michelle Mazurek 1 Todays class Field studies

Model checking and validity in propositional Jonni Virtema and modal inclusion logics

The Validity of Selection Procedures Joey Tran, Casey Procopio, Patrick Catalan, Situation

CS/INFO 330 Java EE Technology Mirek Riedewald mirek@cs.cornell.edu (Based on Suns Java EE

CSCI 4760 - Computer Networks Fall 2016 Instructor: Prof. Roberto Perdisci perdisci@cs.uga.edu

PRISMBREAK The value of online identities Frank Ackermann, November 2013 disclaimer This

OVERVIEW OF THE STOCHASTIC THEORY OF PORTFOLIOS IOANNIS KARATZAS Department of Mathematics,

Welcome! CONTENT MANAGEMENT & USABILITY / USS / UIT OU CAMPUS MEETING AND TRAINING Last Time

Content Management Systems Week 14 LBSC 671 Creating Information Infrastructures Putting the

CSE 440: Introduction to HCI User Interface Design, Prototyping, and Evaluation James Fogarty

Media Resource Brokering Chris Boulton, Lorenzo Miniero draft-ietf-mediactrl-mrb-03 Changes

Validity How to make sure your testing investment yields reliable - PowerPoint PPT Presentation

Validity How to make sure your testing investment yields reliable How to make sure your testing investment yields reliable data (interactive panel with experts) D Daniel Burstein i l B t i B b K Bob Kemper Aaron Gray Eric Miller Interactive

External Validity of NYC Macroscope Electronic Health External Validity of NYC Macroscope

External Validity March 25 1 / 16 Definition How do we define external validity? Mundane

RESEARCH VALIDITY Winfred Arthur, Jr. Department of Psychological and Brain Sciences and

Proving the Validity of an Argument Torben Amtoft Kansas State University Torben Amtoft Kansas

Cue validity Cue validity - predictiveness of a cue for a given category Central

First-Order Necessity and Validity First-Order Necessity and Validity Mark Criley IWU

Circuit Validity Checker D. Mitch Bailey Shuhari System, Japan WOSET 2020 CVC: Circuit Validity

ASSESSING THE MEASUREMENT MODEL RELIABILITY AND VALIDITY USING SPSS/AMOS USING SPSS/AMOS

Validity-preservation properties of rules for combining inferential models combining

External Validity In order to test our RH: we have to decide on a research design, sample

Validity Checking Propositional and First-Order Logic Carlos Bacelar Almeida Departmento de

Does it matter what validity means? Professor Paul E. Newton Date: 4 February 2013

The Validity and Soundness of Arguments Torben Amtoft Kansas State University Torben Amtoft

Field studies and ecological validity Michelle Mazurek 1 Todays class Field studies

Model checking and validity in propositional Jonni Virtema and modal inclusion logics

The Validity of Selection Procedures Joey Tran, Casey Procopio, Patrick Catalan, Situation

CS/INFO 330 Java EE Technology Mirek Riedewald mirek@cs.cornell.edu (Based on Suns Java EE

CSCI 4760 - Computer Networks Fall 2016 Instructor: Prof. Roberto Perdisci perdisci@cs.uga.edu

PRISMBREAK The value of online identities Frank Ackermann, November 2013 disclaimer This

OVERVIEW OF THE STOCHASTIC THEORY OF PORTFOLIOS IOANNIS KARATZAS Department of Mathematics,

Welcome! CONTENT MANAGEMENT &amp; USABILITY / USS / UIT OU CAMPUS MEETING AND TRAINING Last Time

Content Management Systems Week 14 LBSC 671 Creating Information Infrastructures Putting the

CSE 440: Introduction to HCI User Interface Design, Prototyping, and Evaluation James Fogarty

Media Resource Brokering Chris Boulton, Lorenzo Miniero draft-ietf-mediactrl-mrb-03 Changes

Welcome! CONTENT MANAGEMENT & USABILITY / USS / UIT OU CAMPUS MEETING AND TRAINING Last Time