SLIDE 1 A/B Testing in the Wild
Emily Robinson @robinson_es
SLIDE 2
Disclaimer: This talk represents my own views, not those of Etsy
SLIDE 3 Ov Overview
INTRODUCTION CHALLENGES & LESSONS
Statistical Business Etsy A/B Testing
SLIDE 4
Et Etsy
SLIDE 5 Etsy is a global creative commerce platform. We build markets, services and economic opportunity for creative entrepreneurs.
SLIDE 6
Our Items
SLIDE 7 By The Numbers
1.8M
active sellers
AS OF MARCH 31, 2017
29.7M
active buyers
AS OF MARCH 31, 2017
$2.84B
annual GMS
IN 2016
45+M
items for sale
AS OF MARCH 31, 2017
Photo by Kirsty-Lyn Jameson
SLIDE 8
A/B Testin ing
SLIDE 9
What is A/B Testing?
SLIDE 10
Old Experience
SLIDE 11
New Feature
SLIDE 12
A/B Testing: It’s Everywhere
SLIDE 13
Highly Researched
SLIDE 14 My Perspective
Millions of visitors daily Data Engineering Pipeline Set-Up
SLIDE 15
Generating numbers is easy; generating numbers you should trust is hard!
SLIDE 16 Why Statistics Anyway?
- “Election surveys are done with a few thousand people”1
- Targeting small effects
- A .5% change in conversion rate (e.g. 6% to 6.03%) on a high traffic page can be
millions of dollars annually
1Online Experimentation at Microsoft
SLIDE 17
Example le Experim iment
SLIDE 18
Listing Card Experiment
SLIDE 19
Result
👏
SLIDE 20
Listing Card Experiment: Redux
🎊 👏 💰 👎
SLIDE 21
Statis istic ical l Challe llenges
SLIDE 22 Level of Analysis
Visit:
activity by browser over a defined time period (30 minutes)
Browser:
cookie or device ID (for apps)
User:
Signed-in user ID
SLIDE 23 Browser vs Visit: An Example
I really want my
SLIDE 24
SLIDE 25
Next Day
SLIDE 26 Pros and Cons
Visit
Browser Tighter attribution Captures relevant later behavior Independence violation assumption Introduces noise Cannibalization potential Misses multiple events for proportion metrics Our conclusion: offer both, browser generally better
SLIDE 27 GMS per User
- Generally this is key metric
- But it’s a very badly behaved distribution
- Highly skewed and strictly non-negative: can’t use t-test
- Many zeros: can’t log numbers
SLIDE 28
ACBV/ACVV
SLIDE 29 Definitions
- Power: Probability if there is an effect of a certain magnitude, we will detect it
- Bootstrap: random sampling with replacement
- Simulation: modeling random events
SLIDE 30 Test Selection Process
Take Real Experiment Estimate Power of Different Tests
SLIDE 31
Estimating Power
SLIDE 32 Test Selection Process
Take Real Experiment Estimate Power of Different Tests Estimate Power for Different Effect Sizes Find Best Simulation Method
SLIDE 33
Simulation Method Comparison
SLIDE 34
Estimating Power
SLIDE 35
Power at 1% Increase in ACBV
SLIDE 36
Busin iness Challe llenges
SLIDE 37
Working with Teams
SLIDE 38 Proactive Communication
Early involvement:
No post-mortems
Demonstrate value:
Prioritization, feasibility, sequencing
Develop relationship:
Understand teammates
SLIDE 39 Dealing with Adhoc Questions
Question:
What’s the conversion rate of visitors in Estonia on Saturday looking in the wedding category?
First Response:
What decision are you using this for?
SLIDE 40
Helps Avoid This
SLIDE 41
Checks Translation
SLIDE 42 We often joke that our job … is to tell our clients that their new baby is ugly
Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained
SLIDE 43 Business Partners & Experiments
- Financial and emotional investment
- Inaccurate expectations:
- Features are built because team believes they’re useful
- But experiment success rate across industry is (sometimes far less) than 50%
SLIDE 44 Peeking
Question: “What do the results mean?” Answer: “It’s been up for 15 minutes…”
SLIDE 45
SLIDE 46 Daily Experiment Updates
*This is a Made-up Example
Offers Interpretation Shows You’re Monitoring
SLIDE 47
Want Fast Decision Making
SLIDE 48
Cost of Peeking: 5% FPR to 20%!
SLIDE 49 Solution 1: Adjust P-Value Threshold
Not Rigorous Easy to Interpret
SLIDE 50 Solution 2: “Outlaw” Peeking
Miss Bugs Correct Way
SLIDE 51 Solution 3: Continuous Monitoring
Complicated to Implement & Explain Peek and Stay Rigorous
SLIDE 52 And at the End of the Day …
From Julia Evans, @b0rk “How to be a Wizard Programmer”
SLIDE 53 Resources
- Controlled Experiments on the Web: Survey and Practical Guide
- Overlapping Experiment Infrastructure: More, Better, Faster Experimentatio
- From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks
- What works in e-commerce – a meta-analysis of 6700 online experiments
- Online Controlled Experiments at Large-Scale
- Online Experimentation at Microsoft
- Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained
SLIDE 54 Acknowledgments
- Evan D’Agostini (for ACBV development & slides)
- Jack Perkins & Anastasia Erbe (former & fellow search analysts)
- Michael Berkowitz, Callie McRee, David Robinson, Bill Ulammandakh, & Dana Levin-
Robinson (for presentation feedback)
- Etsy Analytics team
- Etsy Search UI & Search Ranking teams
SLIDE 55 Thank You
tiny.cc/abslides robinsones.github.io @robinson_es