SLIDE 1
The one weird trick for analyzing big data Eyeball the data early - - PowerPoint PPT Presentation
The one weird trick for analyzing big data Eyeball the data early - - PowerPoint PPT Presentation
The one weird trick for analyzing big data Eyeball the data early and often! John Lamping A testimonial: Long-Read Assembly Achieves Optimal Repeat Resolution Acknowledgements GMK would also like to thank John Lamping of Human Longevity Inc.,
SLIDE 2
SLIDE 3
"Looking at data? How boring! That's not my job!"
SLIDE 4
Data is a window onto your domain
SLIDE 5
"Look at queries!"
SLIDE 6
Eyeballing queries at Google
Old Google
Australia - Wikipedia Australia travel guide - Wikitravel Tourism Australia Latest Australia news | The Guardian Australia.gov.au
New Google
Tourism Australia Austria - Lonely Planet Australia - Wikipedia Australia.gov.au Austria - Wikipedia
[Australia]
SLIDE 7
Eyeball the data early and often!
SLIDE 8
User session data
09:32:10 query [australia] Australia - Wikipedia Australia travel guide - Wikitravel Tourism Australia Latest Australia news | The Guardian Australia.gov.au 09:32:44 click position 3 Australia travel guide - Wikitravel 09:35:12 query [brisbane]
SLIDE 9
Eyeballing user sessions
[australia] (34) 3: Australia travel guide - Wikitravel (2:28) [brisbane] (20) 4: Brisbane, Australia, Attractions - Tourism Australia (6:20) [morton island] (20) [moreton island] (40) 3: Moreton Island - Visit Brisbane (3:12) 8: Moreton island - Lonely Planet (50) 7: Moreton Island National Park and Recreation Area (Department of ... (54:02) [ayers rock] (30)
SLIDE 10
Which data items to eyeball?
Sample from the ones that matter.
SLIDE 11
Which data items to eyeball?
Sample proportionally to how much they matter.
- by difference
- by volume
- by value
- ...
SLIDE 12
Selecting A/B testing differences
A A A A A A A A A A A A A A A B B B B B B B B B B B B B B B
SLIDE 13
Eyeballing a weighted sample
Workers Company 55,000 Rio Tinto 15,000 Telstra 45,000 Commonwealth Bank 22 The Gantry restaurant each sample is equally important. Emily Jake Nick Olivia Emily Jake Nick Olivia
SLIDE 14
Workers Company 55,000 Rio Tinto 45 The Paddington 122 Peppers Gallery Hotel 5 Melbourne Dry Cleaners
Eyeballing a weighted sample
each sample is equally important.
SLIDE 15
What do we do when we should be eyeballing?
SLIDE 16
What does a new Google engineer do?
Tweak parameters to optimize metrics.
- Revenue
- Clicks
SLIDE 17
Revenue?
SLIDE 18
Google Search Australia - Wikipedia Tourism Australia Australia travel guide Australia - The World Factbook Australia.gov.au Yahoo! Search Australia - Wikipedia Tourism Australia Australia travel guide
Revenue?
Australia Australia Buy Australia on Ebay The Vegemite store Junky ads
SLIDE 19
Clicks?
The one weird trick for analyzing big data
SLIDE 20
The lure of metrics
Metrics capture only a part of the picture. 20 trees
SLIDE 21
Watch out for accidental patterns.
4 4 4 6 9 3 2 1 6 3 9 8 8 4 1 1 6 8 4 1 3 5 5 7 7 3 0 3 0 8
SLIDE 22
A sad tale of not eyeballing data often enough
SLIDE 23
I was good
I eyeballed the documents. I eyeballed the hierarchy.
SLIDE 24
Document hierarchy data
❏ attendance faith leader prayer finances ❏ attendance faith priest church bible ❏ faith bible leader torah synagogue
SLIDE 25
Eyeballing a document hierarchy
word significance = frequency * difference from parent node frequency * log(frequency / frequency in parent)
SLIDE 26
Eyeballing a document hierarchy
❏ faith prayer minister church priest ❏ bible minister church pope jesus ❏ synagogue muslim torah temple kosher
SLIDE 27
I was good
I eyeballed the documents. I eyeballed the hierarchy. I wrote a large scale quality test metric. Whenever a change reduced the quality metric, I fixed it. But I didn't eyeball the difference to the hierarchy.
mostly
SLIDE 28
Eyeball the data early and often!
When something changes, look at the data again.
SLIDE 29
Two plausible alternatives
Eyeball the data early and often.
Catholic Protestant Other Other Christian
SLIDE 30
SLIDE 31
DNA sequencing
SLIDE 32
DNA sequencing
T G G A A G G T C C C A T T T G A C G G T T G G G G T T G G C A A G G T C C C A T T T G
SLIDE 33
DNA sequencing
G G T T G G C A A G G T C C C A T T T G G G T T G C C A A G G A C C C A T T T G G G T T G C A A G GGT C C C A T T T G G G T T G G C A A G G G A T A A C G T A
SLIDE 34
Our code's task:
T T G G C A A G G T T G G A A A G G
SLIDE 35
C A A C A T T G G A A G G T C C A C A
Sequences show variants
T T T T T T T T G G C G G G G G C C A A A A A A A A C A G G G G G G T T C C C A T A A C C A A A C
SLIDE 36
Not working
We ran it against data with known variants. It found most of them. But it missed many of them.
SLIDE 37
Print statements? Tracing? Tweak some parameters!
What to do?
SLIDE 38
Eyeball the data early and often!
When your analysis sees something, eyeball data for a few examples of it.
SLIDE 39
Sequence 4205 Position 462 T Position 463 T Position 464 G Position 465 G Position 466 C Position 467 A Position 468 A Position 469 G Position 470 G
Some data
SLIDE 40
Sequence 2602 Position 144 T Position 145 T Position 146 A Position 147 G Position 148 C Position 149 A Position 150 A Position 151 G Position 152 G
More data
SLIDE 41
Eyeball the data early and often!
Would you see the problem?
SLIDE 42
Sequence 4205 T T G G C A A G G Sequence 2602 T T A G C A A G G Sequence 4403 T T G G C A A A G Sequence 0605 T T G G C A A G G Sequence 3878 T T G G C A G G A Sequence 4138 T T G G C A A G G Sequence 4942 A T T G C A A G G Sequence 1319 T T G G T A A G G Sequence 2251 T T G G C A A G G
A little formatting to support visualization
SLIDE 43
An expert's visualization
SLIDE 44
Eyeballing data reveals ...
SLIDE 45
Eyeballing data reveals ... the best bug of my career.
SLIDE 46
Eyeball the data early and often! Eyeball the data early and often! Eyeball the data early and often!
SLIDE 47