The one weird trick for analyzing big data Eyeball the data early - - PowerPoint PPT Presentation

the one weird trick for analyzing big data
SMART_READER_LITE
LIVE PREVIEW

The one weird trick for analyzing big data Eyeball the data early - - PowerPoint PPT Presentation

The one weird trick for analyzing big data Eyeball the data early and often! John Lamping A testimonial: Long-Read Assembly Achieves Optimal Repeat Resolution Acknowledgements GMK would also like to thank John Lamping of Human Longevity Inc.,


slide-1
SLIDE 1

The one weird trick for analyzing big data

John Lamping Eyeball the data early and often!

slide-2
SLIDE 2

A testimonial:

Long-Read Assembly Achieves Optimal Repeat Resolution Acknowledgements GMK would also like to thank John Lamping of Human Longevity Inc., chatting with whom drove him to take a data-driven approach to this project.

slide-3
SLIDE 3

"Looking at data? How boring! That's not my job!"

slide-4
SLIDE 4

Data is a window onto your domain

slide-5
SLIDE 5

"Look at queries!"

slide-6
SLIDE 6

Eyeballing queries at Google

Old Google

Australia - Wikipedia Australia travel guide - Wikitravel Tourism Australia Latest Australia news | The Guardian Australia.gov.au

New Google

Tourism Australia Austria - Lonely Planet Australia - Wikipedia Australia.gov.au Austria - Wikipedia

[Australia]

slide-7
SLIDE 7

Eyeball the data early and often!

slide-8
SLIDE 8

User session data

09:32:10 query [australia] Australia - Wikipedia Australia travel guide - Wikitravel Tourism Australia Latest Australia news | The Guardian Australia.gov.au 09:32:44 click position 3 Australia travel guide - Wikitravel 09:35:12 query [brisbane]

slide-9
SLIDE 9

Eyeballing user sessions

[australia] (34) 3: Australia travel guide - Wikitravel (2:28) [brisbane] (20) 4: Brisbane, Australia, Attractions - Tourism Australia (6:20) [morton island] (20) [moreton island] (40) 3: Moreton Island - Visit Brisbane (3:12) 8: Moreton island - Lonely Planet (50) 7: Moreton Island National Park and Recreation Area (Department of ... (54:02) [ayers rock] (30)

slide-10
SLIDE 10

Which data items to eyeball?

Sample from the ones that matter.

slide-11
SLIDE 11

Which data items to eyeball?

Sample proportionally to how much they matter.

  • by difference
  • by volume
  • by value
  • ...
slide-12
SLIDE 12

Selecting A/B testing differences

A A A A A A A A A A A A A A A B B B B B B B B B B B B B B B

slide-13
SLIDE 13

Eyeballing a weighted sample

Workers Company 55,000 Rio Tinto 15,000 Telstra 45,000 Commonwealth Bank 22 The Gantry restaurant each sample is equally important. Emily Jake Nick Olivia Emily Jake Nick Olivia

slide-14
SLIDE 14

Workers Company 55,000 Rio Tinto 45 The Paddington 122 Peppers Gallery Hotel 5 Melbourne Dry Cleaners

Eyeballing a weighted sample

each sample is equally important.

slide-15
SLIDE 15

What do we do when we should be eyeballing?

slide-16
SLIDE 16

What does a new Google engineer do?

Tweak parameters to optimize metrics.

  • Revenue
  • Clicks
slide-17
SLIDE 17

Revenue?

slide-18
SLIDE 18

Google Search Australia - Wikipedia Tourism Australia Australia travel guide Australia - The World Factbook Australia.gov.au Yahoo! Search Australia - Wikipedia Tourism Australia Australia travel guide

Revenue?

Australia Australia Buy Australia on Ebay The Vegemite store Junky ads

slide-19
SLIDE 19

Clicks?

The one weird trick for analyzing big data

slide-20
SLIDE 20

The lure of metrics

Metrics capture only a part of the picture. 20 trees

slide-21
SLIDE 21

Watch out for accidental patterns.

4 4 4 6 9 3 2 1 6 3 9 8 8 4 1 1 6 8 4 1 3 5 5 7 7 3 0 3 0 8

slide-22
SLIDE 22

A sad tale of not eyeballing data often enough

slide-23
SLIDE 23

I was good

I eyeballed the documents. I eyeballed the hierarchy.

slide-24
SLIDE 24

Document hierarchy data

❏ attendance faith leader prayer finances ❏ attendance faith priest church bible ❏ faith bible leader torah synagogue

slide-25
SLIDE 25

Eyeballing a document hierarchy

word significance = frequency * difference from parent node frequency * log(frequency / frequency in parent)

slide-26
SLIDE 26

Eyeballing a document hierarchy

❏ faith prayer minister church priest ❏ bible minister church pope jesus ❏ synagogue muslim torah temple kosher

slide-27
SLIDE 27

I was good

I eyeballed the documents. I eyeballed the hierarchy. I wrote a large scale quality test metric. Whenever a change reduced the quality metric, I fixed it. But I didn't eyeball the difference to the hierarchy.

mostly

slide-28
SLIDE 28

Eyeball the data early and often!

When something changes, look at the data again.

slide-29
SLIDE 29

Two plausible alternatives

Eyeball the data early and often.

Catholic Protestant Other Other Christian

slide-30
SLIDE 30
slide-31
SLIDE 31

DNA sequencing

slide-32
SLIDE 32

DNA sequencing

T G G A A G G T C C C A T T T G A C G G T T G G G G T T G G C A A G G T C C C A T T T G

slide-33
SLIDE 33

DNA sequencing

G G T T G G C A A G G T C C C A T T T G G G T T G C C A A G G A C C C A T T T G G G T T G C A A G GGT C C C A T T T G G G T T G G C A A G G G A T A A C G T A

slide-34
SLIDE 34

Our code's task:

T T G G C A A G G T T G G A A A G G

slide-35
SLIDE 35

C A A C A T T G G A A G G T C C A C A

Sequences show variants

T T T T T T T T G G C G G G G G C C A A A A A A A A C A G G G G G G T T C C C A T A A C C A A A C

slide-36
SLIDE 36

Not working

We ran it against data with known variants. It found most of them. But it missed many of them.

slide-37
SLIDE 37

Print statements? Tracing? Tweak some parameters!

What to do?

slide-38
SLIDE 38

Eyeball the data early and often!

When your analysis sees something, eyeball data for a few examples of it.

slide-39
SLIDE 39

Sequence 4205 Position 462 T Position 463 T Position 464 G Position 465 G Position 466 C Position 467 A Position 468 A Position 469 G Position 470 G

Some data

slide-40
SLIDE 40

Sequence 2602 Position 144 T Position 145 T Position 146 A Position 147 G Position 148 C Position 149 A Position 150 A Position 151 G Position 152 G

More data

slide-41
SLIDE 41

Eyeball the data early and often!

Would you see the problem?

slide-42
SLIDE 42

Sequence 4205 T T G G C A A G G Sequence 2602 T T A G C A A G G Sequence 4403 T T G G C A A A G Sequence 0605 T T G G C A A G G Sequence 3878 T T G G C A G G A Sequence 4138 T T G G C A A G G Sequence 4942 A T T G C A A G G Sequence 1319 T T G G T A A G G Sequence 2251 T T G G C A A G G

A little formatting to support visualization

slide-43
SLIDE 43

An expert's visualization

slide-44
SLIDE 44

Eyeballing data reveals ...

slide-45
SLIDE 45

Eyeballing data reveals ... the best bug of my career.

slide-46
SLIDE 46

Eyeball the data early and often! Eyeball the data early and often! Eyeball the data early and often!

slide-47
SLIDE 47

Have fun doing it!