NOT EXACTLY! APPROXIMATE ALGORITHMS FOR BIG DATA FANGJIN YANG - - PowerPoint PPT Presentation

not exactly approximate algorithms for big data
SMART_READER_LITE
LIVE PREVIEW

NOT EXACTLY! APPROXIMATE ALGORITHMS FOR BIG DATA FANGJIN YANG - - PowerPoint PPT Presentation

NOT EXACTLY! APPROXIMATE ALGORITHMS FOR BIG DATA FANGJIN YANG DRUID COMMITTER METAMARKETS NELSON RAY QUANTITATIVE ANALYST GOOGLE OVERVIEW THE PROBLEM MANAGE DATA COST EFFICIENTLY THE DATA DEALING WITH EVENT STREAMS SIMPLIFYING


slide-1
SLIDE 1

NOT EXACTLY! APPROXIMATE ALGORITHMS FOR BIG DATA

FANGJIN YANG · DRUID COMMITTER · METAMARKETS NELSON RAY · QUANTITATIVE ANALYST · GOOGLE

slide-2
SLIDE 2

THE PROBLEM MANAGE DATA COST EFFICIENTLY THE DATA DEALING WITH EVENT STREAMS SIMPLIFYING STORAGE DATA SUMMARIZATION FINDING UNIQUES HYPERLOGLOG ESTIMATING DISTRIBUTION APPROXIMATE HISTOGRAMS

OVERVIEW

slide-3
SLIDE 3

THE PROBLEM

slide-4
SLIDE 4

Fangjin Yang & Nelson Ray 2014

Real-time Bidding

slide-5
SLIDE 5

Fangjin Yang & Nelson Ray 2014

PROBLEMS

  • Storing/processing billions of rows is expensive
  • Reduce storage, improve performance
  • Reduce storage by throwing away information
  • Throwing away information reduces accuracy
slide-6
SLIDE 6

THE DATA

slide-7
SLIDE 7

Fangjin Yang & Nelson Ray 2014

THE DATA

Timestamp Bid Price

2013-10-28T02:13:43Z 1.19 2013-10-28T02:14:21Z 0.05 2013-10-28T02:55:32Z 1.04 2013-10-28T03:07:28Z 0.16 2013-10-28T03:13:43Z 1.03 2013-10-28T04:18:19Z 0.15 2013-10-28T05:36:34Z 0.01 2013-10-28T05:37:59Z 1.03

slide-8
SLIDE 8

Fangjin Yang & Nelson Ray 2014

DATA SUMMARIZATION

Timestamp Revenue Number of Prices

2013-10-28T02 2.28 3 2013-10-28T03 1.19 2 2013-10-28T04 0.15 1 2013-10-28T05 1.04 2

Timestamp Bid Price

2013-10-28T02:13:43Z 1.19 2013-10-28T02:14:21Z 0.05 2013-10-28T02:55:32Z 1.04 2013-10-28T03:07:28Z 0.16 2013-10-28T03:13:43Z 1.03 2013-10-28T04:18:19Z 0.15 2013-10-28T05:36:34Z 0.01 2013-10-28T05:37:59Z 1.03

slide-9
SLIDE 9

Fangjin Yang & Nelson Ray 2014

COMBINING SUMMARIZATIONS

Timestamp Revenue Number of Prices

2013-10-28T02 2.28 3 2013-10-28T03 1.19 2 2013-10-28T04 0.15 1 2013-10-28T05 1.04 2

Timestamp Revenue Number of Prices

2013-10-28 4.66 8

slide-10
SLIDE 10

Fangjin Yang & Nelson Ray 2014

slide-11
SLIDE 11

Fangjin Yang & Nelson Ray 2014

  • Throw away information about individual

events

  • Drastically reduce storage and improve

query speed

  • On average, 40x reduction in storage on

with our own data

  • We’ve lost info about individual prices
  • Data summarization is not always trivial

SUMMARIZATION SUMMARY

slide-12
SLIDE 12

CASE STUDY 1

slide-13
SLIDE 13

Fangjin Yang & Nelson Ray 2014

  • Problem: determine unique number
  • f elements in a set
  • Use case: measuring number of

unique users

CASE STUDY 1

DATA BIG DATA

slide-14
SLIDE 14

Fangjin Yang & Nelson Ray 2014

  • Store every single username (in a Java HashSet)
  • No loss of information, no accuracy tradeoff

EXACT SOLUTION

slide-15
SLIDE 15

Fangjin Yang & Nelson Ray 2014

HASHSET

Timestamp Username

2013-10-28T02:13:43Z user1 2013-10-28T02:14:21Z user2 2013-10-28T02:55:32Z user1 2013-10-28T03:07:28Z user4 2013-10-28T03:13:43Z user97 2013-10-28T04:18:19Z user2 2013-10-28T05:36:34Z user9834 2013-10-28T05:37:59Z user97

Timestamp Usernames

2013-10-28T02 {user1, user2} 2013-10-28T03 {user4, user97} 2013-10-28T04 {user2} 2013-10-28T05 {user9834, user97}

slide-16
SLIDE 16

Fangjin Yang & Nelson Ray 2014

HASHSET

Timestamp Usernames

2013-10-28 {user1, user2, user4, user97, user9834}

Timestamp Usernames

2013-10-28T02 {user1, user2} 2013-10-28T03 {user4, user97} 2013-10-28T04 {user2} 2013-10-28T05 {user9834, user97}

slide-17
SLIDE 17

Fangjin Yang & Nelson Ray 2014

  • Storage/Computation: O(# uniques)
  • We’re not throwing away any information about usernames
  • Accuracy: 100%

EXACT SOLUTION

slide-18
SLIDE 18

Fangjin Yang & Nelson Ray 2014

  • High cardinality user dimensions == infeasible storage
  • Storage cost for 10^9 unique elements == ~48GB of storage

INFEASIBLE STORAGE

slide-19
SLIDE 19

Fangjin Yang & Nelson Ray 2014

  • Plenty of literature
  • Linear Counting
  • Count-Min Sketch
  • LogLog

CARDINALITY ESTIMATION

slide-20
SLIDE 20

Fangjin Yang & Nelson Ray 2014

  • Storage: 1.5 KB ( for cardinalities 10^9 and above)
  • 99.999997% decrease in storage size
  • Computation: O(1) (for cardinalities < ~10^10)
  • Accuracy: 97%

HYPERLOGLOG

slide-21
SLIDE 21

Fangjin Yang & Nelson Ray 2014

  • Maps value in one space (generally larger) to another value in

another space (generally smaller)

HASH FUNCTIONS

HashFn

0001 String

slide-22
SLIDE 22

Fangjin Yang & Nelson Ray 2014

  • Bits of output value are independent and have an equal

probability of occurring (50%)

WHAT MAKES A GOOD HASH FUNCTION?

HashFn

0xxx String

HashFn

1xxx String 50% Probability 50% Probability

slide-23
SLIDE 23

Fangjin Yang & Nelson Ray 2013

HASHING TWO STRINGS

HashFn

0xxx user1

HashFn

1xxx user2

slide-24
SLIDE 24

Fangjin Yang & Nelson Ray 2013

THE NEXT BIT

HashFn

00xx String

HashFn

10xx String

HashFn

01xx String

HashFn

11xx String 25% Probability 25% Probability 25% Probability 25% Probability

slide-25
SLIDE 25

Fangjin Yang & Nelson Ray 2013

HASHING 4 STRINGS

HashFn

00xx user1

HashFn

10xx user2

HashFn

01xx user3

HashFn

11xx user4

slide-26
SLIDE 26

Fangjin Yang & Nelson Ray 2013

  • What about 001x?
  • If we hashed one string, 12.5% chance this could occur
  • If we hashed 8 strings, one of them should be this value
  • What about 000001…x?
  • Extremely unlikely to occur if we only hashed one string

HYPERLOGLOG

slide-27
SLIDE 27

Fangjin Yang & Nelson Ray 2014

HYPERLOGLOG

  • Looks at distribution of bits of hashed values
  • Cares about the position of the left most ‘1’ bit
  • 1000 -> position == 1
  • 0100 -> position == 2
  • 0011 -> position == 3
slide-28
SLIDE 28

Fangjin Yang & Nelson Ray 2014

HYPERLOGLOG

  • Stores the max position of the left-most ‘1’ bit of hashed values
  • User1 —> hash —> 1000 (position == 1)
  • User2 —> hash —> 0100 (position == 2)
  • User3 —> hash —> 0011 (position == 3)
  • HLL will store position == 3
slide-29
SLIDE 29

Fangjin Yang & Nelson Ray 2014

HYPERLOGLOG

slide-30
SLIDE 30

Fangjin Yang & Nelson Ray 2013

HYPERLOGLOG ACCURACY

HashFn

00xx String

HashFn

10xx String

HashFn

01xx String

HashFn

11xx String 25% Probability

slide-31
SLIDE 31

Fangjin Yang & Nelson Ray 2014

HYPERLOGLOG

  • If we fed the stream through a second hash function, we’d have a

second independent estimate

  • Adding more hash functions gives us more independent

estimates that we can combine together for a lower variance estimate

  • This is expensive because we have to hash the same data n times
slide-32
SLIDE 32

Fangjin Yang & Nelson Ray 2014

HYPERLOGLOG

  • Instead we can split the stream
  • Estimate the cardinality of each sub-stream
  • For each sub-stream
  • Store the maximum over the positions of the leftmost '1' bit for

hashed values of the sub-stream

slide-33
SLIDE 33

Fangjin Yang & Nelson Ray 2014

HYPERLOGLOG

Buckets

  • INF
  • INF
  • INF
  • INF
slide-34
SLIDE 34

Fangjin Yang & Nelson Ray 2014

HYPERLOGLOG

HashFn

01xxx...x user1

Buckets

2

  • INF
  • INF
  • INF
slide-35
SLIDE 35

Fangjin Yang & Nelson Ray 2014

HYPERLOGLOG

HashFn

01xxx...x user1

Buckets

2 2 2 1

HashFn

01xxx...x user4

HashFn

01xxx...x user12

HashFn

1xxxx...x user7

slide-36
SLIDE 36

Fangjin Yang & Nelson Ray 2014

HYPERLOGLOG

HashFn

001xx...x user6

Buckets

2 -> 3 2 2 1

slide-37
SLIDE 37

Fangjin Yang & Nelson Ray 2014

DETERMINING FINAL CARDINALITY

Buckets

3 2 2 1

MATH

11.00

slide-38
SLIDE 38

Fangjin Yang & Nelson Ray 2014

HYPERLOGLOG

Timestamp Buckets

2013-10-28T02 [3, 2, 2, 1] 2013-10-28T03 [1, 2, 1, 2] 2013-10-28T04 [2, 1, 4, 1] 2013-10-28T05 [2, 2, 3, 1]

slide-39
SLIDE 39

Fangjin Yang & Nelson Ray 2014

HYPERLOGLOG

Timestamp HLL Object

2013-10-28 [3, 2, 4, 2]

slide-40
SLIDE 40

Fangjin Yang & Nelson Ray 2014

slide-41
SLIDE 41

Fangjin Yang & Nelson Ray 2014

RESULTS

slide-42
SLIDE 42

CASE STUDY 2

slide-43
SLIDE 43

Fangjin Yang & Nelson Ray 2014

  • Problem: determine distribution of

values

  • Use case: quantiles and histograms
  • Hourly truncation

CASE STUDY 2

slide-44
SLIDE 44

Fangjin Yang & Nelson Ray 2014

THE DATA

Timestamp Bid Price

2013-10-28T02:13:43Z 1.19 2013-10-28T02:14:21Z 0.05 2013-10-28T02:55:32Z 1.04 2013-10-28T03:07:28Z 0.16 2013-10-28T03:13:43Z 1.03 2013-10-28T04:18:19Z 0.15 2013-10-28T05:36:34Z 0.01 2013-10-28T05:37:59Z 1.03

slide-45
SLIDE 45

Fangjin Yang & Nelson Ray 2014

EXACT SOLUTION

Timestamp Bid Price

2013-10-28T02:13:43Z 1.19 2013-10-28T02:14:21Z 0.05 2013-10-28T02:55:32Z 1.04 2013-10-28T03:07:28Z 0.16 2013-10-28T03:13:43Z 1.03 2013-10-28T04:18:19Z 0.15 2013-10-28T05:36:34Z 0.01 2013-10-28T05:37:59Z 1.03

Timestamp Bid Prices

2013-10-28T02 [1.19, 0.05, 1.04] 2013-10-28T03 [0.16, 1.03] 2013-10-28T04 [0.15] 2013-10-28T05 [0.01, 1.03]

slide-46
SLIDE 46

Fangjin Yang & Nelson Ray 2014

EXACT SOLUTION

Timestamp Bid Prices

2013-10-28 [1.19, 0.05, 1.04, 0.16, 1.03, 0.15, 0.01, 1.03]

Timestamp Bid Prices

2013-10-28T02 [1.19, 0.05, 1.04] 2013-10-28T03 [0.16, 1.03] 2013-10-28T04 [0.15] 2013-10-28T05 [0.01, 1.03]

slide-47
SLIDE 47

Fangjin Yang & Nelson Ray 2014

  • Arrays of values
  • Storage: Linear
  • Computation: Linear
  • Accuracy: 100%
  • Problem: Storing raw values can often be more expensive than

storing the rest of the row.

  • Solution: Store an approximate representation!

EXACT SOLUTION

slide-48
SLIDE 48

Fangjin Yang & Nelson Ray 2014

  • “A Streaming Parallel Decision Tree Algorithm”
  • Yael Ben-Haim & Elad Tom-Tov
  • Storage: Sublinear/Linear
  • Computation: Sublinear/Linear
  • Accuracy: pretty good

APPROXIMATE HISTOGRAMS

slide-49
SLIDE 49

Fangjin Yang & Nelson Ray 2013

RAW DATA

  • 40 Prices: 3.46, 5.37, 5.62, 5.87, 6.21, 6.79, 7.11, 7.36, 7.55, 7.64, 7.89,

7.9, 8.07, 8.44, 8.62, 8.78, 8.87, 9.03, 9.24, 9.36, 9.58, 9.59, 9.81, 10.31, 10.35, 10.39, 10.47, 10.77, 10.93, 11.04, 11.1, 13.1, 13.27, 13.29, 13.87, 14.29, 14.51, 14.9, 15.75, 17.07

slide-50
SLIDE 50

Fangjin Yang & Nelson Ray 2013

RAW DATA

slide-51
SLIDE 51

Fangjin Yang & Nelson Ray 2013

SUMMARIZE WITH (COUNT, MEAN)

slide-52
SLIDE 52

Fangjin Yang & Nelson Ray 2013

SUMMARIZE WITH (COUNT, MEAN)

slide-53
SLIDE 53

Fangjin Yang & Nelson Ray 2013

SUMMARIZE WITH (COUNT, MEAN)

slide-54
SLIDE 54

Fangjin Yang & Nelson Ray 2014

COMBINING HISTOGRAMS

slide-55
SLIDE 55

Fangjin Yang & Nelson Ray 2014

COMBINING HISTOGRAMS

slide-56
SLIDE 56

Fangjin Yang & Nelson Ray 2014

slide-57
SLIDE 57

Fangjin Yang & Nelson Ray 2014

COUNT # <= X

slide-58
SLIDE 58

Fangjin Yang & Nelson Ray 2014

slide-59
SLIDE 59

Fangjin Yang & Nelson Ray 2014

ACCURACY

slide-60
SLIDE 60

Fangjin Yang & Nelson Ray 2014

  • Open source
  • Designed to power interactive applications at scale
  • Optimized for business intelligence (OLAP) queries
  • Arbitrary slice-n-dice and drill into data
  • Supports streaming and batch data ingestion
  • Exact and approximate calculations (Hyperloglog, approximate

histograms)

DRUID

slide-61
SLIDE 61

Fangjin Yang & Nelson Ray 2014

  • 100 cc2.8xlarge (1600 cores, 6TB RAM) Druid cluster
  • 27B summarized rows/s scan rate
  • Add 16B summarized (~640B raw) rows/s
  • Combine 4B HyperLogLog objects/s
  • Combine 1.5B ApproximateHistogram objects/s

BENCHMARKS

slide-62
SLIDE 62

Fangjin Yang & Nelson Ray 2014

  • Summarization for sums: substantially (e.g. ~40x for us) faster/less

storage

  • 100% accuracy
  • Sketches for cardinality/distribution: 1-2 orders of magnitude faster/

less storage than raw

  • 97% accuracy
  • 40x lower costs is make or break
  • interactive queries that are accurate enough

CONCLUSIONS

slide-63
SLIDE 63

DRUID IS OPEN SOURCE WWW.DRUID.IO

twitter @druidio irc.freenode.net #druid-dev

slide-64
SLIDE 64

THANK YOU