Sketching Streams Chris Taylor DoD Overview What-Why Sketch? - - PowerPoint PPT Presentation

sketching streams
SMART_READER_LITE
LIVE PREVIEW

Sketching Streams Chris Taylor DoD Overview What-Why Sketch? - - PowerPoint PPT Presentation

Sketching Streams Chris Taylor DoD Overview What-Why Sketch? Sketches Hyper Log Log Sketch Frequency Heavy Hitter Sketch Quantile Sketch Theta Sketch What-Why Sketch? What-Why Sketch? Data sets exceed


slide-1
SLIDE 1

Sketching Streams

Chris Taylor DoD

slide-2
SLIDE 2

Overview

  • What-Why Sketch?
  • Sketches

 Hyper Log Log Sketch  Frequency “Heavy Hitter” Sketch  Quantile Sketch  Theta Sketch

slide-3
SLIDE 3

What-Why Sketch?

slide-4
SLIDE 4

What-Why Sketch?

  • Data sets exceed traditional commodity

compute capabilities

– Static and Streaming data – Data set is “noisy” (biology, physics)

  • Approximate results have value
slide-5
SLIDE 5

What-Why Sketch?

  • Compute dynamic “summaries” of a dataset

according to a predefined set of computational constraints

– Storage size – Accuracy, precision...user provided tolerances

  • Sketches are “monoidal” in nature; satisfying a

suite of set operations (union, difference, etc)

– Functional programming concepts – Parallel prefix summarization

slide-6
SLIDE 6

What-Why Sketch?

  • “Data analytic” platforms adopting sketches

– Yahoo's “Data Sketching” library – Druid integration with Yahoo's library** – Redis support – Several opensource projects for Spark/Hadoop

** Traditional Database, “Columnar” Stores, “Big Table” Database

slide-7
SLIDE 7

What-Why Sketch?

  • Measuring Performance

– Using Chapel 1.15! – Measured sketch update performance – Each algorithm receives a randomly filled array of 100K

integers

– Each algorithm provided 5 minutes to 'add' or 'update' a

sketch (serial loop) over sets of the 100K integers

  • Results are the total number of 100K block-integer

updates completed in ~5 minutes

slide-8
SLIDE 8

HyperLogLog

slide-9
SLIDE 9

HyperLogLog

  • Philippe Flajolet
  • Analyzes a stream of hashed values (bit-pattern
  • bservables)

– Split each hashed value into m sets – Collects “runs” of zeros for each m set

  • Provides a Stochastic Average using collected bit-

pattern information

– Compute a harmonic mean of each m bit set (for each

new value)

slide-10
SLIDE 10

HyperLogLog

  • Hashed Value:

000011000111

  • Split hash into bit-pattern sets (m=3):

[ [000], [011], [000], [111] ]

  • Compute running harmonic average over

existing bit-pattern sets

slide-11
SLIDE 11

HyperLogLog

chpl python 2000 4000 6000 8000 10000 12000 Run 1 Run 2 Run 3 Run 4 Run 5

slide-12
SLIDE 12

HyperLogLog

chpl-fast chpl python 50000 100000 150000 200000 250000 300000 Run 1 Run 2 Run 3 Run 4 Run 5

slide-13
SLIDE 13

Frequency Sketch

slide-14
SLIDE 14

Frequency Sketch

  • Implementation of Misra-Greis Algorithm
  • Stores k-1 (item-counter) pairs as a set
  • If a new item is in the set's range

– Increment a counter – Else find an empty counter, add item, and set counter to one

  • Decrement all k-counters if all counters have been

allocated

  • Over time, low frequency elements are removed, making

space for higher frequency items.

slide-15
SLIDE 15

Frequency Sketch

chpl python 2000 4000 6000 8000 10000 12000 Run 1 Run 2 Run 3 Run 4 Run 5

slide-16
SLIDE 16

Frequency Sketch

chpl-fast chpl python 50000 100000 150000 200000 250000 300000 Run 1 Run 2 Run 3 Run 4 Run 5

slide-17
SLIDE 17

Quantile Sketch

slide-18
SLIDE 18

Quantile Sketch

  • “Low Discrepancy Mergeable Quantiles Sketch”

(Agarwal, Cormode, Huang, Philips, Wei, Yi)

  • Non-deterministic!
  • Select elements (upper/lower bounds) from the

stream under a rank constraint:

normalized rank: i|S|/k for 1 <= I <= k ~= 1/e

  • Using the selected elements, or summary,

compute quartile information.

slide-19
SLIDE 19

Quantile Sketch

chpl python 50 100 150 200 250 300 350 400 450 Run 1 Run 2 Run 3 Run 4 Run 5

** Chapel has to perform several domain resizes, could use optimization

slide-20
SLIDE 20

Quantile Sketch

chpl-fast chpl python 200 400 600 800 1000 1200 1400 1600 1800 2000 Run 1 Run 2 Run 3 Run 4 Run 5

slide-21
SLIDE 21

Theta Sketch

slide-22
SLIDE 22

Theta Sketch

  • Kth Minimum Value sketch
  • Maintains a threshold theta and a set of unique hashed

items less than theta

– Assume hashing function computes a uniform distribution

  • Algorithm assumes hash function provides uniform

distribution (over hash space).

  • The assumption gives information about the average

spacing between elements of the stream.

  • Knowing the smallest value, and spacing, one can infer

the total number of distinct values observed

slide-23
SLIDE 23

Theta Sketch

chpl python 10000 20000 30000 40000 50000 60000 70000 80000 Column 1 Column 2 Column 3

slide-24
SLIDE 24

Theta Sketch

chpl-fast chpl python 100000 200000 300000 400000 500000 600000 700000 800000 900000 Run 1 Run 2 Run 3 Run 4 Run 5

slide-25
SLIDE 25
  • Images provided by Library of Congress

– All photos have “no known restrictions on

publication”

  • Code to be posted on github!

– Check the email listserv for details