sketching streams
play

Sketching Streams Chris Taylor DoD Overview What-Why Sketch? - PowerPoint PPT Presentation

Sketching Streams Chris Taylor DoD Overview What-Why Sketch? Sketches Hyper Log Log Sketch Frequency Heavy Hitter Sketch Quantile Sketch Theta Sketch What-Why Sketch? What-Why Sketch? Data sets exceed


  1. Sketching Streams Chris Taylor DoD

  2. Overview ● What-Why Sketch? ● Sketches  Hyper Log Log Sketch  Frequency “Heavy Hitter” Sketch  Quantile Sketch  Theta Sketch

  3. What-Why Sketch?

  4. What-Why Sketch? ● Data sets exceed traditional commodity compute capabilities – Static and Streaming data – Data set is “noisy” (biology, physics) ● Approximate results have value

  5. What-Why Sketch? ● Compute dynamic “summaries” of a dataset according to a predefined set of computational constraints – Storage size – Accuracy, precision...user provided tolerances ● Sketches are “monoidal” in nature; satisfying a suite of set operations (union, difference, etc) – Functional programming concepts – Parallel prefix summarization

  6. What-Why Sketch? ● “Data analytic” platforms adopting sketches – Yahoo's “Data Sketching” library – Druid integration with Yahoo's library ** – Redis support – Several opensource projects for Spark/Hadoop ** Traditional Database, “Columnar” Stores, “Big Table” Database

  7. What-Why Sketch? ● Measuring Performance – Using Chapel 1.15! – Measured sketch update performance – Each algorithm receives a randomly filled array of 100K integers – Each algorithm provided 5 minutes to 'add' or 'update' a sketch (serial loop) over sets of the 100K integers ● Results are the total number of 100K block-integer updates completed in ~5 minutes

  8. HyperLogLog

  9. HyperLogLog ● Philippe Flajolet ● Analyzes a stream of hashed values (bit-pattern observables) – Split each hashed value into m sets – Collects “runs” of zeros for each m set ● Provides a Stochastic Average using collected bit- pattern information – Compute a harmonic mean of each m bit set (for each new value)

  10. HyperLogLog ● Hashed Value: 000011000111 ● Split hash into bit-pattern sets ( m =3): [ [000], [011], [000], [111] ] ● Compute running harmonic average over existing bit-pattern sets

  11. HyperLogLog 12000 10000 8000 Run 1 Run 2 6000 Run 3 Run 4 Run 5 4000 2000 0 chpl python

  12. HyperLogLog 300000 250000 200000 Run 1 Run 2 150000 Run 3 Run 4 Run 5 100000 50000 0 chpl-fast chpl python

  13. Frequency Sketch

  14. Frequency Sketch ● Implementation of Misra-Greis Algorithm ● Stores k-1 (item-counter) pairs as a set ● If a new item is in the set's range – Increment a counter – Else find an empty counter, add item, and set counter to one ● Decrement all k-counters if all counters have been allocated ● Over time, low frequency elements are removed, making space for higher frequency items.

  15. Frequency Sketch 12000 10000 8000 Run 1 Run 2 6000 Run 3 Run 4 Run 5 4000 2000 0 chpl python

  16. Frequency Sketch 300000 250000 200000 Run 1 Run 2 150000 Run 3 Run 4 Run 5 100000 50000 0 chpl-fast chpl python

  17. Quantile Sketch

  18. Quantile Sketch ● “Low Discrepancy Mergeable Quantiles Sketch” (Agarwal, Cormode, Huang, Philips, Wei, Yi) ● Non-deterministic! ● Select elements (upper/lower bounds) from the stream under a rank constraint: normalized rank: i|S|/k for 1 <= I <= k ~= 1/e ● Using the selected elements, or summary, compute quartile information.

  19. Quantile Sketch 450 400 350 300 Run 1 250 Run 2 Run 3 200 Run 4 Run 5 150 100 50 0 chpl python ** Chapel has to perform several domain resizes, could use optimization

  20. Quantile Sketch 2000 1800 1600 1400 Run 1 1200 Run 2 1000 Run 3 Run 4 800 Run 5 600 400 200 0 chpl-fast chpl python

  21. Theta Sketch

  22. Theta Sketch ● Kth Minimum Value sketch ● Maintains a threshold theta and a set of unique hashed items less than theta – Assume hashing function computes a uniform distribution ● Algorithm assumes hash function provides uniform distribution (over hash space). ● The assumption gives information about the average spacing between elements of the stream. ● Knowing the smallest value, and spacing, one can infer the total number of distinct values observed

  23. Theta Sketch 80000 70000 60000 50000 Column 1 Column 2 40000 Column 3 30000 20000 10000 0 chpl python

  24. Theta Sketch 900000 800000 700000 600000 Run 1 500000 Run 2 Run 3 400000 Run 4 Run 5 300000 200000 100000 0 chpl-fast chpl python

  25. ● Images provided by Library of Congress – All photos have “no known restrictions on publication” ● Code to be posted on github! – Check the email listserv for details

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend