prioritizing attention in fast data
play

Prioritizing Attention in Fast Data Principles and Promise Peter - PowerPoint PPT Presentation

Prioritizing Attention in Fast Data Principles and Promise Peter Bailis Edward Gan Kexin Rong Sahaana Suri CIDR 2017 Edward Kexin Sahaana Edward Kexin Sahaana Deepak Firas Matei John Tony Edward Kexin Sahaana Deepak Firas


  1. Prioritizing Attention in Fast Data Principles and Promise Peter Bailis Edward Gan Kexin Rong Sahaana Suri CIDR 2017

  2. Edward Kexin Sahaana

  3. Edward Kexin Sahaana Deepak Firas Matei John Tony

  4. Edward Kexin Sahaana Deepak Firas Matei John Ihab Sam Xu Lei Tony

  5. abundant data, scarce attention

  6. abundant data, scarce attention data is increasingly too big for manual inspection

  7. abundant data, scarce attention data is increasingly too big for manual inspection Twitter, LinkedIn, Facebook, Google: log 12M+ events/s projected 40% year-over-year growth (e.g., via IoT)

  8. abundant data, scarce attention data is increasingly too big for manual inspection Twitter, LinkedIn, Facebook, Google: log 12M+ events/s projected 40% year-over-year growth (e.g., via IoT) today’s operators say: < 6% of this data is ever accessed after ingest

  9. abundant data, scarce attention data is increasingly too big for manual inspection Twitter, LinkedIn, Facebook, Google: log 12M+ events/s projected 40% year-over-year growth (e.g., via IoT) today’s operators say: < 6% of this data is ever accessed after ingest call this trend “fast data”

  10. 6%

  11. abundant data, scarce attention e.g., telemetry and metrics from 100k-MM devices is the application behaving as expected?

  12. idea: use a classifier to filter data classifier

  13. idea: use a classifier to filter data classifier too much data? filter the stream for “interesting”/useful data

  14. basic classifier: static rules classifier

  15. basic classifier: static rules classifier

  16. basic classifier: static rules classifier example: power drain > 2W?

  17. basic classifier: static rules classifier example: power drain > 2W? pros: scalable, simple cons: highly brittle, may miss events

  18. better classifier: use ML & statistics classifier example: compute statistical likelihood of power activity given user population

  19. better classifier: use ML & statistics classifier example: compute statistical likelihood of power activity given user population Mean µ More than k standard deviations from µ

  20. better classifier: use ML & statistics classifier example: compute statistical likelihood of power activity given user population Mean µ pros: can model dynamic & complex events More than k standard cons: often slow! deviations from µ

  21. models are expensive to run e.g., state-of-art CNN: 30fps requires $1200 GPU

  22. models are expensive to run e.g., state-of-art CNN: 30fps requires $1200 GPU anecdote: speed vs. quality engineers at major online service monitoring per-device QoS: off-the-shelf stats packages too slow, not scalable solution: manually tune thresholds per-user, per-device!

  23. models are expensive to run e.g., state-of-art CNN: 30fps requires $1200 GPU anecdote: speed vs. quality engineers at major online service monitoring per-device QoS: off-the-shelf stats packages too slow, not scalable solution: manually tune thresholds per-user, per-device! result: brittle, reactive, false negatives wanted: accurate, scalable classifiers

  24. raw data is still too much classifier even filtered data is problematic at scale high volume still overwhelms human attention high-dimensional attributes can obscure trends

  25. android device types by popularity

  26. explanations aggregate results classify explain highlight commonalities and trends

  27. explanations aggregate results classify explain highlight commonalities and trends e.g., Android Galaxy S7 devices running app version 2.4.4 are 51x more likely than usual to have extreme power drain return aggregates and representative events instead of returning raw data

  28. the key to fast data combine: classify and explain classify explain

  29. the key to fast data combine: classify and explain classify explain how should we do it?

  30. dataflow (alone) is not enough dataflow: a substrate, not a complete solution

  31. dataflow (alone) is not enough dataflow: a substrate, not a complete solution

  32. dataflow (alone) is not enough dataflow: a substrate, not a complete solution missing: scalable, modular operators for prioritizing attention via classification and explanation

  33. macrobase: a fast data system classify explain a system providing fast, reusable, modular operators for classification and explanation prioritizing attention in fast data

  34. MacroBase default workflow input: data attributes, key performance metrics output: attributes that explain deviations in metrics

  35. correlated attributes

  36. correlated attributes key metric

  37. “MacroBase discovered a rare issue with the CMT application and a device-specific battery problem. Consultation and investigation with the CMT team confirmed these issues as previously unknown…”

  38. classify explain key: make this combo fast

  39. example: end-to-end optimization

  40. example: end-to-end optimization A B streaming explanation B C A D A A A B D C B B E B B

  41. example: end-to-end optimization A B streaming explanation B C standard solution: A D find correlations w/in each class A A A B D C B B E B B

  42. example: end-to-end optimization A B streaming explanation B C standard solution: A D find correlations w/in each class A A A: 80% A B B: 20% D C B B E B B

  43. example: end-to-end optimization A B streaming explanation B C standard solution: A D find correlations w/in each class A A A: 80% A: 0.1% C: 31.9% A B B: 20% B: 46% D: 22% D C B B E B B

  44. example: end-to-end optimization A B streaming explanation B C standard solution: A D find correlations w/in each class A A A: 80% A: 0.1% C: 31.9% A B B: 20% B: 46% D: 22% D C better idea: B exploit cardinality imbalance B correlate “outliers”, probe “inliers” E B B

  45. example: end-to-end optimization A B streaming explanation B C standard solution: A D find correlations w/in each class A A A: 80% A: 0.1% C: 31.9% A B B: 20% B: 46% D: 22% D C better idea: B exploit cardinality imbalance B correlate “outliers”, probe “inliers” E A: 80% B B

  46. example: end-to-end optimization A B streaming explanation B C standard solution: A D find correlations w/in each class A A A: 80% A: 0.1% C: 31.9% A B B: 20% B: 46% D: 22% D C better idea: B exploit cardinality imbalance B correlate “outliers”, probe “inliers” E A: 80% A: 0.1% B B

  47. classify explain key: make this combo fast

  48. classify explain key: make this combo fast surprise: this combo enables new optimizations

  49. one weird trick for 2017 systems research 1.) read a textbook on statistics/ML 2.) implement the thing that should work 3.) observe it’s really slow 4.) make it fast using systems techniques needed: classic systems techniques indexing, caching, predicate pushdown, sketching

  50. is this system just a bunch of one-off hacks? classify explain

  51. is this system just a bunch of one-off hacks? no! only need a small # of core operators, coupled with domain-specific features classify explain featurize

  52. is this system just a bunch of one-off hacks? no! only need a small # of core operators, coupled with domain-specific features classify explain featurize optical flow mean . o . . . . . MAD st mean optical flow groupby(video) + CV xform

  53. classify explain featurize a range of interfaces empowers a range of users: domain experts: point and click UI scripters: custom dataflow pipelines ML and systems ninjas: custom operators

  54. users inform design automotive monitoring fleet QoS online services & datacenters (DevOps / monitoring) identifying slow containers, exception telemetry industrial manufacturing key sources of process variance in product geophysics Lunar water ice detection, seismic activity detection

  55. classify explain featurize fast data • overabundant data, scarce human attention • a major opportunity for systems, w/ real use cases macrobase • an open source search engine for fast data • modular, efficient classification and explanation

Recommend


More recommend