Sink or Swim: How not to drown in (colossal) streams of data? Nitin - PowerPoint PPT Presentation

Sink or Swim: How not to drown in (colossal) streams of data? Nitin Agrawal ThoughtSpot

“Colossal” streams of data 4 TB /car /day 10 MB /device /day x 100s thousands cars x millions devices 10 TB /data center /day 20 GB /home /day x 10s data centers x 100s thousands homes 2

“Colossal” streams of data 4 TB /car /day 10 MB /device /day x 100s thousands cars x millions devices Hundreds of TB to PB /day 10 TB /data center /day 20 GB /home /day x 10s data centers x 100s thousands homes 3

Applications with “colossal” data Need to support timely analytics Analyses ▹ Forecast ▹ Recommend ▹ Detect outliers ▹ Telemetry ▹ Route planning 4

Applications with “colossal” data Need to support timely analytics IoT Applications Analyses ▹ Occupancy sensing ▹ Forecast ▹ Energy monitoring ▹ Recommend ▹ Safety and care ▹ Detect outliers ▹ Surveillance ▹ Telemetry ▹ Industrial automation ▹ Route planning 5

Applications with “colossal” data Current solutions In-memory analytics systems Conventional (storage) systems 6

Applications with “colossal” data Current solutions In-memory analytics systems ▹ Interactive latency, but $$$$ ▹ Need secondary system for persistence Conventional (storage) systems 7

DRAM “volatility” 8

DRAM “volatility” 9

Applications with “colossal” data Current solutions In-memory analytics systems ▹ Interactive latency, but $$$$ ▹ Need secondary system for persistence Conventional (storage) systems ▹ High latency ▹ Still quite resource intensive 10

I/O performance not keeping up Improved dramatically over the years But, still a bottleneck… 11

I/O performance not keeping up Disk read performance (spec) Query performance (spec) 1 GB 1 TB Random IOPS (4K) Random HDD: 61 HDD 1 hr 48 days SSD: 400K SSD 0.6 secs 11 mins Sequential (MBps) Sequential HDD: 250 SSD: 3400 HDD 4 secs 1 hr Price SSD 0.3 secs 5 mins HDD: $0.035/GB SSD: $0.5/GB 12

Drowning in data Continuous data generation on significant rise ▹ From sensors, smart devices, servers, vehicles, … ▹ Analyses require timely responses ▹ Overwhelms ingest and processing capability Conventional storage systems can’t cope with data growth ▹ Designed for general-purpose querying not analyses ▹ Store all data for posterity; required capacity grows linearly ▹ Administered storage expensive relative to disks 13

Sink or Swim? 14

How not to drown? Democratizing storage ▹ No one size fits all, store what the application needs. Democratizing discovery ▹ Intuitive interfaces for end-users to engage with data. 15

How not to drown: democratizing storage! Revisiting design assumptions around data ▹ Data streams unlike tax returns, family photos, documents ▹ Consumed by analytics not human readers ▹ Embracing approximate storage - not all data equally valuable for analyses Applications designed with uncertainty and incompleteness ▹ Many care about answer “quality” and timeliness, not solely precision Could store all data and lazily approximate at query time ▹ Slow: ingest and post-processing takes time ▹ Expensive: system needs to be provisioned for all ingested data 16

How not to drown: democratizing discovery! Human-centric interfaces to data ▹ End users not always experts in query formulation. ▹ Embracing natural language querying and searching. Custom data-centric applications without significant effort ▹ End users not necessarily have deep programming expertise. ▹ Empower writing new applications with low/no software development. 17

Embracing approximate storage Proactively summarize data in persistent storage ▹ Fast: queries need to run on a fraction of data Summaries provide additional speedup ▹ Cheap: system provisioned only for approximated data Capacity grows sub-linearly or logarithmically with data ▹ Maximize utilization of administered storage and compute Caveats and limitations of approximate storage ▹ Effectiveness depends on target analyses ▹ Interesting research questions! 18

Preview: potential gains with SummaryStore SummaryStore: approximate store for “colossal” time-series data Key observation: in time-series analyses ▹ Newer data is typically more important than older ▹ Can get away with approximating older data more In real applications (forecasting, outlier analysis, ...) and microbenchmarks: 1 PB on single node 10x compaction scale (compacted 100x ) < 0.1% error < 1s at 95 th %ile Forecasting latency < 10% at 95 th %ile error

Challenges in building approximate storage Ensuring answer quality ▹ Provide high quality answers under aggressive approx. ▹ Quantify answer quality and errors Ensuring query generality ▹ Enable analyses to perform acceptably given approx. scheme ▹ Handle workloads at odds with approx. (e.g., outliers) Reducing developer burden ▹ App developers not statisticians; abstractions to incorporate imprecision ▹ Counter design assumptions across layers of storage stack 20

Applications with “colossal” data streams Current solutions In-memory analytics systems ▹ Interactive latency, but $$$$ ▹ Need secondary system for persistence Conventional time-series stores ▹ High latency, still quite expensive Approximate data stores? ▹ Promising reduction in cost & latency ▹ Current approx storage systems not viable for data streams 21

Goal: build a low-cost, low-latency store for stream analytics

Goal: build a low-cost, low-latency approximate store for stream analytics

Key insight We make the following observation: many stream analyses favor newer data over older existing stores are oblivious, hence costly and slow Examples: Spotify, SoundCloud Time-decayed weights in song recommender Facebook EdgeRank Time-decayed weights in newsfeed recommender Twitter Observability Archive data past an age threshold at lower resolution Smart-home apps Decaying weights in e.g. HVAC control, energy monitor 24

SummaryStore : approximate store for stream analytics Our system, SummaryStore* Approximates data leveraging observation that allocated analyses favor newer data # bits Allocates fewer bits to older data than new: each datum decays over time datum age * Low-Latency Analytics on Colossal Data Streams with SummaryStore , Nitin Agrawal, Ashish Vulimiri. SOSP ’17. 25

SummaryStore : approximate store for stream analytics Our system, SummaryStore Allocates fewer bits to older data than new: each datum decays over time Example decay policy: halve number of bits each day 32-bit value arrives 32 allocated # bits 1 6 8 4 2 1 ½ ¼ Time 26

Time-decayed stream approximation through windowed summarization older data Stream of values newest element 27

Time-decayed stream approximation through windowed summarization oldest newest Group values in windows 28

Time-decayed stream approximation through windowed summarization oldest newest Group values in windows. Discard raw data 29

Time-decayed stream approximation through windowed summarization Sum, Count Sum, Count Sum, Count Sum, Count Sum, Count oldest newest 64 bits 64 bits 64 bits 64 bits 64 bits Group values in windows. Discard raw data, keep only window summaries ▹ e.g. Sum, Count, Histogram, Bloom filter, ... ▹ Each window is given same storage footprint 30

Time-decayed stream approximation through windowed summarization Sum, Count Sum, Count S,C S,C S,C oldest newest 64 bits 64 bits 64b 64b 64 16 vals = 4 bits/value 2 v = 32 bits/value Group values in windows. Discard raw data, keep only window summaries ▹ e.g. Sum, Count, Histogram, Bloom filter, ... ▹ Each window is given same storage footprint To achieve decay, use longer timespan windows over older data 31

Challenge: processing writes v 7 room for one BF BF Bloom Filter more value v 1 v 1 v 2 v 3 v 4 v 5 v 6 Configuration: oldest newest 4 2 1 Window lengths 1, 2, 4, 8, .... Each window has BF BF Bloom Filter Bloom filter v 1 v 2 v 3 v 4 v 5 v 6 v 7 oldest newest 4 2 1 Don’t have raw values, only window summaries (Bloom filters) How do we “move” v 4 , v 6 between windows? 32

Ingest algorithm 1000-bit 1000-bit Bloom Filter Bloom Filter Not possible to actually move values merge v 1 ...............v 8 v 9 ...v 12 Instead, use a different technique, bitwise OR building on work by Cohen & Wang † v 1 ..................v 12 1000-bit ▹ Ingest new values into new windows Bloom Filter ▹ Periodically compact data by merging merge operation for consecutive windows Bloom Filter : bitwise OR ▹ Merge all summary data structures Count : add † E. Cohen, J. Wang, “Maintaining time-decaying Histogram : combine & rebin stream aggregates”, J. Alg. 2006 etc

Challenge: time-range queries query a summary over the time-range [T 1 , T 2 ] Oldest Newest T 1 T 2 Examples ▹ What was average energy usage in Sep 2015? ▹ Fetch a random (time-decayed) sample over the last 1 year 34

Challenge: time-range queries query a summary over the time-range [T 1 , T 2 ] Oldest Newest T 1 T 2 Time-ranges are allowed to be arbitrary, need not be window-aligned 35

Sink or Swim: How not to drown in (colossal) streams of data? Nitin - PowerPoint PPT Presentation

Sink or Swim: How not to drown in (colossal) streams of data? Nitin Agrawal ThoughtSpot Colossal streams of data 4 TB /car /day 10 MB /device /day x 100s thousands cars x millions devices 10 TB /data center /day 20 GB /home /day x

SWIM Industry Administration Collaboration Workshop #4 SWIM, Services & SWIFT (SWIM

SWIM Industry Federal Aviation Administration Collaboration Workshop SWIM, Services & SWIM

SWIM Industry Federal Aviation Administration Collaboration Workshop #3 SWIM, Services &

SWIFT: Day 1 SWIM Developers Workshop SWIM, Services & SWIFT (SWIM Industry-FAA Team)

SWIM Industry Federal Aviation Administration Collaboration Workshop #2 SWIM, Services &

ATD-2 is a Prosumer of SWIM Data ATD-2 has greatly benefited from existing SWIM feeds The

SWIFT: Administration SWIM Industry Collaboration Workshop #6 SWIM, Services & SWIFT

Mountain Rapids, Sugar Creek Swim Club, Ft. Worth Area Swim Team, Mission Bay Makos, Burke

SWIFT: Administration SWIM Industry Collaboration Workshop #10 SWIM, Services & SWIFT

Erin Shaw swimming tips and nutrition Training tips Start training early for the swim and

2021 LEARN TO SWIM GRANTS FOR SWIM LESSON PROVIDERS Overview Eligibility General

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Sinks and Fill Lowell Anthony What is a Sink? A sink is a cell or set of spatially connected

A revised look at the A revised look at the oceanic sink for oceanic sink for atmospheric

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

Semi-stationary reflection, stationary reflection and combinatorics Hiroshi Sakai (joint work

Tail of stationary probability of Stochastic Dynamical systems Gerold Alsmeyer (University of

Non-equilibrium almost-stationary states for interacting electrons on a lattice Stefan Teufel,

Resonant Response in Nonequilibrium Stationary States Ral Salgado-Garca Department of

04 Part II Intro to Database Systems Andy Pavlo AP AP 15-445/15-645 Computer Science Fall

Asymptotic Enumeration of Compacted Binary Trees with Height Restrictions CLA 05/2018 Michael

VIDEOCONFERENCE Fernando Pereira Instituto Superior Tcnico Multimedia Communication, Fernando

Multiplexing Methods Daubing the Information 2005/03/11 (C) Herbert Haas I think there is a

Sink or Swim: How not to drown in (colossal) streams of data? Nitin - PowerPoint PPT Presentation

Sink or Swim: How not to drown in (colossal) streams of data? Nitin Agrawal ThoughtSpot Colossal streams of data 4 TB /car /day 10 MB /device /day x 100s thousands cars x millions devices 10 TB /data center /day 20 GB /home /day x

SWIM Industry Administration Collaboration Workshop #4 SWIM, Services &amp; SWIFT (SWIM

SWIM Industry Federal Aviation Administration Collaboration Workshop SWIM, Services &amp; SWIM

SWIM Industry Federal Aviation Administration Collaboration Workshop #3 SWIM, Services &amp;

SWIFT: Day 1 SWIM Developers Workshop SWIM, Services &amp; SWIFT (SWIM Industry-FAA Team)

SWIM Industry Federal Aviation Administration Collaboration Workshop #2 SWIM, Services &amp;

ATD-2 is a Prosumer of SWIM Data ATD-2 has greatly benefited from existing SWIM feeds The

SWIFT: Administration SWIM Industry Collaboration Workshop #6 SWIM, Services &amp; SWIFT

Mountain Rapids, Sugar Creek Swim Club, Ft. Worth Area Swim Team, Mission Bay Makos, Burke

SWIFT: Administration SWIM Industry Collaboration Workshop #10 SWIM, Services &amp; SWIFT

Erin Shaw swimming tips and nutrition Training tips Start training early for the swim and

2021 LEARN TO SWIM GRANTS FOR SWIM LESSON PROVIDERS Overview Eligibility General

WITH C++ Prof. Amr Goneid AUC Part 9. Streams &amp; Files Prof. amr Goneid, AUC 1 Streams

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Sinks and Fill Lowell Anthony What is a Sink? A sink is a cell or set of spatially connected

A revised look at the A revised look at the oceanic sink for oceanic sink for atmospheric

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data &amp; Real Time Data Streams

Semi-stationary reflection, stationary reflection and combinatorics Hiroshi Sakai (joint work

Tail of stationary probability of Stochastic Dynamical systems Gerold Alsmeyer (University of

Non-equilibrium almost-stationary states for interacting electrons on a lattice Stefan Teufel,

Resonant Response in Nonequilibrium Stationary States Ral Salgado-Garca Department of

04 Part II Intro to Database Systems Andy Pavlo AP AP 15-445/15-645 Computer Science Fall

Asymptotic Enumeration of Compacted Binary Trees with Height Restrictions CLA 05/2018 Michael

VIDEOCONFERENCE Fernando Pereira Instituto Superior Tcnico Multimedia Communication, Fernando

Multiplexing Methods Daubing the Information 2005/03/11 (C) Herbert Haas I think there is a

SWIM Industry Administration Collaboration Workshop #4 SWIM, Services & SWIFT (SWIM

SWIM Industry Federal Aviation Administration Collaboration Workshop SWIM, Services & SWIM

SWIM Industry Federal Aviation Administration Collaboration Workshop #3 SWIM, Services &

SWIFT: Day 1 SWIM Developers Workshop SWIM, Services & SWIFT (SWIM Industry-FAA Team)

SWIM Industry Federal Aviation Administration Collaboration Workshop #2 SWIM, Services &

SWIFT: Administration SWIM Industry Collaboration Workshop #6 SWIM, Services & SWIFT

SWIFT: Administration SWIM Industry Collaboration Workshop #10 SWIM, Services & SWIFT

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams