RUBIK: Efficient Threshold Queries on Massive Time Series Eleni - PowerPoint PPT Presentation

RUBIK: Efficient Threshold Queries on Massive Time Series Eleni Tzirita Zacharatou ‡ Thomas Heinis* Farhan Tauheed § Anastasia Ailamaki ‡ ‡ École Polytechnique *Imperial College London § Oracle Labs, Zurich Fédérale de Lausanne

Scaling up Brain Simulations voltage Model Resolution time voltage time Temporal Resolution time 3D Neuron Model Time Series Analysis: key to neuroscientific discovery 2

Neuron firing: which and when • Exploration • Identify subsets of interest: time series where voltage > -40 • Hypothesis Testing and time step ∈ [300,400] Threshold Query voltage time Threshold queries fuel efficient data analysis 3

Time Series Correlation… time series id voltage time step Trends Correlation Opportunity to scale with Increased simulation duration Across time increase in temporal resolution Increasingly detailed models Across time series increase in spatial resolution …enables efficient time series-specific compression 4

Time Series Data Discretization Range encoding: Binning: Set bin to ‘1’ if condition satisfied, Partition the values into bins ‘0’ otherwise 3: [15-20) ≥ 20 17 0 0 0 0 2: [10-15) ≥ 15 0 0 1 0 ≥ 10 1: [5-10) 9 5 0 0 1 0 Value ≥ 5 0: [0-5) 2 1 1 1 0 Bin Timestep Timestep Increased similarity Precomputed answers across time series stored as a bitmap 5

Bitmap Compression Today • Run-Length-Encoding compresses each bitvector § Word-Aligned Hybrid Code (WAH) [SSDBM ’02] 4 × ’0’ 0 0 0 0 2 × ’0’, 1 × ’1’, 1 × ‘0’ 0 0 1 0 2 × ’0’, 1 × ’1’, 1 × ‘0’ 0 0 1 0 3 × ’1’, 1 × ‘0’ 1 1 1 0 Bin Timestep • Compression prevents direct access § Timesteps don’t correspond to bit positions 6

Bitmap Compression Today • Run-Length-Encoding compresses each bitvector § Word-Aligned Hybrid Code (WAH) [SSDBM ’02] 4 × ’0’ 0 0 0 0 2 × ’0’, 1 × ’1’, 1 × ‘0’ 0 0 1 0 2 × ’0’, 1 × ’1’, 1 × ‘0’ 0 0 1 0 3 × ’1’, 1 × ‘0’ 1 1 1 0 Bin Timestep • Compression prevents direct access Values filtered independently of timesteps § Timesteps don’t correspond to bit positions Similarities across time series are not exploited 7

Our Approach: RUBIK 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Bitmap index Quadtree-based Bitmap stacking creation bitmap decomposition Access specific Exploit timesteps similarities 8

Quadtree-based 3D Bitmap Decomposition Timestep Start Bins 0 Time series 0 0 0 0 1 0 0 0 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1 1 1 Mix 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 First Split All 0 1 1 All 1 All 1 Mix 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 1 1 1 0 0 0 1 1 1 0 0 1 0 1 1 0 0 0 1 1 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 1 0 1 1 0 1 0 0 0 1 1 0 1 Second Split Mix All 0 All 1 All 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 1 1 0 0 1 0 0 0 1 0 9 0

Quadtree-based 3D Bitmap Decomposition Start Mix First Split All 0 All 1 All 1 Mix Second Split Mix All 0 All 1 All 0 0 0 0 0 0 Apply WAH 0 0 1 1 0 0 10

Query Execution Query: Mix voltage > 11 in time steps 1 and 2 All 0 All 1 All 1 Mix 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Bin 1 Mix All 0 All 1 All 0 Timestep 1 0 0 1 1 0 1 0 1 0 1 0 1 0 Transformation into a 2D bitmap problem 1 1 1 1 1 0 1 0 One tree traversal to retrieve multiple bitmaps 11

Stacking Time Series Bitmaps Goal: Maximize size and number of common squares bitmap 1 bitmap 3 0 0 0 0 0 0 0 0 bitmap 2 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 Mix Mix Mix All 0 All 1 All 1 All 1 All 1 cluster 1 cluster 2 ⇒ Maximize compression across time series 12

Scaling with Data Volume In-memory indexes: FastBit (WAH-compressed bitmap index) and RUBIK Configuration: 128 bins, Hardware: AMD Opteron CPU @ 2.7GHz, 32GB RAM Time series data: 1000 time steps, 1.2GB – 4.8GB #queries: 60 25 1500 Total execution time (s) FastBit RUBIK FastBit RUBIK 20 1200 Index size (MB) 15 900 10 600 5 300 0 0 312K 624K 1.25M 312K 624K 1.25M # time series # time series RUBIK index size scales 9X to 23X speedup sublinearly 13

RUBIK Sensitivity Analysis Configuration: 128 bins Datasets: 500K – 2M time series, Benchmark: 60 threshold queries, 1024 time steps, 2.1GB – 8.4GB random thresholds, up to 15% selectivity query execution time (s) 2D range query Filtering Index Size Dataset Size 10 8 8 6 size (GB) 6 7.5X 4 4 6.7X 2 2 5.8X 0 0 small medium (2X) large (4X) small medium (2x) large (4x) dataset dataset Increased similarity ⇒ ~80% of the time is spent on Hardware: AMD Opteron, 2.7GHz, 32GB RAM Increased compression filtering 14

Threshold Queries on Time Series • Subsets of interest in neuroscience simulations • RUBIK outperforms state-of-the-art by using: – Quadtree decomposition ⇒ Transformation into a 2D bitmap problem – Time series clustering ⇒ Similarities across time series are exploited • RUBIK scales particularly well with time series from increasingly detailed simulation models Thank you! 15

RUBIK: Efficient Threshold Queries on Massive Time Series Eleni - PowerPoint PPT Presentation

RUBIK: Efficient Threshold Queries on Massive Time Series Eleni Tzirita Zacharatou Thomas Heinis* Farhan Tauheed Anastasia Ailamaki cole Polytechnique *Imperial College London Oracle Labs, Zurich Fdrale de Lausanne

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

Pilot Study on WAP Diet, Dr. Beverly Rubik Live Blood Analysis of Persons Consuming the Weston A.

Twisty Puzzles Greg d'Eon UDLS, January 2020 or, Rubik's Cubes Probably Seem Tough To Solve,

Lead Screw Motors LSM08 Series LSM11 Series LSM14 Series LSM17 Series

Queries in PSM The following rules apply to the use of queries: CS 235: 1. Queries

Watershed Below TMDL Threshold At TMDL Threshold Above TMDL Threshold Water Quality Overview

Time Series Analysis and Mining with R Time Series Decomposi- tion Time Series Forecasting

Range Minimum and Lowest Common Ancestor Queries Slides by Solon P. Pissis November 15, 2019

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

Middleware Queries Queries Middleware Middleware Queries Prof. Paolo Ciaccia Prof. Paolo

Outline Time series and forecasting Time series objects 1 in R Basic time series functionality

Massive Data Algorithmics Lecture 4: External Search Trees Massive Data Algorithmics Lecture 4:

Massive Data Algorithmics Lecture 7: Range Searching Massive Data Algorithmics Lecture 7: Range

The FIFA Universe Massive scale, massive influence, massive corruption First, Some History.

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

Enter the Threshold The NIST Threshold Cryptography Project National Institute of Standards and

CSE326:DataStructures Lecture#21 OneLastGasp BartNiswonger

A quad-tree based Sparse BLAS implementation for shared memory parallel computers Michele Martone

Approximate Voronoi Diagrams: Techniques, tools, and applications to k th ANN search Nirman Kumar

Locality-Sensitive Orderings ANN -Quadtree Walecki Theorem Local-Sensitivity Anil Maheshwari

CS525: Advanced Database Organization Notes 6: Multi-dimensional indexes Yousef M. Elmehdwi

Flight Simulation Advisor: Hans de Nivelle Team: Alisher Shakhiyev, Alen German, Auyez

B ALANCED T REES Acknowledgement: The course slides are adapted from the slides prepared by R.

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of