Just-In-Time Data Virtualization: Lightweight Data Management with - - PowerPoint PPT Presentation

just in time data virtualization lightweight data
SMART_READER_LITE
LIVE PREVIEW

Just-In-Time Data Virtualization: Lightweight Data Management with - - PowerPoint PPT Presentation

Just-In-Time Data Virtualization: Lightweight Data Management with ViDa Manos Karpathiotakis * , Ioannis Alagiannis * , Thomas Heinis * , Miguel Branco * , Anastasia Ailamaki * * Current data analysis does not scale Most firms estimate


slide-1
SLIDE 1

Just-In-Time Data Virtualization: Lightweight Data Management with ViDa

Manos Karpathiotakis*, Ioannis Alagiannis*, Thomas Heinis*‡, Miguel Branco*, Anastasia Ailamaki*

* ‡

slide-2
SLIDE 2
  • Growing data
  • Growing heterogeneity
  • Data movement regulations

2

“Most firms estimate that they are only analyzing 12% of the data that they already have” [Forrester 2014]

Current data analysis does not scale

Available data blocks business & scientific analytics

slide-3
SLIDE 3

Move data Copy data Transform data

3

Discovering disease signatures

slide-4
SLIDE 4

4

id Protein : AACT Age Phenotype … 1 1.4 45 Trauma … 2 2 55 Chronic Symptoms … 3 0.2 56 … …

Clinical+Genetic+Imaging Data Signature

Patients (CSV) Brain_GrayMatter (Binary)

1 … n 0.45 0.75 … 0.1 1 0.33 0.3 … 0.38 … … … … … m 0.12 … 0.47

BrainRegions (JSON)

Challenge: Physical integration & diverse queries

[{"id": 1, "amygdala": {"X":15,"Y":20, “Vol”: 0.5}, "hippocampus": {"X":17, "Y":10, “Vol”:0.2}}, {"id": 2, ...}, {"id": 3, ...}]

Signature: age > 50 AND amygdala.Vol > 0.3 AND AACT < 1

slide-5
SLIDE 5

No Static Decisions!

External Data Sources Operational Databases Search Engine Spreadsheet Application Reporting Server MapReduce Engine Relational DBMS

Diverse applications over diverse datasets

5

Key: Data Virtualization (Raw) Data:

  • 1. “Golden” repository
  • 2. Manipulate it freely
  • 3. Adapt to it & to queries
slide-6
SLIDE 6

Source Descriptions ViDa Query Language

ViDa Architecture ...

DBMS JSON XML CSV

SQL XQuery ...

Just-In-Time Query Executor Just-In-Time Access Paths Auxiliary Structures

ViDa Optimizer

6

slide-7
SLIDE 7

Source Descriptions ViDa Query Language Queries over heterogeneous datasets

7

Just-In-Time Query Executor Just-In-Time Access Paths Auxiliary Structures

ViDa Optimizer

slide-8
SLIDE 8

Monoids:

  • Abstraction for “aggregates” computation

Monoid Comprehensions*:

  • Operations between monoids

Support multiple data models as input & output

*Fegaras [TODS 2000]

8

Queries translated to monoid comprehensions

for { } yield Sum/Bag/Set/Top-K/… p <- Patients, r <- BrainRegions, p.id = r.id, r.amygdala.Vol > 0.2 list p.age

slide-9
SLIDE 9

“SQL++” Comprehensions Algebra

SELECT r.age FROM Patients p JOIN BrainRegions r ON (p.id = r.id) WHERE r.amygdala.Vol > 0.2

9

for { p <- Patients, r <- BrainRegions, p.id = r.id, r.amygdala.Vol > 0.2 } yield list r.age

Δ𝑚𝑗𝑡𝑢 𝜏

𝑄𝑏𝑢 𝐶𝑠𝑏𝑗𝑜

Internal Calculus Optimizable Algebra

if-else record construction function application (nested) comprehension …

slide-10
SLIDE 10

Source Descriptions ViDa Query Language Query execution in ViDa

10

Just-In-Time Query Executor Just-In-Time Access Paths Auxiliary Structures

ViDa Optimizer

slide-11
SLIDE 11

Plugin Catalog Input Plugin Input Plugin ...

Creating a query executor just-in-time

11

Operator Logic

Output Plugin

Input Binding Input Binding Input Binding

Adapt to data and queries just-in-time

Δ𝑚𝑗𝑡𝑢 𝜏

𝑄𝑏𝑢 𝐶𝑠𝑏𝑗𝑜

Input Plugin

slide-12
SLIDE 12

ViDa access paths

  • Access paths generated Just-in-time*
  • Adapting to schema of data
  • File-format-specific opportunities
  • Position caches for textual formats ‡
  • Data caches

12

∀col: if col needed: if col isInt ... readInt(); skipField(); readInt(); skipRest();

id Protein: AACT age …

*RAW [VLDB 2014]

‡ NoDB [SIGMOD 2012]

Reduce access costs by adapting to underlying data

slide-13
SLIDE 13

Int Int JSON text

… … … … … … … … …

Int

… … …

JSON text

… … …

Int

… … …

Δ𝑚𝑗𝑡𝑢 /𝑠.𝑏𝑕𝑓

𝜏

𝑄𝑏𝑢 𝐶𝑠𝑏𝑗𝑜

Just-in-time operators

13

  • Query operators generated Just-in-time
  • “Hard-coded”, fine-grained operators
  • Adapting data layout of caches to

– query requirements – data format, model

Int Int JSON text Int Int BSON Int Int start pos. end pos.

  • utputBindings(format);

Reduce processing costs by adapting to queries

slide-14
SLIDE 14

Source Descriptions ViDa Query Language

Just-In-Time Query Executor Just-In-Time Access Paths Auxiliary Structures

ViDa Optimizer

Query optimization in ViDa

14

slide-15
SLIDE 15

Optimizing a just-in-time database

  • Choosing appropriate layout
  • Lazy vs. Speculative Execution
  • Fixing “wrong” decisions at runtime

15

𝜏

𝑄𝑏𝑢 𝐶𝑠𝑏𝑗𝑜 𝐻𝑓𝑜

Δ𝑚𝑗𝑡𝑢

slide-16
SLIDE 16

Experimental Setup

  • Intel(R) Xeon(R) CPU E5-2660 @ 2.20GHz
  • 128 GB RAM
  • 7500 RPM SATA

16

Relation name Tuples Attributes Size Type Patients 41718 156 29 MB CSV Genetics 51858 17832 1.8 GB CSV BrainRegions 17000 20446 5.3 GB JSON SELECT val1, ..., valN FROM Patients p JOIN Genetics g ON (p.id = g.id) JOIN BrainRegions b ON (g.id=b.id) WHERE pred1 AND ... AND predN for { p <- Patients, g <- Genetics, b <- BrainRegions, p.id=g.id, g.id=b.id, pred1, ..., predN } yield val1,…,valN

slide-17
SLIDE 17

200 400 600 800 1000 1200

ViDa Col.Store RowStore Col.Store + Doc.Store RowStore + Doc. Store

Execution Time (sec)

Flattening Loading - DBMS Loading - Doc. Store q1-q150

17

ViDa: Competitive without loading/transforming

150 analytics queries on CSV & JSON data

ViDa vs State-of-the-art

slide-18
SLIDE 18

ViDa enables lightweight data management

  • Decouple query language used from data layout
  • Adapt to datasets and queries just-in-time
  • Flexible and competitive with state of the art

18