READY: Completeness is in the Eye of the Beholder B. Chandramouli, - - PowerPoint PPT Presentation

ready completeness is in the eye of the beholder
SMART_READER_LITE
LIVE PREVIEW

READY: Completeness is in the Eye of the Beholder B. Chandramouli, - - PowerPoint PPT Presentation

READY: Completeness is in the Eye of the Beholder B. Chandramouli, J. Gehrke, J. Goldstein, M. Hofmann, D. Kossmann , J. Levandoski, R. Marroquin, W. Xin ETH Zurich, Facebook, Microsoft Observations Observation 1: We produce data in silos


slide-1
SLIDE 1

READY: Completeness is in the Eye of the Beholder

  • B. Chandramouli, J. Gehrke, J. Goldstein, M. Hofmann,
  • D. Kossmann, J. Levandoski, R. Marroquin, W. Xin

ETH Zurich, Facebook, Microsoft

slide-2
SLIDE 2

Observations

  • Observation 1: We produce data in silos (OLTP databases)
  • rich update functionality (SQL)
  • transactions (concurrency & durability)
  • integrity constraints in silo
  • principle: maximum control and isolation for high data quality
  • Observation 2: We consume data across silos (analytics in data lake)
  • rich query functionality (SQL)
  • snapshots across silos
  • integrity constraints across silos
  • principle: the more data, the merrier
slide-3
SLIDE 3

ETL: Implement Producer / Consumer Pipeline

DB DB Files Data Lake ERP CRM

Data Producers (OLTP) Data Consumers (OLAP)

PowerBI

ETL

slide-4
SLIDE 4

ETL: Implement Producer / Consumer Pipeline

DB DB Files Data Lake ERP CRM

Data Producers (OLTP) Data Consumers (OLAP)

PowerBI

ETL

SQL, concurrency, integrity SQL snapshots, integrity

slide-5
SLIDE 5

What if we produce data in the data lake…?

  • One (logical) copy of data
  • lower cost to move data
  • higher freshness
  • One (logical) system
  • higher agility & productivity
  • lower cost (optimization)

DB DB

Files Files

DB

ERP CRM PowerBI

slide-6
SLIDE 6

What if we produce data in the data lake…?

We need to add integrity constraints to data lake!

DB DB

Files Files

DB

ERP CRM PowerBI

SQL, concurrency, integrity

slide-7
SLIDE 7

Agenda

  • READY: Basic Concepts
  • READY 1.0: Implementation (see paper)
  • Experiments & Results
slide-8
SLIDE 8

READY Goals and „CAP Theorem“

  • Custom Integrity
  • every app defines its own set
  • f integrity constraints
  • Sharing
  • one (logical) copy of data
  • Decoupling
  • No application blocked by

constraints of another app

  • Can only have two of three!

DB DB

Files Files

DB

ERP CRM PowerBI

slide-9
SLIDE 9

READY Goals

  • Custom Integrity
  • every producer and every consumer defines own set of constraints
  • Sharing
  • there is only one logical copy of the data
  • Decoupling
  • producers are not blocked by consumers
  • consumers are not blocked by other consumers
  • „CAP Theorem“: Easy to achieve two of these three goals
  • DW + ETL achieves „custom integrity“ and „decoupling“ goals. But, not „sharing“.
  • Data Lake achieves „sharing“ and „decoupling“. But, not „custom integrity“.
slide-10
SLIDE 10

Example: Why is it difficult?

  • USA Analyst

○ queries on USA orders ○ report only when all USA

  • rders have shipped
  • Toys Analyst

○ queries on Toys orders ○ report only when all Toys

  • rders have shipped

Timeline of Order Process.

1. Enter(1, car, USA) 2. Enter(2, ball, Germany) 3. Ship(1) 4. Enter(3, ball, Germany) 5. Enter(4, car, USA) 6. Ship(2, 3) 7. Enter(5, ball, Germany) 8. Ship(4)

slide-11
SLIDE 11
  • USA Analyst

○ queries on USA orders ○ report only when all USA

  • rders have shipped
  • Toys Analyst

○ queries on Toys orders ○ report only when all Toys

  • rders have shipped

Example: Why is it difficult?

Timeline of Order Proc.

1. Enter(1, car, USA) 2. Enter(2, ball, Germany) 3. Ship(1) 4. Enter(3, ball, Germany) 5. Enter(4, car, USA) 6. Ship(2, 3) 7. Enter(5, ball, Germany) 8. Ship(4)

Database states that meet USA Analyst‘s constraint: 3, 4, 8.

slide-12
SLIDE 12

Example: Why is it difficult?

  • USA Analyst
  • queries on USA orders
  • report only when all USA
  • rders have shipped
  • Toys Analyst
  • queries on Toys orders
  • report only when all Toys
  • rders have shipped

Timeline of Order Proc.

1. Enter(1, car, USA) 2. Enter(2, ball, Germany) 3. Ship(1) 4. Enter(3, ball, Germany) 5. Enter(4, car, USA) 6. Ship(2, 3) 7. Enter(5, ball, Germany) 8. Ship(4)

Database States that meet Toys Analyst’s constraints: 1, 6

slide-13
SLIDE 13

READY Approach

  • Each update creates a new version of the data lake
  • efficient implementation of update batches via „delta materialization“ (see paper)
  • All applications run in a sandbox
  • sandbox defines a set of integrity constraints
  • (sandbox also determines concurrency control policy)
  • Consumers: sandbox controls which versions are visible
  • non-compliant versions are not visible to consumer, but possibly to other consumers
  • query annotation determines which visible version to use (next, last, continuous)
  • Producers: sandbox controls which versions are legal
  • non-compliance results in abort of transactions, just as in regular RDBMS
  • In READY 1.0, there is only one producer sandbox
slide-14
SLIDE 14

READY Approach: Temporal Data Lake

Producer V1 V2 V3 V4 V5

all versions of data lake

  • Producer generates new versions independent of consumers
  • Only requirement: each version meets producers integrity constraints
slide-15
SLIDE 15

READY Approach: Visibility of Versions

Producer V1 V2 V3 V4 V5

all versions of data lake

Consumer1

all versions visible to C1 (versions meet C1’s constraints)

  • Consumer1 only sees those versions that meets its constraints
  • does not block producer if producer creates version that is not compliant
slide-16
SLIDE 16

READY Approach: Query Annotations

Producer V1 V2 V3 V4 V5

all versions of data lake

Consumer1

all versions visible to C1 (versions meet C1’s constraints)

  • Consumer1 only sees those versions that meets its constraints
  • Last: use latest visible version to process query

query (last)

slide-17
SLIDE 17

READY Approach: Query Annotations

Producer V1 V2 V3 V4 V5

all versions of data lake

Consumer1

all versions visible to C1 (versions meet C1’s constraints)

  • Consumer1 only sees those versions that meets its constraints
  • Last: use latest visible version to process query
  • Next: wait for next visible version to process query

query (next)

slide-18
SLIDE 18

READY Approach: Decoupling

Producer V1 V2 V3 V4 V5

all versions of data lake

Consumer1

all versions visible to C1 (versions meet C1’s constraints)

  • Consumer2 only sees those versions that meet its constraints
  • may or may not overlap with C1 or any other consumer

Consumer2

all versions visible to C2 (versions meet C2’s constraints)

slide-19
SLIDE 19

Related Work and Concepts

  • Views
  • Pro: sandbox is like a view that filters the right version of a record
  • Con: sandboxes are updateable (producers run in sandboxes)
  • Con: simpler view definition
  • Materialized Views, Incremental Maintenance & Streaming
  • a great way to implement sandboxes
  • DataHub, Version Control Systems (git), Temporal Databases
  • right way to think about data lake and visibility of versions
slide-20
SLIDE 20

Agenda

  • READY: Basic Concepts
  • READY 1.0: Implementation
  • Experiments & Results
slide-21
SLIDE 21

READY 1.0

  • Sandboxes

○ Each transaction / query runs in a sandbox ○ Sandboxes Define: ■ which snapshots of data lake are visible ■ which business objects are visible

21

slide-22
SLIDE 22

READY 1.0: Sandbox Syntax

CREATE SANDBOX sandboxName ( argname argtype )* [ FOR UPDATES ] [ WHEN predicate ] [ WITH ( relationName: predicate )* ];

slide-23
SLIDE 23

READY 1.0: Sandbox Example

CREATE SANDBOX noOpenOrderSandbox() WHEN NOT EXISTS (SELECT * FROM Order o WHERE o.o_status = "Open")

slide-24
SLIDE 24

READY 1.0: Parameterized Sandbox

CREATE SANDBOX completeByNation(:nationId INT) WHEN FORALL (SELECT o.status as s FROM Order o, Customer c WHERE o.o_custkey == c.c_custkey AND c.c_nationkey == :nationId) SATISFY s = "Verified"

slide-25
SLIDE 25

BEGIN USING completeByNation("Germany") NEXT; SELECT c.name, count(*) FROM Order o, Customer c WHERE o.o_custkey = c.c_custkey AND c.c_nationkey = "Germany" GROUP BY c.name COMMIT;

READY 1.0: Sandbox Usage

slide-26
SLIDE 26

READY 1.0: System Overview

slide-27
SLIDE 27

READY 1.0: Version Management

slide-28
SLIDE 28

READY 1.0: Delta Materialization

slide-29
SLIDE 29

READY 1.0: Version Selection (integrity checks)

  • Batch processing whenever needed
  • Incremental processing with every new version of data lake
  • Constraint checks can be expressed as tuple counting

Exists: count(S) > 0 FORALL: count(S) = count(p(S))

  • Transform constraint checking into:

count(post(S)) = count(pre(S)) + count(delta(S))

slide-30
SLIDE 30

Agenda

  • READY: Basic Concepts
  • READY 1.0: Implementation (see paper)
  • Experiments & Results
slide-31
SLIDE 31

READY Prototype (Runtime System)

slide-32
SLIDE 32

TPC-H on READY

TPC-H Update Functions No constraints TPC-H Queries Synthetic Integrity Constraints

slide-33
SLIDE 33

READY 1.0: Version Management

slide-34
SLIDE 34

READY 1.0: Delta Materialization

slide-35
SLIDE 35

Experimental Set-Up and Goals

  • Experiment 1: Measure Cost
  • study „sharing“ goal
  • vary number of applications (sandboxes) with synthetic integrity constraints
  • Experiment 2: Measure Data Freshness
  • study „decoupling“ goal
  • vary number of applications (sandboxes) with synthetic integrity constraints
  • Baselines for all experiments
  • Global: data warehouse in which all consumers run on single data mart
  • (conjunction of all sandboxes)
  • Personal: one data mart for each consumer
slide-36
SLIDE 36

Baseline 1: Global Data Warehouse

DB Producer

ETL

DB Consumer1 Consumer2

Sandbox1 AND Sandbox2

slide-37
SLIDE 37

Baseline 2: Personal Data Warehouse

DB Producer

ETL2

DB Consumer1

Sandbox1

DB Consumer1

Sandbox1

ETL1

slide-38
SLIDE 38

Exp 1: Cost of TPC-H Update Functions

  • Cost of Delta Materialization
  • 8 Sandboxes, vary TPC-H Scaling Factor
slide-39
SLIDE 39

Exp 2: Data Freshness (vary SF, sandboxes)

SF1 SF10

slide-40
SLIDE 40

Conclusion & Future Work

  • Thought experiment: What if DBMS supports multiple sets of IC?
  • semantics make use of all classic DB concepts: snapshots, views, temporal
  • nice implementation on top of Spark possible
  • Future Work
  • Generalize READY model: multiple producer sandboxes (think git & branching)
  • need a way to „merge“ branches