e : Evolving Databases for New-Gen Big Data Applications R. Barber, - PowerPoint PPT Presentation

Wildfire e : Evolving Databases for New-Gen Big Data Applications R. Barber, C. Garcia-Arellano, R. Grosman, R. Mueller, V. Raman , R. Sidle, M. Spilchen, A. Storm, Y. Tian, P. Tozun, D. Zilio, M. Huras, G. Lohman, C. Mohan, F. Ozcan, H. Pirahesh IBM BM

What are these New-Gen Big Data Applications? • World has changed a lot since the 70s SQL XML • Automating business processes  AI everywhere noSQL technology • But databases are still hot IMS SQL+ json + notyetSQL ap + htap +.. OLAP Ad hoc BI xsacs applications Streaming ML DNN + Streaming + xsacs 2

What are these New-Gen Big Data Applications? • World has changed a lot since the 70s SQL XML • Automating business processes  AI everywhere noSQL technology • But databases are still hot IMS And the apps want even more from the database! SQL+ json + -- Higher ingest and update rates notyetSQL ap + htap +.. -- versioning, time-travel -- Ingest and Update anywhere, anytime (“AP” system) -- More real-time analytics (HTAP) OLAP Ad hoc BI -- tons of analytics ==> database cannot hold data in proprietary store xsacs applications Streaming ML DNN + Streaming + xsacs 3

What are these New-Gen Big Data Applications? • World has changed a lot since the 70s SQL XML • Automating business processes  AI everywhere noSQL technology • But databases are still hot IMS And the apps want even more from the database! SQL+ json + -- Higher ingest and update rates notyetSQL ap + htap +.. -- versioning, time-travel -- Ingest and Update anywhere, anytime (“AP” system) -- More real-time analytics (HTAP) OLAP Ad hoc BI -- tons of analytics ==> database cannot hold data in proprietary store xsacs applications Streaming But still want the traditional database goodies: Updates ML DNN + Streaming Transactions (not eventual consistency) + xsacs Point Queries / Indexes complex queries (joins, optimizer, ..) 4

Example: Health Care Convergence of Prevention/Monitoring (sensors on healthy people) and Cure (healthcare setting)

Example: Health Care Convergence of Prevention/Monitoring (sensors on healthy people) and Cure (healthcare setting) Want analytics on latest High ingest rates readings Complex queries, Looking for outliers => joins, .. cannot drop data, need durability AP: cannot wait for Eventual consistency is a pai n mothership to be reachable V1  lookup(k1); V2  lookup(k1); // if V1 finds match and V2 doesn’t, Lots of point queries how to test this app?

Wildfire Goals HTAP: transactions & queries on same data Open Format • All data and indexes • Analytics over latest transactional data in Parquet format on shared storage • Analytics over 1-sec old snapshot • No LOAD • Analytics over 10-min old snapshot • Directly accessible by platforms like Spark Leapfrog transaction speed, with ACID Multi-Master and AP • Millions of inserts, updates / sec / node • disconnected operation • Multi-statement transactions • Snapshot isolation, with versioning and time travel • With async quorum replication (sync option) • Conflict resolution based on timestamp • Full primary and secondary indexing • Millions of gets / sec / node

Wildfire Goals HTAP: transactions & queries on same data Open Format • All data and indexes • Analytics over latest transactional data in Parquet format on shared storage • Analytics over 1-sec old snapshot • No LOAD • Analytics over 10-min old snapshot • Directly accessible by platforms like Spark Leapfrog transaction speed, with ACID Multi-Master and AP • Millions of inserts, updates / sec / node • disconnected operation • Multi-statement transactions • Snapshot isolation, with versioning and time travel • With async quorum replication (sync option) • Conflict resolution based on timestamp • Full primary and secondary indexing • Millions of gets / sec / node Challenge: getting all of these simultaneously

Wildfire architecture Applications analytics high-volume can tolerate slightly stale data transactions requires most recent data spark spark spark spark spark spark spark spark spark executor executor executor executor executor executor executor executor executor wildfire engine wildfire engine wildfire engine wildfire engine SSD/NVM SSD/NVM shared file system

Data lifecycle Grooming: take consistent snapshots resolve conflicts Postgrooming: make data efficient for queries OLTP nodes postgroom groom ORGANIZED zone GROOMED zone LIVE zone TIME (~10 mins) (~1sec) (PBs of data)

Data lifecycle Bulk Load OLTP nodes postgroom groom ORGANIZED zone GROOMED zone LIVE zone TIME (~10 mins) (~1sec) (PBs of data) HTAP (see latest: snapshot isolation) 1-sec old snapshot Optimized snapshot (10 mins stale) Analytics nodes Lookups ML, etc (Spark) BI

Live Zone xsacs … … per xsac logs xsacs replicate (uncommitted) log (committed) What happens at Commit 1. append xsac deltas (Ins/Del/Upd) to common log; replicated in background 2. flush to local SSD 3. status-check if changes are quorum-visible (via heartbeats) -- can time-out AP: Commit does not wait for other nodes; conflicts are resolved after commit (have syncwrite option for higher durability) Read monotonicity: Queries always read quorum-visible state - Hence, later queries see a superset of what prior queries saw

Grooming data (Live  Groomed zone) xsacs … … per xsac logs xsacs replicate (uncommitted) log (committed) groom • Grooming is when conflicts are resolved -- take quorum-visible deltas, form data blocks, and publish to shared file system -- groomed zone is always a consistent snapshot • All deltas (insert/delete/update) are upserts : key, (values)*, beginTime • beginTime initialized at commit as (localTime | nodeID) • No assumption about clock synchronization or speed of replication -- yet, we get read monotonicity • Idea: groom sets beginTime  groomTime|localTime|nodeID • Conflict resolution: versioning, based on beginTime

Postgrooming Queries should run fast (BI and point) • Compute endTime and prevRID And deal with immutable storage system! • • Partition (along multiple dimensions) Build primary and secondary indexes • Want ready access to latest version (for the simple readers) Separate latest and priors • Groomed Blocks (key, vals*, beginTime) LATEST (key, vals*, beginTime, Other partitions Partitions prevRID) PRIORS (key, vals*,beginTime, endTime, prevRID) Partitions postgroom groom GROOMED zone LIVE zone ORGANIZED zone TIME (~10 mins) (~1sec) (PBs of data)

OLAP queries via SparkSQL • Extensions to both Catalyst Optimizer and Data Source API • A new Spark context for SQL • Catalyst Optimizer • Query HCatalog for table schemas • Identify plan to send to Wildfire • Compose a compensation plan (if needed) • Data Source API • SparkSQL Logical plan  Wildfire plan • Plan submission to Wildfire & result passing • Compensation plan (if needed) executed in SparkSQL • Paper has details about pushdown analysis

POST-TRUTH Big data needs updates, indexes, complex queries, transactions • • AP is the reality PB databases will not live in proprietary storage • • It is possible to do ACID with AP • DBMS can adopt open data formats and immutable stores – while still being fast POST-ER-TRUTH • Multi-shard transactions Serializability with AP •

e : Evolving Databases for New-Gen Big Data Applications R. Barber, - PowerPoint PPT Presentation

Wildfire e : Evolving Databases for New-Gen Big Data Applications R. Barber, C. Garcia-Arellano, R. Grosman, R. Mueller, V. Raman , R. Sidle, M. Spilchen, A. Storm, Y. Tian, P. Tozun, D. Zilio, M. Huras, G. Lohman, C. Mohan, F. Ozcan, H.

Evolving Data Access Evolving Data Access Evolving Data Access Evolving Data Access

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Genesis 20 THE PERSISTENCE OF SIN First Look At first glance, Gen 20 doesnt seem to fit Gen

Reintroducing Gen 15 The OTs Doubting Thomas Gen 15:1-5 What can you give me if I

Present and Future of Angular with Ivy Template ViewEngine Ivy Compiler Gen Gen Gen

Jobs, Jobs, Jobs Gen Y, Gen Y, Gen Y 4,500,000 4 500 000 Vacation Home First Baby Purchase

Module 3: Creating and Managing Databases Overview Creating Databases Creating

The Big Picture There are three interesting big picture ideas I want to note as we embark

UI Evolving Platform Evolving Architecture Evolving About Me Xianning ( Pronunciation

Evolving Neural Networks This lecture is based on Xin Yaos tutorial slides From Evolving

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Inference Suppose you are given a Bayesian network with the graph structure and the parameters

6.2 Bump Mapping & Clipping Hao Li http://cs420.hao-li.com 1 Bump Mapping 2 A long time

On The Fidelity of 802.11 Packet Traces Aaron Schulman, Dave Levin, Neil Spring University of

David W. Butler High School 2019-2020 Course Registration General Information Students should

Welcome! We'll be getting started promptly at the top of the hour, and until then you will

Welcome to AP Computer Science! Please sit where you want Mrs. Donaldson, Room I-4 Go Bears!

Wireless Fidelity with bwfm(4) Patrick Wildt September 22, 2019 Patrick Wildt Wireless Fidelity

Slide 1 / 68 Slide 2 / 68 The Cathode Rays experiment is associated 2 The electron charge

e : Evolving Databases for New-Gen Big Data Applications R. Barber, - PowerPoint PPT Presentation

Wildfire e : Evolving Databases for New-Gen Big Data Applications R. Barber, C. Garcia-Arellano, R. Grosman, R. Mueller, V. Raman , R. Sidle, M. Spilchen, A. Storm, Y. Tian, P. Tozun, D. Zilio, M. Huras, G. Lohman, C. Mohan, F. Ozcan, H.

Evolving Data Access Evolving Data Access Evolving Data Access Evolving Data Access

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Genesis 20 THE PERSISTENCE OF SIN First Look At first glance, Gen 20 doesnt seem to fit Gen

Reintroducing Gen 15 The OTs Doubting Thomas Gen 15:1-5 What can you give me if I

Present and Future of Angular with Ivy Template ViewEngine Ivy Compiler Gen Gen Gen

Jobs, Jobs, Jobs Gen Y, Gen Y, Gen Y 4,500,000 4 500 000 Vacation Home First Baby Purchase

Module 3: Creating and Managing Databases Overview Creating Databases Creating

The Big Picture There are three interesting big picture ideas I want to note as we embark

UI Evolving Platform Evolving Architecture Evolving About Me Xianning ( Pronunciation

Evolving Neural Networks This lecture is based on Xin Yaos tutorial slides From Evolving

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Inference Suppose you are given a Bayesian network with the graph structure and the parameters

6.2 Bump Mapping &amp; Clipping Hao Li http://cs420.hao-li.com 1 Bump Mapping 2 A long time

On The Fidelity of 802.11 Packet Traces Aaron Schulman, Dave Levin, Neil Spring University of

David W. Butler High School 2019-2020 Course Registration General Information Students should

Welcome! We'll be getting started promptly at the top of the hour, and until then you will

Welcome to AP Computer Science! Please sit where you want Mrs. Donaldson, Room I-4 Go Bears!

Wireless Fidelity with bwfm(4) Patrick Wildt September 22, 2019 Patrick Wildt Wireless Fidelity

Slide 1 / 68 Slide 2 / 68 The Cathode Rays experiment is associated 2 The electron charge

6.2 Bump Mapping & Clipping Hao Li http://cs420.hao-li.com 1 Bump Mapping 2 A long time