The Myria Big Data Management and Analytics System and Cloud - - PowerPoint PPT Presentation

the myria big data management and analytics system and
SMART_READER_LITE
LIVE PREVIEW

The Myria Big Data Management and Analytics System and Cloud - - PowerPoint PPT Presentation

The Myria Big Data Management and Analytics System and Cloud Service Jingjing Wang, Tobin Baker, Magdalena Balazinska, Daniel Halperin, Brandon Haynes, Bill Howe, Dylan Hutchison, Shrainik Jain, Ryan Maas, Parmita Mehta, Dominik Moritz, Brandon


slide-1
SLIDE 1

Jingjing Wang, Tobin Baker, Magdalena Balazinska, Daniel Halperin, Brandon Haynes, Bill Howe, Dylan Hutchison, Shrainik Jain, Ryan Maas, Parmita Mehta, Dominik Moritz, Brandon Myers, Jennifer Ortiz, Dan Suciu, Andrew Whitaker, Shengliang Xu DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING UNIVERSITY OF WASHINGTON

http://myria.cs.washington.edu

The Myria Big Data Management and Analytics System and Cloud Service

slide-2
SLIDE 2

Acknowledgments

The Myria Team! Our science collaborators!!

  • Andrew Connolly, Tom Quinn, Sarah Loebman, Ariel

Rokem, Ginger Armbrust, Yejin Choi Our sponsors!!!

  • National Science Foundation, Moore & Sloan

Foundations, Washington Research Foundation, eScience Institute, ISTC Big Data, Petrobras, EMC, Amazon, and Facebook

2 Magdalena Balazinska - University of Washington

slide-3
SLIDE 3

Big Data

Magdalena Balazinska - University of Washington 3

Management Analytics Efficient Easy Science Apps

slide-4
SLIDE 4

Goals of the Myria stack

  • Advance state-of-the-art in big data systems
  • Focus on efficiency and productivity
  • Test on real applications and support real users

Deliverables:

  • Built a new big data mgmt & analytics system
  • Deployed and operate Myria as a service
  • Source code and demo service:

http://myria.cs.washington.edu

4 Magdalena Balazinska - University of Washington

slide-5
SLIDE 5

5 Magdalena Balazinska - University of Washington

Myria has been developed and is operated by

  • Database Group in the Computer Science &

Engineering Department at UW

  • UW eScience Institute

Co-PIs: Dan Suciu and Bill Howe

slide-6
SLIDE 6

6

Myria Demo

Magdalena Balazinska - University of Washington

slide-7
SLIDE 7

Myria Cloud Service

Magdalena Balazinska - University of Washington 7

Service available through project website

slide-8
SLIDE 8

Analysis in the Browser with Myria

Magdalena Balazinska - University of Washington 8

Declarative-imperative analysis with MyriaL and Python

slide-9
SLIDE 9

Myria Operates Directly on Data in S3

Magdalena Balazinska - University of Washington 9

For efficient processing, caches query results internally in cluster

slide-10
SLIDE 10

MyriaL is Imperative+Declarative with Iterations

Magdalena Balazinska - University of Washington 10

slide-11
SLIDE 11

Myria Provides Details

  • f Query Execution

Magdalena Balazinska - University of Washington 11

slide-12
SLIDE 12

Myria Service includes Jupyter Notebook

Magdalena Balazinska - University of Washington 12

Jupyter notebook available directly with Myria service

slide-13
SLIDE 13

Myria Supports Python User-Defined Functions

Magdalena Balazinska - University of Washington 13

Data from the Human Connectome project

MRI data analysis Python UDFs enable running legacy code and complex analytics beyond SQL/MyriaL

slide-14
SLIDE 14

Users Can Deploy Own Service

pip install myria-cluster

Magdalena Balazinska - University of Washington 14

myria-cluster create [OPTIONS] CLUSTER_NAME myria-cluster stop/start/destroy […]

slide-15
SLIDE 15

Example Myria Applications

15

Neuroscience Astronomy Natural Language Processing

Picture from Leila Zilles MyMergerTree Screenshot Data from the Human Connectome project

Oceanography

RED fluorescence FSC

Picoplankton Nanoplankton IS Ultraplankton Prochlorococcus

Bibliometrics

slide-16
SLIDE 16

16

Myria Internals

Magdalena Balazinska - University of Washington

slide-17
SLIDE 17

Myria Polystore Stack

Browser Specialized Services

RACO

MyMergerTree

Query Translation, Optimization, and Orchestration

Python/Jupyter

Parallel, Iterative, and Elastic Query Execution

MyriaX

MPI SciDB Graphs NoSQL

Magdalena Balazinska - University of Washington 17

slide-18
SLIDE 18

Myria’s Data Model and Query Interface

  • Relational Algebra Compiler (RACO)

– Myria’s query optimizer and federator

  • RACO core: relational algebra extended with

– Iterations for multi-pass algorithms – Flatmap to explode non-1NF attribute values into many tuples – Stateful apply for windowed and neighborhood functions

  • Query language: MyriaL (Imperative+Declarative)

– Each statement is declarative (SQL, comprehensions, function calls) – Statements are combined with imperative constructs

  • Variable assignment
  • Iteration
  • Python UDFs/UDAs

– Minimize barriers to adoption and run legacy code

  • Python API

– Fluent API with Python lambda functions

Magdalena Balazinska - University of Washington 18

slide-19
SLIDE 19

Polystore Optimization

  • Rule-based opt. with three types of rules

– Optimize logical Myria algebra plans – Translate logical plans into back-end specific physical plans – Optimize back-end specific physical plans

  • To add a new back-end, developer must specify

– Tree representation of query language – Rules that translate Myria algebra into this representation – Administrative functions including one to submit queries

  • Data model independence

– Myria hides the existence of various back-ends – Users write MyriaL scripts assuming relational model – Back-ends include select array, graph, and key-value systems

Magdalena Balazinska - University of Washington 19

slide-20
SLIDE 20

Federated Query Execution

Federated plans require fast data movement

Magdalena Balazinska - University of Washington 20

Worker1 Worker" Source DBMS

User

t = scan(data) x = distances(t,t) export(x,'db://Target') x = import('db://Source') u = cluster(x)

Worker Directory source.w1 à target.wm source.wn à target.w1

[1] [2] [3] [4]

Worker1 Worker# Target DBMS

User or Opt.

slide-21
SLIDE 21

Data Movement with PipeGen

A+

DBMS Bytecode Unit Tests

PipeGen

Pipegen-Enabled DBMS

21

PipeGen: Data Pipe Generator for Hybrid Analytics Brandon Haynes, Alvin Cheung, and Magdalena Balazinska. SOCC 2016.

DBMS bytecode DBMS with

  • ptimized

data pipe PipeVerify: Verification IORedirect: I/O Redirector

Identify File Open Expressions Inject Conditional Redirection Instrument Unit Tests Instrument Unit Tests Data Flow Analysis Type Substitution

FormOpt: Format Optimizer

Data Pipe Type Augmented Types

slide-22
SLIDE 22

PipeGen’s Performance

Magdalena Balazinska - University of Washington 22

16-node cluster with 16 workers/tasks Transfer 10^9 tuples with 4 ints and 3 doubles

slide-23
SLIDE 23

Myria Polystore Stack

Browser Specialized Services

RACO

MyMergerTree

Query Translation, Optimization, and Orchestration

Python/Jupyter

Parallel, Iterative, and Elastic Query Execution

MyriaX

MPI SciDB Graphs NoSQL

Magdalena Balazinska - University of Washington 23

slide-24
SLIDE 24

MyriaX Engine and Cloud Deployment

Magdalena Balazinska - University of Washington 24

Amazon EC2 Instance JSON query plans & API calls Coordinator REST Interface Worker HDFS Amazon EBS Volumes and/or Local Storage RDBMS Amazon S3 Worker YARN Container Worker YARN Container YARN Container … … YARN Container Amazon EC2 Instance RDBMS RDBMS Amazon EC2 Instance … …

slide-25
SLIDE 25

MyriaX Overview

25 Magdalena Balazinska - University of Washington

  • Data storage

– Read data from S3, HDFS, local files – Parse CSV, TSV, and various scientific file formats – Store data in local relational DBMS instances

  • Fast storage with physical tuning (indexing, hash-partitioning)
  • Query execution

– Fundamentally a parallel DBMS

  • Fast, pipelined query execution

– But scheduling more flexible to support elasticity – Novel features: Multiway joins and iterations

  • Resource management

– Executes on top of the YARN resource manager

slide-26
SLIDE 26

Efficient Iterative Processing

  • User specifies query declaratively

– Subset of Datalog with aggregation

  • Generate efficient, shared-nothing query plan

– Small extensions to existing shared-nothing systems

  • Plan amenable to runtime optimizations

– Synchronous vs asynchronous – Different processing priorities

  • Optimizations significantly affect performance

Magdalena Balazinska - University of Washington 26

Asynchronous and Fault-Tolerant Recursive Datalog Evaluation in Shared-Nothing Engines Jingjing Wang, Magdalena Balazinska, and Daniel Halperin. PVLDB 8(12): 1542-1553 (2015)

slide-27
SLIDE 27

Myria’s Optimized Iterations Example

Declarative Query E = scan(jwang:cc:graph); V = select distinct E.$0 from E; do CC := [$0, MIN($1)] <- [from V emit V.$0 as x, V.$0 as y] + [from E, CC where E.$0 = CC.$0 emit E.$1, CC.$1]; until convergence; store(CC, CC);

Magdalena Balazinska - University of Washington 27

Asynchronous and Fault-Tolerant Recursive Datalog Evaluation in Shared-Nothing Engines Jingjing Wang, Magdalena Balazinska, and Daniel Halperin. PVLDB 8(12): 1542-1553 (2015)

// Can have multiple relations // with recursive dep.

IDBController(CC) Scan(Edges) Join Scan(Edges)

Compiled to a Distributed Query Plan

slide-28
SLIDE 28

Performance Comparison with Spark

Declarative Query (subset of Datalog with agg.) Shared-Nothing Query Plan In-Memory Processing Synchronous Asynchronous Prioritize New Data Prioritize Base Data

28

# of Workers 8 16 32 64 50 100 150 200 250 Query Time (Seconds) Spark Myria, Sync Myria, Async

(GraphX)

28

Connected Components – Twitter subgraph 221 million edges and 5 million vertices

slide-29
SLIDE 29

Myria Polystore Stack

Browser Specialized Services

RACO

MyMergerTree

Query Translation, Optimization, and Orchestration

Python/Jupyter

Parallel, Iterative, and Elastic Query Execution

MyriaX

MPI SciDB Graphs NoSQL

Magdalena Balazinska - University of Washington 29

slide-30
SLIDE 30

Magdalena Balazinska - University of Washington 30

Cloud Operation in Myria

Or point to data in Amazon S3

slide-31
SLIDE 31

Myria’s Personalized Service Level Agreements

31

Changing the Face of Database Cloud Services with Personalized Service Level Agreements Jennifer Ortiz, Victor T. Almeida, and Magdalena Balazinska. CIDR 2015

Magdalena Balazinska - University of Washington

Workload Compression into PSLA Workload Generation Query Clustering Template Generation Cross-Tier Pruning PSLA

Schema

Runtime Prediction

Myria’s SLA generation

slide-32
SLIDE 32

Myria’s PerfEnforce Subsystem

32

PerfEnforce Demonstration: Data Analytics with Performance Guarantees Jennifer Ortiz, Brendan Lee, and Magdalena Balazinska. SIGMOD 2016.

Magdalena Balazinska - University of Washington

slide-33
SLIDE 33

Magdalena Balazinska - University of Washington

Myria’s PerfEnforce Subsystem

33

Cluster size changes during query session PerfEnforce Demonstration: Data Analytics with Performance Guarantees Jennifer Ortiz, Brendan Lee, and Magdalena Balazinska. SIGMOD 2016.

slide-34
SLIDE 34

Automatic Data Pipes Image Processing

  • Perf. Debugging

Cloud PSLAs Myria Cloud Operation Performance Guarantees Elastic Memory Efficient Multi-Join Iterative Queries Efficient Processing & Complex Analytics with MyriaX Data Summaries

Myria’s Innovations Summary

Myria Polystore Federated Analytics

Magdalena Balazinska - University of Washington 34

slide-35
SLIDE 35

Conclusion

  • Highly expressive

– MyriaL (RA+iterations) & Python

  • Polystore with hybrid analytics
  • High performance on variety of queries
  • Available as a service

– Focus on low barrier to entry – And turning users into self-sufficient experts – Also focus on the service provider: Operate Myria

  • Source code and more info (includes videos)

http://myria.cs.washington.edu/

35 Magdalena Balazinska - University of Washington