MapReduce and Parallel DBMSs: Friends or Foes? Presented by - PowerPoint PPT Presentation

MapReduce and Parallel DBMSs: Friends or Foes? Presented by Guozhang Wang DB Lunch, May 3 rd , 2010

Papers to Be Covered in This Talk  CACM’10 ◦ MapReduce and Parallel DBMSs: Friends or Foes?  VLDB’09 ◦ HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads  SIGMOD’08 (Pig), VLDB’08 (SCOPE) , VLDB’09 (Hive)

Outline  Architectural differences between MR and PDBMS (CACM’10) ◦ Workload differences ◦ System requirements ◦ Performance benchmark results  Integrate MR and PDBMS (VLDB’09) ◦ Pig, SCOPE, Hive ◦ HadoopDB  Conclusions

Workload Differences  Parallel DBMSs were introduced when ◦ Structured data dominates ◦ Regular aggregations, joins ◦ Terabyte (today petabyte, 1000 nodes)  MapReduce was introduced when ◦ Unstructured data is common ◦ Complex text mining, clustering, etc ◦ Exabyte (100,000 nodes)

System Requirements: From order of 1000 to 100,000  Finer granularity runtime fault tolerance ◦ Mean Time To Failure (MMTF) ◦ Checkpointing  Heterogeneity support over the cloud ◦ Load Balancing

Architectural Differences Parallel DBMSs MapReduce Transactional-level fault Checkpointing tolerance intermediate results

Architectural Differences Parallel DBMSs MapReduce Jobs often need to restart Cannot pipeline query because of failures operators

Architectural Differences Parallel DBMSs MapReduce Jobs often need to restart Cannot pipeline query because of failures operators Hash/range/round robin Runtime scheduling partitioning based on blocks

Architectural Differences Parallel DBMSs MapReduce Jobs often need to restart Cannot pipeline query because of failures operators Execution time determined Cannot globally optimize by slowest node execution plans

Architectural Differences Parallel DBMSs MapReduce Jobs often need to restart Cannot pipeline query because of failures operators Execution time determined Cannot globally optimize by slowest node execution plans Loading to tables before External distributed file querying systems

Architectural Differences Parallel DBMSs MapReduce Jobs often need to restart Cannot pipeline query because of failures operators Execution time determined Cannot globally optimize by slowest node execution plans Awkward for semi- Cannot do indexing, structured data compression, etc

Architectural Differences Parallel DBMSs MapReduce Jobs often need to restart Cannot pipeline query because of failures operators Execution time determined Cannot globally optimize by slowest node execution plans Awkward for semi- Cannot do indexing, structured data compression, etc Dataflow programming SQL language models

Architectural Differences Parallel DBMSs MapReduce Jobs often need to restart Cannot pipeline query because of failures operators Execution time determined Cannot globally optimize by slowest node execution plans Awkward for semi- Cannot do indexing, structured data compression, etc Not suitable for Too low-level, not reusable, unstructured data analysis not good for joins

Least But Not Last ..  Parallel DBMS ◦ Expensive, no open-source option  MapReduce ◦ Hadoop ◦ Attractive for modest budgets and requirements

Benchmark Study  T ested Systems: ◦ Hadoop (MapReduce) ◦ Vertica (Column-store DBMS) ◦ DBMS-X (Row-store DBMS)  100-node cluster at Wisconsin  Tasks ◦ Original MR Grep Task in OSDI’04 paper ◦ Web Log Aggregation ◦ Table Join with Aggregation

Benchmark Results Summary 2X

Benchmark Results Summary  MR: parsing in runtime, no compression and pipelining, etc  PDBMS: parsing while loading, compression, query plan optimization

Outline  Architectural differences between MR and PDBMS (CACM’10) ◦ Workload differences ◦ System requirements ◦ Performance benchmark results  Integrate MR and PDBMS (VLDB’09) ◦ Pig, SCOPE, Hive ◦ HadoopDB  Conclusions

We Want Features from Both Sides:  Data Storage ◦ From MR: semi-structured data loading/parsing ◦ From DBMS: compression, indexing, etc  Query Execution ◦ From MR: load balancing, fault-tolerance ◦ From DBMS: query plan optimization  Query Interface ◦ From MR: procedural ◦ From DBMS: declarative

Pig  Data Storage: MR ◦ Run Pig Latin queries over any external files given user defined parsing functions  Query Execution: MR ◦ Compile to MapReduce plan and get executed on Hadoop  Query Interface: MR+ DBMS ◦ Declarative spirit of SQL + procedural operators

SCOPE  Data Storage: DBMS+MR ◦ Load to Cosmos Storage System, which is append-only, distributed and replicated  Query Execution: MR ◦ Compile to Dryad data flow plan (DAG), and executed by the runtime job manager  Query Interface: DBMS+ MR ◦ Resembles SQL with embedded C# expressions

Hive  Data Storage: DBMS+MR ◦ Use one HDFS dir to store one “table”, associated with builtin serialization format  Hive-Metastore  Query Execution: MR ◦ Compile to a DAG of map-reduce jobs, executed over Hadoop  Query Interface: DBMS ◦ SQL-like declarative HiveQL

So Far.. Pig SCOPE Hive SIGMOD’08 VLDB’08 VLDB’09 Procedural SQL-like Query HiveQL Interface + C# Higher than MR External Cosmos HDFS w/ Data Storage Files Storage Metastore Query Hadoop Dryad Hadoop Execution

HadoopDB Pig SCOPE Hive HadoopDB VLDB’09 SIGMOD’08 VLDB’08 VLDB’09 Procedural SQL-like Query HiveQL SQL Interface + C# Higher than MR External Cosmos HDFS w/ HDFS + Data Storage Files Storage Metastore DBMS Query As much DBMS Hadoop Dryad Hadoop Execution as possible

Basic Idea  Multiple, independent single node databases coordinated by Hadoop  SQL queries first compiled to MapReduce, then a sub-sequence of map-reduce converts back to SQL

Architecture

SQL – MR – SQL (SMS)

SQL – MR – SQL (SMS) Year

SQL – MR – SQL (SMS) Not Year Year

Evaluation Setup  Tasks: Same as the CACM’10 paper  Amazon EC2 “large” instances  For fault-tolerance: terminate a node at 50% completion  For fluctuation-tolerance: slow down a node by running an I/O-intensive job

Performance: join task

Scalability: aggregation task

Conclusions  Sacrificing performance is necessary for fault tolerance/heterogeneity in the case of order 100,000 nodes  MapReduce and Parallel DBMSs completes each other for large scale analytical workloads.

Conclusions  Sacrificing performance is necessary for fault tolerance/heterogeneity in the case of order 100,000 nodes  MapReduce and Parallel DBMSs completes each other for large scale analytical workloads. Questions?

Other MR+DBMS Work (part of the slide from Andrew Pavlo)  Commercial MR Integrations ◦ Vertica ◦ Greenplum ◦ AsterData ◦ Sybase IQ  Research ◦ MRi (Wisconsin) ◦ Osprey (MIT)

Benchmark Results Summary  MR: Record parsing in run time  PDBMS: Record parsed/compressed when loaded

Benchmark Results Summary  MR: Write intermediate results to disks  PDBMS: Pipelining

Benchmark Results Summary  MR: Cannot handle joins very efficiently  PDBMS: Optimization for joins

Benchmark Results Summary Summary:  Trade performance to have runtime scheduling and checkpointing  Trade execution time to reduce load time at storage layer

Architectural Differences Parallel DBMSs MapReduce Transactional-level fault Checkpointing tolerance intermediate results Hash/range/round robin Runtime scheduling partitioning based on blocks Loading to tables before External distributed file querying systems Dataflow programming SQL language models

MapReduce and Parallel DBMSs: Friends or Foes? Presented by - PowerPoint PPT Presentation

MapReduce and Parallel DBMSs: Friends or Foes? Presented by Guozhang Wang DB Lunch, May 3 rd , 2010 Papers to Be Covered in This Talk CACM10 MapReduce and Parallel DBMSs: Friends or Foes? VLDB09 HadoopDB: An Architectural

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Parallel DBs & MapReduce CSE 344 SECTION 10 Big Bi g Data The Three

Spark RDD Operations Transformation and Actions 1 MapReduce Vs RDD Both MapReduce and RDD can

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Analytical Query Processing Marco Serafini COMPSCI 532 Lecture 7 Announcement Midterm date

CS 764: Topics in Database Management Systems Lecture 12: Parallel DBMSs Xiangyao Yu 10/14/2020

MapReduce and its use for indexing The Programming Model and Practice Enrique Alfonseca

Contingency Planning ISIS: The Battle for Ar-Raqqa Mech Airborne 82nd 101st 5th Flt. The

Collaboration: Why Bother? LaneCounty MovingForwardTogether

Osprey House, Tuesday 30 September 2014 Cautionary statement This presentation contains certain

How To Get Your Paper Rejected Vu Pham Is Writing Important ?? It improves quality of your

Jeffrey C. Carver University of Alabama Los Alamos Computer Science Symposium October 15, 2008

The Rebuild of Conventional Forces: Implications for Force Training Dr. Robbin F. Laird October

Children: Whats at Stake. November 14, 2012 1 11/14/2012 Co-sponsored by and Joe Theis

Foundations of Computational Linguistics man-machine communication in natural language R OLAND H

MapReduce and Parallel DBMSs: Friends or Foes? Presented by - PowerPoint PPT Presentation

MapReduce and Parallel DBMSs: Friends or Foes? Presented by Guozhang Wang DB Lunch, May 3 rd , 2010 Papers to Be Covered in This Talk CACM10 MapReduce and Parallel DBMSs: Friends or Foes? VLDB09 HadoopDB: An Architectural

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Parallel DBs &amp; MapReduce CSE 344 SECTION 10 Big Bi g Data The Three

Spark RDD Operations Transformation and Actions 1 MapReduce Vs RDD Both MapReduce and RDD can

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Analytical Query Processing Marco Serafini COMPSCI 532 Lecture 7 Announcement Midterm date

CS 764: Topics in Database Management Systems Lecture 12: Parallel DBMSs Xiangyao Yu 10/14/2020

MapReduce and its use for indexing The Programming Model and Practice Enrique Alfonseca

Contingency Planning ISIS: The Battle for Ar-Raqqa Mech Airborne 82nd 101st 5th Flt. The

Collaboration: Why Bother? LaneCounty MovingForwardTogether

Osprey House, Tuesday 30 September 2014 Cautionary statement This presentation contains certain

How To Get Your Paper Rejected Vu Pham Is Writing Important ?? It improves quality of your

Jeffrey C. Carver University of Alabama Los Alamos Computer Science Symposium October 15, 2008

The Rebuild of Conventional Forces: Implications for Force Training Dr. Robbin F. Laird October

Children: Whats at Stake. November 14, 2012 1 11/14/2012 Co-sponsored by and Joe Theis

Foundations of Computational Linguistics man-machine communication in natural language R OLAND H

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

Parallel DBs & MapReduce CSE 344 SECTION 10 Big Bi g Data The Three