Towards Efficient MapReduce Using MPI Torsten Hoefler, Andrew - PowerPoint PPT Presentation

Towards Efficient MapReduce Using MPI Torsten Hoefler¹, Andrew Lumsdaine¹, Jack Dongarra² ²Dept. of Computer Science ¹Open Systems Lab University of Tennessee Knoxville Indiana University Bloomington 09/09/09 EuroPVM/MPI 2009 Helsinki, Finland Torsten Hoefler EuroPVM/MPI 2009 1 Indiana University Helsinki, Finland

Motivation  MapReduce as emerging programming framework Original implementation on COTS clusters  Other architectures are explored (Cell, GPUs,…)  Traditional HPC platforms?   Can MapReduce work over MPI? Yes, but … we want it fast!   What is MapReduce? Similar to functional programming  Map = map ( std::transform() )  Reduce = fold ( std::accumulate() )  2 Torsten Hoefler, Indiana University EuroPVM/MPI 2009, Helsinki, Finland

MapReduce in Detail  The user defines two functions  map: input key-value pairs:  output key-value pairs:   reduce: input key and a list of values  output key and a single value   The framework  accepts list  outputs result pairs 3 Torsten Hoefler, Indiana University EuroPVM/MPI 2009, Helsinki, Finland

Parallelization  Map and Reduce are pure functions  no internal state and no side effects  application in arbitrary order!  MapReduce done by the framework  can schedule map and reduce tasks  can restart map and reduce tasks (FT)  No synchronization  implicit barrier between Map and Reduce 4 Torsten Hoefler, Indiana University EuroPVM/MPI 2009, Helsinki, Finland

MapReduce Applications  Works well for several applications  sorting, counting, grep, graph transposition  Bellman Ford and Page Rank (iterative MR)  MapReduce has complex requirements  express algorithms as Map and Reduce tasks  similar to functional programming  ignore: scheduling and synchronization  data distribution  fault tolerance  monitoring  5 Torsten Hoefler, Indiana University EuroPVM/MPI 2009, Helsinki, Finland

Communication Requirements  two phases, three communication phases a) Read input for read N input pairs:  b) Build input lists for order pairs by keys and transfer to tasks:  Output data of c) usually negligible   two critical phases: a) and b) 6 Torsten Hoefler, Indiana University EuroPVM/MPI 2009, Helsinki, Finland

All in one view 7 Torsten Hoefler, Indiana University EuroPVM/MPI 2009, Helsinki, Finland

Parallelism limits  map is massively parallel (only limited by N) often  data usually divided in chunks (e.g., 64 MiB)  either read from shared FS (e.g., GFS, S3, …)  or available on master process   reduce needs input for a specific key tasks can be mapped close to the data  worst-case is an irregular all-to-all   we assume worst case: input only on master and keys evenly distributed  8 Torsten Hoefler, Indiana University EuroPVM/MPI 2009, Helsinki, Finland

An MPI implementation  Straight-forward with point-to-point  not focus of this work  MPI offers mechanisms to optimize: 1) Collective operations optimized communication schemes  2) Overlapping communication and computation requires good MPI library and network  9 Torsten Hoefler, Indiana University EuroPVM/MPI 2009, Helsinki, Finland

An HPC-centric approach  Example: word count  Map accepts text and vector of strings  Reduce accepts string and count  Rank 0 as master, P-1 workers  MPI_Scatter() to distribute input data  Map like standard MapReduce  MPI_Reduce() to perform reduction  Reduce as user-defined operation HPC-centric, orthogonal to simple implementation  10 Torsten Hoefler, Indiana University EuroPVM/MPI 2009, Helsinki, Finland

Reduction in the MPI library  Built-in or user-defined ops as must be associative (MPI ops are)   number of keys must be known by all procs can be reduced locally (cf. combiner ) MPI_Reduce_local   keys must have fixed size  identity element with respect to if not all processes have values for all keys   Obviously limits the possible reductions  No variable-size reductions! 11 Torsten Hoefler, Indiana University EuroPVM/MPI 2009, Helsinki, Finland

Optimizations  Optimized implementation hardware optimization, e.g., BG/P  communication optimization, e.g., MPICH2, OMPI   Computation/communication overlap? pipelining with NonBlocking Collectives (NBC)  accepted for next generation MPI (2.x or 3.0)  offered in LibNBC (portable, OFED optimized)  12 Torsten Hoefler, Indiana University EuroPVM/MPI 2009, Helsinki, Finland

Synchronization in MapReduce 13 Torsten Hoefler, Indiana University EuroPVM/MPI 2009, Helsinki, Finland

Performance Results  MapReduce application simulator  Map tasks receive specified data and simulate computation  Reduce performs reduction over all keys  System:  Odin at Indiana University  128 4-core nodes with 4 GiB memory  InfiniBand interconnect  LibNBC (OFED optimized, threaded) 14 Torsten Hoefler, Indiana University EuroPVM/MPI 2009, Helsinki, Finland

Static Workload  Fixed workload: 1s per packet  Reduction of comm/synch overhead of 27% 15 Torsten Hoefler, Indiana University EuroPVM/MPI 2009, Helsinki, Finland

Dynamic Workload  Dynamic workload: 1ms-10s  Reduction of execution time of 25% 16 Torsten Hoefler, Indiana University EuroPVM/MPI 2009, Helsinki, Finland

What does MPI need?  Fault Tolerance  MPI offers basic inter-communicator FT  no support for collective communications  checking if a collective was successful is hard  collectives might never return (dead-/lifelock)  Variable Reductions  MPI reductions are fixed-size  MR needs reductions of growing/shrinking data  Also useful for higher languages like C++, C#, or Python 17 Torsten Hoefler, Indiana University EuroPVM/MPI 2009, Helsinki, Finland

Conclusions  We proposed an unconventional way to implement MapReduce  efficiently uses collective communication  limited by MPI interface  allows efficient use of nonblocking collectives  Implementation can be chosen based on properties of Map and Reduce  MPI-optimized implementation if possible  point-to-point based implementation otherwise 18 Torsten Hoefler, Indiana University EuroPVM/MPI 2009, Helsinki, Finland

Questions Questions? 19 Torsten Hoefler, Indiana University EuroPVM/MPI 2009, Helsinki, Finland

Towards Efficient MapReduce Using MPI Torsten Hoefler, Andrew - PowerPoint PPT Presentation

Towards Efficient MapReduce Using MPI Torsten Hoefler, Andrew Lumsdaine, Jack Dongarra Dept. of Computer Science Open Systems Lab University of Tennessee Knoxville Indiana University Bloomington 09/09/09 EuroPVM/MPI 2009 Helsinki,

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

Introduction In Lecture 1, we exhibited three remarkable quantum optical phenomena that defied

Introduction Today well develop the theory of continuous-variable teleportation, i.e.,

Communication in Games Mehdi Dastani BBL-521 M.M.Dastani@uu.nl Cheap Talk In cheap talk,

Current Operations: Neutrino Detectors/Computing (MiniBooNE, MINOS, MINER A) Deborah Harris

About domain controllers Interaction diagrams Not usually a domain concept Dynamic

Yang-Push Notification Capabilities tle Balazs Lengyel, Alex Clemm, Benoit Claise 11/18/2019

B L A S T B L A S T - A spontaneous event finder MySQL Apple Core Location framework Apple

<?php function preprocess_drupalcon($presentation) { if ($drupal && $ionic) { if

Towards Efficient MapReduce Using MPI Torsten Hoefler, Andrew - PowerPoint PPT Presentation

Towards Efficient MapReduce Using MPI Torsten Hoefler, Andrew Lumsdaine, Jack Dongarra Dept. of Computer Science Open Systems Lab University of Tennessee Knoxville Indiana University Bloomington 09/09/09 EuroPVM/MPI 2009 Helsinki,

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

Introduction In Lecture 1, we exhibited three remarkable quantum optical phenomena that defied

Introduction Today well develop the theory of continuous-variable teleportation, i.e.,

Communication in Games Mehdi Dastani BBL-521 M.M.Dastani@uu.nl Cheap Talk In cheap talk,

Current Operations: Neutrino Detectors/Computing (MiniBooNE, MINOS, MINER A) Deborah Harris

About domain controllers Interaction diagrams Not usually a domain concept Dynamic

Yang-Push Notification Capabilities tle Balazs Lengyel, Alex Clemm, Benoit Claise 11/18/2019

B L A S T B L A S T - A spontaneous event finder MySQL Apple Core Location framework Apple

&lt;?php function preprocess_drupalcon($presentation) { if ($drupal &amp;&amp; $ionic) { if

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

<?php function preprocess_drupalcon($presentation) { if ($drupal && $ionic) { if