Lecture 36: MapReduce Frameworks [Adapted from slides by John - PDF document

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a framework for batch processing of Big Data: http://research.google.com/archive/mapreduce-osdi04-slides] • Framework: A system used by programmers to build applications • Batch processing: All the data is available at the outset and results aren’t consumed until processing completes • Big Data: A buzzword used to describe datasets so large that they reveal facts about the world via statistical analysis The big ideas that underly MapReduce: • Datasets are too big to be stored or analyzed on one machine. • When using multiple machines, systems issues abound. • Pure functions enable an abstraction barrier between data processing logic and distributed system administration. Last modified: Wed Apr 18 15:59:28 2012 CS61A: Lecture #36 1 Last modified: Wed Apr 18 15:59:28 2012 CS61A: Lecture #36 2 Systems Unix Pipes as a Framework Systems research enables the development of applications by defining (Review) Unix embeds a framework for composing processes: and implementing abstractions: • Each process has a standard input stream (of characters) and a • Operating systems provide a stable, consistent interface to unreli- standard output stream (plus a standard error stream “on the side.”) able, inconsistent hardware • Programming languages provide functions to read and write to these • Networks provide a simple, robust data transfer interface to con- streams, just as for ordinary files. stantly evolving communications infrastructure • In Python, these streams are called sys.stdin , sys.stdout , and • Databases provide a declarative interface to software that stores sys.stderr . and retrieves information efficiently • They are objects whose interface provides read and write opera- • Distributed systems provide a single-entity-level interface to a clus- tions and iterators. ter of multiple machines • The OS allows one to string together ( compose ) sequences of pro- Unifying property of effective systems: grams into a pipeline, enabled by the common stream interface. E.g., (from lecture 9): Hide complexity, but retain flexibility tr -c -s ’[:alpha:]’ ’[\n*]’ < FILE | sort | uniq -c | \ sort -n -r -k 1,1 | sed 20q prints the 20 most frequent words in FILE , with their counts. Last modified: Wed Apr 18 15:59:28 2012 CS61A: Lecture #36 3 Last modified: Wed Apr 18 15:59:28 2012 CS61A: Lecture #36 4 MapReduce Idea Example I: Counting Occurrences A MapReduce job takes a mapper program and a reducer program from Counting occurrences of words in a large collection of documents: a user and applies them to a set of data. • Input to map operation: pairs (name of document, text of document) . • Map phase: Apply a mapper function to inputs, emitting a set of • Output from map operation: pairs (word, 1) (the 1 represents a intermediate key-value pairs. count). • Reduce phase: For each distinct intermediate key, apply a reducer • Input to reduce operation: (word, iterator over all counts for function to accumulate all the values with that key. Return a list of that word) accumulated values for the key. • Output from reduce operation: sum of all counts for one word. • (Could make things more efficient by having the map operation do some counting and return just one count for each distinct word). Last modified: Wed Apr 18 15:59:28 2012 CS61A: Lecture #36 5 Last modified: Wed Apr 18 15:59:28 2012 CS61A: Lecture #36 6

Abstract Execution Model Parallel Execution Execution http://research.google.com/archive/mapreduce-osdi04-slides/index-auto-... Parallel Execution http://research.google.com/archive/mapreduce-osdi04-slides/index-auto-... H om e P rev N ext 7 Home Prev Next 8 Execution Parallel Execution Last modified: Wed Apr 18 15:59:28 2012 CS61A: Lecture #36 7 Last modified: Wed Apr 18 15:59:28 2012 CS61A: Lecture #36 8 Example II: Distributed Grep Example III: Reverse Web-Link Graph • Input to map: Pairs (name of document, text of document). • Input to map: Pairs (source URL, webpage content of URL) • Output from map: pairs (name of document, line matching target • Output from map: Pairs (target URL, source URL) for each hy- pattern) perlink target on the input webpage. • Output from reduce: the list of matching lines from each document. • Output from reduce: (target URL, list of source URLs) . • (Reduce is trivial here; we’re just using map). • The work here is mostly in gathering up and sorting the results of map. Last modified: Wed Apr 18 15:59:28 2012 CS61A: Lecture #36 9 Last modified: Wed Apr 18 15:59:28 2012 CS61A: Lecture #36 10 Inverted Index Scale • Input to map: Pairs (document name, document contents) . Way back in August 2004, MapReduce at Google processed this much data in using MapReduce: • Output from map: Pairs (word from document, document name) Number of jobs 29,423 • Output from reduce: for each word, list of all documents it came Average job completion time 634 secs from. Machine days used 79,186 days Input data read 3,288 TB Intermediate data produced 758 TB Output data written 193 TB Average worker machines per job 157 Average worker deaths per job 1.2 Average map tasks per job 3,351 Average reduce tasks per job 55 Unique map implementations 395 Unique reduce implementations 269 Unique map/reduce combinations 426 Last modified: Wed Apr 18 15:59:28 2012 CS61A: Lecture #36 11 Last modified: Wed Apr 18 15:59:28 2012 CS61A: Lecture #36 12

What the Framework Provides • Fault tolerance: A machine or hard drive might crash. – The MapReduce framework automatically re-runs failed tasks. • Speed: Some machine might be slow because it’s overloaded or fail- ing. – The framework can run multiple copies of a task and keep the result of the one that finishes first. • Network locality: Data transfer is expensive. – The framework tries to schedule map tasks on the machines that hold the data to be processed. • Monitoring: Will my job finish before dinner?!? – The framework provides a web-based interface describing jobs. Last modified: Wed Apr 18 15:59:28 2012 CS61A: Lecture #36 13

Lecture 36: MapReduce Frameworks [Adapted from slides by John - PDF document

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a framework for batch processing of Big Data: http://research.google.com/archive/mapreduce-osdi04-slides] Framework: A system used by programmers to build

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Large-Scale Data Engineering Frameworks Beyond MapReduce event.cwi.nl/lsde THE HADOOP ECOSYSTEM

732A54 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Pe na IDA,

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Counting Triangles and Modeling MapReduce Siddharth Suri Yahoo! Research Outline 2 Modeling

Data Analytics Dan Ports, CSEP 552 Today MapReduce is it a major step backwards?

Interface Documents David Christian 11/20/17 Interface between CE and DAQ Interface

Principles of Programming Languages

Charlie Garrod Michael Hilton School of Computer Science 15-214 1 Administrivia Homework 5

Computing Resources for ProtoDUNE A. Norman, H. Schellman Software and Computing Questions

Delay and laziness no not ea eager er ... When are expressions

RFNoC: Evolving SDR Toolkits to the FPGA platgorm Martjn Braun 31.1.2016 USRP: A White Box?

Streaming Data And Concurrency In R Rory Winston rory@theresearchkitchen.com About Me

Evolution of an Apache Spark Nick Afshartous Architecture for Processing WB Analytics Game Data

Lecture 36: MapReduce Frameworks [Adapted from slides by John - PDF document

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a framework for batch processing of Big Data: http://research.google.com/archive/mapreduce-osdi04-slides] Framework: A system used by programmers to build

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Large-Scale Data Engineering Frameworks Beyond MapReduce event.cwi.nl/lsde THE HADOOP ECOSYSTEM

732A54 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Pe na IDA,

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Counting Triangles and Modeling MapReduce Siddharth Suri Yahoo! Research Outline 2 Modeling

Data Analytics Dan Ports, CSEP 552 Today MapReduce is it a major step backwards?

Interface Documents David Christian 11/20/17 Interface between CE and DAQ Interface

Principles of Programming Languages

Charlie Garrod Michael Hilton School of Computer Science 15-214 1 Administrivia Homework 5

Computing Resources for ProtoDUNE A. Norman, H. Schellman Software and Computing Questions

Delay and laziness no not ea eager er ... When are expressions

RFNoC: Evolving SDR Toolkits to the FPGA platgorm Martjn Braun 31.1.2016 USRP: A White Box?

Streaming Data And Concurrency In R Rory Winston rory@theresearchkitchen.com About Me

Evolution of an Apache Spark Nick Afshartous Architecture for Processing WB Analytics Game Data

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the