MapReduce: Simplified Data Processing on Large Clusters Jeffrey - PowerPoint PPT Presentation

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Presented by: Chaochao Yan 04/25/2018

MapReduce A programming model and an associated implementation for processing and generating large data sets.

Motivation: Large Scale Data Processing ❏ Want to process lots of data (>1 PB data) ❏ Want to use hundreds or thousands of CPUs ❏ Want to make this easy

MapReduce ❏ Automatic parallelization & distribution ❏ Fault-tolerant ❏ Provides status and monitoring tools ❏ Clean abstraction for programmers

Programming model ❏ Input & Output: each a set of key/value pairs ❏ Divide and conquer similar ❏ Programmer specifies two functions: map(in_key, in_value) -> list(out_key, intermediate_value) reduce(out_key, list(intermediate_value)) -> list(out_value)

WordCount Pseudo-code map(String input_key, String input_value): // input_key: document name, input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word, output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));

Illustrated WordCount “see”: [“1”, “1”] Picture from http://ranger.uta.edu/~sjiang/CSE6350-spring-18/lecture-8.pdf

Distributed and Parallel Computing ❏ map() functions run in distributed and parallel, creating different intermediate values from different input data sets ❏ reduce() functions also run in distributed and parallel, each working on a different output key ❏ All values are processed independently

Implementation Overview ❏ 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory ❏ Commodity networking hardware is used ❏ Storage is on local IDE disks ❏ GFS: distributed file system manages data ❏ Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines

High-level MapReduce Pipeline Picture from http://mapreduce-tutorial.blogspot.com/2011/04/mapreduce-data-flow.html

High-level MapReduce Pipeline Picture from https://research.google.com/archive/mapreduce-osdi04-slides/index-auto-0008.html

Question 1 Use Figure 1 to explain a MR program’s execution.

Picture 1 from Google MapReduce Paper, OSDI04

Question 2 Describe how MR handles worker and master failures

Fault Tolerance ❏ Detect failure via periodic heartbeats ❏ Worker Failure Map and reduce tasks in progress are rescheduled ❏ Completed map tasks are rescheduled (data on local disk) ❏ Completed reduce tasks do not need to be re-executed (data on GFS) ❏ ❏ Master Failure abort the computation ❏

Question 3 Compared with traditional parallel programming models, such as multithreading and MPI, what are major advantages of MapReduce? Easy to use, scalability, and reliability

Comparison with Traditional Models Picture from http://ranger.uta.edu/~sjiang/CSE6350-spring-18/lecture-8.pdf

Locality ❏ Master program divides up tasks based on location of data: tries to have map() tasks on the same machine as physical data or as “near” as possible. ❏ Map task inputs are divided into 16-64 MB blocks, Google File System chunk size is 64 MB.

Task Granularity And Pipelining Fine granularity tasks: many more map tasks than machines ❏ Minimizes time for fault recovery ❏ Can pipeline shuffling with map execution ❏ Better dynamic load balancing

Task Granularity And Pipelining ❏ Picture from https://research.google.com/archive/mapreduce-osdi04-slides/index-auto-0009.html

Question 4 The implementation of MapReduce enforces a barrier between the Map and Reduce phases, i.e., no reducers can proceed until all mappers have completed their assigned workload. For higher efficiency, is it possible for a reducer to start its execution earlier, and why? (clue: think of availability of inputs to reducers)

Backup Tasks Slow workers significantly delay completion time ❏ Other jobs consuming resources on machine ❏ Bad disks w/ soft errors transfer data slowly ❏ Weird things: processor caches disabled (!!) Solution: Near end of phase, schedule backup tasks ❏ Whichever one finishes first "wins"

Sort Performance ❏ 10^10 100-byte records(1TB data, 1800 nodes)

Refinement ❏ Sorting guarantees within each reduce partition ❏ ❏ Combiner Reduce in advance ❏ Useful for saving network bandwidth ❏ ❏ User-defined counters Useful for debug ❏

MapReduce: Simplified Data Processing on Large Clusters Jeffrey - PowerPoint PPT Presentation

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Presented by: Chaochao Yan 04/25/2018 MapReduce A programming model and an associated implementation for processing and generating large data sets.

MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review

MapReduce Simplified Data Processing on Large Clusters Dean J. and Ghemawat S. Google, 2008

Motivation Large-Scale Data Processing MapReduce: Want to use 1000s of CPUs Simplified

50 crossings: 0 Simplified: 41 0 100 crossings: 0 Simplified: 81 0 Simplified: 200 156

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

1 Execution Implementation Overview Master notifies reducers about Focuses on large

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

Regular Distributed Register Fabric Regular Distributed Register Fabric and Synthesis for Multi-

Execu&on Templates: Caching Control Plane Decisions for Strong Scaling of Data Analy&cs

inTouch : Designing a Mobile Coordination System Karen Tang 05-899: Ubicomp January 30, 2007

Instruction Scheduling Last week Register allocation Today Instruction scheduling

Poli 30D Political Inquiry Theory & Hypothesis Shane Xinyang Xuan ShaneXuan.com October 18,

U4 Program No coaches in U4 - Dutch Style Soccer 2 sessions on Tuesday Evenings

Traumatic Brain Injury Advisory Board Workgroup Virtual Meeting May 12, 2020 Welcome

For Immediate Release July 20, 2016 For More Information Trisha Voltz Carlson SVP, Investor

MapReduce: Simplified Data Processing on Large Clusters Jeffrey - PowerPoint PPT Presentation

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Presented by: Chaochao Yan 04/25/2018 MapReduce A programming model and an associated implementation for processing and generating large data sets.

MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review

MapReduce Simplified Data Processing on Large Clusters Dean J. and Ghemawat S. Google, 2008

Motivation Large-Scale Data Processing MapReduce: Want to use 1000s of CPUs Simplified

50 crossings: 0 Simplified: 41 0 100 crossings: 0 Simplified: 81 0 Simplified: 200 156

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

1 Execution Implementation Overview Master notifies reducers about Focuses on large

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

Regular Distributed Register Fabric Regular Distributed Register Fabric and Synthesis for Multi-

Execu&amp;on Templates: Caching Control Plane Decisions for Strong Scaling of Data Analy&amp;cs

inTouch : Designing a Mobile Coordination System Karen Tang 05-899: Ubicomp January 30, 2007

Instruction Scheduling Last week Register allocation Today Instruction scheduling

Poli 30D Political Inquiry Theory &amp; Hypothesis Shane Xinyang Xuan ShaneXuan.com October 18,

U4 Program No coaches in U4 - Dutch Style Soccer 2 sessions on Tuesday Evenings

Traumatic Brain Injury Advisory Board Workgroup Virtual Meeting May 12, 2020 Welcome

For Immediate Release July 20, 2016 For More Information Trisha Voltz Carlson SVP, Investor

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

Execu&on Templates: Caching Control Plane Decisions for Strong Scaling of Data Analy&cs

Poli 30D Political Inquiry Theory & Hypothesis Shane Xinyang Xuan ShaneXuan.com October 18,