MapReduce: Simplified Data Processing on Large Clusters Jeffrey - - PowerPoint PPT Presentation
MapReduce: Simplified Data Processing on Large Clusters Jeffrey - - PowerPoint PPT Presentation
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Presented by: Chaochao Yan 04/25/2018 MapReduce A programming model and an associated implementation for processing and generating large data sets.
MapReduce
A programming model and an associated implementation for processing and generating large data sets.
Motivation: Large Scale Data Processing
❏ Want to process lots of data (>1 PB data) ❏ Want to use hundreds or thousands of CPUs ❏ Want to make this easy
MapReduce
❏ Automatic parallelization & distribution ❏ Fault-tolerant ❏ Provides status and monitoring tools ❏ Clean abstraction for programmers
Programming model
❏ Input & Output: each a set of key/value pairs ❏ Divide and conquer similar ❏ Programmer specifies two functions: map(in_key, in_value) -> list(out_key, intermediate_value) reduce(out_key, list(intermediate_value)) -> list(out_value)
WordCount Pseudo-code
map(String input_key, String input_value): // input_key: document name, input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word, output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));
Illustrated WordCount
Picture from http://ranger.uta.edu/~sjiang/CSE6350-spring-18/lecture-8.pdf
“see”: [“1”, “1”]
Distributed and Parallel Computing
❏ map() functions run in distributed and parallel, creating different intermediate values from different input data sets ❏ reduce() functions also run in distributed and parallel, each working on a different output key ❏ All values are processed independently
Implementation Overview
❏ 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory ❏ Commodity networking hardware is used ❏ Storage is on local IDE disks ❏ GFS: distributed file system manages data ❏ Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines
High-level MapReduce Pipeline
Picture from http://mapreduce-tutorial.blogspot.com/2011/04/mapreduce-data-flow.html
High-level MapReduce Pipeline
Picture from https://research.google.com/archive/mapreduce-osdi04-slides/index-auto-0008.html
Question 1
Use Figure 1 to explain a MR program’s execution.
Picture 1 from Google MapReduce Paper, OSDI04
Question 2
Describe how MR handles worker and master failures
Fault Tolerance
❏ Detect failure via periodic heartbeats ❏ Worker Failure
❏ Map and reduce tasks in progress are rescheduled ❏ Completed map tasks are rescheduled (data on local disk) ❏ Completed reduce tasks do not need to be re-executed (data on GFS)
❏ Master Failure
❏ abort the computation
Question 3
Compared with traditional parallel programming models, such as multithreading and MPI, what are major advantages of MapReduce? Easy to use, scalability, and reliability
Comparison with Traditional Models
Picture from http://ranger.uta.edu/~sjiang/CSE6350-spring-18/lecture-8.pdf
Locality
❏ Master program divides up tasks based on location of data: tries to have map() tasks on the same machine as physical data or as “near” as possible. ❏ Map task inputs are divided into 16-64 MB blocks, Google File System chunk size is 64 MB.
Task Granularity And Pipelining
Fine granularity tasks: many more map tasks than machines ❏ Minimizes time for fault recovery ❏ Can pipeline shuffling with map execution ❏ Better dynamic load balancing
Task Granularity And Pipelining
❏
Picture from https://research.google.com/archive/mapreduce-osdi04-slides/index-auto-0009.html
Question 4
The implementation of MapReduce enforces a barrier between the Map and Reduce phases, i.e., no reducers can proceed until all mappers have completed their assigned workload. For higher efficiency, is it possible for a reducer to start its execution earlier, and why? (clue: think of availability of inputs to reducers)
Backup Tasks
Slow workers significantly delay completion time ❏ Other jobs consuming resources on machine ❏ Bad disks w/ soft errors transfer data slowly ❏ Weird things: processor caches disabled (!!) Solution: Near end of phase, schedule backup tasks ❏ Whichever one finishes first "wins"
Sort Performance
❏ 10^10 100-byte records(1TB data, 1800 nodes)
Refinement
❏ Sorting guarantees
❏ within each reduce partition
❏ Combiner
❏ Reduce in advance ❏ Useful for saving network bandwidth
❏ User-defined counters
❏ Useful for debug