[M AP R EDUCE ] Shrideep Pallickara Computer Science Colorado - - PDF document

m ap r educe
SMART_READER_LITE
LIVE PREVIEW

[M AP R EDUCE ] Shrideep Pallickara Computer Science Colorado - - PDF document

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University CS 555: D ISTRIBUTED S YSTEMS [M AP R EDUCE ] Shrideep Pallickara Computer Science Colorado State University CS555: Distributed Systems [Fall 2019]


slide-1
SLIDE 1

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.1

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS 555: DISTRIBUTED SYSTEMS

[MAPREDUCE]

Shrideep Pallickara Computer Science Colorado State University

September 26, 2019

L10.1 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.2 Professor: SHRIDEEP PALLICKARA

Frequently asked questions from the previous class survey

September 26, 2019

slide-2
SLIDE 2

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.2

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.3 Professor: SHRIDEEP PALLICKARA

Topics covered in this lecture

¨ MapReduce

September 26, 2019 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

MAPREDUCE

September 26, 2019

L10.4

slide-3
SLIDE 3

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.3

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.5 Professor: SHRIDEEP PALLICKARA

MapReduce: Topics that we will cover

¨ Why? ¨ What it is and what it is not? ¨ The core framework and original Google paper ¨ Development of simple programs using Hadoop ¤ The dominant MapReduce implementation

September 26, 2019 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.6 Professor: SHRIDEEP PALLICKARA

MapReduce

¨ It’s a framework for processing data residing on a large number of

computers

¨ Very powerful framework ¤ Excellent for some problems ¤ Challenging or not applicable in other classes of problems

September 26, 2019

slide-4
SLIDE 4

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.4

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.7 Professor: SHRIDEEP PALLICKARA

What is MapReduce?

¨ More a framework than a tool ¨ You are required to fit (some folks shoehorn it) your solution into the

MapReduce framework

¨ MapReduce is not a feature, but rather a constraint

September 26, 2019 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.8 Professor: SHRIDEEP PALLICKARA

What does this constraint mean?

¨ It makes problem solving easier and harder ¨ Clear boundaries for what you can and cannot do ¤ You actually need to consider fewer options than what you are used to ¨ But solving problems with constraints requires planning and a change

in your thinking

September 26, 2019

slide-5
SLIDE 5

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.5

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.9 Professor: SHRIDEEP PALLICKARA

But what does this get us?

¨ Tradeoff of being confined to the MapReduce framework? ¤ Ability to process data on a large number of computers ¤ But, more importantly, without having to worry about concurrency, scale,

fault tolerance, and robustness

September 26, 2019 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.10 Professor: SHRIDEEP PALLICKARA

A challenge in writing MapReduce programs

¨ Design! ¤ Good programmers can produce bad software due to poor design ¤ Good programmers can produce bad MapReduce algorithms ¨ Only in this case your mistakes will be amplified ¤ Your job may be distributed on 100s or 1000s of machines and operating

  • n a Petabyte of data

September 26, 2019

slide-6
SLIDE 6

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.6

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.11 Professor: SHRIDEEP PALLICKARA

MapReduce: Origins of the design

¨ Process crawled data and logs of web requests ¨ Several computations work on this raw data to compute derived data ¤ Inverted indices ¤ Representation of graph structure of web documents ¤ Pages crawled per host ¤ Most frequent queries in a day …

September 26, 2019 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.12 Professor: SHRIDEEP PALLICKARA

Most computations are conceptually straightforward

¨ But data is large ¨ Computations must be scalable ¤ Distributed across thousands of machines ¤ To complete in a reasonable amount of time

September 26, 2019

slide-7
SLIDE 7

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.7

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.13 Professor: SHRIDEEP PALLICKARA

Complexity of managing distributed computations can …

¨ Obscure simplicity of original computation ¨ Contributing factors:

① How to parallelize computation ② Distribute the data ③ Handle failures

September 26, 2019 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.14 Professor: SHRIDEEP PALLICKARA

MapReduce was developed to cope with this complexity

¨ Express simple computations ¨ Hide messy details of ¤ Parallelization ¤ Data distribution ¤ Fault tolerance ¤ Load balancing

September 26, 2019

slide-8
SLIDE 8

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.8

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.15 Professor: SHRIDEEP PALLICKARA

MapReduce

¨ Programming model ¨ Associated implementation for ¤ Processing & Generating large data sets

September 26, 2019 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.16 Professor: SHRIDEEP PALLICKARA

Programming model

¨ Computation takes a set of input key/value pairs ¨ Produces a set of output key/value pairs ¨ Express the computation as two functions: ¤ Map ¤ Reduce

September 26, 2019

slide-9
SLIDE 9

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.9

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.17 Professor: SHRIDEEP PALLICKARA

Map

¨ Takes an input pair ¨ Produces a set of intermediate key/value pairs

September 26, 2019 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.18 Professor: SHRIDEEP PALLICKARA

MapReduce library

¨ Groups all intermediate values with the same intermediate key ¨ Passes them to the Reduce function

September 26, 2019

slide-10
SLIDE 10

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.10

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.19 Professor: SHRIDEEP PALLICKARA

Reduce function

¨ Accepts intermediate key I and ¤ Set of values for that key ¨ Merge these values together to get ¤ Smaller set of values

September 26, 2019 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.20 Professor: SHRIDEEP PALLICKARA

Counting number occurrences of each word in a large collection of documents

map (String key, String value) //key: document name //value: document contents for each word w in value EmitIntermediate(w, “1”)

September 26, 2019

slide-11
SLIDE 11

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.11

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.21 Professor: SHRIDEEP PALLICKARA

Counting number occurrences of each word in a large collection of documents

reduce (String key, Iterator values) //key: a word //value: a list of counts int result = 0; for each v in values result += ParseInt(v); Emit(AsString(result result));

Sums together all counts emitted for a particular word

September 26, 2019 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.22 Professor: SHRIDEEP PALLICKARA

MapReduce specification object contains

¨ Names of ¤ Input ¤ Output ¨ Tuning parameters

September 26, 2019

slide-12
SLIDE 12

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.12

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.23 Professor: SHRIDEEP PALLICKARA

Map and reduce functions have associated types drawn from different domains

map map(k1, v1) à list(k2, v2) reduce reduce(k2, list(v2)) à list(v2)

September 26, 2019 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.24 Professor: SHRIDEEP PALLICKARA

What’s passed to-and-from user-defined functions

¨ Strings

¨ User code converts between

¤ String

¤ Appropriate types

September 26, 2019

slide-13
SLIDE 13

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.13

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.25 Professor: SHRIDEEP PALLICKARA

Programs expressed as MapReduce computations: Distributed Grep

¨ Map ¤ Emit line if it matches specified pattern ¨ Reduce ¤ Just copy intermediate data to the output

September 26, 2019 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.26 Professor: SHRIDEEP PALLICKARA

Term-Vector per Host

¨ Summarizes important terms that occur in a set of documents <word,

frequency>

¨ Map ¤ Emit <hostname, term vector> ¤ For each input document ¨ Reduce function ¤ Has all per-document vectors for a given host ¤ Add term vectors; discard away infrequent terms

n <hostname, term vector>

September 26, 2019

slide-14
SLIDE 14

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.14

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

IMPLEMENTATION OF THE RUNTIME

September 26, 2019

L10.27 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.28 Professor: SHRIDEEP PALLICKARA

Implementation

¨ Machines are commodity machines ¨ GFS is used to manage the data stored on the disks

September 26, 2019

slide-15
SLIDE 15

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.15

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.29 Professor: SHRIDEEP PALLICKARA

Execution Overview – Part I

¨ Maps distributed across multiple machines ¨ Automatic partitioning of data into M splits ¨ Splits processed concurrently on different machines

September 26, 2019 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.30 Professor: SHRIDEEP PALLICKARA

Execution Overview – Part II

¨ Partition intermediate key space into R pieces ¨ E.g. hash(key) mod

mod R

¨ User specified parameters ¤ Partitioning function ¤ Number of partitions (R)

September 26, 2019

slide-16
SLIDE 16

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.16

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.31 Professor: SHRIDEEP PALLICKARA

Execution Overview

Split 0 Split 1 Split 2 Split 3 Split 4

User Program Master Worker Worker Worker Worker Worker Output file 0 Output file 1

September 26, 2019 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.32 Professor: SHRIDEEP PALLICKARA

Execution Overview: Step I The MapReduce library

¨ Splits input files into M pieces ¤ 16-64 MB per piece ¨ Starts up copies of the program on a cluster of machines

September 26, 2019

slide-17
SLIDE 17

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.17

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.33 Professor: SHRIDEEP PALLICKARA

Execution Overview: Step II Program copies

¨ One of the copies is a Master ¨ There are M map tasks and R reduce tasks to assign ¨ Master ¤ Picks idle workers ¤ Assigns each worker a map or reduce task

September 26, 2019 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.34 Professor: SHRIDEEP PALLICKARA

Execution Overview: Step III Workers that are assigned a map task

¨ Read contents of their input split ¨ Parses <key, value> pairs out of input data ¨ Pass each pair to user-defined Map function ¨ Intermediate <key, value> pairs from Maps ¤ Buffered in Memory

September 26, 2019

slide-18
SLIDE 18

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.18

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.35 Professor: SHRIDEEP PALLICKARA

Execution Overview: Step IV Writing to disk

¨ Periodically, buffered pairs are written to disk ¨ These writes are partitioned ¤ By the partitioning function ¨ Locations of buffered pairs on local disk ¤ Reported to back to Master ¤ Master forwards these locations to reduce workers

September 26, 2019 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.36 Professor: SHRIDEEP PALLICKARA

Execution Overview: Step V Reading Intermediate data

¨ Master notifies Reduce worker about locations ¨ Reduce worker reads buffered data from the local disks of Maps ¨ Read all intermediate data; sort by intermediate key ¤ All occurrences of same key grouped together ¤ Many different keys map to the same Reduce task

September 26, 2019

slide-19
SLIDE 19

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.19

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.37 Professor: SHRIDEEP PALLICKARA

Execution Overview: Step VI Processing data at the Reduce worker

¨ Iterate over sorted intermediate data ¨ For each unique key pass ¤ Key + set of intermediate values to Reduce function ¨ Output of Reduce function is appended ¤ To output file of reduce partition

September 26, 2019 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.38 Professor: SHRIDEEP PALLICKARA

Execution Overview: Step VII Waking up the user

¨ After all Map & Reduce tasks have been completed ¨ Control returns to the user code

September 26, 2019

slide-20
SLIDE 20

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.20

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

TASK GRANULARITY

September 26, 2019

L10.39 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.40 Professor: SHRIDEEP PALLICKARA

Task Granularity

¨ Subdivide map phase into M pieces ¨ Subdivide reduce phase into R pieces ¨ M, R >> number of worker machines ¨ Each worker performing many different tasks ¤ Improves dynamic load balancing ¤ Speeds up recovery during failures

September 26, 2019

slide-21
SLIDE 21

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.21

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.41 Professor: SHRIDEEP PALLICKARA

Master Data Structures

September 26, 2019

¨ For each Map and Reduce task ¤ State: {idle, in-progress, completed} ¤ Worker machine identity ¨ For each completed Map task store ¤ Location and sizes of R intermediate file regions ¨ Information pushed incrementally to in-progress Reduce tasks

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.42 Professor: SHRIDEEP PALLICKARA

Practical bounds on how large M and R can be

¨ Master must make O(M + R) scheduling decisions ¨ Keep O(MR) state in memory

September 26, 2019

slide-22
SLIDE 22

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.22

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L10.43 Professor: SHRIDEEP PALLICKARA

The contents of this slide-set are based on the following references

¨ JEFFREY DEAN and SANJAY GHEMAWAT: MapReduce: Simplified Data Processing on

Large Clusters. OSDI 2004: 137-150

¨ MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop

and Other Systems. 1st Edition. Donald Miner and Adam Shook. O'Reilly Media ISBN: 978-1449327170. [Chapter 1]

September 26, 2019