Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) - PDF document

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) Part 1: MapReduce Algorithm Design (1/3) Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451/ 1

Agenda for today Abstraction Storage/computing Cluster of computers 2

Abstraction Storage/computing Cluster of computers 3

Data-intensive distributed computing How can we process a large file on a distributed system? MapReduce 4

File.txt How many times do we see 10 TB “Waterloo” in this file? Sequential read: 100 MB/s 10 𝑈𝐶 100 𝑁𝐶/𝑡 = 28 ℎ𝑝𝑣𝑠𝑡 It takes 28 hours just to read the file (ignoring computation) 5

File.txt How many times do we see 10 TB “Waterloo” in this file? S1 S2 S3 S19 S20 . . . With 20x more resources, can we achieve 20x speed up? Can we speed up this process by using more resources? How can we solve this problem using 20 servers instead? For simplicity assume that all 20 servers have a copy of the 10 TB file. 6

File.txt Count S1 S2 S3 S19 S20 “Waterloo” Map . . . 5 2 8 0 21 Reduce + 36 This is the logical view of how MapReduce works in our simple count Waterloo example. Each of the 20 servers are responsible for a chunk of the 10TB file. Each server counts the number of times Waterloo appears in the text assigned to it. Then, all servers send these partial results to another server (can be one of the 20 servers). This server adds up all of the partial results to find the total number of times Waterloo appears in the 10TB file. Physical view details such as how each server gets the chunk it should process, and how intermediate results are moved to the reducer should be ignored for now. 7

File.txt Count S1 S2 S3 S19 S20 “Waterloo” Map . . . 5 2 8 0 21 Reduce + What if we have a lot of intermediate results? Having only one reducer can be a bottleneck. 36 In our simple example, one reducer was enough because it only had to add up some (i.e., number of mappers) numbers. But in general we might have a ton of partial results from the map phase. Let’s see another example. 8

File.txt How many times do we see each 10 TB word in this file? S1 S2 S3 S19 S20 . . . Word count is the “hello world” of MapReduce 9

The expected output is … For each word in the input file, count how many times it appears in the file. Word Count Waterloo 36 Kitchener 27 City 512 Is 12450 The 16700 University 123 … 10

File.txt S1 S2 S3 S19 S20 Map . . . (waterloo, 5) … … … (university, 4) (kitchener, 2) (waterloo, 21) Reduce (city,10) (city, 4) … … + (waterloo, 36) (city, 500) … All mappers send list of (key, value) pairs to the reducer, where the key is word and value is its count. The reducer adds up all intermediate results. But it can now be a bottleneck. Can we have multiple reducers like mappers? 11

S1 S2 S3 S19 S20 . . . Map (waterloo, 5) (university, 4) … … … (kitchener, 2) (waterloo, 21) (city,10) (city, 4) … … Reduce What intermediate result should be moved to which reducer? 12

Sending partial results to the right reducer • Each word should be processed by one reducer, otherwise we will have partial results again! • E.g., all (Waterloo, *) should be processed by the same reducer • So we partition intermediate results by key How can mapper x know which reducer mapper y will sent key k? 13

Hash functions to rescue … • Mapper x and y can send key k to the same reducer by hashing k • Mapper x: Hash(k) = i → I will send k to reducer i • Mapper y: Hash(k) = i → I will send k to reducer i • E.g., Hash(“waterloo”) = 2 Each mapper can independently hash any key like k to find out which reducer it should go to. 14

S1 S2 S3 S19 S20 . . . Map (waterloo, 5) (university, 4) … … … (kitchener, 2) (waterloo, 21) (city,10) (city, 4) … … Reduce (waterloo, 36) (city, 1800) (university, 500) (kitchener, 500) … … 15

S1 S2 S3 S19 S20 . . . Map (waterloo, 5) (university, 4) … … … (kitchener, 2) (waterloo, 21) (city,10) (city, 4) … … Shuffling Reduce (waterloo, 36) (city, 1800) (university, 500) (kitchener, 500) … … The process of moving intermediate results from mappers to reducers called shuffling 16

There is a problem we ignored … S1 (waterloo, 5) What if this list is too long? (kitchener, 2) (city,10) … We might have memory overflow on mappers! 17

There is a problem we ignored … Waterloo is a city in Ontario, Canada. It is the smallest of three cities in the Regional Municipality of Waterloo … S1 We need a data structure like a dictionary to count all words, but how much memory do we need? Buffering is dangerous Solution: Do not accumulate! Unfortunately if we want to accumulate all stats in a dictionary, it may need too much memory. Although in the case of English Text the size of the dictionary is limited to the number of English words, no assumption can be made for an arbitrary input. 18

Waterloo is a city in Ontario, Waterloo is a city in Ontario, Canada. It is the smallest of Canada. It is the smallest of three cities in the Regional three cities in the Regional Municipality of Waterloo … Municipality of Waterloo … S1 S1 (waterloo, 1) (waterloo, 5) (is, 1) (kitchener, 2) (a,1) (city,10) (city,1) … … For every word we read emit (word, 1) to the reducer! This way the memory we need is almost 0. 19

S1 S2 S3 S19 S20 . . . Map (waterloo, 1) (university, 1) … … … (is, 1) (of, 1) (a,1) (waterloo, 1) (city,1) … … Reduce (waterloo, 36) (city, 1800) (university, 500) (kitchener, 500) … … We need no change in the reduce phase. Reducers should still add all numbers for each key. 20

MapReduce “word count” pseudo -code def map(key: Long, value: String) = { for (word <- tokenize(value)) { emit(word, 1) } } def reduce(key: String, values: Iterable[Int]) = { for (value <- values) { sum += value } emit(key, sum) } Mapper: simply process line by line. For every line emit (word, 1). Reducer: for every word, count all of the 1s. 21

Apache Hadoop is the most famous open-source implementation of MapReduce 22

MapReduce Implementations Google has a proprietary implementation in C++ Bindings in Java, Python Hadoop provides an open-source implementation in Java Development begun by Yahoo, later an Apache project Used in production at Facebook, Twitter, LinkedIn, Netflix, … Large and expanding software ecosystem Potential point of confusion: Hadoop is more than MapReduce today Lots of custom research implementations 23

Input k 1 v 1 k 2 v 2 k 3 v 3 k 4 v 4 k 5 v 5 k 6 v 6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 group values by key a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r 1 s 1 r 2 s 2 r 3 s 3 Output 24

MapReduce Programmer specifies two functions: map (k 1 , v 1 ) → List[(k 2 , v 2 )] reduce (k 2 , List[v 2 ]) → List[(k 3 , v 3 )] All values with the same key are sent to the same reducer The execution framework handles everything else… What’s “everything else”? 25

MapReduce “Runtime” Handles scheduling Assigns workers to map and reduce tasks Handles “data distribution” Moves processes to data Handles synchronization Groups intermediate data Handles errors and faults Detects worker failures and restarts Everything happens on top of a distributed FS 26

The word count example … (waterloo,{1,1,1,1,1}) (city, {1,1}) (university, {1,1,1}) Input file … 1 Line of text 1 key “Waterloo is a small city.” (waterloo,{1,1,1,1,1}) The reduce function The map function is called for every map reduce is called for every key line (waterloo, 5) (waterloo,1) (is, 1) (a, 1) … 27

MapReduce Programmer specifies two functions: map (k 1 , v 1 ) → List[(k 2 , v 2 )] reduce (k 2 , List[v 2 ]) → List[(k 3 , v 3 )] All values with the same key are sent to the same reducer The execution framework handles everything else… Not quite … 28

k 1 v 1 k 2 v 2 k 3 v 3 k 4 v 4 k 5 v 5 k 6 v 6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 group values by key a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r 1 s 1 r 2 s 2 r 3 s 3 What’s the most complex and slowest operation here? The slowest operation is shuffling intermediate results from mappers to reducers 29

MapReduce ✗ Programmer specifies two functions: map (k 1 , v 1 ) → List[(k 2 , v 2 )] reduce (k 2 , List[v 2 ]) → List[(k 3 , v 3 )] All values with the same key are sent to the same reducer partition (k', p) → 0 ... p -1 Often a simple hash of the key, e.g., hash(k') mod n Divides up key space for parallel reduce operations combine (k 2 , List[v 2 ]) → List[(k 2 , v 2 )] Mini-reducers that run in memory after the map phase Used as an optimization to reduce network traffic 30

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) - PDF document

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) Part 1: MapReduce Algorithm Design (1/3) Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451/ 1 Agenda for today Abstraction Storage/computing

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se Amirkabir University of

for Data Intensive Scalable Computing CAP3 Gene Assembly Program Compute intensive

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

Data-Intensive Distributed Computing 431/631 (Fall 2020) Part 1: Introduction to Big Data Ali

Data-Intensive Distributed Computing 451/651 (Fall 2020) Part 1: Introduction to Big Data Ali

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

On safety in distributed computing Srivatsan Ravi On safety in distributed computing Safety in

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

OCIO UFOs Template 4 April 26, 2011 4 April 26, 2011 Objectives 1. Provide an interoperable

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) Part 6: Data Mining (3/4)

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 9: Real-Time Data

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 6: Data Mining (4/4)

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 6: Data Mining (2/4)

Data-Intensive Distributed Computing CS 451/651 (Fall 2020) Part 3: From MapReduce to Spark (1/2)

CAPITAL EQUIPMENT UPDATE Stewardship of Capital Equipment Maintaining effective controls is

TESTING THE HECKSCHER-OHLIN- VANEK PARADIGM IN A WORLD WITH CHEAP FOREIGN LABOR Eric Fisher

What Lattice QCD can do for experiment Christine Davies University of Glasgow HPQCD

Part 1. The Essence of the Pig 1. 2. 3. 4. 5. 6. Part 1. The Essence of the Pig 1.

Bigtable, Hive, and Pig Jimmy Lin Jimmy Lin University of Maryland Tuesday, April 27, 2010

Scaling Up Pig Duen Horng (Polo) Chau Associate Professor Associate Director, MS Analytics

Spark RDD 1 Where are we? Distributed storage in HDFS MapReduce query execution in Hadoop