DATA MINING LECTURE 15
The Map-Reduce Computational Paradigm Most of the slides are taken from: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org
DATA MINING LECTURE 15 The Map-Reduce Computational Paradigm Most - - PowerPoint PPT Presentation
DATA MINING LECTURE 15 The Map-Reduce Computational Paradigm Most of the slides are taken from: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org J. Leskovec, A. Rajaraman, J.
The Map-Reduce Computational Paradigm Most of the slides are taken from: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org
Datasets, http://www.mmds.org
2
Memory Disk CPU
Machine Learning, Statistics “Classical” Data Mining
Datasets, http://www.mmds.org
3
Datasets, http://www.mmds.org
4
Mem Disk CPU Mem Disk CPU
Switch Each rack contains 16-64 nodes Mem Disk CPU Mem Disk CPU
Switch Switch 1 Gbps between any pair of nodes in a rack 2-10 Gbps backbone between racks In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO
Datasets, http://www.mmds.org
5
Datasets, http://www.mmds.org
6
programs?
Datasets, http://www.mmds.org
7
Datasets, http://www.mmds.org
8
Datasets, http://www.mmds.org
9
Datasets, http://www.mmds.org
10
C0 C1 C2 C5
Chunk server 1
D1 C5
Chunk server 3
C1 C3 C5
Chunk server 2
C2 D0 D0
Bring computation directly to the data!
C0 C5
Chunk server N
C2 D0
Datasets, http://www.mmds.org
11
Chunk servers also serve as compute servers
Datasets, http://www.mmds.org
12
in memory
line
Datasets, http://www.mmds.org
13
Outline stays the same, Map and Reduce change to fit the problem
Datasets, http://www.mmds.org
14
v k k v k v map v k v k
…
k v map Input Data elements (key-value pairs) Intermediate key-value pairs
…
k v
Datasets, http://www.mmds.org
16
Important: Different shapes correspond to different types of keys and values!
k v
…
k v k v k v Intermediate key-value pairs Group by key reduce reduce k v k v k v
…
k v
…
k v k v v v v Key-value groups Output key-value pairs
Datasets, http://www.mmds.org
17
pairs
appearance of the word in the input line
computed from the set of values associated with 𝑙’
Datasets, http://www.mmds.org
18
The crew of the space shuttle Endeavor recently returned to Earth as ambassadors, harbingers of a new era
space
NASA are saying that the recent assembly of the Dextre bot is the first step in a long-term space-based man/mache partnership. '"The work we're doing now
need ……………………..
Big document (The, 1) (crew, 1) (of, 1) (the, 1) (space, 1) (shuttle, 1) (Endeavor, 1) (recently, 1) …. (crew, 1) (crew, 1) (space, 1) (the, 1) (the, 1) (the, 1) (shuttle, 1) (recently, 1) … (crew, 2) (space, 1) (the, 3) (shuttle, 1) (recently, 1) … MAP:
Read input and produces a set of key-value pairs
Group by key:
Collect all pairs with same key
Reduce:
Collect all values belonging to the key and output
(key, value) Provided by the programmer Provided by the programmer (key, value) (key, value) Sequentially read the data Only sequential reads
Datasets, http://www.mmds.org
19
map(key, value): // key: document name; value: text of the document for each word w in words(value): emit(w, 1) reduce(key, values): // key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(key, result)
Datasets, http://www.mmds.org
20
Datasets, http://www.mmds.org
21
Datasets, http://www.mmds.org
22
Big document MAP:
Read input and produces a set of key-value pairs
Group by key:
Collect all pairs with same key
(Hash merge, Shuffle, Sort, Partition)
Reduce:
Collect all values belonging to the key and output
Datasets, http://www.mmds.org
23
All phases are distributed with many tasks doing the work
new set of (k’,v’)-pairs
to the same reduce
grouped by key into new (k’,v’’)-pairs
many tasks doing the work
Input 0
Map 0
Input 1
Map 1
Input 2
Map 2 Reduce 0 Reduce 1
Out 0 Out 1
Shuffle
24
Datasets, http://www.mmds.org
physical storage location of input data
Datasets, http://www.mmds.org
25
location and sizes of its R intermediate files, one for each reducer
Datasets, http://www.mmds.org
26
worker are reset to idle
Datasets, http://www.mmds.org
28
cluster
recovery from worker failures
Datasets, http://www.mmds.org
29
Datasets, http://www.mmds.org
30
time:
Datasets, http://www.mmds.org
31
as the reduce function
Datasets, http://www.mmds.org
32
mapper (single machine):
Datasets, http://www.mmds.org
33
input file
intermediate key end up at the same worker
from a host end up in the same output file
Datasets, http://www.mmds.org
34
multiplication)
Datasets, http://www.mmds.org
36
Datasets, http://www.mmds.org
37
A B a1 b1 a2 b1 a3 b2 a4 b3 B C b2 c1 b2 c2 b3 c3
A B C a3 b2 c1 a3 b2 c2 a4 b3 c3
R S
Datasets, http://www.mmds.org
38
𝑁𝑤 𝑗 = 𝑛𝑗𝑘𝑤𝑘
𝑘
different tasks
sparse matrix
The matrix and vectors are stored in a sparse form:
each mapper.
𝑛𝑗𝑘𝑤𝑘
𝑘
for entry 𝑗 of the output vector.
𝑗, 𝑘, 𝑛𝑗𝑘 it outputs the key-value pair (𝑗, 𝑛𝑗𝑘𝑤𝑘)
row 𝑗.
where the vector can fit into memory
sent out.
supersteps.
computations on graphs.
in the input graph and connects with its neighbors
to all other nodes
(𝑏, 𝑐, 𝑥𝑏𝑐) to its neighbors.
are pairs (𝑑, 𝑥𝑏𝑑) and (𝑒, 𝑥𝑏𝑒) stored locally
MapReduce
Datasets, http://www.mmds.org
46
Datasets, http://www.mmds.org
47
Datasets, http://www.mmds.org
48
structured data
framework
algorithims
Pregel
HDFS and Amazon S3.
Datasets, http://www.mmds.org
51