Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms - - PowerPoint PPT Presentation

large scale data mining
SMART_READER_LITE
LIVE PREVIEW

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms - - PowerPoint PPT Presentation

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook Part 2:Mining using MapReduce Mining algorithms using MapReduce Information retrieval


slide-1
SLIDE 1

Large-scale Data Mining: MapReduce and Beyond

Part 2: Algorithms

Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

slide-2
SLIDE 2

Part 2:Mining using MapReduce

 Mining algorithms using MapReduce

Information retrieval Graph algorithms: PageRank Clustering: Canopy clustering, KMeans Classification: kNN, Naïve Bayes

 MapReduce Mining Summary

2

slide-3
SLIDE 3

MapReduce Interface and Data Flow

 Map: (K1, V1)  list(K2, V2)  Combine: (K2, list(V2))  list(K2, V2)  Partition: (K2, V2)  reducer_id  Reduce: (K2, list(V2))  list(K3, V3)

3

Reduce Map Combine Map Combine Map Combine

Host 1 Host 2 Host 3

Reduce

Host 4

Map Combine

partition id, doc list(w, id) list(unique_w, id) w1, list(unique_id) w1, list(id) w2, list(id) w2, list(unique_id)

slide-4
SLIDE 4

4

Information retrieval using MapReduce

slide-5
SLIDE 5

IR: Distributed Grep

 Find the doc_id and line# of a matching pattern  Map: (id, doc) list(id, line#)  Reduce: None

5

2 1 5 4 6 Map1 Map2 Map3 3 Docs Output

<1, 123> <3, 717>, <5, 1231> <6, 1012>

Grep “data mining”

slide-6
SLIDE 6

IR: URL Access Frequency

 Map: (null, log)  (URL, 1)  Reduce: (URL, 1)  (URL, total_count)

6

Map1 Map2 Map3 Logs Map output

<u1,1> <u2,1> <u3,1> <u3,1>

Reduce

<u1,2> <u2,1> <u3,2>

Reduce output Also described in Part 1

<u1,1>

slide-7
SLIDE 7

IR: Reverse Web-Link Graph

 Map: (null, page)  (target, source)  Reduce: (target, source)  (target, list(source))

7

Map1 Map2 Map3 Pages Map output

<t1,s2> <t2,s3> <t2,s5> <t3,s5>

Reduce

<t1,[s2]> <t2,[s3,s5]> <t3,[s5]>

Reduce output It is the same as matrix transpose

slide-8
SLIDE 8

IR: Inverted Index

 Map: (id, doc)  list(word, id)  Reduce: (word, list(id))  (word, list(id))

8

Map1 Map2 Map3 Doc Map output

<w1,1> <w2,2> <w3,3> <w1,5>

Reduce

<w1,[1,5]> <w2,[2]> <w3,[3]>

Reduce output

slide-9
SLIDE 9

9

Graph mining using MapReduce

slide-10
SLIDE 10

PageRank

 PageRank vector q is defined as

where

 A is the source-by-destination

adjacency matrix,

 e is all one vector.  N is the number of nodes  c is the weight between 0 and 1 (eg.0.85)

 PageRank indicates the importance of a page.  Algorithm: Iterative powering for finding the first

eigen-vector

10

q = cATq + 1¡c

N e

2 1 3 4

A = B B @ 1 1 1 1 1 1 1 1 C C A

Browsing Teleporting

slide-11
SLIDE 11

MapReduce: PageRank

11

PageRank Map()

  • Input: key = page x,

value = (PageRank qx, links[y1…ym])

  • Output: key = page x, value = partialx

1. Emit(x, 0) //guarantee all pages will be emitted 2. For each outgoing link yi:

  • Emit(yi, qx/m)

PageRank Reduce()

  • Input: key = page x, value = the list of [partialx]
  • Output: key = page x, value = PageRank qx

1. qx = 0 2. For each partial value d in the list:

  • qx += d

3. qx = cqx+ (1-c)/N 4. Emit(x, qx)

2 1 3 4

Map: distribute PageRank qi

q1 q2 q3 q4 Reduce: update new PageRank q2 q3 q4 q1 Check out Kang et al ICDM’09

slide-12
SLIDE 12

12

Clustering using MapReduce

slide-13
SLIDE 13

Canopy: single-pass clustering

Canopy creation

 Construct overlapping clusters – canopies  Make sure no two canopies with too much overlaps

 Key: no canopy centers are too close to each other

McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 13

C1 C2 C3 C4

  • verlapping clusters

too much overlap two thresholds T1>T2

T1 T2

slide-14
SLIDE 14

Canopy creation

McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00

Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids

 Put all points into a queue Q  While (Q is not empty)

– p = dequeue(Q) – For each canopy c:

 if dist(p,c)< T1: c.add(p)  if dist(p,c)< T2: strongBound = true

– If not strongBound: create canopy at p

 For all canopy c:

– Set centroid to the mean of all points in c

14

slide-15
SLIDE 15

Canopy creation

McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 15

C1

Strongly marked points

C2 T1 T2

Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids

 Put all points into a queue Q  While (Q is not empty)

– p = dequeue(Q) – For each canopy c:

 if dist(p,c)< T1: c.add(p)  if dist(p,c)< T2: strongBound = true

– If not strongBound: create canopy at p

 For all canopy c:

– Set centroid to the mean of all points in c

slide-16
SLIDE 16

Canopy creation

McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 Canopy center Other points in the cluster Strongly marked points 16

Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids

 Put all points into a queue Q  While (Q is not empty)

– p = dequeue(Q) – For each canopy c:

 if dist(p,c)< T1: c.add(p)  if dist(p,c)< T2: strongBound = true

– If not strongBound: create canopy at p

 For all canopy c:

– Set centroid to the mean of all points in c

slide-17
SLIDE 17

Canopy creation

McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 17

Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids

 Put all points into a queue Q  While (Q is not empty)

– p = dequeue(Q) – For each canopy c:

 if dist(p,c)< T1: c.add(p)  if dist(p,c)< T2: strongBound = true

– If not strongBound: create canopy at p

 For all canopy c:

– Set centroid to the mean of all points in c

slide-18
SLIDE 18

Canopy creation

McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 18

Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids

 Put all points into a queue Q  While (Q is not empty)

– p = dequeue(Q) – For each canopy c:

 if dist(p,c)< T1: c.add(p)  if dist(p,c)< T2: strongBound = true

– If not strongBound: create canopy at p

 For all canopy c:

– Set centroid to the mean of all points in c

slide-19
SLIDE 19

MapReduce - Canopy Map()

McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00

Canopy creation Map()

  • Input: A set of points P, threshold T1, T2
  • Output: key = null; value = a list of local canopies

(total, count)

  • For each p in P:
  • For each canopy c:
  • if dist(p,c)< T1 then c.total+=p, c.count++;
  • if dist(p,c)< T2 then strongBound = true
  • If not strongBound then create canopy at p

Close()

  • For each canopy c:
  • Emit(null, (total, count))

Map1 Map2

19

slide-20
SLIDE 20

MapReduce - Canopy Reduce()

McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00

Map1 results Map2 results

20 For simplicity we assume only one reducer

Reduce()

  • Input: key = null; input values (total, count)
  • Output: key = null; value = cluster centroids
  • For each intermediate values
  • p = total/count
  • For each canopy c:
  • if dist(p,c)< T1 then c.total+=p,

c.count++;

  • if dist(p,c)< T2 then strongBound = true
  • If not strongBound then create canopy at p

Close()

  • For each canopy c: emit(null, c.total/c.count)
slide-21
SLIDE 21

MapReduce - Canopy Reduce()

McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00

Reducer results

21

Reduce()

  • Input: key = null; input values (total, count)
  • Output: key = null; value = cluster centroids
  • For each intermediate values
  • p = total/count
  • For each canopy c:
  • if dist(p,c)< T1 then c.total+=p,

c.count++;

  • if dist(p,c)< T2 then strongBound = true
  • If not strongBound then create canopy at p

Close()

  • For each canopy c: emit(null, c.total/c.count)

Remark: This assumes only one reducer

slide-22
SLIDE 22

Clustering Assignment

McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 22

Clustering assignment

 For each point p  Assign p to the closest

canopy center

slide-23
SLIDE 23

MapReduce: Cluster Assignment

McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00

Canopy center

23

Cluster assignment Map()

  • Input: Point p; cluster centriods
  • Output:
  • Output key = cluster id
  • Output value =point id
  • currentDist = inf
  • For each cluster centroids c
  • If dist(p,c)<currentDist

then bestCluster=c, currentDist=dist(p,c);

  • Emit(bestCluster, p)

Results can be directly written back to HDFS without a reducer. Or an identity reducer can be applied to have output sorted on cluster id.

slide-24
SLIDE 24

KMeans: Multi-pass clustering

24

Kmeans ()

  • While not converge:
  • AssignCluster()
  • UpdateCentroids()

AssignCluster():

  • For each point p

Assign p the closest c

Traditional AssignCluster()

UpdateCentroids ():

  • For each cluster

Update cluster center

UpdateCentroid()

slide-25
SLIDE 25

MapReduce – KMeans

25

KmeansIter() Map(p) // Assign Cluster

  • For c in clusters:
  • If dist(p,c)<minDist,

then minC=c, minDist = dist(p,c)

  • Emit(minC.id, (p, 1))

Reduce() //Update Centroids

  • For all values (p, c) :
  • total += p; count += c;
  • Emit(key, (total, count))

Map1 Map2 Initial centroids

slide-26
SLIDE 26

MapReduce – KMeans

26 3 2 4

1

Map1 Map2 Initial centroids Reduce1 Reduce2 Map: assign each p to closest centroids Reduce: update each centroid with its new location (total, count)

KmeansIter() Map(p) // Assign Cluster

  • For c in clusters:
  • If dist(p,c)<minDist,

then minC=c, minDist = dist(p,c)

  • Emit(minC.id, (p, 1))

Reduce() //Update Centroids

  • For all values (p, c) :
  • total += p; count += c;
  • Emit(key, (total, count))
slide-27
SLIDE 27

27

Classification using MapReduce

slide-28
SLIDE 28

MapReduce kNN

1 1 1 1 1 1 1 1 1

Map()

Input:

All points

query point p

Output: k nearest neighbors (local)

Emit the k closest points to p

28

Map1 Map2 Query point

Reduce()

Input:

Key: null; values: local neighbors

query point p

Output: k nearest neighbors (global)

Emit the k closest points to p among all local neighbors

Reduce K=3

slide-29
SLIDE 29

T1w T2w

Naïve Bayes

30

 Formulation:  Parameter estimation

 Class prior:  Conditional probability:

d1 d2 d3 d4 d5 d6 d7 Term vector c1:

N1=3

c2:

N2=4

P(cjd) / P(c) Q

w2d P(wjc)

^ P(c) = Nc

N where Nc is #docs in c, N is #docs

^ P(wjc) = Tcw

P

w0 Tcw0

Tcw is # of occurrences of w in class c

Goals:

  • 1. total number of docs N
  • 2. number of docs in c:Nc
  • 3. word count histogram in c:Tcw
  • 4. total word count in c:Tcw’

c is a class label, d is a doc, w is a word

slide-30
SLIDE 30

MapReduce: Naïve Bayes

31

ClassPrior() Map(doc): Emit(class_id, (doc_id, doc_length) Combine()/Reduce()

  • Nc = 0; sTcw = 0
  • For each doc_id:
  • Nc++; sTcw +=doc_length
  • Emit(c, Nc)

 Naïve Bayes can be implemented

using MapReduce jobs of histogram computation

ConditionalProbability() Map(doc):

  • For each word w in doc:
  • Emit(pair(c, w) , 1)

Combine()/Reduce()

  • Tcw = 0
  • For each value v: Tcw += v
  • Emit(pair(c, w) , Tcw)

Goals:

  • 1. total number of docs N
  • 2. number of docs in c:Nc
  • 3. word count histogram in c:Tcw
  • 4. total word count in c:Tcw’
slide-31
SLIDE 31

32

MapReduce Mining Summary

slide-32
SLIDE 32

Taxonomy of MapReduce algorithms

One Iteration Multiple Iterations Not good for MapReduce Clustering Canopy KMeans Classification Naïve Bayes, kNN Gaussian Mixture SVM Graphs PageRank Information Retrieval Inverted Index Topic modeling (PLSI, LDA)

33 

One-iteration algorithms are perfect fits

Multiple-iteration algorithms are OK fits

but small shared info have to be synchronized across iterations (typically through filesytem)

Some algorithms are not good for MapReduce framework

Those algorithms typically require large shared info with a lot of synchronization.

Traditional parallel framework like MPI is better suited for those.

slide-33
SLIDE 33

MapReduce for machine learning algorithms

 The key is to convert into summation form

(Statistical Query model [Kearns’94])

 y=  f(x) where f(x) corresponds to map(),

 corresponds to reduce().

 Naïve Bayes

 MR job: P(c)  MR job: P(w|c)

 Kmeans

 MR job: split data into subgroups and compute partial

sum in Map(), then sum up in Reduce()

34

Map-Reduce for Machine Learning on Multicore [NIPS’06]

slide-34
SLIDE 34

Machine learning algorithms using MapReduce

 Linear Regression:

MR job1: MR job2:

 Finally, solve  Locally Weighted Linear Regression:

MR job1: MR job2:

35

A = P

i wixixiT

b = P

i wixiy(i)

Map-Reduce for Machine Learning on Multicore [NIPS’06]

^ y = (XTX)¡1XTy

where X 2 Rm£n and y 2 Rm

A = XTX = P

i xixiT

b = XTy = P

i xiy(i)

A^ y = b

slide-35
SLIDE 35

Machine learning algorithms using MapReduce (cont.)

 Logistic Regression  Neural Networks  PCA  ICA  EM for Gaussian Mixture Model

36

Map-Reduce for Machine Learning on Multicore [NIPS’06]

slide-36
SLIDE 36

37

MapReduce Mining Resources

slide-37
SLIDE 37

Mahout: Hadoop data mining library

Mahout: http://lucene.apache.org/mahout/

 scalable data mining libraries: mostly implemented in Hadoop

Data structure for vectors and matrices

 Vectors

 Dense vectors as double[]  Sparse vectors as HashMap<Integer, Double>  Operations: assign, cardinality, copy, divide, dot, get,

haveSharedCells, like, minus, normalize, plus, set, size, times, toArray, viewPart, zSum and cross

 Matrices

 Dense matrix as a double[][]  SparseRowMatrix or SparseColumnMatrix as Vector[] as holding the

rows or columns of the matrix in a SparseVector

 SparseMatrix as a HashMap<Integer, Vector>  Operations: assign, assignColumn, assignRow, cardinality, copy,

divide, get, haveSharedCells, like, minus, plus, set, size, times, transpose, toArray, viewPart and zSum

38

slide-38
SLIDE 38

MapReduce Mining Papers

[Chu et al. NIPS’06] Map-Reduce for Machine Learning on Multicore

 General framework under MapReduce

[Papadimitriou et al ICDM’08] DisCo: Distributed Co-clustering with Map- Reduce

 Co-clustering

[Kang et al. ICDM’09] PEGASUS: A Peta-Scale Graph Mining System

  • Implementation and Observations

 Graph algorithms

[Das et al. WWW’07] Google news personalization: scalable online collaborative filtering

 PLSI EM

[Grossman+Gu KDD’08] Data Mining Using High Performance Data Clouds: Experimental Studies Using Sector and Sphere. KDD 2008

 Alternative to Hadoop which supports wide-area data collection and

distribution.

39

slide-39
SLIDE 39

Summary: algorithms

 Best for MapReduce:

Single pass, keys are uniformly distributed.

 OK for MapReduce:

Multiple pass, intermediate states are small

 Bad for MapReduce

Key distribution is skewed Fine-grained synchronization is required.

40

slide-40
SLIDE 40

Large-scale Data Mining: MapReduce and Beyond

Part 2: Algorithms

Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook