Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms - PowerPoint PPT Presentation

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Part 2:Mining using MapReduce  Mining algorithms using MapReduce  Information retrieval  Graph algorithms: PageRank  Clustering: Canopy clustering, KMeans  Classification: kNN, Naïve Bayes  MapReduce Mining Summary 2

MapReduce Interface and Data Flow  Map : (K1, V1)  list(K2, V2)  Combine : (K2, list(V2))  list(K2, V2)  Partition: (K2, V2)  reducer_id  Reduce : (K2, list(V2))  list(K3, V3) id, doc list(w, id) list(unique_w, id) Host 1 w1, list(unique_id) Host 3 w1, list(id) Map Combine Reduce Map Combine Host 2 Host 4 Map Combine w2, list(id) Map Combine Reduce w2, list(unique_id) partition 3

Information retrieval using MapReduce 4

IR: Distributed Grep  Find the doc_id and line# of a matching pattern  Map: (id, doc)  list(id, line#)  Reduce: None Grep “data mining” Docs Output 1 2 Map1 <1, 123> 3 4 5 Map2 <3, 717>, <5, 1231> 6 Map3 <6, 1012> 5

IR: URL Access Frequency  Map: (null, log)  (URL, 1)  Reduce: (URL, 1)  (URL, total_count) Logs Map output Reduce output <u1,1> Map1 <u1,1> <u1,2> <u2,1> <u2,1> Reduce Map2 <u3,2> <u3,1> <u3,1> Map3 Also described in Part 1 6

IR: Reverse Web-Link Graph  Map: (null, page)  (target, source)  Reduce: (target, source)  (target, list(source)) Pages Map output Reduce output <t1,s2> Map1 <t1,[s2]> <t2,s3> <t2,[s3,s5]> Reduce Map2 <t2,s5> <t3,[s5]> <t3,s5> Map3 It is the same as matrix transpose 7

IR: Inverted Index  Map: (id, doc)  list(word, id)  Reduce: (word, list(id))  (word, list(id)) Doc Map output Reduce output <w1,1> Map1 <w1,[1,5]> <w2,2> Reduce <w2,[2]> Map2 <w3,3> <w3,[3]> <w1,5> Map3 8

Graph mining using MapReduce 9

PageRank 2  PageRank vector q is defined as 1 where q = c A T q + 1 ¡ c N e 3 4 Browsing Teleporting 0 1  A is the source-by-destination 0 1 1 1 B C adjacency matrix, 0 0 1 1 B C A = @ A 0 0 0 1  e is all one vector. 0 0 1 0  N is the number of nodes  c is the weight between 0 and 1 (eg.0.85)  PageRank indicates the importance of a page.  Algorithm: Iterative powering for finding the first eigen-vector 10

MapReduce : PageRank PageRank Map() 2 1  Input: key = page x , value = (PageRank q x , links[y 1 … y m ])  Output: key = page x, value = partial x 3 4 1. Emit(x, 0) //guarantee all pages will be emitted Map: distribute PageRank q i 2. For each outgoing link y i : • Emit(y i , q x /m) q3 q1 q2 q4 PageRank Reduce()  Input: key = page x , value = the list of [partial x ]  Output: key = page x, value = PageRank q x 1. q x = 0 q1 q2 q3 q4 2. For each partial value d in the list: • Reduce: update new PageRank q x += d 3. q x = cq x + (1-c)/N 4. Emit(x, q x ) Check out Kang et al ICDM’09 11

Clustering using MapReduce 12

Canopy: single-pass clustering Canopy creation  Construct overlapping clusters – canopies  Make sure no two canopies with too much overlaps  Key: no canopy centers are too close to each other T1 C1 C3 C2 C4 T2 too much overlap two thresholds overlapping clusters T1>T2 McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 13

Canopy creation Input :1)points 2) threshold T1, T2 where T1>T2 Output : cluster centroids  Put all points into a queue Q  While (Q is not empty) – p = dequeue(Q) – For each canopy c:  if dist(p,c)< T1: c.add(p)  if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p  For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 14

Canopy creation Input :1)points 2) threshold T1, T2 where T1>T2 Output : cluster centroids  Put all points into a queue Q  While (Q is not empty) – p = dequeue(Q) T1 T2 – For each canopy c:  if dist(p,c)< T1: c.add(p) Strongly marked points  if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p C1 C2  For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 15

Canopy creation Input :1)points 2) threshold T1, T2 where T1>T2 Output : cluster centroids  Put all points into a queue Q Other points  While (Q is not empty) in the cluster – p = dequeue(Q) – For each canopy c: Canopy center  if dist(p,c)< T1: c.add(p) Strongly marked points  if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p  For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 16

MapReduce - Canopy Map() Canopy creation Map()  Input: A set of points P , threshold T1, T2  Output: key = null; value = a list of local canopies (total, count)  For each p in P:  For each canopy c: • if dist(p,c)< T1 then c.total+=p, c.count++; Map1 • if dist(p,c)< T2 then strongBound = true Map2 • If not strongBound then create canopy at p Close()  For each canopy c: • Emit(null, (total, count)) McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 19

MapReduce - Canopy Reduce() Reduce()  Input: key = null; input values (total, count)  Output: key = null; value = cluster centroids  For each intermediate values  p = total/count  For each canopy c: • if dist(p,c)< T1 then c.total+=p, Map1 results c.count++; Map2 results • if dist(p,c)< T2 then strongBound = true  If not strongBound then create canopy at p Close() For simplicity we assume only one reducer  For each canopy c: emit(null, c.total/c.count) McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 20

MapReduce - Canopy Reduce() Reduce()  Input: key = null; input values (total, count)  Output: key = null; value = cluster centroids  For each intermediate values  p = total/count  For each canopy c: • if dist(p,c)< T1 then c.total+=p, Reducer results c.count++; • if dist(p,c)< T2 then strongBound = true  If not strongBound then create canopy at p Close() Remark: This assumes only one reducer  For each canopy c: emit(null, c.total/c.count) McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 21

Clustering Assignment Clustering assignment  For each point p  Assign p to the closest canopy center McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 22

MapReduce : Cluster Assignment Cluster assignment Map()  Input: Point p; cluster centriods  Output:  Output key = cluster id  Output value =point id  currentDist = inf  For each cluster centroids c Canopy center  If dist(p,c)<currentDist then bestCluster=c, currentDist=dist(p,c);  Emit(bestCluster, p) Results can be directly written back to HDFS without a reducer. Or an identity reducer can be applied to have output sorted on cluster id. McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 23

KMeans: Multi-pass clustering AssignCluster() Traditional AssignCluster(): • For each point p Kmeans () Assign p the closest c  While not converge: UpdateCentroid()  AssignCluster() UpdateCentroids ():  UpdateCentroids()  For each cluster Update cluster center 24

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms - PowerPoint PPT Presentation

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook Part 2:Mining using MapReduce Mining algorithms using MapReduce Information retrieval

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Text Classification Dr. Ahmed Rafea Supervised learning Learning to assign objects to classes

BGP and the Rule of Custom Caleb James DeLisle @cjd@mastodon.social @cjd:matrix.org

IAEA-NDS Nuclear Reaction Databases and Services Viktor Zerkin International Atomic Energy

A Symbolic Approach to the Projection Method Nam Pham Mark Giesbrecht University of Waterloo,

1 2 transition amplitudes from lattice QCD Stefan Meinel KEK-FF 2019 Introduction Many

The Nuclear Data Activities at the NEA Data Bank E. Dupont, F. Michel-Sendis (data@nea.fr) J.

M I D D L E G R O U N D A C A D E M Y T R A I N I N G I N I N T E R C U L T U R A L C O M P E

Gravity in three dimensions as a noncommutative gauge theory George Manolakos ESI workshop MM