Covered Topics!
v Big Graph Data Mining
§ Sampling § Ranking
v Big Data Management
§ Indexing
v Big Data Preprocessing/Cleaning v Big Data Acquisition/Measurement
- J. Leskovec, A. Rajaraman, J. Ullman:
Mining of Massive Datasets,tp:// www.mmds.org 1
Covered Topics! v Big Graph Data Mining Sampling Ranking v Big Data - - PowerPoint PPT Presentation
Covered Topics! v Big Graph Data Mining Sampling Ranking v Big Data Management Indexing v Big Data Preprocessing/Cleaning v Big Data Acquisition/Measurement J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,tp:// 1
v Big Graph Data Mining
v Big Data Management
v Big Data Preprocessing/Cleaning v Big Data Acquisition/Measurement
Mining of Massive Datasets,tp:// www.mmds.org 1
Time: 6:00pm–8:50pm Thu Location: AK 233 Spring 2018
v 2 questions on clustering. v 15 minutes v At the beginning of the class. v Count 5% towards the final v The written part (30%) includes the quizzes. v Try to provide intermediate results in the quiz
v Given a cloud of data points we want
Mining of Massive Datasets,tp:// www.mmds.org 4
5
v Given a set of points, with a notion of
v Usually:
Massive Datasets, http://www.mmds.org
6
v Hierachical clustering v Point assignment
v Clustering on big data
Massive Datasets, http://www.mmds.org
7
x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x
Mining of Massive Datasets, http:// www.mmds.org
x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x Outlier Cluster
Mining of Massive Datasets, http:// www.mmds.org 8
9
v Clustering in two dimensions looks easy v Clustering small amounts of data looks easy v And in most cases, looks are not deceiving v Many applications involve not 2, but 10 or
v High-dimensional spaces look
v Almost all pairs of points are at about the
Mining of Massive Datasets, http:// www.mmds.org
v Intuitively: Music divides into categories,
v Represent a CD by a set of customers who
v Similar CDs have similar sets of customers,
10
Massive Datasets, http://www.mmds.org
v For each customer
v For Amazon, the dimension is tens of millions v Task: Find clusters of similar CDs
Mining of Massive Datasets, http:// www.mmds.org 11
v As with CDs we have a choice when
Mining of Massive Datasets, http:// www.mmds.org 12
13
v Hierarchical:
v Point assignment:
Massive Datasets, http://www.mmds.org
v Key operation:
v Three important questions:
Mining of Massive Datasets, http:// www.mmds.org 14
(5,3)
x (1.5,1.5) x (4.5,0.5) x (1,1) x (4.7,1.3)
Data:
x … centroid Dendrogram
v (1) How to represent a cluster of many
v Possible meanings of “closest”:
16
∈C x c
2
Centroid is the avg. of all (data)points in the cluster. This means centroid is an “artificial” point. Clustroid is an existing (data)point that is “closest” to all other points in the cluster.
X
Cluster on 3 datapoints
Centroid Clustroid Datapoint
v (2) How do you determine the
17
Mining of Massive Datasets, http:// www.mmds.org
v Approach 2.1: Use the diameter of the
v Approach 2.2: Use the average
v Approach 2.3: Use a density-based
Mining of Massive Datasets, http:// www.mmds.org 18
v Naïve implementation of hierarchical
Mining of Massive Datasets, http:// www.mmds.org 19
v Assumes Euclidean space/distance v Start by picking k, the number of clusters v Initialize clusters by picking one point per
21
Mining of Massive Datasets, http:// www.mmds.org
v 1) For each point, place it in the cluster whose
v 2) After all points are assigned, update the
v 3) Reassign all points to their closest centroid
v Repeat 2 and 3 until convergence
Mining of Massive Datasets, http:// www.mmds.org 22
Mining of Massive Datasets, http:// www.mmds.org 23
x x x x x x x x x … data point … centroid x x x Clusters after round 1
Mining of Massive Datasets, http:// www.mmds.org 24
x x x x x x x x x … data point … centroid x x x Clusters after round 2
Mining of Massive Datasets, http:// www.mmds.org 25
x x x x x x x x x … data point … centroid x x x Clusters at the end
v Try different k, looking at the change in the
v Average falls rapidly until right k, then
26
k Average distance to centroid Best value
Massive Datasets, http://www.mmds.org
Mining of Massive Datasets, http:// www.mmds.org 27
x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x
Too few; many long distances to centroid.
Mining of Massive Datasets, http:// www.mmds.org 28
x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x
Just right; distances rather short.
Mining of Massive Datasets, http:// www.mmds.org 29
x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x
Too many; little improvement in average distance.
v 1) For each point, place it in the cluster whose
v 2) After all points are assigned, update the
v 3) Reassign all points to their closest centroid
v Repeat 2 and 3 until convergence
Mining of Massive Datasets, http:// www.mmds.org 30
?
v BFR [Bradley-Fayyad-Reina] is a
v Assumes that clusters are normally distributed
v Efficient way to summarize clusters
32
Mining of Massive Datasets, http:// www.mmds.org
v Points are read from disk one main-memory-
v Most points from previous memory
v To begin, from the initial load we select the
33
Mining of Massive Datasets, http:// www.mmds.org
v Discard set (DS):
v Compression set (CS):
v Retained set (RS):
Mining of Massive Datasets, http:// www.mmds.org 34
35
A cluster. Its points are in the DS. The centroid Compressed sets. Their points are in the CS. Points in the RS Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
v The number of points, N v The vector SUM, whose ith component is
v The vector SUMSQ: ith component = sum
36
A cluster. All its points are in the DS. The centroid
v 2d + 1 values represent any size cluster
v Average in each dimension (the centroid)
v Variance of a cluster’s discard set in dimension
v Next step: Actual clustering
37
Note: Dropping the “axis-aligned” clusters assumption would require storing full covariance matrix to summarize the cluster. So, instead of SUMSQ being a d-dim vector, it would be a d x d matrix, which is too big!
v 1) Find those points that are “sufficiently
v 2) Use any main-memory clustering algorithm to
38
Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
v 3) DS set: Adjust statistics of the clusters to
v 4) Consider merging compressed sets in the CS v 5) If this is the last round, merge all compressed
Mining of Massive Datasets, http:// www.mmds.org 39
Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
Mining of Massive Datasets, http:// www.mmds.org 40
A cluster. Its points are in the DS. The centroid Compressed sets. Their points are in the CS. Points in the RS Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
v Q1) How do we decide if a point is
v Q2) How do we decide whether two
41
Mining of Massive Datasets, http:// www.mmds.org
v Q1) We need a way to decide whether to
v BFR suggests two ways:
42
v
Mining of Massive Datasets, http:// www.mmds.org 43
σi … standard deviation of points in the cluster in the ith dimension
v
44
Mining of Massive Datasets, http:// www.mmds.org
v Compute the variance of the combined
v Combine if the combined variance is
Mining of Massive Datasets, http:// www.mmds.org 45
v Clustering: Given a set of points, with a
v Algorithms:
Mining of Massive Datasets, http:// www.mmds.org 46