DS504/CS586: Big Data Analytics Big Data Clustering
- Prof. Yanhua Li
Welcome to
Time: 6:00pm –8:50pm Thu Location: AK 232 Fall 2016
DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li - - PowerPoint PPT Presentation
Welcome to DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu Location: AK 232 Fall 2016 High Dimensional Data v Given a cloud of data points we want to understand its structure J. Leskovec, A.
Time: 6:00pm –8:50pm Thu Location: AK 232 Fall 2016
v Given a cloud of data points we want
Mining of Massive Datasets,tp:// www.mmds.org 2
3
v Given a set of points, with a notion of
v Usually:
Massive Datasets, http://www.mmds.org
4
x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x
Mining of Massive Datasets, http:// www.mmds.org
x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x Outlier Cluster
Mining of Massive Datasets, http:// www.mmds.org 5
6
v Clustering in two dimensions looks easy v Clustering small amounts of data looks easy v And in most cases, looks are not deceiving v Many applications involve not 2, but 10 or
v High-dimensional spaces look
v Almost all pairs of points are at about the
Mining of Massive Datasets, http:// www.mmds.org
v Intuitively: Music divides into categories,
v Represent a CD by a set of customers who
v Similar CDs have similar sets of customers,
7
Massive Datasets, http://www.mmds.org
v For each customer
v For Amazon, the dimension is tens of millions v Task: Find clusters of similar CDs
Mining of Massive Datasets, http:// www.mmds.org 8
v Represent a document by a vector
v Documents with similar sets of words
9
Mining of Massive Datasets, http:// www.mmds.org
v As with CDs we have a choice when
Mining of Massive Datasets, http:// www.mmds.org 10
11
v Hierarchical:
v Point assignment:
Massive Datasets, http://www.mmds.org
v Key operation:
v Three important questions:
Mining of Massive Datasets, http:// www.mmds.org 12
v Key operation: Repeatedly combine two
v (1) How to represent a cluster of many
v (2) How to determine “nearness” of
Mining of Massive Datasets, http:// www.mmds.org 13
(5,3)
x (1.5,1.5) x (4.5,0.5) x (1,1) x (4.7,1.3)
Data:
x … centroid Dendrogram
v (1) How to represent a cluster of many
v Possible meanings of “closest”:
15
∈C x c
2
Centroid is the avg. of all (data)points in the cluster. This means centroid is an “artificial” point. Clustroid is an existing (data)point that is “closest” to all other points in the cluster.
X
Cluster on 3 datapoints
Centroid Clustroid Datapoint
v (2) How do you determine the
16
Mining of Massive Datasets, http:// www.mmds.org
v Approach 2.1: Use the diameter of the
v Approach 2.2: Use the average
v Approach 2.3: Use a density-based
Mining of Massive Datasets, http:// www.mmds.org 17
v Naïve implementation of hierarchical
Mining of Massive Datasets, http:// www.mmds.org 18
v Assumes Euclidean space/distance v Start by picking k, the number of clusters v Initialize clusters by picking one point per
20
Mining of Massive Datasets, http:// www.mmds.org
v 1) For each point, place it in the cluster whose
v 2) After all points are assigned, update the
v 3) Reassign all points to their closest centroid
v Repeat 2 and 3 until convergence
Mining of Massive Datasets, http:// www.mmds.org 21
Mining of Massive Datasets, http:// www.mmds.org 22
x x x x x x x x x … data point … centroid x x x Clusters after round 1
Mining of Massive Datasets, http:// www.mmds.org 23
x x x x x x x x x … data point … centroid x x x Clusters after round 2
Mining of Massive Datasets, http:// www.mmds.org 24
x x x x x x x x x … data point … centroid x x x Clusters at the end
v Try different k, looking at the change in the
v Average falls rapidly until right k, then
25
k Average distance to centroid Best value
Massive Datasets, http://www.mmds.org
Mining of Massive Datasets, http:// www.mmds.org 26
x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x
Too few; many long distances to centroid.
Mining of Massive Datasets, http:// www.mmds.org 27
x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x
Just right; distances rather short.
Mining of Massive Datasets, http:// www.mmds.org 28
x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x
Too many; little improvement in average distance.
v 1) For each point, place it in the cluster whose
v 2) After all points are assigned, update the
v 3) Reassign all points to their closest centroid
v Repeat 2 and 3 until convergence
Mining of Massive Datasets, http:// www.mmds.org 29
v BFR [Bradley-Fayyad-Reina] is a
v Assumes that clusters are normally distributed
v Efficient way to summarize clusters
31
Mining of Massive Datasets, http:// www.mmds.org
v Points are read from disk one main-memory-
v Most points from previous memory
v To begin, from the initial load we select the
32
Mining of Massive Datasets, http:// www.mmds.org
v Discard set (DS):
v Compression set (CS):
v Retained set (RS):
Mining of Massive Datasets, http:// www.mmds.org 33
34
A cluster. Its points are in the DS. The centroid Compressed sets. Their points are in the CS. Points in the RS Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
v The number of points, N v The vector SUM, whose ith component is
v The vector SUMSQ: ith component = sum
35
A cluster. All its points are in the DS. The centroid
v 2d + 1 values represent any size cluster
v Average in each dimension (the centroid)
v Variance of a cluster’s discard set in dimension
v Next step: Actual clustering
36
Note: Dropping the “axis-aligned” clusters assumption would require storing full covariance matrix to summarize the cluster. So, instead of SUMSQ being a d-dim vector, it would be a d x d matrix, which is too big!
v 1) Find those points that are “sufficiently
v 2) Use any main-memory clustering algorithm to
37
Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
v 3) DS set: Adjust statistics of the clusters to
v 4) Consider merging compressed sets in the CS v 5) If this is the last round, merge all compressed
Mining of Massive Datasets, http:// www.mmds.org 38
Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
Mining of Massive Datasets, http:// www.mmds.org 39
A cluster. Its points are in the DS. The centroid Compressed sets. Their points are in the CS. Points in the RS Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
v Q1) How do we decide if a point is
v Q2) How do we decide whether two
40
Mining of Massive Datasets, http:// www.mmds.org
v Q1) We need a way to decide whether to
v BFR suggests two ways:
41
v
Mining of Massive Datasets, http:// www.mmds.org 42
σi … standard deviation of points in the cluster in the ith dimension
v
43
Mining of Massive Datasets, http:// www.mmds.org
v Compute the variance of the combined
v Combine if the combined variance is
Mining of Massive Datasets, http:// www.mmds.org 44
v Clustering: Given a set of points, with a
v Algorithms:
Mining of Massive Datasets, http:// www.mmds.org 45