SLIDE 6 CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6
Scalable K-Means++(A.K.A. K-Means|| ) Algorithm
- l: Number of points chosen in each iteration
- Total number of points in C is !!"#$ (> k)
- To reduce the number of centers
- Step 7 assigns weights to the points in C
- Step 8 reclusters these weighted points to obtain k centers
1: C ← sample a point uniformly at random from X 2: ψ←φX(C) 3: for O(log ψ) times do 4: C′ ← sample each point x ∈ X independently with probability of !*+(,, .)/01(.) 5: C ← C ∪ C′ φX (C) 6: ψ is updated and continue the iteration 7: end for 8: For x∈C, set wx to be the number of points in X closer to x than any other point in C 9: Recluster the weighted points in C into k clusters
CS535 Big Data | Computer Science | Colorado State University
K-Means|| Algorithm: A Parallel Implementation
- Line 2: calculate the cost function
- How will you design RDDs for this?
- Line 4: each mapper (RDD) can sample independently
- How can you sample numbers?
- Line 5: identical to the line 2
1: C ← sample a point uniformly at random from X 2: ψ←φX(C) 3: for O(log ψ) times do 4: C′ ← sample each point x ∈ X independently with probability of "#$(&, ()/+,(() 5: C ← C ∪ C′ φX (C) 6: ψ is updated and continue the iteration 7: end for 8: For x∈C, set wx to be the number of points in X closer to x than any other point in C 9: Recluster the weighted points in C into k clusters
CS535 Big Data | Computer Science | Colorado State University
K-Means|| Algorithm: A Parallel Implementation
- Line 2: calculate the cost function
- RDD calc = all possible combinations between RDD points (for the points), and RDD centers (for centers)
- val calc = p.cartesian(c)
- Create an aggregator function
- Line 4: each mapper (RDD) can sample independently
- Random number generation per points, r. if r>=probability, sample it otherwise drop it
- Line 5: identical to the line 2
1: C ← sample a point uniformly at random from X 2: ψ←φX(C) 3: for O(log ψ) times do 4: C′ ← sample each point x ∈ X independently with probability of "#$(&, ()/+,(() 5: C ← C ∪ C′ φX (C) 6: ψ is updated and continue the iteration 7: end for 8: For x∈C, set wx to be the number of points in X closer to x than any other point in C 9: Recluster the weighted points in C into k clusters
CS535 Big Data | Computer Science | Colorado State University
K-Means|| Algorithm: Properties
- Requires !(#$%&)iterations
- A constant number of iterations is usually enough
- Creates an !(log +) approximation to the global optima
- Uses results from probability and sequence theory
- Each iteration reduces error by a constant amount and a term proportional to the global
error
- Expected value of the cost function approaches a multiple of the global optima
- Performance (with KDDCup 1999 data)
- Paper: Bahmani, B., Moseley, B., Vattani, A., Kumar, R. and Vassilvitskii, S., 2012.
Scalable k-means++. arXiv preprint arXiv:1203.6402
- Data: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Note: all values scaled down by 1010
CS535 Big Data | Computer Science | Colorado State University
Other options for clustering with Apache Spark MLlib
- Gaussian mixture
- Composite distribution: points are drawn from one of k Gaussian sub-distributions each
with own probability
- spark.mllib implements using the expectation-maximization algorithm to induce the
maximum-likelihood model
- k is the number of desired cluster
- convergenceTol is the maximum change in log-likelihood
- For convergence
- maxIterations is the maximum number of iterations to perform without reaching convergence
- initialModel is an optional starting points from which to start the EM model
CS535 Big Data | Computer Science | Colorado State University
Other options for clustering with Apache Spark MLlib
- Power iteration clustering (PIC)
- Clustering vertices of a graph given pairwise similarities as edge properties
- Computes a pseudo-eigenvector of the normalized affinity matrix of the graph
- k: number of clusters
- maxIterations: maximum number of power iterations
- initializationMode: initialization model. This can be either “random”, which is the default, to use
a random vector as vertex properties, or “degree” to use normalized sum similarities
Lin, F. and Cohen, W.W., 2010. Power iteration clustering.
CS535 Big Data | Computer Science | Colorado State University