SLIDE 16 CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 16
Scalable K-Means++(A.K.A. K-Means|| ) Algorithm
- l: Number of points chosen in each iteration
- Total number of points in C is !!"#$ (> k)
- To reduce the number of centers
- Step 7 assigns weights to the points in C
- Step 8 reclusters these weighted points to obtain k centers
1: C ← sample a point uniformly at random from X 2: ψ←φX(C) 3: for O(log ψ) times do 4: C′ ← sample each point x ∈ X independently with probability of !*+(,, .)/01(.) 5: C ← C ∪ C′ φX (C) 6: ψ is updated and continue the iteration 7: end for 8: For x∈C, set wx to be the number of points in X closer to x than any other point in C 9: Recluster the weighted points in C into k clusters
CS535 Big Data | Computer Science | Colorado State University
K-Means|| Algorithm: A Parallel Implementation
- Line 2: calculate the cost function
- How will you design RDDs for this?
- Line 4: each mapper (RDD) can sample independently
- How can you sample numbers?
- Line 5: identical to the line 2
1: C ← sample a point uniformly at random from X 2: ψ←φX(C) 3: for O(log ψ) times do 4: C′ ← sample each point x ∈ X independently with probability of "#$(&, ()/+,(() 5: C ← C ∪ C′ φX (C) 6: ψ is updated and continue the iteration 7: end for 8: For x∈C, set wx to be the number of points in X closer to x than any other point in C 9: Recluster the weighted points in C into k clusters
CS535 Big Data | Computer Science | Colorado State University