[PDF] - FAQs Quiz #3 Scores will be available by 3/6 Programming PDF Document

SLIDE 1

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

CS535 BIG DATA

PART B. GEAR SESSIONS

SESSION 2: MACHINE LEARNING FOR BIG DATA

Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535

FAQs

Quiz #3
Scores will be available by 3/6
Programming Assignment #2
March 10
Piazza discussion board
Critical Review

CS535 Big Data | Computer Science | Colorado State University

SLIDE 2

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

Topics of Todays Class

GEAR Session 2. Machine Learning for Big Data
Lecture 1.
Clustering Algorithms

CS535 Big Data | Computer Science | Colorado State University

GEAR Session 2. Machine Learning for Big Data

Lecture 1. Distributed Clustering Models

CS535 Big Data | Computer Science | Colorado State University

SLIDE 3

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

Clustering: Core concept

Set of N-dimensional vectors
Can be in the order of millions
Group (or cluster) them based on their proximity (or similarity) to each other in an N-

dimensional space

Vectors or objects in a cluster (or group) are more similar to each other than in any other group

CS535 Big Data | Computer Science | Colorado State University

Clustering: Applications

Anomaly detection
Fraud detection
Recommendation systems
Medical imaging
Market research
Human genetic clustering

CS535 Big Data | Computer Science | Colorado State University

SLIDE 4

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4

GEAR Session 2. Machine Learning for Big Data

Lecture 1. Distributed Clustering Introduction

CS535 Big Data | Computer Science | Colorado State University

This material is built based on,

Arthur, D.; Vassilvitskii, S. (2007). "k-means++: the advantages of careful

seeding" (PDF). Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics Philadelphia, PA,

USA. pp. 1027–1035
Bahmani, B., Moseley, B., Vattani, A., Kumar, R. and Vassilvitskii, S., 2012. Scalable k-

means++. arXiv preprint arXiv:1203.6402.

Apache Spark Mllib: Clustering
https://spark.apache.org/docs/latest/ml-clustering.html

CS535 Big Data | Computer Science | Colorado State University

SLIDE 5

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 5

K-Means Clustering

A set of unlabeled points
Assumes that they form k clusters
Find a set of cluster centers that minimize the distance to nearest center
Finding a global optima is NP-hard: O(ndk+1)
Many approximate algorithms are available
D. Aloise, A. Deshpande, P. Hansen, and P. Popat. NP-hardness of Euclidean sum-of-squares clustering.

Machine Learning, 75(2):245–248, 2009.

CS535 Big Data | Computer Science | Colorado State University

Concept: k-Means Clustering (1/4)

. . . . . . . . . . . . . . . .. . . . . . .

x x

10 -8 -6 -4 -2 0 2 4 6
4 -3 -2 -1 0 1 2 3 4

CS535 Big Data | Computer Science | Colorado State University

SLIDE 6

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6

Concept: k-Means Clustering (2/4)

. . . . . . . . . . . . . . . . . . . . . . .

x x

10 -8 -6 -4 -2 0 2 4 6
4 -3 -2 -1 0 1 2 3 4

CS535 Big Data | Computer Science | Colorado State University

Concept: k-Means Clustering (3/4)

. . . . . . . . . . . . . . . . . . . . . . .

x x

10 -8 -6 -4 -2 0 2 4 6
4 -3 -2 -1 0 1 2 3 4

CS535 Big Data | Computer Science | Colorado State University

SLIDE 7

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 7

Concept: k-Means Clustering (4/4)

. . . . . . . . . . . . . . . . . . . . . . .

x x

10 -8 -6 -4 -2 0 2 4 6
4 -3 -2 -1 0 1 2 3 4

CS535 Big Data | Computer Science | Colorado State University

k-Means algorithm- Lloyd’s Algorithm (1/2)

Input
k (number of clusters)
Training set {x(1), x(2), x(3),…. x(m)}

(drop x0= 1 convention)

x(i) ∈ Rn

CS535 Big Data | Computer Science | Colorado State University

SLIDE 8

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 8

k-Means algorithm- Lloyd’s Algorithm (2/2)

Randomly initialize K cluster centroids

repeat{

for i = 1 to m c(i):=index (from i to K) of cluster centroid closest to x(i) for k = 1 to K μk:= average (mean) of points assigned to cluster k

}

µ1,µ1,...µk ∈ Rn

CS535 Big Data | Computer Science | Colorado State University

Cost function

The objective is to find:
Where μi is the mean of points in Si

argmin

S

(|| x −

x∈Si

∑

i=1 k

∑

µi ||)2

CS535 Big Data | Computer Science | Colorado State University

SLIDE 9

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 9

k-Means algorithm- Lloyd’s Algorithm: Step-By-Step Instruction

1. Initialization Step
Select random k centers
Using a random uniform distribution
2. Assignment Step
Assign each observation to the cluster
Euclidean distance
3. Update Step
Calculate the new means of Euclidean distance to each assigned cluster
Update centroids
4. Termination Step
Stop when the centroids do not change for two consecutive steps.

CS535 Big Data | Computer Science | Colorado State University

k-Means for non-separated clusters

.. . .. . .. .. .. . . .. .. . .. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. .. . .

Separated clusters Non-Separated clusters

CS535 Big Data | Computer Science | Colorado State University

SLIDE 10

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 10

How to choose the number of clusters

Value k in the algorithm

. . . . . . . . .. . . . . . . . . . . . . .

10 -8 -6 -4 -2 0 2 4 6
4 -3 -2 -1 0 1 2 3 4

CS535 Big Data | Computer Science | Colorado State University

Choosing the value K (1/2)

Elbow Method

“Elbow” Cost function J K (no. of clusters) K (no. of clusters) Cost function J

CS535 Big Data | Computer Science | Colorado State University

SLIDE 11

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 11

Choosing the value K (2/2)

. .. .. . . . .. .. .. . .. .. .. . . .... . . . . .. .. . . . .. .. . . .. .. .. . .. .. .. . . .. . . . . . .. .. . .

Small Medium Large Small Medium Large Extra Large Extra Small Sleeve Length Sleeve Length Waist Waist

CS535 Big Data | Computer Science | Colorado State University

Distance Measures

Euclidean Distance
Manhattan Distance
Cosine Distance
Hamming Distance
Jaccard Dissimilarity
Edit Distance
Smith Waterman Similarity
Image Distance
Etc.

CS535 Big Data | Computer Science | Colorado State University

SLIDE 12

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 12

GEAR Session 2. Machine Learning for Big Data

Lecture 1. Distributed Clustering Scalable k-means

CS535 Big Data | Computer Science | Colorado State University

k-Means algorithm- Lloyd’s Algorithm: Strengths

Embarrassingly parallel
Converges to a local minima
O(nkdi) runtime

CS535 Big Data | Computer Science | Colorado State University

SLIDE 13

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 13

k-Means algorithm- Lloyd’s Algorithm: Weaknesses

O(nkdi) runtime
Worst case? ! ≈ 2 $
Large number of local minima
Many local minima are poor
k is unknown

CS535 Big Data | Computer Science | Colorado State University

The K-Means++ Algorithm

Avoiding cold-start improves results
Reducing the number of total iterations
Initialize cluster centers sequentially
Only the first center is randomly selected
Each further center is selected probabilistically to be far from existing centers
Result of this is an O(log k ) approximation to the global optima

CS535 Big Data | Computer Science | Colorado State University

SLIDE 14

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 14

The K-Means++ Algorithm: Step-by-Step description

Step 1. Choose one center uniformly at random from among the data points Step 2. For each data point x, compute D(x), the shortest distance between x and the nearest center that has already been chosen Step 3. Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2. (D2 weighting)

D2 weighting: Take a new center, choosing x with probability

!(#)% ∑'∈) !(#)%

Step 4. Repeat Steps 2 and 3 until k centers have been chosen Step 5. Now that the initial centers have been chosen, proceed using standard k-Means

More info: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf

CS535 Big Data | Computer Science | Colorado State University

The K-Means++ Algorithm: How can we improve this?

GOAL 1: Can we reduce the number of iterations for initializing centroids,
Can we select multiple centroids at a time?
GOAL 2: While considering un-uniformly distributed dataset,
Non-uniform selection?
GOAL 3: With a reasonable approximation guarantees?

CS535 Big Data | Computer Science | Colorado State University

SLIDE 15

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 15

Scalable K-Means++(A.K.A. K-Means|| ) Algorithm

Parallel version for initializing the centers
Oversampling factor ! = Ω $
Select an initial center (uniformly at random)
Computes initial cost of the clustering after this selection, Ψ
Proceeds in log Ψ iterations
Samples each x with probability !&'(), +)/./(+)

CS535 Big Data | Computer Science | Colorado State University

Scalable K-Means++(A.K.A. K-Means|| ) Algorithm

Let ! = #$, #&, … , #( be a set of points in the d-dimensional Euclidean space
Let k be a positive integer specifying the number of clusters
Let ∥ * − , ∥ denote the Euclidean distance between a and b
For a point x and a subset - ⊆ ! of points, the distance is defined as

/ #, - = 0123∈5 ∥ # − 6 ∥

For a subset - ⊆ ! , its centroid be given by
7829:;1/ - = $

|5| ∑3∈5 6

Let C={c1, c2,… ck} be a set of points and let - ⊆ !
The cost of Y with respect to C as
>5 ? = ∑3∈5 /& 6, ? = ∑3∈5 @A$,..,C

D@(∥ # − 6 ∥ &

CS535 Big Data | Computer Science | Colorado State University

SLIDE 16

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 16

Scalable K-Means++(A.K.A. K-Means|| ) Algorithm

l: Number of points chosen in each iteration
Total number of points in C is !!"#$ (> k)
To reduce the number of centers
Step 7 assigns weights to the points in C
Step 8 reclusters these weighted points to obtain k centers

1: C ← sample a point uniformly at random from X 2: ψ←φX(C) 3: for O(log ψ) times do 4: C′ ← sample each point x ∈ X independently with probability of !*+(,, .)/01(.) 5: C ← C ∪ C′ φX (C) 6: ψ is updated and continue the iteration 7: end for 8: For x∈C, set wx to be the number of points in X closer to x than any other point in C 9: Recluster the weighted points in C into k clusters

CS535 Big Data | Computer Science | Colorado State University

K-Means|| Algorithm: A Parallel Implementation

Line 2: calculate the cost function
How will you design RDDs for this?
Line 4: each mapper (RDD) can sample independently
How can you sample numbers?
Line 5: identical to the line 2

1: C ← sample a point uniformly at random from X 2: ψ←φX(C) 3: for O(log ψ) times do 4: C′ ← sample each point x ∈ X independently with probability of "#$(&, ()/+,(() 5: C ← C ∪ C′ φX (C) 6: ψ is updated and continue the iteration 7: end for 8: For x∈C, set wx to be the number of points in X closer to x than any other point in C 9: Recluster the weighted points in C into k clusters

CS535 Big Data | Computer Science | Colorado State University

SLIDE 17

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 17

K-Means|| Algorithm: A Parallel Implementation

Line 2: calculate the cost function
RDD calc = all possible combinations between RDD points (for the points), and RDD centers (for centers)
val calc = p.cartesian(c)
Create an aggregator function
Line 4: each mapper (RDD) can sample independently
Random number generation per points, r. if r>=probability, sample it otherwise drop it
Line 5: identical to the line 2

1: C ← sample a point uniformly at random from X 2: ψ←φX(C) 3: for O(log ψ) times do 4: C′ ← sample each point x ∈ X independently with probability of "#$(&, ()/+,(() 5: C ← C ∪ C′ φX (C) 6: ψ is updated and continue the iteration 7: end for 8: For x∈C, set wx to be the number of points in X closer to x than any other point in C 9: Recluster the weighted points in C into k clusters

CS535 Big Data | Computer Science | Colorado State University

K-Means|| Algorithm: Properties

Requires !(#$%&)iterations
A constant number of iterations is usually enough
Creates an !(log +) approximation to the global optima
Uses results from probability and sequence theory
Each iteration reduces error by a constant amount and a term proportional to the global

error

Expected value of the cost function approaches a multiple of the global optima
Performance (with KDDCup 1999 data)
Paper: Bahmani, B., Moseley, B., Vattani, A., Kumar, R. and Vassilvitskii, S., 2012.

Scalable k-means++. arXiv preprint arXiv:1203.6402

Data: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

Note: all values scaled down by 1010

CS535 Big Data | Computer Science | Colorado State University

SLIDE 18

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 18

Other options for clustering with Apache Spark MLlib

Gaussian mixture
Composite distribution: points are drawn from one of k Gaussian sub-distributions each

with own probability

spark.mllib implements using the expectation-maximization algorithm to induce the

maximum-likelihood model

k is the number of desired cluster
convergenceTol is the maximum change in log-likelihood
For convergence
maxIterations is the maximum number of iterations to perform without reaching convergence
initialModel is an optional starting points from which to start the EM model

CS535 Big Data | Computer Science | Colorado State University

Other options for clustering with Apache Spark MLlib

Power iteration clustering (PIC)
Clustering vertices of a graph given pairwise similarities as edge properties
Computes a pseudo-eigenvector of the normalized affinity matrix of the graph
k: number of clusters
maxIterations: maximum number of power iterations
initializationMode: initialization model. This can be either “random”, which is the default, to use

a random vector as vertex properties, or “degree” to use normalized sum similarities

Lin, F. and Cohen, W.W., 2010. Power iteration clustering.

CS535 Big Data | Computer Science | Colorado State University

SLIDE 19

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 19

Other options for clustering with Apache Spark MLlib

Bisecting K-means
Often be much faster than regular K-means
It will generally produce a different clustering
Hierarchical clustering
Building a hierarchy of clusters
Agglomerative
“bottom up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one

moves up the hierarchy

Divisive
“top down” approach: all observations start in one cluster, and splits are performed recursively as one moves

down the hierarchy

Bisecting k-means algorithm is a kind of divisive algorithms

CS535 Big Data | Computer Science | Colorado State University

Other options for clustering with Apache Spark MLlib

Streaming K-means
Estimate clusters dynamically
Updating them as new data arrive Hierarchical clustering
Parameters to control the decay (or “forgetfulness”) of the estimates

CS535 Big Data | Computer Science | Colorado State University

SLIDE 20

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 20

Questions?

CS535 Big Data | Computer Science | Colorado State University