FAQs Quiz #3 Scores will be available by 3/6 Programming - - PDF document

faqs
SMART_READER_LITE
LIVE PREVIEW

FAQs Quiz #3 Scores will be available by 3/6 Programming - - PDF document

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 BIG DATA PART B. GEAR SESSIONS SESSION 2: MACHINE LEARNING FOR BIG DATA Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535


slide-1
SLIDE 1

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

CS535 BIG DATA

PART B. GEAR SESSIONS

SESSION 2: MACHINE LEARNING FOR BIG DATA

Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535

FAQs

  • Quiz #3
  • Scores will be available by 3/6
  • Programming Assignment #2
  • March 10
  • Piazza discussion board
  • Critical Review

CS535 Big Data | Computer Science | Colorado State University

slide-2
SLIDE 2

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

Topics of Todays Class

  • GEAR Session 2. Machine Learning for Big Data
  • Lecture 1.
  • Clustering Algorithms

CS535 Big Data | Computer Science | Colorado State University

GEAR Session 2. Machine Learning for Big Data

Lecture 1. Distributed Clustering Models

CS535 Big Data | Computer Science | Colorado State University

slide-3
SLIDE 3

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

Clustering: Core concept

  • Set of N-dimensional vectors
  • Can be in the order of millions
  • Group (or cluster) them based on their proximity (or similarity) to each other in an N-

dimensional space

  • Vectors or objects in a cluster (or group) are more similar to each other than in any other group

CS535 Big Data | Computer Science | Colorado State University

Clustering: Applications

  • Anomaly detection
  • Fraud detection
  • Recommendation systems
  • Medical imaging
  • Market research
  • Human genetic clustering

CS535 Big Data | Computer Science | Colorado State University

slide-4
SLIDE 4

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4

GEAR Session 2. Machine Learning for Big Data

Lecture 1. Distributed Clustering Introduction

CS535 Big Data | Computer Science | Colorado State University

This material is built based on,

  • Arthur, D.; Vassilvitskii, S. (2007). "k-means++: the advantages of careful

seeding" (PDF). Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics Philadelphia, PA,

  • USA. pp. 1027–1035
  • Bahmani, B., Moseley, B., Vattani, A., Kumar, R. and Vassilvitskii, S., 2012. Scalable k-

means++. arXiv preprint arXiv:1203.6402.

  • Apache Spark Mllib: Clustering
  • https://spark.apache.org/docs/latest/ml-clustering.html

CS535 Big Data | Computer Science | Colorado State University

slide-5
SLIDE 5

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 5

K-Means Clustering

  • A set of unlabeled points
  • Assumes that they form k clusters
  • Find a set of cluster centers that minimize the distance to nearest center
  • Finding a global optima is NP-hard: O(ndk+1)
  • Many approximate algorithms are available
  • D. Aloise, A. Deshpande, P. Hansen, and P. Popat. NP-hardness of Euclidean sum-of-squares clustering.

Machine Learning, 75(2):245–248, 2009.

CS535 Big Data | Computer Science | Colorado State University

Concept: k-Means Clustering (1/4)

. . . . . . . . . . . . . . . .. . . . . . .

x x

  • 10 -8 -6 -4 -2 0 2 4 6
  • 4 -3 -2 -1 0 1 2 3 4

CS535 Big Data | Computer Science | Colorado State University

slide-6
SLIDE 6

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6

Concept: k-Means Clustering (2/4)

. . . . . . . . . . . . . . . . . . . . . . .

x x

  • 10 -8 -6 -4 -2 0 2 4 6
  • 4 -3 -2 -1 0 1 2 3 4

CS535 Big Data | Computer Science | Colorado State University

Concept: k-Means Clustering (3/4)

. . . . . . . . . . . . . . . . . . . . . . .

x x

  • 10 -8 -6 -4 -2 0 2 4 6
  • 4 -3 -2 -1 0 1 2 3 4

CS535 Big Data | Computer Science | Colorado State University

slide-7
SLIDE 7

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 7

Concept: k-Means Clustering (4/4)

. . . . . . . . . . . . . . . . . . . . . . .

x x

  • 10 -8 -6 -4 -2 0 2 4 6
  • 4 -3 -2 -1 0 1 2 3 4

CS535 Big Data | Computer Science | Colorado State University

k-Means algorithm- Lloyd’s Algorithm (1/2)

  • Input
  • k (number of clusters)
  • Training set {x(1), x(2), x(3),…. x(m)}

(drop x0= 1 convention)

x(i) ∈ Rn

CS535 Big Data | Computer Science | Colorado State University

slide-8
SLIDE 8

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 8

k-Means algorithm- Lloyd’s Algorithm (2/2)

  • Randomly initialize K cluster centroids

repeat{

for i = 1 to m c(i):=index (from i to K) of cluster centroid closest to x(i) for k = 1 to K μk:= average (mean) of points assigned to cluster k

}

µ1,µ1,...µk ∈ Rn

CS535 Big Data | Computer Science | Colorado State University

Cost function

  • The objective is to find:
  • Where μi is the mean of points in Si

argmin

S

(|| x −

x∈Si

i=1 k

µi ||)2

CS535 Big Data | Computer Science | Colorado State University

slide-9
SLIDE 9

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 9

k-Means algorithm- Lloyd’s Algorithm: Step-By-Step Instruction

  • 1. Initialization Step
  • Select random k centers
  • Using a random uniform distribution
  • 2. Assignment Step
  • Assign each observation to the cluster
  • Euclidean distance
  • 3. Update Step
  • Calculate the new means of Euclidean distance to each assigned cluster
  • Update centroids
  • 4. Termination Step
  • Stop when the centroids do not change for two consecutive steps.

CS535 Big Data | Computer Science | Colorado State University

k-Means for non-separated clusters

.. . .. . .. .. .. . . .. .. . .. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. .. . .

Separated clusters Non-Separated clusters

CS535 Big Data | Computer Science | Colorado State University

slide-10
SLIDE 10

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 10

How to choose the number of clusters

  • Value k in the algorithm

. . . . . . . . .. . . . . . . . . . . . . .

  • 10 -8 -6 -4 -2 0 2 4 6
  • 4 -3 -2 -1 0 1 2 3 4

CS535 Big Data | Computer Science | Colorado State University

Choosing the value K (1/2)

Elbow Method

“Elbow” Cost function J K (no. of clusters) K (no. of clusters) Cost function J

CS535 Big Data | Computer Science | Colorado State University

slide-11
SLIDE 11

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 11

Choosing the value K (2/2)

. .. .. . . . .. .. .. . .. .. .. . . .... . . . . .. .. . . . .. .. . . .. .. .. . .. .. .. . . .. . . . . . .. .. . .

Small Medium Large Small Medium Large Extra Large Extra Small Sleeve Length Sleeve Length Waist Waist

CS535 Big Data | Computer Science | Colorado State University

Distance Measures

  • Euclidean Distance
  • Manhattan Distance
  • Cosine Distance
  • Hamming Distance
  • Jaccard Dissimilarity
  • Edit Distance
  • Smith Waterman Similarity
  • Image Distance
  • Etc.

CS535 Big Data | Computer Science | Colorado State University

slide-12
SLIDE 12

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 12

GEAR Session 2. Machine Learning for Big Data

Lecture 1. Distributed Clustering Scalable k-means

CS535 Big Data | Computer Science | Colorado State University

k-Means algorithm- Lloyd’s Algorithm: Strengths

  • Embarrassingly parallel
  • Converges to a local minima
  • O(nkdi) runtime

CS535 Big Data | Computer Science | Colorado State University

slide-13
SLIDE 13

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 13

k-Means algorithm- Lloyd’s Algorithm: Weaknesses

  • O(nkdi) runtime
  • Worst case? ! ≈ 2 $
  • Large number of local minima
  • Many local minima are poor
  • k is unknown

CS535 Big Data | Computer Science | Colorado State University

The K-Means++ Algorithm

  • Avoiding cold-start improves results
  • Reducing the number of total iterations
  • Initialize cluster centers sequentially
  • Only the first center is randomly selected
  • Each further center is selected probabilistically to be far from existing centers
  • Result of this is an O(log k ) approximation to the global optima

CS535 Big Data | Computer Science | Colorado State University

slide-14
SLIDE 14

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 14

The K-Means++ Algorithm: Step-by-Step description

Step 1. Choose one center uniformly at random from among the data points Step 2. For each data point x, compute D(x), the shortest distance between x and the nearest center that has already been chosen Step 3. Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2. (D2 weighting)

  • D2 weighting: Take a new center, choosing x with probability

!(#)% ∑'∈) !(#)%

Step 4. Repeat Steps 2 and 3 until k centers have been chosen Step 5. Now that the initial centers have been chosen, proceed using standard k-Means

More info: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf

CS535 Big Data | Computer Science | Colorado State University

The K-Means++ Algorithm: How can we improve this?

  • GOAL 1: Can we reduce the number of iterations for initializing centroids,
  • Can we select multiple centroids at a time?
  • GOAL 2: While considering un-uniformly distributed dataset,
  • Non-uniform selection?
  • GOAL 3: With a reasonable approximation guarantees?

CS535 Big Data | Computer Science | Colorado State University

slide-15
SLIDE 15

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 15

Scalable K-Means++(A.K.A. K-Means|| ) Algorithm

  • Parallel version for initializing the centers
  • Oversampling factor ! = Ω $
  • Select an initial center (uniformly at random)
  • Computes initial cost of the clustering after this selection, Ψ
  • Proceeds in log Ψ iterations
  • Samples each x with probability !&'(), +)/./(+)

CS535 Big Data | Computer Science | Colorado State University

Scalable K-Means++(A.K.A. K-Means|| ) Algorithm

  • Let ! = #$, #&, … , #( be a set of points in the d-dimensional Euclidean space
  • Let k be a positive integer specifying the number of clusters
  • Let ∥ * − , ∥ denote the Euclidean distance between a and b
  • For a point x and a subset - ⊆ ! of points, the distance is defined as

/ #, - = 0123∈5 ∥ # − 6 ∥

  • For a subset - ⊆ ! , its centroid be given by
  • 7829:;1/ - = $

|5| ∑3∈5 6

  • Let C={c1, c2,… ck} be a set of points and let - ⊆ !
  • The cost of Y with respect to C as
  • >5 ? = ∑3∈5 /& 6, ? = ∑3∈5 @A$,..,C

D@(∥ # − 6 ∥ &

CS535 Big Data | Computer Science | Colorado State University

slide-16
SLIDE 16

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 16

Scalable K-Means++(A.K.A. K-Means|| ) Algorithm

  • l: Number of points chosen in each iteration
  • Total number of points in C is !!"#$ (> k)
  • To reduce the number of centers
  • Step 7 assigns weights to the points in C
  • Step 8 reclusters these weighted points to obtain k centers

1: C ← sample a point uniformly at random from X 2: ψ←φX(C) 3: for O(log ψ) times do 4: C′ ← sample each point x ∈ X independently with probability of !*+(,, .)/01(.) 5: C ← C ∪ C′ φX (C) 6: ψ is updated and continue the iteration 7: end for 8: For x∈C, set wx to be the number of points in X closer to x than any other point in C 9: Recluster the weighted points in C into k clusters

CS535 Big Data | Computer Science | Colorado State University

K-Means|| Algorithm: A Parallel Implementation

  • Line 2: calculate the cost function
  • How will you design RDDs for this?
  • Line 4: each mapper (RDD) can sample independently
  • How can you sample numbers?
  • Line 5: identical to the line 2

1: C ← sample a point uniformly at random from X 2: ψ←φX(C) 3: for O(log ψ) times do 4: C′ ← sample each point x ∈ X independently with probability of "#$(&, ()/+,(() 5: C ← C ∪ C′ φX (C) 6: ψ is updated and continue the iteration 7: end for 8: For x∈C, set wx to be the number of points in X closer to x than any other point in C 9: Recluster the weighted points in C into k clusters

CS535 Big Data | Computer Science | Colorado State University

slide-17
SLIDE 17

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 17

K-Means|| Algorithm: A Parallel Implementation

  • Line 2: calculate the cost function
  • RDD calc = all possible combinations between RDD points (for the points), and RDD centers (for centers)
  • val calc = p.cartesian(c)
  • Create an aggregator function
  • Line 4: each mapper (RDD) can sample independently
  • Random number generation per points, r. if r>=probability, sample it otherwise drop it
  • Line 5: identical to the line 2

1: C ← sample a point uniformly at random from X 2: ψ←φX(C) 3: for O(log ψ) times do 4: C′ ← sample each point x ∈ X independently with probability of "#$(&, ()/+,(() 5: C ← C ∪ C′ φX (C) 6: ψ is updated and continue the iteration 7: end for 8: For x∈C, set wx to be the number of points in X closer to x than any other point in C 9: Recluster the weighted points in C into k clusters

CS535 Big Data | Computer Science | Colorado State University

K-Means|| Algorithm: Properties

  • Requires !(#$%&)iterations
  • A constant number of iterations is usually enough
  • Creates an !(log +) approximation to the global optima
  • Uses results from probability and sequence theory
  • Each iteration reduces error by a constant amount and a term proportional to the global

error

  • Expected value of the cost function approaches a multiple of the global optima
  • Performance (with KDDCup 1999 data)
  • Paper: Bahmani, B., Moseley, B., Vattani, A., Kumar, R. and Vassilvitskii, S., 2012.

Scalable k-means++. arXiv preprint arXiv:1203.6402

  • Data: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

Note: all values scaled down by 1010

CS535 Big Data | Computer Science | Colorado State University

slide-18
SLIDE 18

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 18

Other options for clustering with Apache Spark MLlib

  • Gaussian mixture
  • Composite distribution: points are drawn from one of k Gaussian sub-distributions each

with own probability

  • spark.mllib implements using the expectation-maximization algorithm to induce the

maximum-likelihood model

  • k is the number of desired cluster
  • convergenceTol is the maximum change in log-likelihood
  • For convergence
  • maxIterations is the maximum number of iterations to perform without reaching convergence
  • initialModel is an optional starting points from which to start the EM model

CS535 Big Data | Computer Science | Colorado State University

Other options for clustering with Apache Spark MLlib

  • Power iteration clustering (PIC)
  • Clustering vertices of a graph given pairwise similarities as edge properties
  • Computes a pseudo-eigenvector of the normalized affinity matrix of the graph
  • k: number of clusters
  • maxIterations: maximum number of power iterations
  • initializationMode: initialization model. This can be either “random”, which is the default, to use

a random vector as vertex properties, or “degree” to use normalized sum similarities

Lin, F. and Cohen, W.W., 2010. Power iteration clustering.

CS535 Big Data | Computer Science | Colorado State University

slide-19
SLIDE 19

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 19

Other options for clustering with Apache Spark MLlib

  • Bisecting K-means
  • Often be much faster than regular K-means
  • It will generally produce a different clustering
  • Hierarchical clustering
  • Building a hierarchy of clusters
  • Agglomerative
  • “bottom up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one

moves up the hierarchy

  • Divisive
  • “top down” approach: all observations start in one cluster, and splits are performed recursively as one moves

down the hierarchy

  • Bisecting k-means algorithm is a kind of divisive algorithms

CS535 Big Data | Computer Science | Colorado State University

Other options for clustering with Apache Spark MLlib

  • Streaming K-means
  • Estimate clusters dynamically
  • Updating them as new data arrive Hierarchical clustering
  • Parameters to control the decay (or “forgetfulness”) of the estimates

CS535 Big Data | Computer Science | Colorado State University

slide-20
SLIDE 20

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 20

Questions?

CS535 Big Data | Computer Science | Colorado State University