 
              CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs • Quiz #3 • Scores will be available by 3/6 • Programming Assignment #2 PART B. GEAR SESSIONS • March 10 SESSION 2: MACHINE LEARNING FOR BIG DATA • Piazza discussion board • Critical Review Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • GEAR Session 2. Machine Learning for Big Data • Lecture 1. • Clustering Algorithms GEAR Session 2. Machine Learning for Big Data Lecture 1. Distributed Clustering Models CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Clustering : Core concept Clustering : Applications • Set of N-dimensional vectors • Anomaly detection • Can be in the order of millions • Fraud detection • Recommendation systems • Group (or cluster) them based on their proximity (or similarity) to each other in an N- • Medical imaging dimensional space • Market research • Vectors or objects in a cluster (or group) are more similar to each other than in any other group • Human genetic clustering http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1
CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University This material is built based on, • Arthur, D.; Vassilvitskii, S. (2007). " k -means++: the advantages of careful seeding" (PDF). Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms . Society for Industrial and Applied Mathematics Philadelphia, PA, GEAR Session 2. Machine Learning for Big Data USA. pp. 1027–1035 Lecture 1. Distributed Clustering Introduction • Bahmani, B., Moseley, B., Vattani, A., Kumar, R. and Vassilvitskii, S., 2012. Scalable k- means++. arXiv preprint arXiv:1203.6402 . • Apache Spark Mllib: Clustering • https://spark.apache.org/docs/latest/ml-clustering.html CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Concept: k -Means Clustering (1/4) K-Means Clustering • A set of unlabeled points • Assumes that they form k clusters . . -10 -8 -6 -4 -2 0 2 4 6 . . x . . . • Find a set of cluster centers that minimize the distance to nearest center . . . . . • Finding a global optima is NP-hard: O(n dk+1 ) .. . • Many approximate algorithms are available . . . . . . x . . D. Aloise, A. Deshpande, P. Hansen, and P. Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245–248, 2009. -4 -3 -2 -1 0 1 2 3 4 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Concept: k -Means Clustering (2/4) Concept: k -Means Clustering (3/4) . . . . -10 -8 -6 -4 -2 0 2 4 6 -10 -8 -6 -4 -2 0 2 4 6 . . . . x . . . . . . . . . . x . . . . . . . . . . . . . . . . . . . . . . . . x x . . . . -4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4 http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2
CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Concept: k -Means Clustering (4/4) k -Means algorithm- Lloyd’s Algorithm (1/2) • Input • k (number of clusters) . . -10 -8 -6 -4 -2 0 2 4 6 • Training set {x (1) , x (2) , x (3) ,…. x (m) } . . x ( i ) ∈ R n . . . . x (drop x 0 = 1 convention) . . . . . . . . . . . . . x . . -4 -3 -2 -1 0 1 2 3 4 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University k -Means algorithm- Lloyd’s Algorithm (2/2) Cost function • Randomly initialize K cluster centroids • The objective is to find : µ 1 , µ 1 ,... µ k ∈ R n k ∑ ∑ µ i ||) 2 argmin (|| x − repeat{ S i = 1 x ∈ S i for i = 1 to m c (i) :=index (from i to K) of cluster centroid • Where μ i is the mean of points in S i closest to x (i) for k = 1 to K μ k := average (mean) of points assigned to cluster k } CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University k -Means for non-separated clusters k -Means algorithm- Lloyd’s Algorithm: Step-By-Step Instruction 1. Initialization Step • Select random k centers Separated clusters Non-Separated clusters • Using a random uniform distribution .. .. .. .. . .. . 2. Assignment Step .. . . . .. .. • Assign each observation to the cluster .. .. .. . • Euclidean distance . . .. . .. .. . . . .. .. 3. Update Step .. . .. • Calculate the new means of Euclidean distance to each assigned cluster . .. .. . .. • Update centroids 4. Termination Step • Stop when the centroids do not change for two consecutive steps. http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3
CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Choosing the value K (1/2) How to choose the number of clusters • Value k in the algorithm Elbow Method . . . -10 -8 -6 -4 -2 0 2 4 6 . . . .. . . . “Elbow” Cost function J Cost function J . . . . . . . . . . . . K (no. of clusters) K (no. of clusters) -4 -3 -2 -1 0 1 2 3 4 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Choosing the value K (2/2) Distance Measures • Euclidean Distance • Manhattan Distance . . .. .. . • Cosine Distance . . Extra Large .. Large .. .. .. • Hamming Distance .. .. .. .. . . . . .. • Jaccard Dissimilarity .. . Large . .. .. .. .. . Medium . . . . Medium . .. .. .. • Edit Distance .. Small Waist . . . . Waist .... .. . . • Smith Waterman Similarity . . . Small • Image Distance • Etc. Extra Small Sleeve Length Sleeve Length CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University k -Means algorithm- Lloyd’s Algorithm: Strengths • Embarrassingly parallel • Converges to a local minima GEAR Session 2. Machine Learning for Big Data • O(nkdi) runtime Lecture 1. Distributed Clustering Scalable k-means http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4
Recommend
More recommend