cs535 big data 03 02 2020 week 7 a sangmi lee pallickara
play

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big - PDF document

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs Quiz #3 Scores will be available by 3/6 Programming Assignment #2 PART B. GEAR SESSIONS


  1. CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs • Quiz #3 • Scores will be available by 3/6 • Programming Assignment #2 PART B. GEAR SESSIONS • March 10 SESSION 2: MACHINE LEARNING FOR BIG DATA • Piazza discussion board • Critical Review Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • GEAR Session 2. Machine Learning for Big Data • Lecture 1. • Clustering Algorithms GEAR Session 2. Machine Learning for Big Data Lecture 1. Distributed Clustering Models CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Clustering : Core concept Clustering : Applications • Set of N-dimensional vectors • Anomaly detection • Can be in the order of millions • Fraud detection • Recommendation systems • Group (or cluster) them based on their proximity (or similarity) to each other in an N- • Medical imaging dimensional space • Market research • Vectors or objects in a cluster (or group) are more similar to each other than in any other group • Human genetic clustering http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

  2. CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University This material is built based on, • Arthur, D.; Vassilvitskii, S. (2007). " k -means++: the advantages of careful seeding" (PDF). Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms . Society for Industrial and Applied Mathematics Philadelphia, PA, GEAR Session 2. Machine Learning for Big Data USA. pp. 1027–1035 Lecture 1. Distributed Clustering Introduction • Bahmani, B., Moseley, B., Vattani, A., Kumar, R. and Vassilvitskii, S., 2012. Scalable k- means++. arXiv preprint arXiv:1203.6402 . • Apache Spark Mllib: Clustering • https://spark.apache.org/docs/latest/ml-clustering.html CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Concept: k -Means Clustering (1/4) K-Means Clustering • A set of unlabeled points • Assumes that they form k clusters . . -10 -8 -6 -4 -2 0 2 4 6 . . x . . . • Find a set of cluster centers that minimize the distance to nearest center . . . . . • Finding a global optima is NP-hard: O(n dk+1 ) .. . • Many approximate algorithms are available . . . . . . x . . D. Aloise, A. Deshpande, P. Hansen, and P. Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245–248, 2009. -4 -3 -2 -1 0 1 2 3 4 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Concept: k -Means Clustering (2/4) Concept: k -Means Clustering (3/4) . . . . -10 -8 -6 -4 -2 0 2 4 6 -10 -8 -6 -4 -2 0 2 4 6 . . . . x . . . . . . . . . . x . . . . . . . . . . . . . . . . . . . . . . . . x x . . . . -4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4 http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

  3. CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Concept: k -Means Clustering (4/4) k -Means algorithm- Lloyd’s Algorithm (1/2) • Input • k (number of clusters) . . -10 -8 -6 -4 -2 0 2 4 6 • Training set {x (1) , x (2) , x (3) ,…. x (m) } . . x ( i ) ∈ R n . . . . x (drop x 0 = 1 convention) . . . . . . . . . . . . . x . . -4 -3 -2 -1 0 1 2 3 4 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University k -Means algorithm- Lloyd’s Algorithm (2/2) Cost function • Randomly initialize K cluster centroids • The objective is to find : µ 1 , µ 1 ,... µ k ∈ R n k ∑ ∑ µ i ||) 2 argmin (|| x − repeat{ S i = 1 x ∈ S i for i = 1 to m c (i) :=index (from i to K) of cluster centroid • Where μ i is the mean of points in S i closest to x (i) for k = 1 to K μ k := average (mean) of points assigned to cluster k } CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University k -Means for non-separated clusters k -Means algorithm- Lloyd’s Algorithm: Step-By-Step Instruction 1. Initialization Step • Select random k centers Separated clusters Non-Separated clusters • Using a random uniform distribution .. .. .. .. . .. . 2. Assignment Step .. . . . .. .. • Assign each observation to the cluster .. .. .. . • Euclidean distance . . .. . .. .. . . . .. .. 3. Update Step .. . .. • Calculate the new means of Euclidean distance to each assigned cluster . .. .. . .. • Update centroids 4. Termination Step • Stop when the centroids do not change for two consecutive steps. http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

  4. CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Choosing the value K (1/2) How to choose the number of clusters • Value k in the algorithm Elbow Method . . . -10 -8 -6 -4 -2 0 2 4 6 . . . .. . . . “Elbow” Cost function J Cost function J . . . . . . . . . . . . K (no. of clusters) K (no. of clusters) -4 -3 -2 -1 0 1 2 3 4 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Choosing the value K (2/2) Distance Measures • Euclidean Distance • Manhattan Distance . . .. .. . • Cosine Distance . . Extra Large .. Large .. .. .. • Hamming Distance .. .. .. .. . . . . .. • Jaccard Dissimilarity .. . Large . .. .. .. .. . Medium . . . . Medium . .. .. .. • Edit Distance .. Small Waist . . . . Waist .... .. . . • Smith Waterman Similarity . . . Small • Image Distance • Etc. Extra Small Sleeve Length Sleeve Length CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University k -Means algorithm- Lloyd’s Algorithm: Strengths • Embarrassingly parallel • Converges to a local minima GEAR Session 2. Machine Learning for Big Data • O(nkdi) runtime Lecture 1. Distributed Clustering Scalable k-means http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend