CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big - PDF document

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs • Quiz #3 • Scores will be available by 3/6 • Programming Assignment #2 PART B. GEAR SESSIONS • March 10 SESSION 2: MACHINE LEARNING FOR BIG DATA • Piazza discussion board • Critical Review Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • GEAR Session 2. Machine Learning for Big Data • Lecture 1. • Clustering Algorithms GEAR Session 2. Machine Learning for Big Data Lecture 1. Distributed Clustering Models CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Clustering : Core concept Clustering : Applications • Set of N-dimensional vectors • Anomaly detection • Can be in the order of millions • Fraud detection • Recommendation systems • Group (or cluster) them based on their proximity (or similarity) to each other in an N- • Medical imaging dimensional space • Market research • Vectors or objects in a cluster (or group) are more similar to each other than in any other group • Human genetic clustering http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University This material is built based on, • Arthur, D.; Vassilvitskii, S. (2007). " k -means++: the advantages of careful seeding" (PDF). Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms . Society for Industrial and Applied Mathematics Philadelphia, PA, GEAR Session 2. Machine Learning for Big Data USA. pp. 1027–1035 Lecture 1. Distributed Clustering Introduction • Bahmani, B., Moseley, B., Vattani, A., Kumar, R. and Vassilvitskii, S., 2012. Scalable k- means++. arXiv preprint arXiv:1203.6402 . • Apache Spark Mllib: Clustering • https://spark.apache.org/docs/latest/ml-clustering.html CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Concept: k -Means Clustering (1/4) K-Means Clustering • A set of unlabeled points • Assumes that they form k clusters . . -10 -8 -6 -4 -2 0 2 4 6 . . x . . . • Find a set of cluster centers that minimize the distance to nearest center . . . . . • Finding a global optima is NP-hard: O(n dk+1 ) .. . • Many approximate algorithms are available . . . . . . x . . D. Aloise, A. Deshpande, P. Hansen, and P. Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245–248, 2009. -4 -3 -2 -1 0 1 2 3 4 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Concept: k -Means Clustering (2/4) Concept: k -Means Clustering (3/4) . . . . -10 -8 -6 -4 -2 0 2 4 6 -10 -8 -6 -4 -2 0 2 4 6 . . . . x . . . . . . . . . . x . . . . . . . . . . . . . . . . . . . . . . . . x x . . . . -4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4 http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Concept: k -Means Clustering (4/4) k -Means algorithm- Lloyd’s Algorithm (1/2) • Input • k (number of clusters) . . -10 -8 -6 -4 -2 0 2 4 6 • Training set {x (1) , x (2) , x (3) ,…. x (m) } . . x ( i ) ∈ R n . . . . x (drop x 0 = 1 convention) . . . . . . . . . . . . . x . . -4 -3 -2 -1 0 1 2 3 4 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University k -Means algorithm- Lloyd’s Algorithm (2/2) Cost function • Randomly initialize K cluster centroids • The objective is to find : µ 1 , µ 1 ,... µ k ∈ R n k ∑ ∑ µ i ||) 2 argmin (|| x − repeat{ S i = 1 x ∈ S i for i = 1 to m c (i) :=index (from i to K) of cluster centroid • Where μ i is the mean of points in S i closest to x (i) for k = 1 to K μ k := average (mean) of points assigned to cluster k } CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University k -Means for non-separated clusters k -Means algorithm- Lloyd’s Algorithm: Step-By-Step Instruction 1. Initialization Step • Select random k centers Separated clusters Non-Separated clusters • Using a random uniform distribution .. .. .. .. . .. . 2. Assignment Step .. . . . .. .. • Assign each observation to the cluster .. .. .. . • Euclidean distance . . .. . .. .. . . . .. .. 3. Update Step .. . .. • Calculate the new means of Euclidean distance to each assigned cluster . .. .. . .. • Update centroids 4. Termination Step • Stop when the centroids do not change for two consecutive steps. http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Choosing the value K (1/2) How to choose the number of clusters • Value k in the algorithm Elbow Method . . . -10 -8 -6 -4 -2 0 2 4 6 . . . .. . . . “Elbow” Cost function J Cost function J . . . . . . . . . . . . K (no. of clusters) K (no. of clusters) -4 -3 -2 -1 0 1 2 3 4 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Choosing the value K (2/2) Distance Measures • Euclidean Distance • Manhattan Distance . . .. .. . • Cosine Distance . . Extra Large .. Large .. .. .. • Hamming Distance .. .. .. .. . . . . .. • Jaccard Dissimilarity .. . Large . .. .. .. .. . Medium . . . . Medium . .. .. .. • Edit Distance .. Small Waist . . . . Waist .... .. . . • Smith Waterman Similarity . . . Small • Image Distance • Etc. Extra Small Sleeve Length Sleeve Length CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University k -Means algorithm- Lloyd’s Algorithm: Strengths • Embarrassingly parallel • Converges to a local minima GEAR Session 2. Machine Learning for Big Data • O(nkdi) runtime Lecture 1. Distributed Clustering Scalable k-means http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big - PDF document

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs Quiz #3 Scores will be available by 3/6 Programming Assignment #2 PART B. GEAR SESSIONS

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

CS535 Big Data 3/4/2020 Week 7-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/27/2020 Week 14-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 3/25/2020 Week 8-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/24/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 3/9/2020 Week 8-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

FAQs Lossy Algorithm http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-0 Week 2-A-1 1/30/2019

Distributed Implementation of the Triplets View CS535 Big Data | Computer Science | Colorado State

Matrices with Application to Page Rank Markov Matrices Pagerank Anil Maheshwari

CMU-Q 15-381 Lecture 16: Markov Decision Processes I Teacher: Gianni A. Di Caro R ECAP : M

INTRODUCTION TO PROBABILITY INTRODUCTION TO PROBABILITY MODELS MODELS Lecture 20 Qi Wang ,

Discrete-Event Systems and Generalized Semi-Markov Processes Discrete-Event Systems and

Algorithms for Parity Games Piotr Danilewski May 15, 2008 Piotr Danilewski Algorithms for

Computational Approaches to Analysis and Control of Hybrid Systems Antoine Girard Laboratoire

Model Selection and Assumptions November 15, 2019 November 15, 2019 1 / 32 Forward Selection

Introduction: Why Optimization? Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big - PDF document

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs Quiz #3 Scores will be available by 3/6 Programming Assignment #2 PART B. GEAR SESSIONS

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

CS535 Big Data 3/4/2020 Week 7-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/27/2020 Week 14-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 3/25/2020 Week 8-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/24/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 3/9/2020 Week 8-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

FAQs Lossy Algorithm http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-0 Week 2-A-1 1/30/2019

Distributed Implementation of the Triplets View CS535 Big Data | Computer Science | Colorado State

Matrices with Application to Page Rank Markov Matrices Pagerank Anil Maheshwari

CMU-Q 15-381 Lecture 16: Markov Decision Processes I Teacher: Gianni A. Di Caro R ECAP : M

INTRODUCTION TO PROBABILITY INTRODUCTION TO PROBABILITY MODELS MODELS Lecture 20 Qi Wang ,

Discrete-Event Systems and Generalized Semi-Markov Processes Discrete-Event Systems and

Algorithms for Parity Games Piotr Danilewski May 15, 2008 Piotr Danilewski Algorithms for

Computational Approaches to Analysis and Control of Hybrid Systems Antoine Girard Laboratoire

Model Selection and Assumptions November 15, 2019 November 15, 2019 1 / 32 Forward Selection

Introduction: Why Optimization? Geoff Gordon &amp; Ryan Tibshirani Optimization 10-725 / 36-725

Introduction: Why Optimization? Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725