K-Means Class Algorithmic Methods of Data Mining Program M. Sc. - PowerPoint PPT Presentation

K-Means Class Algorithmic Methods of Data Mining Program M. Sc. Data Science University Sapienza University of Rome Semester Fall 2018 Slides by Carlos Castillo http://chato.cl/ Sources: ● Mohammed J. Zaki, Wagner Meira, Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press, May 2014. Example 13.1. [download] ● Evimaria Terzi: Data Mining course at Boston University http://www.cs.bu.edu/~evimaria/cs565-13.html 1

The k-means problem Boston University Slideshow Title Goes Here • consider set X={x 1 ,...,x n } of n points in R d • assume that the number k is given • problem: • find k points c 1 ,...,c k (named centers or means) so that the cost is minimized 2

The k-means problem • k=1 and k=n are easy special cases ( why? ) Boston University Slideshow Title Goes Here • an NP-hard problem if the dimension of the data is at least 2 (d≥2) • in practice, a simple iterative algorithm works quite well 3

The k-means algorithm Boston University Slideshow Title Goes Here • voted among the top-10 algorithms in data mining • one way of solving the k- means problem 4

K-means algorithm 5

The k-means algorithm Boston University Slideshow Title Goes Here 1.randomly (or with another method) pick k cluster centers {c 1 ,...,c k } 2.for each j, set the cluster X j to be the set of points in X that are the closest to center c j 3.for each j let c j be the center of cluster X j (mean of the vectors in X j ) 1.repeat (go to step 2) until convergence 6

Sample execution Boston University Slideshow Title Goes Here 7

1-dimensional clustering exercise Exercise: ● For the data in the figure ● Run k-means with k=2 and initial centroids u1=2, u2=4 (Verify: last centroids are 18 units apart) ● Try with k=3 and initialization 2,3,30 8 http://www.dataminingbook.info/pmwiki.php/Main/BookDownload Exercise 13.1

Limitations of k-means ● Clusters of different size ● Clusters of different density ● Clusters of non-globular shape ● Sensitive to initialization 9

Limitations of k-means: different sizes Boston University Slideshow Title Goes Here 10

Limitations of k-means: different density Boston University Slideshow Title Goes Here 11

Limitations of k-means: non-spherical shapes Boston University Slideshow Title Goes Here 12

Effects of bad initialization Boston University Slideshow Title Goes Here 13

k-means algorithm Boston University Slideshow Title Goes Here • finds a local optimum • often converges quickly but not always • the choice of initial points can have large influence in the result • tends to find spherical clusters • outliers can cause a problem • different densities may cause a problem 14

Advanced: k-means initialization 15

Initialization Boston University Slideshow Title Goes Here • random initialization • random, but repeat many times and take the best solution • helps, but solution can still be bad • pick points that are distant to each other • k-means++ • provable guarantees 16

k-means++ Boston University Slideshow Title Goes Here David Arthur and Sergei Vassilvitskii k-means++: The advantages of careful seeding SODA 2007 17

k-means algorithm: random initialization Boston University Slideshow Title Goes Here 18

k-means algorithm: random initialization Boston University Slideshow Title Goes Here 19

k-means algorithm: initialization with further-first Boston University Slideshow Title Goes Here traversal 2 1 3 4 20

k-means algorithm: initialization with further-first Boston University Slideshow Title Goes Here traversal 21

but... sensitive to outliers Boston University Slideshow Title Goes Here 2 1 3 22

but... sensitive to outliers Boston University Slideshow Title Goes Here 23

Here random may work well Boston University Slideshow Title Goes Here 24

k-means++ algorithm • interpolate between the two methods Boston University Slideshow Title Goes Here • let D(x) be the distance between x and the nearest center selected so far • choose next center with probability proportional to (D(x)) a = D a (x) ✦ a = 0 r a n d o m i n i t i a l i z a t i o n ✦ a ∞ f = u r t h e s t - fj r s t t r a v e r s a l ✦ a = 2 k - m e a n s + + 25

k-means++ algorithm • initialization phase: Boston University Slideshow Title Goes Here • choose the first center uniformly at random • choose next center with probability proportional to D 2 (x) • iteration phase: • iterate as in the k-means algorithm until convergence 26

k-means++ initialization Boston University Slideshow Title Goes Here 3 1 2 27

k-means++ result Boston University Slideshow Title Goes Here 28

k-means++ provable guarantee Boston University Slideshow Title Goes Here • approximation guarantee comes just from the first iteration (initialization) • subsequent iterations can only improve cost 29

Lesson learned Boston University Slideshow Title Goes Here • no reason to use k-means and not k-means++ • k-means++ : • easy to implement • provable guarantee • works well in practice 30

k-means-- ● Algorithm 4.1 in [Chawla & Gionis SDM 2013] 31

K-Means Class Algorithmic Methods of Data Mining Program M. Sc. - PowerPoint PPT Presentation

K-Means Class Algorithmic Methods of Data Mining Program M. Sc. Data Science University Sapienza University of Rome Semester Fall 2018 Slides by Carlos Castillo http://chato.cl/ Sources: Mohammed J. Zaki, Wagner Meira, Jr., Data

K-MEANS++ OPTIMAL INITIALIZATION ALGORITHM An Improved K-means Clustering Method OVERVIEW

Lecture 23/Chapter 19 Diversity of Sample Means Means versus Proportions Behavior of

Data Clustering: Data Clustering: 50 Years Beyond K means 50 Years Beyond K means 50 Years

Multi-variable Optimization K-means clustering K-means clustering on points is finding K

1 K-means clustering The K-means clustering algorithm can be seen as applying the EM algorithm to

K -means Clustering Ke Chen Reading: [7.3, EA], [9.1, CMB] COMP24111 Machine Learning Outline

11/11/2014 Chapter 22 INFERENCES ABOUT MEANS 1 SAMPLING DISTRIBUTION FOR MEANS Recall, the

Chapter 7: The Distribution of Sample Means Frequency 2 1 0 1 2 3 4 5 6 7 8 9 Scores Distribution

A Semantics for Means-End Relations Jesse Hughes Technical University of Eindhoven August 29,

k -means++ seeding Have seen that the k -means algorithm can output arbitrarily poor solutions, if

MacConvilles Surveying BIM What it Means to Quantity Surveying BIM What it Means to

How Tortillas Stack Up in the Baking Industry What is a Tortilla? In Mexico, means little

QSL Card QSL Card A means of providing written confirmation A means of providing written

Fed Forum Personal Bankruptcy Reform of 2005: Means-Testing or Mean-Spirited? Astrid Dick

MEP Means Coordination Jason Richards Peter Martin MEP Means Coordination Western Link,

Sustainable Ocean. Innovation means to come up with new ideas. Sustainable means to keep

Review Lecture A Tiefenbruck MWF 9-9:50am Center 212 Lecture B Jones MWF 2-2:50pm Center

Dynamic Programming Dynamic Programming Steps. 9 View the problem solution as the result of a

Common Patterns and Pitfalls for Implementing Algorithms in Spark Hossein Falaki @mhfalaki

Lecture 14: Iterative Methods and Sparse Linear Algebra David Bindel 10 Mar 2010 Reminder:

Today Reminder: Constraint satisfaction problems See Russell and Norvig, chapters 5 and 6 CSP:

Value Iteration 3-21-16 Reading Quiz The Q function learned by Q-learning maps ________ to

7. Iterative Methods: Roots and Optima Citius, Altius, Fortius! 7. Iterative Methods: Roots and

Superiorized Inversion of the Radon Transform Gabor T. Herman Graduate Center, City University of