SLIDE 3 K-means Clustering
The key parameter that you have to select for k-means is k, the number of clusters. You may typically choose k based on the number of clusters you expect in the data, perhaps you expect about 10 clusters as the places where you typically stay in a day. Given k, the k-means algorithm consists of an iterative algorithm with four steps.
- 1. Select K initial centroids at random from the points.
- 2. repeat
- a. Assign each object to the cluster with the nearest centroid (in terms of distance).
- b. Re-compute each centroid as the mean of the objects assigned to it.
- 3. until centroids do not change.
While this algorithm sometimes produces suboptimal clusterings, it is fast and really easy to to implement. Lets look at a simple SQL implementation (from Joni Salonen [2]). Suppose we have some location data, already geocoded into latitude-longitude pairs, and we want to find clusters of locations that lie close to each other. We’ll use two tables, gps_data to store the data and the cluster assigned to each point, and gps_clusters for the cluster centers:
create table gps_data (id int primary key, cluster_id int, lat double, lng double); create table gps_clusters (id int auto_increment primary key, lat double, lng double);
The K-means algorithm can now be implemented with the following procedure.
DELIMITER // CREATE PROCEDURE kmeans(v_K int) BEGIN TRUNCATE gps_clusters;
- - initialize cluster centers
INSERT INTO gps_clusters (lat, lng) SELECT lat, lng FROM gps_data LIMIT v_K; REPEAT
- - assign clusters to data points
UPDATE gps_data d SET cluster_id = (SELECT id FROM gps_clusters c ORDER BY POW(d.lat-c.lat,2)+POW(d.lng-c.lng,2) ASC LIMIT 1);
- - calculate new cluster center
UPDATE gps_clusters C, (SELECT cluster_id, AVG(lat) AS lat, AVG(lng) AS lng FROM gps_data GROUP BY cluster_id) D SET C.lat=D.lat, C.lng=D.lng WHERE C.id=D.cluster_id; UNTIL ROW_COUNT() = 0 END REPEAT; END//
The above code should be quite useful for your assignment as well…