Pattern Analysis and Machine Intelligence
Lecture Notes on Clustering (I) 2012-2013
Davide Eynard
davide.eynard@usi.ch
Department of Electronics and Information Politecnico di Milano
– p. 1/23
Pattern Analysis and Machine Intelligence Lecture Notes on - - PowerPoint PPT Presentation
Pattern Analysis and Machine Intelligence Lecture Notes on Clustering (I) 2012-2013 Davide Eynard davide.eynard@usi.ch Department of Electronics and Information Politecnico di Milano p. 1/23 Some Info Lectures given by: Davide
Davide Eynard
davide.eynard@usi.ch
Department of Electronics and Information Politecnico di Milano
– p. 1/23
Some Info
http://davide.eynard.it davide.eynard@usi.ch
Learning: Data Mining, Inference, and Prediction"
– p. 2/23
Course Schedule [Tentative]
Date Topic 06/05/2012 Clustering I: Introduction, K-means 07/05/2012 Clustering II: K-M alternatives, Hierarchical, SOM 13/05/2012 Clustering III: Mixture of Gaussians, DBSCAN, J-P 14/05/2012 Clustering IV: Spectral Clustering (+Text?) 20/05/2012 Clustering V: Evaluation Measures
– p. 3/23
Today’s Outline
– p. 4/23
Clustering: a definition
"The process of organizing objects into groups whose members are similar in some way" J.A. Hartigan, 1975 "An algorithm by which objects are grouped in classes, so that intra-class similarity is maximized and inter-class similarity is minimized"
"... grouping or segmenting a collection of objects into subsets or clusters, such that those within each cluster are more closely related to one another than objects assigned to different clusters"
– p. 5/23
Clustering: a definition
can be used for reasoning or prediction"
– p. 6/23
(Some) Applications of Clustering
advertising
– p. 7/23
Example: Clustering (CDs/Movies/Books/...)
what are categories actually?
dimension may be 0 or 1 only)
user liked it
– p. 8/23
Requirements
parameters
– p. 9/23
Question
What if we had a dataset like this?
– p. 10/23
Problems
There are a number of problems with clustering. Among them:
adequately (and concurrently);
items can be problematic because of time complexity;
(for distance-based clustering);
(which is not always easy, especially in multi-dimensional spaces);
arbitrary itself) can be interpreted in different ways (see Boyd, Crawford: "Six Provocations for Big Data": pdf, video).
– p. 11/23
Clustering Algorithms Classification
– p. 12/23
Distance Measures
– p. 13/23
Distances vs Similarities
between two data objects...
calculated)
– p. 14/23
Similarity through distance
– p. 15/23
Distances for numeric attributes
dij =
q
|xik − xjk|q
p-dimensional data objects, and q is a positive integer
– p. 16/23
K-Means Algorithm
– p. 17/23
K-Means: A numerical example
– p. 18/23
K-Means: still alive?
Time for some demos!
– p. 19/23
K-Means: Summary
(k, t ≪ n)
– p. 20/23
K-Means application: Vector Quantization
cluster centroid (called codeword);
Expected size for the compressed data: log2(K)/(4 · 8).
– p. 21/23
Bibliography
Tutorial Slides by P .L. Lanzi
Tutorial slides by J.D. Ullman
(Manning, 2009)
Data Mining, Inference, and Prediction"
– p. 22/23
– p. 23/23