Clustering algorithms Machine Learning Hamid Beigy Sharif - - PowerPoint PPT Presentation

clustering algorithms
SMART_READER_LITE
LIVE PREVIEW

Clustering algorithms Machine Learning Hamid Beigy Sharif - - PowerPoint PPT Presentation

Clustering algorithms Machine Learning Hamid Beigy Sharif University of Technology Fall 1393 Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 1 / 22 Table of contents Supervised & unsupervised learning 1


slide-1
SLIDE 1

Clustering algorithms

Machine Learning Hamid Beigy

Sharif University of Technology

Fall 1393

Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 1 / 22

slide-2
SLIDE 2

Table of contents

1

Supervised & unsupervised learning

2

Clustering

3

Hierarchical clustering

4

Non-Hierarchical Clustering

Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 2 / 22

slide-3
SLIDE 3

Supervised & unsupervised learning

The learning methods covered in class up to this point have focused on the issue of classification/regression.

An example consisted of a pair of variables (x, t), where x a feature vector and t the label/value. Such learning problems are called supervised since the system is given both the feature vector and the correct answer.

We will investigate methods that operate on unlabeled data.

Given a collection of feature vectors X = {x1, x2, . . . , xN} without labels/values ti, these methods attempt to build a model that captures the structure of the data. These methods are called unsupervised since they are not provided with the correct answer.

The unsupervised learning methods may appear to have limited capabilities, there are several reasons that make them useful

Labeling large data sets can be a costly procedure but raw data is cheap. Class labels may not be known beforehand. Large datasets can be compressed by finding a small set of prototypes. One can train with large amount of unlabeled data, and then use supervision to label the groupings found. Unsupervised methods can be used for feature extraction. Exploratory data analysis can provide insight into the nature or structure of the data.

Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 3 / 22

slide-4
SLIDE 4

Unsupervised Learning

Unsupervised learning algorithms

Non-parametric methods: These methods don’t make any assumption about the underlying densities, instead we seek a partition of the data into clusters. Parametric methods: These methods model the underlying class-conditional densities with a mixture of parametric densities, and the objective is to find the model parameters. p(x|θ) =

  • i

p(x|ωi, θi)p(ωi)

Examples of unsupervised learning

Dimensionality reduction Latent variable learning Clustering

A cluster is a number of similar objects collected or grouped together. Clustering algorithm partitions examples into groups when labels are available. Sample applications Novelty detection and outliers detection. Clusters are connected regions of a multidimensional space containing a relatively high density of points, separated from other such regions by a region containing a relatively low density of points.

Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 4 / 22

slide-5
SLIDE 5

Application of Clustering

Cluster retrieved documents to present more organized and understandable results to user → ”diversified retrieval” Detecting near duplicates such as entity resolution Exploratory data analysis Automated (or semi-automated) creation of taxonomies Comparison

Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 5 / 22

slide-6
SLIDE 6

Why do Unsupervised Learning?

Clustering is a very difficult problem because data can reveal clusters with different shapes and sizes. How many clusters do you see in the above figure?

Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 6 / 22

slide-7
SLIDE 7

Why do Unsupervised Learning? (cont.)

How many clusters do you see in the figure?

Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 7 / 22

slide-8
SLIDE 8

Why do Unsupervised Learning? (cont.)

How many clusters do you see in the figure?

Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 8 / 22

slide-9
SLIDE 9

Why do Unsupervised Learning? (cont.)

How many clusters do you see in the figure?

Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 9 / 22

slide-10
SLIDE 10

Why do Unsupervised Learning? (cont.)

How many clusters do you see in the figure?

Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 10 / 22

slide-11
SLIDE 11

Clustering

Clustering algorithms can be divided into several groups

Exclusive (each pattern belongs to only one cluster) Vs non-exclusive (each pattern can be assigned to several clusters). Hierarchical (nested sequence of partitions) Vs partitioned (a single partition).

Clustering algorithms

Hierarchical clustering Centroid-based clustering Distribution-based clustering Density-based clustering Grid-based clustering Constraint clustering

Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 11 / 22

slide-12
SLIDE 12

Clustering

Challenges in the clustering

Selection of an appropriate measure of similarity to define clusters that is often both data (cluster shape) and context dependent. Choice of the criterion function to be optimized. Evaluation function Optimization method

Similarity/distance measures

Euclidean distance (L2 norm) L2(x, y) =

  • ΣN

i=1(xi − yi)2

L1 norm: L1(x, x) =

  • ΣN

i=1|xi − yi|

Cosine similarity: cosine(x, y) =

xy ||x||||y||

Evaluation function that assigns a (usually real-valued) value to a clustering. This function typically function of withing-cluster similarity and between-cluster dissimilarity. Optimization method : Find a clustering that maximize the criterion. This can be done by global optimization methods (often intractable), greedy search methods, and approximation algorithms.

Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 12 / 22

slide-13
SLIDE 13

Hierarchical clustering

Organizes the clusters in a hierarchical way Produces a rooted tree (Dendrogram)

Animal Vertebrate Invertebrate Fish Reptile Amphibian Mammal Worm Insect Crustacean

Recursive application of a standard clustering algorithm can produce a hierarchical clustering

Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 13 / 22

slide-14
SLIDE 14

Hierarchical clustering (cont.)

Types of hierarchical clustering

Agglomerative (bottom-up): Methods start with each example in its own cluster and iteratively combine them to form larger and larger clusters. Divisive(top-down): Methods separate all examples recursively into smaller clusters.

Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 14 / 22

slide-15
SLIDE 15

Agglomerative (bottom up)

Assumes a similarity function for determining the similarity of two clusters. Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. The history of merging forms a binary tree or hierarchy Basic algorithms:

Start with all instances in their own cluster Until there is only one cluster:

Among the current clusters, determine the two clusters, ci and cj that are most similar Replace ci and cj with a single cluster ci ∪ cj

Cluster Similarity: How to compute similarity of two clusters each possibly containing multiple instances?

Single Linkage: Similarity of two most similar members. Complete Linkage: Similarity of two least similar members. Group Average: Average similarity between members. This method uses the average of similarity across all pairs within the merged cluster to measure the similarity of two clusters. This method is a compromise between single and complete link.

Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 15 / 22

slide-16
SLIDE 16

Single-Link (bottom-up)

sim(ci, cj) = maxx∈ci,y∈cjsim(x, y)

Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 16 / 22

slide-17
SLIDE 17

Compelete-Link (bottom-up)

sim(ci, cj) = minx∈ci,y∈cjsim(x, y)

Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 17 / 22

slide-18
SLIDE 18

Computational Complexity of HAC

In the first iteration, all HAC methods need to compute similarity of all pairs n individual instances which is O(n2). In each of the subsequent O(n) merging iterations, must find smallest distance pair of clusters → Maintain heap O(n2log(n)) In each of the subsequent O(n) merging iterations, it must compute the distance between the most recently created cluster and all other existing cluster. Can this be done in constant time such that O(n2log(n)) overall?

Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 18 / 22

slide-19
SLIDE 19

Centroid-Based Clustering

Assumes instances are real-valued vectors. Clusters represented via centroids (for example, average of points in a cluster) c µ(c) = 1 |c|

  • x∈c

x Reassignment of instances to clusters is based on distance to the current cluster K-Means algorithm

Input: k = number of clusters, distance measure d, Select k random instances s1, s2, ..., sk as seeds. Until clustering converges or other stopping criterion:

For each instance xi: Assign xi to cluster cj such that d(xi, sj) is minimum For each cluster cj ,update its centroid sj = µ(cj)

Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 19 / 22

slide-20
SLIDE 20

Time Complexity

Assume computing distance between two instances is O(D), where D is the dimensionality of the vectors. Reassigning clusters for N points: O(kN) distance computations, or O(kND). Computing centroids: Each instance gets added once to some centroid: O(ND). Assume these two steps are each done once for m iterations: O(mkND). Problems with K-means

Results can vary based on random seed selection, especially for high-dimensional data. Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. Sensitive to outliers

Idea: Combine HAC and K-means clustering. Convergence of K-means

Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 20 / 22

slide-21
SLIDE 21

Gaussian mixture model

A mixture model is a linear combination of K densities p(x|θ) =

K

  • k=1

πkN(x|µk, Σk) Set of parameters θ = {{πk}, {µk}, {Σk}} π is a discrete distribution, i.e. 0 ≤ πk ≤ 1 and K

k=1 πk = 1.

Each component is a multi-variate Gaussian N(x|µk, Σk) = 1 (2π)D/2|Σk| exp

  • −(x − µk)TΣ−1

k (x − µk)

  • To generate a sample x from the mixture model: (1) sample mixture component z ∼ π,

(2) sample x ∈ RD from the zth component x ∼ N(µz, Σz). An alternative viewpoint: z is a 1−of−K binary vector p(x) =

  • z

p(x|z)p(z) =

K

  • k=1

πkN(x|µk, Σk) The posterior distribution p(zk|x) = p(x|zk)p(zk) p(x) = πkN(x|µk, Σk) K

j=1 πjN(x|µj, Σj)

Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 21 / 22

slide-22
SLIDE 22

Gaussian Mixtures and EM

Initialize π, µ, and Σ. Repeat

E-Step Evaluate the posterior probabilities p(zk|xn) = πkN(x|µk, Σk) k

j=1 πjN(x|µj, Σj)

M-Step Update the parameter values µk = 1 Nk

K

  • n=1

p(zk|xn)xn Σk = 1 Nk

K

  • n=1

p(zk|xn)(xn − µk)(xn − µk)T Σk = Nk N

Until Convergence

Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 22 / 22