APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft - - PowerPoint PPT Presentation

applied machine learning
SMART_READER_LITE
LIVE PREVIEW

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft - - PowerPoint PPT Presentation

MACHINE LEARNING - MSc Course APPLIED MACHINE LEARNING APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE LEARNING - MSc Course APPLIED MACHINE LEARNING Objectives Learn basic techniques for data


slide-1
SLIDE 1

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

1

APPLIED MACHINE LEARNING

Methods for Clustering K-means, Soft K-means DBSCAN

slide-2
SLIDE 2

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

2

Objectives

Learn basic techniques for data clustering

  • K-means and soft K-means, GMM (next lecture)
  • DBSCAN

Understand the issues and major challenges in clustering

  • Choice of metric
  • Choice of number of clusters
slide-3
SLIDE 3

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

3

What is clustering?

Clustering is a type of multivariate statistical analysis also known as cluster analysis, unsupervised classification analysis, or numerical taxonomy. Clustering is a process of partitioning a set of data (or objects) in a set

  • f meaningful sub-classes, called clusters.

Cluster: a collection of data objects that are “similar” to one another and thus can be treated collectively as one group.

slide-4
SLIDE 4

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

4

Classification versus Clustering

Supervised Classification = Classification  We know the class labels and the number of classes.

1 2 3 1 2 3

Unsupervised Classification = Clustering  We do not know the class labels and may not know the number of classes.

? ? ? ? ? ?

slide-5
SLIDE 5

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

5

Classification versus Clustering

Unsupervised Classification = Clustering  Hard problem when no pair of objects have exactly the same feature.  Need to determine how similar two or more objects are to one another.

? ? ? ? ?

slide-6
SLIDE 6

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

6

Which clusters can you create?

Which two subgroups of pictures are similar and why?

slide-7
SLIDE 7

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

7

Which clusters can you create?

Which two subgroups of pictures are similar and why?

slide-8
SLIDE 8

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

8

A good clustering method produces high quality clusters when:

  • The intra-class (that is, intra-cluster) similarity is high.
  • The inter-class similarity is low.
  • The quality measure of a cluster depends on the similarity

measure used!

What is Good Clustering?

slide-9
SLIDE 9

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

9

Exercise:

Intra-class similarity is the highest when: a) you choose to classify images with and without glasses b) you choose to classify images of person1 against person2

 Person1 with glasses  Person1 without glasses  Person2 without glasses  Person2 with glasses

slide-10
SLIDE 10

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

10

Exercise:

Projection onto first two principal components after PCA

 Person1 with glasses  Person1 without glasses  Person2 without glasses  Person2 with glasses

Intra-class similarity is the highest when: a) you choose to classify images with and without glasses b) you choose to classify images of person1 against person2

slide-11
SLIDE 11

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

11

Exercise:

The eigenvector e1 is composed of a mix between the main characteristics of the two faces and it is hence explanatory of both. However, since both faces have little in common, the two groups have different coordinates onto e1 but have quasi identical coordinates for the glasses in each subgroup. Projecting

  • nto e1 hence offers a mean to compute a metric of similarity across the two

persons.

Projection onto e1 against e2

 Person1 with glasses  Person1 without glasses  Person2 without glasses  Person2 with glasses e1 e2

slide-12
SLIDE 12

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

12

Exercise:

When projecting onto e1 and e3, we can separate the image of the person1 with and without glasses, as the eigenvector e3 embeds features distinctive of person1 primarily.

Projection onto e1 against e3

 Person1 with glasses  Person1 without glasses  Person2 without glasses  Person2 with glasses e1 e3 e2

slide-13
SLIDE 13

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

13

Exercise:

Design a method to find out the groups when you no longer have the class labels?

Projection onto first two principal components after PCA

slide-14
SLIDE 14

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

14

Priors:

  • Data cluster within a circle
  • There are 2 clusters

x1 x2 x3

Sensitivity to Prior Knowledge

Outliers (noise) Relevant Data

slide-15
SLIDE 15

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

15

Priors:

  • Data follow a complex distribution
  • There are 3 clusters

x1 x2 x3

Sensitivity to Prior Knowledge

slide-16
SLIDE 16

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

16

Globular Clusters Non-Globular Clusters

Clusters’ Types

K-means produces globular clusters DBSCAN produces non- globular clusters

slide-17
SLIDE 17

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

17

Requirements for good clustering:

  • Discovery of clusters with arbitrary shape
  • Ability to deal with noise and outliers
  • Insensitivity to input records’ ordering
  • Scalability
  • High dimensionality
  • Interpretability and reusability

What is Good Clustering?

slide-18
SLIDE 18

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

18

How to cluster?

x1 x2

What choice of model (circle, ellipse) for the cluster? How many models?

slide-19
SLIDE 19

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

19

x1 x2

What choice of model (circle, ellipse) for the cluster? How many models? Circle Fixed number: K=2 Where to place them for optimal clustering?

K-means Clustering

K-Means clustering generates a number K of disjoint clusters to miminize:

 

2 1 1

,...,

i k i k

K K k x c

J x   

 

 

 

i

x

ith data point

k

geometric centroid

𝑑𝒍

cluster label or number

slide-20
SLIDE 20

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

20

K-means Clustering

Initialization: initialize at random the positions of the centers of the clusters

x1 x2

In mldemos; centroids are initialized on one datapoint with no overlap across centroids.

slide-21
SLIDE 21

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

21

K-means Clustering

Assignment Step:

  • Calculate the distance from each data point to each centroid.
  • Assign the responsibility of each data point to its “closest” centroid.

If a tie happens (i.e. two centroids are equidistant to a data point, one assigns the data point to the smallest winning centroid).

x1 x2

 

 

argmin ,

i k

i k

k d x  

i

x

ith data point

k

geometric centroid

Responsibility of cluster for point 1 if k 0 otherwise

i

i k i

k x k r     

slide-22
SLIDE 22

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

22

Update step (M-Step): Recompute the position of centroid based on the assignment of the points

K-means Clustering

x1 x2

i k

k i i k i i

r x r   

 

 

argmin ,

i k

i k

k d x  

Responsibility of cluster for point 1 if k 0 otherwise

i

i k i

k x k r     

slide-23
SLIDE 23

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

23

K-means Clustering

x1 x2

 

 

argmin ,

i k

i k

k d x  

Responsibility of cluster for point 1 if k 0 otherwise

i

i k i

k x k r     

Assignment Step:

  • Calculate the distance from each data point to each centroid.
  • Assign the responsibility of each data point to its “closest” centroid.

If a tie happens (i.e. two centroids are equidistant to a data point, one assigns the data point to the smallest winning centroid).

i k

k i i k i i

r x r   

slide-24
SLIDE 24

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

24

K-means Clustering

x1 x2

Stopping Criterion: Go back to step 2 and repeat the process until the clusters are stable.

Update step (M-Step): Recompute the position of centroid based on the assignment of the points

slide-25
SLIDE 25

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

25

K-means Clustering

x1 x2 Intersection points

K-means creates a hard partitioning of the dataset

slide-26
SLIDE 26

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

26

Effect of the distance metric on K-means

L1-Norm L2-Norm L3-Norm L8-Norm

slide-27
SLIDE 27

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

27

K-means Clustering: Algorithm

  • 1. Initialization: Pick K arbitrary centroids and set their geometric means

to random values (in mldemos; centroids are initialized on one datapoint with no overlap across centroids).

  • 2. Calculate the distance from each data point to each centroid .
  • 3. Assignment Step: Assign the responsibility of each data point to its

“closest” centroid (E-step). If a tie happens (i.e. two centroids are equidistant to a data point, one assigns the data point to the smallest winning centroid).

  • 4. Update Step: Adjust the centroids to be the means of all data points

assigned to them (M-step)

  • 5. Go back to step 2 and repeat the process until the clusters are stable.

 

 

argmin ,

i k

i k

k d x  

1 if k 0 otherwise

i k i

k r     

i k

k i i k i i

r x r   

slide-28
SLIDE 28

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

28

K-means Clustering

The algorithm of K-means is a simple version of Expectation-Maximization applied to a model composed of isotropic Gauss functions (see next lecture)

slide-29
SLIDE 29

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

29

K-means Clustering: Properties

  • There are always K clusters.
  • The clusters do not overlap. (soft K-means relaxes this assumption,

see next slides)

  • Each member of a cluster is closer to its cluster than to any other

cluster. The algorithm is guaranteed to converge in a finite number of iterations But it converges to a local optimum! It is hence very sensitive to initialization of the centroids.

slide-30
SLIDE 30

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

30

Soft K-means Clustering

Assignment Step (E-step):

  • Calculate the distance from each data point to each centroid.
  • Assign the responsibility of each data point to its “closest” centroid.

x1 x2

 

 

 

 

'

, , '

: responsibility of cluster for point [0,1], Normalized over clusters: 1

i k i k i

k i d x k i d x k k i k

r k x e r e r

   

    

   

 

Each data point is given a soft `degree of assignment' to each of the means .

i

k

x 

slide-31
SLIDE 31

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

31

x1 x2

Update step (M-Step): Recompute the position of centroid based on the assignment of the points

The model parameters, i.e. the means, are adjusted to match the weighted sample means of the data points that they are responsible for.

i

k i k i k i i

r x r    

 

 

 

 

'

, , '

: responsibility of cluster for point [0,1], Normalized over clusters: 1

i k i k i

k i d x k i d x k k i k

r k x e r e r

   

    

   

 

The update algorithm of the soft K-means is identical to that of the hard K-means, aside from the fact that the responsibilities to a particular cluster are now real numbers varying between 0 and 1.

Soft K-means Clustering

slide-32
SLIDE 32

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

32

small ~ large  

large ~ small  

is the stiffness 1 measures the disparity across clusters    

 

 

 

 

'

, , '

: responsibility of cluster for point [0,1], Normalized over clusters: 1

i k i k i

k i d x k i d x k k i k

r k x e r e r

   

    

   

 

Soft K-means Clustering

slide-33
SLIDE 33

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

33

Soft K-means algorithm with a small (left), medium (center) and large (right) 

Soft K-means Clustering

10   5   1  

slide-34
SLIDE 34

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

34

Soft K-means Clustering

Iterations of the Soft K-means algorithm from the random initialization (left) to convergence (right). Computed with = 10.

slide-35
SLIDE 35

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

35

Advantages:

  • Computationally faster than other clustering techniques.
  • Produces tighter clusters, especially if the clusters are globular.
  • Guaranteed to converge.

Drawbacks:

  • Does not work well with non-globular clusters.
  • Sensitivity to choice of initial partitions

Different initial partitions can result in different final clusters.

  • Assumes a fixed number K of clusters.

 It is, therefore, good practice to run the algorithm several times using different K values, to determine the optimal number of clusters.

(soft) K-means Clustering: Properties

slide-36
SLIDE 36

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

36

Advantages:

  • Computationally faster than other clustering techniques.
  • Produces tighter clusters, especially if the clusters are globular.
  • Guaranteed to converge.

Drawbacks:

  • Does not work well with non-globular clusters.
  • Sensitivity to choice of initial partitions

Different initial partitions can result in different final clusters.

  • Assumes a fixed number K of clusters.

 It is, therefore, good practice to run the algorithm several times using different K values, to determine the optimal number of clusters.

(soft) K-means Clustering: Properties

slide-37
SLIDE 37

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

37

  • Unbalanced clusters:

K-means takes into account only the distance between the means and data points; it has no representation of the variance of the data within each cluster.

  • Elongated clusters:

K-means imposes a fixed shape for each cluster (sphere).

K-means Clustering: Weaknesses

slide-38
SLIDE 38

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

38

Very sensitive to the choice of the number of clusters K and the initialization.  Mldemos example

K-means Clustering: Weaknesses

slide-39
SLIDE 39

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

39

K-means would not be able to reject outliers

x1 x2 x3

K-means: Limitations

Outliers (noise) Relevant Data

slide-40
SLIDE 40

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

40

K-means would not be able to reject outliers K-means assigns all datapoints to a cluster  Outliers get assigned to the closest cluster

x1 x2 x3

K-means: Limitations

DBSCAN can determine outliers and can generate non-globular clusters

slide-41
SLIDE 41

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

41

Density Based Spatial Clustering of Applications with Noise (DBSCAN)

x1 x2 x3

Outliers (noise)

e 1. Pick a datapoint at random 2. Compute number of datapoints within e 3. If < mdata, set this datapoint as outlier 4. Go back to 1

slide-42
SLIDE 42

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

42

Density Based Spatial Clustering of Applications with Noise (DBSCAN)

x1 x2 x3

Outliers (noise)

1. Pick a datapoint at random 2. Compute number of datapoints within e 3. For each datapoint found, assign it to same cluster 4. Go back to 1

Cluster 1

slide-43
SLIDE 43

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

43

Density Based Spatial Clustering of Applications with Noise (DBSCAN)

x1 x2 x3

Outliers (noise)

1. Pick a datapoint at random 2. Compute number of datapoints within e 3. For each datapoint found, assign it to same cluster 4. Merge two clusters if distance between clusters < e

Cluster 1 Cluster 2 Cluster 1

slide-44
SLIDE 44

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

44

Density Based Spatial Clustering of Applications with Noise (DBSCAN)

x1 x2 x3

Outliers (noise) Cluster 1 Cluster 2 Cluster 1

Hyperparameters:

  • e: size of neighborhood
  • mdata: minimum

number of datapoints

slide-45
SLIDE 45

APPLIED MACHINE LEARNING

46

Comparison: K-means / DBSCAN

K-means DBSCAN Hyperparameters K: Nm of clusters

e: size, Mdata: min. nm of datapoints

Computational cost O(K*M) O(M*log(M)), M: nm datapoints Type of cluster Globular Non-globular (arbitrary shapes, non- linear boundaries) Robustness to noise Not robust Robust to outliers within e

K-means is computational cheap. However, it is not robust to noise and produces only globular clusters. DBSCAN is computationally intensive, but it can detect automatically noise and produces clusters of arbitrary shape. Both K-means and BDSCAN depend on choosing well the hyperparameters  To determine the hyperparameters, use evaluation methods for clustering (next)

slide-46
SLIDE 46

APPLIED MACHINE LEARNING

47

Clustering methods rely on hyper parameters

  • Number of clusters, elements in the cluster, distance metric

 Need to determine the goodness of these choices Clustering is unsupervised classification  Do not know the real number of clusters and the data labels  Difficult to evaluate these choices without ground truth

Evaluation of Clustering Methods

slide-47
SLIDE 47

48 48

ADVANCED MACHINE LEARNING

Two types of measures: Internal versus external measures Internal measures rely on measures of similarity:

  • (low) intra-cluster distance versus (high) inter-cluster distances
  • Internal measures are problematic as the metric of similarity is
  • ften already optimized by the clustering algorithm.

External measures rely on ground truth (class labels):

  • Given a (sub)-set of known class labels compute similarity of

clusters to class labels.

  • In real-world data, it’s hard/infeasible to gather ground truth.

Evaluation of Clustering Methods

slide-48
SLIDE 48

APPLIED MACHINE LEARNING

49

Internal Measure: RSS

Residual Sum of Square RSS is an internal measure (available in mldemos). It computes the distance (in norm-2) of each datapoint from its centroid for all clusters.

2 1

RSS=

k

K k k x C

x 

 



slide-49
SLIDE 49

50 50

ADVANCED MACHINE LEARNING

Goal of K-means is to find cluster centers 𝜈𝑙 which minimize distortion.

RSS for K-Means

2 1

RSS=

k

K k k x C

x 

 



Measure of Distortion 𝐿: 𝑁 𝑑𝑚𝑣𝑡𝑢𝑓𝑠𝑡 𝑆𝑇𝑇: 0 𝑁: 100 𝑒𝑏𝑢𝑏𝑞𝑝𝑗𝑜𝑢𝑡 𝑂: 2 𝑒𝑗𝑛𝑓𝑜𝑡𝑗𝑝𝑜𝑡

  • However, it can still be used to determine an ‘optimal’ 𝐿 by monitoring the

slope of the decrease of the measure as 𝐿 increases.

By ↑ 𝐿 we ↓ 𝑆𝑇𝑇, what is the optimal 𝐿 such that 𝑆𝑇𝑇 → 0?

  • 𝑆𝑇𝑇 = 0 when 𝐿 = 𝑁. One has as many clusters as datapoints!
slide-50
SLIDE 50

51 51

ADVANCED MACHINE LEARNING

K-means Clustering: Examples

Procedure: Run K-means – increase monotonically number of clusters – run K- means with several initialization and take best run; use RSS measure to measure improvement in clustering  determine a plateau

Optimal 𝑙 is at the ‘elbow’ of the curve

𝑁: 100 𝑒𝑏𝑢𝑏𝑞𝑝𝑗𝑜𝑢𝑡 𝑂: 2 𝑒𝑗𝑛𝑓𝑜𝑡𝑗𝑝𝑜𝑡 𝑙: 4 𝑑𝑚𝑣𝑡𝑢𝑓𝑠𝑡

slide-51
SLIDE 51

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

52

K-means with RSS: Examples

Cluster Analysis of Hedge Funds (fonds speculatifs)

[N. Das, 9th Int. Conf. on Computing Economis and Finance, 2011]

No legal definition of Hedge funds - consists of a wide category of investment funds with high risk & high returns – variety of strategies for guiding the investment Research Question: classify type of Hedge funds based on information provided to the client Data Dimension (Features): such as: asset class, size of the hedge fund, incentive fee, risk-level, and liquidity of hedge funds

slide-52
SLIDE 52

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

53

K-means with RSS: Examples

Cluster Analysis of Hedge Funds (fonds speculatifs)

[N. Das, 9th Int. Conf. on Computing Economis and Finance, 2011]

No legal definition of Hedge funds - consists of a wide category of investment funds with high risk & high returns – variety of strategies for guiding the investment Research Question: classify type of Hedge funds based on information provided to the client Data Dimension (Features): such as: asset class, size of the hedge fund, incentive fee, risk- level, and liquidity of hedge funds

Number of Clusters (K)

Optimal results are found with 7 clusters.

Cutoff

Procedure: Run K-means – increase monotonically number of clusters – run K-means with several initialization and take best run; Use RSS measure to measure improvement in clustering  determine a plateau

slide-53
SLIDE 53

54 54

ADVANCED MACHINE LEARNING

K-means Clustering: Examples

Which one is the ‘optimal’ 𝑙?

𝑁: 100 𝑒𝑏𝑢𝑏𝑞𝑝𝑗𝑜𝑢𝑡 K: 3 𝑒𝑗𝑛𝑓𝑜𝑡𝑗𝑝𝑜𝑡 The ‘elbow’ or ‘plateau’ method for choosing the optimal 𝑙 from the RSS curve can be unreliable for certain datasets:

𝑙: 11 𝑙: 2

We don’t know! We need an additional penalty or criterion!

slide-54
SLIDE 54

APPLIED MACHINE LEARNING

55

AIC and BIC determine how good the model fits the dataset in a probabilistic sense (maximum-likelihood measure). The measure is balanced by how many parameters are needed to get a good fit.

 

L: maximum likelihood of the model : number of free parameters : number of datapoints

  • Aikaike Information Criterion: AIC=

2ln 2

  • Bayesian Information Criterion:

2ln ln As the number of da

B M

L B BIC L B M      tapoints (observations) increase, BIC assigns more weights to simpler models than AIC. Low BIC implies either fewer explanatory variables, better fit, or both.

Penalty for an increase in computational costs due to number of parameters and number of datapoints

Other Metrics to Evaluate Clustering Methods

Choosing AIC versus BIC depends on the application: Is the purpose of the analysis to make predictions, or to decide which model best represents reality? AIC may have better predictive ability than BIC, but BIC finds a computationally more efficient solution.

slide-55
SLIDE 55

56 56

ADVANCED MACHINE LEARNING

AIC for K-Means

For the particular case of K-means, we do not have a maximum likelihood estimate of the model:

𝐵𝐽𝐷 = −2 ln(𝑀) + 2𝐶 𝐵𝐽𝐷𝑆𝑇𝑇 = 𝑆𝑇𝑇 + 𝐶

However, we can formulate a metric based on the RSS that penalizes for model complexity (# K-clusters), conceptually following AIC:

Weighting Factor

2 1

RSS=

k

K k k x C

x 

 



Number of free parameters B=(K*N) K: # clusters N: # dimensions

: likelihood of model : number of free parameters L B

slide-56
SLIDE 56

57 57

ADVANCED MACHINE LEARNING

BIC for K-Means

𝐶𝐽𝐷𝑆𝑇𝑇 = 𝑆𝑇𝑇 + ln(𝑁) 𝐶

For the particular case of K-means, we do not have a maximum likelihood estimate of the model:

𝐶𝐽𝐷 = −2 ln(𝑀) + ln(𝑁)𝐶

However, we can formulate a metric based on the RSS that penalizes for model complexity (# K-clusters, # M-datapoints), conceptually following BIC:

Weighting factor penalizes wrt. # datapoints (i.e. computational complexity)

2 1

RSS=

k

K k k x C

x 

 



Number of free parameters B=(K*N) K: # clusters N: # dimensions

slide-57
SLIDE 57

58 58

ADVANCED MACHINE LEARNING

K-means Clustering: Examples

𝑁: 100 𝑒𝑏𝑢𝑏𝑞𝑝𝑗𝑜𝑢𝑡 N: 3 𝑒𝑗𝑛𝑓𝑜𝑡𝑗𝑝𝑜𝑡 Procedure: Run K-means – increase monotonically number of clusters – run K- means with several initialization and take best run;

  • use AIC/BIC curves to find the optimal 𝑙, which is min 𝐵𝐽𝐷 or min(𝐶𝐽𝐷)

Both min(𝐶𝐽𝐷) and min(𝐵𝐽𝐷) → 𝑙 = 2

𝑙: 2 𝑑𝑚𝑣𝑡𝑢𝑓𝑠𝑡

slide-58
SLIDE 58

59 59

ADVANCED MACHINE LEARNING

BIC for K-Means

𝑁: 100 𝑒𝑏𝑢𝑏𝑞𝑝𝑗𝑜𝑢𝑡 N: 2 𝑒𝑗𝑛𝑓𝑜𝑡𝑗𝑝𝑜𝑡

𝐶𝐽𝐷𝑆𝑇𝑇 = 𝑆𝑇𝑇 + ln(𝑁) (𝐿 ∙ 𝑂)

𝑙: 14 𝐿: 14 𝑑𝑚𝑣𝑡𝑢𝑓𝑠𝑡

slide-59
SLIDE 59

60 60

ADVANCED MACHINE LEARNING

BIC for K-Means

𝑁: 100 𝑒𝑏𝑢𝑏𝑞𝑝𝑗𝑜𝑢𝑡 N: 2 𝑒𝑗𝑛𝑓𝑜𝑡𝑗𝑝𝑜𝑡

𝐶𝐽𝐷𝑆𝑇𝑇 = 𝑆𝑇𝑇 + ln(𝑁) (𝐿 ∙ 𝑂)

𝑙: 4 𝐿: 4 𝑑𝑚𝑣𝑡𝑢𝑓𝑠𝑡

slide-60
SLIDE 60

61 61

ADVANCED MACHINE LEARNING

AIC / BIC for DBSCAN

Comput centroid of each cluster and apply AIC/BIC of K−means

DBSCAN large e DBSCAN medium e DBSCAN small e

DBSCAN large e DBSCAN medium e DBSCAN small e

RSS 43 26 0.5 BIC 42 34 78 AIC 69 51 24

slide-61
SLIDE 61

62 62

ADVANCED MACHINE LEARNING

AIC / BIC for DBSCAN

Comput centroid of each cluster and apply AIC/BIC of K−means

DBSCAN large e DBSCAN medium e DBSCAN small e

K-means DBSCAN large e DBSCAN medium e DBSCAN small e

RSS 51 95 59 0.6 BIC 65 118 88 331 AIC 55 102 67 93

K-means

slide-62
SLIDE 62

APPLIED MACHINE LEARNING

63

Two types of measures: Internal versus external measures External measures assume that a subset of datapoints have class label  semi-supervised learning They measure how well these datapoints are clustered.  Needs to have an idea of the number of existing classes and have labeled some datapoints  Interesting only in cases when labeling is highly time-consuming when the data is very large (e.g. in speech recognition)

Evaluation of Clustering Methods

slide-63
SLIDE 63

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

64

Semi-Supervised Learning

Clustering F1-Measure:

(careful: similar but not the same F-measure as the F-measure we will see for classification!)

Tradeoff between clustering correctly all datapoints of the same class in the same cluster and making sure that each cluster contains points of only one class.

     

 

             

1 1 1

: nm of labeled datapoints : the set of classes : nm of clusters, : nm of members of class and of cluster , max , 2 , , , , , , ,

ik i ik ik

i i i i c C k i i i i i i i i

M C c K n c k c F C K F c k M R c k P c k F c k R c k P c k n R c k c n P c k k

     

slide-64
SLIDE 64

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

65

Recall: proportion of datapoints correctly classified/clusterized Precision: proportion of datapoints of the same class in the cluster

Class 1 Class 2 Labeled Unlabeled

     

 

             

1 1 1

: nm of labeled datapoints : the set of classes : nm of clusters, : nm of members of class and of cluster , max , 2 , , , , , , ,

ik i ik ik

i i i i c C k i i i i i i i i

M C c K n c k c F C K F c k M R c k P c k F c k R c k P c k n R c k c n P c k k

     

   

1 2

2 4 , 1 1 , 2 1 2 4 R c k R c k       

   

1 2

2 4 , 1 , 2 6 6 P c k R c k     

slide-65
SLIDE 65

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

66

Penalize fraction of labeled points in each class

   

 

 

 

1 2

2 4 , , 1 , 2 0.7 6 6 F C K F c k F c k     

Class 1 Class 2 Labeled Unlabeled

Picks for each class the cluster with the maximal F1 measure      

 

             

1 1 1

: nm of labeled datapoints : the set of classes : nm of clusters, : nm of members of class and of cluster , max , 2 , , , , , , ,

ik i ik ik

i i i i c C k i i i i i i i i

M C c K n c k c F C K F c k M R c k P c k F c k R c k P c k n R c k c n P c k k

     

slide-66
SLIDE 66

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

67

Summary of F1-Measure

Picks for each class the cluster with the maximal F1 measure Recall: proportion of datapoints correctly classified/clusterized Precision: proportion of datapoints of the same class in the cluster

     

 

             

1 1 1

: nm of labeled datapoints : the set of classes : nm of clusters, : nm of members of class and of cluster , max , 2 , , , , , , ,

ik i ik ik

i i i i c C k i i i i i i i i

M C c K n c k c F C K F c k M R c k P c k F c k R c k P c k n R c k c n P c k k

     

Penalize fraction of labeled points in each class

Clustering F1-Measure:

(careful: similar but not the same F-measure as the F-measure we will see for classification!)

Tradeoff between clustering correctly all datapoints of the same class in the same cluster and making sure that each cluster contains points of only one class.

slide-67
SLIDE 67

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

68

Introduced two clustering techniques: K-means and DBSCAN Discussed pros and cons in terms of computational time, power

  • f representation (globular/non-globular clusters)

Introduced metrics to evaluate clustering and help to choose the hyperparameters:

  • Internal measures (RSS, AIC, BIC)
  • External measures: F1-measure (also called F-measure

for clustering)

Summary of Lecture

Next week: Practical on Clustering: You will compare performance of K-means and DBSCAN on your datasets and use the internal and external measure to assess these performance and choose the hyperparameters.

slide-68
SLIDE 68

APPLIED MACHINE LEARNING

69

Robotic Application of Clustering Method

Variety of hand postures when grasping objects How to generate correct hand posture on robots?

El-Khoury, S., Miao, Li and Billard, A. (2013) On the Generation of a Variety of Grasps. Robotics and Autonomous Systems Journal.

slide-69
SLIDE 69

APPLIED MACHINE LEARNING

70

Robotic Application of Clustering Method

4 DOFs industrial hand (Barrett Technology) 9 DOFs humanoid hand (iCub Robot) Problem: Choose the point of contact and generate feasible posture for the fingers to touch the object at the correct point and with the desired force. Difficuly: High-degrees of freedom (large number of possible points of contact, large number of DOFs to control)

slide-70
SLIDE 70

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

71

Formulate the problem as Constraint-Based Optimization :

Minimize generated torques at fingertips under constraints:

  • Force closure
  • Kinematic feasibility
  • Collision avoidance

Nonconvex optimization  yields several local / feasible solutions From 1890 trials converge to 791 feasible solutions From 1890 trials converge to 612 feasible solutions Took ~2.65s. for each solution! Took ~12.14s for each solution Took too long for realistic application

slide-71
SLIDE 71

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

72

Apply K-means on all solutions and group them into clusters

11 Clusters 20 Clusters

slide-72
SLIDE 72

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

73

  • A. Shukla and A. Billard, NIPS 2012
slide-73
SLIDE 73

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

74

slide-74
SLIDE 74

MACHINE LEARNING - MSc Course

APPLIED MACHINE LEARNING

75