Integrating Constraints and Metric Learning in Semi-Supervised - - PowerPoint PPT Presentation

integrating constraints and metric learning in semi
SMART_READER_LITE
LIVE PREVIEW

Integrating Constraints and Metric Learning in Semi-Supervised - - PowerPoint PPT Presentation

Integrating Constraints and Metric Learning in Semi-Supervised Clustering M. Bilenko, S. Basu, R.J.Mooney Presentor: Lei Tang Arizona State University Machine Learning Seminar 1 Introduction 2 Formulation K-means Integrating Constraints and


slide-1
SLIDE 1

Integrating Constraints and Metric Learning in Semi-Supervised Clustering

  • M. Bilenko, S. Basu, R.J.Mooney

Presentor: Lei Tang Arizona State University

Machine Learning Seminar

slide-2
SLIDE 2

1 Introduction 2 Formulation

K-means Integrating Constraints and Metric Learning

3 MPCK-Means Algorithm

Initialization E-step M-step

4 Experiment Results

slide-3
SLIDE 3

Semi-supervised Clustering

1 Constrained-based method

Seeded KMeans, Constrained KMeans given partial label information. COP KMeans given pairwise constraint(must-link, cannot-link)

2 Metric-based method

Learn a metric to satisfy the constraint, such that the data of the same cluster gets closer, whereas data of different clusters gets further away

Limitations Previous metric learning excludes unlabeled data during metric training. A single distance metric is used for all clusterings, forcing them to have the same shape.

slide-4
SLIDE 4

Semi-supervised Clustering

1 Constrained-based method

Seeded KMeans, Constrained KMeans given partial label information. COP KMeans given pairwise constraint(must-link, cannot-link)

2 Metric-based method

Learn a metric to satisfy the constraint, such that the data of the same cluster gets closer, whereas data of different clusters gets further away

Limitations Previous metric learning excludes unlabeled data during metric training. A single distance metric is used for all clusterings, forcing them to have the same shape.

slide-5
SLIDE 5

Constrait-based method

K-means clustering: Minimize

  • xi∈X

||xi − µli||2 Semi-supervised clustering with constraints Minimize

  • xi∈X

||xi − µli||2

  • Typical k-means

+

  • (xi,xj)∈M

wij1[li = lj]

  • must-link

+

  • (xi,xj)∈C

¯ wij1[li = lj]

  • cannot-link
slide-6
SLIDE 6

Constrait-based method

K-means clustering: Minimize

  • xi∈X

||xi − µli||2 Semi-supervised clustering with constraints Minimize

  • xi∈X

||xi − µli||2

  • Typical k-means

+

  • (xi,xj)∈M

wij1[li = lj]

  • must-link

+

  • (xi,xj)∈C

¯ wij1[li = lj]

  • cannot-link
slide-7
SLIDE 7

Metric-based Method

Euclidean distance: ||xi − xj|| =

  • (xi − xj)T(xi − xj)

Mahalanobis distance: ||xi − xj||A =

  • (xi − xj)TA(xi − xj)

where A is a covariance matrix. A 0 If a A is used for calculate distance, then each cluster is modeled as a multivariate Gaussian distribution with covariance A−1.

slide-8
SLIDE 8

Clustering with different shape

What if the shape of clusters are different? Use different A for each cluster(Assign different covariance). To Maximize the likelihood boils down to : Minimize

  • xi∈X
  • ||xi − µli||2

Ali − log(detAli)

slide-9
SLIDE 9

Clustering with different shape

What if the shape of clusters are different? Use different A for each cluster(Assign different covariance). To Maximize the likelihood boils down to : Minimize

  • xi∈X
  • ||xi − µli||2

Ali − log(detAli)

slide-10
SLIDE 10

Combine Constraints and Metric Learning

Minimize

  • xi∈X

[||xi − µli||2

Ali − log(detAli)]

  • Metric Learning

+

  • (xi,xj)∈M

wij1[li = lj] +

  • (xi,xj)∈C

¯ wij1[li = lj]

  • Constraints

Intuitively, the penality wij and ¯ wij should be based on distance of two data points. Minimize

  • xi∈X

[||xi − µli||2

Ali − log(detAli)]

+

  • (xi,xj)∈M

fM(xi, xj)1[li = lj] +

  • (xi,xj)∈C

fc(xi, xj)1[li = lj]

slide-11
SLIDE 11

Combine Constraints and Metric Learning

Minimize

  • xi∈X

[||xi − µli||2

Ali − log(detAli)]

  • Metric Learning

+

  • (xi,xj)∈M

wij1[li = lj] +

  • (xi,xj)∈C

¯ wij1[li = lj]

  • Constraints

Intuitively, the penality wij and ¯ wij should be based on distance of two data points. Minimize

  • xi∈X

[||xi − µli||2

Ali − log(detAli)]

+

  • (xi,xj)∈M

fM(xi, xj)1[li = lj] +

  • (xi,xj)∈C

fc(xi, xj)1[li = lj]

slide-12
SLIDE 12

Penality based on distance Must-link: Violations means data belongs to different cluster. fM(xi, xj) = 1 2(||xi − xj||2

Ali + ||xi − xj||2 Alj )

  • Average

The further away two data are, the more penality. Cannot-link: Violations means data belongs to the same cluster. fC(xi, xj) = ||x′

li − x′′ li ||2 Ali

  • Maximum distant points

−||xi − xj||2

Ali

The closer two data are, the more penality.

slide-13
SLIDE 13

Penality based on distance Must-link: Violations means data belongs to different cluster. fM(xi, xj) = 1 2(||xi − xj||2

Ali + ||xi − xj||2 Alj )

  • Average

The further away two data are, the more penality. Cannot-link: Violations means data belongs to the same cluster. fC(xi, xj) = ||x′

li − x′′ li ||2 Ali

  • Maximum distant points

−||xi − xj||2

Ali

The closer two data are, the more penality.

slide-14
SLIDE 14

Metric pairwise constrained K-means(MPCK)

General Framework of MPCK algorithm based on EM Initialize clusters Repeat until convergence:

Assign Cluster to minimize the objective goal. Estimate the mean Update the metric

Difference with k-means Cluster assignment takes constraint into consideration. The metric is updated in each round.

slide-15
SLIDE 15

Metric pairwise constrained K-means(MPCK)

General Framework of MPCK algorithm based on EM Initialize clusters Repeat until convergence:

Assign Cluster to minimize the objective goal. Estimate the mean Update the metric

Difference with k-means Cluster assignment takes constraint into consideration. The metric is updated in each round.

slide-16
SLIDE 16

Initialization

Basic idea Construct traversive closure of the must-link Choose the mean of each component as the seed. Extend the sets of must-link and cannot-link. Construct traversive closure of the must-link Must-link: {AB, BC, DE}; Cannot link: {BE};

slide-17
SLIDE 17

Initialization

Basic idea Construct traversive closure of the must-link Choose the mean of each component as the seed. Extend the sets of must-link and cannot-link. Construct traversive closure of the must-link Must-link: {AB, BC, DE}; Cannot link: {BE};

slide-18
SLIDE 18

Cluster Assignment

1 Randomly re-order the data points 2 Assign each data point to a cluster that minimize the objective

function: Minimize J =

  • xi∈X

[||xi − µli||2

Ali − log(detAli)]

+

  • (xi,xj)∈M

fM(xi, xj)1[li = lj] +

  • (xi,xj)∈C

fc(xi, xj)1[li = lj]

slide-19
SLIDE 19

Update the metric

1 Update the centroid of each cluster 2 Update the distance metric of each cluster; Take the derivative of the

goal function and set it to 0 to get the new metric: Ah = |Xh|   

  • xi∈Xh

(xi − µi)(xi − µi)T +

  • (xi,xj)∈Mh

1 2wij(xi − xj)(xi − xj)T1[li = lj]) +

  • (xi,xj)∈Ch

¯ wij

  • (x′

h − x′′ h )(x′ h − x′′ h )T − (xi − xj)(xi − xj)T

1[li = lj]   

−1

slide-20
SLIDE 20

Some issues

1 Singularity: If the sum is singular, Set A−1 h

= A−1

h

+ ǫtr(A−1

h )I to

ensure nonsiguarity.

2 Semi-positive definiteness: If Ah is negative definite, project it into

set C = {A : A 0} by setting negative eigenvalues to 0.

3 Computational cost: Use diagonal matrix. Or the same distance

metric for all clusters.

4 Convergence: Theoretically, each step reduce the objective goal. But

if singularity and semi-positive definiteness are involved, the algorithm might not converge in theory. Anyhow, it works fine in reality.

slide-21
SLIDE 21

Some issues

1 Singularity: If the sum is singular, Set A−1 h

= A−1

h

+ ǫtr(A−1

h )I to

ensure nonsiguarity.

2 Semi-positive definiteness: If Ah is negative definite, project it into

set C = {A : A 0} by setting negative eigenvalues to 0.

3 Computational cost: Use diagonal matrix. Or the same distance

metric for all clusters.

4 Convergence: Theoretically, each step reduce the objective goal. But

if singularity and semi-positive definiteness are involved, the algorithm might not converge in theory. Anyhow, it works fine in reality.

slide-22
SLIDE 22

Some issues

1 Singularity: If the sum is singular, Set A−1 h

= A−1

h

+ ǫtr(A−1

h )I to

ensure nonsiguarity.

2 Semi-positive definiteness: If Ah is negative definite, project it into

set C = {A : A 0} by setting negative eigenvalues to 0.

3 Computational cost: Use diagonal matrix. Or the same distance

metric for all clusters.

4 Convergence: Theoretically, each step reduce the objective goal. But

if singularity and semi-positive definiteness are involved, the algorithm might not converge in theory. Anyhow, it works fine in reality.

slide-23
SLIDE 23

Some issues

1 Singularity: If the sum is singular, Set A−1 h

= A−1

h

+ ǫtr(A−1

h )I to

ensure nonsiguarity.

2 Semi-positive definiteness: If Ah is negative definite, project it into

set C = {A : A 0} by setting negative eigenvalues to 0.

3 Computational cost: Use diagonal matrix. Or the same distance

metric for all clusters.

4 Convergence: Theoretically, each step reduce the objective goal. But

if singularity and semi-positive definiteness are involved, the algorithm might not converge in theory. Anyhow, it works fine in reality.

slide-24
SLIDE 24

Experiment Results(1)

A single diagonal matrix is used.

slide-25
SLIDE 25

Experiment Results(2)

A single diagonal matrix compared with multiple full matrix. Some phenomenons Use different matrix and cluster and use full matrix definitely increase the performance. When the constraints are few, RCA seems working better than MPCK-means. Why?

slide-26
SLIDE 26

Experiment Results(2)

A single diagonal matrix compared with multiple full matrix. Some phenomenons Use different matrix and cluster and use full matrix definitely increase the performance. When the constraints are few, RCA seems working better than MPCK-means. Why?

slide-27
SLIDE 27

Conclusions

By integrating metric learning and constraints during clustering, it

  • utperforms each single approach.

Questions? Thank you!!

slide-28
SLIDE 28

Conclusions

By integrating metric learning and constraints during clustering, it

  • utperforms each single approach.

Questions? Thank you!!