Some Clustering Methods on Some Clustering Methods on Some - - PowerPoint PPT Presentation

some clustering methods on some clustering methods on
SMART_READER_LITE
LIVE PREVIEW

Some Clustering Methods on Some Clustering Methods on Some - - PowerPoint PPT Presentation

Some Clustering Methods on Some Clustering Methods on Some Clustering Methods on Dissimilarity or Similarity Matrices: Dissimilarity or Similarity Matrices: Dissimilarity or Similarity Matrices: Uncovering Clusters in WEB Content, Structure


slide-1
SLIDE 1
  • Some Clustering Methods on

Dissimilarity or Similarity Matrices:

Uncovering Clusters in WEB Content, Structure and Usage

Some Clustering Methods on Some Clustering Methods on Dissimilarity or Similarity Matrices: Dissimilarity or Similarity Matrices:

Uncovering Clusters in WEB Content, Structure and Usage Uncovering Clusters in WEB Content, Structure and Usage

Yves Lechevallier Yves Lechevallier INRIA INRIA-

  • Paris

Paris-

  • Rocquencourt

Rocquencourt AxIS Project AxIS Project Yves.Lechevallier@inria.fr Yves.Lechevallier@inria.fr

Paris-Rocquencourt

Workshop Franco-Brasileiro sobre Mineração de Dados Workshop Franco-Brésilien sur la fouille de données Récife 5-7 May 2009

slide-2
SLIDE 2
  • Two types of Data Tables

Classical Data Table Each object is described by a vector of measures. Dissimilarity or Similarity Table The relation between two objects is measured by a positive value.

slide-3
SLIDE 3
  • Clustering Process

Data Table Dissimilarity

  • r Similarity

Tables Inter-cluster Structures

partition

e1 e2 e5 e4 e3

hierarchy

slide-4
SLIDE 4
  • To formulate a clustering problem you must specify the

following components

  • Ω : the set of objects (units) to be clustered.
  • The set of variables (attributes) to be used in describing
  • bjects.
  • A principle for grouping objects into clusters (based on a

measure of similarity or dissimilarity between two objects)

  • The inter-cluster structure which defines the desired

relationship among clusters (clusters should be disjoint or hierarchically organised)

Components of a Clustering Problem Components of a Clustering Problem

slide-5
SLIDE 5
  • Partitioning Methods

Partitioning Methods

The selected inter-cluster structure is the partition. By defining a function of homogeneity or a quality criterion

  • n a partition, the problem of clustering becomes a problem

perfectly defined in discrete optimization.

To find, among the set of all possible partitions, a partition where a fixed a priori criterion is

  • ptimized.
slide-6
SLIDE 6
  • !

Optimisation problem

A criterion W on , where is a set of all partitions in K nonempty classes of Ω that the problem of optimization is :

+

ℜ → Ω ℘ ) (

K

) (Ω ℘K

  • =

Ω ℘ ∈

= =

K k k Q

Q w Q W Min P W

K

1 ) (

) ( ) ( ) (

where w(Qk) is the homogeneity measure of the class Qk. and K is the number of classes

slide-7
SLIDE 7
  • Iterative Optimization Algorithm

We start from a realizable solution ) (

) (

Ω ℘ ∈

K

Q At the step t+1, we have a realizable solution we seek a realizable solution checking The algorithm is stopped when

) (t

Q

) (

) ( ) 1 ( t t

Q g Q =

+

) ( ) (

) ( ) 1 ( t t

Q W Q W ≤

+

) ( ) 1 ( t t

Q Q =

+

Choice Choice

"#$%%$$%$%$$&' ($"% $%%$$%$ %$$$$)

slide-8
SLIDE 8
  • *

Neighborhood algorithm

One of the strategies used to build the function g is :

  • to associate any realizable solution Q a finite set of the

realizable solutions V(Q), call neighborhood of Q,

  • then to select the optimal solution for this criterion W in

this neighbour, which is usually called local optimal solution. For example we can take as neighborhood of Q all partitions obtained starting from the partition Q by changing only one element of class. Two well known exemples of this algorithm are « ping pong » algorithm and k-means algorithm.

slide-9
SLIDE 9
  • k-means algorithm

With the neighborhood algorithm, it is not necessary systematically to take a best solution to obtain the decrease of the criterion, it is sufficient to find in this neighborhood a solution better than the current solution. In the k-means algorithm it is sufficient:

) , ( min arg

2 , , 1 j i K j

d w z

  • =

= as such determine to

The decrease of the intraclass inertia criterion W is ensure thanks to the Huygens theorem.

slide-10
SLIDE 10
  • Iterative two steps relocation process

This algorithm involves at each iteration two steps:

  • 1. The first step is the representation step. The goal is to

select a prototype for each cluster by optimizing an a priori criterion.

  • 2. The second step is the allocation step. The goal is to

find a new affection of each object of Ω from prototypes defined in the previous step.

slide-11
SLIDE 11
  • Dynamical clustering algorithms are iterative two

steps relocation algorithms involving at each iteration the identification of a prototype for each cluster by optimizing an adequacy criterion. It is a k-means like algorithm with adequacy criterion equal to variance criterion and the class prototypes equal to cluster centers of gravity

Dynamic Clustering Method

slide-12
SLIDE 12
  • In dynamical clustering, the optimization problem is :

Let Ω be a set of n objects described by p variables and Λ a set of class prototypes. Each object i is described by a vector xi. The problem is to find simultaneously the partition P=(C1,...,CK) of Ω in K clusters and the system L=(L1,...,LK)

  • f class prototypes of Λ which optimize the partitioning

criterion W(P,L).

Λ ∈ ∈ =

= ∈ k k K k C s k s

L P C L D L P W

i

, ) , ( ) , (

1

x

Optimization problem

slide-13
SLIDE 13
  • (b) Allocation step

For each object i of Ω define the index cluster l which verifies (a) Initialization Choose K distinct class prototypes L1,...,LK of Λ (c) Representation step For each cluster k find the class prototype Lk of

Λ which minimizes

) , ( min arg

i K 1,..., k k

L D l x

=

=

=

k

C s s k

L D L C w ) , ( ) , ( x

Repeat (b) and (c) until the stationarity of the criterion

Algorithm

slide-14
SLIDE 14
  • Convergence
  • The dynamical clustering algorithm converges
  • The partitioning criterion decreases at each iteration

In order to get the convergence it is necessary to define the class prototype Lk which minimizes the adequacy criterion w(Ck,Lk) measuring the proximity between the prototype Lk and the corresponding cluster Ck

How to define D ?

slide-15
SLIDE 15
  • The optimization problem for class

prototype

C s j j s

L x

2

) (

For each cluster C we search the vector L of S which minimizes the following adequacy criterion :

ℜ ∈ − = = =

  • =

∈ ∈ ∈ j p j C s j j s C s s C s s

L L x L d L D L C w

1 2 2

) ( ) , ( ) , ( ) , ( x x

For each variable j the problem is to find the element Lj of S which minimizes: The solution is evident

=

C s j s C j

x L

1

slide-16
SLIDE 16
  • !
  • =

=

K k C s k s

i

L d L P W

1 2

) , ( ) , ( x

d is euclidian distance Λ is ℜ.

=

C s j s C j k

x L

1

  • =

=

K k C s k s

i

L d L P W

1

) , ( ) , ( x

{ }

k j s j k

C s x median L ∈ = ,

Mean vector Unique Median vector No unique

Two Classical Criteria

D=d D=d

slide-17
SLIDE 17
  • How to classify the Complex Data
  • &+$",$$

$-$'

  • "$.$$/.0

#1$"

#2/$

  • 3$++$

$$.0

slide-18
SLIDE 18
  • *

The optimization problem for the distance table

For each cluster C we search the object sC of E which minimizes the following adequacy criterion : The solution is simple

′ = ′

k

C s k

s s d s C w ) , ( ) , (

2

∈ ′

′ =

k k

C s E s C

s s d s ) , ( min arg

2

slide-19
SLIDE 19
  • %$4#3"$51$&13'"

64

7(%$%"/

""$%8++$

7$9$"(/4#

/ $$"$$

7(%//"8!$$8

!$#

$$%$:($ $$//$.$

$$* +/$)

Clickstream data

slide-20
SLIDE 20
  • The welcome page of the site
slide-21
SLIDE 21
  • Motivation

7$/"%

$$+$$%$ %$(#

$$$$$&"$

$$'

$$$$&$

+/$+$$'

7$"$"

($$/

$"%"#($""/"

%$.$"$$;

slide-22
SLIDE 22
  • Semantic structure of the site

Great Great density density

  • f links
  • f links
slide-23
SLIDE 23
  • Navigation or Visite

<+/$$" %/$")

7$$$ %/$$ "$)

=9$$$ $9$ /$(%$)

slide-24
SLIDE 24
  • Two complex representations of

navigations

,$# ><..1.?@ / $(+/$ A&<..<.1.?' A&<..1.'

  • <

>.@ >@

  • >@

>. @ 1 > @ >@ ? >@

slide-25
SLIDE 25
  • Choice of the dissimilarity function

Jaccard binary Cosine counting Tf x idf counting

with

7$"B$ $/$$"$ $$$"$$

slide-26
SLIDE 26
  • !

Expert or a priori partition

Classification of pages into semantic categories performed by an expert.

slide-27
SLIDE 27
  • Results on the distance table

The dynamic clustering : Hierarchical clustering :

C-$+$%"$

slide-28
SLIDE 28
  • *

Rand / MND on distance table

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 2 4 6 8 10 12 14 Rand TfxIdf Jaccard Cosinus

slide-29
SLIDE 29
  • Our method searches a pair (P, L) and a vector d of DK

which optimize the criterion W2(P,L,d).

  • =

= ∈

= ∆ =

K k K k C i k i k k k k

k

d d y C d L P W

1 1 2 2 2

) y , x ( ) , , ( ) , , (

d = (d1 ,…,dK ) is a vector of K distances. The distance dk , is associated to the cluster Ck and belongs to the set of distance family D.

k

y is the prototype of the cluster Ck

A dynamical cluster method with adaptive distances (G. Govaert, 1975)

dk is a distance of the cluster Ck (local allocation distance)

slide-30
SLIDE 30
  • The optimization problem for the

distance table

L=(c1,..,cK) is a vector of objects of Ω λA&λ.))).λD'(/$+$ λ $$$ ($$$$ Our method searches a pair (P, L) and a weight vector λ where the criterion W2(P,L,λ) is optimized

  • =

= ∈

= ∆ =

K k K k C s k k k k k

k

c s d c C L P W

1 1 2 2 2

) , ( ) , , ( ) , , ( λ λ λ

K k

k K k k

,..., 1 for and 1

1

= > =

=

λ λ

slide-31
SLIDE 31
  • The optimization problem of the

representative step

For each cluster Ck, the problem is to find a object ck that minimizes the adequacy criterion The optimization problem of the representative step Is divided in two steps: Step 1: The class Ck and weight λk are fixed.

Ω ∈ Ω ∈

= ∆ =

k

C s k c k k c k

c s d c C c

2 2

) , ( min arg ) , , ( min arg λ λ

slide-32
SLIDE 32
  • The optimization problem of the

representative step

Step 2: The partition P and L=(c1,..,cK) is a vector of

  • bjects are fixed.

The problem is to find (/$+$ λ that minimizes the adequacy criterion

  • =

= ∈ = ∈

Φ = = =

K k k k K k C s k k K k C s k k

k k

c s d c s d L P W

1 1 2 1 2 2

) , ( ) , ( ) , , ( λ λ λ λ

The solution is given by the Lagrange multiplier method.

slide-33
SLIDE 33
  • Conclusion
  • The adaption of the class prototype

approach to classify a distance table is easy.

  • The protype is remplaced by a medoid.
  • This approach can be used when the

distance is non an euclidean distance.

slide-34
SLIDE 34

References

  • M. Chavent, F. A. T. De Carvalho, Y. Lechevallier and R. Verde. New clustering

methods for interval data. In Computational Statistics, Vol. 21(23):211-230, 2006.

  • A. Da Silva, Y. Lechevallier, F. Rossi and F. A. T. De Carvalho Clustering Dynamic

Web Usage Data. In "Innovative Applications in Data Mining",Edited by Nadia Nedjah, Luiza de Macedo Mourelle and Janusz Kacprzyk. Springer, 2009.

  • F. A. T. de Carvalho and Y. Lechevallier Partitional Clustering Algorithms for

Symbolic Interval Data based on Single Adaptive Distances. Pattern Recognition, 2009

  • T. Despeyroux, Y. Lechevallier, B. Trousse and A.-M. Vercoustre. Experiments in

Clustering Homogeneous XML Documents to Validate an Existing Typology. Journal of Universal Computer Science,2006.

  • F. Rossi, F. A. T. De Carvalho, Y. Lechevallier and A. Da Silva. Dissimilarities for

Web Usage Mining. In V.Batagelj, H-H. Bock, A.Ferligoj and A. vZiberna editors, Data Science and Classification (Proceedings of IFCS 2006), Pages 39-46, Springer,

  • N. Villa and F. Rossi. A comparison between dissimilarity SOM and kernel SOM for

clustering the vertices of a graph. Workshop on Self-Organizing Maps (WSOM 07).