Geodesic Distance Distance based based Geodesic Fuzzy Clustering - - PowerPoint PPT Presentation

geodesic distance distance based based geodesic fuzzy
SMART_READER_LITE
LIVE PREVIEW

Geodesic Distance Distance based based Geodesic Fuzzy Clustering - - PowerPoint PPT Presentation

Geodesic Distance Distance based based Geodesic Fuzzy Clustering Clustering Fuzzy Abonyi and and Bal J nos nos Abonyi Bal zs zs Feil Feil J University of of Pannonia Pannonia University abonyij@ @fmt.uni fmt.uni-


slide-1
SLIDE 1

Geodesic Geodesic Distance Distance based based Fuzzy Fuzzy Clustering Clustering

J Já ános nos Abonyi Abonyi and and Bal Balá ázs zs Feil Feil University University of

  • f Pannonia

Pannonia

abonyij abonyij@ @fmt.uni fmt.uni-

  • pannon.hu

pannon.hu www.fmt.uni www.fmt.uni-

  • pannon.hu

pannon.hu/ /softcomp softcomp

slide-2
SLIDE 2

2 2/ /47 47

Administrative details Administrative details

  • Webpage :

Webpage :

  • www.fmt

www.fmt. .uni uni-

  • pannon

pannon. .hu/softcomp hu/softcomp

  • You can download

You can download

  • Transparencies

Transparencies

  • Related papers in PDF

Related papers in PDF

  • Demonstration programs

Demonstration programs

  • Please, send an e

Please, send an e-

  • mail to:

mail to: abonyij@fmt. abonyij@fmt.uni uni-

  • pannon

pannon.hu .hu

  • Book:

Book:

  • Fuzzy Clustering for Data Mining

Fuzzy Clustering for Data Mining and System Identification (coming soon and System Identification (coming soon! !) )

  • MATLAB

MATLAB Toolbox Toolbox

  • File exchange: more than

File exchange: more than 7 7000 users worldwide 000 users worldwide

slide-3
SLIDE 3

3 3/ /47 47

Location Location

Veszpr Veszpré ém m Buildings of the University of Buildings of the University of Pannonia Pannonia DPE offices DPE offices DPE Laboratory DPE Laboratory

slide-4
SLIDE 4

4 4/ /47 47

Industrial Process Development Industrial Process Development

  • Process engineering

Process engineering covers all the necessary covers all the necessary knowledge required for knowledge required for defining, designing, defining, designing, implementing and optimizing any process implementing and optimizing any process

  • Nowadays, the task of process engineers is to design,

Nowadays, the task of process engineers is to design, construct and operate construct and operate complete complete systems. systems.

T TJ M FOLYAMAT MODELL ÁLLAPOTBECSLÉS IDENTIFIKÁLÁS A STRUKTÚRA VÁLASZTÁS MASTER C4 MASTER C2 SLAVE C3 SLAVE C1 RÁTÁPLÁLÁS HŰTÉS V1 ZÁRVA V2 NYITVA TW MÉRÉS T, TJ T TJ T

τ2, K2 τ4, K4 τ3, K3 SZABÁLYOZÓ

(Vc)BECSÜLT

τ1, K1

T TJ M FOLYAMAT MODELL ÁLLAPOTBECSLÉS IDENTIFIKÁLÁS A STRUKTÚRA VÁLASZTÁS MASTER C4 MASTER C2 SLAVE C3 SLAVE C1 RÁTÁPLÁLÁS HŰTÉS V1 ZÁRVA V2 NYITVA TW MÉRÉS T, TJ T TJ T

τ2, K2 τ4, K4 τ3, K3 SZABÁLYOZÓ

(Vc)BECSÜLT

τ1, K1

slide-5
SLIDE 5

5 5/ /47 47

  • dr. Nagy

Lajos

  • dr. Chován

Tibor

  • dr. Szeifert

Ferenc

  • dr. Abonyi

János

  • dr. Árva

Péter

  • dr. Németh

Sándor

  • dr. Moser

Károly

  • dr. Lakatos

Béla

Industrial Process Automation Tailored Process Simulators Batch Process Design Process Data Mining Hierarchical Process Modeling Modeling and Control

  • f Crystallizers

Department of Process Engineering at the University of Veszprém

  • dr. Janos Madar, Ferenc P. Pach, Balazs Feil, Balázs Balaskó

The Optimization of Operating Processes Project: www.fmt.vein.hu/softcomp/procopt The EAsy MATLAB Toolbox www.fmt.vein.hu/softcomp/EAsy The GP MATLAB Toolbox www.fmt.vein.hu/softcomp/gp

slide-6
SLIDE 6

6 6/ /47 47

Advanced Model Based Process Engineering Tools

CI in modeling and control Computational Intelligence in Data Mining

www.fmt.uni-pannon.hu/softcomp

slide-7
SLIDE 7

7 7/ /47 47

Optimization of operating technologies Optimization of operating technologies

Problem Description (e.g.) Problem Description (e.g.)

HDPE HDPE-

  • I

I plant of TVK Ltd plant of TVK Ltd

  • Operating

Operating multi multi-

  • product technology

product technology

  • Large volume (60.000t/y),

Large volume (60.000t/y), high value added products high value added products (more (more than than ten ten) )

  • Using the same process

Using the same process equipment, equipment, (process transitions) (process transitions)

  • Goal

Goal: :

  • ptimization
  • ptimization of
  • f the

the technological technological parameters parameters

slide-8
SLIDE 8

8 8/ /47 47

Know Know-

  • how:

how: Process Process Data Data Warehouse Warehouse

  • T

The mountains of data he mountains of data that that computer computer-

  • controlled plants

controlled plants generate must be used generate must be used

  • by the operator support

by the operator support systems to distinguish normal systems to distinguish normal from abnormal operating from abnormal operating conditions conditions

  • to

to optimize the technology

  • ptimize the technology

and the product and the product

  • to plan and schedule

to plan and schedule sequences of operating steps. sequences of operating steps.

  • The aim of the

The aim of the data data warehouse warehouse is to organize and store the data is to organize and store the data taken from different information taken from different information sources by different sampling sources by different sampling times and allow the users to times and allow the users to query, analyze and group these query, analyze and group these data. data.

Enterprise Information System Partially overlapping, non-consistent, mainly paper based reports Adequate reports based on electronic stored data Consistent information sources, regulated access Enterprise Information System Partially overlapping, non-consistent, mainly paper based reports Adequate reports based on electronic stored data Consistent information sources, regulated access

1. 2. 3.

slide-9
SLIDE 9

9 9/ /47 47

Know Know-

  • how:

how:

Interactive Interactive Process data mining Process data mining

Data Cleaning Data Integration Databases

Data Warehouse

Task-relevant Data Selection

Data Mining

Pattern Evaluation

Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases

Technology Database Data Analysis Prior Knowledge, goals

  • Learning the

Learning the application domain: application domain:

  • relevant prior knowledge and goals of

relevant prior knowledge and goals of application application

  • Creating a target data set: data selection

Creating a target data set: data selection

  • Data cleaning

Data cleaning and preprocessing: and preprocessing: (may take 60% of effort!) (may take 60% of effort!)

  • Data reduction and transformation

Data reduction and transformation: :

  • Find useful features,

Find useful features,

  • Choosing functions of data mining

Choosing functions of data mining

  • summarization, classification,

summarization, classification, regression, association, clustering. regression, association, clustering.

  • Choosing the mining algorithm(s)

Choosing the mining algorithm(s)

  • Data mining:

Data mining: search for patterns of interest search for patterns of interest

  • Pattern evaluation and knowledge

Pattern evaluation and knowledge presentation presentation

  • visualization, transformation, removing

visualization, transformation, removing redundant patterns, etc. redundant patterns, etc.

  • Use of discovered knowledge

Use of discovered knowledge

slide-10
SLIDE 10

10 10/ /47 47

  • Lots of data is being collected

Lots of data is being collected and warehoused and warehoused

  • UC Berkeley 2003 estimate:

UC Berkeley 2003 estimate: 5 5 exabytes exabytes (5 million terabytes) of new data was created in 2002. (5 million terabytes) of new data was created in 2002.

  • Relational database

Relational database

  • Huge data warehouses are under construction

Huge data warehouses are under construction

  • WWW: A huge, hyper

WWW: A huge, hyper-

  • linked,

linked, dynamic, global information system dynamic, global information system

  • e

e-

  • commerce

commerce

  • Text (documents, emails)

Text (documents, emails) and multimedia databases and multimedia databases

  • Competitive Pressure is Strong

Competitive Pressure is Strong

  • Provide better, customized services for an

Provide better, customized services for an edge edge (e.g. in Customer Relationship Management) (e.g. in Customer Relationship Management)

  • Computers have become cheaper and more powerful

Computers have become cheaper and more powerful Why Mine Data? Why Mine Data? Commercial Viewpoint Commercial Viewpoint

slide-11
SLIDE 11

11 11/ /47 47

Why Mine Data? Why Mine Data? Scientific Viewpoint

Scientific Viewpoint

  • Data collected and stored at

Data collected and stored at enormous speeds (GB/hour) enormous speeds (GB/hour)

  • Time

Time-

  • series data

series data

  • microarrays

microarrays generating gene generating gene expression data expression data

  • scientific simulations

scientific simulations generating terabytes of data generating terabytes of data

  • Traditional techniques infeasible

Traditional techniques infeasible for raw data for raw data

  • Data mining may help scientists

Data mining may help scientists

  • in classifying and segmenting data

in classifying and segmenting data

  • in

in Hypothesis Formation Hypothesis Formation

slide-12
SLIDE 12

12 12/ /47 47

Process Process M Monitoring

  • nitoring based

based

  • n
  • n Data

Data Mining Mining ( (e.g e.g. . cont cont.) .)

  • Problem:

Problem: How to simultaneously monitor How to simultaneously monitor 10 10 -

  • 100 process variables

100 process variables? ?

  • Solution:

Solution: R Reduce the dimensionality of the educe the dimensionality of the correlated process data by correlated process data by projecting them down onto a lower projecting them down onto a lower dimensional latent variable space dimensional latent variable space

  • Tools:

Tools: P Principal rincipal C Component

  • mponent A

Analysis nalysis ( (PCA PCA) and/or ) and/or S Self elf-

  • organizing
  • rganizing M

Maps aps ( (SOM SOM). ).

  • Beside process performance

Beside process performance monitoring, these tools can be used monitoring, these tools can be used for system identification, estimate for system identification, estimate the product quality, and for product the product quality, and for product design. design.

1 U-matrix 1 C2 1 C6 1 H2 1 Slurry 1 T 1 PE 1 C6in 1 C2in 1 H2in 1 IBin 1 MFI

slide-13
SLIDE 13

13 13/ /47 47

Major Data Mining Tasks Major Data Mining Tasks

Clustering

Association Rules

Deviation Detection

Milk

  • Classification:

Classification: predicting an item class predicting an item class

  • Clustering:

Clustering: finding clusters in data finding clusters in data

  • Associations:

Associations: e.g. A & B & C occur frequently e.g. A & B & C occur frequently

  • Visualization:

Visualization: to facilitate human discovery to facilitate human discovery

  • Summarization:

Summarization: describing a group describing a group

  • Deviation Detection

Deviation Detection: finding changes : finding changes

  • Estimation: predicting a continuous value

Estimation: predicting a continuous value

  • Link Analysis: finding relationships

Link Analysis: finding relationships

slide-14
SLIDE 14

14 14/ /47 47

What is Cluster Analysis? What is Cluster Analysis?

  • Cluster: a collection of data objects

Cluster: a collection of data objects

  • Similar to one another within the

Similar to one another within the same cluster same cluster

  • Dissimilar to the objects in other

Dissimilar to the objects in other clusters clusters

  • Cluster analysis

Cluster analysis

  • Grouping a set of data objects into

Grouping a set of data objects into clusters clusters

  • Clustering is

Clustering is unsupervised unsupervised classification classification: no predefined : no predefined classes classes

  • Typical applications

Typical applications

  • As a

As a stand stand-

  • alone tool

alone tool to get insight into to get insight into data distribution data distribution

  • As a

As a preprocessing step preprocessing step for other for other algorithms algorithms

slide-15
SLIDE 15

15 15/ /47 47

Terminology Terminology

100 150 200 250 300 500 1000 1500 2000 2500 3000 3500 Top speed [km/h] Weight [kg] Sports cars Medium market cars Lorries Object or data point feature feature space cluster feature label

Vehicle Top speed km/h Colour Air resistance Weight Kg V1 220 red 0.30 1300 V2 230 black 0.32 1400 V3 260 red 0.29 1500 V4 140 gray 0.35 800 V5 155 blue 0.33 950 V6 130 white 0.40 600 V7 100 black 0.50 3000 V8 105 red 0.60 2500 V9 110 gray 0.55 3500

slide-16
SLIDE 16

16 16/ /47 47

Classification vs. Clustering Classification vs. Clustering

Classification: Supervised learning: Learns a method for predicting the instance class from pre-labeled (classified) instances

slide-17
SLIDE 17

17 17/ /47 47

General Applications of Clustering General Applications of Clustering

  • Pattern Recognition

Pattern Recognition

  • Spatial Data Analysis

Spatial Data Analysis

  • detect spatial clusters and explain them in spatial data

detect spatial clusters and explain them in spatial data mining mining

  • Image Processing

Image Processing

  • Economic Science (especially market research)

Economic Science (especially market research)

  • WWW

WWW

  • Document classification

Document classification

  • Cluster

Cluster Weblog Weblog data to discover groups of similar data to discover groups of similar access patterns access patterns

slide-18
SLIDE 18

18 18/ /47 47

  • A set of N points in an M dimensional space

A set of N points in an M dimensional space OR OR

  • A proximity matrix that gives the pairwise distance or similarit

A proximity matrix that gives the pairwise distance or similarity y between points. between points.

  • Can be viewed as a weighted graph.

Can be viewed as a weighted graph.

  • Type of data in clustering analysis

Type of data in clustering analysis

Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types:

Input Data for Clustering Input Data for Clustering

I1 I2 I3 I4 I5 I6 I1 1.00 0.70 0.80 0.00 0.00 0.00 I2 0.70 1.00 0.65 0.25 0.00 0.00 I3 0.80 0.65 1.00 0.00 0.00 0.00 I4 0.00 0.25 0.00 1.00 0.90 0.85 I5 0.00 0.00 0.00 0.90 1.00 0.95 I6 0.00 0.00 0.00 0.85 0.95 1.00

slide-19
SLIDE 19

19 19/ /47 47

Measures of Similarity Measures of Similarity

  • The first step in clustering raw data is to

The first step in clustering raw data is to define some measure of similarity between define some measure of similarity between two data items two data items

  • That is, we need to know when two data items

That is, we need to know when two data items are close enough to be considered members are close enough to be considered members

  • f the same
  • f the same cluster

cluster

  • Different measures may produce entirely

Different measures may produce entirely different clusters, so the measure selected different clusters, so the measure selected must reflect the nature of the data must reflect the nature of the data

slide-20
SLIDE 20

20 20/ /47 47

What does What does ‘ ‘similarity similarity’ ’ mean? mean?

  • It’s hard to define, but we

know if we ‘look at it’.

  • Emotions, feelings…
  • There is a need to apply

more practical definitions

0.5 1 1.5 2 2.5 1/6 1/7 1/8 1/9 1/10 1/11 1/12 1/13

slide-21
SLIDE 21

21 21/ /47 47

Similarity and Dissimilarity Between Objects Similarity and Dissimilarity Between Objects

  • Distances

Distances are normally used to measure the are normally used to measure the similarity similarity or

  • r

dissimilarity dissimilarity between two data objects between two data objects

  • Some popular ones include:

Some popular ones include: Minkowski Minkowski distance distance: :

where where i i = ( = (x xi1

i1,

, x xi2

i2,

, … …, , x xip

ip) and

) and j j = ( = (x xj1

j1,

, x xj2

j2,

, … …, , x xjp

jp) are two

) are two p p-

  • dimensional data objects, and

dimensional data objects, and q q is a positive integer is a positive integer

  • If

If q q = = 1 1, , d d is Manhattan distance is Manhattan distance

q q p p q q

j x i x j x i x j x i x j i d ) | | ... | | | (| ) , (

2 2 1 1

− + + − + − =

| | ... | | | | ) , (

2 2 1 1 p p

j x i x j x i x j x i x j i d − + + − + − =

slide-22
SLIDE 22

22 22/ /47 47

  • Partitional Clustering

Partitional Clustering ( (e.g e.g. . K K-

  • means) finds a one

means) finds a one-

  • level partitioning of the

level partitioning of the data into K disjoint groups. data into K disjoint groups.

  • Hierarchical Clustering

Hierarchical Clustering finds a hierarchy of nested clusters (dendogram). finds a hierarchy of nested clusters (dendogram).

  • May proceed either

May proceed either bottom bottom-

  • up (agglomerative) or

up (agglomerative) or top top-

  • down (divisive).

down (divisive).

  • Uses a proximity matrix.

Uses a proximity matrix.

  • Can be viewed as operating on a proximity graph.

Can be viewed as operating on a proximity graph. Types of Clustering: Partitional and Hierarchical Types of Clustering: Partitional and Hierarchical

slide-23
SLIDE 23

23 23/ /47 47

Simple Clustering: K Simple Clustering: K-

  • means

means

Works with numeric data only Works with numeric data only 1) 1) Pick a number (K) of cluster centers Pick a number (K) of cluster centers (at random) (at random) 2) 2) Assign every item to its nearest cluster Assign every item to its nearest cluster center (e.g. using Euclidean distance) center (e.g. using Euclidean distance) 3) 3) Move each cluster center to the mean of Move each cluster center to the mean of its assigned items its assigned items 4) 4) Repeat steps 2,3 until convergence Repeat steps 2,3 until convergence (change in cluster assignments less than (change in cluster assignments less than a threshold) a threshold)

slide-24
SLIDE 24

24 24/ /47 47

K K-

  • means example, step 1

means example, step 1

k1 k2 k3 X Y Pick 3 initial cluster centers (randomly)

slide-25
SLIDE 25

25 25/ /47 47

K K-

  • means example, step 2

means example, step 2

k1 k2 k3 X Y Assign each point to the closest cluster center

slide-26
SLIDE 26

26 26/ /47 47

K K-

  • means example, step

means example, step 3 3

X Y Move each cluster center to the mean

  • f each cluster

k1 k2 k2 k1 k3 k3

slide-27
SLIDE 27

27 27/ /47 47

K K-

  • means example, step 4

means example, step 4

X Y Reassign points closest to a different new cluster center Q: Which points are reassigned? k1 k2 k3

slide-28
SLIDE 28

28 28/ /47 47

K K-

  • means

means example, step 4b

example, step 4b X Y re-compute cluster means k1 k3 k2

slide-29
SLIDE 29

29 29/ /47 47

K K-

  • means

means example, step 5

example, step 5 X Y move cluster centers to cluster means k2 k1 k3

slide-30
SLIDE 30

30 30/ /47 47

C1 C2 C3 C6 C4 C5 K-clustering and Voronoi Diagrams

slide-31
SLIDE 31

31 31/ /47 47

K K-

  • means

means Discussion

Discussion

  • Result can vary significantly depending on

Result can vary significantly depending on initial choice of seeds initial choice of seeds

  • Can get trapped in local minimum

Can get trapped in local minimum

  • Example:

Example:

  • To increase chance of finding global optimum:

To increase chance of finding global optimum: restart with different random seeds restart with different random seeds

instances instances initial initial cluster cluster centers centers

slide-32
SLIDE 32

32 32/ /47 47

K K-

  • means clustering summary

means clustering summary Advantages Advantages

  • Simple,

Simple, understandable understandable

  • items automatically

items automatically assigned to clusters assigned to clusters Disadvantages Disadvantages

  • Must pick number of

Must pick number of clusters before hand clusters before hand

  • All items forced into a

All items forced into a cluster cluster

  • Too sensitive to

Too sensitive to

  • utliers
  • utliers
  • Clusters

Clusters with with complex complex shapes shapes

slide-33
SLIDE 33

33 33/ /47 47

Clustering Methods Clustering Methods

  • Many different method and algorithms:

Many different method and algorithms:

  • For numeric and/or symbolic data

For numeric and/or symbolic data

  • Deterministic vs. probabilistic

Deterministic vs. probabilistic

  • Exclusive vs. overlapping

Exclusive vs. overlapping

  • Hierarchical vs. flat

Hierarchical vs. flat

  • Top

Top-

  • down vs. bottom

down vs. bottom-

  • up

up

a k j i h g f e d c b

Non-overlapping Overlapping

a k j i h g f e d c b

slide-34
SLIDE 34

34 34/ /47 47

Theory Theory of

  • f k

k-

  • means

means

⎪ ⎩ ⎪ ⎨ ⎧ − ≤ − =

  • therwise

if m

j k i k ik

1

2 2

c u c u

data point k cluster centre i distance cluster centre j

K c i all for U C Ø j i all for Ø C C U C

i j i c i i

≤ ≤ ⊂ ⊂ ≠ = ∩ =

=

2

1

U

All clusters C together fills the whole universe U Clusters do not

  • verlap

A cluster C is never empty and it is smaller than the whole universe U There must be at least 2 clusters in a c-partition and at most as many as the number of data points K

∑ ∑ ∑

= ∈ =

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − = =

c i C k i k c i i

i k

J J

1 2 , 1 u

c u

Minimise the total sum

  • f all distances
slide-35
SLIDE 35

35 35/ /47 47

Fuzzy Logic Fuzzy Logic

  • Philosophical approach

Philosophical approach

  • Ontological commitment based on

Ontological commitment based on “ “degree of truth degree of truth” ”

  • Is

Is not not a method for reasoning under uncertainty a method for reasoning under uncertainty

  • Crisp Facts

Crisp Facts – – distinct boundaries distinct boundaries

  • Fuzzy Facts

Fuzzy Facts – – imprecise boundaries imprecise boundaries

  • Probability

Probability -

  • incomplete facts

incomplete facts

  • Example

Example

  • The

The temperature temperature is 26 is 26° °C C (Crisp) (Crisp)

  • The

The temperature temperature is HOT is HOT (Fuzzy) (Fuzzy)

slide-36
SLIDE 36

36 36/ /47 47

Example Fuzzy Variable Example Fuzzy Variable

Each function tells us how much we Each function tells us how much we consider a consider a temperature temperature in the set if it in the set if it has a particular value has a particular value. . Or, how much Or, how much truth to attribute to the statement: truth to attribute to the statement: “ “The The temperature temperature is HOT is HOT? ?” ”

Temperature Membership (Degree of Truth)

1.0 20 40 30

Medium Hot Cold

10

  • Classical set theory

Classical set theory

  • An object is either in or not in

An object is either in or not in the set the set

  • Sets with smooth boundary

Sets with smooth boundary

  • Not completely in or out

Not completely in or out – – somebody 6 somebody 6” ” is 80% tall is 80% tall

  • Fuzzy set theory

Fuzzy set theory

  • An object is in a set by matter

An object is in a set by matter

  • f degree
  • f degree
  • 1.0 => in the set

1.0 => in the set

  • 0.0 => not in the set

0.0 => not in the set

  • 0.0 < object < 1.0 => partially in

0.0 < object < 1.0 => partially in the set the set

  • Provides a way to write symbolic

Provides a way to write symbolic rules with terms like rules with terms like “ “medium medium” ” but evaluate them in a quantified but evaluate them in a quantified way way

slide-37
SLIDE 37

37 37/ /47 47

Fuzzy Fuzzy c c-

  • means

means

( )

= −

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ =

c j q jk ik ik

d d m

1 1 / 2

1

i k ik

d c u − =

Distance from point k to current cluster centre i Distance from point k to

  • ther cluster centres j

Point k’s membership

  • f cluster i

Fuzziness exponent

1 2 3 4 5 0.5 1

Cluster centres

Membership of test point

  • is with q = 1.1, * is with q = 2

Data point

slide-38
SLIDE 38

38 38/ /47 47

Fuzzy Fuzzy c c-

  • means

means ( (cont cont.) .)

K c i all for U C Ø j i all for Ø C C U C

i j i c i i

≤ ≤ ⊂ ⊂ ≠ = ∩ =

=

2

1

U

All clusters C together fill the whole universe U. Remark: The sum of memberships for a data point is 1, and the total for all points is K Not valid: Clusters do overlap A cluster C is never empty and it is smaller than the whole universe U There must be at least 2 clusters in a c-partition and at most as many as the number of data points K

slide-39
SLIDE 39

39 39/ /47 47

Cluster prototype Cluster prototype

c-means Gustafson-Kessel Gath-Geva Clusters with different shapes and volumes Ellipsoids with equal volumes Euclidean norm sphere

slide-40
SLIDE 40

40 40/ /47 47

How How to to reveal reveal hidden hidden structure structure of

  • f data

data? ?

slide-41
SLIDE 41

41 41/ /47 47

Aims Aims and and Tools Tools Data can form groups and also can lie on a low dimensional (smooth) manifold of the feature space.

Relationship among the variables Input – output data…

Proposed approaches use a special distance called geodesic distance

This reflects the true embedded manifold In the graph theory, the distance between two vertices in a weighted graph is the sum of weights of edges in a shortest path connecting them. This is an approximation of the geodesic distance.

slide-42
SLIDE 42

42 42/ /47 47

Geodesic Distance based Clustering Algorithms

  • Algorithm

Algorithm I: I: Isomap Isomap based based Clustering Clustering

  • Isomap

Isomap seeks to preserve the intrinsic geometry of the data, as captured in the geodesic manifold distances between all pairs of data points. It uses the (approximated) geodesic distances between the data, and it is able to discover nonlinear manifolds and project them into a lower dimensional space. Algorithm I does the clustering on the data projected by Isomap. After the embedding, any clustering algorithm can be applied on the projected data. In this phase of the project, classical fuzzy c-means have been used.

slide-43
SLIDE 43

43 43/ /47 47

Geodesic Distance based Clustering Algorithms (cont’d)

  • Algorithm

Algorithm II: II: Geodesic Distance based c-Medoid Clustering

It does the clustering in the original feature space. The aim is to explore the hidden structure of data and find groups with similar data points. If the data lie on a (low dimensional) embedded manifold, the classical clustering methods cannot be used mainly because of their distance measure. The crucial question is how to measure the distances between data points tocalculate their similarity measures. To reflect the manifold containing the samples, there is a need to measure the distances on the manifold. Hence, geodesic distances have to be used.

slide-44
SLIDE 44

44 44/ /47 47

Geodesic Distance based c-Medoid Clustering (cont’d) Step 1: Calculate the (approximated) geodesic distances between all pairs of data points. Step 2: Use fuzzy c-medoid algorithm:

(a) Arbitrarily choose c objects as the initial medoids. (b) Use the calculated geodesic distances to determine how far the data points are from the medoids (the cluster centers). (c) Calculate fuzzy membership degrees as usual by fuzzy partitional clustering methods: (d) Calculate the objective function terms with the determined membership degrees, for all data point as potential medoids, and choose the data points as new medoids that minimize the objective function (e) If there are changes, jump Step 2(b).

slide-45
SLIDE 45

45 45/ /47 47

Examples Examples: : S-Curve Data Set

2 dimensional Isomap projection of the S-curve data set (dots)and the cluster centers by fuzzy c- means on the projected data (diamonds). The data points in the feature space closest to the fuzzy c-means cluster centers (Algorithm I). Centers of the geodesic distance based clustering in the feature space (Algorithm II).

Similar excellent results.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 0.5 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 0.5 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

slide-46
SLIDE 46

46 46/ /47 47

Examples Examples: : Spiral Data Set

Isomap projection of the spiral data set (dots)and the cluster centers by fuzzy c- means on the projected data (diamonds). The data points in the feature space closest to the cluster centers (Algorithm I). Centers of the classical fuzzy c-means clustering in the feature space. Centers of the geodesic distance based clustering in the feature space (Algorithm II).

Algorithm II is much better than Algorithm I

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

slide-47
SLIDE 47

47 47/ /47 47

Conclusions Conclusions

Aim: to discover the hidden structure of complex multivariate datasets. Tool: clustering. Problem: the classical clustering techniques may fail to explore the (nonlinear) manifolds the data lie on. Proposed solution: usage of geodesic distance that is based on the exploration of the manifold

Algorithm I is based on the clustering of the Isomap, i.e. the Isomap algorithm is used to explore the hidden (nonlinear) structure of data and the projected data are clustered. Algorithm II is based on the geodesic distances directly, and can be considered as a modification of the fuzzy c-medoid clustering.

The examples show the advantages of the proposed methods using benchmark datasets in (manifold) clustering.