Clustering Genetic Algorithm Petra Kudov Department of Theoretical - - PowerPoint PPT Presentation

clustering genetic algorithm
SMART_READER_LITE
LIVE PREVIEW

Clustering Genetic Algorithm Petra Kudov Department of Theoretical - - PowerPoint PPT Presentation

Introduction Clustering Genetic Algorithm Experimental results Conclusion Clustering Genetic Algorithm Petra Kudov Department of Theoretical Computer Science Institute of Computer Science Academy of Sciences of the Czech Republic ETID


slide-1
SLIDE 1

ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion

Clustering Genetic Algorithm

Petra Kudová

Department of Theoretical Computer Science Institute of Computer Science Academy of Sciences of the Czech Republic

ETID 2007

slide-2
SLIDE 2

ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion

Outline

Introduction Clustering Genetic Algorithm Experimental results Conclusion

slide-3
SLIDE 3

ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion

Motivation

Goals

study applicability of GAs to clustering design genetic operators suitable for clustering application to tasks with unknown number of clusters compare to standard techniques

Clustering

partitioning of a data set into subsets - clusters, so that the data in each subset share some common trait

  • ften based on some similarity or distance measure

the notion of similarity is always problem-dependent. wide range of algorithms (k-means, SOMs, etc.)

slide-4
SLIDE 4

ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion

Clustering

Definition of cluster

Basic idea: cluster groups together similar objects More formally: clusters are connected regions of a multi-dimensional space containing a relatively high density of points, separated from other such regions by an low density of points

Applications

Marketing - find groups of customers with similar behaviour Biology - classify of plants/animals given their features WWW - document classification, clustering weblog data to discover groups of similar access patterns

slide-5
SLIDE 5

ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion

Genetic algorithms

Genetic algorithms

stochastic optimization technique applicable on a wide range of problems work with population of solutions - individuals new populations produced by genetic operators

Genetic operators

selection - the better the solution is the higher probability to be selected for reproduction crossover - creates new individuals by combining old ones mutation - random changes

slide-6
SLIDE 6

ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion

Clustering Genetic Algorithm (CGA)

Representation of the individual

  • 1. approach (Hruschka, Campelo, Castro)

for each data point store cluster ID long individuals (high space requirements)

  • 2. approach (Maulik, Bandyopadhyay)

store centres of the clusters need to assign data points to clusters before each fitness evaluation

slide-7
SLIDE 7

ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion

Fitness

Normalization

partition the data set into clusters using the given individual move the centres to the actual gravity centres

Fitness evaluation

clustering error: fit(I) = −EVQ EVQ =

K

  • i=1

|| xi − cf(xi)||2, f( xi) = arg min

k

|| xi − ck||2 silhouette function: fit(i) = N

i=1 s(

xi) s( x) = b( x) − a( x) max{b( x), a( x)}

slide-8
SLIDE 8

ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion

Crossover

One-point Crossover

exchange the whole blocks (i.e. centres)

Combining Crossover

match the centres and combine them

slide-9
SLIDE 9

ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion

Mutation

One-point mutation, Biased one-point mutation

One-point Mutation:

  • cnew =

xi, where i ← random(1, N) Bias one-point Mutation:

  • cnew =

cold + ∆, where ∆ is a random small vector

K-means mutation

several steps of k-means clustering

Cluster addition, Cluster removal

Cluster Addition – adds one centre Cluster Removal – removes randomly selected centre

slide-10
SLIDE 10

ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion

Experiments

Goals

demonstrate the performance of CGA compare variants of genetic operators

Data Sets

25 centres

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 data

vowels (UCI machine learning repository)

11 kinds of vowels, dimension 9 990 examples

mushrooms (UCI machine learning repository)

23 kinds of mushrooms, dimension 125 8124 examples

slide-11
SLIDE 11

ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion

Operators Comparison

Mutation

25clusters Vowels 1-point 0.20 927.7 Biased 1-point 0.25 927.3 K-means 0.26 940.7 1-point + Biased 1-pt 0.21 927.3 1-point + K-means 0.21 927.6 All 0.22 927.3

Crossover

25clusters Vowels 1-point 0.201 927.7 Combining 0.222 927.4 Both 0.202 927.4

slide-12
SLIDE 12

ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion

Convergence Rate – Mutation

  • 1250
  • 1200
  • 1150
  • 1100
  • 1050
  • 1000
  • 950
  • 900

5 10 15 20 25 fitness iteration Mutation Comparison 1-point biased 1-point k-means 1-point + k-means

slide-13
SLIDE 13

ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion

Convergence Rate – Crossover

  • 1250
  • 1200
  • 1150
  • 1100
  • 1050
  • 1000
  • 950
  • 900

2 4 6 8 10 12 14 fitness iteration Crossover Comparison 1-point combining both

slide-14
SLIDE 14

ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion

Comparison to other clustering algorithms

Mushroom data set method accuracy k-means 95.8% CLARA 96.8% CGA 97.3% HCA 99.2% 25 centers CGA k-means

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 data centers 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 data centers

slide-15
SLIDE 15

ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion

Estimating the number of clusters

Initial population: 2 to 15 centres

12 14 16 18 20 22 24 26 10 20 30 40 50 60 70 80 # centers

slide-16
SLIDE 16

ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion

Estimating the number of clusters

Initial population: 10 to 30 centres

21 21.5 22 22.5 23 23.5 24 24.5 25 25.5 2 4 6 8 10 12 14 # centers

slide-17
SLIDE 17

ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion

Conclusion

Summary

Clustering Genetic Algorithm proposed several genetic operators proposed and compared CGA compared to available clustering algorithms estimating the number of clusters tested

Future work

application of CGA to large data sets reducing time requirements, lazy evaluations, etc. applications

slide-18
SLIDE 18

ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion

Thank you. Any questions?