Clustering Genetic Algorithm Petra Kudov Department of Theoretical - PowerPoint PPT Presentation

Introduction Clustering Genetic Algorithm Experimental results Conclusion Clustering Genetic Algorithm Petra Kudová Department of Theoretical Computer Science Institute of Computer Science Academy of Sciences of the Czech Republic ETID 2007 ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion Outline Introduction Clustering Genetic Algorithm Experimental results Conclusion ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion Motivation Goals study applicability of GAs to clustering design genetic operators suitable for clustering application to tasks with unknown number of clusters compare to standard techniques Clustering partitioning of a data set into subsets - clusters, so that the data in each subset share some common trait often based on some similarity or distance measure the notion of similarity is always problem-dependent. wide range of algorithms (k-means, SOMs, etc.) ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion Clustering Definition of cluster Basic idea: cluster groups together similar objects More formally: clusters are connected regions of a multi-dimensional space containing a relatively high density of points, separated from other such regions by an low density of points Applications Marketing - find groups of customers with similar behaviour Biology - classify of plants/animals given their features WWW - document classification, clustering weblog data to discover groups of similar access patterns ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion Genetic algorithms Genetic algorithms stochastic optimization technique applicable on a wide range of problems work with population of solutions - individuals new populations produced by genetic operators Genetic operators selection - the better the solution is the higher probability to be selected for reproduction crossover - creates new individuals by combining old ones mutation - random changes ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion Clustering Genetic Algorithm (CGA) Representation of the individual 1. approach (Hruschka, Campelo, Castro) for each data point store cluster ID long individuals (high space requirements) 2. approach (Maulik, Bandyopadhyay) store centres of the clusters need to assign data points to clusters before each fitness evaluation ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion Fitness Normalization partition the data set into clusters using the given individual move the centres to the actual gravity centres Fitness evaluation clustering error: fit ( I ) = − E VQ K � c f ( x i ) || 2 , c k || 2 || � x i − � f ( � || � x i − � E VQ = x i ) = arg min k i = 1 fit ( i ) = � N i = 1 s ( � silhouette function: x i ) b ( � x ) − a ( � x ) s ( � x ) = max { b ( � x ) , a ( � x ) } ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion Crossover One-point Crossover exchange the whole blocks (i.e. centres) Combining Crossover match the centres and combine them ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion Mutation One-point mutation, Biased one-point mutation One-point Mutation: � c new = � x i , where i ← random ( 1 , N ) Bias one-point Mutation: c old + � ∆ , where � � c new = � ∆ is a random small vector K-means mutation several steps of k-means clustering Cluster addition, Cluster removal Cluster Addition – adds one centre Cluster Removal – removes randomly selected centre ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion Experiments Goals demonstrate the performance of CGA compare variants of genetic operators Data Sets 0.9 data 0.8 0.7 25 centres 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 vowels (UCI machine learning repository) 11 kinds of vowels, dimension 9 990 examples mushrooms (UCI machine learning repository) 23 kinds of mushrooms, dimension 125 8124 examples ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion Operators Comparison Mutation 25clusters Vowels 1-point 0.20 927.7 Biased 1-point 0.25 927.3 K-means 0.26 940.7 1-point + Biased 1-pt 0.21 927.3 1-point + K-means 0.21 927.6 All 0.22 927.3 Crossover 25clusters Vowels 1-point 0.201 927.7 Combining 0.222 927.4 Both 0.202 927.4 ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion Convergence Rate – Mutation Mutation Comparison -900 1-point biased 1-point k-means 1-point + k-means -950 -1000 -1050 fitness -1100 -1150 -1200 -1250 0 5 10 15 20 25 iteration ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion Convergence Rate – Crossover Crossover Comparison -900 1-point combining both -950 -1000 -1050 fitness -1100 -1150 -1200 -1250 0 2 4 6 8 10 12 14 iteration ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion Comparison to other clustering algorithms method accuracy Mushroom data set k-means 95.8% CLARA 96.8% CGA 97.3% HCA 99.2% 25 centers CGA k-means 0.9 0.9 data data centers centers 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion Estimating the number of clusters Initial population: 2 to 15 centres 26 # centers 24 22 20 18 16 14 12 0 10 20 30 40 50 60 70 80 ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion Estimating the number of clusters Initial population: 10 to 30 centres 25.5 # centers 25 24.5 24 23.5 23 22.5 22 21.5 21 2 4 6 8 10 12 14 ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion Conclusion Summary Clustering Genetic Algorithm proposed several genetic operators proposed and compared CGA compared to available clustering algorithms estimating the number of clusters tested Future work application of CGA to large data sets reducing time requirements, lazy evaluations, etc. applications ETID’2007

Introduction Clustering Genetic Algorithm Experimental results Conclusion Thank you. Any questions? ETID’2007

Clustering Genetic Algorithm Petra Kudov Department of Theoretical - PowerPoint PPT Presentation

Introduction Clustering Genetic Algorithm Experimental results Conclusion Clustering Genetic Algorithm Petra Kudov Department of Theoretical Computer Science Institute of Computer Science Academy of Sciences of the Czech Republic ETID

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

1 2 Genetic Program Genetic Program Parameter 3 Genetic Program Genetic Program 4 Softcoding

LECTURE 7 Clustering The k-means algorithm Hierarchical Clustering The DBSCAN algorithm

A fuzzy clustering method using Genetic Algorithm and Fuzzy Subtractive Clustering Thanh Le, Tom

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

K-MEANS++ OPTIMAL INITIALIZATION ALGORITHM An Improved K-means Clustering Method OVERVIEW

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Week 10 - Friday What did we talk about last time? Time More on linked lists A good

For Loops or count controlled repetition CORE-UA 109.01, Joanna Klukowska adapted from slides

Extending SRGS to Support More Powerful and Expressive Grammars Paolo Baggia, Loquendo Deborah

fabrizio.falchi@cnr.it fabrizio.falchi@cnr.it fabrizio.falchi@cnr.it W HAT S THAT ?

Decision Trees Some exercises 1. Exemplifying how to compute information gains and how to work

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

MA162: Finite mathematics . Jack Schmidt University of Kentucky October 31, 2012 Schedule: HW

Us Using a g agen ent-bas ased ed model els t to exam amine e e eco-evol olution onar