SI0506 - Data Clustering Using Flocking 1
Sathyanarayan Anand & Debasree Banerjee Swarm Intelligence - - PowerPoint PPT Presentation
Sathyanarayan Anand & Debasree Banerjee Swarm Intelligence - - PowerPoint PPT Presentation
Sathyanarayan Anand & Debasree Banerjee Swarm Intelligence 2005-06 09.02.2006 SI0506 - Data Clustering Using Flocking 1 What is Data Clustering? Given a set of elements and a similarity measure between pairs of elements, to find an
SI0506 - Data Clustering Using Flocking 2
What is Data Clustering?
- Given a set of elements and a similarity measure
between pairs of elements, to find an algorithm for grouping elements into clusters, so that similar elements end up in the same cluster.
- Data element = Point in some high-dimensional space.
- Applications: Geographic information systems, pattern
recognition, medical imaging, marketing analysis, weather forecasting, etc.
SI0506 - Data Clustering Using Flocking 3
Related Work
- Hierarchical Algorithms: Break large clusters into smaller
- nes till desired granularity is reached.
– Chameleon: Model based splitting of clusters.
- Partitioning Algorithms: Move data between partitions to
- ptimize some quality measure.
– K-means clustering – Fuzzy c-means clustering
- Density-Based Algorithms.
– DBSCAN
- Swarm-Based Algorithms.
– Lumer-Faieta: Ant-colony based clustering.
SI0506 - Data Clustering Using Flocking 4
Flocking Rules
(As given by Craig Reynolds)
Separation: steer to avoid crowding local flock mates. No two agents land up on the same data point. Alignment: steer towards the average heading
- f local flock mates.
Cohesion: steer to move toward the average position of local flock mates.
SI0506 - Data Clustering Using Flocking 5
- Used to determine if two data points, a and b, belong to
the same cluster or not.
– Euclidean distance: – Vector dot product: – Penalized Difference: abs(a – b).p, where p is a vector that denotes the importance of each attribute. – Pearson’s Coefficient:
The Algorithm – Similarity Measures
SI0506 - Data Clustering Using Flocking 6
The Algorithm - Procedure
- Initialize flock randomly on the dataset.
- Repeat
– Each agent performs local density based clustering
- If the density of points around a given point, exceeds a given
threshold then every point in the cluster takes the label of the point with the minimum label.
- Merge clusters belonging to different agents.
- Flock migrates to new location controlled by defining flock speed.
- Flock Memory: Location not revisited until all other
locations have been visited.
- Local clustering leads to the emergence of global cluster
pattern.
SI0506 - Data Clustering Using Flocking 7
The Algorithm – Proof of Convergence
- Markov process with state = centroid of flock.
– Centroid = data point that minimizes cumulative distance to all
- ther points.
– Next state (centroid) depends only on current state.
- Irreducibility = Any point can be reached from any point.
- Ergodicity = Time taken to revisit a state is finite and a
- periodic. Ensured through flock memory.
- In the limit of infinite time, an irreducible & ergodic
Markov process converges to a stationary distribution.
– Clustering becomes independent of initial state. – Similar proof used in spectral clustering techniques.
SI0506 - Data Clustering Using Flocking 8
The Algorithm - Limitations
- Density-based clustering highly susceptible to the radius
and density threshold parameters.
- Computational cost for creating an efficient data
structure is exponential. Can be reduced using certain techniques.
SI0506 - Data Clustering Using Flocking 9
Results – Synthetic Dataset
SI0506 - Data Clustering Using Flocking 10
Results – Zoo Dataset
SI0506 - Data Clustering Using Flocking 11
Results – Chameleon Dataset 1
SI0506 - Data Clustering Using Flocking 12
Results – Chameleon Dataset 2
SI0506 - Data Clustering Using Flocking 13
Results – Performance Comparison
Dataset coverage w.r.t number of agents in the flock.
SI0506 - Data Clustering Using Flocking 14
References
- 1. Zaiane O.R., Lee C.H. Clustering Spatial Data in the Presence of Obstacles: A Density-based
- Approach. IEEE Database Engineering and Applications Symposium, 2002. Proceedings.
International 17-19 July 2002 Page(s):214 -223.
- 2. E. Lumer, and B. Faieta. Diversity and adaptation in populations of clustering ants. Proceedings, 3rd
international conference on Simulation of adaptive behavior: from animals to animats 3, pages 501-508, 1994.
- 3. Ester M., Kriegel H.P., Sander J., Xu X.. A Density Based Approach for Discovering Clusters in
Large Spatial Databases with Noise. In 2nd International Conference on Knowledge Discovery Databases and Data Mining (KDD’96), Portland, Oregon. AAAI Press, 1996.
- 4. Karypis G., Han S., Kumar V. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic
- Modeling. In IEEE Computer: Special Issue on Data Analysis and Mining, 1999. Volume 32,
Number 8, Pages 68 - 75.
- 5. Bradley P.S., Fayyad U., Reina C. Scaling Clustering Algorithms for Large Databases. In 4th
international Conference Knowledge Discovery Databases and Data Mining (KDD’98), New York City, AAAI Press, 1998.
- 6. F. Höppner. Speeding up Fuzzy c-Means: Using a Hierarchical Data Organization to Control the
Precision of Membership Calculation. Fuzzy Sets and Systems, 128(3), pp. 365-378, 2002.
- 7. Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning
databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science.