G RAVITATIONAL C LUSTERING OF THE S ELF -O RGANIZING M AP Nejc Ilc - - PowerPoint PPT Presentation

g ravitational c lustering of the
SMART_READER_LITE
LIVE PREVIEW

G RAVITATIONAL C LUSTERING OF THE S ELF -O RGANIZING M AP Nejc Ilc - - PowerPoint PPT Presentation

ICANNGA 2011, Ljubljana G RAVITATIONAL C LUSTERING OF THE S ELF -O RGANIZING M AP Nejc Ilc Andrej Dobnikar University of Ljubljana Faculty of Computer and Information Science I NTRODUCTION Tools needed to deal with data/web mining


slide-1
SLIDE 1

GRAVITATIONAL CLUSTERING OF THE SELF-ORGANIZING MAP

Nejc Ilc Andrej Dobnikar

University of Ljubljana Faculty of Computer and Information Science

ICANNGA 2011, Ljubljana

slide-2
SLIDE 2

INTRODUCTION

  • Tools needed to deal with
  • data/web mining
  • huge (social) networks
  • gene expression data
  • image segmentation

2 ICANNGA, April 2011

slide-3
SLIDE 3

INTRODUCTION

  • Tools needed to deal with
  • data/web mining
  • huge (social) networks
  • gene expression data
  • image segmentation

3 ICANNGA, April 2011

Visualization of the Internet

Credits: Opte Project

slide-4
SLIDE 4

INTRODUCTION

  • Tools needed to deal with
  • data/web mining
  • huge (social) networks
  • gene expression data
  • image segmentation

4 ICANNGA, April 2011

slide-5
SLIDE 5

INTRODUCTION

  • Tools needed to deal with
  • data/web mining
  • huge (social) networks
  • gene expression data
  • image segmentation

5 ICANNGA, April 2011

Connections between neurons in human brain

Credits: Van J. Wedeen, M.D., MGH/Harvard U.

slide-6
SLIDE 6

INTRODUCTION

  • Tools needed to deal with
  • data/web mining
  • huge (social) networks
  • gene expression data
  • image segmentation

6 ICANNGA, April 2011

slide-7
SLIDE 7

INTRODUCTION

  • Tools needed to deal with
  • data/web mining
  • huge (social) networks
  • gene expression data
  • image segmentation

7 ICANNGA, April 2011

Heat map of gene expression profile

Credits: Manfred Gessler

slide-8
SLIDE 8

INTRODUCTION

  • Tools needed to deal with
  • data/web mining
  • huge (social) networks
  • gene expression data
  • image segmentation

8 ICANNGA, April 2011

slide-9
SLIDE 9

INTRODUCTION

  • Tools needed to deal with
  • data/web mining
  • huge (social) networks
  • gene expression data
  • image segmentation

9 ICANNGA, April 2011

Image segmentation

Credits: T . Riklin-Raviv, N. Sochen and N. Kiryati

slide-10
SLIDE 10

INTRODUCTION

  • Tools needed to deal with
  • data/web mining
  • huge (social) networks
  • gene expression data
  • image segmentation

10 ICANNGA, April 2011

slide-11
SLIDE 11

CLUSTERING

  • unsupervised process of organizing data into

"natural" groups

  • approaches
  • information theory
  • graphs
  • fuzzy logic
  • artificial neural

networks

ICANNGA, April 2011 11

slide-12
SLIDE 12

CLUSTERING WITH SOM

  • Self-Organizing Map [Kohonen, 1982]
  • Advantages
  • visualization of high-dimensional data
  • preserves topology and density of input data
  • Problem
  • SOM is not "true" clustering method
  • more neurons than expected number of clusters
  • How to group neurons into clusters?

12 ICANNGA, April 2011

slide-13
SLIDE 13

CLUSTERING OF SOM

  • K-means, hierarchical

[Vesanto & Alhoniemi, 2000]

  • Emergence SOM

[Ultsch, 2007]

  • watershed algorithm
  • neurons > 1000
  • Surface flooding

[Brugger et al., 2008]

  • automatically finds

number of clusters

13 ICANNGA, April 2011

slide-14
SLIDE 14

GSOM – THE IDEA

ICANNGA, April 2011 14

slide-15
SLIDE 15

GSOM – LEVEL ONE

  • train SOM on input data
  • identify winning neurons
  • remove interpolating neurons

15 ICANNGA, April 2011

𝑛𝑗 = [𝑛𝑗1, 𝑛𝑗2, … , 𝑛𝑗𝐸]

slide-16
SLIDE 16

GSOM – LEVEL TWO

  • Gravitational clustering

[Wright, 1977; Gomez et al., 2003]

  • BMU  mass point (m=1)
  • "Move & merge" steps

16 ICANNGA, April 2011

slide-17
SLIDE 17

EXPERIMENT

  • GSOM compared to
  • EM GMM [Dempster et al., 1977]
  • CS

[Jenssen et al., 2003]

  • SOMkM [Vesanto & Alhoniemi, 2000]
  • datasets
  • 6 artificial (2D with complex shapes)
  • 3 real from UCI (Iris, Wine, LetterABC)
  • 100 runs of algorithm, we measure:
  • Clustering Error (CE): minimal, average
  • elapsed time

17 ICANNGA, April 2011

slide-18
SLIDE 18

RESULTS – GIANT

EM GMM CE = 0.0

18 ICANNGA, April 2011

GSOM CE = 0.0 SOMkM CE = 0.352 CS CE = 0.219

slide-19
SLIDE 19

RESULTS – WAVE

EM GMM CE = 0.280

19 ICANNGA, April 2011

GSOM CE = 0.0 SOMkM CE = 0.126 CS CE = 0.130

slide-20
SLIDE 20

RESULTS – RANKS

Mean Rank

  • minimal CE

20 ICANNGA, April 2011

  • average CE
slide-21
SLIDE 21

RESULTS – ELAPSED TIME

  • Hepta

N=212

  • LettersABC

N=1719

21 ICANNGA, April 2011

slide-22
SLIDE 22

RESULTS – NUMBER OF CLUSTERS

  • number of detected clusters

22 ICANNGA, April 2011

dataset true number GSOM Giant 2 2 Hepta 7 7 Ring 2 4 Wave 2 2 Moon 4 4 Flag 3 3 Iris 3 3 Wine 3 3 LettersABC 3 7

slide-23
SLIDE 23

RESULTS – NUMBER OF CLUSTERS

  • number of detected clusters

23 ICANNGA, April 2011

dataset true number GSOM Giant 2 2 Hepta 7 7 Ring 2 4 Wave 2 2 Moon 4 4 Flag 3 3 Iris 3 3 Wine 3 3 LettersABC 3 7

slide-24
SLIDE 24

RESULTS – NUMBER OF CLUSTERS

  • number of detected clusters

24 ICANNGA, April 2011

dataset true number GSOM Giant 2 2 Hepta 7 7 Ring 2 4 Wave 2 2 Moon 4 4 Flag 3 3 Iris 3 3 Wine 3 3 LettersABC 3 7

slide-25
SLIDE 25

GSOM - SUMMARY

+ finds clusters of complex shapes, linearly non-separable + insensitive to unbalanced density of clusters + number of clusters automatically detected + usage of topology relations – neighbourhood + less computational intensive + intuitive

  • 8 parameters to adjust
  • sometimes unstable behaviour

25 ICANNGA, April 2011

slide-26
SLIDE 26

FUTURE WORK

  • implementing heuristics for setting parameters

automatically

  • study of clustering ensembles based on GSOM
  • could non-deterministic nature of GSOM be an

advantage?

  • application of GSOM on clustering of gene

expression data

ICANNGA, April 2011 26

slide-27
SLIDE 27

DATASETS PROPERTIES

ICANNGA, April 2011 27

dataset number

  • f points

number of dimensions number of clusters Giant 862 2 2 Hepta 212 2 7 Ring 800 2 2 Wave 293 2 2 Moon 514 2 4 Flag 640 2 3 Iris 150 4 3 Wine 178 13 3 LettersABC 1719 16 3

slide-28
SLIDE 28

GSOM PARAMETERS SETTING

ICANNGA, April 2011 28

dataset SOM size SOM grid 𝐇 𝚬𝐇 α p Giant 13 x 11 rect. 0.0008 0.045 0.01 0.1 Hepta 9 x 8 rect. 0.0008 0.060 0.01 0.1 Ring 11 x 10 rect. 0.0008 0.045 0.01 0.1 Wave 14 x 12 rect. 0.0008 0.045 0.01 0.1 Moon 20 x 10 rect. 0.0008 0.045 0.01 0.0 Flag 14 x 9 rect. 0.0008 0.045 0.01 0.1 Iris 12 x 5 rect. 0.0008 0.045 0.01 0.1 Wine 7 x 5 rect. 0.0008 0.030 0.01 0.1 LettersABC 12 x 9 rect. 0.0010 0.030 0.01 0.1