An Algorithm for Sample and Data Dimensionality Reduction Using Fast - - PowerPoint PPT Presentation

an algorithm for sample and data dimensionality reduction
SMART_READER_LITE
LIVE PREVIEW

An Algorithm for Sample and Data Dimensionality Reduction Using Fast - - PowerPoint PPT Presentation

7th International Conference on Advanced Data Mining and Applications An Algorithm for Sample and Data Dimensionality Reduction Using Fast Simulated Annealing Szymon ukasik , Piotr Kulczycki Department of Automatic Control and IT, Cracow


slide-1
SLIDE 1

An Algorithm for Sample and Data Dimensionality Reduction Using Fast Simulated Annealing

Szymon Łukasik, Piotr Kulczycki

Department of Automatic Control and IT, Cracow University of Technology Systems Research Institute, Polish Academy of Sciences

7th International Conference on Advanced Data Mining and Applications

slide-2
SLIDE 2

Motivation

  • It is estimated ("How much information” project, Univ. of California

Berkeley) that 1 million terabytes of data is generated annually worldwide, with 99.997% of it available only in digital form.

  • It is commonly agreed that our ability to analyze new data is growing at

much lower pace than the capacity to collect and store it.

  • When examining huge data samples one faces both technical difficulties and

methodological obstacles of high-dimensional data analysis (coined term – "curse of dimensionality”).

2

slide-3
SLIDE 3

Curse of dimensionality - example

Source: K. Beyer et al., „When Is «Nearest Neighbor» meaningful?”, In: Proc. ICDT, 1999.

3

slide-4
SLIDE 4

Scope of our research

  • We have developed an universal unsupervised data dimensionality reduction

technique, in some aspects similar to Principal Components Analysis (it’s linear) and Multidimensional Scaling (it’s distance-preserving). What is more, we try to reduce data sample length at the same time

  • Establishing exact form of the transformation matrix is treated as a

continuous optimization problem and solved by Parallel Fast Simulated Annealing.

  • The algorithm is supposed to be used in conjunction with various procedures
  • f data mining e.g. outlier detection, cluster analysis, and classification.

4

slide-5
SLIDE 5

General description of the algorithm

  • Data dimensionality reduction is realized via linear transformation:

whereas 𝑉 denotes the initial data set (𝑜× 𝑛), 𝐵 - transformation matrix (𝑂× 𝑜) and 𝑋 represents the transformed data matrix (𝑂× 𝑛).

  • Transformation matrix is obtained using Parallel FSA. The cost function

𝑕(𝐵) which is minimized is given by raw Stress: with 𝐵 being a solution of the optimization problem, and 𝑣i, 𝑣j ,𝑥i (𝐵), 𝑥j (𝐵) representing data instances in initial and reduced feature space.

𝑋 = 𝐵 𝑉 𝑕(𝐵) = 𝑥𝑗(𝐵) − 𝑥

𝑘 (𝐵) 𝑆𝑂 − 𝑣𝑗 − 𝑣𝑘 𝑆𝑜 2 𝑛 𝑘 =𝑗+1 𝑛 𝑗=1

5

slide-6
SLIDE 6

FSA neighbor generation strategy

6

  • 20
  • 10

10 20

  • 20

20

a1 a2

  • 20
  • 10

10 20

  • 20

20

a1 a2

slide-7
SLIDE 7

FSA temperature and termination criterion

  • Initial temperature 𝑈(0) is determined through a set of pilot runs consisting
  • f 𝑙𝑄 positive transitions from the starting solution. It is supposed to

guarantee predetermined initial level of worse solution’s acceptance probability 𝑄(0) resulting from the Metropolis rule.

  • Initial solution is obtained using feature selection algorithm introduced by

Pal & Mitra in 2004. It is based on feature space clustering, with similar features forming distinctive clusters. As a measure of similarity maximal information compression index was defined. The partition itself is performed using k-nearest neighbor rule (here 𝒍=𝒐 − 𝑶 is being used).

  • The termination criterion is either executing assumed number of iterations
  • r fulfilling customized condition based on the estimator of the global

minimum employing order statistics proposed recently for a class of stochastic random search algorithms by Bartkute and Sakalauskas (2009)

7

slide-8
SLIDE 8

FSA paralellization

8

Current global solution Neighbor 1 Neighbor 2 Neighbor ncores … Current 1 Current 2 Current ncores FSA FSA FSA Make global current either best improving or random non-improving solution

slide-9
SLIDE 9

Sample size reduction

  • For each sample element ui a positive weight 𝑞i is assigned. It incorporates an

information about a relative deformation of the element’s distance to other sample points. Data elements with higher weight could then be treated as more

  • adequate. Weights are normalized to fulfill .
  • Consequently weights could be then used to improve the performance of data

mining procedures e.g. by introducing such weights into the definition of the classic data mining algorithms (e.g. k-means or k-nearest neighbor).

  • Alternatively one can use weights to eliminate some data elements from the
  • sample. It can be performed by removing from the sample data elements with

associated weights fulfilling following condition: 𝑞i < P where P ∊ [0, 1] and then normalizing all weights. One can achieve in this way simultaneous dimensionality and sample length reduction with P serving as a data compression ratio.

9

𝑞𝑗 = 1

slide-10
SLIDE 10

Experimental evaluation

10

  • We have examined the performance of the algorithm by measuring the accuracy of
  • utlier detection 𝘑o (for artificially generated datasets), clustering 𝘑c and classification

𝘑k (for selected benchmark instances from UCI ML repository).

  • Outlier detection was performed using nonparametric statistical kernel density
  • estimation. By using randomly generated datasets we had a possibility to designate

actual outliers.

  • Clustering accuracy was measured by Rand index (in reference to class labels). It was

implemented via classic k-means algorithm.

  • Classification accuracy (for nearest-neighbor classifier) was measured, by average

classification correctness obtained during 5-fold cross-validation procedure.

  • Each test consisted of 30 runs, we reported average and the mean of above

mentioned indices. We compared our approach to PCA and Evolutionary Algorithms- based Feature Selection (by Saxena et al.).

slide-11
SLIDE 11

Example: seeds dataset (7D→2D)

11

Our approach PCA

slide-12
SLIDE 12

More details – classification

12

glass 9D→4D wine 13D→5D WBC 9D→4D vehicle 18D→10D seeds 7D→2D 𝑱 𝒍𝑱𝑶𝑱𝑼 ±𝝉(𝑱𝒍𝑱𝑶𝑱𝑼) 71.90 ±8.10 74.57 ±5.29 95.88 ±1.35 63.37 ±3.34 90.23 ±2.85 Our approach (P=0.1) 𝑱 𝒍𝑺𝑭𝑬 ±𝝉(𝑱𝒍𝑺𝑭𝑬) 70.48 ±7.02 78.00 ±4.86 95.95 ±1.43 63.96 ±2.66 89.76 ±3.18 PCA 𝑱 𝒍𝑺𝑭𝑬 ±𝝉(𝑱𝒍𝑺𝑭𝑬) 58.33 ±6.37 72.00 ±7.22 95.29 ±2.06 62.24 ±3.84 83.09 ±7.31 EA-based Feature Selection 𝑱 𝒍𝑺𝑭𝑬 ±𝝉(𝑱𝒍𝑺𝑭𝑬) 64.80 ±4.43 72.82 ±1.02 95.10 ±0.80 60.86 ±1.51 not tested

slide-13
SLIDE 13

More details – cluster analysis

13

glass 9D→4D wine 13D→5D WBC 9D→4D vehicle 18D→10D seeds 7D→2D 𝑱 𝒅𝑱𝑶𝑱𝑼 68.23 93.48 66.23 64.18 91.06 Our approach (P=0.2) 𝑱 𝒅𝑺𝑭𝑬 ±𝝉(𝑱𝒅𝑺𝑭𝑬) 68.43 ±0.62 92.81 ±0.76 66.29 ±0.62 64.62 ±0.24 89.59 ±1.57 PCA 𝑱 𝒅𝑺𝑭𝑬 67.71 92.64 66.16 64.16 88.95

slide-14
SLIDE 14

Conclusion

  • The algorithm was tested for numerous instances of outlier detection,

cluster analysis and classification problems and was found to offer promising

  • performance. It results in an accurate distance preservation with possibility
  • f out-of-sample extension at the same time.
  • Drawbacks?

It is not designed for huge datasets (due to significant computational cost of cost function evaluation) and shouldn’t be used in the situation where only single data analysis task needs to be performed.

  • What can be done in the future

We observed that taking into account topological deformation of the dataset in the reduced feature space (by proposed weighting scheme) brings positive results in various data mining

  • procedures. It can be easily extended for other DR techniques!

Proposed approach could make algorithms very prone to ‘curse of dimensionality’ practically usable (we have examined it in the case of KDE).

14

slide-15
SLIDE 15

Thank you for your attention!

slide-16
SLIDE 16

Short bibliography

1.

  • H. Szu, R. Hartley: "Fast simulated annealing”, Physics Letters A, vol. 122/3-4, 1987.

2.

  • L. Ingber: "Adaptive simulated annealing (ASA): Lessons learned“, Control and

Cybernetics, vol. 25/1, 1996. 3.

  • D. Nam, J.-S. Lee, C. H. Park, "N-dimensional Cauchy neighbor generation for the fast

simulated annealing", IEICE Trans. Information and Systems, vol. E87-D/11, 2004 4. S.K. Pal, P. Mitra, "Pattern Recognition Algorithms for Data Mining”, Chapman and Hall, 2004. 5.

  • V. Bartkute, L. Sakalauskas: "Statistical Inferences for Termination of Markov Type

Random Search Algorithms”, Journal of Optimization Theory and Applications, vol. 141/3, 2009. 6.

  • P. Kulczycki, "Kernel Estimators in Industrial Applications”, Soft Computing Applications

in Industry”, B. Prasad (ed.), Springer-Verlag, 2008. 7.

  • A. Saxena, N.R. Pal, M. Vora, "Evolutionary methods for unsupervised feature selection

using Sammon’s stress function". Fuzzy Information and Engineering, vol. 2, 2010.

16