a review of dimensionality reduction in
play

A review of dimensionality reduction in high-dimensional data using - PowerPoint PPT Presentation

Second Workshop on Software Challenges to Exascale Computing (13 th , 14 th December 2018, New Delhi) A presentation on A review of dimensionality reduction in high-dimensional data using multi-core and many-core architecture by Mr.


  1. Second Workshop on Software Challenges to Exascale Computing (13 th , 14 th December 2018, New Delhi) A presentation on A review of dimensionality reduction in high-dimensional data using multi-core and many-core architecture by Mr. Siddheshwar Vilas Patil Ph. D. Research Scholar (QIP , AICTE Scheme) Under the Guidance of Prof. Dr. D. B. Kulkarni Registrar & Professor in Information Technology, Walchand College of Engineering, Sangli, MH, India (A Government Aided Autonomous Institute)

  2. Outline • Introduction • Dimensionality Reduction • Literature Review • Challenges • Parallel Computing Approaches • Conclusion • References 1/24/2019 SCEC 2018 2

  3. Introduction • Massive amounts of high dimensional data • Big Data - Exponential growth and availability of data, 3Vs • Afterwards, this list was extended with “ Big Dimensionality” in Big Data . • “Curse of Big Dimensionality”, is boosted by the explosion of features ( thousand or even millions of features) • Early, Data scientists - huge number of instances , while paying less attention to the features aspect. 1/24/2019 SCEC 2018 3

  4. Big Dimensionality Millions of Dimensions 1/24/2019 SCEC 2018 4

  5. Example- libSVM Database • In 1990s, the maximum dimensionality - 62,000 • In 2000s - 16 million • In 2010s - 29 million • In this new scenario, it is common now to deal with millions of features, so the existing learning methods need to be adapted. 1/24/2019 SCEC 2018 5

  6. Summary of high-dimensional datasets 1/24/2019 SCEC 2018 6

  7. Scalability • Scalability is defined as the effect that an increase in the size of the training set has on the computational performance of an algorithm: accuracy, training time and allocated memory. 1/24/2019 SCEC 2018 7

  8. Methods to perform DR • Missing Values • Low Variance- Let’s think of a scenario where we have a constant variable (all observations have the same value) in data set • Not improve the power of model because it has zero variance • High Correlation- It is not good to have multiple variables of similar information. • Pearson correlation matrix to identify the variables with high correlation. 1/24/2019 SCEC 2018 8

  9. Dimensionality Reduction • Feature Extraction: Transforms original features to a set of new features • More compact and of stronger discriminating power. • Applications - Image analysis, Signal processing, and Information retrieval 1/24/2019 SCEC 2018 9

  10. Dimensionality Reduction • Feature Selection: remove the irrelevant and redundant features • Two features are redundant to each other if their values are completely correlated • Irrelevant: contain no information that is useful for the data mining task at hand • Feature is relevant if it contains some information about the target (removal of this feature will decrease accuracy of classifier) 1/24/2019 SCEC 2018 10

  11. Dimensionality reduction • Linear Methods: – Principal Component Analysis (PCA) – Linear Discriminate Analysis (LDA) – Multidimensional Scaling (MDS) – Non-negative Matrix Factorization(NMF) – Lasso • Non-Linear Methods: – Locally Linear Embedding (LLE) – Isometric Feature Mapping (Isomap) – Hilbert Schmidt Independence Criterion(HSIC) – Minimum Redundancy Maximum Relevancy ( mRMR) • Autoencoders (Linear as well Non Linear) 1/24/2019 SCEC 2018 11

  12. Feature selection methods • Individual evaluation is also known as feature ranking and assesses individual features by assigning them weights according to their degrees of relevance. • Subset evaluation produces candidate feature subsets based on a certain search strategy. • Compared with the previous best one with respect to this measure. • While the individual evaluation is incapable of removing redundant features because redundant features are likely to have similar rankings, the subset evaluation approach can handle feature redundancy with feature relevance. 1/24/2019 SCEC 2018 12

  13. Feature Selection Steps • Feature selection is an optimization problem. • Step 1: Search the space of possible feature subsets. • Step 2: Pick the subset that is optimal or near-optimal with respect to some criterion 1/24/2019 SCEC 2018 13

  14. Feature Selection Steps (Cont’d) • Search strategies – Exhaustive – Heuristic • Evaluation Criterion - Filter methods - Wrapper methods 1/24/2019 SCEC 2018 14

  15. Search Strategies • Assuming d features, an exhaustive search would require: • Examining all possible subsets of size m. • Selecting the subset that performs the best according to the criterion. • Exhaustive search is usually impractical. • In practice, heuristics are used to speed-up search 1/24/2019 SCEC 2018 15

  16. Evaluation Strategies • Filter Methods – Evaluation is independent of the classification method – The criterion evaluates feature subsets based on their class discrimination ability (feature relevance): • Mutual information or correlation between the feature values and the class labels 1/24/2019 SCEC 2018 16

  17. Evaluation Strategies • Wrapper Methods – Evaluation uses criteria related to the classification algorithm. – To compute the objective function, a classifier is built for each tested feature subset and its generalization accuracy is estimated (e.g. cross- validation) 1/24/2019 SCEC 2018 17

  18. Evaluation Strategies Evaluation Strategies • Filter based – Chi-Squared – Information Gain – Correlation-Based Feature Selection, CFS • Wrapper methods – recursive feature elimination – sequential feature selection algorithms – genetic algorithms 1/24/2019 SCEC 2018 18

  19. Feature Ranking • Evaluate all d features individually using the criterion • Select the top m features from this list. Sequential forward selection (SFS) (heuristic search) • First, the best single feature is selected • Then, pairs of features are formed using one of the remaining features and this best feature, and the best pair is selected. • Next, triplets of features are formed using one of the remaining features and these two best features, and the best triplet is selected. • This procedure continues until a predefined number of features are selected. • Wrapper methods (e.g. decision trees, linear classifiers) or Filter methods (e.g. mRMR) could be used • Sequential backward selection (SBS) 1/24/2019 SCEC 2018 19

  20. Advantages of Dimensionality Reduction • Helps in data compression, and hence reduced storage space. • It reduces computation time. • It remove redundant irrelevant features, if any • Improves accuracy of Classification 1/24/2019 SCEC 2018 20

  21. Literature Review • Implementation of the Principal Component Analysis onto High- Performance Computer Facilities for Hyperspectral Dimensionality Reduction: Results and Comparisons • An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark • Ultra High-Dimensional Nonlinear Feature Selection for Big Biological Data 1/24/2019 SCEC 2018 21

  22. Author Dimensionality Parallel H/W configuration Datasets reduction programming algorithm model M. Hilbert-schmidt MapReduce Intel xeon 2.4 GHz, 24 GB P53, Yamada independence framework RAM (16 cores) Enzyme et al. [7] criterion lasso (Hadoop and with least apache spark) angle regression Z. Wu Principal MapReduce Cloud computing (Intel AVIRIS et component framework Xeon E5630 CPUs(8 cores) cuprite al.[12] analysis (Hadoop and 2.53 GHz, 5GB RAM, hypersp apache spark), 292 GB SAS HDD), 8 ectral MPI Cluster slave(Intel Xeon E7-4807 datasets CPUs (12 cores) 1.86 GHz) S. Minimum MapReduce Cluster (18 computing Epsilon, Ramirez redundancy on apache nodes, 1 master node) URL, - maximum spark, CUDA computing nodes: Intel Xeon Kddb Gallego relevance on GPGPU E5-2620, 6 cores/processor, et al.[2] (mRMR) 64 GB RAM 1/24/2019 SCEC 2018

  23. Author Dimensionality Parallel H/W configuration Datasets reduction programming algorithm model E. Martel Principal CUDA on Intel core i7-4790, NVIDIA Hyperspectr et al. [4] component GPGPU 32 GB Memory, GeForce al data analysis GTX 680 GPU J. Zubova Random MPI Cluster - URL, Kddb et al. [13] projection L. Zhao Distributed Cluster platforms - Economic et subtractive data (China) al. [5] clustering S. Singular value CUDA on Intel core i7, 8GB RAM, - Cuomo et Decomposition GPGPU 2.8 GHz, GPU NVIDIA al.[8] Quadro K5000, 1536 CUDA cores W. Li et Isometric CUDA on Intel core i7-4790, 3.6 GHz, HIS datasets al. [9] mapping GPGPU 8 cores, 32GB RAM, GPU -Indian (ISOMAP) Nvidia GTX 1080, 2560 pines,Salinas CUDA cores, 8GB RAM , Pavia 1/24/2019 SCEC 2018

  24. Challenges • Exponential growth in the dimensionality and sample size. • So, the existing algorithms not always respond in an adequate same way when deal with this new extremely high dimensions. 1/24/2019 SCEC 2018 24

  25. Challenges • Reducing data complexity is therefore crucial for data analysis tasks, knowledge inference using machine learning (ML) algorithms, and data visualization • Ex. Use of feature selection in analyzing DNA microarrays, where there are many thousands of features, and a few tens to hundreds of samples 1/24/2019 SCEC 2018 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend