New approaches for improving New approaches for improving Data - - PowerPoint PPT Presentation

new approaches for improving new approaches for improving
SMART_READER_LITE
LIVE PREVIEW

New approaches for improving New approaches for improving Data - - PowerPoint PPT Presentation

Ph. D. THESIS New approaches for improving New approaches for improving Data mining feature selection Data mining feature selection techniques techniques Supervised by: Elaborated by: Pr. Y. Slimani M. A. Esseghir Pr. G. Goncalves


slide-1
SLIDE 1

1 1

New approaches for improving New approaches for improving Data mining feature selection Data mining feature selection techniques techniques

Supervised by:

  • Pr. Y. Slimani
  • Pr. G. Goncalves
  • M. T. Hsu

Elaborated by:

  • M. A. Esseghir
  • Ph. D. THESIS

21/09/06

slide-2
SLIDE 2

2 2

Outline Outline

  • Introduction

Introduction

  • Feature selection problem

Feature selection problem

  • Existing approaches

Existing approaches

  • The proposed approaches

The proposed approaches

  • Search topics and perspectives

Search topics and perspectives

  • Conclusion

Conclusion

slide-3
SLIDE 3

3 3

Introduction Introduction

  • The ability of machines in storing increasing data

The ability of machines in storing increasing data volume volume outpass

  • utpass their ability to analyze them.

their ability to analyze them.

  • Applied Data mining techniques :

Applied Data mining techniques :

  • Computational cost

Computational cost number of features number of features

  • Classification accuracy

Classification accuracy high dimensionality high dimensionality

  • Identification of

Identification of representative features representative features to build to build classification classification models. models.

slide-4
SLIDE 4

4 4

Feature Selection Feature Selection (FS) (FS) problem problem

  • Identification of

Identification of salient salient features features

  • Discarding:

Discarding: irrelevant irrelevant, , redundant redundant, noisy data. , noisy data.

  • Enhance the models comprehensibility.

Enhance the models comprehensibility.

  • Avoid models

Avoid models overfitting

  • verfitting.

.

  • Improve classification and time response (time and

Improve classification and time response (time and complexity) capabilities. complexity) capabilities.

Feature selection studies how to select a subset or list of attributes or variables that are used to construct models describing data.”Huan Liu” IEEE senior member Definition Definition 2 Objectives A process that chooses an optimal subset of features according to a certain criterion

slide-5
SLIDE 5

5 5

Existing Approaches Existing Approaches

  • Wrappers

Wrappers and Filters and Filters

  • Filters

Filters: selects subsets using their general : selects subsets using their general characteristics (intrinsic properties). characteristics (intrinsic properties).

  • Search: Forward and backward search based one criterion.

Search: Forward and backward search based one criterion.

  • Dependency measures

Dependency measures

  • Information measures

Information measures

  • Consistency measures

Consistency measures

  • Wrappers

Wrappers: apply a learning algorithm to evaluate : apply a learning algorithm to evaluate selected subsets. selected subsets.

  • Search: Exhaustive, random, heuristic (

Search: Exhaustive, random, heuristic (GA,SA,HC,GrS GA,SA,HC,GrS). ).

  • Evaluation:

Evaluation: ANN ANN, , ID3,C4.5, NB,SVM ID3,C4.5, NB,SVM. .

slide-6
SLIDE 6

6 6

FS process FS process

Exploration Validation

Huan Liu

slide-7
SLIDE 7

7 7

Existing Approaches (2) Existing Approaches (2)

Advantages Advantages drawbacks drawbacks Filters Filters

  • Simple to implement

Simple to implement

  • Low search cost

Low search cost O(N O(N2

2)

)

  • Not well performing

Not well performing

  • Independent criterion

Independent criterion

  • 1 feature at a time

1 feature at a time

Wrappers Wrappers

  • High subsets qualities

High subsets qualities

  • Improves classification

Improves classification

  • All features are

All features are considered considered

  • Exponential exploration

Exponential exploration search search (2 (2n

n)

)

  • high evaluation cost

high evaluation cost

  • unadapted

unadapted for large data for large data sets sets

slide-8
SLIDE 8

8 8

The proposed approaches The proposed approaches

  • Genetic Algorithm (AG)

Genetic Algorithm (AG)

  • Standard

Standard

  • Mimetic algorithms: hybrid global+ local

Mimetic algorithms: hybrid global+ local search search

  • Parallel

Parallel FS FS for high dimensional data for high dimensional data

  • ISLAND model

ISLAND model

  • Multi

Multi-

  • agent System

agent System

slide-9
SLIDE 9

9 9

The proposed approaches(2) The proposed approaches(2)

  • Ant Colony Optimizer (ACO):

Ant Colony Optimizer (ACO):

  • AS and ACS adaptation:

AS and ACS adaptation:

  • 2Graph complete

2Graph complete

  • Nodes corresponds to attributes

Nodes corresponds to attributes

  • Polarized edges

Polarized edges

  • Hybrid search:

Hybrid search:

  • Combining wrappers and filters

Combining wrappers and filters

  • Correlation guided search

Correlation guided search

  • Discarding redundant features.

Discarding redundant features.

slide-10
SLIDE 10

10 10

Search Topics and perspectives Search Topics and perspectives

  • New feature section search strategies, based on

New feature section search strategies, based on metaheuristic metaheuristic adaptations, as: adaptations, as:

  • Multi agent genetic algorithms

Multi agent genetic algorithms

  • Ant colony optimization (ACO)

Ant colony optimization (ACO)

  • Particle swarm optimizer (PSO)

Particle swarm optimizer (PSO)

  • Cultural algorithms.

Cultural algorithms.

  • Improving evaluation quality: multi

Improving evaluation quality: multi-

  • objective
  • bjective
  • ptimization.
  • ptimization.
  • Parallelization, distribution, load balancing,

Parallelization, distribution, load balancing, integration into a common framework (DM grid integration into a common framework (DM grid service) service)

slide-11
SLIDE 11

11 11

Search Topics(2) Search Topics(2)

  • Hybridization of wrapper and filter

Hybridization of wrapper and filter approaches. approaches.

  • New feature selection approaches for

New feature selection approaches for unsupervised classification unsupervised classification. .

slide-12
SLIDE 12

12 12

Conclusion Conclusion

  • Fs is a multi

Fs is a multi-

  • disciplinary search topic:

disciplinary search topic:

  • Statistics;Optimization;Data

Statistics;Optimization;Data mining mining

  • FS is an Essential KDD step to face new

FS is an Essential KDD step to face new data mining challenges. data mining challenges.

  • High dimensionality, Biological data, Streaming Data mining,

High dimensionality, Biological data, Streaming Data mining,

  • FS poses new challenges to data mining

FS poses new challenges to data mining community. community.

  • New efficient search strategies, hybrid strategies.

New efficient search strategies, hybrid strategies.

slide-13
SLIDE 13

13 13