Feature Selection for Predictive Modelling
A Needle in a Haystack Problem
Munshi Imran Hossain Sudipta Basu
Feature Selection for Predictive Modelling A Needle in a Haystack - - PowerPoint PPT Presentation
Feature Selection for Predictive Modelling A Needle in a Haystack Problem Munshi Imran Hossain Sudipta Basu Introduction Suppose someone is studying heart diseases. They want to find out what are the possible factors that may cause
A Needle in a Haystack Problem
Munshi Imran Hossain Sudipta Basu
Introduction
factors that may cause heart diseases.
common sense …
6/4/18 Cytel Inc. 2
Features
6/4/18 Cytel Inc. 3
6/4/18 Cytel Inc. 4
Needle in a Haystack
Finding a relevant set of solution is akin to finding a needle in a haystack!
banking, insurance, telecoms, manufacturing, healthcare, etc.
6/4/18 Cytel Inc. 5
Predictive Modelling
predict output on observations that it has not encountered.
the model.
variables.
Feature Selection.
6/4/18 Cytel Inc. 6
Question
Can we find a needle in a haystack in a time and cost effective manner ?
6/4/18 Cytel Inc. 7
Answer The answer is :
problem
6/4/18 Cytel Inc. 8
Genetic Algorithms
Genetic Algorithms
algorithms that are inspired by ideas from Natural Selection and Evolutionary Biology.
it can be used to solve optimization problems
6/4/18 Cytel Inc. 9
Application Areas
Applicable to wide variety of problems. Some areas are –
6/4/18 Cytel Inc. 10
A Schematic Genetic Algorithm
6/4/18 Cytel Inc. 11
Initialization
the problem at hand
6/4/18 Cytel Inc. 12
Initialization Selection Crossover Mutation Evaluation Convergence Termination
Evaluation
Fitness =|E(Guess Solution) – Actual Values|
6/4/18 Cytel Inc. 13
Initialization Selection Crossover Mutation Evaluation Convergence Termination
Evolution Environment
Evaluation Fitness Value
Selection
using a stochastic process, say “Roulette Wheel Sampling”.
6/4/18 Cytel Inc. 14
Initialization Selection Crossover Mutation Evaluation Convergence Termination
Selection
6/4/18 Cytel Inc. 15
Initialization Selection Crossover Mutation Evaluation Convergence Termination
First Generation Selected Chromosomes
C2 C1 C1 C3 C4 C4 C5 C6 C6
Crossover (Recombination)
between two parent chromosomes to generate new chromosomes.
binary numbers.
a common location.
till the new generation has N chromosomes.
6/4/18 Cytel Inc. 16
Initialization Selection Crossover Mutation Evaluation Convergence Termination
Crossover (Recombination)
6/4/18 Cytel Inc. 17
Initialization Selection Crossover Mutation Evaluation Convergence Termination
1 1 1 1 1 1 1 1 1 1
Parent A Parent B Offspring 1 Offspring 2
Mutation
mutation with a low probability.
to a 0 and vice versa.
diversity in the population for generating new solutions in future generations.
6/4/18 Cytel Inc. 18
Initialization Selection Crossover Mutation Evaluation Convergence Termination
Convergence
Generation: The processes
selection, crossover and mutation result in new chromosomes that belong to Second Generation.
Second Generation with the First generation for Convergence.
6/4/18 Cytel Inc. 19
Initialization Selection Crossover Mutation Evaluation Convergence Termination
Termination
the steps from Evaluation to Convergence by creating subsequent generations until termination
Highest fitness values
two subsequent generations remain the same (within a certain tolerance)
reached.
6/4/18 Cytel Inc. 20
Initialization Selection Crossover Mutation Evaluation Convergence Termination
Example: Problem Statement
monitor breathing.
the set of features that will be used to build a linear discriminant analysis (LDA) model.
a subject is normal or labored.
6/4/18 Cytel Inc. 21
Example: Simulation Parameters
health parameters of subjects. Number of subjects is a little over 1500.
80:20. 10,000 such splits were made randomly.
6/4/18 Cytel Inc. 22
Example: Area Under ROC Curve
6/4/18 Cytel Inc. 23
Example: GA Operators
GA Parameters Values Population size 100 chromosomes Number of generations 100 Evaluation Median AUC over 10,000 simulations Selection Roulette-Wheel Sampling Crossover Single Point Crossover Mutation 1 / 27
6/4/18 Cytel Inc. 24
Example: Performance
6/4/18 Cytel Inc. 25
Advantages
Independent of Calculus:
condition is difficult to satisfy for objective functions for many problems.
not suitable, GA can be useful for optimization.
6/4/18 Cytel Inc. 26
Advantages
Flexibility:
heuristics may be employed to make the calculation faster or more robust.
Speciation heuristic penalizes crossover between candidate solutions that are too similar. This encourages population diversity and helps prevent premature convergence to a less optimal solution.
6/4/18 Cytel Inc. 27
Limitations and Solutions
Repeated Fitness Function Evaluation:
prohibitive.
single function evaluation may require several hours to several days
Solution:
approximated fitness.
promising approaches to convincingly use GA to solve complex real life problems.
6/4/18 Cytel Inc. 28
Limitations and Solutions
Scalability:
search space size.
engine, a house or plane.
Solution:
instead of detailed construction plans, and airfoils instead of whole aircraft designs.
6/4/18 Cytel Inc. 29
Conclusion
methods are sometimes not agile enough to solve the problem.
infrastructures, genetic algorithms can be applied for solving problems in reasonable time.
6/4/18 Cytel Inc. 30
6/4/18 Cytel Inc. 31
Any Questions ?
6/4/18 Cytel Inc. 32
6/4/18 Cytel Inc. 33