Feature Selection for Predictive Modelling A Needle in a Haystack - - PowerPoint PPT Presentation

feature selection for predictive modelling
SMART_READER_LITE
LIVE PREVIEW

Feature Selection for Predictive Modelling A Needle in a Haystack - - PowerPoint PPT Presentation

Feature Selection for Predictive Modelling A Needle in a Haystack Problem Munshi Imran Hossain Sudipta Basu Introduction Suppose someone is studying heart diseases. They want to find out what are the possible factors that may cause


slide-1
SLIDE 1

Feature Selection for Predictive Modelling

A Needle in a Haystack Problem

Munshi Imran Hossain Sudipta Basu

slide-2
SLIDE 2

Introduction

  • Suppose someone is studying heart diseases.
  • They want to find out what are the possible

factors that may cause heart diseases.

  • Let’s try to imagine some of these factors from

common sense …

6/4/18 Cytel Inc. 2

slide-3
SLIDE 3

Features

  • Weight
  • Smoking Habits
  • Food Habits
  • Drinking Habits
  • Genetic Traits
  • Bad Managers
  • Extra Marital Affairs
  • Irregular Sleep Patterns
  • Stock Market Fluctuation
  • Daughter’s Boyfriend
  • In Laws
  • Unhappy Married Life
  • Unsafe neighborhood
  • Unemployment

6/4/18 Cytel Inc. 3

slide-4
SLIDE 4

An explosion of information

6/4/18 Cytel Inc. 4

slide-5
SLIDE 5

Needle in a Haystack

  • Find relevant set of solutions.
  • Solution space contains well over a trillion combinations.

Finding a relevant set of solution is akin to finding a needle in a haystack!

  • In an era of Big Data, this is a common problem for any field –

banking, insurance, telecoms, manufacturing, healthcare, etc.

6/4/18 Cytel Inc. 5

slide-6
SLIDE 6

Predictive Modelling

  • Model Fitting: A training data is used to fit a model. It is used to

predict output on observations that it has not encountered.

  • Model Accuracy: Test data is used to compute the accuracy of

the model.

  • Features of the model: Each such model has some independent

variables.

  • Feature Selection: This problem is also called the problem of

Feature Selection.

6/4/18 Cytel Inc. 6

slide-7
SLIDE 7

Question

The question is:

Can we find a needle in a haystack in a time and cost effective manner ?

6/4/18 Cytel Inc. 7

slide-8
SLIDE 8

Answer The answer is :

YES!

  • There are lots of algorithms available which solve this

problem

  • One such group of algorithms:

6/4/18 Cytel Inc. 8

Genetic Algorithms

slide-9
SLIDE 9

Genetic Algorithms

  • Genetic algorithms are numerical optimization

algorithms that are inspired by ideas from Natural Selection and Evolutionary Biology.

  • The method is quite generic, which means that

it can be used to solve optimization problems

  • f a wide range.

6/4/18 Cytel Inc. 9

slide-10
SLIDE 10

Application Areas

Applicable to wide variety of problems. Some areas are –

  • 1. Prediction of three dimensional protein structure
  • 2. Automatic evolution of computer software
  • 3. Training and designing artificial neural networks
  • 4. Image processing
  • 5. Job shop scheduling

6/4/18 Cytel Inc. 10

slide-11
SLIDE 11

A Schematic Genetic Algorithm

6/4/18 Cytel Inc. 11

slide-12
SLIDE 12

Initialization

  • Algorithm begins with a population of N solutions.
  • These solutions are also called chromosomes.
  • First Generation – The first round of solutions to

the problem at hand

6/4/18 Cytel Inc. 12

Initialization Selection Crossover Mutation Evaluation Convergence Termination

slide-13
SLIDE 13

Evaluation

  • Each solution is applied on the problem .
  • A fitness value is evaluated for the solution.
  • In finding root of eqn. E :

Fitness =|E(Guess Solution) – Actual Values|

6/4/18 Cytel Inc. 13

Initialization Selection Crossover Mutation Evaluation Convergence Termination

Evolution Environment

Evaluation Fitness Value

slide-14
SLIDE 14

Selection

  • Select n best chromosomes for the next stage

using a stochastic process, say “Roulette Wheel Sampling”.

6/4/18 Cytel Inc. 14

Initialization Selection Crossover Mutation Evaluation Convergence Termination

slide-15
SLIDE 15

Selection

6/4/18 Cytel Inc. 15

Initialization Selection Crossover Mutation Evaluation Convergence Termination

First Generation Selected Chromosomes

C2 C1 C1 C3 C4 C4 C5 C6 C6

slide-16
SLIDE 16

Crossover (Recombination)

  • Selected chromosomes are used for crossover.
  • It is a process in which information is exchanged

between two parent chromosomes to generate new chromosomes.

  • In our schematic, chromosomes are strings of

binary numbers.

  • Create offspring by cleaving the chromosomes at

a common location.

  • Continue

till the new generation has N chromosomes.

6/4/18 Cytel Inc. 16

Initialization Selection Crossover Mutation Evaluation Convergence Termination

slide-17
SLIDE 17

Crossover (Recombination)

6/4/18 Cytel Inc. 17

Initialization Selection Crossover Mutation Evaluation Convergence Termination

1 1 1 1 1 1 1 1 1 1

Parent A Parent B Offspring 1 Offspring 2

slide-18
SLIDE 18

Mutation

  • New offspring chromosomes are subjected to

mutation with a low probability.

  • Flip 1 at a particular location in a chromosome

to a 0 and vice versa.

  • Maintain Genetic Diversity – Keep sufficient

diversity in the population for generating new solutions in future generations.

6/4/18 Cytel Inc. 18

Initialization Selection Crossover Mutation Evaluation Convergence Termination

slide-19
SLIDE 19

Convergence

  • Second

Generation: The processes

  • f

selection, crossover and mutation result in new chromosomes that belong to Second Generation.

  • Compare the highest fitness value of the

Second Generation with the First generation for Convergence.

6/4/18 Cytel Inc. 19

Initialization Selection Crossover Mutation Evaluation Convergence Termination

slide-20
SLIDE 20

Termination

  • Repeat

the steps from Evaluation to Convergence by creating subsequent generations until termination

  • Some common termination conditions are:
  • Convergence:

Highest fitness values

  • f

two subsequent generations remain the same (within a certain tolerance)

  • Fixed number of generations reached.
  • Allocated budget (computation time/money)

reached.

  • Combinations of the above

6/4/18 Cytel Inc. 20

Initialization Selection Crossover Mutation Evaluation Convergence Termination

slide-21
SLIDE 21

Example: Problem Statement

  • Data: We generated data from a device that is used to

monitor breathing.

  • Problem Statement: A classification problem – find

the set of features that will be used to build a linear discriminant analysis (LDA) model.

  • Objective: Determine whether the breathing action of

a subject is normal or labored.

6/4/18 Cytel Inc. 21

slide-22
SLIDE 22

Example: Simulation Parameters

  • It consists of measurements on more than a 100

health parameters of subjects. Number of subjects is a little over 1500.

  • Data was split into training and test sets in a ratio of

80:20. 10,000 such splits were made randomly.

6/4/18 Cytel Inc. 22

slide-23
SLIDE 23

Example: Area Under ROC Curve

  • The LDA model was then used to predict the
  • utcomes on the test data and AUC was computed.

6/4/18 Cytel Inc. 23

slide-24
SLIDE 24

Example: GA Operators

GA Parameters Values Population size 100 chromosomes Number of generations 100 Evaluation Median AUC over 10,000 simulations Selection Roulette-Wheel Sampling Crossover Single Point Crossover Mutation 1 / 27

6/4/18 Cytel Inc. 24

slide-25
SLIDE 25

Example: Performance

6/4/18 Cytel Inc. 25

slide-26
SLIDE 26

Advantages

Independent of Calculus:

  • Conventional methods of optimization are based on calculus.
  • They get trapped in local optima.
  • They are also based on the existence of derivatives. This

condition is difficult to satisfy for objective functions for many problems.

  • In problems where calculus-based optimization methods are

not suitable, GA can be useful for optimization.

6/4/18 Cytel Inc. 26

slide-27
SLIDE 27

Advantages

Flexibility:

  • In addition to the main operators above, other

heuristics may be employed to make the calculation faster or more robust.

  • The

Speciation heuristic penalizes crossover between candidate solutions that are too similar. This encourages population diversity and helps prevent premature convergence to a less optimal solution.

6/4/18 Cytel Inc. 27

slide-28
SLIDE 28

Limitations and Solutions

Repeated Fitness Function Evaluation:

  • Repeated fitness function evaluation for complex problems can be

prohibitive.

  • In real world problems such as structural optimization problems, a

single function evaluation may require several hours to several days

  • f complete simulation.

Solution:

  • Forgo an exact evaluation and use computationally efficient

approximated fitness.

  • Amalgamation of approximate models may be one of the most

promising approaches to convincingly use GA to solve complex real life problems.

6/4/18 Cytel Inc. 28

slide-29
SLIDE 29

Limitations and Solutions

Scalability:

  • GA do not scale well with complexity.
  • If the number of features is large, there is exponential increase in

search space size.

  • Extremely difficult to use it on problems such as designing an

engine, a house or plane.

Solution:

  • Break the problem down into the simplest representation possible.
  • Encode designs for fan blades instead of engines, building shapes

instead of detailed construction plans, and airfoils instead of whole aircraft designs.

6/4/18 Cytel Inc. 29

slide-30
SLIDE 30

Conclusion

  • In the era of big data, conventional optimization

methods are sometimes not agile enough to solve the problem.

  • With higher computing power from cloud computing

infrastructures, genetic algorithms can be applied for solving problems in reasonable time.

6/4/18 Cytel Inc. 30

slide-31
SLIDE 31

6/4/18 Cytel Inc. 31

Found It !

slide-32
SLIDE 32

Any Questions ?

6/4/18 Cytel Inc. 32

slide-33
SLIDE 33

6/4/18 Cytel Inc. 33