Data Warehousing and Machine Learning Feature Selection Thomas D. - - PowerPoint PPT Presentation

data warehousing and machine learning
SMART_READER_LITE
LIVE PREVIEW

Data Warehousing and Machine Learning Feature Selection Thomas D. - - PowerPoint PPT Presentation

Data Warehousing and Machine Learning Feature Selection Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring 2008 1 / 16 Feature Selection When features dont help Data generated by process


slide-1
SLIDE 1

Data Warehousing and Machine Learning

Feature Selection Thomas D. Nielsen

Aalborg University Department of Computer Science

Spring 2008

DWML Spring 2008 1 / 16

slide-2
SLIDE 2

Feature Selection

When features don’t help Data generated by process described by Bayesian network:

Class

Class ⊕ ⊖ 0.5 0.5 A1 Class 1 ⊕ 0.4 0.6 ⊖ 0.5 0.5 A2 A1 1 1.0 0.0 1 0.0 1.0 A3 Class 1 ⊕ 0.5 0.5 ⊖ 0.7 0.3 A1 A2 A3 Attribute A2 is just a duplicate of A1. Conditional class probability for example: P(⊕ | A1 = 1, A2 = 1, A3 = 0) = 0.461

DWML Spring 2008 2 / 16

slide-3
SLIDE 3

Feature Selection

The Naive Bayes model learned from data:

Class

Class ⊕ ⊖ 0.5 0.5 A1 Class 1 ⊕ 0.4 0.6 ⊖ 0.5 0.5 A2 Class 1 ⊕ 0.4 0.6 ⊖ 0.5 0.5 A3 Class 1 ⊕ 0.5 0.5 ⊖ 0.7 0.3 A1 A2 A3 In Naive Bayes model: P(⊕ | A1 = 1, A2 = 1, A3 = 0) = 0.507 Intuitively: the NB model double counts the information provided by A1, A2.

DWML Spring 2008 3 / 16

slide-4
SLIDE 4

Feature Selection

The Naive Bayes model with selected features A1 and A3:

Class

Class ⊕ ⊖ 0.5 0.5 A1 Class 1 ⊕ 0.4 0.6 ⊖ 0.5 0.5 A3 Class 1 ⊕ 0.5 0.5 ⊖ 0.7 0.3 A1 A3 In this Naive Bayes model: P(⊕ | A1 = 1, A3 = 0) = 0.461 (and all other posterior class probabilities are also the same as for the true model).

DWML Spring 2008 4 / 16

slide-5
SLIDE 5

Feature Selection

Decision Tree Decision trees learned from the same data:

1 1 1 1 1 1

A1 A1 A2 A3 A3 A3 ⊕/⊖ : 0.66/ 0.33 ⊕/⊖ : 0.66/ 0.33 ⊕/⊖ : 0.36/ 0.64 ⊕/⊖ : 0.36/ 0.64 ⊕/⊖ : 0.57/ 0.43 ⊕/⊖ : 0.57/ 0.43 ⊕/⊖ : 0.46/ 0.54 ⊕/⊖ : 0.46/ 0.54 Decision tree does not test two equivalent variables twice on one branch (but might pick one or the

  • ther on different branches).

DWML Spring 2008 5 / 16

slide-6
SLIDE 6

Feature Selection

Problems

  • Correlated features can skew prediction
  • Irrelevant features (not correlated to class variable) cause unnecessary blowup of model

space (search space)

  • Irrelevant features can drown the information provided by informative features in noise (e.g.

distance function dominated by random values of many uninformative features)

  • Irrelevant features in a model reduce its explanatory value (also when predictive accuracy is

not reduced).

DWML Spring 2008 6 / 16

slide-7
SLIDE 7

Feature Selection

Problems

  • Correlated features can skew prediction
  • Irrelevant features (not correlated to class variable) cause unnecessary blowup of model

space (search space)

  • Irrelevant features can drown the information provided by informative features in noise (e.g.

distance function dominated by random values of many uninformative features)

  • Irrelevant features in a model reduce its explanatory value (also when predictive accuracy is

not reduced). Methods of feature selection

  • Define relevance of features, and filter out irrelevant features before learning (relevance

independent of used model).

  • Filter features based on model-specific criteria (e.g. eliminate highly correlated features for

Naive Bayes).

  • Wrapper approach: evaluate feature subsets by model performance.

DWML Spring 2008 6 / 16

slide-8
SLIDE 8

Feature Selection

Relevance A possible definition: A feature Ai is irrelevant if for all a ∈ states(Ai) and c ∈ states(C) P(C = c | Ai = a) = P(C = c).

DWML Spring 2008 7 / 16

slide-9
SLIDE 9

Feature Selection

Relevance A possible definition: A feature Ai is irrelevant if for all a ∈ states(Ai) and c ∈ states(C) P(C = c | Ai = a) = P(C = c). Limitations of relevance based filtering:

  • Even if Ai is irrelevant, it may become relevant in the presence of another feature, i.e. for

some Aj and a′ ∈ states(Aj): P(C = c | Ai = a, Aj = a′) = P(C = c | Aj = a′).

  • For Naive Bayes: irrelevant features neither help nor hurt
  • Irrelevance does not capture redundancy

Generally: difficult to say in a data- and method-independent way what features are useful.

DWML Spring 2008 7 / 16

slide-10
SLIDE 10

Feature Selection

The Wrapper Approach [Kohavi,John 97]

  • Search over possible feature subsets
  • Candidate feature subsets v are evaluated:
  • Construct a model using features v using the given learning method (= induction algorithm).
  • Evaluate performance of the model using cross-validation
  • Assign score f(v): average predictive accuracy in cross-validation
  • Best feature subset found is used to learn final model

DWML Spring 2008 8 / 16

slide-11
SLIDE 11

Feature Selection

Feature Selection Search The feature subset lattice for 4 attributes: E.g. 1, 0, 1, 0 represents feature subset {A1, A3}. Search space too big for exhaustive search!

DWML Spring 2008 9 / 16

slide-12
SLIDE 12

Feature Selection

Greedy Search

f(v)=

0.5

DWML Spring 2008 10 / 16

slide-13
SLIDE 13

Feature Selection

Greedy Search

f(v)=

0.5 0.6 0.7 0.6 0.5

DWML Spring 2008 10 / 16

slide-14
SLIDE 14

Feature Selection

Greedy Search

f(v)=

0.5 0.7 0.63 0.72 0.65

DWML Spring 2008 10 / 16

slide-15
SLIDE 15

Feature Selection

Greedy Search

f(v)=

0.5 0.7 0.72 0.7 0.69

DWML Spring 2008 10 / 16

slide-16
SLIDE 16

Feature Selection

Greedy Search

f(v)=

0.5 0.7 0.72

Search terminates when no score improvement obtained by expansion.

DWML Spring 2008 10 / 16

slide-17
SLIDE 17

Feature Selection

Best First Search

closed: best:

  • pen:

f(v)=

0.5

DWML Spring 2008 11 / 16

slide-18
SLIDE 18

Feature Selection

Best First Search

closed: best:

  • pen:

f(v)=

0.5 0.7 0.6 0.6 0.5

DWML Spring 2008 11 / 16

slide-19
SLIDE 19

Feature Selection

Best First Search

closed: best:

  • pen:

f(v)=

0.5 0.7 0.63 0.72 0.65 0.6 0.6 0.5

DWML Spring 2008 11 / 16

slide-20
SLIDE 20

Feature Selection

Best First Search

closed: best:

  • pen:

f(v)=

0.5 0.7 0.63 0.72 0.65 0.6 0.6 0.5 0.69 0.7

DWML Spring 2008 11 / 16

slide-21
SLIDE 21

Feature Selection

Best First Search

closed: best:

  • pen:

f(v)=

0.5 0.7 0.63 0.72 0.65 0.6 0.6 0.5 0.69 0.7 0.75

DWML Spring 2008 11 / 16

slide-22
SLIDE 22

Feature Selection

Best First Search

closed: best:

  • pen:

f(v)=

0.5 0.7 0.63 0.72 0.65 0.6 0.6 0.5 0.69 0.7 0.75

DWML Spring 2008 11 / 16

slide-23
SLIDE 23

Feature Selection

Best First Search

closed: best:

  • pen:

f(v)=

0.5 0.7 0.63 0.72 0.65 0.6 0.6 0.5 0.69 0.7 0.75

DWML Spring 2008 11 / 16

slide-24
SLIDE 24

Feature Selection

Best First Search

closed: best:

  • pen:

f(v)=

0.5 0.7 0.63 0.72 0.65 0.6 0.6 0.5 0.82 0.69 0.7 0.75

DWML Spring 2008 11 / 16

slide-25
SLIDE 25

Feature Selection

Best First Search

closed: best:

  • pen:

f(v)=

0.5 0.7 0.63 0.72 0.65 0.6 0.6 0.5 0.82 0.69 0.7 0.75

DWML Spring 2008 11 / 16

slide-26
SLIDE 26

Feature Selection

Best First Search Seach continues until k consecutive expansions have not generated any score improvement for feature subset. Can also be used as anytime algorithm: search continues indefinitely, current ’best’ is always available.

DWML Spring 2008 11 / 16

slide-27
SLIDE 27

Feature Selection

Experimental Results: Accuracy Results from [Kohavi,John 97]. Table gives accuracy in 10-fold cross-validation. Comparison of algorithm using all features, and algorithm using selected features (-FSS). Here with greedy search.

DWML Spring 2008 12 / 16

slide-28
SLIDE 28

Feature Selection

Experimental Results: Number of Featues Results from [Kohavi,John 97]. Average number of features selected in 10-fold cross-validation.

DWML Spring 2008 13 / 16

slide-29
SLIDE 29

Feature Selection

Experimental Results: Accuracy Results from [Kohavi,John 97]. Table gives accuracy in 10-fold cross-validation. Comparison of feature subset selection using hill climbing and best first search.

DWML Spring 2008 14 / 16

slide-30
SLIDE 30

Feature Selection

Experimental Results: Number of Featues Results from [Kohavi,John 97]. Average number of features selected in 10-fold cross-validation.

DWML Spring 2008 15 / 16

slide-31
SLIDE 31

Feature Generation

Building new features

  • Discretization of continuous attributes
  • Value grouping: e.g. reduce date of sale to month of sale
  • Synthesize new features: e.g. from A1, A2 (continuous) compute Anew := A1/A2

DWML Spring 2008 16 / 16