Mutual Information an Adequate Tool for Feature Selection ? Benot - - PowerPoint PPT Presentation

mutual information
SMART_READER_LITE
LIVE PREVIEW

Mutual Information an Adequate Tool for Feature Selection ? Benot - - PowerPoint PPT Presentation

Mutual Information an Adequate Tool for Feature Selection ? Benot Frnay November 15, 2013 Introduction What is Feature Selection ? Overview of the Presentation Example of Feature Selection: Diabetes Progression Goal : predict the diabetes


slide-1
SLIDE 1

Mutual Information

an Adequate Tool for Feature Selection ? Benoît Frénay November 15, 2013

slide-2
SLIDE 2

Introduction

What is Feature Selection ? Overview of the Presentation

slide-3
SLIDE 3

Example of Feature Selection: Diabetes Progression

Goal: predict the diabetes progression one year after baseline. 442 diabetes patients were measured on 10 baseline variables. Available patient characteristics (features): 1 age 2 sex 3 body mass index (BMI) 4 blood pressure (BP) 5 serum measurement #1 . . . . . . 10 serum measurement #6

2

slide-4
SLIDE 4

Example of Feature Selection: Diabetes Progression

What are the best features ? Figure reproduced from Efron, B., Hastie, T., Johnstone, I., Tishirani, R. Least Angle Regression. Annals of Statistics 32 (2) p. 407–499, 2004.

3

slide-5
SLIDE 5

Example of Feature Selection: Diabetes Progression

What are the 1 best features ? 3 body mass index (BMI) Figure reproduced from Efron, B., Hastie, T., Johnstone, I., Tishirani, R. Least Angle Regression. Annals of Statistics 32 (2) p. 407–499, 2004.

3

slide-6
SLIDE 6

Example of Feature Selection: Diabetes Progression

What are the 2 best features ? 3 body mass index (BMI) 9 serum measurement #5 Figure reproduced from Efron, B., Hastie, T., Johnstone, I., Tishirani, R. Least Angle Regression. Annals of Statistics 32 (2) p. 407–499, 2004.

3

slide-7
SLIDE 7

Example of Feature Selection: Diabetes Progression

What are the 3 best features ? 3 body mass index (BMI) 9 serum measurement #5 4 blood pressure (BP) Figure reproduced from Efron, B., Hastie, T., Johnstone, I., Tishirani, R. Least Angle Regression. Annals of Statistics 32 (2) p. 407–499, 2004.

3

slide-8
SLIDE 8

Example of Feature Selection: Diabetes Progression

What are the 10 best features ? 3 body mass index (BMI) 9 serum measurement #5 4 blood pressure (BP) 7 serum measurement #3 2 sex 10 serum measurement #6 5 serum measurement #1 8 serum measurement #4 6 serum measurement #2 1 age Figure reproduced from Efron, B., Hastie, T., Johnstone, I., Tishirani, R. Least Angle Regression. Annals of Statistics 32 (2) p. 407–499, 2004.

3

slide-9
SLIDE 9

Example of Feature Selection: Diabetes Progression

What are the 3 best features ? 3 body mass index (BMI) 9 serum measurement #5 4 blood pressure (BP) Figure reproduced from Efron, B., Hastie, T., Johnstone, I., Tishirani, R. Least Angle Regression. Annals of Statistics 32 (2) p. 407–499, 2004.

3

slide-10
SLIDE 10

What is Feature Selection ?

Problems with high-dimensional data: interpretability of data curse of dimensionality concentration of distances Feature selection consists in using only a subset of the features: selecting features (easy-to-interpret models) information may be discarded if necessary Question: how can one select relevant features ?

4

slide-11
SLIDE 11

Mutual Information: Optimal Solution or Heuristic ?

Mutual information (MI) assesses the quality of feature subsets: rigorous definition (information theory) interpretation in terms of uncertainty reduction What kind of guarantees do we have ? Outline of this presentation: feature selection with mutual information adequacy of mutual information in classification adequacy of mutual information in regression

5

slide-12
SLIDE 12

Mutual Information in a Nutshell

slide-13
SLIDE 13

Measuring (Statistical) Dependency: Mutual Information

Uncertainty on the value of the output Y : H(Y ) = E

Y {− log pY (Y )} . 7

slide-14
SLIDE 14

Measuring (Statistical) Dependency: Mutual Information

Uncertainty on the value of the output Y : H(Y ) = E

Y {− log pY (Y )} .

Uncertainty on Y once X is known: H(Y |X) = E

X,Y

  • − log pY |X(Y |X)
  • .

7

slide-15
SLIDE 15

Measuring (Statistical) Dependency: Mutual Information

Uncertainty on the value of the output Y : H(Y ) = E

Y {− log pY (Y )} .

Uncertainty on Y once X is known: H(Y |X) = E

X,Y

  • − log pY |X(Y |X)
  • .

Mutual information (MI): I(X; Y ) = H(Y ) − H(Y |X) = E

X,Y

  • log pX,Y (X, Y )

pX(X)pY (Y )

  • .

MI is the reduction of uncertainty about the value of Y once X is known.

7

slide-16
SLIDE 16

Feature Selection with Mutual Information

Natural interpretation in feature selection: the reduction of uncertainty about the class label (Y ) once a subset of features (X) is known.

  • ne selects the subset of features which maximises MI

MI can detect linear as well as non-linear relationships between variables. not true for the correlation coefficient MI can be defined for multi-dimensional variables (subsets of features). useful to detect mutually relevant or redundant features

8

slide-17
SLIDE 17

Greedy Procedures for Feature Selection

The number of possible feature subsets in exponential w.r.t. d. exhaustive search is usually intractable (d > 10) Standard solution: use greedy procedures. Optimality is not guaranteed, but very good results are obtained.

9

slide-18
SLIDE 18

The Forward Search Algorithm

Input: set of features 1 . . . d Output: subsets of feature indices {Si}i∈1...d S0 ← {} U ← {1, . . . , d} for all number of features i ∈ 1 . . . d do for all remaining feature with index j ∈ U do compute mutual information ˆ Ij = ˆ I(XSi−1∪{j}; S) end for Si ← Si−1 ∪

  • arg maxj ˆ

Ij

  • U ← U \
  • arg maxj ˆ

Ij

  • end for

10

slide-19
SLIDE 19

The Backward Search Algorithm

Input: set of features 1 . . . d Output: subsets of feature indices {Si}i∈1...d Sd ← {1, . . . , d} for all number of features i ∈ d − 1 . . . 1 do for all remaining feature with index j ∈ Si+1 do compute mutual information ˆ Ij = ˆ I(XSi+1\{j}; S) end for Si ← Si+1 \

  • arg maxj ˆ

Ij

  • end for

11

slide-20
SLIDE 20

Should You Use Mutual Information ?

Is MI optimal ? Do we have guarantees ? What does mean optimality ? in classification: maximises accuracy in regression: minimises MSE/MAE MI allows strategies like min.-redundancy-max.-relevance (mRMR): arg max

Xk

I(Xk; Y ) − 1 d

d

  • i=1

I(Xk; Xi) MI is supported by a large literature of successful applications.

12

slide-21
SLIDE 21

Mutual Information in Classification

Theoretical Considerations Experimental Assessment

slide-22
SLIDE 22

Classification and Risk Minimisation

Goal in classification: to minimise the number of misclassifications. Take an optimal classifier with probability of misclassification Pe

  • ptimal feature subsets correspond to minimal Pe.

Question: how optimal is feature selection with MI ?

  • r

what is the relationship between the MI / H(Y|X) and Pe ? remember: I(X; Y ) = H(Y )

constant

−H(Y |X)

14

slide-23
SLIDE 23

Bounds on the Risk: the Hellman-Raviv Inequality

An upper bound on Pe is given by the Hellman-Raviv inequality Pe ≤ 1 2H(Y |X). where Pe is the probability of misclassification for an optimal classifier.

15

slide-24
SLIDE 24

Bounds on the Risk: the Fano Inequalities

The weak and strong Fano bounds are H(Y |X) ≤ 1 + Pe log2(nY − 1) H(Y |X) ≤ H(Pe) + Pe log2(nY − 1) where nY is the number of classes.

16

slide-25
SLIDE 25

No Deterministic Relationship between MI and Risk

Hellman-Raviv and Fano inequalities: upper/lower-bounds on the risk. classical (and simplistic) justification for MI: Pe → 0 if H(Y |X) → 0 or equiv. I(X; Y ) → H(Y )

17

slide-26
SLIDE 26

No Deterministic Relationship between MI and Risk

Hellman-Raviv and Fano inequalities: upper/lower-bounds on the risk. classical (and simplistic) justification for MI: Pe → 0 if H(Y |X) → 0 or equiv. I(X; Y ) → H(Y )

17

slide-27
SLIDE 27

No Deterministic Relationship between MI and Risk

Hellman-Raviv and Fano inequalities: upper/lower-bounds on the risk. classical (and simplistic) justification for MI: Pe → 0 if H(Y |X) → 0 or equiv. I(X; Y ) → H(Y )

17

slide-28
SLIDE 28

No Deterministic Relationship between MI and Risk

Hellman-Raviv and Fano inequalities: upper/lower-bounds on the risk. classical (and simplistic) justification for MI: Pe → 0 if H(Y |X) → 0 or equiv. I(X; Y ) → H(Y ) Question: is it possible to increase the risk while decreasing H(Y|X) ?

17

slide-29
SLIDE 29

No Deterministic Relationship between MI and Risk

Hellman-Raviv and Fano inequalities: upper/lower-bounds on the risk. classical (and simplistic) justification for MI: Pe → 0 if H(Y |X) → 0 or equiv. I(X; Y ) → H(Y ) Question: is it possible to increase the risk while decreasing H(Y|X) ?

T h e a n s w e r i s y e s !

17

slide-30
SLIDE 30

Simple Example of MI Failure (1)

Disease diagnosis, two classes with priors P(Y = a) = 0.32 and P(Y = b) = 0.68. (1) For each new patient: two medical tests are available: X1 ∈ {0, 1} and X2 ∈ {0, 1} but the practician can only perform either X1 or X2 In terms of feature selection, he has to select the best feature (X1 or X2).

18

slide-31
SLIDE 31

Simple Example of MI Failure (1)

Disease diagnosis, two classes with priors P(Y = a) = 0.32 and P(Y = b) = 0.68. (1) For each new patient: two medical tests are available: X1 ∈ {0, 1} and X2 ∈ {0, 1} but the practician can only perform either X1 or X2 In terms of feature selection, he has to select the best feature (X1 or X2).

18

slide-32
SLIDE 32

Simple Example of MI Failure (1)

Disease diagnosis, two classes with priors P(Y = a) = 0.32 and P(Y = b) = 0.68. (1) For each new patient: two medical tests are available: X1 ∈ {0, 1} and X2 ∈ {0, 1} but the practician can only perform either X1 or X2 In terms of feature selection, he has to select the best feature (X1 or X2).

18

slide-33
SLIDE 33

Simple Example of MI Failure (2)

Through experimentation, the practician estimates P(X1|Y ) and P(X2|Y ). Using Bayes’ theorem: P(Y |X1) =   X1 = 0 X1 = 1 Y = a 0.65 0.23 Y = b 0.35 0.77   and P(Y |X2) =   X2 = 0 X2 = 1 Y = a 0.49 0.01 Y = b 0.51 0.99  

19

slide-34
SLIDE 34

Simple Example of MI Failure (2)

Through experimentation, the practician estimates P(X1|Y ) and P(X2|Y ). Using Bayes’ theorem: P(Y |X1) =   X1 = 0 X1 = 1 Y = a 0.65 0.23 Y = b 0.35 0.77   and P(Y |X2) =   X2 = 0 X2 = 1 Y = a 0.49 0.01 Y = b 0.51 0.99  

19

slide-35
SLIDE 35

Simple Example of MI Failure (2)

Through experimentation, the practician estimates P(X1|Y ) and P(X2|Y ). Using Bayes’ theorem: P(Y |X1) =   X1 = 0 X1 = 1 Y = a 0.65 0.23 Y = b 0.35 0.77   and P(Y |X2) =   X2 = 0 X2 = 1 Y = a 0.49 0.01 Y = b 0.51 0.99  

19

slide-36
SLIDE 36

Simple Example of MI Failure (2)

Through experimentation, the practician estimates P(X1|Y ) and P(X2|Y ). Using Bayes’ theorem: P(Y |X1) =   X1 = 0 X1 = 1 Y = a 0.65 0.23 Y = b 0.35 0.77   and P(Y |X2) =   X2 = 0 X2 = 1 Y = a 0.49 0.01 Y = b 0.51 0.99  

19

slide-37
SLIDE 37

Simple Example of MI Failure (2)

Through experimentation, the practician estimates P(X1|Y ) and P(X2|Y ). Using Bayes’ theorem: P(Y |X1) =   X1 = 0 X1 = 1 Y = a 0.65 0.23 Y = b 0.35 0.77   and P(Y |X2) =   X2 = 0 X2 = 1 Y = a 0.49 0.01 Y = b 0.51 0.99  

19

slide-38
SLIDE 38

Simple Example of MI Failure (2)

Through experimentation, the practician estimates P(X1|Y ) and P(X2|Y ). Using Bayes’ theorem: P(Y |X1) =   X1 = 0 X1 = 1 Y = a 0.65 0.23 Y = b 0.35 0.77   and P(Y |X2) =   X2 = 0 X2 = 1 Y = a 0.49 0.01 Y = b 0.51 0.99  

19

slide-39
SLIDE 39

Warning: Change of Point of View !

I(X; Y ) = H(Y )

constant

−H(Y |X) (2)

20

slide-40
SLIDE 40

Simple Example of MI Failure (3)

Solution 1: using feature X1 Pe = 0.26 and I(X1; Y ) = 0.09 Solution 2: using feature X2 Pe = 0.32 and I(X2; Y ) = 0.24 Observations: the MI is much larger using X2 Pe is also much larger using X2

21

slide-41
SLIDE 41

Simple Example of MI Failure (3)

Solution 1: using feature X1 Pe = 0.26 and I(X1; Y ) = 0.09 Solution 2: using feature X2 Pe = 0.32 and I(X2; Y ) = 0.24 Observations: the MI is much larger using X2 Pe is also much larger using X2

21

slide-42
SLIDE 42

Simple Example of MI Failure (3)

Solution 1: using feature X1 Pe = 0.26 and I(X1; Y ) = 0.09 Solution 2: using feature X2 Pe = 0.32 and I(X2; Y ) = 0.24 Observations: the MI is much larger using X2 Pe is also much larger using X2

21

slide-43
SLIDE 43

Simple Example of MI Failure (3)

Solution 1: using feature X1 Pe = 0.26 and I(X1; Y ) = 0.09 Solution 2: using feature X2 Pe = 0.32 and I(X2; Y ) = 0.24 Observations: the MI is much larger using X2 Pe is also much larger using X2

21

slide-44
SLIDE 44

Simple Example of MI Failure (4)

Conclusion: selecting X2 based on MI increases the probability of error.

22

slide-45
SLIDE 45

More MI Failures

Experiment: constant class priors binary tests Xi ∈ {0, 1} with random probabilities counting MI failures for pairs of tests Increasing mutual information increases Pe for about 20% of the pairs.

23

slide-46
SLIDE 46

Empirical Results in Classification

Experiments corresponding to various classification problem difficulties. Conditional probability of failure: average probability of mutual information failure Misclassification probability loss ∆Pe: percentage of errors due to incorrect subset choice

24

slide-47
SLIDE 47

Empirical Results in Classification (Artificial)

Artificial three-class balanced classification problems: for each class, each feature has a Gaussian conditional distribution Gaussian widths and centers are randomly drawn features are compared pair-wise

25

slide-48
SLIDE 48

Empirical Results in Classification (Artificial)

Observations: the average probability of mutual information failure is small misclassification probability losses remain small failures are more likely to occur for moderately difficult problems

26

slide-49
SLIDE 49

Empirical Results in Classification (Digits)

Ten-class digit recognition dataset with 10992 instances/16 features: 10 features randomly chosen among the available features forward search finds feature subsets of increasing sizes a step fails if it does not minimise the misclassification probability

Figure reproduced from F. Alimoglu (1996) Combining Multiple Classifiers for Pen-Based Handwritten Digit Recognition, MSc Thesis, Bogazici University

27

slide-50
SLIDE 50

Empirical Results in Classification (Digits)

Observations: the average probability of mutual information failure is small misclassification probability losses remain small failures are more likely to occur for moderately difficult problems

28

slide-51
SLIDE 51

Conclusion

The published results show that mutual information is not necessarily optimal it is possible to design examples where MI fails the impact and the frequency of MI failures seem to be limited Good news: MI remains a valuable criterion for feature selection.

Frénay, B., Doquire, G., Verleysen, M. On the potential inadequacy of mutual information for feature selection. In Proc. ESANN 2012, 501–506. Frénay, B., Doquire, G., Verleysen, M. Theoretical and empirical study on the potential inadequacy of mutual information for feature selection in

  • classification. Neurocomputing, 112, 64–78, 2013.

Doquire, G., Frénay, B., Verleysen, M. Risk estimation and feature selection. In Proc. ESANN 2013, 161–166.

29

slide-52
SLIDE 52

Mutual Information in Regression

slide-53
SLIDE 53

Preliminary Change of Perspective

Mutual information (MI): I(X; Y ) = H(Y ) − H(Y |X) = −H(Y |X) + const. As far as maximising I(X; Y ) is concerned: arg max

X

I(X; Y ) = arg min

X H(Y |X)

Let us discuss H(Y |X). . .

31

slide-54
SLIDE 54

Feature Selection in Regression

Goal: reduce the average estimation error ǫ = Y − f (X). Question: is there a link between H(Y |X) = H(ǫ|X) and mean square error (MSE) EX,Y

  • (Y − f (X))2

= EX,Y

  • ǫ2

mean absolute error (MAE) EX,Y {|Y − f (X)|} = EX,Y {|ǫ|}

Figure : f (x) = sin(x) polluted by uniform, Laplacian or Gaussian target noise.

32

slide-55
SLIDE 55

Theoretical Results: Uniform/Laplacian/Gaussian Case

Uniform estimator error: MSE = 1 12 exp (2H(Y |X)) Laplacian estimator error: MSE = 1 2e2 exp (2H(Y |X)) Gaussian estimator error: MSE = 1 2πe exp (2H(Y |X))

33

slide-56
SLIDE 56

Theoretical Results: Uniform/Laplacian/Gaussian Case

Uniform estimator error: MSE = 1 12 exp (2H(Y |X)) MAE = 1 4 exp (H(Y |X)) Laplacian estimator error: MSE = 1 2e2 exp (2H(Y |X)) MAE = 1 2e exp (H(Y |X)) Gaussian estimator error: MSE = 1 2πe exp (2H(Y |X)) MAE = 1 π√e exp (H(Y |X))

33

slide-57
SLIDE 57

Theoretical Results: Illustration

Figure : MSE and MAE in terms of H(Y |X) for identically distributed uniform (dotted line), Laplacian (dashed line) and Gaussian (plain line) estimation error.

34

slide-58
SLIDE 58

Theoretical Results: Student Case

Uniform/Laplacian/Gaussian estimator error: MSE ∝ exp (2H(Y |X)) MAE ∝ exp (H(Y |X)) Student estimation error with number of degrees of freedom ν: MSE = 1 (ν − 2) B 1

2, ν 2

2 exp

  • 2H(Y |X) − (ν + 1)
  • Ψ

ν + 1 2

  • − Ψ

ν 2

  • (a) ν = 2.3

(b) ν = 5 (c) ν = 30

35

slide-59
SLIDE 59

Mutual Information Failure in Regression (1)

Figure : MSE in terms of the conditional target entropy H(Y |X) for Student estimation error with different numbers of degrees of freedom: ν = 2.3 (dotted line), ν = 5 (dashed line) or ν = 30 (plain line).

36

slide-60
SLIDE 60

Mutual Information Failure in Regression (2)

Figure : Example of mutual information failure for Student estimation error with respect to the MSE. The candidate feature subsets correspond to different numbers of degrees of freedom: ν = 2.3 (dotted line) and ν = 5 (dashed line).

37

slide-61
SLIDE 61

Conclusion

In some cases, there exists no deterministic relationship between the conditional entropy H(Y |X) and the MSE or MAE. The optimality of MI depends on the estimation error distribution. For the example considered here, failures have limited impact. See: Frénay, B., Doquire, G., Verleysen, M. Is mutual information adequate for feature selection in regression ? Neural Networks, 48:1-7, 2013.

38

slide-62
SLIDE 62

Conclusion

slide-63
SLIDE 63

Take-Home Messages

Theoretical and experimental study: mutual information-based feature selection is legitimate counterexamples can be obtained in classification/regression such failures remain uncommon and of limited impact Our recent works enhance the legitimacy of mutual information-based feature selection, while clearing up some common misunderstandings. Summary: Frénay, B., Doquire, G., Verleysen, M. Mutual Information: an Adequate Tool for Feature Selection. In Proc. of BENELEARN 2013, 89.

40

slide-64
SLIDE 64

Thank you for your attention ! I hope it was interesting for everyone. Any questions ?