Redundant Feature Elimination Redundant Feature Elimination for - - PowerPoint PPT Presentation

redundant feature elimination redundant feature
SMART_READER_LITE
LIVE PREVIEW

Redundant Feature Elimination Redundant Feature Elimination for - - PowerPoint PPT Presentation

1 Redundant Feature Elimination Redundant Feature Elimination for Multi-Class Problems for Multi-Class Problems Annalisa Appice, Michelangelo Ceci Dipartimento di Informatica, Universit degli Studi di Bari, Italy Simon Rawles, Peter Flach


slide-1
SLIDE 1

1

Redundant Feature Elimination Redundant Feature Elimination for Multi-Class Problems for Multi-Class Problems

Annalisa Appice, Michelangelo Ceci

Dipartimento di Informatica, Università degli Studi di Bari, Italy

Simon Rawles, Peter Flach

Department of Computer Science, University of Bristol, UK

slide-2
SLIDE 2

2

Re Redundant dundant fe feature ature r reduction eduction

  • REFER: an efficient, scalable, logic-

based method for eliminating Boolean features which are redundant for multi- class classifier learning.

– Why? Size of hypothesis space, predictive performance, model comprehensibility. – Distinct from feature selection.

slide-3
SLIDE 3

3

Overview of this talk Overview of this talk

  • Redundant feature reduction

– What is feature redundancy? – Doing multi-class reduction

  • Related approaches
  • Theoretical and experimental results
  • Summary
  • Current and future work
slide-4
SLIDE 4

4

Example: Redundancy of features Example: Redundancy of features

f1 f2 f3 class e1 1 1 a e2 1 a e3 a e4 b e5 1 b

A fixed number of Boolean features One of several class labels (‘multiclass’)

slide-5
SLIDE 5

5

Discriminating a against b Discriminating a against b

f1 f2 f3 class e1 1 1 a e2 1 a e3 a e4 b e5 1 b

True values in examples of class a make the feature better for distinguishing a from b in a classification rule.

slide-6
SLIDE 6

6

Discriminating a against b Discriminating a against b

f1 f2 f3 class e1 1 1 a e2 1 a e3 a e4 b e5 1 b

False values in examples of class b make the feature better for distinguishing a from b in a rule.

slide-7
SLIDE 7

7

Discriminating a against b Discriminating a against b

f1 f2 f3 class e1 1 1 a e2 1 a e3 a e4 b e5 1 b

f2 covers f1 and f3 is useless. f1 and f3 are redundant. Negated features are not automatically considered.

slide-8
SLIDE 8

8

More formally... More formally...

For discriminating class a examples from class b,

  • f covers g if Ta(g) ⊆ Ta(f) and Fb(g) ⊆ Fb(f).
  • A feature is redundant if another feature covers it.

f1 f2 class e1 1 1 a e2 1 a e3 a e4 b e5 1 b

Ta(f2) = {e1, e2}. Ta(f1) = {e1}. Fb(f2) = {e4, e5}. Fb(f1) = {e5}. a is the ‘positive class’ here

slide-9
SLIDE 9

9

Neighbourhoods of examples Neighbourhoods of examples

  • A way to upgrade to multi-class data.
  • Each class is partitioned into subsets of

similar examples.

– REFER-N finds non-redundant features between each neighbourhood pair in turn. – Builds up list of non-redundant features between each neighbourhood pair in turn.

  • Efficient, more reduction, logic-based.
slide-10
SLIDE 10

10

Neighbourhood construction Neighbourhood construction

slide-11
SLIDE 11

11

Neighbourhood construction Neighbourhood construction

1

slide-12
SLIDE 12

12

Neighbourhood construction Neighbourhood construction

1

slide-13
SLIDE 13

13

Neighbourhood construction Neighbourhood construction

1

slide-14
SLIDE 14

14

Neighbourhood construction Neighbourhood construction

1 1 1 1

slide-15
SLIDE 15

15

Neighbourhood construction Neighbourhood construction

2

slide-16
SLIDE 16

16

Neighbourhood construction Neighbourhood construction

2

slide-17
SLIDE 17

17

Neighbourhood construction Neighbourhood construction

2

slide-18
SLIDE 18

18

Neighbourhood construction Neighbourhood construction

2 2 2

slide-19
SLIDE 19

19

Neighbourhood construction Neighbourhood construction

3

slide-20
SLIDE 20

20

Neighbourhood construction Neighbourhood construction

3 3 3 3 3

slide-21
SLIDE 21

21

Neighbourhood construction Neighbourhood construction

4

slide-22
SLIDE 22

22

Neighbourhood construction Neighbourhood construction

4

slide-23
SLIDE 23

23

Neighbourhood construction Neighbourhood construction

5

slide-24
SLIDE 24

24

Neighbourhood construction Neighbourhood construction

5 5 5

slide-25
SLIDE 25

25

Neighbourhood construction Neighbourhood construction

1 1 1 1 2 2 3 2 3 4 3 3 3 5 5 5 Groups of similar examples with the same class label

slide-26
SLIDE 26

26

Neighbourhood comparison Neighbourhood comparison

1 2 3 4 5

slide-27
SLIDE 27

27

Neighbourhood comparison Neighbourhood comparison

1 2 3 4 5

slide-28
SLIDE 28

28

Neighbourhood comparison Neighbourhood comparison

Comparing all neighbourhoods of differing class 1 2 3 4 5

slide-29
SLIDE 29

29

Ancestry of REFER Ancestry of REFER

  • REDUCE (Lavrač et al. 1999)

– Feature reduction for propositionalised ILP datasets – Preserves learnability of a complete and consistent hypothesis

  • REFER uses a variant of REDUCE

– Redundant features found between the examples in each neighbourhood pair – Prefers features already found non-redundant

slide-30
SLIDE 30

30

Related multiclass filters Related multiclass filters

  • FOCUS for noise-free Boolean data

(Almuallim & Dietterich 1991)

– Exhaustive evaluation of all subsets – A time complexity of O(np)

  • SCRAP relevance filter (Raman 2003)

– Also uses neighbourhood approach – No guarantee that selected features (still) discriminate among all classes.

slide-31
SLIDE 31

31

Theoretical results Theoretical results

  • REFER preserves the learnability of a

complete and consistent theory.

– If a C&C rule was in the original data, it’ll be in the reduced data.

  • REFER is efficient. Time complexity is

– … linear in number of examples – … quadratic in number of features

slide-32
SLIDE 32

32

Experimental results Experimental results

20 40 60 80 100 120 5000 10000 15000

# of original features # of reduced features

REFER REDUCE

  • Mutagenesis data from SINUS

– Feature set greatly reduced (13118 → 44) – Accuracy still competitive (approx. 85%)

slide-33
SLIDE 33

33

  • Thirteen UCI benchmark datasets

– Compared with LVF, CFS and Relief using discrete/discretised data – Generally conservative – Faster: 8 out of 13 faster, 3 very close. – Competitive predictive accuracy using several classifiers:

Experimental results Experimental results

JRIP NB C4.5 SVM

Winner

6 7 3 6

Within 1%

3 2 4

slide-34
SLIDE 34

34

Experimental results Experimental results

  • Reuters-21578 large-scale high-

dimensionality sparse data

– 16,582 preprocessed features were reduced to 1450. – REFER supports parallel execution well.

  • REFER runs in parallel on subsets of the

feature set and again on the combination.

slide-35
SLIDE 35

35

Summary Summary

  • A method for eliminating redundant

Boolean features for multi-class classification tasks.

  • Uses logical coverage of examples
  • Efficient and scalable

– requiring less time than the three feature selection algorithms we used

  • Amenable to parallel execution
slide-36
SLIDE 36

36

Current and future investigations Current and future investigations

  • Interaction between feature selection

and feature reduction

– Benefits of combination

  • Noise handling using non-pure

neighbourhoods (‘relaxed REFER’)

– Overcoming sensitivity to noise

  • REFER for example reduction
slide-37
SLIDE 37

37

Questions Questions

slide-38
SLIDE 38

38

slide-39
SLIDE 39

39

Average reduction on UCI data Average reduction on UCI data

20 40 60 80 100 120 140 160 50 100 150 200

Number of original features # of reduced features

LVF CFS RELIEFF REFER

slide-40
SLIDE 40

40

Effect of choice of starting point Effect of choice of starting point

Number of reduced features

20 40 60 80 100 120

Number of neighbourhoods constructed

200 400 600 800 1000 Aud Car Brid F1M F1C Post F3M Nur Mus F3C Yea Pim Tic Aud Car Brid F1M F1C Post F3M Nur Mus F3C Yea Pim Tic

slide-41
SLIDE 41

41

Comparison of running times Comparison of running times

Time (s) Dataset # in- stances # fea- tures LVF CFS RELIEFF REFER Audiology 398 184 3.37 0.80 3.84 0.72 Bridge 108 83 0.89 0.38 0.67 0.22 Car 1728 21 1.94 0.44 15.92 0.50 Flare1066/C 1066 40 2.62 0.48 11.51 0.61 Flare1066/M 1066 42 0.82 0.51 11.63 0.20 Flare323/C 323 37 0.72 0.38 1.19 0.12 Flare323/M 323 36 0.80 0.39 1.25 0.21 Mushroom 8124 116 29.48 5.30 1838.36 1.66 Nursery 12960 27 34.24 1.64 1038.31 20.38 Post-operative 90 23 0.33 0.30 0.32 0.08 Tic-tac-toe 950 27 1.03 0.37 5.49 0.20 Pima 768 120 12.2 1 14.1 2.6537 Yeast 1484 120 55 19.1 57.1 26.7132

Machine spec: Pentium IV 1.4GHz PC running Windows XP

slide-42
SLIDE 42

42

Full accuracy results Full accuracy results

JRIP NB C4.5 SVM LVF CFS RELIEF REFER none LVF CFS RELIEF REFER none LVF CFS RELIEF REFER none LVF CFS RELIEF REFER none Audiology 72.0 66.6 74.9 72.2 74.9 72.1 69.2 71.8 74.9 71.8 75.1 70.6 77.9 74.3 74.4 75.2 69.2 81.6 79.7 81.6 Bridge 58.7 58.0 58.8 59.8 62.5 62.7 55.9 63.5 68.2 63.4 57.6 70.5 64.2 65.1 63.2 61.8 61.8 66.0 68.1 66.0 Car 94.9 70.6 93.6 93.5 94.1 82.4 77.8 85.8 86.9 86.9 96.8 77.8 92.5 94.0 94.0 93.1 77.8 93.3 93.6 93.6 Flare1066 C 83.0 82.7 83.0 82.8 82.7 77.7 80.8 72.7 74.8 74.2 80.9 81.5 80.4 80.6 80.6 82.7 82.9 82.7 82.7 82.7 Flare1066 M 96.6 96.6 96.4 96.5 96.4 96.3 95.2 88.3 88.6 88.2 96.5 96.4 96.0 96.1 96.1 96.5 96.4 96.4 96.4 96.4 Flare323 C 87.5 88.0 88.1 88.1 88.1 87.0 86.2 82.0 81.4 81.4 87.2 87.7 85.9 85.9 85.9 88.4 88.4 87.7 87.7 87.7 Flare323 M 89.1 89.7 89.1 89.1 89.1 85.8 86.9 83.7 83.7 83.7 86.3 88.5 87.2 87.2 87.2 90.0 89.4 89.1 89.1 89.1 Mushroom 100.0 93.0 100.0 100.0 100.0 93.6 93.0 94.3 94.3 94.3100.0 93.0 100.0 100.0 100.0 99.9 92.6 100.0 100.0 100.0 Nursery 97.3 36.3 91.1 98.7 98.7 86.2 66.2 89.4 92.2 92.9 99.5 66.2 91.1 98.1 98.3 93.2 66.2 93.1 93.1 93.1 Post-Op 71.1 68.9 71.1 71.1 71.1 66.7 66.7 58.9 57.8 58.9 62.2 66.7 65.6 62.2 62.2 68.9 68.9 67.8 67.8 67.8 Tic-tac-toe 95.4 69.9 85.3 98.1 97.7 68.8 69.9 71.4 68.4 68.4 91.6 69.9 83.5 98.7 98.7 75.7 69.9 98.3 98.3 98.3 Pima 69.8 72.6 73.5 72.6 71.1 71.2 74.7 74.0 74.7 74.9 63.9 72.0 65.1 67.9 67.2 72.6 73.5 72.1 74.1 74.6 Yeast 50.9 56.1 50.4 49.9 50.6 54.7 50.0 54.7 56.0 55.7 44.3 48.0 43.6 45.8 45.8 56.1 50.4 54.6 57.6 57.8

slide-43
SLIDE 43

43

REFER for propositionalisation REFER for propositionalisation

Setting M1 M2 M3 M4 Instances produced 1692 1692 1692 1692 Features produced 1016 2114 3986 13118 SINUS parameters (L, V, T) 3, 3, 20 3, 3, 20 3, 3, 20 4, 4, 20 inda and ind1 yes yes yes yes bonds yes yes yes yes atom element and type yes yes yes yes atom charge no yes yes yes lumo and logp no yes yes yes 2D molecular structures yes no yes yes

slide-44
SLIDE 44

44

REFER for propositionalisation REFER for propositionalisation

# features # features after reduction # neigh- bourhoods running time (s) M1 1016 25.9 16 1.10 M2 2114 32.1 17 9.33 M3 3986 40.9 26 40.30 M4 13118 44.4 27.1 608.17

slide-45
SLIDE 45

45

REFER for propositionalisation REFER for propositionalisation

Dataset/System JRIP NB KNN C4.5 SVM REFER 85.58 87.22 82.98 84.53 86.14 M1 REDUCE 86.14 83.53 81.46 84.01 85.64 REFER 87.28 87.25 87.77 89.91 89.88 M2 REDUCE 85.70 84.06 87.25 89.91 88.33 REFER 85.08 84.06 84.53 84.56 86.66 M3 REDUCE 85.08 86.69 84.53 84.56 84.03 REFER 80.98 82.19 82.09 83.30 86.19 M4 REDUCE 82.98 84.15 80.13 83.30 86.90

System PROGOL FOIL TILDE MRDTL 1BC MR-SBC Accuracy 86 83 85 88 87.2 89.9

slide-46
SLIDE 46

46

Neighbourhoods of examples Neighbourhoods of examples

E

c1 c1 c1 c2 c3 c3 c3 c3 c1 c2 c1 c1 c1 c1

E1 E2 E4 E3

c2 c3 c3 c2

E5

e1

a) b) E1 c1 E2 E3 E4 E5 c2 c3 c2 c1

R2 analogy of neighbourhood construction Comparison between neighbourhood pairs

slide-47
SLIDE 47

47

Another simple example Another simple example

f1 f2 ¬f1 ¬f2 class e1 1 1 a e2 1 1 a e3 1 1 a e4 1 1 b e5 1 1 b

f2 is a useless feature - any feature can cover it.

slide-48
SLIDE 48

48

Introducing negated features Introducing negated features

f1 f2 ¬f1 ¬f2 class e1 1 1 a e2 1 1 a e3 1 1 a e4 1 1 b e5 1 1 b

… but its negation is a perfectly non-redundant feature. REFER assumes that the user will provide negated features if the language for rules requires it.

slide-49
SLIDE 49

49

Introducing negated features Introducing negated features

f1 f2 ¬f1 ¬f2 class e1 1 1 a e2 1 1 a e3 1 1 a e4 1 1 b e5 1 1 b

If all features are considered together, f2 is chosen ...

slide-50
SLIDE 50

50

Introducing negated features Introducing negated features

f1 f2 ¬f1 ¬f2 class e1 1 1 a e2 1 1 a e3 1 1 a e4 1 1 b e5 1 1 b

… but REFER considers positive against positive and negative against negative only.