A constructive approach to incremental learning Mario Rosario - - PowerPoint PPT Presentation

a constructive approach to incremental learning
SMART_READER_LITE
LIVE PREVIEW

A constructive approach to incremental learning Mario Rosario - - PowerPoint PPT Presentation

High Performance Computing and Networking Institute National Research Council, Italy The Data Reference Model: A constructive approach to incremental learning Mario Rosario Guarracino October 12, 2006 10/13/2006 10:26 PM Acknowledgements


slide-1
SLIDE 1

10/13/2006 10:26 PM

High Performance Computing and Networking Institute

National Research Council, Italy

The Data Reference Model:

A constructive approach to incremental learning

Mario Rosario Guarracino

October 12, 2006

slide-2
SLIDE 2

October 12, 2006 -- Pg. 2 Workshop on Data Mining and Mathematical Programming

Acknowledgements

  • prof. Franco Giannessi – U. of Pisa,
  • prof. Panos Pardalos – CAO UFL,

Onur Seref – CAO UFL, Claudio Cifarelli – U. of Rome La Sapienza.

slide-3
SLIDE 3

October 12, 2006 -- Pg. 3 Workshop on Data Mining and Mathematical Programming

Agenda

Generalized eigenvalue classification Purpose of incremental learning Subset selection algorithm Initial points selection Accuracy results Conclusion and future work

slide-4
SLIDE 4

October 12, 2006 -- Pg. 4 Workshop on Data Mining and Mathematical Programming

Introduction

Supervised learning refers to the capability of a system to

learn from examples (training set).

The trained system is able to provide an answer (output)

for each new question (input).

Supervised means the desired output for the training set is

provided by an external teacher.

Binary classification is among the most successful

methods for supervised learning.

slide-5
SLIDE 5

October 12, 2006 -- Pg. 5 Workshop on Data Mining and Mathematical Programming

Applications

Many applications in biology and medicine:

Tissues that are prone to cancer can be detected with high accuracy. New DNA sequences or proteins can be tracked down to their origins. Identification of new genes or isoforms of gene expressions in large datasets. Analysis and reduction of data spatiality and principal characteristics for drug design.

slide-6
SLIDE 6

October 12, 2006 -- Pg. 6 Workshop on Data Mining and Mathematical Programming

Peculiarity of the problem

Data produced in biomedical application will exponentially

increase in the next years.

In genomic/proteomic application, data are often updated,

which poses problems to the training step.

Publicly available datasets contain gene expression data

for tens of thousands characteristics.

Current classification methods can over-fit the problem,

providing models that do not generalize well.

slide-7
SLIDE 7

October 12, 2006 -- Pg. 7 Workshop on Data Mining and Mathematical Programming

A

B B

A

Linear discriminant planes

Consider a binary classification task with points in two

linearly separable sets.

– There exists a plane that classifies all points in the two sets

There are infinitely many planes that correctly classify

the training data.

slide-8
SLIDE 8

October 12, 2006 -- Pg. 8 Workshop on Data Mining and Mathematical Programming

Best plane

To construct the plane “furthers” from both classes, we

examine the convex hull of each set.

The best plane bisects closest points in the convex hulls.

A

B B

A

c d

slide-9
SLIDE 9

October 12, 2006 -- Pg. 9 Workshop on Data Mining and Mathematical Programming

SVM classification

A different approach, yielding the same solution, is to

maximize the margin between support planes

– Support planes leave all points of a class on one side

Support planes are pushed apart until they “bump” into a

small set of data points (support vectors).

A

B B

A

slide-10
SLIDE 10

October 12, 2006 -- Pg. 10 Workshop on Data Mining and Mathematical Programming

SVM classification

Support Vector Machines are the state of the art for the

existing classification methods.

Their robustness is due to the strong fundamentals of

statistical learning theory.

The training relies on optimization of a quadratic convex

cost function, for which many methods are available.

– Available software includes SVM-Lite and LIBSVM. These techniques can be extended to the nonlinear

discrimination, embedding the data in a nonlinear space using kernel functions.

slide-11
SLIDE 11

October 12, 2006 -- Pg. 11 Workshop on Data Mining and Mathematical Programming

A different religion

Mangasarian (2004) showed binary classification

problem can be formulated as a generalized eigenvalue problem (GEPSVM).

Find x’w1=γ1 the closer to A and the farther from B:

A

B B

A

  • O. L. Mangasarian and E. W. Wild Multisurface Proximal Support Vector Classification

via Generalized Eigenvalues. Data Mining Institute Tech. Rep. 04-03, June 2004.

slide-12
SLIDE 12

October 12, 2006 -- Pg. 12 Workshop on Data Mining and Mathematical Programming

GEP technique Let: Previous equation becomes: Raleigh quotient of Generalized Eigenvalue Problem

Gx=λHx.

slide-13
SLIDE 13

October 12, 2006 -- Pg. 13 Workshop on Data Mining and Mathematical Programming

GEP technique

Conversely, to find the plane closer to B and further from A we need to solve: which has the same eigenvectors of the previous problem and reciprocal eigenvalues. We only need to evaluate the eigenvectors related to min and max eigenvalues of Gx=λHx.

slide-14
SLIDE 14

October 12, 2006 -- Pg. 14 Workshop on Data Mining and Mathematical Programming

GEP technique Let [w1 γ1] and [wm γm] be eigenvectors associated to min and max eigenvalues of Gx=λHx:

a A closer to x'w1 -γ1 =0 than to x'wm-γm=0, b B closer to x'wm-γm=0 than to x'w1-γ1=0.

slide-15
SLIDE 15

October 12, 2006 -- Pg. 15 Workshop on Data Mining and Mathematical Programming

Regularization

A and B can be rank-deficient. G and H are always rank-deficient, the product of matrices of dimension (n +1 n) is of rank at least n 0/ eigenvalue. Do we need to regularize the problem to obtain a well

posed problem?

slide-16
SLIDE 16

October 12, 2006 -- Pg. 16 Workshop on Data Mining and Mathematical Programming

An useful theorem Consider GEP Gx=λHx and the transformed G1x=λH1x defined by: for each choice of scalars τ1, τ2, δ1 and δ2, such that the 2 2 matrix is nonsingular. Then G*x=λ H*x and Gx=λHx have the same eigenvectors.

slide-17
SLIDE 17

October 12, 2006 -- Pg. 17 Workshop on Data Mining and Mathematical Programming

Linear case

In the linear case, the theorem can be applied. For

τ1=τ2=1 and δ1=δ2=δ, the transformed problem is:

As long as δ 1, matrix Ω is non-degenerate. In practice, in each class of the training set, there has to

be a number of linearly independent points equal to the number of features.

– prob (Ker(G) Ker(H) ≠ 0) = 0

slide-18
SLIDE 18

October 12, 2006 -- Pg. 18 Workshop on Data Mining and Mathematical Programming

Classification accuracy: linear kernel

98.30 98.60 98.24

14 2462

GalaxyBright 75.70 73.60 74.91

8 768

PimaIndians 83.60 81.80 86.05

13 297

ClevelandHeart 89.00 86.70 87.60

7 300

NDC SVM GEPSVM ReGEC

dim train

Dataset

Accuracy results have been obtained using ten fold cross validation

slide-19
SLIDE 19

October 12, 2006 -- Pg. 19 Workshop on Data Mining and Mathematical Programming

Nonlinear case

A standard technique to obtain greater separability between

sets is to embed the points into a nonlinear space, via kernel functions, like the gaussian kernel :

Each element of kernel matrix is:

where

slide-20
SLIDE 20

October 12, 2006 -- Pg. 20 Workshop on Data Mining and Mathematical Programming

Nonlinear case

Using a gaussian kernel the problem becomes: to produce the proximal surfaces: The associated GEP involves matrices of the order of the

training set and rank at most the number of features.

slide-21
SLIDE 21

October 12, 2006 -- Pg. 21 Workshop on Data Mining and Mathematical Programming

ReGEC

Matrices are deeply rank deficient and the problem is ill

posed.

We propose to generate the two proximal surfaces:

solving the problem where KA and KB are main diagonals of K(A,C) and K(B,C). ~ ~

slide-22
SLIDE 22

October 12, 2006 -- Pg. 22 Workshop on Data Mining and Mathematical Programming

Classification accuracy: gaussian kernel

89.15 85.53 84.44

2 4900 400

Banana 77.36 75.77 75.29

3 2051 150

Titanic 65.80 59.63 58.23

9 400 666

Flare-solar 90.21 87.70 88.56

21 4600 400

Waveform 83.05 81.43 82.06

13 100 170

Heart 95.20 92.71 92.76

5 75 140

Thyroid 75.66 69.36 70.26

20 300 700

German 76.21 74.75 74.56

8 300 468

Diabetis 73.49 71.73 73.40

9 77 200

Breast-cancer

SVM GEPSVM ReGEC

m test train

Dataset

Accuracy with ten random splits provided by IDA repository

slide-23
SLIDE 23

October 12, 2006 -- Pg. 23 Workshop on Data Mining and Mathematical Programming

Methods generalization

The classification surfaces are very tangled. Those models are good on original data, but do not

generalize well to new data (over-fitting).

slide-24
SLIDE 24

October 12, 2006 -- Pg. 24 Workshop on Data Mining and Mathematical Programming

How to solve the problem?

slide-25
SLIDE 25

October 12, 2006 -- Pg. 25 Workshop on Data Mining and Mathematical Programming

Incremental classification

A possible solution is to find a small and robust subset of

the training set that provides comparable accuracy results.

A smaller set of points reduces the probability of over-fitting

the problem.

A kernel built from a smaller subset is computationally

more efficient in predicting new points, compared to kernels that use the entire training set.

As new points become available, the cost of retraining the

algorithm decreases if the influence of the new points is

  • nly evaluated by the small subset.
slide-26
SLIDE 26

October 12, 2006 -- Pg. 26 Workshop on Data Mining and Mathematical Programming

Incremental learning algorithm

1: Γ0 = C \ C0 2: {M0, Acc0} = Classify( C; C0 ) 3: k = 1 4: while |Γk| > 0 do 5: xk = x : maxx {Mk Γk-1} {dist(x, Pclass(x))} 6: {Mk, Acck } = Classify( C; {Ck-1 {xk}} ) 7: if Acck > Acck-1 then 8: Ck = Ck-1 {xk} 9: k = k + 1 10: end if 11: Γk = Γk-1 \ {xk} 12: end while

slide-27
SLIDE 27

October 12, 2006 -- Pg. 27 Workshop on Data Mining and Mathematical Programming

I-ReGEC: Incremental ReGEC

ReGEC accuracy=84.44 I-ReGEC accuracy=85.49

When ReGEC algorithm is trained on all points, surfaces are

affected by noisy points (left).

I-ReGEC achieves clearly defined boundaries, preserving

accuracy (right). Less then 5% of points needed for training!

slide-28
SLIDE 28

October 12, 2006 -- Pg. 28 Workshop on Data Mining and Mathematical Programming

Initial points selection

Unsupervised clustering techniques can be adapted to

select initial points.

We compare the classification obtained with k randomly

selected starting points for each class, and k points determined by k-means method.

Results show higher classification accuracy and a more

consistent representation of the training set when k-means method is used instead of random selection.

slide-29
SLIDE 29

October 12, 2006 -- Pg. 29 Workshop on Data Mining and Mathematical Programming

Initial points selection

Starting points Ci chosen: randomly (top), k-means (bottom). For each kernel produced by

Ci, a set of evenly distributed points x is classified.

The procedure is repeated 100 times. Let yi {1; -1} be the

classification based on Ci.

y = | yi| estimates the

probability x is classified in

  • ne class.

random acc=84.5 std = 0.05 k-means acc=85.5 std = 0.01

slide-30
SLIDE 30

October 12, 2006 -- Pg. 30 Workshop on Data Mining and Mathematical Programming

Initial points selection

Starting points Ci chosen: randomly (top), k-means (bottom). For each kernel produced by

Ci, a set of evenly distributed points x is classified.

The procedure is repeated 100 times. Let yi {1; -1} be the

classification based on Ci.

y = | yi| estimates the

probability x is classified in

  • ne class.

random acc=72.1std = 1.45 k-means acc=97.6std = 0.04

slide-31
SLIDE 31

October 12, 2006 -- Pg. 31 Workshop on Data Mining and Mathematical Programming

10 20 30 40 50 60 70 80 90 100 0.5 0.6 0.7 0.8 0.9 1

Initial point selection

Effect of increasing initial points k with k-means on

Chessboard dataset.

The graph shows the classification accuracy versus the

total number of initial points 2k from both classes.

This result empirically shows that there is a minimum k,

with which we reach high accuracy results.

slide-32
SLIDE 32

October 12, 2006 -- Pg. 32 Workshop on Data Mining and Mathematical Programming

Initial point selection

Bottom figure shows k vs. the number of additional points

included in the incremental dataset.

10 20 30 40 50 60 70 80 90 100 0.5 0.6 0.7 0.8 0.9 1 10 20 30 40 50 60 70 80 90 100 2 4 6 8 10 12

slide-33
SLIDE 33

October 12, 2006 -- Pg. 33 Workshop on Data Mining and Mathematical Programming

Dataset reduction

1.45 9.67 Flare-solar 8.85 12.40 Thyroid 4.25 4.215 WPBC 6.62 25.9 Votes 4.92 15.28 Bupa 2.76 7.59 Haberman 3.55 16.63 Diabetis 4.15 29.09 German 3.92 15.7 Banana % of train chunk Dataset I-ReGEC

slide-34
SLIDE 34

October 12, 2006 -- Pg. 34 Workshop on Data Mining and Mathematical Programming

Accuracy results

65.80 65.11 3 9.67 58.23 666 Flare-solar 95.20 94.01 5 12.40 92.76 140 Thyroid 63.60 60.27 2 42.15 58.36 99 WPBC 95.60 93.41 10 25.90 95.09 391 Votes 69.90 63.94 4 15.28 59.03 310 Bupa 71.70 73.45 2 7.59 73.26 275 Haberman 76.21 74.13 5 16.63 74.56 468 Diabetis 75.66 73.5 8 29.09 70.26 700 German 89.15 85.49 5 15.70 84.44 400 Banana acc acc k chunk acc train Dataset SVM I-ReGEC ReGEC

slide-35
SLIDE 35

October 12, 2006 -- Pg. 35 Workshop on Data Mining and Mathematical Programming

Positive results

Incremental learning, in

conjunction with ReGEC, reduces training sets dimension.

Accuracy results do not

deteriorate selecting fewer training points.

Classification surfaces can be

generalized.

slide-36
SLIDE 36

October 12, 2006 -- Pg. 36 Workshop on Data Mining and Mathematical Programming

Positive results

Incremental classification can be applied

to different algorithms and still enhances accuracy results

65.81 (4.20) 60.23 (68.06) Flare-Solar 94.55 (13.41) 94.77 (21.57) Thyroid 69.78 (23.56) 66.00 (129.35) WPBC 93.25 (15.12) 92.70 (60.69) Votes 66.21 (11.79) 65.80 (153.80) Bupa 72.82 (11.14) 63.85 (129.22) Haberman 72.55 (9.85) 67.83 (185.60) Diabetis 72.15 (34.11) 69.50 (268.04) German 87.26 (23.56) 85.06 (129.35) Banana acc (bar) acc (bar) Dataset I-T.r.a.c.e. T.r.a.c.e.

courtesy of Claudio Cifarelli

slide-37
SLIDE 37

October 12, 2006 -- Pg. 37 Workshop on Data Mining and Mathematical Programming

Not so positive results

There are points in the training

set that are not chosen by the method but increase accuracy.

Block selection does not give any

improvement.

slide-38
SLIDE 38

October 12, 2006 -- Pg. 38 Workshop on Data Mining and Mathematical Programming

Work in progress

Incremental

classification with feature selection for microarray datasets.

1.34 95.39 11.15 7.25

Golub

72 x 7129

1.62 32.43 9.70 5.43

Alon

62 x 2000

1.72 122.63 37.30 20.14

Iizuka

60 x 7129

1.96 474.35 9.31 8.10

Vantveer

98 x 24188

1.68 211.66 18.42 8.29

Nutt

50 x 12625

2.29 288.23 5.63 6.87

Singh

136 x 12600

1.77 57.15 34.00 6.80

H-Sporadic

22 x 3226

1.75 56.48 21.40 4.28

H-BRCA2

22 x 3226

1.55 49.85 30.55 6.11

H-BRCA1

22 x 3226

% of feature features % of train chunk Dataset

slide-39
SLIDE 39

October 12, 2006 -- Pg. 39 Workshop on Data Mining and Mathematical Programming

Work in progress

96,86 88.10 92.06 90.08 94.44 93.25 93.25 93.65 96.83

Golub

72 x 7129

83.50 81.75 90.87 84.52 90.08 89.68 90.08 82.14 91.27

Alon

62 x 2000

69.00 n.a. n.a. 61.90 66.67 n.a. n.a. 61.90 67.10

Iizuka

60 x 7129

68.00 n.a. n.a. 64.57 65.33 n.a. n.a. 66.86 66.86

Vantveer

98 x 24188

76.60 n.a. n.a. 67.46 67.46 n.a. n.a. 74.60 72.22

Nutt

50 x 12625

77.86 n.a. n.a. 84.85 88.74 n.a. n.a. 90.48 91.20

Singh

136 x 12600

77.00 69.05 69.05 79.76 70.24 75.00 69.05 78.57 73.81

H-Sporadic

22 x 3226

85.00 63.10 64.29 72.62 69.05 79.76 72.62 77.38 84.52

H-BRCA2

22 x 3226

80.00 52.38 66.67 69.05 76.19 75.00 77.38 72.62 75.00

H-BRCA1

22 x 3226

IRegec Golub K-U PCA FDA K-U PCA FDA L-S PCA FDA L-U PCA FDA S-PCA FDA U-PCA FDA K-LS SVM L-LS SVM Dataset

L=linear, K=RBF, U=unsupervised, S=supervised http://www.esat.kuleuven.be/MACBETH/

slide-40
SLIDE 40

October 12, 2006 -- Pg. 40 Workshop on Data Mining and Mathematical Programming

Conclusions

Generalized eigenvalue is a competitive classification

method.

Incremental learning reduces redundancy in training sets

and can help to avoid over-fitting.

Subset selection algorithm provides a constructive way to

reduce complexity in kernel based classification algorithms.

Initial points selection strategy can help in finding regions

where knowledge is missing.

IReGEC can be a starting point to explore very large

problems.

slide-41
SLIDE 41

October 12, 2006 -- Pg. 41 Workshop on Data Mining and Mathematical Programming

Questions? High Performance Computing and Networking Institute

National Research Council, Italy

The Data Reference Model:

A constructive approach to incremental learning

Mario.Guarracino@icar.cnr.it

October 12, 2006