Incremental Classification with Generalized Eigenvalues Mario - - PowerPoint PPT Presentation

incremental classification with generalized eigenvalues
SMART_READER_LITE
LIVE PREVIEW

Incremental Classification with Generalized Eigenvalues Mario - - PowerPoint PPT Presentation

High Performance Computing and Networking Institute National Research Council, Italy The Data Reference Model: Incremental Classification with Generalized Eigenvalues Mario Rosario Guarracino September 17, 2007 11/30/2007 8:53 AM


slide-1
SLIDE 1

11/30/2007 8:53 AM

High Performance Computing and Networking Institute

National Research Council, Italy

The Data Reference Model:

Incremental Classification with Generalized Eigenvalues

Mario Rosario Guarracino

September 17, 2007

slide-2
SLIDE 2

October 12, 2006 -- Pg. 2 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

People@ICAR

Researchers – Mario Guarracino – Pasqua D’Ambra – Ivan De Falco – Ernesto Tarantino Associates – Daniela di Serafino (SUN) – Francesca Perla (UniParth) – Gerardo Toraldo (UniNa) Fellows – Davide Feminiano – Salvatore Cuciniello Collaborators – Franco Giannessi (UniPi) – Claudio Cifarelli (HP) – Panos Pardalos, Onur Seref (UFL) – Oleg Prokopyev (U. Pittsburg) – Giuseppe Trautteur (UniNa) – Francesca Del Vecchio Blanco

(SUN)

– Antonio Della Cioppa (UniSa) Students

– Danilo Abbate, – Francesco Antropoli, – Giovanni Attratto, – Tony De Vivo, – Alessandra Vocca,

slide-3
SLIDE 3

October 12, 2006 -- Pg. 3 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Agenda

Generalized eigenvalues classification Purpose of incremental learning Subset selection algorithm Initial points selection Accuracy results More examples Conclusion and future work

slide-4
SLIDE 4

October 12, 2006 -- Pg. 4 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Introduction

Supervised learning refers to the capability of a system to

learn from examples (training set).

The trained system is able to provide an answer (output)

for each new question (input).

Supervised means the desired output for the training set is

provided by an external teacher.

Binary classification is among the most successful

methods for supervised learning.

slide-5
SLIDE 5

October 12, 2006 -- Pg. 5 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Applications

Data produced in biomedical application

will exponentially increase in the next years.

In genomic/proteomic application, data

are often updated, which poses problems to the training step.

Publicly available datasets contain gene

expression data for tens of thousands characteristics.

Current classification methods can over-

fit the problem, providing models that do not generalize well.

slide-6
SLIDE 6

October 12, 2006 -- Pg. 6 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

A

B B

A

Linear discriminant planes

Consider a binary classification task with points in two

linearly separable sets.

– There exists a plane that classifies all points in the two sets

There are infinitely many planes that correctly classify

the training data.

slide-7
SLIDE 7

October 12, 2006 -- Pg. 7 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Support vector machines formulation

To construct the furthest plane from both sets, we

examine the convex hull of each set.

The best plane bisects closest points (support vectors) in

the convex hulls.

A

B B

A

c d

slide-8
SLIDE 8

October 12, 2006 -- Pg. 8 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Support vector machines dual formulation

The dual formulation, yielding the same solution, is to

maximize the margin between support planes

– Support planes leave all points of a class on one side

Support planes are pushed apart until they “bump” into a

small set of data points (support vectors).

A

B B

A

slide-9
SLIDE 9

October 12, 2006 -- Pg. 9 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Support Vector Machine features

Support Vector Machines are the state of the art for the

existing classification methods.

Their robustness is due to the strong fundamentals of

statistical learning theory.

The training relies on optimization of a quadratic convex

cost function, for which many methods are available.

– Available software includes SVM-Lite and LIBSVM. These techniques do not scale well with the size of the

training set.

– Training 50,000 examples amounts to a Hessian matrix with 2.5 billion elements ~ 20 GB RAM.

slide-10
SLIDE 10

October 12, 2006 -- Pg. 10 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

A different approach

The problem can be restated as: find two hyperplanes,

each the closest to one set and the furthest from the

  • ther.

The binary classification problem can be solved as a

generalized eigenvalue computation (GEC).

A

B B

A

  • O. L. Mangasarian and E. W. Wild Multisurface Proximal Support Vector Classification

via Generalized Eigenvalues. Data Mining Institute Tech. Rep. 04-03, June 2004.

slide-11
SLIDE 11

October 12, 2006 -- Pg. 11 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

GEC method

Let: Previous equation becomes: Raleigh quotient of generalized eigenvalue problem:

Gx = λHx.

slide-12
SLIDE 12

October 12, 2006 -- Pg. 12 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

GEC method

Conversely, the plane closer to B and furthest from A:

Same eigenvectors of the previous problem and reciprocal

eigenvalues.

We only need to evaluate the eigenvectors related to

minimum and maximum eigenvalues of Gx=λHx.

slide-13
SLIDE 13

October 12, 2006 -- Pg. 13 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

GEC method

Let [w1 γ1] and [w2 γ2] be eigenvectors associated to min and max eigenvalues of Gx = λHx:

a A closer to x'w1 -γ1 = 0 than to x'w2 -γ2 = 0, b B closer to x'w2 -γ2 = 0 than to x'w1 -γ1 = 0.

slide-14
SLIDE 14

October 12, 2006 -- Pg. 14 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Example

Let: Set G=[A -e]' [A -e] and H=[B -e]' [B -e], we obtain: Minimum and maximum eigenvalues of Gx = λHx are λ1 = 0 and λ3 = and the corresponding eigenvectors: x1=[1 0 2], x3=[1 -1 0]. The resulting planes are x – 2 = 0 and x – y = 0.

slide-15
SLIDE 15

October 12, 2006 -- Pg. 15 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Classification accuracy: linear kernel

98.30 98.60 98.24

14 2462

GalaxyBright 75.70 73.60 74.91

8 768

PimaIndians 83.60 81.80 86.05

13 297

ClevelandHeart 89.00 86.70 87.60

7 300

NDC SVM GEPSVM ReGEC

dim train

Dataset

Accuracy results using ten fold cross validation

slide-16
SLIDE 16

October 12, 2006 -- Pg. 16 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Nonlinear case

When sets are not linearly separable, nonlinear

discrimination is needed.

Data is nonlinearly transformed in another space to increase

separability, and linear discrimination is found in that space.

slide-17
SLIDE 17

October 12, 2006 -- Pg. 17 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Nonlinear case

A standard technique is to transform points into a nonlinear

space, via kernel functions, like the Gaussian kernel:

Each element of the kernel matrix is:

where

  • K. Bennett and O. Mangasarian, Robust Linear Programming Discrimination of Two Linearly

Inseparable Sets, Optimization Methods and Software, 1, 23-34, 1992.

slide-18
SLIDE 18

October 12, 2006 -- Pg. 18 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Nonlinear case

Using the Gaussian kernel the GEC problem can be

formulated: in order to evaluate the proximal surfaces: the associated GEC is ill posed.

slide-19
SLIDE 19

October 12, 2006 -- Pg. 19 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

ReGEC method

To regularize the problem, generate the two proximal

surfaces: solving: where KA and KB are main diagonals of K(A,C) and K(B,C). ~ ~

  • M. R. Guarracino, C. Cifarelli, O. Seref, P. M. Pardalos, A Classification Method based on

Generalized Eigenvalue Problems, Optimization Methods and Software, 2007.

slide-20
SLIDE 20

October 12, 2006 -- Pg. 20 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

ReGEC algorithm

% Let A Rm s and B Rn s % be the training points in each class. % Choose appropriate δ and σ R

C = [A;B];

% Build G and H matrices

g = [K (A, C, σ), -ones(m, 1)]; h = [K (B, C, σ), -ones(n, 1)]; G = g’ g; H = h’ h;

% Regularize the problem

G*= G + δ diag(H); H*= H + δ diag(G);

% Compute the hyperplanes V(:,1) and V(:,2)

[V,D] = eig(G*;H*);

slide-21
SLIDE 21

October 12, 2006 -- Pg. 21 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Classification accuracy: gaussian kernel

89.15 89.15 85.53 85.53 84.44 84.44

2 4900 400

Banana 77.36 77.36 75.77 75.77 75.29 75.29

3 2051 150

Titanic 65.80 65.80 59.63 59.63 58.23 58.23

9 400 666

Flare-solar 90.21 90.21 87.70 87.70 88.56 88.56

21 4600 400

Waveform 83.05 83.05 81.43 81.43 82.06 82.06

13 100 170

Heart 95.20 95.20 92.71 92.71 92.76 92.76

5 75 140

Thyroid 75.66 75.66 69.36 69.36 70.26 70.26

20 300 700

German 76.21 76.21 74.75 74.75 74.56 74.56

8 300 468

Diabetis 73.49 73.49 71.73 71.73 73.40 73.40

9 77 200

Breast-cancer

SVM GEPSVM ReGEC

m test train

Dataset

Accuracy with ten random splits provided by IDA repository

slide-22
SLIDE 22

October 12, 2006 -- Pg. 22 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Generalizability of the methods

The classification surfaces can be very tangled. Those models are good on original data, but do not

generalize well to new data (over-fitting).

slide-23
SLIDE 23

October 12, 2006 -- Pg. 23 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

How to solve the problem?

slide-24
SLIDE 24

October 12, 2006 -- Pg. 24 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Incremental classification

A possible solution is to find a small and robust subset of

the training set that provides comparable accuracy results.

A smaller set of points reduces the probability of over-fitting

the problem.

A kernel built from a smaller subset is computationally

more efficient in predicting new points, compared to kernels that use the entire training set.

As new points become available, the cost of retraining the

algorithm decreases if the influence of the new points is

  • nly evaluated by the small subset.
slide-25
SLIDE 25

October 12, 2006 -- Pg. 25 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

I-ReGEC: Incremental learning 1: Γ0 = C \ C0 2: {M0, Acc0} = Classify( C; C0 ) 3: k = 1 4: while |Γk| > 0 do 5: xk = x : maxx {Mk Γk-1} {dist(x, Pclass(x))} 6: {Mk, Acck } = Classify( C; {Ck-1 {xk}} ) 7: if Acck > Acck-1 then 8: Ck = Ck-1 {xk} 9: k = k + 1 10: end if 11: Γk = Γk-1 \ {xk} 12: end while

slide-26
SLIDE 26

October 12, 2006 -- Pg. 26 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

I-ReGEC: Incremental learning 1: Γ0 = C \ C0 2: {M0, Acc0} = Classify( C; C0 ) 3: k = 1 4: while |Γk| > 0 do 5: xk = x : maxx {Mk Γk-1} {dist(x, Pclass(x))} 6: {Mk, Acck } = Classify( C; {Ck-1 {xk}} ) 7: if Acck > Acck-1 then 8: Ck = Ck-1 {xk} 9: k = k + 1 10: end if 11: Γk = Γk-1 \ {xk} 12: end while

slide-27
SLIDE 27

October 12, 2006 -- Pg. 27 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

I-ReGEC: Incremental learning 1: Γ0 = C \ C0 2: {M0, Acc0} = Classify( C; C0 ) 3: k = 1 4: while |Γk| > 0 do 5: xk = x : maxx {Mk Γk-1} {dist(x, Pclass(x))} 6: {Mk, Acck } = Classify( C; {Ck-1 {xk}} ) 7: if Acck > Acck-1 then 8: Ck = Ck-1 {xk} 9: k = k + 1 10: end if 11: Γk = Γk-1 \ {xk} 12: end while

slide-28
SLIDE 28

October 12, 2006 -- Pg. 28 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

I-ReGEC: Incremental learning algorithm 1: Γ0 = C \ C0 2: {M0, Acc0} = Classify( C; C0 ) 3: k = 1 4: while |Γk| > 0 do 5: xk = x : maxx {Mk Γk-1} {dist(x, Pclass(x))} 6: {Mk, Acck } = Classify( C; {Ck-1 {xk}} ) 7: if Acck > Acck-1 then 8: Ck = Ck-1 {xk} 9: k = k + 1 10: end if 11: Γk = Γk-1 \ {xk} 12: end while

slide-29
SLIDE 29

October 12, 2006 -- Pg. 29 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

I-ReGEC: Incremental ReGEC

ReGEC accuracy=84.44 I-ReGEC accuracy=85.49

When ReGEC algorithm is trained on all points, surfaces are

affected by noisy points (left).

I-ReGEC achieves clearly defined boundaries, preserving

accuracy (right). Less then 5% of points needed for training!

slide-30
SLIDE 30

October 12, 2006 -- Pg. 30 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Initial points selection

Unsupervised clustering techniques can be adapted to

select initial points.

We compare the classification obtained with k randomly

selected starting points for each class, and k points determined by k-means method.

Results show higher classification accuracy and a more

consistent representation of the training set when k-means method is used instead of random selection.

slide-31
SLIDE 31

October 12, 2006 -- Pg. 31 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Initial points selection

Starting points Ci chosen: randomly (top), k-means (bottom). For each kernel produced by

Ci, a set of evenly distributed points x is classified.

The procedure is repeated 100 times. Let yi {1; -1} be the

classification based on Ci.

y = | yi| estimates the

probability x is classified in

  • ne class.

random acc=84.5 std = 0.05 k-means acc=85.5 std = 0.01

slide-32
SLIDE 32

October 12, 2006 -- Pg. 32 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Initial points selection

Starting points Ci chosen: randomly (top), k-means (bottom). For each kernel produced by

Ci, a set of evenly distributed points x is classified.

The procedure is repeated 100 times. Let yi {1; -1} be the

classification based on Ci.

y = | yi| estimates the

probability x is classified in

  • ne class.

random acc=72.1std = 1.45 k-means acc=97.6std = 0.04

slide-33
SLIDE 33

October 12, 2006 -- Pg. 33 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

10 20 30 40 50 60 70 80 90 100 0.5 0.6 0.7 0.8 0.9 1

Initial point selection

Effect on classification accuracy of increasing initial points

with k-means on Chessboard dataset (higher is better).

The graph shows the classification accuracy versus the

total number of initial points 2k from both classes.

This result empirically shows that there is a minimum k,

with which we reach high accuracy results.

slide-34
SLIDE 34

October 12, 2006 -- Pg. 34 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Initial point selection

Bottom figure shows k vs. the number of additional points

included in the incremental dataset (lower is better).

10 20 30 40 50 60 70 80 90 100 0.5 0.6 0.7 0.8 0.9 1 10 20 30 40 50 60 70 80 90 100 2 4 6 8 10 12

slide-35
SLIDE 35

October 12, 2006 -- Pg. 35 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Dataset reduction

1.45 9.67 666 Flare-solar 8.85 12.40 140 Thyroid 4.25 42.15 99 WPBC 6.62 25.90 391 Votes 4.92 15.28 310 Bupa 2.76 7.59 275 Haberman 3.55 16.63 468 Diabetis 4.15 29.09 700 German 3.92 15.70 400 Banana % of train chunk train Dataset I I-

  • ReGEC

ReGEC

Experiments on

real & synthetic datasets confirm training data reduction.

slide-36
SLIDE 36

October 12, 2006 -- Pg. 36 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Accuracy results

65.80 65.11 3 9.67 58.23 666 Flare-solar 95.20 94.01 5 12.40 92.76 140 Thyroid 63.60 60.27 2 42.15 58.36 99 WPBC 95.60 93.41 10 25.90 95.09 391 Votes 69.90 63.94 4 15.28 59.03 310 Bupa 71.70 73.45 2 7.59 73.26 275 Haberman 76.21 74.13 5 16.63 74.56 468 Diabetis 75.66 73.5 8 29.09 70.26 700 German 89.15 85.49 5 15.70 84.44 400 Banana acc acc k chunk acc train Dataset SVM I-ReGEC ReGEC

Classification

accuracy with incremental technique well compare with standard methods

slide-37
SLIDE 37

October 12, 2006 -- Pg. 37 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Positive results

Incremental learning, in

conjunction with ReGEC, reduces training sets dimension.

Accuracy results do not

deteriorate selecting fewer training points.

Classification surfaces can be

generalized.

slide-38
SLIDE 38

October 12, 2006 -- Pg. 38 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Positive results

Incremental classification can enhance

accuracy results of different algorithms.

65.81 (4.20) 60.23 (68.06) Flare-Solar 94.55 (13.41) 94.77 (21.57) Thyroid 69.78 (23.56) 66.00 (129.35) WPBC 93.25 (15.12) 92.70 (60.69) Votes 66.21 (11.79) 65.80 (153.80) Bupa 72.82 (11.14) 63.85 (129.22) Haberman 72.55 (9.85) 67.83 (185.60) Diabetis 72.15 (34.11) 69.50 (268.04) German 87.26 (23.56) 85.06 (129.35) Banana acc (bar) acc (bar) Dataset I-T.r.a.c.e. T.r.a.c.e.

  • C. Cifarelli, L. Nieddu, O. Seref, P. M. Pardalos. K-T.R.A.C.E: A kernel k-means procedure

for classification. COR 2007

slide-39
SLIDE 39

October 12, 2006 -- Pg. 39 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Ongoing research

Microarray technology can scan

expression levels of tens of thousands of genes to classify patients in different groups.

For example, it is possible to

classify types of cancers with respect to the patterns of gene activity in the tumor cells.

Standard methods fail to derive

grouping of genes responsible of classification.

slide-40
SLIDE 40

October 12, 2006 -- Pg. 40 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Examples of microarray analysis

Breast cancer: BRCA1 vs. BRCA2 and sporadic mutations, – I. Hedenfalk et al, NEJM, 2001. (22 patients, 3226 genes) Prostate cancer: prediction of patient outcome after prostatectomy, – Singh D. et al, Cancer Cell, 2002. (136 patients, 12600 genes) Malignant gliomas survival: gene expression vs. histological

classification,

– C. Nutt et al, Cancer Res., 2003. (50 patients, 12625 genes) Clinical outcome of breast cancer, – L. van’t Veer et al, Nature, 2002. (98 patients, 24188 genes) Recurrence of hepatocellaur carcinoma after curative resection, – N. Iizuka et al, Lancet, 2003. (60 patients, 7129 genes) Tumor vs. normal colon tissues, – A. Alon et al, PNAS, 1999. (62 patients, 2000 genes) Acute Myeloid vs. Lymphoblastic Leukemia, – T. Golub et al, Science, 1999. (72 patients, 7129 genes)

slide-41
SLIDE 41

October 12, 2006 -- Pg. 41 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Feature selection techniques

Standard methods need long and memory intensive

computations.

– PCA, SVD, ICA,…

Statistical techniques are much faster, but, can

produce low accuracy results.

– FDA, LDA,…

Need for hybrid techniques that can take advantage of

both approaches.

slide-42
SLIDE 42

October 12, 2006 -- Pg. 42 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

ILDC-ReGEC

Simultaneous incremental learning and decremented

characterization permit to acquire knowledge on gene grouping during the classification process.

This technique relies on standard statistical indexes

(mean µ and standard deviation σ):

slide-43
SLIDE 43

October 12, 2006 -- Pg. 43 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

ILDC-ReGEC: Golub dataset

About 100 genes out of 7129

responsible of discrimination

– Acute Myeloid Leukemia, and – Acute Lymphoblastic Leukemia.

Selected genes in agreement

with previous studies.

Less then 10 patients, out of

72, needed for training.

– Classification accuracy: 96.86%

All All AML AML

slide-44
SLIDE 44

October 12, 2006 -- Pg. 44 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

ILDC-ReGEC: Golub dataset

Different techniques agree on the miss-classified patient!

Missclassified patient

slide-45
SLIDE 45

October 12, 2006 -- Pg. 45 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Gene expression analysis

ILDC-ReGEC:

Incremental classification with feature selection for microarray datasets.

Few patients

and genes selected as important for discrimination.

1.34 95.39 11.15 7.25

Golub

72 x 7129

1.62 32.43 9.70 5.43

Alon

62 x 2000

1.72 122.63 37.30 20.14

Iizuka

60 x 7129

1.96 474.35 9.31 8.10

Vantveer

98 x 24188

1.68 211.66 18.42 8.29

Nutt

50 x 12625

2.29 288.23 5.63 6.87

Singh

136 x 12600

1.77 57.15 34.00 6.80

H-Sporadic

22 x 3226

1.75 56.48 21.40 4.28

H-BRCA2

22 x 3226

1.55 49.85 30.55 6.11

H-BRCA1

22 x 3226

% of genes genes % of train chunk

Dataset

slide-46
SLIDE 46

October 12, 2006 -- Pg. 46 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

ILDC-ReGEC: gene expression analysis

96.86 96.86 88.10 92.06 90.08 94.44 93.25 93.25 93.65 96.83

Golub

72 x 7129

83.50 81.75 90.87 84.52 90.08 89.68 90.08 82.14 91.27 91.27

Alon

62 x 2000

69.00 69.00 n.a. n.a. 61.90 66.67 n.a. n.a. 61.90 67.10

Iizuka

60 x 7129

68.00 68.00 n.a. n.a. 64.57 65.33 n.a. n.a. 66.86 66.86

Vantveer

98 x 24188

76.60 76.60 n.a. n.a. 67.46 67.46 n.a. n.a. 74.60 72.22

Nutt

50 x 12625

77.86 n.a. n.a. 84.85 88.74 n.a. n.a. 90.48 91.20 91.20

Singh

136 x 12600

77.00 69.05 69.05 79.76 79.76 70.24 75.00 69.05 78.57 73.81

H-Sporadic

22 x 3226

85.00 85.00 63.10 64.29 72.62 69.05 79.76 72.62 77.38 84.52

H-BRCA2

22 x 3226

80.00 80.00 52.38 66.67 69.05 76.19 75.00 77.38 72.62 75.00

H-BRCA1

22 x 3226

ILDC ReGEC KUPCA FDA KUPCA FDA LSPCA FDA LUPCA FDA SPCA FDA UPCA FDA KLS SVM LLS SVM

Dataset

slide-47
SLIDE 47

October 12, 2006 -- Pg. 47 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Research directions

Is it possible to find an optimal strategy for subset

selection?

– How far (accuracy/computational complexity) is it from the proposed incremental one? Is it possible to provide prior knowledge, in generalized

eigenvalues classification, analytically rather then with training points?

Can linear algebra algorithms for large sparse matrices

enhance algorithm performance?

slide-48
SLIDE 48

October 12, 2006 -- Pg. 48 Katedry Oblicze Równoległych PJWSTK i Zespołu Architektury Komputerowej IPIPAN

Conclusions

Generalized eigenvalue is a competitive classification

method.

Incremental learning reduces redundancy in training sets

and can help to avoid over-fitting.

Subset selection algorithm provides a constructive way to

reduce complexity in kernel based classification algorithms.

Initial points selection strategy can help in finding regions

where knowledge is missing.

IReGEC can be a starting point to explore very large

problems.

slide-49
SLIDE 49

11/30/2007 8:53 AM

High Performance Computing and Networking Institute

National Research Council, Italy

The Data Reference Model:

Incremental Classification with Generalized Eigenvalues

Mario Rosario Guarracino

September 17, 2007