Classification Complexity Measures and Their Relationship to - - PowerPoint PPT Presentation

classification complexity measures
SMART_READER_LITE
LIVE PREVIEW

Classification Complexity Measures and Their Relationship to - - PowerPoint PPT Presentation

Classification Complexity Measures and Their Relationship to Feature Selection J. L. Solka and D. A. Johannsen solkajl@nswc.navy.mil;johannsen@nswc.navy.mil NSWCDD Interface04 p.1/40 Agenda Minimal spanning tree complexity measures. An


slide-1
SLIDE 1

Classification Complexity Measures

and Their Relationship to Feature Selection

  • J. L. Solka and D. A. Johannsen

solkajl@nswc.navy.mil;johannsen@nswc.navy.mil

NSWCDD

Interface04 – p.1/40

slide-2
SLIDE 2

Agenda

Minimal spanning tree complexity measures. An artificial nose data set. A gene expression data set. Wrap-up and conclusions.

Interface04 – p.2/40

slide-3
SLIDE 3

Acknowledgments

This work was supported by the Mathematical, Computer, and Information Sciences Division of the Office of Naval Research. We wish to acknowledge helpful discussions with Dr. David Marchette of NSWCDD and

  • Dr. Carey Priebe of JHU

Interface04 – p.3/40

slide-4
SLIDE 4

Classifier Complexity Measures

Single nearest neighbor cross-validated classification performance. Graph theoretic measures. Minimal spanning tree (MST) measures. Class cover catch digraph measures.

Interface04 – p.4/40

slide-5
SLIDE 5

MST Methodology

Cost of an edge of a graph is the Euclidean distance between the two vertices Spanning tree is the tree that covers all vertices of the graph. A minimal spanning tree is a spanning tree of a graph, such that the sum of all of the edge costs is minimal

Interface04 – p.5/40

slide-6
SLIDE 6

MST Classifier Complexity Algorithm Compute the MST of all the observations Count the number of edges that cross between disparate observations

Interface04 – p.6/40

slide-7
SLIDE 7

MST Example 1

−4 −3 −2 −1 1 2 3 4 −5 −4 −3 −2 −1 1 2 3

Minimum Spanning Tree Inter−Class Edges for Two Bivariate Normal Samples

Interface04 – p.7/40

slide-8
SLIDE 8

MST Example 2

−3 −2 −1 1 2 3 4 5 −6 −5 −4 −3 −2 −1 1 2 3 4

Minimum Spanning Tree Inter−Class Edges for Two Bivariate Normal Samples

Interface04 – p.8/40

slide-9
SLIDE 9

Treatise

Is there a correspondence between nearest neighbor cross-validated performance and the MST complexity measure? Can one use the MST complexity measure as a surrogate for nearest neighbor classifier performance during classifier

  • ptimization?

What is the effect of Minkowski p parameter choice on classifier performance? Can Minkowski p parameter and feature selection be simultaneously optimized based on some measure of classifier performance?

Interface04 – p.9/40

slide-10
SLIDE 10

Artificial Nose Data Set

19 fibers x 2 wavelengths x 60 samples per time period = 2280 dimensional data Application of interest was detection of the ground water contaminant trichloethylene among various confusers.

Interface04 – p.10/40

slide-11
SLIDE 11

Artificial Nose Minkowski Pseudo-Metric

✂ ✄ ☎ ✆ ✝ ✂ ✄ ☎ ✆ ✞✠✟ ✡ ☎

given by

☞☛✍✌ ✎ ✌ ✏ ✄ ✑ ✡ ✒ ☎ ✓ ✄ ✑✔ ✕✗✖ ✑ ✒ ✘ ✖ ✑ ✙ ✄✛✚ ✕ ✌ ✘ ✄✢✜ ☎✤✣ ✥✛✦ ✄ ✕ ✌ ✘ ✑ ☎ ✄✢✜ ☎ ✧ ✦ ✄ ✕ ✌ ✘ ✒ ☎ ✄✢✜ ☎ ✥ ☎ ☛ ★ ✜ ☎ ✑ ✩ ☛

(1)

Interface04 – p.11/40

slide-12
SLIDE 12

Scatter Overview

Appears in conjunction with Fisher’s Linear Discriminant Select dimensions in which classes are well separated class means are well separated each class has small within class variance

Interface04 – p.12/40

slide-13
SLIDE 13

Scatter Computation

Two classes,

✂ ✑

and

✂ ✒

, with

and

members (respectively) Class means:

✁✄✂ ✓ ☎
✆ ✝ ✞✠✟ ✡

Class scatter matrices

☛ ✂ ✓ ✆ ✝ ✞✠✟ ✄ ✡ ✧ ✁✄✂ ☎ ✄ ✡ ✧ ✁✄✂ ☎ ☞

Interface04 – p.13/40

slide-14
SLIDE 14

Scatter Computation

Within class scatter

☛✁ ✓ ☛ ✑ ☛ ✒

Between class scatter

☛✄✂ ✓ ✄ ✁ ✑ ✧ ✁ ✒ ☎ ✄ ✁ ✑ ✧ ✁ ✒ ☎ ☞

Interface04 – p.14/40

slide-15
SLIDE 15

Scatter-based Feature Selection

Univariate case:

☛ ✂

and

  • are scalars

“Good” dimensions are those with large value of

☛ ✂
  • Multivariate case (say,
  • dim’l):
☛ ✂

and

  • are

matrices

☛ ✂
  • no longer appropriate

Chose tr

✄ ☛ ✂ ☎
  • tr
✄ ☛

Interface04 – p.15/40

slide-16
SLIDE 16

Performance and Complexity for the Nonsmoothed Nose Data

10 20 30 40 50 60 70 80 90 100 0.25 0.3 0.35 Average Complexity (over cross validation) p 10 20 30 40 50 60 70 80 90 100 0.65 0.7 0.75 Average Performance (over cross validation)

Performance and Complexity as a Function of Minkowski p Paramater for the (Nonsmoothed) Nose Data

p=5

Interface04 – p.16/40

slide-17
SLIDE 17

Performance as a Function of Scatter Selected Fibers for the Nonsmoothed Nose Data at p=5

5 10 15 20 25 30 35 40 0.5 0.6 0.7 0.8 0.9 (21 ,0.78125)

  • Performance as a Function of Scatter Selected Fibers at p = 5

Number of Fibers Average Performance Over the Cross−Validation

Interface04 – p.17/40

slide-18
SLIDE 18

Performance and Complexity for the Smoothed Nose Data

10 20 30 40 50 60 70 80 90 100 0.2 0.25 0.3 0.35 Average Complexity (over cross validation) p

Performance and Complexity as a Function of Minkowski p Parameter for the Smoothed Nose Data

10 20 30 40 50 60 70 80 90 100 0.6 0.7 0.8 0.9 Average Performance (over cross validation) (29,.85625)

  • Interface04 – p.18/40
slide-19
SLIDE 19

Performance as a Function of Scatter Selected Fibers for the Smoothed Nose Data at p=29

5 10 15 20 25 30 35 40 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 (37 ,0.85)

  • Number of Fibers

Average Performance Over the Cross Validation

Interface04 – p.19/40

slide-20
SLIDE 20

The Golub Gene Data

72 patients (observations) in 7129 dimensions (genes) ALL and AML Leukemia patients ALL T-cell and B-cell variants

Interface04 – p.20/40

slide-21
SLIDE 21

Performance and Complexity for the Golub Gene Data

10 20 30 40 50 60 70 80 90 100 0.2 0.25 0.3 0.35 Average Complexity (over cross validation) p 10 20 30 40 50 60 70 80 90 100 0.6 0.7 0.8 0.9 Average Performance (over cross validation)

Performance and Complexity as a Function of Minkowski p Value

  • ptimal performance at p=4

Interface04 – p.21/40

slide-22
SLIDE 22

Performance as a Function of the Number of Genes for the Golub Gene Data at p=4

1000 2000 3000 4000 5000 6000 7000 8000 0.7 0.75 0.8 0.85 (372 ,0.84722)

  • Performance as a Function of Scatter Selected Genes at p = 4

Number of Genes Average Performance Over the Cross−Validation

Interface04 – p.22/40

slide-23
SLIDE 23

The Normalized Golub Data

Only retain those genes whose expression level is 20

  • r greater across all patients

Consider an

✂✁

genes by

✂✄

patients data matrix Divide each column by its mean Subject each row to a standard normalizing transformation Reduces the dimensionality to roughly 1753 genes

Interface04 – p.23/40

slide-24
SLIDE 24

Performance and Complexity for the Reduced Golub Gene Data

10 20 30 40 50 60 70 80 90 100 0.2 0.3 0.4 Average Complexity (over cross validation) p

Complexity and Performance vs. Minkowski p for Pruned Leukemia Data

10 20 30 40 50 60 70 80 90 100 0.6 0.8 1 Average Performance (over cross validation)

  • (4, 0.8472)

Interface04 – p.24/40

slide-25
SLIDE 25

Performance as a Function of the Number of Genes for the Reduced Golub Gene Data at p=4

200 400 600 800 1000 1200 1400 1600 1800 0.6 0.7 0.8 0.9 1

  • (1698 ,0.86111)
  • Performance as a Function of Scatter Selected Genes at p = 4

Number of Genes Average Performance Over the Cross−Validation

Interface04 – p.25/40

slide-26
SLIDE 26

Simultaneous

✄✛✚ ✡

Optimization of Parameters Huge dimensionality prevents a classical

  • ptimization approach

Subsampling used in order to reduce computational complexity. Stochastic optimization through simultaneous perturbation method of Spall Sensitivities to the formulation of an appropriate optimization criteria

Interface04 – p.26/40

slide-27
SLIDE 27

The Simlutaneous Pertubation Stochastic Approximation (SPSA)Algorithm of Spall Find

  • that minimizes a loss function
✁ ✄

subject to

  • satisfying relevant constraints

So we seek

  • such that
✂ ✄ ✂ ☎ ✓ ✟

The SPSA takes the form

✆ ✞✝ ✟ ✑ ✓ ✆
✧ ✠ ✝ ✆☛✡ ✝ ✄ ✆

where

☞ ✠ ✝ ✌

is a nonnegative gain sequence

Interface04 – p.27/40

slide-28
SLIDE 28

Computation of

  • Let

be a vector of

  • independent random

variables at the

th iteration

✝ ✓ ✄ ✝ ✑ ✡ ✝ ✒ ✡ ✣ ✣ ✣ ✡ ✝ ☛ ☎ ✂

Let

☞☎✄ ✝ ✌

be a sequence of positive scalers For iteration

✁ ✁ ☎

, take measurements at design levels

✝ ✝ ✆ ✄ ✆
✄ ✝ ✝ ☎ ✓ ✁ ✄ ✆
✄ ✝ ☎ ✝ ✞ ✟ ✟ ✝ ✆ ✄ ✆ ✞✝ ✄ ✝ ✝ ☎ ✓ ✁ ✄ ✆ ✞✝ ✧ ✄ ✝ ☎ ✝ ✞✡✠ ✟ ✝

Interface04 – p.28/40

slide-29
SLIDE 29

The Standard Sp formulation for

✡ ✝ ✄ ✆
☎ ✓
✄ ✆ ✞✝ ✄ ✝ ✝ ☎ ✧ ✆ ✄ ✆ ✞✝ ✧ ✄ ✝ ✝ ☎ ✁ ✄ ✝ ✝ ✑ ✣ ✣ ✣ ✆ ✄ ✆
✄ ✝ ✝ ☎ ✧ ✆ ✄ ✆
✧ ✄ ✝ ✝ ☎ ✁ ✄ ✝ ✝ ☛ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂

We note that

✆ ✡ ✝ ✄ ✄ ☎
  • nly requires two

measurements of

✁ ✄ ✄ ☎

independent of

  • Interface04 – p.29/40
slide-30
SLIDE 30

Gain Selection in SPSA

The choice of gains

✠ ✝ ✡ ✄ ✝

is crucial to the performance of SPSA For

✁ ✓ ✟ ✡ ☎ ✡ ✁ ✡ ✣ ✣ ✣ ✡

consider

✠ ✝ ✓
✝ ✟ ✑ ✟ ✁ ✟✄✂ ✄ ✝ ✓ ☎ ✞ ✝ ✟ ✑ ✟✄✆

where

✠ ✡ ✄ ✡ ✝ ✡ ✞ ✟ ✟ ✡ ✟

Spall recommends picking

✝ ✓ ✟ ✠ ✡ ✟ ✁

,

✞ ✓ ✟ ✠ ☎ ✟ ☎

, and

✠ ✓ ✠ ☛ ☞

We also set A = 0 and the choice of c is based

  • n an estimate of the standard deviation of

the associated noise in the

function

Interface04 – p.30/40

slide-31
SLIDE 31

Penalty Function

✡ ☎ ✓
✝ ✖ ✑ ✡ ✒ ✝ ✄ ✡ ✝ ✧ ☎ ☎ ✒

Where

☎ ✡
  • is chosen so that penalty

function has roughly the same order of magnitude as the objective function.

−0.2 0.2 0.4 0.6 0.8 1 1.2 −0.2 0.2 0.4 0.6 0.8 1 0.02 0.04 0.06 0.08 0.1 0.12 0.14 x x2 (x−1)2+y2 (y−1)2 y

Interface04 – p.31/40

slide-32
SLIDE 32

Out SPSA Flowchart

  • 1. Initialize weights
✡ ✠ ✠ ✠ ✡
✓ ✠ ☞ ✡
✟ ✑ ✓ ✁ ✠ ☞

Note that

✟ ✑ ✓
✡ ✄

.

  • 2. Draw Bernoulli Sample
  • 3. Evaluate objective function at the two points.
  • 4. Adjust weights based on the gradient
  • estimate. Weights are constrained to lie in

[0,1].

  • 5. Iterate

Interface04 – p.32/40

slide-33
SLIDE 33

Stopping Criteria

Small change in

✟ ✑ ✧
✥ ✥ ☛

Small relative change in the objective function abs

✁ ✄
✟ ✑ ☎ ✧ ✁ ✄
☎ ✁ ✄
✟ ✑ ☎

Interface04 – p.33/40

slide-34
SLIDE 34

Pruned Genes Convergence Curves

200 400 600 800 1000 1200 1400 1600 1800 2000 2 3 4 5 6 7 8 9 10 Minkowski P vs. Iteration Iteration Minkowski P 200 400 600 800 1000 1200 1400 1600 1800 2000 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 Classifier Performance vs. Iteration Iteration Classifier Performance 200 400 600 800 1000 1200 1400 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 Minkowski P vs. Iteration Iteration Minkowski P 200 400 600 800 1000 1200 1400 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 Classifier Performance vs. Iteration Iteration Classifier Performance

Interface04 – p.34/40

slide-35
SLIDE 35

Pruned Genes Convergence Curves

100 200 300 400 500 600 700 5 10 15 20 25 30 35 40 Minkowski P vs. Iteration Iteration Minkowski P 100 200 300 400 500 600 700 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 Classifier Performance vs. Iteration Iteration Classifier Performance 500 1000 1500 2000 2500 3000 3500 1 2 3 4 5 6 7 Minkowski P vs. Iteration Iteration Minkowski P 500 1000 1500 2000 2500 3000 3500 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 Classifier Performance vs. Iteration Iteration Classifier Performance

Interface04 – p.35/40

slide-36
SLIDE 36

Relationship to the Hypergeometric Distribution

The total number of pruned genes

The number of genes selected in experiment 1

The number of genes selected in experiment 2

The number of genes in the intersection of

and

✁ ✄ ✓ ✁ ☎ ✓
✝ ✂
✠ ✁ ✁ ✠ ✝ ✂
✁ ✂ ✄ ☎ ✓ ✁ ✁ ✄

Interface04 – p.36/40

slide-37
SLIDE 37

Toy Problem Results

dimension

1 2 3 4 5 6 7 8 9 10

count

73 59 53 54 59 43 47 43 48 48

500

  • N(0,1) on

9 dims and N(2,1)

  • n 1 dim

500

  • N(0,1) on 9

dims and N(-2,1)

  • n 1 dim

100 trials of 200 iterations each

1 2 3 4 5 6 7 8 9 10 5 10 15 20 25 Histogram of Dimension Count Number of Dimensions Selected Count

Interface04 – p.37/40

slide-38
SLIDE 38

Pruned Gene Results

Interface04 – p.38/40

slide-39
SLIDE 39

Smoothed Nose Results

✂ ✄ ☎ ✆ ✝ ✞ ✟ ✠ ✡ ☛ ☞ ✆ ✌✍✏✎ ✑ ✒ ✓ ✔ ✎ ✕✗✖ ✘ ✓ ✌ ✌ ✎ ✌✙ ✓ ✘ ✎ ✚✛✖ ✘ ✓ ✌ ✌ ✎ ✍ ✖ ✘ ✓ ✜ ✎ ✌ ✌ ✓ ✍ ✎ ✌ ✙ ✖ ✘ ✓ ✌ ✌✏✎ ✍ ✓ ✝ ✔ ✎ ✕✗✖ ✘ ✓ ✌✢ ✎ ✑ ✒ ✓ ✚ ✎ ✕✗✖ ✜ ✓ ✌ ✢ ✎ ✘ ✖ ✌ ✓ ✕ ✎ ✕ ✖ ✘ ✓ ✌ ✙ ✎ ✚ ✖ ✘ ✓ ✜ ✎ ✔ ✖ ✔ ✓ ✔ ✎ ✕✗✖ ✣ ✓ ✞ ✌ ✌ ✎ ✌ ✙ ✓ ✚ ✎ ✕✗✖ ✜ ✓ ✣ ✙ ✎ ✑ ✒ ✓ ✜ ✎ ✚✛✖ ✍ ✓ ✍ ✎ ✌✙ ✓ ✌ ✣ ✎ ✌ ✌ ✖ ✕ ✓ ✌ ✢ ✎ ✌ ✌ ✖ ✌ ✓ ✌ ✙ ✎ ✍ ✖ ✘ ✓ ✟ ✘ ✎ ✚✛✖ ✘ ✓ ✌ ✢ ✎ ✘ ✖ ✌ ✓ ✜ ✎ ✚✛✖ ✍ ✓ ✌ ✘ ✎ ✑ ✒ ✓ ✚ ✎ ✚ ✖ ✘ ✓ ✌ ✌ ✎ ✜ ✖ ✚ ✓ ✍ ✎ ✜ ✖ ✢ ✓ ✘ ✎ ✚✛✖ ✌ ✓ ✤ ✥ ✦ ✧ ★ ✠ ✌ ✌ ✎ ✍ ✖ ✘ ✓ ✕ ✎ ✕✗✖ ✘ ✓ ✍ ✎ ✌✙ ✓ ✚ ✎ ✚✛✖ ✘ ✓ ✌ ✍✏✎ ✑ ✒ ✓ ✌ ✣ ✎ ✌ ✌ ✓ ✌ ✢ ✎ ✌ ✙ ✖ ✘ ✓ ✜ ✎ ✍ ✓ ✡ ✜ ✎ ✌ ✌ ✓ ✌ ✙ ✎ ✚✛✖ ✘ ✓ ✌ ✣ ✎ ✌ ✌ ✖ ✕ ✓ ✌ ✌ ✎ ✜ ✖ ✚ ✓ ✌ ✣ ✎ ✌ ✌ ✓ ✣ ✣ ✎ ✑ ✒ ✓ ✌ ✢ ✎ ✌ ✣ ✖ ✣ ✓ ✍ ✎ ✌ ✙ ✖ ✔ ✓ ☛ ✍ ✎ ✌✙ ✖ ✘ ✓ ✜ ✎ ✔ ✖ ✔ ✓ ✌ ✢ ✎ ✌ ✌ ✖ ✌ ✓ ✍ ✎ ✜ ✖ ✢ ✓ ✌ ✢ ✎ ✌ ✙ ✖ ✘ ✓ ✌ ✢ ✎ ✌ ✣ ✖ ✣ ✓ ✣ ✌ ✎ ✑ ✒ ✓ ✌ ✙ ✎ ✍ ✖ ✍ ✓ ☞ ✌ ✌✩✎ ✍ ✓ ✔ ✎ ✕✗✖ ✣ ✓ ✌✙ ✎ ✍ ✖ ✘ ✓ ✘ ✎ ✚✛✖ ✌ ✓ ✜ ✎ ✍ ✓ ✍ ✎ ✌ ✙ ✖ ✔ ✓ ✌ ✙ ✎ ✍ ✖ ✍ ✓ ✌ ✜ ✎ ✑ ✒ ✓

Interface04 – p.39/40

slide-40
SLIDE 40

Conclusions

Presented preliminary results on the benefits of simultaneously optimizing (w,p,s) Presented results illustrating the use of MST as a classification complexity measure Our next step is to perform simultaneous (w,p,s)

  • ptimization using surrogate classifier complexity measures

such as the MST criteria in conjunction with appropriate

  • bjective functions and optimization methodologies

Interface04 – p.40/40