Classification Complexity Measures and Their Relationship to Feature Selection J. L. Solka and D. A. Johannsen solkajl@nswc.navy.mil;johannsen@nswc.navy.mil NSWCDD Interface04 – p.1/40

Agenda Minimal spanning tree complexity measures. An artificial nose data set. A gene expression data set. Wrap-up and conclusions. Interface04 – p.2/40

Acknowledgments This work was supported by the Mathematical, Computer, and Information Sciences Division of the Office of Naval Research. We wish to acknowledge helpful discussions with Dr. David Marchette of NSWCDD and Dr. Carey Priebe of JHU Interface04 – p.3/40

Classifier Complexity Measures Single nearest neighbor cross-validated classification performance. Graph theoretic measures. Minimal spanning tree (MST) measures. Class cover catch digraph measures. Interface04 – p.4/40

MST Methodology Cost of an edge of a graph is the Euclidean distance between the two vertices Spanning tree is the tree that covers all vertices of the graph. A minimal spanning tree is a spanning tree of a graph, such that the sum of all of the edge costs is minimal Interface04 – p.5/40

MST Classifier Complexity Algorithm Compute the MST of all the observations Count the number of edges that cross between disparate observations Interface04 – p.6/40

MST Example 1 Minimum Spanning Tree Inter−Class Edges for Two Bivariate Normal Samples 3 2 1 0 −1 −2 −3 −4 −5 −4 −3 −2 −1 0 1 2 3 4 Interface04 – p.7/40

MST Example 2 Minimum Spanning Tree Inter−Class Edges for Two Bivariate Normal Samples 4 3 2 1 0 −1 −2 −3 −4 −5 −6 −3 −2 −1 0 1 2 3 4 5 Interface04 – p.8/40

Treatise Is there a correspondence between nearest neighbor cross-validated performance and the MST complexity measure? Can one use the MST complexity measure as a surrogate for nearest neighbor classifier performance during classifier optimization? What is the effect of Minkowski p parameter choice on classifier performance? Can Minkowski p parameter and feature selection be simultaneously optimized based on some measure of classifier performance? Interface04 – p.9/40

Artificial Nose Data Set 19 fibers x 2 wavelengths x 60 samples per time period = 2280 dimensional data Application of interest was detection of the ground water contaminant trichloethylene among various confusers. Interface04 – p.10/40

Interface04 – p.11/40 ✄ ✑ ✕ ✌ ✘ ✄ ✕ ✌ ✘ ✑ ☎ ☎ ✧ ✦ ✕ ✘ ✌ ✘ ✒ ☎ ☎ ✥ ☎ ☛ ★ ✜ ☎ ✑ ✩ ☛ ✖ ✙ ✒ ✝ ✄ ☎ ✆ ✆ ✡ ☎ ✑ ✎ ✌ ☎ ✏ ✄ ✑ ✡ ✒ ☎ ✄ ✂ ✁ ✓ ✄ ✑✔ � ✂ (1) ✄✢✜ given by Pseudo-Metric Artificial Nose ✕✗✖ ✄✢✜ Minkowski ✞✠✟ ✥✛✦ ☎✤✣ ✄✢✜ �☞☛✍✌ ✄✛✚

Scatter Overview Appears in conjunction with Fisher’s Linear Discriminant Select dimensions in which classes are well separated class means are well separated each class has small within class variance Interface04 – p.12/40

✝ ☎ ✓ ✂ ☛ ✄ ✡ ✡ ✧ ✆ ✂ � ☎ ✓ ✄ ✝ ✡ ✒ � ✧ ✑ � ☎ ✒ ✂ ☞ ✑ ✂ ✆ Scatter Computation Two classes, and , with and members (respectively) Class means: ✁✄✂ ✞✠✟ Class scatter matrices ✁✄✂ ✁✄✂ ✞✠✟ Interface04 – p.13/40

✑ ✄ ✁ ✄ ☎ ✒ ✁ ✧ ✁ ✁ ✓ ✧ ✒ ☎ ✒ ☛ ✑ ☛ ✓ ☞ ✑ Scatter Computation Within class scatter ☛ ✁� Between class scatter ☛ ✄✂ Interface04 – p.14/40

� � ☛ ☛ ✝ � ✄ ☛ ☛ ✂ ☛ ✂ ☎ ☛ � � ✂ ☛ � ✄ ☛ � ☛ � ✂ ☛ ☎ ✂ Scatter -based Feature Selection Univariate case: and are scalars “Good” dimensions are those with large value of Multivariate case (say, -dim’l): and are matrices no longer appropriate Chose tr tr Interface04 – p.15/40

Performance and Complexity for the Nonsmoothed Nose Data Performance and Complexity as a Function of Minkowski p Paramater for the (Nonsmoothed) Nose Data 0.35 0.75 p=5 Average Performance (over cross validation) Average Complexity (over cross validation) 0.3 0.7 0.25 0.65 0 0 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 90 90 100 100 p Interface04 – p.16/40

Performance as a Function of Scatter Selected Fibers for the Nonsmoothed Nose Data at p=5 0.9 Performance as a Function of Scatter Selected Fibers at p = 5 0.8 (21 ,0.78125) Average Performance Over the Cross−Validation o 0.7 0.6 0.5 0 5 10 15 20 25 30 35 40 Number of Fibers Interface04 – p.17/40

Performance and Complexity for the Smoothed Nose Data Performance and Complexity as a Function of Minkowski p Parameter for the Smoothed Nose Data 0.35 0.9 (29,.85625) o Average Performance (over cross validation) Average Complexity (over cross validation) 0.3 0.8 0.25 0.7 0.2 0.6 0 0 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 90 90 100 100 p Interface04 – p.18/40

Performance as a Function of Scatter Selected Fibers for the Smoothed Nose Data at p=29 0.86 (37 ,0.85) o 0.84 Average Performance Over the Cross Validation 0.82 0.8 0.78 0.76 0.74 0.72 0.7 0.68 0 5 10 15 20 25 30 35 40 Number of Fibers Interface04 – p.19/40

The Golub Gene Data 72 patients (observations) in 7129 dimensions (genes) ALL and AML Leukemia patients ALL T -cell and B-cell variants Interface04 – p.20/40

Performance and Complexity for the Golub Gene Data Performance and Complexity as a Function of Minkowski p Value 0.35 0.9 optimal performance at p=4 Average Performance (over cross validation) Average Complexity (over cross validation) 0.3 0.8 0.25 0.7 0.2 0.6 0 0 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 90 90 100 100 p Interface04 – p.21/40

Performance as a Function of the Number of Genes for the Golub Gene Data at p=4 Performance as a Function of Scatter Selected Genes at p = 4 (372 ,0.84722) 0.85 o Average Performance Over the Cross−Validation 0.8 0.75 0.7 0 1000 2000 3000 4000 5000 6000 7000 8000 Number of Genes Interface04 – p.22/40

The Normalized Golub Data Only retain those genes whose expression level is 20 or greater across all patients Consider an genes by patients data matrix �✂✁ �✂✄ Divide each column by its mean Subject each row to a standard normalizing transformation Reduces the dimensionality to roughly 1753 genes Interface04 – p.23/40

Performance and Complexity for the Reduced Golub Gene Data Complexity and Performance vs. Minkowski p for Pruned Leukemia Data 0.4 1 Average Performance (over cross validation) Average Complexity (over cross validation) (4, 0.8472) o 0.3 0.8 0.2 0.6 0 0 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 90 90 100 100 p Interface04 – p.24/40

Performance as a Function of the Number of Genes for the Reduced Golub Gene Data at p=4 1 Performance as a Function of Scatter Selected Genes at p = 4 0.9 Average Performance Over the Cross−Validation (1698 ,0.86111) o o 0.8 0.7 0.6 0 200 400 600 800 1000 1200 1400 1600 1800 Number of Genes Interface04 – p.25/40

✡ � ☎ Simultaneous ✄✛✚ Optimization of Parameters Huge dimensionality prevents a classical optimization approach Subsampling used in order to reduce computational complexity. Stochastic optimization through simultaneous perturbation method of Spall Sensitivities to the formulation of an appropriate optimization criteria Interface04 – p.26/40

☞ ✠ ✟ ✑ ✓ ✆ � ✝ ✧ ✝ ✠ ✌ ✝ ✄ ✆ � ✝ ☎ ✝ ✆ ✟ ✓ � � ✁ ✄ � ☎ � ✄ ✂ ✂ ☎ The Simlutaneous Pertubation Stochastic Approximation (SPSA)Algorithm of Spall Find that minimizes a loss function subject to satisfying relevant constraints So we seek such that The SPSA takes the form �✞✝ ✆☛✡ where is a nonnegative gain sequence Interface04 – p.27/40

✄ � ✓ ☎ ✝ ✝ ✄ ✝ ✆ ✄ � ✆ ✝ ✝ ✄ � ✁ ✆ ✓ ✟ ✄ ☎ ✆ ✄ ✆ ✝ ✟ � ✞ ✝ ☎ ✝ ✄ ✝ ✆ ✁ ✝ ✝ ✡ ✑ ✝ ✄ ✓ ✝ ✁ ✒ ☎ ✝ � ✟ ✝ ✝ ✝ ✡ ☎ ✧ ✁ ✁ ✄ ✆ ✌ ✝ ✄ ✣ ✂ ☎ ☛ ✝ ✡ ✣ ✣ ✝ Computation of Let be a vector of independent random variables at the th iteration Let ☞ ☎✄ be a sequence of positive scalers For iteration , take measurements at design levels ✞✡✠ �✞✝ �✞✝ Interface04 – p.28/40

✄ ✧ ✝ ✄ ✧ ✝ � ✆ ✄ ✆ ☎ ☎ ✝ ✝ ✄ ✝ � ✆ ✄ ✆ ✝ ✁ ✣ ✂ ✄ ✄ ✝ ✡ ✆ ✄ ✂ ✂ ✂ ✄ ✂ ✂ ✂ ✂ ✂ ☛ ✝ ✝ ✣ ✣ ✁ ✓ � � � � � � � � ☎ ✆ ✝ � ✆ ✄ ✝ ✡ ✆ � � ✄ ✑ ✧ ✝ ✝ ✄ ✁ ☎ ✝ ✝ ✄ ☎ ✆ ✆ ✄ ✆ ✧ ☎ ✝ ✝ ✄ � ☎ The Standard Sp formulation for �✞✝ �✞✝ We note that only requires two measurements of independent of Interface04 – p.29/40

Recommend

More recommend