Classifier Inspired Scaling for Training Set Selection
Walter Bennette DISTRIBUTION A: Approved for public release: distribution unlimited: 16 May 2016. Case #88ABW-2016- 2511
Classifier Inspired Scaling for Training Set Selection Walter - - PowerPoint PPT Presentation
Classifier Inspired Scaling for Training Set Selection Walter Bennette DISTRIBUTION A: Approved for public release: distribution unlimited: 16 May 2016. Case #88ABW-2016- 2511 Outline Instance-based classification Training set
Walter Bennette DISTRIBUTION A: Approved for public release: distribution unlimited: 16 May 2016. Case #88ABW-2016- 2511
Instance-based classification Training set selection Scaling approaches Experimental results · · ENN DROP3 CHC
Stratified Classifier inspired
2/46
4/46
5/46
6/46
7/46
8/46
9/46
10/46
11/46
12/46
13/46
What are they used for? Classification of gene expression Content-based image retrieval Text categorization Load forecasting assistant for power company · · · ·
14/46
What if there is a large amount of data?
15/46
What if there is a huge amount of data?
16/46
What if there is a serious amount of data?
17/46
Instead of maintaining all of the training data Keep only certain necessary data points · ·
19/46
Formulation: Effect: An instance is removed from the training data if its does not agree with the majority of it nearest neighbors ·
k
Makes decision boundaries smoother Doesn't remove much data · ·
20/46
21/46
Formulation:
DROP3 (Training set TR): Selection set S. Let S = TR after applying ENN. For each instance Xi in S: Find the k +1 nearest neighbors of Xi in S. Add Xi to each of its lists of associates. For each instance Xi in S: Let with = # of associates of Xi classified correctly with Xi as a neighbor. Let without = # of associates of Xi classified correctly without Xi. If without ≥ with Remove Xi from S. For each associate a of Xi Remove Xi from a’s list of neighbors. Find a new nearest neighbor for a. Add a to its new list of associates. Endif Return S.
22/46
Formulation: Effect: Iterative procedure that compares accuracy of neighbors with and without members · Removes much more data than ENN Maintains acceptable accuracy · ·
23/46
24/46
Formulation: Effectiveness: A chromosome is a subset of the training data A binary gene represents each instance · · · Fitness = α ∗ Accuracy + (1 − α) ∗ Reduction Removes a large amount of data Achieves acceptable accuracy · ·
25/46
26/46
As datasets grow, TSS becomes more and more expensive May be prohibitive The vast majority of scaling approaches rely on a stratified approach · · ·
28/46
29/46
30/46
Lin et al. 2015 Used for support vector machines and did not consider data reduction · ·
31/46
Based heavily on ReDD Used for kNN and monitor data reduction · ·
33/46
The "Balance"" dataset Determine scale positions Attributes · Balanced Leaning right Leaning left
Left weight Left distance Right weight Right distance
35/46
36/46
37/46
Parameters: Design: Learn a Random Forest for the filter Split data into 1/3rd, 2/3rd · · Perform for ENN, CHC, and DROP3 with 3-NN Compare no scaling, stratified, and classifier inspired Calculate reduction, accuracy, and computation time with 10-fold CV · · ·
38/46
10 experimental datasets from KEEL ·
39/46
40/46
41/46
42/46
Maintains accuracy (mostly) Maintains data reduction Slower than stratified approach, but may improve for larger datasets · · ·
43/46
Perform for many more datasets Apply to very large datasets Investigate if damage can be spotted apriori · · ·
44/46
Promising candidate for scaling Training Set Selection to large datasets
45/46
Walter Bennette walter.bennette.1@us.af.mil 315-330-4957
46/46