classifier inspired scaling for training set selection
play

Classifier Inspired Scaling for Training Set Selection Walter - PowerPoint PPT Presentation

Classifier Inspired Scaling for Training Set Selection Walter Bennette DISTRIBUTION A: Approved for public release: distribution unlimited: 16 May 2016. Case #88ABW-2016- 2511 Outline Instance-based classification Training set


  1. Classifier Inspired Scaling for Training Set Selection Walter Bennette DISTRIBUTION A: Approved for public release: distribution unlimited: 16 May 2016. Case #88ABW-2016- 2511

  2. Outline · Instance-based classification · Training set selection - ENN - DROP3 - CHC · Scaling approaches - Stratified - Classifier inspired · Experimental results 2/46

  3. Instance-based classification

  4. Instance-based classification 4/46

  5. Instance-based classification 5/46

  6. Instance-based classification 6/46

  7. Instance-based classification 7/46

  8. Instance-based classification 8/46

  9. Instance-based classification 9/46

  10. Instance-based classification 10/46

  11. Instance-based classification 11/46

  12. Instance-based classification 12/46

  13. Instance-based classification 13/46

  14. Instance-based classification What are they used for? · Classification of gene expression · Content-based image retrieval · Text categorization · Load forecasting assistant for power company 14/46

  15. Instance-based classification What if there is a large amount of data? 15/46

  16. Instance-based classification What if there is a huge amount of data? 16/46

  17. Instance-based classification What if there is a serious amount of data? 17/46

  18. Training set selection (TSS)

  19. Training set selection (TSS) · Instead of maintaining all of the training data · Keep only certain necessary data points 19/46

  20. Edited Nearest Neighbors (ENN) Formulation: · An instance is removed from the training data if its does not agree with the majority of it nearest neighbors k Effect: · Makes decision boundaries smoother · Doesn't remove much data 20/46

  21. Edited Neares Neighbors (ENN) 21/46

  22. DROP3 Formulation: DROP3 (Training set TR): Selection set S. Let S = TR after applying ENN. For each instance Xi in S: Find the k +1 nearest neighbors of Xi in S. Add Xi to each of its lists of associates. For each instance Xi in S: Let with = # of associates of Xi classified correctly with Xi as a neighbor. Let without = # of associates of Xi classified correctly without Xi. If without ≥ with Remove Xi from S. For each associate a of Xi Remove Xi from a’s list of neighbors. Find a new nearest neighbor for a. Add a to its new list of associates. Endif Return S. 22/46

  23. DROP3 Formulation: · Iterative procedure that compares accuracy of neighbors with and without members Effect: · Removes much more data than ENN · Maintains acceptable accuracy 23/46

  24. DROP3 24/46

  25. Genetic algorithm (CHC) Formulation: · A chromosome is a subset of the training data · A binary gene represents each instance · Fitness = α ∗ Accuracy + (1 − α ) ∗ Reduction Effectiveness: · Removes a large amount of data · Achieves acceptable accuracy 25/46

  26. Genetic algorithm (CHC) 26/46

  27. Scaling

  28. Scaling · As datasets grow, TSS becomes more and more expensive · May be prohibitive · The vast majority of scaling approaches rely on a stratified approach 28/46

  29. No scaling 29/46

  30. Stratified scaling 30/46

  31. Representative Data Detection (ReDD) · Lin et al. 2015 · Used for support vector machines and did not consider data reduction 31/46

  32. Our approach

  33. Classifier inspired approach · Based heavily on ReDD · Used for kNN and monitor data reduction 33/46

  34. The filter The "Balance"" dataset · Determine scale positions - Balanced - Leaning right - Leaning left · Attributes - Left weight - Left distance - Right weight - Right distance 34/46

  35. The filter 35/46

  36. The filter 36/46

  37. The filter 37/46

  38. Experimentation Parameters: · Learn a Random Forest for the filter · Split data into 1/3rd, 2/3rd Design: · Perform for ENN, CHC, and DROP3 with 3-NN · Compare no scaling, stratified, and classifier inspired · Calculate reduction, accuracy, and computation time with 10-fold CV 38/46

  39. Datasets · 10 experimental datasets from KEEL 39/46

  40. Reduction 40/46

  41. Accuracy 41/46

  42. Time 42/46

  43. Results · Maintains accuracy (mostly) · Maintains data reduction · Slower than stratified approach, but may improve for larger datasets 43/46

  44. Future work · Perform for many more datasets · Apply to very large datasets · Investigate if damage can be spotted apriori 44/46

  45. Conclusion Promising candidate for scaling Training Set Selection to large datasets 45/46

  46. Questions Walter Bennette walter.bennette.1@us.af.mil 315-330-4957 46/46

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend