Classifier Inspired Scaling for Training Set Selection Walter - - PowerPoint PPT Presentation

classifier inspired scaling for training set selection
SMART_READER_LITE
LIVE PREVIEW

Classifier Inspired Scaling for Training Set Selection Walter - - PowerPoint PPT Presentation

Classifier Inspired Scaling for Training Set Selection Walter Bennette DISTRIBUTION A: Approved for public release: distribution unlimited: 16 May 2016. Case #88ABW-2016- 2511 Outline Instance-based classification Training set


slide-1
SLIDE 1

Classifier Inspired Scaling for Training Set Selection

Walter Bennette DISTRIBUTION A: Approved for public release: distribution unlimited: 16 May 2016. Case #88ABW-2016- 2511

slide-2
SLIDE 2

Outline

Instance-based classification Training set selection Scaling approaches Experimental results · · ENN DROP3 CHC

  • ·

Stratified Classifier inspired

  • ·

2/46

slide-3
SLIDE 3

Instance-based classification

slide-4
SLIDE 4

Instance-based classification

4/46

slide-5
SLIDE 5

Instance-based classification

5/46

slide-6
SLIDE 6

Instance-based classification

6/46

slide-7
SLIDE 7

Instance-based classification

7/46

slide-8
SLIDE 8

Instance-based classification

8/46

slide-9
SLIDE 9

Instance-based classification

9/46

slide-10
SLIDE 10

Instance-based classification

10/46

slide-11
SLIDE 11

Instance-based classification

11/46

slide-12
SLIDE 12

Instance-based classification

12/46

slide-13
SLIDE 13

Instance-based classification

13/46

slide-14
SLIDE 14

Instance-based classification

What are they used for? Classification of gene expression Content-based image retrieval Text categorization Load forecasting assistant for power company · · · ·

14/46

slide-15
SLIDE 15

Instance-based classification

What if there is a large amount of data?

15/46

slide-16
SLIDE 16

Instance-based classification

What if there is a huge amount of data?

16/46

slide-17
SLIDE 17

Instance-based classification

What if there is a serious amount of data?

17/46

slide-18
SLIDE 18

Training set selection (TSS)

slide-19
SLIDE 19

Training set selection (TSS)

Instead of maintaining all of the training data Keep only certain necessary data points · ·

19/46

slide-20
SLIDE 20

Edited Nearest Neighbors (ENN)

Formulation: Effect: An instance is removed from the training data if its does not agree with the majority of it nearest neighbors ·

k

Makes decision boundaries smoother Doesn't remove much data · ·

20/46

slide-21
SLIDE 21

Edited Neares Neighbors (ENN)

21/46

slide-22
SLIDE 22

DROP3

Formulation:

DROP3 (Training set TR): Selection set S. Let S = TR after applying ENN. For each instance Xi in S: Find the k +1 nearest neighbors of Xi in S. Add Xi to each of its lists of associates. For each instance Xi in S: Let with = # of associates of Xi classified correctly with Xi as a neighbor. Let without = # of associates of Xi classified correctly without Xi. If without ≥ with Remove Xi from S. For each associate a of Xi Remove Xi from a’s list of neighbors. Find a new nearest neighbor for a. Add a to its new list of associates. Endif Return S.

22/46

slide-23
SLIDE 23

DROP3

Formulation: Effect: Iterative procedure that compares accuracy of neighbors with and without members · Removes much more data than ENN Maintains acceptable accuracy · ·

23/46

slide-24
SLIDE 24

DROP3

24/46

slide-25
SLIDE 25

Genetic algorithm (CHC)

Formulation: Effectiveness: A chromosome is a subset of the training data A binary gene represents each instance · · · Fitness = α ∗ Accuracy + (1 − α) ∗ Reduction Removes a large amount of data Achieves acceptable accuracy · ·

25/46

slide-26
SLIDE 26

Genetic algorithm (CHC)

26/46

slide-27
SLIDE 27

Scaling

slide-28
SLIDE 28

Scaling

As datasets grow, TSS becomes more and more expensive May be prohibitive The vast majority of scaling approaches rely on a stratified approach · · ·

28/46

slide-29
SLIDE 29

No scaling

29/46

slide-30
SLIDE 30

Stratified scaling

30/46

slide-31
SLIDE 31

Representative Data Detection (ReDD)

Lin et al. 2015 Used for support vector machines and did not consider data reduction · ·

31/46

slide-32
SLIDE 32

Our approach

slide-33
SLIDE 33

Classifier inspired approach

Based heavily on ReDD Used for kNN and monitor data reduction · ·

33/46

slide-34
SLIDE 34

The filter

The "Balance"" dataset Determine scale positions Attributes · Balanced Leaning right Leaning left

  • ·

Left weight Left distance Right weight Right distance

  • 34/46
slide-35
SLIDE 35

The filter

35/46

slide-36
SLIDE 36

The filter

36/46

slide-37
SLIDE 37

The filter

37/46

slide-38
SLIDE 38

Experimentation

Parameters: Design: Learn a Random Forest for the filter Split data into 1/3rd, 2/3rd · · Perform for ENN, CHC, and DROP3 with 3-NN Compare no scaling, stratified, and classifier inspired Calculate reduction, accuracy, and computation time with 10-fold CV · · ·

38/46

slide-39
SLIDE 39

Datasets

10 experimental datasets from KEEL ·

39/46

slide-40
SLIDE 40

Reduction

40/46

slide-41
SLIDE 41

Accuracy

41/46

slide-42
SLIDE 42

Time

42/46

slide-43
SLIDE 43

Results

Maintains accuracy (mostly) Maintains data reduction Slower than stratified approach, but may improve for larger datasets · · ·

43/46

slide-44
SLIDE 44

Future work

Perform for many more datasets Apply to very large datasets Investigate if damage can be spotted apriori · · ·

44/46

slide-45
SLIDE 45

Conclusion

Promising candidate for scaling Training Set Selection to large datasets

45/46

slide-46
SLIDE 46

Questions

Walter Bennette walter.bennette.1@us.af.mil 315-330-4957

46/46