ADVANCED MACHINE LEARNING Caveats and Techniques to Deal with - - PowerPoint PPT Presentation

advanced machine learning caveats and techniques to deal
SMART_READER_LITE
LIVE PREVIEW

ADVANCED MACHINE LEARNING Caveats and Techniques to Deal with - - PowerPoint PPT Presentation

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING Caveats and Techniques to Deal with Imbalanced Datasets (Adapted from H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9,


slide-1
SLIDE 1

1

ADVANCED MACHINE LEARNING

1

ADVANCED MACHINE LEARNING Caveats and Techniques to Deal with Imbalanced Datasets

(Adapted from H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263--‐1284, 2009 )

slide-2
SLIDE 2

2

ADVANCED MACHINE LEARNING

2

ML is now everywhere

❖ Increase of storage capacity  easy to build large datasets e.g. companies can store activities of clients with a variety of attributes ❖ Widely available code for ML techniques  Easy to use by laymen ➢ ML used widely in a variety of domains ❖ One hopes that ML will solve problems with which companies struggle (vast amount of data, very noisy, high-dimensional)

slide-3
SLIDE 3

3

ADVANCED MACHINE LEARNING

3

ML techniques assume balanced datasets

ML algorithms assumes that data is balanced ➢ In classification: comparative number of instances of each class What would be the result of running SVM on the imbalanced dataset?

Balanced dataset Imbalanced dataset 826 dataset points: 790 positive class, 36 negative class

slide-4
SLIDE 4

5

ADVANCED MACHINE LEARNING

5

Imbalanced Data: Why and When?

slide-5
SLIDE 5

6

ADVANCED MACHINE LEARNING

6

Imbalanced Data: Example

Finance: Number of clients who closed account / nm clients who did not: 1% Robotics: Number of points where the two arms intersect / nm points with no intersection: 4-5% One has usually much less datapoints from the adverse class. This is unfortunate as we care a lot about avoiding misclassifying elements of this class.

slide-6
SLIDE 6

7

ADVANCED MACHINE LEARNING

7

Imbalanced Data: Example

What would be the effect on SVR?

slide-7
SLIDE 7

8

ADVANCED MACHINE LEARNING

8

Imbalanced Data: Example

Poor interpolation with missing data

slide-8
SLIDE 8

9

ADVANCED MACHINE LEARNING

9

Types of Imbalance

Balanced between-class datasets but white dataset not representative with more data is some regions Few data locally

slide-9
SLIDE 9

10

ADVANCED MACHINE LEARNING

10

slide-10
SLIDE 10

11

ADVANCED MACHINE LEARNING

11

Imbalance and curse of dimensionality

slide-11
SLIDE 11

12

ADVANCED MACHINE LEARNING

12

Approaches to learning with imbalanced datasets

slide-12
SLIDE 12

13

ADVANCED MACHINE LEARNING

13

Learning with imbalanced datasets

Sampling Methods Cost-Sensitive Methods

Two main approaches

Act on the data Act on the cost function

slide-13
SLIDE 13

14

ADVANCED MACHINE LEARNING

14

slide-14
SLIDE 14

15

ADVANCED MACHINE LEARNING

15

Sampling Methods

Increase Dataset Generate new data points for the smallest class Decrease Dataset Remove redundant datapoints from the largest class

Compensate the lack of data by:

slide-15
SLIDE 15

16

ADVANCED MACHINE LEARNING

16

Undersampling

Remove redundant datapoints Looses statistics – good only if enough datapoints on undersampled class and for low dimensional datasets

slide-16
SLIDE 16

17

ADVANCED MACHINE LEARNING

17

Oversampling

Pick neighbour and create new datapoint Risk overfitting, especially if one does this for points that are noise

slide-17
SLIDE 17

18

ADVANCED MACHINE LEARNING

18

SMOTE: Synthetic minority oversampling technique

No Neighbors of the same class  noise Several Neighbors

  • f the same class

Surrounded by the

  • ther class

 in danger Surrounded only on one side by the other class  safe Generate new samples inbetween existing datapoints based on their local density and their borders with the other class. Can use cleaning techniques (undersampling) to remove redundancy in the end.

slide-18
SLIDE 18

19

ADVANCED MACHINE LEARNING

19

slide-19
SLIDE 19

20

ADVANCED MACHINE LEARNING

20

slide-20
SLIDE 20

21

ADVANCED MACHINE LEARNING

21

slide-21
SLIDE 21

22

ADVANCED MACHINE LEARNING

22

How to treat Imbalanced Datasets with SVM

 SVM with asymmetric misclassification cost  Vary the penalty placed on each datapoint depending on its class  SVM class boundary adjustement  Changing b changes the boundary between the two classes

 

2 * 1

1 C minimize + 2

M i i i i

w M   

 

1

sgn ,

i

M i i

y k x x b 

       

slide-22
SLIDE 22

23

ADVANCED MACHINE LEARNING

23

How to treat Imbalanced Datasets with SVM

slide-23
SLIDE 23

24

ADVANCED MACHINE LEARNING

24

Taking into account imbalanced datasets in the assessment of performance

Traditional accuracy measure are sensitive to the data distribution The F-measure is better adapted as it evaluated performance on one class

slide-24
SLIDE 24

25

ADVANCED MACHINE LEARNING

25

TOOLBOX https://github.com/scikit-learn-contrib/imbalanced-learn.