Imbalanced Domain Learning Fraud Detection Course - 2019/2020 Nuno - - PowerPoint PPT Presentation

imbalanced domain learning
SMART_READER_LITE
LIVE PREVIEW

Imbalanced Domain Learning Fraud Detection Course - 2019/2020 Nuno - - PowerPoint PPT Presentation

Imbalanced Domain Learning Fraud Detection Course - 2019/2020 Nuno Moniz nuno.moniz@fc.up.pt Today 1. Beyond Standard ML 2. Imbalanced Domain Learning 2.1 Problem Formulation 2.2 Evaluation/Learning 3. Strategies for Imbalanced Domain


slide-1
SLIDE 1
slide-2
SLIDE 2

Imbalanced Domain Learning

Fraud Detection Course - 2019/2020

Nuno Moniz nuno.moniz@fc.up.pt

slide-3
SLIDE 3

Today

  • 1. Beyond Standard ML
  • 2. Imbalanced Domain Learning
  • 3. Strategies for Imbalanced Domain Learning
  • 4. Practical Examples

2.1 Problem Formulation 2.2 Evaluation/Learning · ·

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-4
SLIDE 4

Beyond Standard Machine Learning

slide-5
SLIDE 5

Hey Model 1, all apples are red, yellow or green.

slide-6
SLIDE 6

Hey Model 1, what's the colour of this apple?

slide-7
SLIDE 7

Famous ML Mistakes #1

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-8
SLIDE 8

Famous ML Mistakes #2

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-9
SLIDE 9

Machine Learning, Predictive Modelling

The goal of predictive modelling is to obtain a good approximation for an unknown function What you need (the most basic): ·

  • Y = f(

, , ⋯) x1 x2

· A dataset A target variable Learning algorithm(s) Evaluation metric(s)

  • Fraud Detection Course 2019/2020 - Nuno Moniz
slide-10
SLIDE 10

Modelling Wally

Your task is to nd Wally among 100 people You train two models using the state-of- the-art in Deep Learning Confusion Matrix? · · Model 1 obtains 99% Accuracy Model 2 obtains 1% Accuracy

  • ·

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-11
SLIDE 11

Predicting Popular News

Your task is to anticipate the most popular news of the day You train two models using the state-of- the-art Ensemble Learning techniques Normalized Mean Squared Error: · · Model 1 obtains 0.1 NMSE Model 2 obtains 0.5 NMSE

  • ·

1 n ∑n i=1 ( − yi ˆ yi)2 y2

i

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-12
SLIDE 12

Imbalanced Domain Learning

slide-13
SLIDE 13

Imbalanced Domain Learning?

It's still predictive modelling, and as such... "The goal of predictive modelling is to obtain a good approximation for an unknown function" Standard predictive modelling has some assumptions: Assumptions in Imbalanced Domain Learning: · ·

  • Y = f(

, , ⋯) x1 x2

· The distribution of the target variable is balanced Users have uniform preferences: all errors were born equal

  • ·

The distribution of the target variable is imbalanced Users have non-uniform preferences: some cases are more important The more important/relevant cases are those which are rare or extreme

  • Fraud Detection Course 2019/2020 - Nuno Moniz
slide-14
SLIDE 14

A balanced distribution An imbalanced distribution

Imbalanced Domain Learning - Nominal Target

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-15
SLIDE 15

A balanced distribution An imbalanced distribution

Imbalanced Domain Learning - Numerical Target

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-16
SLIDE 16

How to learn? How to evaluate?

Problems with Imbalanced Domains

There are two main problems when learning with imbalanced domains

  • 1. How to learn?
  • 2. How to evaluate?

Models are

  • ptimized

to accurately represent the maximum of information When there's information imbalance, this means that it will more likely represent the majority type of information in detriment to the minority (rare/extreme cases) · · Most of the most well-known evaluation metrics are focused on assessing the average behaviour of the models However, there are a lot of cases where the evaluation objective is to understand if a model is capable of predicting a certain class or subset of values, i.e. imbalanced domain learning · ·

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-17
SLIDE 17

The Problem of Evaluation

This problem is dierent for classication and regression tasks Imbalanced Learning has been explored in classication for over 20 years, but in regression problems it is very recent The main idea is this: for evaluating this type of problems: · · · Remember that not all cases are equal You're focused on the ability of models in predicting a rare cases Missing a prediction of rare cases is worst than missing a normal case

  • Fraud Detection Course 2019/2020 - Nuno Moniz
slide-18
SLIDE 18

Classification

Standard Evaluation Non-Standard Evaluation

Regression

Standard Evaluation Non-Standard Evaluation

The Problem of Evaluation

Accuracy Error Rate (complement) · · F-Score G-Mean ROC Curves / AUC · · · MSE RMSE · · MSE RMSE Utility-Based Metrics (UBL R Package) ·

ϕ

·

ϕ

·

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-19
SLIDE 19

The Problem of Learning

Imagine the following scenario: However, learning algorithms make choices - they have assumptions. The most hazardous for imbalanced domain learning are:

  • 1. Assuming that all cases are equal
  • 2. Internal optimization/decisions based on standard metrics

You have a dataset of 10,000 cases of credit transactions classied as Fraud or Normal This dataset has 9,990 cases classied as Normal, and only 10 cases classied as Fraud Learning algorithm are not human beings: they're programmed to operate in a pre- determined way This usually means that the problem they want to solve is: how can we accurately represent the data? · · · ·

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-20
SLIDE 20

The Problem of Learning

Instead of "It's all about the bass", it's in fact all about the mean/mode. Remember this?

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-21
SLIDE 21

Until now

  • 1. There's more to machine learning than standard tasks
  • 2. Learning algorithms are biased and,
  • 3. Algorithms are focused on reducing the average error/representing the majority cases
  • 4. Beware of standard evaluation metrics if your task is imbalanced domain learning

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-22
SLIDE 22

Strategies for Imbalanced Domain Learning

slide-23
SLIDE 23

Strategies for Imbalanced Domain Learning

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-24
SLIDE 24

Data Pre-Processing

Goal: change the examples distribution before applying any learning algorithm Advantages: any standard learning algorithm can then be used Disadvantages · · · dicult to decide the

  • ptimal

distribution (a perfect balance does not always provide the optimal results) the strategies applied may severly increase/decrease the total number of examples

  • Fraud Detection Course 2019/2020 - Nuno Moniz
slide-25
SLIDE 25

Special-purpose Learning Methods

Goal: change existing algorithms to provide a better t to the imbalanced distribution Advantages Disadvantages · · very eective in the contexts for which they were design more comprehensible to the user

  • ·

dicult task because it requires a deep knowledge of both the algorithm and the domain diculty of using an already adapted method in a dierent learning system

  • Fraud Detection Course 2019/2020 - Nuno Moniz
slide-26
SLIDE 26

Prediction Post-Processing

Goal: change the predictions after applying any learning algorithm Advantages: any standard learning algorithm can be used Disadvantages: potential loss of models interpretability · · ·

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-27
SLIDE 27

Practical Examples

slide-28
SLIDE 28

Practical Examples - Data Pre-Processing

Data Pre-Processing strategies are also known are Resampling Strategies These are the most common strategies to tackle imbalanced domain learning tasks We will look at practical examples for both classication and regression using: · · · Random Undersampling Random Oversampling SMOTE

  • Fraud Detection Course 2019/2020 - Nuno Moniz
slide-29
SLIDE 29

Preliminaries in R

Install the package UBL from CRAN ·

install.packages("UBL")

Install UBL from GitHub ·

library(devtools) # stable release install_github("paobranco/UBL",ref="master") # development release install_github("paobranco/UBL",ref="develop")

After installation, the package can be used, as any other R package ·

library(UBL)

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-30
SLIDE 30

Data for Pratical Examples - Classification Tasks

We will use the well-known dataset iris. The iris ower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in 1936 The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). For the purpose of the practical examples, we will consider the class setosa as being the rare class, and the other classes as being the normal class · · ·

library(UBL) # generating an artificially imbalanced data set data(iris) data <- iris[, c(1, 2, 5)] data$Species <- factor(ifelse(data$Species == "setosa","rare","common")) ## checking the class distribution of this artificial data set table(data$Species) ## ## common rare ## 100 50

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-31
SLIDE 31

Random Undersampling

To force the models to focus on the most important and least represented class(es) this technique randomly removes examples from the most represented and therefore less important class(es) As such, the modied data set obtained is smaller than the original one The user must always be aware that to obtain a more balanced data set, this strategy may discard useful data Therefore, this strategy should be applied with caution, specially in smaller datasets · · · ·

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-32
SLIDE 32

Random Undersampling

# using a percentage provided by the user to perform undersampling datU <- RandUnderClassif(Species ~ ., data, C.perc = list(common=0.4)) table(datU$Species) ## ## common rare ## 40 50 # automaticaly balancing the data distribution datB <- RandUnderClassif(Species ~ ., data, C.perc = "balance") table(datB$Species) ## ## common rare ## 50 50

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-33
SLIDE 33

Random Undersampling

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-34
SLIDE 34

Random Oversampling

This strategy introduces replicas of randomly selected examples from relevant classes of the data set, e.g. replicating fraud cases This allows to obtain a better balanced data set without discarding any important examples However, this method is highly prone to over-tting (think a data set only with red apples) This methods also has a strong impact on the number of examples of the new data set which can be a diculty for the algorithm · · · ·

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-35
SLIDE 35

Random Oversampling

# using a percentage provided by the user to perform oversampling datO <- RandOverClassif(Species ~ ., data, C.perc = list(rare=3)) table(datO$Species) ## ## common rare ## 100 150 # automaticaly balancing the data distribution datB <- RandOverClassif(Species ~ ., data, C.perc = "balance") table(datB$Species) ## ## common rare ## 100 100

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-36
SLIDE 36

Random Oversampling

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-37
SLIDE 37

SMOTE

Synthetic Minority Over-Sampling Technique The SMOTE algorithm is a strategy that performs oversampling via the generation of synthetic examples A synthetic case of the minority class is generated via the interpolation of two minority class cases A new minority case example is obtained with a seed example from that class and one of its randomly selected nearest neighbours With two examples, a new synthetic case is obtained by interpolating the example features · · · ·

k

·

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-38
SLIDE 38

SMOTE

#using smote just to oversample the class rare datSM1 <- SmoteClassif(Species ~ ., data, C.perc = list(common=1,rare=6)) table(datSM1$Species) ## ## common rare ## 100 300 # user defined percentages for both undersample and oversample datSM2 <- SmoteClassif(Species~., data, C.perc=list(common=0.2, rare=2)) table(datSM2$Species) ## ## common rare ## 20 100

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-39
SLIDE 39

SMOTE

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-40
SLIDE 40

SMOTE

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-41
SLIDE 41

Data for Pratical Examples - Regression Tasks

We will use the algae data set from the package DMwR This data set contains observations on 11 variables as well as the concentration levels of 7 harmful algae. Values were measured in several European rivers We will focus on the target variable a7 · · ·

# loading the algae data set library("DMwR") data(algae) algae <- knnImputation(algae) # checking the density distribution of the data set plot(density(algae$a7))

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-42
SLIDE 42

Extreme Values?

In classication tasks, it is easy to know if a distribution is imbalanced: class frequency How can we do it regression tasks? Specically, how can we determine the importance (or relevance) of a given value in a domain? · · · Which values should be considered extreme?

  • Fraud Detection Course 2019/2020 - Nuno Moniz
slide-43
SLIDE 43

Relevance Functions

Ribeiro, R. Utility-based regression. Diss. PhD thesis, Dep. Computer Science, Faculty of Sciences-University of Porto, 2011 When dealing with imbalanced regression tasks the user should specify which are the important values However, this can be a very dicult task: continuous domains are potentially innite Professor Rita Ribeiro (Ribeiro, 2011) proposed a framework for dening relevance functions

  • f a given continuous target variable. This framework:

· · · Includes an automatic method that allows to obtain a relevance function from the target value distribution. It is based on boxplot rules (lookup adjusted boxplot). This method assumes that the most important values for users are those considered as extreme by such rules It also allows users to manually specify the values which are considered to be relevant and irrelevant using a matrix (the user inserts a given set of points, and the remaining are interpolated)

  • Fraud Detection Course 2019/2020 - Nuno Moniz
slide-44
SLIDE 44

Relevance Threshold

After obtaining a relevance function for a given target variable, we still have to dene which range of relevance values should be considered as being the most important values ·

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-45
SLIDE 45

Random Undersampling

The Random Undersampling approach is simply based on the random removal of examples from the original data set The examples removed are randomly selected from the subset of examples with less important ranges of the target variable, those with a relevance that is below the user- defined threshold You need to dene a relevance function, a relevance threshold and the percentage of undersampling to perform · · ·

# using the automatic method for defining the relevance function and the default threshold (0.5) Alg.U <- RandUnderRegress(a7 ~ ., algae, C.perc = list(0.5)) # 50% undersample Alg.UBal <- RandUnderRegress(a7 ~ ., algae, C.perc = "balance")

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-46
SLIDE 46

Random Undersampling

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-47
SLIDE 47

Random Oversampling

The Random Oversampling approach is simply based on the introduction of random copies

  • f examples from the training data set

These replicas are only introduced in the most important ranges of the target variable, i.e. for cases with a relevance value that is above the user-dened threshold It is necessary to dene a relevance function, a relevance threshold and the percentage of

  • versampling to perform

· · ·

# using the automatic method for defining the relevance function and the default threshold (0.5) Alg.O <- RandOverRegress(a7 ~ ., algae, C.perc = list(4.5)) Alg.OBal <- RandOverRegress(a7 ~ ., algae, C.perc = "balance")

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-48
SLIDE 48

Random Oversampling

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-49
SLIDE 49

SMOTE

The relevance function and the relevance threshold dened determine which are the relevant and non-relevant (normal) cases This algorithm combines an oversampling strategy by interpolation of important examples with a random undersampling approach The procedure is in all similar to the generation of synthetic examples in SMOTE for classication tasks · · ·

# we have two bumps: the first must be undersampled and the second oversampled. # Thus, we can chose the following percentages: thr.rel=0.8; C.perc=list(0.2, 4) ## Warning in knnImputation(algae): No case has missing values. Stopping as there ## is nothing to do. # using these percentages and the relevance threshold of 0.8 with all the other parameters default values Alg.SM <- SmoteRegress(a7 ~ ., algae, thr.rel = thr.rel, C.perc = C.perc, dist = "HEOM") # use the automatic method for obtaining a balanced data set Alg.SMBal <- SmoteRegress(a7 ~ ., algae, thr.rel = thr.rel, C.perc = "balance", dist = "HEOM")

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-50
SLIDE 50

SMOTE

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-51
SLIDE 51

Wrap-up

slide-52
SLIDE 52

Summary

  • 1. Machine Learning has a lot of faces and some of them are not pretty
  • 1. Imbalanced Domain Learning is considered one of the most important topics for Machine

Learning and Data Mining

  • 2. There are a lot of strategies to tackle this type of tasks, but all of them have their advantages

and disadvantages

  • 3. Solutions are domain-dependent
  • 4. Remember: before you begin tackling any ML problem, investigate the domain and your
  • bjective.

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-53
SLIDE 53

Three Challenges for the Future

  • 1. Auto-Machine Learning and Imbalanced Domain Learning
  • 1. Targeted Resampling: reduce the variance of outcomes in strategies
  • 2. How to "force" a model to account for small concepts without sampling

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-54
SLIDE 54