Imbalanced Domain Learning Fraud Detection Course - 2019/2020 Nuno - PowerPoint PPT Presentation

Imbalanced Domain Learning Fraud Detection Course - 2019/2020 Nuno Moniz nuno.moniz@fc.up.pt

Today 1. Beyond Standard ML 2. Imbalanced Domain Learning 2.1 Problem Formulation · 2.2 Evaluation/Learning · 3. Strategies for Imbalanced Domain Learning 4. Practical Examples Fraud Detection Course 2019/2020 - Nuno Moniz

Beyond Standard Machine Learning

Hey Model 1, all apples are red, yellow or green.

Hey Model 1, what's the colour of this apple?

Famous ML Mistakes #1 Fraud Detection Course 2019/2020 - Nuno Moniz

Famous ML Mistakes #2 Fraud Detection Course 2019/2020 - Nuno Moniz

Machine Learning, Predictive Modelling The goal of predictive modelling is to obtain a good approximation for an unknown function · - Y = f ( x 1 x 2 , , ⋯ ) What you need (the most basic): · A dataset - A target variable - Learning algorithm(s) - Evaluation metric(s) - Fraud Detection Course 2019/2020 - Nuno Moniz

Modelling Wally Your task is to �nd Wally among 100 · people You train two models using the state-of- · the-art in Deep Learning Model 1 obtains 99% Accuracy - Model 2 obtains 1% Accuracy - Confusion Matrix? · Fraud Detection Course 2019/2020 - Nuno Moniz

Predicting Popular News Your task is to anticipate the most popular · news of the day You train two models using the state-of- · the-art Ensemble Learning techniques Model 1 obtains 0.1 NMSE - Model 2 obtains 0.5 NMSE - Normalized Mean Squared Error: · y i ) 2 ˆ ( − y i n ∑ n 1 i =1 y 2 i Fraud Detection Course 2019/2020 - Nuno Moniz

Imbalanced Domain Learning

Imbalanced Domain Learning? It's still predictive modelling, and as such... · "The goal of predictive modelling is to obtain a good approximation for an unknown · function" - Y = f ( x 1 x 2 , , ⋯ ) Standard predictive modelling has some assumptions: · The distribution of the target variable is balanced - Users have uniform preferences: all errors were born equal - Assumptions in Imbalanced Domain Learning : · The distribution of the target variable is imbalanced - Users have non-uniform preferences: some cases are more important - The more important/relevant cases are those which are rare or extreme - Fraud Detection Course 2019/2020 - Nuno Moniz

Imbalanced Domain Learning - Nominal Target A balanced distribution An imbalanced distribution Fraud Detection Course 2019/2020 - Nuno Moniz

Imbalanced Domain Learning - Numerical Target A balanced distribution An imbalanced distribution Fraud Detection Course 2019/2020 - Nuno Moniz

Problems with Imbalanced Domains There are two main problems when learning with imbalanced domains 1. How to learn? 2. How to evaluate? How to learn? How to evaluate? Models are optimized to accurately Most of the most well-known evaluation · · represent the maximum of information metrics are focused on assessing the average behaviour of the models When there's information imbalance, this · means that it will more likely represent the However, there are a lot of cases where the · majority type of information in detriment evaluation objective is to understand if a to the minority (rare/extreme cases) model is capable of predicting a certain class or subset of values, i.e. imbalanced domain learning Fraud Detection Course 2019/2020 - Nuno Moniz

The Problem of Evaluation This problem is di�erent for classi�cation and regression tasks · Imbalanced Learning has been explored in classi�cation for over 20 years, but in regression · problems it is very recent The main idea is this: for evaluating this type of problems: · Remember that not all cases are equal - You're focused on the ability of models in predicting a rare cases - Missing a prediction of rare cases is worst than missing a normal case - Fraud Detection Course 2019/2020 - Nuno Moniz

The Problem of Evaluation Classification Regression Standard Evaluation Standard Evaluation Accuracy MSE · · Error Rate (complement) RMSE · · Non-Standard Evaluation Non-Standard Evaluation F-Score MSE · · ϕ G-Mean RMSE · · ϕ ROC Curves / AUC Utility-Based Metrics ( UBL R Package) · · Fraud Detection Course 2019/2020 - Nuno Moniz

The Problem of Learning Imagine the following scenario: You have a dataset of 10,000 cases of credit transactions classi�ed as Fraud or Normal · This dataset has 9,990 cases classi�ed as Normal , and only 10 cases classi�ed as Fraud · Learning algorithm are not human beings: they're programmed to operate in a pre- · determined way This usually means that the problem they want to solve is: how can we accurately represent · the data? However, learning algorithms make choices - they have assumptions. The most hazardous for imbalanced domain learning are: 1. Assuming that all cases are equal 2. Internal optimization/decisions based on standard metrics Fraud Detection Course 2019/2020 - Nuno Moniz

The Problem of Learning Instead of "It's all about the bass", it's in fact all about the mean/mode. Remember this? Fraud Detection Course 2019/2020 - Nuno Moniz

Until now 1. There's more to machine learning than standard tasks 2. Learning algorithms are biased and, 3. Algorithms are focused on reducing the average error/representing the majority cases 4. Beware of standard evaluation metrics if your task is imbalanced domain learning Fraud Detection Course 2019/2020 - Nuno Moniz

Strategies for Imbalanced Domain Learning

Strategies for Imbalanced Domain Learning Fraud Detection Course 2019/2020 - Nuno Moniz

Data Pre-Processing Goal : change the examples distribution · before applying any learning algorithm Advantages : any standard learning · algorithm can then be used · Disadvantages di�cult to decide the optimal - distribution (a perfect balance does not always provide the optimal results) the strategies applied may severly - increase/decrease the total number of examples Fraud Detection Course 2019/2020 - Nuno Moniz

Special-purpose Learning Methods Goal : change existing algorithms to provide · a better �t to the imbalanced distribution · Advantages very e�ective in the contexts for which - they were design more comprehensible to the user - · Disadvantages di�cult task because it requires a deep - knowledge of both the algorithm and the domain di�culty of using an already adapted - method in a di�erent learning system Fraud Detection Course 2019/2020 - Nuno Moniz

Prediction Post-Processing Goal : change the predictions after applying · any learning algorithm Advantages : any standard learning · algorithm can be used Disadvantages : potential loss of models · interpretability Fraud Detection Course 2019/2020 - Nuno Moniz

Practical Examples

Practical Examples - Data Pre-Processing Data Pre-Processing strategies are also known are Resampling Strategies · These are the most common strategies to tackle imbalanced domain learning tasks · We will look at practical examples for both classi�cation and regression using: · Random Undersampling - Random Oversampling - SMOTE - Fraud Detection Course 2019/2020 - Nuno Moniz

Preliminaries in R Install the package UBL from CRAN · install.packages("UBL") Install UBL from GitHub · library(devtools) # stable release install_github("paobranco/UBL",ref="master") # development release install_github("paobranco/UBL",ref="develop") After installation, the package can be used, as any other R package · library(UBL) Fraud Detection Course 2019/2020 - Nuno Moniz

Data for Pratical Examples - Classification Tasks We will use the well-known dataset iris . The iris �ower data set or Fisher's Iris data set is a · multivariate data set introduced by the British statistician and biologist Ronald Fisher in 1936 The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica · and Iris versicolor). For the purpose of the practical examples, we will consider the class setosa as being the rare · class, and the other classes as being the normal class library(UBL) # generating an artificially imbalanced data set data(iris) data <- iris[, c(1, 2, 5)] data$Species <- factor(ifelse(data$Species == "setosa","rare","common")) ## checking the class distribution of this artificial data set table(data$Species) ## ## common rare ## 100 50 Fraud Detection Course 2019/2020 - Nuno Moniz

Random Undersampling To force the models to focus on the most important and least represented class(es) this · technique randomly removes examples from the most represented and therefore less important class(es) As such, the modi�ed data set obtained is smaller than the original one · The user must always be aware that to obtain a more balanced data set, this strategy may · discard useful data Therefore, this strategy should be applied with caution, specially in smaller datasets · Fraud Detection Course 2019/2020 - Nuno Moniz

Imbalanced Domain Learning Fraud Detection Course - 2019/2020 Nuno - PowerPoint PPT Presentation

Imbalanced Domain Learning Fraud Detection Course - 2019/2020 Nuno Moniz nuno.moniz@fc.up.pt Today 1. Beyond Standard ML 2. Imbalanced Domain Learning 2.1 Problem Formulation 2.2 Evaluation/Learning 3. Strategies for Imbalanced Domain

The Short Introduction to Imbalanced Classification Zeyu Qin 07.02.2020 Overview Reference

Natures Theory For Humans For Plants An imbalanced diet is An imbalanced Fertilizer poison to

ADVANCED MACHINE LEARNING Caveats and Techniques to Deal with Imbalanced Datasets (Adapted from

ML in Practice: Dealing with imbalanced data CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu T

Dealing with imbalanced datasets Bart Baesens Professor Data Science at KU Leuven DataCamp

Strong Baselines for Neural Semi-supervised Learning under Domain Shift Sebastian Ruder Barbara

Tutorial on Learning Class Imbalanced Data Streams Leandro L. Minku Shuo Wang Giacomo Boracchi

Learning Imbalanced Data with Random Forests Chao Chen (Stat., UC Berkeley) chenchao@ st at .

Learning Deep Representation for Imbalanced Classification Chen Huang, Yining Li, Chen Change

Active Learning for Decision-Making from Imbalanced Observational Data 11.06.2019 Iiris Sundin 1

Learning and Imbalanced Data January 28, 2019 David Rimshnick Data Science in the Wild, Spring

Web Hosting and Domain Names Introduction to Web Design Web Hosting and Domain Names

Focusing the Core Domain Model A Domain-Driven Design Case Study, Eric Evans, Domain Language

Image Processing A case study for a domain decomposed MPI code Domain Decomposition 1

Introducing the Presentation Description The current global economy is based on imbalanced

Probability Density Function Estimation Based Over-Sampling for Imbalanced Two-Class Problems Ming

CS 188: Artificial Intelligence Nave Bayes Instructors: Anca Dragan--- University of

Bringing Protective Factors to Life Social & Emotional Competence of Children The webinar

An Urban HIV Telemedicine Program for Specialty HIV Services for Underserved Populations in San

Mat2345 Negation 1.5 Inference Week 2 Modus Ponens Modus Tollens Chap 1.5, 1.6 Rules

Some Thoughts on Electronic Voting Ronald L. Rivest MIT CSAIL DIMACS Voting Workshop May 26,

How to prepare for the end of free movement New routes into the UK post January 2021 EU PASSPORT

How could the Tampere Convention on the Provision of Telecommunication Resources for Disaster

AquaPod the lightweight, transportable, personal watercraft price effort AquadPod RED B

Imbalanced Domain Learning Fraud Detection Course - 2019/2020 Nuno - PowerPoint PPT Presentation

Imbalanced Domain Learning Fraud Detection Course - 2019/2020 Nuno Moniz nuno.moniz@fc.up.pt Today 1. Beyond Standard ML 2. Imbalanced Domain Learning 2.1 Problem Formulation 2.2 Evaluation/Learning 3. Strategies for Imbalanced Domain

The Short Introduction to Imbalanced Classification Zeyu Qin 07.02.2020 Overview Reference

Natures Theory For Humans For Plants An imbalanced diet is An imbalanced Fertilizer poison to

ADVANCED MACHINE LEARNING Caveats and Techniques to Deal with Imbalanced Datasets (Adapted from

ML in Practice: Dealing with imbalanced data CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu T

Dealing with imbalanced datasets Bart Baesens Professor Data Science at KU Leuven DataCamp

Strong Baselines for Neural Semi-supervised Learning under Domain Shift Sebastian Ruder Barbara

Tutorial on Learning Class Imbalanced Data Streams Leandro L. Minku Shuo Wang Giacomo Boracchi

Learning Imbalanced Data with Random Forests Chao Chen (Stat., UC Berkeley) chenchao@ st at .

Learning Deep Representation for Imbalanced Classification Chen Huang, Yining Li, Chen Change

Active Learning for Decision-Making from Imbalanced Observational Data 11.06.2019 Iiris Sundin 1

Learning and Imbalanced Data January 28, 2019 David Rimshnick Data Science in the Wild, Spring

Web Hosting and Domain Names Introduction to Web Design Web Hosting and Domain Names

Focusing the Core Domain Model A Domain-Driven Design Case Study, Eric Evans, Domain Language

Image Processing A case study for a domain decomposed MPI code Domain Decomposition 1

Introducing the Presentation Description The current global economy is based on imbalanced

Probability Density Function Estimation Based Over-Sampling for Imbalanced Two-Class Problems Ming

CS 188: Artificial Intelligence Nave Bayes Instructors: Anca Dragan--- University of

Bringing Protective Factors to Life Social &amp; Emotional Competence of Children The webinar

An Urban HIV Telemedicine Program for Specialty HIV Services for Underserved Populations in San

Mat2345 Negation 1.5 Inference Week 2 Modus Ponens Modus Tollens Chap 1.5, 1.6 Rules

Some Thoughts on Electronic Voting Ronald L. Rivest MIT CSAIL DIMACS Voting Workshop May 26,

How to prepare for the end of free movement New routes into the UK post January 2021 EU PASSPORT

How could the Tampere Convention on the Provision of Telecommunication Resources for Disaster

AquaPod the lightweight, transportable, personal watercraft price effort AquadPod RED B

Bringing Protective Factors to Life Social & Emotional Competence of Children The webinar