Improving Electric fraud detection using class imbalance strategies - - PowerPoint PPT Presentation

improving electric fraud detection using class imbalance
SMART_READER_LITE
LIVE PREVIEW

Improving Electric fraud detection using class imbalance strategies - - PowerPoint PPT Presentation

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Improving Electric fraud detection using class imbalance strategies Eng. Federico Decia Eng. Matas Di Martino Eng. Juan Molinelli Prof. Alicia


slide-1
SLIDE 1

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions

Improving Electric fraud detection using class imbalance strategies

  • Eng. Federico Decia
  • Eng. Matías Di Martino
  • Eng. Juan Molinelli
  • Prof. Alicia Fernández

Instituto de Ingeniería Eléctrica, Facultad de Ingeniería Universidad de la República Montevideo, Uruguay.

slide-2
SLIDE 2

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions

Introduction

slide-3
SLIDE 3

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Problem description

Nontechnical losses represent a very high cost to power supply companies. Background Research in different countries has been made to tackle this problem Ramos et al., 2010 ← Brazil Nagi and Mohamad, 2010 ← Malaysia Muniz et al., 2009 ← Rio de Janeiro, Brazil Uruguay In Uruguay the national electric power company (henceforth call UTE) faces the problem by manually monitoring a group of customers.

slide-4
SLIDE 4

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Problem description

Difficulties Big number of customers (only in the capital city there are 500.000) Wide variety of scams and ways to alter consumption meters.

slide-5
SLIDE 5

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Problem description

Other factors: Fraud history. Building address and dimensions. Counter type.

slide-6
SLIDE 6

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Problem description

Objective Develop an automatic tool that, based on manually labeled data, detect new suspect customers.

slide-7
SLIDE 7

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Problem description

Data Description

slide-8
SLIDE 8

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Data

DATASET 1 1504 industrial profiles (October 2004- September 2009) Each profile is represented by the customers monthly consumption. Labels for each customer are provided by UTE. Used for training and theoretical evaluation DATASET 2 3300 industrial profiles (January 2008 - January 2011 ) Each profile is represented by the customers monthly consumption. Used for on field evaluation

slide-9
SLIDE 9

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Data

slide-10
SLIDE 10

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Class imbalance problem

The class imbalance problem When working on fraud detection field, one can not assume that the number of people who commit fraud are the same than those who do not, usually there are fewer elements from the class who commit fraud.

slide-11
SLIDE 11

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Class imbalance problem

The class imbalance problem When working on fraud detection field, one can not assume that the number of people who commit fraud are the same than those who do not, usually there are fewer elements from the class who commit fraud. Strategies Under-Sampling

slide-12
SLIDE 12

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Class imbalance problem

The class imbalance problem When working on fraud detection field, one can not assume that the number of people who commit fraud are the same than those who do not, usually there are fewer elements from the class who commit fraud. Strategies Under-Sampling One Class SVM and Cost Sensitive SVM

slide-13
SLIDE 13

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Class imbalance problem

The class imbalance problem When working on fraud detection field, one can not assume that the number of people who commit fraud are the same than those who do not, usually there are fewer elements from the class who commit fraud. Strategies Under-Sampling One Class SVM and Cost Sensitive SVM Recall, Precision and F-Value

slide-14
SLIDE 14

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Class imbalance problem

Recallp = TP TP + FN Recalln = TN TN + FP Precision = TP TP + FP Fvalue = (1 + β2)Recallp × Precision β2 Recallp + Precision Labeled as Positive Negative Positive TP (True Positive) FN (False Negative) Negative FP (False Positive) TN (True Negative)

Table: Confusion matrix

slide-15
SLIDE 15

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Class imbalance problem

Strategy proposed

slide-16
SLIDE 16

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Block diagram

Block Diagram

Figure: Block Diagram

The system input corresponds to the last three years of the monthly consumption curve of each costumer.

slide-17
SLIDE 17

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Block diagram

Features

slide-18
SLIDE 18

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Features

Consumption ratio for the last 3, 6 and 12 months and the average consumption.

slide-19
SLIDE 19

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Features

Norm of the difference between the expected consumption and the actual consumption.

Figure: Consumptions are estimate based on the consumption of the same month of the previous year multiplied by the ratio between the mean consumption of each year.

slide-20
SLIDE 20

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Features

Difference between Fourier and Wavelets coefficients from the last and previous years.

slide-21
SLIDE 21

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Features

Difference between the coefficients of the polynomial

Figure: Difference in the coefficients of the polynomial that best fits the consumption curve.

slide-22
SLIDE 22

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Features

Distance to the mean customer

Figure: We compare each consumption curve with the mean curve.

slide-23
SLIDE 23

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Features

Variance

Figure: Changes in the variance value Figure: Comparison with the variance normal value

slide-24
SLIDE 24

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Features

Global Characteristic

Figure: Module of the first five Fourier coefficients. Figure: Slope of the straight line that best fits the consumption curve.

slide-25
SLIDE 25

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Features Selection

Features Selection Evaluation methods used Filter → CfsSubsetEval (Weka) Wrapper aiming to improve the F-Value on the different classifiers considered. Search methods used Exhaustive search (only for CfsSubsetEval) Best First (for all other methods)

slide-26
SLIDE 26

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Features Selection

Classifiers

slide-27
SLIDE 27

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Classifiers

Classifiers Classifiers used Classifiers considered to tackle this problem were: One Class Support Vector Machine (O-SVM) Cost-Sensitive Support Vector Machine (CS-SVM) Optimum Path Forest (OPF) Tree C4.5

slide-28
SLIDE 28

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Classifiers

SVM Parameters Compromise: O-SVM → ν ; CS-SVM → C Kernel: Gaussian → γ Optimal parameters were found using 10-fold cross validation. The performance was measure by F-value.

slide-29
SLIDE 29

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Classifiers

Optimum Path Forest Euclidean distance was used as the distance function. Raw input vectors (instead of the features here proposed). Under-sample the majority class (during the training step

  • f the algorithm) to improve the final performance (in the

F-value sense).

slide-30
SLIDE 30

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Classifiers

C4.5 The fourth classifier used is a decision tree proposed by Ross Quinlan: C4.5 As with OPF, we under-sample the majority class (during the training step of the algorithm) to improve the final performance (in the F-value sense).

slide-31
SLIDE 31

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Classifiers

Combination

Combination

slide-32
SLIDE 32

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Classifiers

Combination Why? Improve final performance More robust and general solution Decision rule gp(x) = λp

  • s dp
  • s + λp

cs dp cs + λp OPF dp OPF + λp Tree dp Tree

(1) gn(x) = λn

  • s dn
  • s + λn

cs dn cs + λn OPF dn OPF + λn Tree dn Tree

(2) di

j (x) = 1 if j labels the sample as i and 0 otherwise.

gp(x) > gn(x) → x labeled as positive gp(x) < gn(x) → x labeled as negative

slide-33
SLIDE 33

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Classifiers

Weight Traditional Case (Kuncheva, 2004), weighted majority vote rule: Hypothesis: independence Objective: maximize overall accuracy λj = log

  • Accuracyj

1 − Accuracyj

  • Accuracyj: ratio of correctly classified samples for the jth

classifier.

slide-34
SLIDE 34

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Classifiers

Example Fist and second classifier

Fist Label as Clase Positive Negative Positive 100 Negative 9900 Second Label as Clase Positive Negative Positive 100 Negative 9900

Third classifier

Third Label as Clase Positive Negative Positive 50 50 Negative 150 9750

slide-35
SLIDE 35

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Classifiers

Weight With this in mind, but taking into account that we want to find a solution with good balance between Recall and Precision, some weights λp,n

j

were proposed: λi

j = log

  • Recallp

j +1

Recallp

j −1

  • λi

j = log

  • Fvaluej +1

Fvaluej −1

  • λi

j = log

  • Accuracyj

1−Accuracyj

  • λp

j = Recalln j and λn j = Recallp j

slide-36
SLIDE 36

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Classifiers

Optimal Weights λi ∈ [0 : 0.05 : 1] Exhaustive search F-value maximization (10 fold c.v.) Results All the alternatives to combine classifiers outputs, improved individual classifiers performance.

slide-37
SLIDE 37

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Classifiers

Conclusions

slide-38
SLIDE 38

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Results

Labeling Results Labeling Results on DATASET1

Description Acc.(%) Rec.(%) Pre.(%) F-val.(%) O-SVM 84,9 54,9 50,8 52,8 CS-SVM 84,5 62,8 49,7 55,5 OPF 80,1 62,2 40,5 49 Tree (C4.5) 79 64,6 39 48,6 Combination 86,2 64 54,4 58,8 Table: Data Set 1 Labeling Results

slide-39
SLIDE 39

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Results

On Field Results Tests were done in the following way:

1

Train the classification algorithm using DATASET 1.

2

Classify samples from DATASET 2. Lets call DATASET 2P the samples of DATASET 2 labelled as positive (associated to abnormal consumption behaviour).

3

Inspect customers on DATASET 2P .

slide-40
SLIDE 40

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Results

On Field Results Results 340 samples of DATASET2P (340/560) 11 fraudulent activities detected 4 suspect activities detected (are being analyzed)

slide-41
SLIDE 41

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Results

On Field Results Results 340 samples of DATASET2P (340/560) 11 fraudulent activities detected 4 suspect activities detected (are being analyzed)

slide-42
SLIDE 42

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Results

On Field Results Results 340 samples of DATASET2P (340/560) 11 fraudulent activities detected 4 suspect activities detected (are being analyzed) Results analysis Results show that the automatic framework has a hit rate between 3.3% and 4.4%. Manual fraud detection performed by UTE’s experts during 2010 had a hit rate of about 4%.

slide-43
SLIDE 43

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Conclusions

Conclusions Results are promising. Specially taking into account that manual detection considers more information than just the consumption curve, such as fraud history, building dimension and contracted power, among others. Our software could complement and improve experts knowledge.

slide-44
SLIDE 44

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Future work

Future work Future work Improving final performance and monitor bigger customer sets aiming to reach all customers in Uruguay. Example: Adding more features to out learning algorithm, such as:

Counter type (digital or analog). Customer type (dwelling or industrial). Contracted power.

slide-45
SLIDE 45

Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Future work

Thank you