Introduction & Motivation Bart Baesens Professor Data Science - - PowerPoint PPT Presentation

introduction motivation
SMART_READER_LITE
LIVE PREVIEW

Introduction & Motivation Bart Baesens Professor Data Science - - PowerPoint PPT Presentation

DataCamp Fraud Detection in R FRAUD DETECTION IN R Introduction & Motivation Bart Baesens Professor Data Science at KU Leuven DataCamp Fraud Detection in R Instructors DataCamp Fraud Detection in R Instructors DataCamp Fraud


slide-1
SLIDE 1

DataCamp Fraud Detection in R

Introduction & Motivation

FRAUD DETECTION IN R

Bart Baesens

Professor Data Science at KU Leuven

slide-2
SLIDE 2

DataCamp Fraud Detection in R

Instructors

slide-3
SLIDE 3

DataCamp Fraud Detection in R

Instructors

slide-4
SLIDE 4

DataCamp Fraud Detection in R

Instructors

slide-5
SLIDE 5

DataCamp Fraud Detection in R

What is fraud?

Fraud is an uncommon, well-considered, imperceptibly concealed, time-evolving and often carefully organized crime which appears in many types and forms.

slide-6
SLIDE 6

DataCamp Fraud Detection in R

Impact of fraud

Fraud is very rare, but cost of not detecting fraud can be huge! Examples: Organizations lose 5% of their yearly revenues to fraud Money lost by businesses to fraud > $3.5 trillion each year Credit card companies lose approximately 7 cents per $100 of transactions due to fraud Fraud takes up 5-10% of the claim amounts paid for non-life insurance

slide-7
SLIDE 7

DataCamp Fraud Detection in R

Types of fraud

Anti-money laundering Check fraud (Credit) card fraud Click fraud Customs fraud Counterfeit Identity theft Insurance fraud Mortgage fraud Non-delivery fraud Online fraud Product warranty fraud Tax evasion Telecommunication fraud Theft of inventory Threat Ticket fraud Transit faud Wire fraud Workers compensation fraud

slide-8
SLIDE 8

DataCamp Fraud Detection in R

Key characteristics of successful fraud analytics models

Statistical accuracy

slide-9
SLIDE 9

DataCamp Fraud Detection in R

Key characteristics of successful fraud analytics models

Statistical accuracy Interpretability

slide-10
SLIDE 10

DataCamp Fraud Detection in R

Key characteristics of successful fraud analytics models

Statistical accuracy Interpretability Regulatory compliance

slide-11
SLIDE 11

DataCamp Fraud Detection in R

Key characteristics of successful fraud analytics models

Statistical accuracy Interpretability Regulatory compliance Economical impact

slide-12
SLIDE 12

DataCamp Fraud Detection in R

Key characteristics of successful fraud analytics models

Statistical accuracy Interpretability Regulatory compliance Economical cost Complement expert based approaches with data-driven techniques

slide-13
SLIDE 13

DataCamp Fraud Detection in R

Challenges of fraud detection model

Imbalance e.g. in credit card fraud < 0.5% frauds typically

slide-14
SLIDE 14

DataCamp Fraud Detection in R

Challenges of fraud detection model

Imbalance e.g. in credit card fraud < 0.5% frauds typically Operational efficiency e.g. in credit card fraud < 8 seconds decision time

slide-15
SLIDE 15

DataCamp Fraud Detection in R

Challenges of fraud detection model

Imbalance e.g. in credit card fraud < 0.5% frauds typically Operational efficiency e.g. in credit card fraud < 8 seconds decision time Avoid harassing good customers

slide-16
SLIDE 16

DataCamp Fraud Detection in R

Imbalanced data

After a major storm, an insurance company received many claims Fraudulent claims are labeled with 1 and legitimate claims with 0 The percentage of fraud cases in the data can be determined by using the functions table() and prop.table()

prop.table(table()) to determine percentage of fraud

> prop.table(table(fraud_label)) 0 1 0.9911 0.0089

slide-17
SLIDE 17

DataCamp Fraud Detection in R

Imbalanced data

Visualize imbalance with pie chart

> labels <- c("no fraud", "fraud") > labels <- paste(labels, round(100*prop.table(table(fraud_label)), 2)) > labels <- paste0(labels, "%") > pie(table(fraud_label), labels, col = c("blue", "red"), main = "Pie chart of storm claims")

slide-18
SLIDE 18

DataCamp Fraud Detection in R

Evaluation of supervised method: confusion matrix

slide-19
SLIDE 19

DataCamp Fraud Detection in R

Confusion matrix: claims example

Suppose no detection model is used, so all claims are considered as legitimate: Function confusionMatrix() from package caret:

> predictions <- rep.int(0, nrow(claims)) > predictions <- factor(predictions, levels = c("no fraud", "fraud")) > library(caret) > confusionMatrix(data = predictions, reference = fraud_label) Confusion Matrix and Statistics Reference Prediction 0 1 0 614 14 1 0 0 Accuracy : 0.9777

slide-20
SLIDE 20

DataCamp Fraud Detection in R

Total cost of not detecting fraud: claims example

Total cost of fraud defined as the sum of fraudulent amounts Total cost if no fraud is detected:

> total_cost <- sum(claim_amount[fraud_label == "fraud"]) > print(total_cost) [1] 2301508

slide-21
SLIDE 21

DataCamp Fraud Detection in R

Let's practice!

FRAUD DETECTION IN R

slide-22
SLIDE 22

DataCamp Fraud Detection in R

Time features

FRAUD DETECTION IN R

Bart Baesens

Professor Data Science at KU Leuven

slide-23
SLIDE 23

DataCamp Fraud Detection in R

Analyzing time

Certain events are expected to occur at similar moments in time Example: customer making transactions at similar hours Aim: capture information about the time aspect by meaningful features Dealing with time can be tricky 00:00 = 24:00 no natural ordering: 23:00 < , > 01:00?

slide-24
SLIDE 24

DataCamp Fraud Detection in R

Mean of timestamps

Do not use arithmetic mean to compute an average timestamp! Example: transaction made at 01:00, 02:00, 21:00 and 22:00 arithmetic mean is 11:30, but no transfer was made close to that time! Convert digital timestamps to decimal format in hours

> data(timestamps) > head(timestamps) [1] "20:27:28" "21:08:41" "01:30:16" "00:57:04" "23:12:14" "22:54:16" > library(lubridate) > ts <- as.numeric(hms(timestamps)) / 3600 > head(ts) [1] 20.4577778 21.1447222 1.5044444 0.9511111 23.2038889 22.9044444

slide-25
SLIDE 25

DataCamp Fraud Detection in R

Circular histogram

> library(ggplot2) > clock <- ggplot(data.frame(ts), aes(x = ts)) + geom_histogram(breaks = seq(0, 24), colour = "blue", fill = "lightblue") + coord_polar() > arithmetic_mean <- mean(ts) > clock + geom_vline(xintercept = arithmetic_mean, linetype = 2, color = "red", size = 2)

slide-26
SLIDE 26

DataCamp Fraud Detection in R

Circular histogram with arithmetic mean

slide-27
SLIDE 27

DataCamp Fraud Detection in R

von Mises distribution

Model time as a periodic variable using the von Mises probability distribution (Correa Bahnsen et al., 2016) Periodic normal distribution = normal distribution wrapped around a circle von Mises distribution of a set of timestamps D = {t ,t ,… ,t } D ∼ vonM ises μ,κ μ: periodic mean, measure of location, distribution is clustered around μ 1/κ: periodic variance; κ is a measure of concentration

1 2 n

( )

slide-28
SLIDE 28

DataCamp Fraud Detection in R

Estimating parameters μ and κ

# Convert the decimal timestamps to class "circular" > library(circular) > ts <- circular(ts, units = "hours", template = "clock24") > head(ts) Circular Data: [1] 20.457889 21.144607 1.504422 0.950982 23.203917 4.904397 > estimates <- mle.vonmises(ts) > p_mean <- estimates$mu %% 24 > concentration <- estimates$kappa

slide-29
SLIDE 29

DataCamp Fraud Detection in R

Circular histogram with periodic mean

slide-30
SLIDE 30

DataCamp Fraud Detection in R

Confidence interval

Extract new features: confidence interval for the time of a transaction S = {x ∣i = 1,… ,n} : set of transactions made by the same customer (1) Estimate μ(S) and κ(S) based on S: (2) Calculate the density (= likelihood) of the timestamps for the estimated von Mises distribution:

i time

> estimates <- mle.vonmises(ts) > p_mean <- estimates$mu %% 24 > concentration <- estimates$kappa > densities <- dvonmises(ts, mu = p_mean, kappa = concentration)

slide-31
SLIDE 31

DataCamp Fraud Detection in R

Feature extraction

Binary feature if a new transaction time is within the confidence interval (CI) with probability α (e.g. 0.90, 0.95) Timestamp is within 90% CI if its density is larger than the cutoff value: Binary time feature: TRUE if timestamp lies inside CI, FALSE otherwise

> alpha <- 0.90 > quantile <- qvonmises((1 - alpha)/2, mu = p_mean, kappa = concentration) %% 24 > cutoff <- dvonmises(quantile, mu = p_mean, kappa = concentration) > time_feature <- densities >= cutoff

slide-32
SLIDE 32

DataCamp Fraud Detection in R

Confidence interval

slide-33
SLIDE 33

DataCamp Fraud Detection in R

Confidence interval

slide-34
SLIDE 34

DataCamp Fraud Detection in R

Example

slide-35
SLIDE 35

DataCamp Fraud Detection in R

Confidence interval with moving time window

> print(ts) [1] 18.42 20.45 20.88 0.75 19.20 23.65 6.08 > time_feature = c(NA, NA) > for (i in 3:length(ts)) { # Previous timestamps ts_history <- ts[1:(i-1)] # Estimate mu and kappa on historic timestamps estimates <- mle.vonmises(ts_history) p_mean <- estimates$mu %% 24 concentration <- estimates$kappa # Estimate density of current timestamp dens_i <- dvonmises(ts[i], mu = p_mean, kappa = concentration) # Check if density is larger than cutoff with confidence level 90% alpha <- 0.90 quantile <- qvonmises((1-alpha)/2, mu=p_mean, kappa=concentration) %% 24 cutoff <- dvonmises(quantile, mu = p_mean, kappa = concentration) time_feature[i] <- dens_i >= cutoff } > print(time_feature) [1] NA NA TRUE FALSE TRUE TRUE FALSE

slide-36
SLIDE 36

DataCamp Fraud Detection in R

Let's practice!

FRAUD DETECTION IN R

slide-37
SLIDE 37

DataCamp Fraud Detection in R

Frequency features

FRAUD DETECTION IN R

Tim Verdonck

Professor Data Science at KU Leuven

slide-38
SLIDE 38

DataCamp Fraud Detection in R

Need for additional features

Transfers made by Alice & Bob:

> trans %>% select(fraud_flag, orig_account_id, benef_country, authentication_cd, channel_cd, amount) fraud_flag account_name benef_country authentication_cd channel_cd amount 1 0 Bob ISO03 AU02 CH07 549 2 0 Alice ISO03 AU03 CH04 37 3 0 Bob ISO03 AU04 CH07 25 4 0 Bob ISO03 AU02 CH06 25 5 0 Alice ISO03 AU01 CH07 13 6 0 Bob ISO03 AU02 CH06 785 7 0 Alice ISO03 AU03 CH04 49 8 0 Bob ISO03 AU02 CH07 35 ... ... ... ... ... ... ... 36 0 Alice ISO03 AU05 CH04 126 37 0 Bob ISO03 AU02 CH06 22 38 0 Alice ISO03 AU03 CH04 41 39 1 Bob ISO03 AU03 CH05 3779 40 1 Alice ISO03 AU04 CH05 1531

slide-39
SLIDE 39

DataCamp Fraud Detection in R

Alice's & Bob's profile

Authentication methods used by Alice:

fraud_flag authentication_cd 0 1 AU01 6 0 AU02 0 0 AU03 7 0 AU04 0 1 AU05 9 0

slide-40
SLIDE 40

DataCamp Fraud Detection in R

Alice's & Bob's profile

Authentication methods used by Alice: Authentication methods used by Bob:

fraud_flag authentication_cd 0 1 AU01 6 0 AU02 0 0 AU03 7 0 AU04 0 1 AU05 9 0 fraud_flag authentication_cd 0 1 AU01 1 0 AU02 8 0 AU03 0 1 AU04 7 0 AU05 0 0

slide-41
SLIDE 41

DataCamp Fraud Detection in R

Frequency feature for one account

Arrange the data according to time

> library(dplyr) > trans <- trans %>% arrange(timestamp)

slide-42
SLIDE 42

DataCamp Fraud Detection in R

Frequency feature for one account

Arrange the data according to time Alice's data:

> library(dplyr) > trans <- trans %>% arrange(timestamp) > trans_Alice <- trans %>% filter(account_name == "Alice")

slide-43
SLIDE 43

DataCamp Fraud Detection in R

Frequency feature for one account

Arrange the data according to time Alice's data: Alice her first transaction:

> library(dplyr) > trans <- trans %>% arrange(timestamp) > trans_Alice <- trans %>% filter(account_name == "Alice") steps authentication_cd freq_auth AU03 0

slide-44
SLIDE 44

DataCamp Fraud Detection in R

Frequency feature for one account (step 1)

Step 1: create function frequency_fun Function counts the number of previous transfers with the same authentication method as the current one:

> frequency_fun <- function(steps, auth_method) { n <- length(steps) frequency <- sum(auth_method[1:n] == auth_method[n + 1]) return(frequency) } steps authentication_cd freq_auth AU03 0 1 AU03 1

slide-45
SLIDE 45

DataCamp Fraud Detection in R

Frequency feature for one account (step 1)

Step 1: create function frequency_fun Function counts the number of previous transfers with the same authentication method as the current one:

> frequency_fun <- function(steps, auth_method) { n <- length(steps) frequency <- sum(auth_method[1:n] == auth_method[n + 1]) return(frequency) } steps authentication_cd freq_auth AU03 0 1 AU03 1 2 AU03 2

slide-46
SLIDE 46

DataCamp Fraud Detection in R

Frequency feature for one account (step 1)

Step 1: create function frequency_fun Function counts the number of previous transfers with the same authentication method as the current one:

> frequency_fun <- function(steps, auth_method) { n <- length(steps) frequency <- sum(auth_method[1:n] == auth_method[n + 1]) return(frequency) } steps authentication_cd freq_auth AU03 0 1 AU03 1 2 AU03 2 3 AU01 0

slide-47
SLIDE 47

DataCamp Fraud Detection in R

Frequency feature for one account (step 1)

Step 1: create function frequency_fun Function counts the number of previous transfers with the same authentication method as the current one:

> frequency_fun <- function(steps, auth_method) { n <- length(steps) frequency <- sum(auth_method[1:n] == auth_method[n + 1]) return(frequency) } steps authentication_cd freq_auth AU03 0 1 AU03 1 2 AU03 2 3 AU01 0 4 AU01 1

slide-48
SLIDE 48

DataCamp Fraud Detection in R

Frequency feature for one account (step 2)

Step 2: use rollapply from the package zoo

> library(zoo) > freq_auth <- rollapply(trans_Alice$transfer_id, width = list(-1:-length(trans_Alice$transfer_id)), partial = TRUE, FUN = frequency_fun, trans_Alice$authentication_cd)

slide-49
SLIDE 49

DataCamp Fraud Detection in R

Frequency feature for one account (step 2 & 3)

Step 2: use rollapply from the package zoo Step 3: frequency feature starts with a zero

> library(zoo) > freq_auth <- rollapply(trans_Alice$transfer_id, width = list(-1:-length(trans_Alice$transfer_id)), partial = TRUE, FUN = frequency_fun, trans_Alice$authentication_cd) > freq_auth <- c(0, freq_auth)

slide-50
SLIDE 50

DataCamp Fraud Detection in R

Result!

authentication_cd freq_auth fraud_flag 1 AU03 0 0 2 AU03 1 0 3 AU03 2 0 4 AU01 0 0 5 AU01 1 0 6 AU05 0 0 7 AU05 1 0 8 AU05 2 0 9 AU01 2 0 10 AU05 3 0 11 AU05 4 0 12 AU05 5 0 13 AU03 3 0 14 AU05 6 0 15 AU01 3 0 16 AU05 7 0 17 AU03 4 0 18 AU01 4 0 19 AU01 5 0 20 AU03 5 0 21 AU05 8 0 22 AU03 6 0 23 AU04 0 1

slide-51
SLIDE 51

DataCamp Fraud Detection in R

For multiple accounts

Step 1: group the data by account_name: Step 2: use group_by() and mutate() from dplyr package

> trans %>% group_by(account_name) > trans <- trans %>% group_by(account_name) %>% mutate(freq_auth = c(0, rollapplyr(transfer_id, width = list(-1:-length(transfer_id)), partial = TRUE, FUN = count_fun, authentication_cd) ) )

slide-52
SLIDE 52

DataCamp Fraud Detection in R

Result for multiple accounts

account_name authentication_cd freq_auth fraud_flag 1 Bob AU02 0 0 2 Alice AU03 0 0 3 Bob AU04 0 0 4 Bob AU02 1 0 5 Alice AU01 0 0 6 Bob AU02 2 0 7 Alice AU03 1 0 8 Bob AU02 3 0 9 Alice AU01 1 0 10 Bob AU04 1 0 11 Bob AU02 4 0 12 Alice AU01 2 0 13 Alice AU05 0 0 14 Alice AU05 1 0 15 Alice AU05 2 0 16 Bob AU02 5 0 17 Bob AU04 2 0 18 Bob AU02 6 0 ... ... ... ... ... 37 Bob AU02 7 0 38 Alice AU03 5 0 39 Bob AU03 0 1 40 Alice AU04 0 1

slide-53
SLIDE 53

DataCamp Fraud Detection in R

Let's practice!

FRAUD DETECTION IN R

slide-54
SLIDE 54

DataCamp Fraud Detection in R

Recency features

FRAUD DETECTION IN R

Tim Verdonck

Professor Data Science at KU Leuven

slide-55
SLIDE 55

DataCamp Fraud Detection in R

Authentication method vs time

slide-56
SLIDE 56

DataCamp Fraud Detection in R

Large time interval

slide-57
SLIDE 57

DataCamp Fraud Detection in R

Small time interval

slide-58
SLIDE 58

DataCamp Fraud Detection in R

Zero recency

slide-59
SLIDE 59

DataCamp Fraud Detection in R

Anomalous behavior

slide-60
SLIDE 60

DataCamp Fraud Detection in R

Definition

recency = exp(−γ ⋅ t) = e t = time-interval between two consecutive events of the same type γ = tuning parameter, typically close to 0 (e.g. 0.01, 0.02, 0.05) 0 ≤ recency ≤ 1

−γt

slide-61
SLIDE 61

DataCamp Fraud Detection in R

Recency vs time

slide-62
SLIDE 62

DataCamp Fraud Detection in R

How to choose parameter γ?

(1) choose when recency is small (e.g. 0.01) after certain amount (t) of time (2) calculate γ = −log(recency)/t Example: set γ such that recency = 0.01 after t = 180 days

> gamma <- -log(0.01)/180 > gamma [1] 0.02558428

slide-63
SLIDE 63

DataCamp Fraud Detection in R

Recency feature in R (step 1)

recency_fun <- function(t, gamma, auth_cd, freq_auth) { n_t <- length(t) if (freq_auth[n_t] == 0) { recency <- 0 # recency = 0 when frequency = 0 } else { time_diff <- t[1] - max(t[2:n_t][auth_cd[(n_t-1):1] == auth_cd[n_t]]) # time-interval = current timestamp # - timestamp of previous transfer with same auth_cd recency <- exp(-gamma * time_diff) } return(recency) }

slide-64
SLIDE 64

DataCamp Fraud Detection in R

Recency feature in R (step 2)

(1) Choose value for γ (2) Use rollapply(), group_by(), and mutate()

> gamma <- -log(0.01)/180 # = 0.0256 > library(dplyr) # needed for group_by() and mutate() > library(zoo) # needed for rollapply() > trans <- trans %>% group_by(account_name) %>% mutate(rec_auth = rollapply(timestamp, width = list(0:-length(transfer_id)), partial = TRUE, FUN = recency_fun, gamma, authentication_cd, freq_auth))

slide-65
SLIDE 65

DataCamp Fraud Detection in R

Result!

account_name timestamp authentication_cd rec_auth fraud_flag 1 Bob 44.25 AU02 0.000 0 2 Alice 54.12 AU03 0.000 0 3 Bob 57.45 AU04 0.000 0 4 Bob 64.29 AU02 0.599 0 5 Alice 64.29 AU03 0.771 0 6 Bob 64.29 AU02 1.000 0 7 Alice 70.25 AU03 0.859 0 8 Bob 70.25 AU02 0.859 0 9 Alice 74.08 AU01 0.000 0 10 Bob 74.08 AU04 0.653 0 11 Bob 74.08 AU02 0.907 0 12 Alice 83.93 AU01 0.777 0 13 Alice 96.21 AU05 0.000 0 14 Alice 96.21 AU05 1.000 0 15 Alice 98.25 AU05 0.949 0 16 Bob 109.27 AU02 0.406 0 17 Bob 123.89 AU04 0.280 0 18 Bob 155.95 AU02 0.303 0 ... ... ... ... ... ... 37 Bob 407.17 AU02 0.002 0 38 Alice 420.17 AU03 0.717 0 39 Bob 441.34 AU03 0.000 1 40 Alice 443.24 AU04 0.000 1

slide-66
SLIDE 66

DataCamp Fraud Detection in R

Features based on time, frequency and recency

slide-67
SLIDE 67

DataCamp Fraud Detection in R

Let's practice!

FRAUD DETECTION IN R