Clustering Large Credit Client Data Sets for Classification with SVM - - PowerPoint PPT Presentation

clustering large credit client data sets for
SMART_READER_LITE
LIVE PREVIEW

Clustering Large Credit Client Data Sets for Classification with SVM - - PowerPoint PPT Presentation

Clustering Large Credit Client Data Sets for Classification with SVM Ralf Stecking University of Oldenburg Department of Economics Klaus B. Schebesch University Vasile Goldi Arad Faculty of Economics University of Edinburgh Credit


slide-1
SLIDE 1

Clustering Large Credit Client Data Sets for Classification with SVM

Ralf Stecking

University of Oldenburg Department of Economics

Klaus B. Schebesch

University “Vasile Goldiş” Arad Faculty of Economics University of Edinburgh Credit Scoring and Credit Control XI Conference

26.08.2009

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 1 / 30

slide-2
SLIDE 2

Overview

Motivation Kernels and clustering Preliminary evaluation of cluster based SVM Multiple validation of cluster based SVM Credit scoring and data clustering Symbolic representation of credit client clusters Symbolic SVM model building and evaluation Conclusions and outlook

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 2 / 30

slide-3
SLIDE 3

Motivation

In past work we used a medium sized empirical credit scoring data set with N = 658 credit clients all having m = 40 input features in order to analyze different aspects of model building with regard to out of sample classification performance. As base model we used SVM and other statistical learning methods like LDA, CART, and LogReg. Thereupon model combinations of

  • utputs of the base models were also investigated.

Gaining access to a N ≈ 140.000 data set with m = 23 features per credit client and with extremly asymmetric class distributions, precludes case-by-case training of models like SVM. Further increasing N and “fusing” credit client information from different sources will worsen the situation.

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 3 / 30

slide-4
SLIDE 4

Kernels and relations between pairs of cases (1)

A training set {yi|xi}i=1,...,N may contain labeled credit clients (e.g. yi ∈ {−1, 1}) or unlabeled ones (yi = 0, all i, say). A kernel function ksij(xi, xj) ≥ 0 describes a metric relation (inverted distance, etc.) between any two training feature vectors xi, xj ∈ {1, ..., N}. The implied numerical matrix Kij is usually meant to be symmetric. Individualized parameters for pairs of clients ij can impose conditions which may act

classwise (e.g. correct for asymmetic costs) casewise (e.g. correct for case importance) or interaction-wise (i.e. 2-interactions).

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 4 / 30

slide-5
SLIDE 5

Kernels and relations between pairs of cases (2)

An instantiation of kij(xi, xj) would be the adaptation of the very powerful RBF kernel for nonlinear SVM: rij exp(−sij||xi − xj||2), with rij ∈ {0, 1} a grouping relation, and sij ≥ 0 the interaction sensitivity. By identically permuting the index sets of i and j, i.e. via ik and jk, the matrix (Kikjk) resulting from the kernel may be block-diagonal, indicating a cluster structure, which in turn may or may not be dependent of the labeling {yik}. The most popular RBF SVM is simply using rij = 1 for all i, j and sij = s > 0, a constant, for all i, j.

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 5 / 30

slide-6
SLIDE 6

Kernels and relations between pairs of cases (3)

Being able to set (at least part of) the rij and sij parameters would convey domain knowledge into the problem formulation. Do we have such knowledge ? Treating rij and sij as “slow” variables which are to be optimized along with “fast” variables, i.e. the SVM duals and slacks, is hardly tractable (for empirically reasonale N). In order to approxiamte this task, we split the associated simulataneous problem into two consecutive tasks which are routinely tractable:

Cluster the data set by a fast standard method (with and without label prepartitioning) Apply the SVM kernel to the resulting cluster prototypes.

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 6 / 30

slide-7
SLIDE 7

What we do for this presentation

k[X(i),X(j)]

C C C m+d

m N

m+d Cluster Representatives

(y|X)

+ derived variables

Data

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 7 / 30

slide-8
SLIDE 8

Some issues concerning cluster formation

Is there any cluster structure in the data ? A clustering algorithm will issue “clusters” for very 1 ≤ c ≤ N ! Suppose there is some (empirical) clustering, are the cluster shapes compact (spheroidal) or elongated or of mixed shape in high dimensions and are they well separated ? The clustering method can be one of the following completely unsupervised ... constrained to some degree (“soft” penalty terms, using balancing and correlations etc. arguments, of “must-be” or “cannot-be” type, ... , can cluster members of predefined classes only, ...) How to cluster labeled credit client data ? Cluster the entire training data or cluster the members of predefined classes

  • nly ?

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 8 / 30

slide-9
SLIDE 9

Different models trained on cluster prototypes

22 24 26 28 30 32 34 36 20 40 60 80 100 120 140 160 180 200 Restart clusterings sorted by misclass.rate Misclassification (in precent) 50 60 80 100 110 120 Kernels Cluster RBF SVM

  • LIN. SVM

2-POLYNOM. SVM RBF SVM COULOMB SVM FTE RBF SVM Cluster- and case-by-case trained SVM

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 9 / 30

slide-10
SLIDE 10

ROC curves of SVM with different kernels ...

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 false positive rate true positive rate kernels AUC MXE linear (cyan) 0.833 0.494 polynomial (blue) 0.846 0.505 rbf (green) 0.856 0.497 coulomb (red) 0.858 0.527 fte > rbf (black) 0.875 0.478 ROC curves SVM models

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 10 / 30

slide-11
SLIDE 11

... and ROC curves of cluster based RBF SVM

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 false positive rate true positive rate kernels AUC MXE linear (black) 0.833 0.494 polynomial (blue) 0.846 0.505 rbf (green) 0.856 0.497 coulomb (red) 0.858 0.527 fte > rbf (dash) 0.875 0.478 ROC curves of SVM models

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 11 / 30

slide-12
SLIDE 12

Validation of cluster based SVM on a large credit client data set

It is not sufficient to validate the SVM models trained on a given set of cluster representatives! Improved validation includes clustering, the

  • utcome of which may be different when using different holdout sets:

1 Step A: Divide the training set into positive and negative cases

T = P ∪ N.

2 Step B: Subdivide both P and N of a large training set (of > 100000

cases, say) into n (approx. equally sized) non-overlapping segments [P1, P2, ..., Pi, ..., Pn] and [N1, N2, ..., Ni, ..., Nn] with the smallest segment containing at least 30 cases, say.

3 Step C i: cluster both sets [P1, P2, ..., Pi−1, Pi+1..., Pn] and

[N1, N2, ..., Ni−1, Ni+1..., Nn] obtaining 2c cluster representatives. Train a SVM on these labeled 2c points.

4 Step D i: validate the ith SVM just on the segment [Pi, Ni]. Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 12 / 30

slide-13
SLIDE 13

ROC curves for SVM 50,100,200 and 400 clusters

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 13 / 30

slide-14
SLIDE 14

Sorted AUC for SVM 50,100,200 and 400 clusters

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 10 20 30 40 50 60 70 80 90 100 # block-leaveout computation Area under ROC (AUC, sorted) and MXE mean AUC = 0.6446035 mean MXE = 0.6001835 Validation of SVM trained on 50 pos / 50 neg CC 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 10 20 30 40 50 60 70 80 90 100 # block-leaveout computation Area under ROC (AUC, sorted) and MXE mean AUC = 0.6735184 mean MXE = 0.6270437 Validation of SVM trained on 100 pos / 100 neg CC 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 10 20 30 40 50 60 70 80 90 100 # block-leaveout computation Area under ROC (AUC, sorted) and MXE Validation of SVM trained on 200 pos / 200 neg CC 0.4 0.5 0.6 0.7 0.8 0.9 1.0 10 20 30 40 50 60 70 80 90 100 # block-leaveout computation Area under ROC (AUC, sorted) and MXE Validation of SVM trained on 400 pos / 400 neg CC

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 14 / 30

slide-15
SLIDE 15

Sliced ROC for SVM 50,100,200 and 400 clusters

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 15 / 30

slide-16
SLIDE 16

Validation of SVM on 15 different cluster numbers

0.62 0.63 0.64 0.65 0.66 0.67 50 100 150 200 250 300 350 400 # cluster centres Area under ROC (AUC) Validation of SVM trained on cluster centers from two k-means start series plus output combination on computed on the holdout sets of both series

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 16 / 30

slide-17
SLIDE 17

Example of ROC Validation and Output combinations

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 True positive rate False positive rate Validation SVM on [ nee115_100 + r1_nee115_100 ]

AUC (1) = 0.661921 AUC (2) = 0.6470367 AUC (1+2) = 0.671793

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 17 / 30

slide-18
SLIDE 18

Credit scoring problem and data

139951 clients for a building and loan credit: 3692 defaulting and 136259 non defaulting For each credit client: 12 variables resulting in a 23-dimensional input pattern (e.g. loan–to–value ratio, repayment rate, amount of credit, house type etc.) Binary target variable y: state of credit Goal: generate (supervised or semi-supervised) binary classification functions in order to assign credit applicants to good and bad risk classes In previous work (Baesens et al. 2003, Bellotti and Crook 2008, Huang et al. 2003,

Li et al. 2006, Stecking and Schebesch 2002 and 2004) at least slightly superior

performance of SVM for credit scoring towards more traditional methods could be observed

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 18 / 30

slide-19
SLIDE 19

Encoding scheme for classical variables

  • No. of

Model Variable Scale Categories Coding input #1 nominal 3 binary 0/1 x1 to x2 #2

  • rdinal

2 binary 0/1 x3 #3

  • rdinal

4 binary 0/1 x4 to x6 #4 nominal 2 binary 0/1 x7 #5 nominal 5 binary 0/1 x8 to x11 #6

  • rdinal

5 binary 0/1 x12 to x15 #7 nominal 3 binary 0/1 x16 to x17 #8 nominal 3 binary 0/1 x18 to x19 #9 − #12 quantitative

  • xi − ¯

x sx x20 to x23 Target nominal 2 binary -1/1 y

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 19 / 30

slide-20
SLIDE 20

Clustering large data sets

What we have before ...

Large data set (N ≈ 140.000), hardly feasible for advanced methods like SVM Relatively small set of variables Unequal class sizes (default rate 2, 6%) Asymmetric misclassification costs Mixed variable types (nominal / ordinal / quantitative) Quantitative variables need standardization / outlier treatment

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 20 / 30

slide-21
SLIDE 21

Clustering large data sets

... and after clustering

Small data set (N = number of clusters) Equal class sizes (or any other desired) Frequency distribution, histogram, intervals, sample statistics (mean, median, standard deviation etc.) for each variable per credit client cluster Symbolic description of cluster objects possible − → Symbolic Data Analysis!

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 21 / 30

slide-22
SLIDE 22

Clustering large data sets

What we still don’t know

Which cluster procedure to use

supervized / unsupervized K–Means classwise Constrained clustering?

Number of clusters? Cluster representation? Cluster label (Is it a good or a bad credit client cluster?) How to pass cluster information to a classification method? How to predict credit worthiness of single credit applicants?

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 22 / 30

slide-23
SLIDE 23

Cluster representation

Let credit client i belong to cluster Cu where number of clusters is m and Cu ∈ {C1, . . . , Cm} Variable Xj for credit client i will take (categorical or quantitative) values Xij

− → “classical data”

What type of characteristics Xj can we observe for cluster Cu?

Single valued Multi valued Interval valued Modal multi valued Modal interval valued (histogram)

− → “symbolic data”

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 23 / 30

slide-24
SLIDE 24

Clusters are symbolic objects

Clusters are described by modal variables Categories or intervals appear with a given probability Variable X is represented by Xu = {ηu1, pu1; . . . ; ηus, pus} where

  • bject (cluster) u relates to variable X with k = 1, . . . , s outcomes

puk is probability, i.e. s

k=1 puk = 1 and pk ≥ 0

Outcome ηuk may be category or (non–overlapping) interval

For example X1 = {Male, 0.65; Female, 0.35} might by the symbolic description of variable “gender” for cluster 1.

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 24 / 30

slide-25
SLIDE 25

Encoding scheme for symbolic variables

Original

  • No. of

Model Variable Scale Outcomes η input #1 nominal 3 category p1 to p3 #2

  • rdinal

2 category p4 to p5 #3

  • rdinal

4 category p6 to p9 #4 nominal 2 category p10 to p11 #5 nominal 5 category p12 to p16 #6

  • rdinal

5 category p17 to p21 #7 nominal 3 category p22 to p24 #8 nominal 3 category p25 to p27 #9 − #12 quantitative (4×) 4 intervals p28 to x43 Target nominal 2 binary -1/1 y

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 25 / 30

slide-26
SLIDE 26

Symbolic SVM model building

Empirical approach

1 Divide large data set into “good” (N = 136.259) and “bad”

(N = 3.692) credit clients

2 K–means cluster analysis: extract 50 clusters from “good” and “bad”

classes respectively and preserve labels

3 Symbolic description for each cluster:

modal multi valued for categorical, modal interval valued (with quartiles as interval borders) for quantitative

variables → 43–dimensional input pattern consisting of probabilities is given to SVM

4 Train SVM with (a) linear and (b) RBF kernel 5 Cluster description: display region information −

→ Extract rules!

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 26 / 30

slide-27
SLIDE 27

SVM classification function

Good and bad risk classes X 2 X 1 y = +1 y = −1

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 27 / 30

slide-28
SLIDE 28

SVM classification function

Critical region X 2 X 1 y = +1 y = −1

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 27 / 30

slide-29
SLIDE 29

SVM classification function

Typical and critical regions X 2 X 1 y = +1 y = −1

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 27 / 30

slide-30
SLIDE 30

Classification with symbolic SVM

Empirical approach

1 Divide credit client data set into training (N = 93.391) and validation

(N = 46.560) set

2 Cluster training set −

→ build SVM

3 Relative frequency coding for validation set: each credit client is

assigned to a category / an interval with a probability of either 0 or 1

4 Use SVM classification function from step 2 to predict credit client

behaviour on the validation set

5 Calculate ROC curve and area under ROC curve (AUC) 6 Compare to benchmark models: (i) traditional SVM and (ii) LDA and

logistic regression trained on a small (n = 658) equal class sized hold

  • ut data sample

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 28 / 30

slide-31
SLIDE 31

Classification results

Cluster Sample Traditional based SVM based SVM Methods Linear RBF Linear RBF LDA LogReg Training

  • No. of

training cases 100 100 658 658 658 658

  • No. of SVs

25 51 357 431 – – Training error 0% 0% 22.64% 10.94% 23.86% 22.49% L–1–o error 16% 12% 27.20% 25.08% 27.66% 27.51% Forecasting AUC (Full Set) (N = 139951) 0.656 0.675 0.583 0.576 0.582 0.584 [95% CI] ±0.09 ±0.09 ±0.11 ±0.11 ±0.10 ±0.10 AUC (Validation) (N = 46560) 0.660 0.686 – – – – [95% CI] ±0.15 ±0.15 – – – –

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 29 / 30

slide-32
SLIDE 32

Conclusions and outlook

For credit scoring data with vastly different numbers of clients and with different client features, SVM trained on few cluster centers of client clusters can replicate and surpass expected (cross-validated) performance in terms of AUC. Clusters resulting from k-means are rather unstable on our data. This is due to overlap of or to non-spheroidal clusterings. However validated performance peaks (in terms of numbers of cluster) can be found (around 100 clusters per class). Cluster based SVM can be used for describing and predicting credit clients. Symbolic coding enables more complex representation of cluster information. Further work: on validation procedures / on more adapted symbolic cluster encodings / on data fusion

Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 30 / 30