Advances in Credit Scoring: combining performance and interpretation - - PowerPoint PPT Presentation

advances in credit scoring combining performance and
SMART_READER_LITE
LIVE PREVIEW

Advances in Credit Scoring: combining performance and interpretation - - PowerPoint PPT Presentation

Advances in Credit Scoring: combining performance and interpretation in Kernel Discriminant Analysis. Caterina Liberati DEMS Universit degli Studi di Milano-Bicocca, Milan, Italy caterina.liberati@unimib.it November 10 th 2017 Liberati 1 /


slide-1
SLIDE 1

Advances in Credit Scoring: combining performance and interpretation in Kernel Discriminant Analysis.

Caterina Liberati

DEMS Università degli Studi di Milano-Bicocca, Milan, Italy

caterina.liberati@unimib.it

Liberati November 10th 2017 1 / 36

slide-2
SLIDE 2

Outline

1

Motivation

2

Kernel-Induced Feature Space

3

Our Proposal

4

Examples

Liberati November 10th 2017 2 / 36

slide-3
SLIDE 3

Motivation

Credit Scoring: Performance vs Interpretation

Learning Task with Standard Techniques The objective of quantitative Credit Scoring (CS) is to develop accurate models that can distinguish between good and bad applicants (Baesens et al, 2003). The CS → supervised classification problem: Linear discriminant analysis (Mays 2004; Duda et al. 2000), logistic regression and their variations (Wiginton 1980; Hosmer and Lemeshow 1989; Back et al. 1996) Modeling CS with Machine Learning Algorithms A variety of techniques have been applied in modeling CS: Neural Networks (Malhotra and Malhotra, 2003; West, 2000), Decision Trees (Huang et al, 2006), k-Nearest Neighbor classifiers (Henley and Hand, 1996; Piramuthu, 1999) Comparisons with standard data mining tools highlighted the superiority of such algorithms with respect to the standard classification tools

Liberati November 10th 2017 3 / 36

slide-4
SLIDE 4

Motivation

Credit Scoring: Performance vs Interpretation

Kernel-based Discriminants Significant theoretical advances in Machine Learning produced new algorithms’ category based on works of Vapnik (1995-1998). He points out that learning can be simpler if one uses low complexity classifiers in high dimensional space (F ). The usage of kernel mappings makes it possible to project data implicitly in the Feature Space (F ) through the inner product operator Due to the flexibility and remarkably good performance, the popularity of such algorithms grew quickly. Performance vs Interpretation Kernel-based classifiers are able to capture non-linearities in the data, at the same time, they have an inability to provide an explanation, or comprehensible justification, for the solutions they reach (Barakat and Bradley 2010).

Liberati November 10th 2017 4 / 36

slide-5
SLIDE 5

Kernel-Induced Feature Space

Complex Classification Tasks

  • x

x x x x x x x x x x x x x x x x x x x x x x1 x2

−12 −10 −8 −6 −4 −2 2 4 6 8 −10 −8 −6 −4 −2 2 4 6

Figure: Examples of complex data structures.

Liberati November 10th 2017 5 / 36

slide-6
SLIDE 6

Kernel-Induced Feature Space

Do we need Kernels?

The complexity of the target function to be learned depends on the way it is represented and the difficulty of the learning task can vary accordingly (figure from Schölkopf and Smola (2002)). φ : R2 → R3 (x1, x2) → (z1, z2, z3) = (x2

1 ,

√ 2x1x2, x2

2 )

(φ(x), φ(z)) = (x2

1 ,

√ 2x1x2, x2

2 )(z2 1 ,

√ 2z1z2, z2

2 )T

= ((x1, x2)(z1, z2)T )2 = (x · z)2 Liberati November 10th 2017 6 / 36

slide-7
SLIDE 7

Kernel-Induced Feature Space

Making Kernels

Kernel converts a non linear problem into a linear one by projecting data onto a high dimensional Feature Space F without knowing the mapping function explicitly. k : X 2 → R which for all pattern sets {x1, x2..xn} ⊂ X and with X ⊂ Rp, give rise to positive matrices Kij = k(xi, xj) If the Mercer’s theorem is satisfied (Mercer, 1909), the kernel k corresponds to mapping the data into a possibly high-dimensional dot product space F by a (usually nonlinear) map φ : Rp → F and taking the dot product there (Vapnik, 1995), i.e. k(x, z) = (φ(x) · φ(z)) (1) if Mercer’s theorem is satisfied Kij = k(xi, xj) is a Reproducing Kernel Hilbert Space (RKHS).

Liberati November 10th 2017 7 / 36

slide-8
SLIDE 8

Kernel-Induced Feature Space

Advantages of Learning with Kernels

Among the others a RKHS has nice property: K(x, z)2 K(x, x) · K(z, z) ∀x, z ∈ X (2) The Cauchy-Schwarz inequality allows us to view K as a measure of similarity between inputs. If x, z ∈ X are similar then K(x, z) will be closer to 1 while if x, z ∈ X are dissimilar then K(x, z) will be closer to 0. K(x, z) is a space of similarities among instances The freedom to choose the mapping k will enable us to design a large variety of learning algorithms. If map is chosen suitably, complex relations can be simplified and easily detected.

Liberati November 10th 2017 8 / 36

slide-9
SLIDE 9

Kernel-Induced Feature Space

Kernel Discriminant Analysis

Assume that we are given the input data set IXY = {(x1, y1), ..., (xn, yn)} of training vectors xi ∈ X and the corresponding values of yi ∈ Y ={1, 2} be sets of indices. The class separability in a direction of the weights ω ∈ F is obtained maximizing the Rayleigh coefficient (Baudat and Anouar, 2000): J(ω) = ω′Sφ

ω′Sφ

(3) From the theory of reproducing kernel the solution of ω ∈ F must lie in the span of all the training samples in F. We can notice that ω can be formed by a linear expansion of training samples as follows: ω =

n

  • i=1

αiφ(xi) (4)

Liberati November 10th 2017 9 / 36

slide-10
SLIDE 10

Kernel-Induced Feature Space

Kernel Discriminant Analysis

As already showed by Mika et al (2003) the SΦ

B and SΦ W can be easily written as

ω′Sφ

Bω = α′Mα

(5) where M = (m1 − m2)(m1 − m2)′ mg =

1 ng

n

i=1

ng

k=1 k(xi, xg k), g = 1, 2.

ω′Sφ

Wω = α′Nα

(6) N = 2

g=1 K g(I − Lg)K ′ g

K g a kernel matrix with a generic element (ith, k th) equal to k(xi, xg

k)

I the identity matrix Lg a matrix with all entries n−1

g Liberati November 10th 2017 10 / 36

slide-11
SLIDE 11

Kernel-Induced Feature Space

Kernel Discriminant Analysis

These evidences allow to boils down the optimization problem of eq. 3 into finding the class separability directions α of the following maximization criterion: J(α) = α′Mα α′Nα (7) This problem can be solved by finding the leading eigenvectors of N−1M. Since the proposed setting is ill-posed, because N is at most of rank n-1, we employed a regularization method. The classifier is: f(x) =

n

  • i=1

αik(xi, x) + b (8) b = α′ 1 2(m1 + m2) (9)

Liberati November 10th 2017 11 / 36

slide-12
SLIDE 12

Kernel-Induced Feature Space

KDA into SVM formulation

The linear classifier can be reviewed into SVM formulation as LS-SVM. Consider a binary classification model in the Reproducing Kernel Hilbert space: f(x) = ω′φ(x) + b (10) where ω is the weight vector in RKHS, and b ∈ R which is called as the bias term. The discriminant function of LS-SVM classifiers (Suykens and Vandewalle 1999 is constructed by minimizing the following problem: Min J(ω, e) = 1 2ω′ω + 1 2γ

n

  • i=1

e2

i

(11) Such that: yi = ω′φ(xi) + b + ei i=1,2,...n

Liberati November 10th 2017 12 / 36

slide-13
SLIDE 13

Kernel-Induced Feature Space

KDA into SVM formulation

The Lagrangian of problem (eq. 13) is expressed by: L(ω, b, e; α) = J(ω, b, e) −

n

  • i=1

αi(ω′φ(xi) + b − yi + ei) (12) where αi ∈ R are the Lagrange multipliers, which can be positive or negative in this

  • formulation. The conditions for optimality yield:

      

∂L ∂ω

= 0 ⇒ ω = n

i=1 αiφ(xi) ∂L ∂b

= 0 ⇒ m

i=1 αi = 0 ∂L ∂ξ

= 0 ⇒ αi = γe

∂L ∂αi

= 0 ⇒ yi(ω′φ(xi) + b) − 1 + ei = 0∀i = 1, 2, ..n (13) The solution is found by solving a system of linear equations in eq. 15 (Kuhn and Tucker 1951). The fitting function namely the output of LS-SVM is: f(x) =

n

  • i=1

αik(xi, x) + b (14)

Liberati November 10th 2017 13 / 36

slide-14
SLIDE 14

Kernel-Induced Feature Space

LS-SVM vs SVM

LS-SVM vs SVM The major drawback of SVM lies in the estimation procedure based on the constrained optimization programming (Wang and Hu 2015), therefore the computation burden becomes particularly heavy for large scale problems. In such cases LS-SVM is preferred because its solution is based on solving a linear set of equations (Suykens and Vandewalle 1999). KDA vs SVM SVMs do not deal with multi-class problem directly when data structures present more than 2 groups unless we use any OAA OAO classifications.

Liberati November 10th 2017 14 / 36

slide-15
SLIDE 15

Kernel-Induced Feature Space

Kernel Settings

The most common kernel mappings: Kernel Mapping k(x,z) Cauchy

1 1+ ||x−z||2

c

Laplace exp(−

  • x−z2

c2

) Multi-quadric

  • ||x − z||2 + c2

Polynomial degree 2 (x · z)2 Gaussian (RBF) exp( −x−z2

2c2

) Sigmoidal (SIG) tanh[c(x · z) + 1] Tuning parameter is set trough some grid search algorithms Regularization methods to overcome the singularity of SΦ

W (Friedman 1989; Mika

1999)

REG

Model selection criteria for choosing the best kernel function (Error Rate, AUC, Information Criteria)

SEL Liberati November 10th 2017 15 / 36

slide-16
SLIDE 16

Our Proposal

Our Proposal

An operative strategy Our goal is NOT to derive a new classification model that discriminates better than

  • thers previously published.

I Selection of the best Kernel Discriminant function (a) Computing kernel matrix using as inputs original variables (b) Performing Kernel Discriminant Analysis (KDA) with different kernel maps (c) Selecting the best Kernel Discriminant f(x) via minimum misclassification error rate or maximization of AUC

Liberati November 10th 2017 16 / 36

slide-17
SLIDE 17

Our Proposal

Our Proposal

An operative strategy Our goal is NOT to derive a new classification model that discriminates better than

  • thers previously published.

I Selection of the best Kernel Discriminant function (a) Computing kernel matrix using as inputs original variables (b) Performing Kernel Discriminant Analysis (KDA) with different kernel maps (c) Selecting the best Kernel Discriminant f(x) via minimum misclassification error rate or maximization of AUC II Reconstruction of the Kernel Discriminant function through a linear regression (a) Performing a linear regression where f(x) is the target and the original variables are the predictors (b) Studying the goodness of fit of the linear reconstruction (c) If II.(b) is satisfying, use the estimates of regression ˆ f(x) as new classifier

Liberati November 10th 2017 16 / 36

slide-18
SLIDE 18

Our Proposal

Our Proposal

An operative strategy Our goal is NOT to derive a new classification model that discriminates better than

  • thers previously published.

I Selection of the best Kernel Discriminant function (a) Computing kernel matrix using as inputs original variables (b) Performing Kernel Discriminant Analysis (KDA) with different kernel maps (c) Selecting the best Kernel Discriminant f(x) via minimum misclassification error rate or maximization of AUC II Reconstruction of the Kernel Discriminant function through a linear regression (a) Performing a linear regression where f(x) is the target and the original variables are the predictors (b) Studying the goodness of fit of the linear reconstruction (c) If II.(b) is satisfying, use the estimates of regression ˆ f(x) as new classifier III Application of the rule for a test data (a) Employing to the test set, the parameters of the regression at II.(a)

Liberati November 10th 2017 16 / 36

slide-19
SLIDE 19

Our Proposal

Our Proposal

An operative strategy One may be surprised that the linear approximation of the kernel rule by input variables does not coincide with the direct regression of the response variable on the input variables. Theorem Let A and B be 2 pre-hilbertian subspaces A and B such that B ⊂ A, and PA and PB the orthogonal projectors onto A and B, then for any y, PB · PA(y) = PB(y). Since the vector space linearly spanned by the input variables is embedded in the Feature Space, there should be no gain to approximate y by the kernel classifier (14) and then approximate f(x) by a linear combination of the xi, instead of a direct projection onto the xi. This paradox disappears if we notice that LS-SVM classifier (14) does not correspond to the orthogonal projection onto the Feature Space, or, in other words, to the least squares approximation of the binary response.

Liberati November 10th 2017 17 / 36

slide-20
SLIDE 20

Examples

Example 1: psycho credit scoring

The total sample is composed of 7699 self-employed customers (entrepreneurs, artisans, freelancers) of an Italian bank, distributed into two classes (6160 in “good”, 1539 in “bad”). Training set → 4619 instances, Test set → 3080 instances Our database is composed by 4 sets of quantitative variables and 1 target variable:

1

y: target variable which identifies the trespassing of the credit limit by the clients.

2

CREDIT: set of 1 scale variable provided by the credit bureau which measures the solvency statement of the subjects.

3

MANAGE:set of 25 dichotomous variables (yes/no) related to the customers’ usage of the banking services (e.g. bank account, credit card, payment of utilities, accrual of salary, etc.) synthesized via MCA.

4

ECO: set of 7 scale variables related to the cash flow and the economic returns of the financial activities operated by the users (e.g. monthly revenue produced by the customers, financial assets held by the customers, etc.).

5

SEMIO: set of 5 scale variables that synthesize the psychological traits of the subjects.

Liberati November 10th 2017 18 / 36

slide-21
SLIDE 21

Examples

Example 1: data preparation

Sémiométrie (Lebart et al, 2003) is a list of 210 graphical forms among nouns, adjectives, or verbs, marked by respondents in terms of sensation (pleasant=+3 or unpleasant=-3). Sample: 16,582 individuals aged 18 and over who were interviewed between 1990 and 2002. The rankings were synthesize via Principal Component Analysis. According to the results only the first 6 factorial axis were interpreted.

1

Pc1 - named axis of Participation. It is not a psychological trait.

2

Pc2 - named Duty (-)/Pleasure (+)

3

Pc3 - named Attachment (-)/Detachment (+)

4

Pc4 - named Sublimation (-)/Materialism (+)

5

Pc5 - named Idealization (-)/Pragmatism (+)

6

Pc6 - named Humility (-)/Sovereignty (+) SEMIO is obtained by a supplementary projection of points onto a subspace spanned by the 5 semiometric factors (Pc2-Pc6): fsup = X +U (15) where X + is our standardized data matrix (supplementary observations) and U are the original eigenvectors obtained by the spectral decomposition of Sémiométrie data.

Liberati November 10th 2017 19 / 36

slide-22
SLIDE 22

Examples

Example 1: classification results

Table: Average classification performance statistics on test sets (50 runs)

Classifier Parameter Variables set Error Rate AUC Class Accuracy good bad CAU 3.6786 CREDIT+ECO+MANAGE+SEMIO 0.186 0.850 0.863 0.619 LAP 3.6786 CREDIT+ECO+MANAGE+SEMIO 0.199 0.852 0.831 0.678 MUL 5.8893 CREDIT+ECO+MANAGE+SEMIO 0.220 0.873 0.769 0.826 RBF 3.6786 CREDIT+ECO+MANAGE+SEMIO 0.210 0.856 0.801 0.748 POLY 2 CREDIT+ECO+MANAGE+SEMIO 0.333 0.566 0.733 0.398 LDA CREDIT+ECO+MANAGE+SEMIO 0.368 0.522 0.713 0.300 LR CREDIT+ECO+MANAGE+SEMIO 0.159 0.522 0.936 0.458 Liberati November 10th 2017 20 / 36

slide-23
SLIDE 23

Examples

Example 1: variables importance

Discriminant**score**********y

Bad Good

Figure: Score values

Liberati November 10th 2017 21 / 36

slide-24
SLIDE 24

Examples

Example 1: variables importance

r Score Variable △R2 rWR b p-value MUL (R2=0.986 on the training sample with CREDIT+ECO+MANAGE+SEMIO sets) Pc2 0.177 18.40% 0.924 0.000 Pc3 0.701 71.50% 1.836 0.000 Bureau 0.080 8.90% 7.426 0.000 RBF (R2=0.869 on the training sample with CREDIT+ECO+MANAGE+SEMIO sets) Pc2 0.160 18.90% 0.020 0.000 Pc3 0.614 70.90% 0.040 0.000 Bureau 0.069 8.70% 0.160 0.000 POLY (R2=0.682 on the training sample with CREDIT+ECO+MANAGE+SEMIO sets) Interests on financial asset (F .) 0.018 3.30% 0.766 0.000 Total financial assets managed 0.059 11.20% 0.001 0.000 Factor 3 0.040 7.80 % 62.885 0.000 Factor 4 0.009 15.90% 100.860 0.000 Factor 13 0.009 14.00% 43.619 0.000 LDA (R2=1 on the training sample with CREDIT+ECO+MANAGE+SEMIO sets) Pc2 0.176 18.10% 0.095 0.000 Pc3 0.712 71.60% 0.190 0.000 Bureau 0.082 9.00% 0.774 0.000 LR (R2=0.394 on the training sample with CREDIT+ECO+MANAGE+SEMIO sets) Pc2 0.093 16.60% 1.031 0.000 Pc3 0.298 63.10% 2.006 0.000 Bureau 0.039 6.80% 7.888 0.000 Liberati November 10th 2017 22 / 36

slide-25
SLIDE 25

Examples

Example 1: roc of reconstructed multiquadric score

AUC=0.893 score=0.924 Pc2+1.836Pc3+0.088Bureau Pc2=Duty/Pleasure Pc3=Attachment/Detachment Bureau=measure of solvency

False positive rate (1-Specificity) 0.2 0.4 0.6 0.8 1 True positive rate (Sensitivity) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ROC curve

Figure: ROC of reconstructed score

Liberati November 10th 2017 23 / 36

slide-26
SLIDE 26

Examples

Example 2: SMEs Data

A real dataset provided by an Italian bank y → default probability over the next 12 months Our database is composed by 10 qualitative variables:

1

4 variables collected by a questionnaire administered to corporate

  • customers. DOM1-DOM4 investigate some aspects of the corporate

clients: the seniority of the company, the skills present in the company, the past experience in the market, the personal assets of the owners.

2

4 default indicators provided by the Central Bureau of risk CB1az, CB2az measure the risk related to the companies CB1coll, CB2coll measure the risk related to the natural persons (collaborators) involved with the enterprises ownership.

3

2 variables: CERI is a proxy of the non-standard behavior of firms estimated by the central risk of the bank; ANAG indicates the term of relationship between the business and the bank

Liberati November 10th 2017 24 / 36

slide-27
SLIDE 27

Examples

Example 2: preprocessing

We randomly selected a large sample composed by 8700 instances. The groups distribution was: y=1 “bad” (29% of the total sample) and y=2“good” (71% of the total sample) We split the sample into training (4703) and test sample (3997). Data were synthesized via Multiple Correspondence Analysis in 37 factor axis. The allocation rule of the units to one of the two groups is based on k-Nearest Neighbor with a window width δ = 10.

Liberati November 10th 2017 25 / 36

slide-28
SLIDE 28

Examples

Example 2: classification results

Figure: Confusion matrices of different classifiers on training set

Liberati November 10th 2017 26 / 36

slide-29
SLIDE 29

Examples

Example 2: classification results

Discriminant AUC CAUCHY 0.956 LAPLACE 0.915 RBF 0.890 LOGISTIC 0.842 LINEAR 0.713

Table: Area Under Curve

(AUC) on training sets

0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1−Specificity Sensibility Cauchy Laplace RBF FLDA Logistic

Liberati November 10th 2017 27 / 36

slide-30
SLIDE 30

Examples

Example 2: Assessing the reconstruction process

Figure below shows the relationship between the Cauchy kernel discriminant function and the Linear reconstruction Discriminant.

Liberati November 10th 2017 28 / 36

slide-31
SLIDE 31

Examples

Example 2: classification accuracy

Table: Correct Classification Rates for different methods on the test data.

Correct Classification Rates Discriminant Rules class 1 class 2

  • verall

CAUCHY 71.48 74.78 73.82 LOGISTIC REGRESSION 54.91 89.88 79.90 FLDA 61.88 56.57 58.08 RECONSTRUCTED 73.80 74.51 74.30 Results highlight very good performance of Cauchy kernel discriminant respect to the

  • ther classifier. Logistic regression is the best in terms of overall accuracy but if we

compare the two rules in terms of good predictions in both classes the reconstructed Cauchy kernel discriminant is more balanced and effective.

Liberati November 10th 2017 29 / 36

slide-32
SLIDE 32

Examples

Example 2: Characterization of the test partition

Characterization of the test partition has been carried out by finding a ranking among all the characterizing variables of a group by means of probabilistic criterion: value-test V ∼ Hyp(n, nν, nq)

n=sample size nq= number of instances sampled without replacement belonging to q-th group nν= number of instances with ν-th category.

Table: Categories characterizing the group of the bad instances classified as bad.

Characteristic % of category % of category % of group V-Test Pvalue categories in group in sample in category nνq/nq nν/n nνq/nν CB2_az=1 76.08 28.52 59.37 38.42 0.000 CB1_az=1 55.31 23.83 51.64 26.39 0.000 CERI=1 40.02 17.31 51.45 21.09 0.000 CB2_coll=1 18.62 8.89 46.62 11.92 0.000 DOM4=1 46.04 31.60 32.43 11.47 0.000 DOM3=1 13.22 5.70 51.58 11.13 0.000 DOM2=2 23.56 13.91 37.70 9.98 0.000 DOM1=1 30.67 20.23 33.73 9.44 0.000 CB1_coll=1 14.30 7.96 39.95 8.26 0.000 ANAG=1 35.34 28.08 28.01 5.98 0.000 DOM2=1 4.41 2.22 44.14 5.09 0.000 ANAG=2 32.01 26.24 27.15 4.86 0.000 DOM1=2 29.59 24.96 26.38 3.96 0.000 DOM2=3 26.71 22.99 25.85 3.26 0.001 nνq=instances with ν-th category in the group q Liberati November 10th 2017 30 / 36

slide-33
SLIDE 33

Examples

Table: Categories characterizing the group of the good instances classified as good.

Characteristic % of category % of category % of group V-Test Pvalue categories in group in set in category CB2_az=4 26.09 16.49 89.08 22.19 0.000 CB2_az=3 27.91 18.41 85.33 20.66 0.000 ANAG=4 28.87 21.15 76.82 15.53 0.000 DOM2=4 70.21 60.88 64.92 15.33 0.000 CERI=5 56.45 47.13 67.43 15.03 0.000 CB1_az=5 28.47 21.59 74.24 13.67 0.000 CERI=4 14.50 9.97 81.93 12.67 0.000 CB2_coll=4 8.53 5.48 87.59 11.45 0.000 CB1_az=4 22.11 16.91 73.61 11.34 0.000 DOM3=5 21.54 16.77 72.32 10.41 0.000 DOM4=5 21.01 16.35 72.34 10.27 0.000 CB2_az=5 22.64 17.85 71.41 10.18 0.000 DOM1=4 28.94 23.81 68.40 9.72 0.000 CB2_coll=3 8.67 6.14 79.48 8.73 0.000 CB1_coll=4 7.68 5.48 78.83 7.97 0.000 DOM4=4 17.60 14.23 69.62 7.81 0.000 DOM1=3 35.30 31.00 64.11 7.47 0.000 CB1_az=3 18.81 15.59 67.91 7.16 0.000 CB2_coll=2 7.93 6.54 68.20 4.49 0.000 DOM3=4 4.98 3.96 70.71 4.17 0.000 CB1_coll=2 7.15 5.96 67.45 3.99 0.000 DOM4=3 26.20 24.13 61.11 3.85 0.000 CB1_coll=3 6.40 5.40 66.67 3.51 0.000 CB2_az=2 20.33 18.73 61.11 3.27 0.001 ANAG=3 25.84 24.53 59.30 2.41 0.008 Liberati November 10th 2017 31 / 36

slide-34
SLIDE 34

Examples

References

Akaike H (1973) Information theory and an extension of the maxi- mum likelihood principle in Information Theory: Proceedings of the 2nd International Symposium, B. N. Petrov and F. Csaki (Eds.), pp. 267-281, Academiai Kiado, Budapest, Hungary. Baesens B, Van Gestel T, Viaene S, Stepanova M, Suykens J, Vanthienen J (2003) Benchmarking state- of-the-art classification algorithms for credit scoring. Journal of the Operational Research Society 54:627-635 Barakat N, Bradley AP (2010) Evaluating consumer loans using neural networks. Neurocomputing 74:178-190 Baudat G, Anouar F (2000) Generalized discriminant analysis using a kernel approach. Neural Computation 12:2385-2404 Bozdogan H, Sclove LS (1984) Multi-sample cluster analysis using Akaike’s Information Criterion Annals of the Institute

  • f Statistical Mathematics36(1): 163-180

Haff L. R. (1980) Empirical Bayes estimation of the multivariate normal covariance matrix The Annals of Statistics 8(3): 586-597 Huang YM, Hung C, Jiau HC (2006) Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem. Nonlinear Analysis: Real World Applications 7:720-747 James W, Stein C (1961) Estimation with quadratic loss in Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, 361?379, Berkeley, Calif, USA. Johnson R M (1966). The minimal transformation to orthonormality. Psychometrika, 31, 61-66. Johnson J (2000) A Heuristic Method for Estimating the Relative Weight of Predictor Variables in Multiple Regression Multivariate Behavioral Research 35(1):1-19 Lebart, L, Piron, M, and Steiner, J. F. (2003) La Sémiométrie. Dunod Ledoit O, Wolf M (2004) A well-conditioned estimator for large-dimensional covariance matrices Journal of Multivariate Analysis 88(2) 365-411 Liberati C, Camillo F, Saporta G (2017) Advances in credit scoring: combining performance and interpretation in kernel discriminant analysis. Advances in Data Analysis and Classification 11(1):121-138 Liberati November 10th 2017 32 / 36

slide-35
SLIDE 35

Examples Malhotra R, Malhotra DK (2003) Evaluating consumer loans using neural networks. Omega 31:83-96 Mercer J (1909) Functions of positive and negative type and their connection with the theory of integral equations. Philos Trans R Soc Lond 209:415-446. Mika S, Rätsch J G Weston, Schölkopf B, Müller KR (2003) Constructing descriptive and discrimina- tive nonlinear features: Rayleigh coefficients in kernel feature spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence 5:623-628 Schölkopf B, Smola AJ (2002) Learning with Kernels. MIT Press, Cambridge, MA. Suykens J, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Processing Letters 9(3):293-300 Shurygin A (1983) The linear combination of the simplest discriminator and Fisher’s one in Applied Statistics, Nauka (Ed.), Moscow 144-158 (in Russian) Stein C (1975) Estimation of a covariance matrix, Rietz Lecture in Proceedings of the 39th Annual Meeting IMS, Atlanta, Ga, USA. Vapnik V (1995) The Nature of Statistical Learning Theory. Springer, New York Liberati November 10th 2017 33 / 36

slide-36
SLIDE 36

Examples

Choice of ridge parameter

Naïve Ridge Estimators of the Covariance Matrix: ΣR = ˆ ΣMLE + γ · I Smoothed Covariance Estimators: ΣS = ˆ Σ(1 − ρ) + ρD (16) with 0 < ρ < 1 and D = tr(ˆ Σ) p Ip The structure minimizes the mean squared error (MSE):||ˆ Σ − Σ||2

F

Maximum Likelihood/Empirical Bayes (MLE/EB) (Haff, 1980): ˆ ΣMLE/EB = ˆ ΣMLE +

p−1 n·tr(ˆ ΣMLE ) Ip

Stipulated Ridge (Shurygin, 1983): ˆ ΣSRE = ˆ ΣMLE + p(p − 1)[2n · tr(ˆ ΣMLE)]−1Ip Convex Sum (Ledoit and Wolf, 2004): ˆ ΣCSE =

n n+m ˆ

Σ + (1 −

n n+m )ˆ

D

Return Liberati November 10th 2017 34 / 36

slide-37
SLIDE 37

Examples

Model Selection Criterion

The best kernel function selected among all the competitive models is the one that minimize the total error rate maximizing the Area Under the ROC curve minimize the Akaike criterion (in case of using a probabilistic discriminant): The computation of Information criteria (Akaike 1973, Bozdogan and Sclove, 1984) is done under the normality assumption of each group: where n sample instances, p= variables and Xg ∼ Np(µg, Σ) for g = 1, 2 AIC = np log(2π) + n log |n−1ΣW| + np + 2(2p + 2p(p + 1) 2 ) (17)

Return Liberati November 10th 2017 35 / 36

slide-38
SLIDE 38

Examples

Relative Weight of Predictor Variables

Assume X an n × p matrix of variables and y is a n × 1 vector of scores. It is possible to find the singular value decomposition of X: X = P △ Q′ Johnson (1966) showed that the best-fitting orthogonal approximation of X is obtained by: Λ = Q △ Q′ and that β = PQ′y The ǫ measure (Johnson, 2000) the relative importance of the variables. ǫ = Λ2β2 rescaled Relative Weights (rRW) represent the proportion of predictable variance in y explained by the variables: rWR = ǫ/R2 R2 is the R squared of the estimated regression.

r Liberati November 10th 2017 36 / 36