Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra - - - PowerPoint PPT Presentation

ricco rakotomalala
SMART_READER_LITE
LIVE PREVIEW

Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra - - - PowerPoint PPT Presentation

Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/ Maximum a posteriori rule Calculating the posterior probability P Y y P / Y y


slide-1
SLIDE 1

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

1

Ricco RAKOTOMALALA

slide-2
SLIDE 2

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

2

Maximum a posteriori rule

               

                

K l l l k k k k k

y Y P y Y P y Y P y Y P P y Y P y Y P y Y P

1

/ / / /

Calculating the posterior probability MAP – Maximum a posteriori rule

     

k k k k k k k

y Y P y Y P y y Y P y          / max arg / max arg

* *

How to estimate P(X/Y=yk)

Prior probability of class k : P(Y = yk) Estimated by empirical frequency nk/n

Assumptions are introduced in order to obtain a convenient calculation of this likelihood

Bayes theorem

slide-3
SLIDE 3

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

3

slide-4
SLIDE 4

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

4

Conditional independence for the calculation of the likelihood

   

J j k j k

y Y X P y Y P

1

) / ( ) / (

The attributes are all conditionally independent of one another given the value of Y

For a categorical attribute X, the conditional probability for the value xl is computed as follows…

) ( ) ( ) / (

k k l k l

y Y P y Y x X P y Y x X P       

The probability is estimated using the conditional relative frequency

     

k kl k k l k l

n n y Y y Y x X y Y x X P             ) ( , # ) ( ) ( , # / ˆ     

n n n y x X Y

k kl k l

  \

The Laplace rule of succession is often used to estimate the conditional probability

 

K n n p y Y x X P

k kl k l k l

      1 / ˆ

/

Conditional independence assumption

This is a kind of smoothing; it enables also to overcome the (nkl = 0) problem.

slide-5
SLIDE 5

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

5

NB Maladie Maladie Total Absent 5 Présent 5 Total général 10 NB Maladie Marié Maladie Non Oui Total général Absent 2 3 5 Présent 4 1 5 Total général 6 4 10 NB Maladie Etud.Sup Maladie Non Oui Total général Absent 4 1 5 Présent 1 4 5 Total général 5 5 10

Conditional independence assumption

082 . 2 5 1 1 2 5 1 3 2 10 1 5 .) / ( ˆ .) / ( ˆ ) ( ˆ ) , / ( ˆ                      Abs M

  • ui

Etu P Abs M

  • ui

Marié P Absent Maladie P

  • ui

Etu

  • ui

Marié Absent Maladie P 102 . 2 5 1 4 2 5 1 1 2 10 1 5 .) / ( ˆ .) / ( ˆ ) ( ˆ ) , / ( ˆ                      Abs M

  • ui

Etu P Abs M

  • ui

Marié P présent Maladie P

  • ui

Etu

  • ui

Marié présent Maladie P

 If Etu = oui and Marié = oui Then Maladie = Présent

Maladie Marié Etud.Sup Présent Non Oui Présent Non Oui Absent Non Non Absent Oui Oui Présent Non Oui Absent Non Non Absent Oui Non Présent Non Oui Absent Oui Non Présent Oui Non

Direct estimation of the posterior probability

1 1 1 ) , / ( ˆ     

  • ui

Etu

  • ui

Marié Absent Maladie P 1 ) , / Présent ( ˆ     

  • ui

Etu

  • ui

Marié Maladie P

 If Etu = oui and Marié = oui Then Maladie = Absent

(+) No assumptions, (-) small number of covered examples (-) Questionable assumption, (+) more reliable estimation of probabilities

An example using a toy dataset

slide-6
SLIDE 6

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

6

>> Simplicity, quickness, ability to handle very large dataset, no possible crash during the calculations >> This is a linear classifier  similar classification performance

(see the numerous experiments described in scientific papers)

>> Incrementality (we store only the contingency tables) >> Statistically robust (even if the assumption is very questionable) >> No indication about the relevance of the attributes (really ?) >> Very high number of rules

(in practice, the logical rules are not computed, the contingency tables for the calculation of the conditional frequency are deployed e.g. PMML format)

>> Not explicit model (really ?)  not used in marketing domain, etc.

We see often these conclusions in the literature… Is it possible to go beyond that? Advantage and shortcoming (end of the course?)

slide-7
SLIDE 7

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

7

              

 

  J j k j k k k J j k j k k k

y Y X P y Y P y y Y X P y Y P y

1 * 1 *

) / ( ln ) ( ln max arg ) / ( ) ( max arg

Logarithmic transformation

slide-8
SLIDE 8

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

8

A discrete attribute X with L levels

) / ( ln ) ( ln ) , (

k k k

y Y X P y Y P X y d    

From X, we can create L dummy variables

  

  

              

L l l k l k L l l k l k L l l k l k k

I a a I y Y x X P y Y P I y Y x X P y Y P X y d

1 , , 1 1

) / ( ln ) ( ln ) / ( ln ) ( ln ) , (

We obtain a linear combination of the dummy variables i.e. an explicit model which is easy to deploy  K linear classification functions (such as linear discriminant analysis)

Model using one predictive attribute

slide-9
SLIDE 9

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

9

NB Maladie Maladie Total Absent 5 Présent 5 Total général 10 NB Maladie Etud.Sup Maladie Non Oui Total général Absent 4 1 5 Présent 1 4 5 Total général 5 5 10

) ( 2528 . 1 ) ( 3365 . 6931 . ) ( 2 5 1 1 ln ) ( 2 5 1 4 ln 2 10 1 5 ln ) , (

  • ui

X non X

  • ui

X non X X absent d                     

) ( 3365 . ) ( 2528 . 1 6931 . ) , (

  • ui

X non X X present d        

For an instance (Etu.Sup = NON)

0296 . 1 3365 . 6931 . ) , (      X absent d 9495 . 1 2528 . 1 9631 . ) , (      X present d Prediction : Maladie = non

An example (Y : Maladie; X : Etu.Sup)

slide-10
SLIDE 10

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

10

  

    

                   

1 1 , , 1 1 1

) / ( ) / ( ln ) / ( ln ) ( ln ) / ( ln ) ( ln ) , (

L l l k l k L l l k L k l k L k L l l k l k k

I b b I y Y x X P y Y x X P y Y x X P y Y P I y Y x X P y Y P X y d

since

1

2 1

   

L

I I I 

One level [xL] becomes the reference level The dummy coding is the most commonly used coding scheme

Implemented solution into TANAGRA (Using [L-1] dummy variables for an attribute X with L levels)

slide-11
SLIDE 11

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

11

Extension to J predictive attributes Dummy coding scheme Xj with Lj levels  (Lj-1) dummy variables

Maladie Marié Etud.Sup Présent Non Oui Présent Non Oui Absent Non Non Absent Oui Oui Présent Non Oui Absent Non Non Absent Oui Non Présent Non Oui Absent Oui Non Présent Oui Non

Linear classification functions using the indicator variables

slide-12
SLIDE 12

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

12

The class attribute has 2 levels :: Y={+,-}

J J J J J J

X c X c X c c X d X a X a X a a X d X a X a X a a X d                     

       

  

2 2 1 1 , 2 2 , 1 1 , , , 2 2 , 1 1 , ,

) ( ) , ( ) , (

Decision rule

D(X) > 0  Y = +

>> D(X) is the SCORE function. It assigns a score proportional to positive class probability estimate to the instances >> The sign of the coefficients allows to interpret the influence of the descriptors

Interpretation

Classification functions SCORE Descriptors Présent Absent D(X) Marié = Non 0.916291 -0.287682 1.203973 Etud.Sup = Oui 0.916291 -0.916291 1.832582 constant

  • 3.198673 -1.589235
  • 1.609438

Notre exemple :

Not being married makes sick… To study makes sick…

The particular case of the binary classification (K = 2) Construction of the SCORE function

slide-13
SLIDE 13

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

13

Reading of the coefficients in the classification functions

Estimation of the conditional probabilities

386294 . 1 ) ln( 4 2 . 8 . ) ; / (       

  • dds

present Y O M N M

  • dds

Naives Bayes Classifier (explicit representation) The sick individuals (maladie = présent) have 4 times more chance to be not married than to be married The coefficient of the classification function corresponds to the logarithm of the odds

4055 . ) ln( 667 . 6 . 4 . ) ; / (        

  • dds

absent M O M N M

  • dds

For the non-sick individuals, they have (1/0.667) = 1.5 times more chance to be married than not to be married.

Nombre de Maladie Marié Maladie Non Oui Total général Présent 0.8 0.2 1.0 Absent 0.4 0.6 1.0 Total général 0.6 0.4 1.0

Classification functions Descriptors Présent Absent Marié = Non 1.38629 -0.4055 constant

  • 2.3026
  • 1.204
slide-14
SLIDE 14

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

14

Nombre de Maladie Marié Maladie Non Oui Total général Présent 0.8 0.2 1.0 Absent 0.4 0.6 1.0 Total général 0.6 0.4 1.0

6 66 . 4 ) ; / ( ) ; / ( ) / ; / (               A Y O M N M

  • dds

P Y O M N M

  • dds

A Y P Y O M N M ratio

  • dds

Classification functions Descriptors Présent Absent SCORE Marié = Non 1.38629 -0.40547 1.79176 constant

  • 2.30259 -1.20397
  • 1.09861

The sick individuals have 6 times more chance to be married than the non-sick individuals.

 

79176 . 1 6 ln 

The coefficient of the score function corresponds to the

  • dds-ratio
  • The reading of the odds-ratio is inverted

compared with the logistic regression

  • This interpretation is relevant if only if the

association between X and Y is significant

Comments

Reading of the coefficients in the score function (binary problem)

slide-15
SLIDE 15

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

15

Checking the relevance of the variables Removing the irrelevant variables Removing the redundancy between the variables

slide-16
SLIDE 16

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

16

Amazing consequence of the conditional independence assumption

By nature, the coefficients associated to a variable are estimated independently to the other predictive variables  thus the addition or the removal of one predictive variable does not modify the coefficients related to the

  • ther variables.

Classification functions Descriptors Présent Absent Marié = Non 0.916291 -0.287682 constant

  • 1.94591 -1.252763

Classification functions Descriptors Présent Absent Marié = Non 0.916291

  • 0.287682

Etud.Sup = Oui 0.916291

  • 0.916291

constant

  • 3.198673
  • 1.589235

Classifier with 1 variable Classifier with 2 variables (“Etu.Sup” is added)

It is not needed to recalculate the other coefficients when we add

  • r we remove a variable.
slide-17
SLIDE 17

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

17

Relevance of an attribute (1)

A variable is influent if it enables to increase the differences between the classification functions d(yk,X) (according to yk)  If the conditional distributions P(X/yk) are different according to yk  If the conditional distributions P(X/yk) are different to the marginal distribution P(X)

Nombre de Marié Marié Maladie Non Oui Total général Absent 0.4 0.6 1.0 Présent 0.8 0.2 1.0 Total général 0.6 0.4 1.0

Nombre de Marié Etud.Sup Maladie Non Oui Total général Absent 0.8 0.2 1.0 Présent 0.2 0.8 1.0 Total général 0.5 0.5 1.0

  

  

 

K k L l k l k l k L l l l

p p p Y X H p p X H

1 1 / 2 / . 1 . 2 .

log ) / ( log ) (

~ total variance ~ within variance ~ Between Variance i.e. explained variance



 

   

L l K k k l kl kl

p p p p X Y I Y X H X H

1 1 . . 2

log ) , ( ) / ( ) (

Mutual information

slide-18
SLIDE 18

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

18

Nombre de Marié Etud.Sup Maladie Non Oui Total général Absent 0.4 0.1 0.5 Présent 0.1 0.4 0.5 Total général 0.5 0.5 1.0 Nombre de Marié Marié Maladie Non Oui Total général Absent 0.2 0.3 0.5 Présent 0.4 0.1 0.5 Total général 0.6 0.4 1.0

2781 . ) , (  ES Y I

1245 . ) , (  M Y I

Statistical test (H0 : the variables are independent)

     

1 1 ~ ) , ( 2 ln 2

2

       L K X Y I n G 

We can even determine the statistical significance of the association

0496 . . 85 . 3 ) (    value p ES G 1889 . . 73 . 1 ) (    value p M G

We can establish a hierarchy between the predictive variables

The association between Y and ES is significant The association between Y and M is not significant

Relevance of an attribute (2)

slide-19
SLIDE 19

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

19

Ranking using the symmetrical uncertainty measure

Defined between [0 ; 1]

) ( ) ( ) , ( 2

,

X H Y H X Y I s

X Y

  

RANKING: 1. Calculating s for each predictive variable 2. Sort them in a decreasing order 3. Retain only the variables significantly related to Y

E.g.« kr-vs-kp » dataset (19 selected pour  = 0.001)

Shortcoming

  • Choosing the right significance level

« alpha » is difficult

  • All the associations are significant when

the database size « n » increase Possible solution : “elbow rule”

Unacceptable shortcoming This solution does not take into account the redundancy between the variables

slide-20
SLIDE 20

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

20

Feature selection which handles the redundancy – CFS approach

 

X X X Y

s p p p s p merit

, ,

1      

The MERIT of a subset of "p" attributes is defined as follows

Numerator: association of the predictive attributes with the target variable (relevance) Denominator : association between the predictive attributes (redundancy)  The aim is to obtain a subset of attributes which are strongly related to the target attribute and weakly related to each other.

« FORWARD » strategy

  • Beginning with 0 variables
  • Adding sequentially one variable i.e.

c.-à-d. the variable which maximizes the increase of MERIT is added

  • Etc.

Stopping rule: stop when the additional variable does not increase the MERIT E.g. « kr-vs-kp » -

  • nly 3 selected var.
slide-21
SLIDE 21

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

21

The selection is justified and appropriate?

34 descriptors Test error rate = 14.80% 1500 training set size 1696 test set size 3 descriptors Test error rate = 9.67%

The variable selection enables to reduce the number of variables by maintaining the performance level Sometimes, it increases the prediction performance (e.g. here, but this is rare) Sometimes, it is wrong (when we remove too many variables)

slide-22
SLIDE 22

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

22

Getting the previous situation (discrete variables) by discretizing the continuous variables e.g. using a supervised approach such as MDLPC (Fayyad & Irani, 1993)  Empirical studies shows that this is a good solution  This is the best solution when we have a mix of continuous and discrete predictive attributes

slide-23
SLIDE 23

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

23

Discretization of continuous attributes Using a specific supervised algorithm The well-known unsupervised approaches (e.g. equal width, equal frequency) do not consider the target attribute. They are not adapted to the supervised learning context.

x1 Densité

0.0 0.2 0.4 0.6 0.8 2 4 6 8

x2 Densité

0.0 0.1 0.2 0.3 0.4 2 4 6 8

x3 Densité

0.0 0.1 0.2 0.3 0.4 5 10 15 20

x4 Densité

0.00 0.05 0.10 0.15 5 10 15

4 examples of conditional distributions Why supervised algorithms (MDLPC, Fayyad et Irani, 1993 ; Chi-merge, Kerber, 1992) are more convenient?

  • Detecting the intervals where one of

the classes is overrepresented

  • Detecting automatically the right

number of intervals

slide-24
SLIDE 24

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

24

The variable to discretize is the only one predictive variable used in the decision tree

  • learning. The variable is used - with different cut points - in the various splitting process.

Discretization of continuous attributes using a decision tree learning algorithm

slide-25
SLIDE 25

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

25

Parametric approach Making assumptions about the conditional distributions

slide-26
SLIDE 26

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

26

Assumption.1 – Gaussian conditional distribution

 

2 , ,

2 1 ,

2 1 ) ( /

         

  

j k j k j

x j k j k k j

e X f y Y X P

 

 

x Densité

0.0 0.1 0.2 0.3 0.4

  • 5

5

x Densité

0.0 0.1 0.2 0.3 0.4

  • 5

5

1

2

1

2

Compatible with the Gaussian assumption Not compatible with the Gaussian assumption  possible solution: discretization

2 1

  

Normal distribution for X conditionally to yk

Note: This is a particular case of the discriminant analysis where we consider than the values

  • utside of the main diagonal of the covariance matrix are zero (see Linear Discriminant Analysis).
slide-27
SLIDE 27

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

27

Consequences of the Gaussian assumption

Quadratic classifier

The decision rule is not modified i.e.

 

) ( , max arg ) ( ˆ

* *

     

k k k k

y d y y y

 

                              

j j k j j k j j k k j j k j k j k j j k j k j j k k k

c x b x a p x x p y d

, , 2 , , 2 , 2 , 2 , , 2 2 ,

ln ) ln( 2 2 1 ln ) , (      

IRIS dataset (2 predictive variables)

The interpretation is not easy.

slide-28
SLIDE 28

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

28

Assumption.2 - Homoscedasticity

k

j j k

  ,

,

 

The conditional variances are the same over the classes

 

2 ,

2 1

2 1 ) ( /

         

  

j j k j

x j j k k j

e X f y Y X P

 

 

The common variance is estimated with the within variance

x Densité

0.0 0.1 0.2 0.3 0.4

  • 5

5

1

2

1

2

Not compatible with the assumption

 But the approach is robust

x Densité

0.0 0.1 0.2 0.3 0.4

  • 6
  • 4
  • 2

2 4

1

2

1

2

Compatible with the homoscedasticity assumption

slide-29
SLIDE 29

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

29

Consequences of the homoscedasticity assumption

J J k k k k j j j k j j j k k k

x a x a x a a x p y d

, 2 2 , 1 1 , , 2 2 , 2 ,

2 ln ) , (                    

     Linear classifier

The decision rule is not modified i.e.

 

) ( , max arg ) ( ˆ

* *

     

k k k k

y d y y y

Fichier IRIS (2 variables) If K=2 (binary problem), we can calculate the SCORE function -- D(X)

The interpretation is easier

PET.LENGTH low -> Setosa PET.LENGTH middle -> Versicolor PET.LENGTH high -> Virginica

slide-30
SLIDE 30

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

30

Evaluate the relevance of the variables Remove the irrelevant variables Removing the redundancies

slide-31
SLIDE 31

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

31

Variable importance – One way ANOVA scheme

x Densité

0.0 0.1 0.2 0.3 0.4

  • 6
  • 4
  • 2

2 4

1

2

k H

j k

  , :

,

 

Comparison of conditional means Test statistic F

   

K n n K n F

k k k k k k

     

2 2

ˆ 1 1 ˆ ˆ   

Between Variance

  • Within Variance

Under H0, F ~ Fisher (K-1, n-K) d.f.

slide-32
SLIDE 32

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

32

Variable selection - RANKING

RANKING :approach 1. Calculating F for all the variables 2. Sort them according F in a decreasing order 3. Retain only the variables with a significant association

IRIS + 1 ALEA (variable generated randomly) Same problems than for the discrete attributes  Choosing the significance level “alpha”  Dealing with redundancy

slide-33
SLIDE 33

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

33

Variable selection – How to deal with redundancy?

Extension of the CFS approach to continuous predictors

  • Measure 1 : Measuring the association between Y (discrete) et X (continuous)
  • Measure 2 : Measuring the association between Xj (continuous) et Xj’ (continuous)

Problem  Measure 1 and Measure 2 must be comparable !

 

X X X Y

s p p p s p merit

, ,

1      

Other approaches

 STEPDISC algorithm for linear discriminant analysis (Multivariate analysis of variance – MANOVA)

But the calculations are costly.

 Using embedded approach of other learning algorithms (e.g. decision tree).

But the relevant variables for a method are not necessarily the same for the naive bayes classifier

 Discretize the predictive variables and use the selection approaches for discrete attributes

slide-34
SLIDE 34

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

34

>> Strong advantages (Incrementality, ability to handle very large database) >> We can extract an explicit model (completely unknown) >> Very often used in the research domain (text mining, etc.) >> Not used in some domains (e.g. marketing)… because the users do not know

that we can extract an explicit model than we can deploy easily

slide-35
SLIDE 35

Ricco Rakotomalala Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

35

References

Tanagra – “Naive Bayes Classifier for discrete predictors” http://data-mining-tutorials.blogspot.fr/2010/07/naive-bayes-classifier-for- discrete.html Tanagra – “Naive Bayes Classifier for continuous predictors” http://data-mining-tutorials.blogspot.fr/2010/11/naive-bayes-classifier-for- continuous.html Wikipedia, « Naive Bayes Classifier » http://en.wikipedia.org/wiki/Naive_Bayes_classifier STATSOFT e-books, « Naive Bayes Classifier » (see. other distribution assumptions) http://www.statsoft.com/textbook/naive-bayes-classifier/