ricco rakotomalala
play

Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra - - PowerPoint PPT Presentation

Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/ Maximum a posteriori rule Calculating the posterior probability P Y y P / Y y


  1. Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  2. Maximum a posteriori rule Calculating the posterior probability           P Y y P / Y y    k k P Y y /   Bayes  k P     theorem     P Y y P / Y y  k k K          P Y y P / Y y l l  l 1 MAP – Maximum a posteriori rule      y arg max P Y y / k * k k           y arg max P Y y P / Y y k * k k k How to estimate P(X/Y=y k ) Prior probability of class k : P(Y = y k ) Assumptions are introduced in order to obtain Estimated by empirical frequency n k /n a convenient calculation of this likelihood Ricco Rakotomalala 2 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  3. Ricco Rakotomalala 3 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  4. Conditional independence assumption Conditional independence for the J      P ( / Y y ) P ( X / Y y ) calculation of the likelihood k j k  j 1 The attributes are all conditionally independent of one another given the value of Y For a categorical attribute X, the conditional    P ( X x Y y )    probability for the value x l is computed as follows… l k ( / ) P X x Y y  l k P ( Y y ) k           The probability is estimated using   # , X ( ) x Y ( ) y n ˆ     l k kl P X x / Y y   the conditional relative frequency      l k # , Y ( ) y n k k  Y \ X x l The Laplace rule of succession is often used to estimate the conditional probability y n n k kl k    n 1 ˆ     kl P X x / Y y p  l k l / k  n K n k This is a kind of smoothing; it enables also to overcome the (n kl = 0) problem. Ricco Rakotomalala 4 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  5. An example using a toy dataset Maladie Marié Etud.Sup Direct estimation of the posterior probability Présent Non Oui Présent Non Oui 1 ˆ      Absent Non Non P ( Maladie Absent / Marié oui , Etu oui ) 1 1 Absent Oui Oui Présent Non Oui 0 ˆ      Absent Non Non P ( Maladie Présent / Marié oui , Etu oui ) 0 1 Absent Oui Non Présent Non Oui  If Etu = oui and Marié = oui Then Maladie = Absent Absent Oui Non Présent Oui Non (+) No assumptions, (-) small number of covered examples Conditional independence assumption NB Maladie Maladie Total ˆ    Absent 5 P ( Maladie Absent / Marié oui , Etu oui ) Présent 5 ˆ ˆ ˆ         P ( Maladie Absent ) P ( Marié oui / M Abs .) P ( Etu oui / M Abs .) Total général 10    5 1 3 1 1 1     0 . 082 NB Maladie Marié    10 2 5 2 5 2 Maladie Non Oui Total général Absent 2 3 5 ˆ    P ( Maladie présent / Marié oui , Etu oui ) Présent 4 1 5 Total général 6 4 10 ˆ ˆ ˆ         ( ) ( / .) ( / .) P Maladie présent P Marié oui M Abs P Etu oui M Abs    NB Maladie Etud.Sup 5 1 1 1 4 1     0 . 102 Oui Maladie Non Total général    10 2 5 2 5 2 Absent 4 1 5 Présent 1 4 5  If Etu = oui and Marié = oui Then Maladie = Présent Total général 5 5 10 (-) Questionable assumption, (+) more reliable estimation of probabilities Ricco Rakotomalala 5 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  6. Advantage and shortcoming (end of the course?) >> Simplicity, quickness, ability to handle very large dataset, no possible crash during the calculations >> Incrementality (we store only the contingency tables) >> Statistically robust (even if the assumption is very questionable) >> This is a linear classifier  similar classification performance (see the numerous experiments described in scientific papers) >> No indication about the relevance of the attributes (really ?) >> Very high number of rules (in practice, the logical rules are not computed, the contingency tables for the calculation of the conditional frequency are deployed e.g. PMML format) >> Not explicit model (really ?)  not used in marketing domain, etc. We see often these conclusions in the literature… Is it possible to go beyond that? Ricco Rakotomalala 6 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  7. Logarithmic transformation J      y arg max P ( Y y ) P ( X / Y y ) k * k j k k  j 1   J         y arg max ln P ( Y y ) ln P ( X / Y y ) k * k j k   k  j 1 Ricco Rakotomalala 7 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  8. Model using one predictive attribute A discrete attribute X with L levels     d ( y , X ) ln P ( Y y ) ln P ( X / Y y ) k k k From X, we can create L dummy variables L        d ( y , X ) ln P ( Y y ) ln P ( X x / Y y ) I k k l k l  l 1 L        ln P ( Y y ) ln P ( X x / Y y ) I k l k l  l 1 L     a a I 0 , k l , k l  l 1 We obtain a linear combination of the dummy variables i.e. an explicit model which is easy to deploy  K linear classification functions (such as linear discriminant analysis) Ricco Rakotomalala 8 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  9. An example (Y : Maladie; X : Etu.Sup) NB Maladie Maladie Total Absent 5 Présent 5 Total général 10 NB Maladie Etud.Sup Oui Maladie Non Total général Absent 4 1 5 Présent 1 4 5 Total général 5 5 10    5 1 4 1 1 1        d ( absent , X ) ln ln ( X non ) ln ( X oui )    10 2 5 2 5 2         0 . 6931 0 . 3365 ( X non ) 1 . 2528 ( X oui )         d ( present , X ) 0 . 6931 1 . 2528 ( X non ) 0 . 3365 ( X oui ) For an instance (Etu.Sup = NON)      d ( absent , X ) 0 . 6931 0 . 3365 1 . 0296 Prediction : Maladie = non      d ( present , X ) 0 . 9631 1 . 2528 1 . 9495 Ricco Rakotomalala 9 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  10. Implemented solution into TANAGRA (Using [L-1] dummy variables for an attribute X with L levels) since     L   I I I 1       1 2 L d ( y , X ) ln P ( Y y ) ln P ( X x / Y y ) I k k l k l  l 1    L 1 P ( X x / Y y )         l k ln P ( Y y ) ln P ( X x / Y y ) ln I   k L k l P ( X x / Y y )  l 1 L k  L 1     b b I 0 , k l , k l  l 1 One level [x L ] becomes the reference level The dummy coding is the most commonly used coding scheme Ricco Rakotomalala 10 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  11. Maladie Marié Etud.Sup Présent Non Oui Extension to J predictive attributes Présent Non Oui Absent Non Non Absent Oui Oui Présent Non Oui Dummy coding scheme Absent Non Non Absent Oui Non X j with L j levels  (L j -1) dummy variables Présent Non Oui Absent Oui Non Présent Oui Non Linear classification functions using the indicator variables Ricco Rakotomalala 11 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  12. The particular case of the binary classification (K = 2) Construction of the SCORE function The class attribute has 2 levels :: Y={+,-}         ( , ) d X a a X a X a X Decision rule      , 0 , 1 1 , 2 2 , J J         d ( , X ) a a X a X a X      D(X) > 0  Y = + , 0 , 1 1 , 2 2 , J J       d ( X ) c c X c X c X 1 1 2 2 J J Interpretation >> D(X) is the SCORE function. It assigns a score proportional to positive class probability estimate to the instances >> The sign of the coefficients allows to interpret the influence of the descriptors Notre Classification exemple : functions SCORE Descriptors Présent Absent D(X) Not being married makes sick… Marié = Non 0.916291 -0.287682 1.203973 To study makes sick… Etud.Sup = Oui 0.916291 -0.916291 1.832582 constant -3.198673 -1.589235 -1.609438 Ricco Rakotomalala 12 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend