Association rules and compositional data analy lysis: : im impli - - PowerPoint PPT Presentation

association rules and compositional
SMART_READER_LITE
LIVE PREVIEW

Association rules and compositional data analy lysis: : im impli - - PowerPoint PPT Presentation

Association rules and compositional data analy lysis: : im impli licatio ions to big ig data R. S. Kenett 1 , J.A. Martn - Fernndez 2 , S. Thi -Henestrosa 2 and M. Vives-Mestres 2 1 KPA Group, Israel; University of Turin, Italy and


slide-1
SLIDE 1

Association rules and compositional data analy lysis: : im impli licatio ions to big ig data

  • R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2 and M. Vives-Mestres2

1 KPA Group, Israel; University of Turin, Italy and Neaman Institute, Technion, Israel 2Universitat de Girona, Spain

slide-2
SLIDE 2

CoDaWork 2017 2

slide-3
SLIDE 3

CoDaWork 2017 3

This is work in progress The long term goal is to introduce CoDa to text (semantic data) analysis and to scale it to big data….

slide-4
SLIDE 4

Association Rules(AR)

CoDaWork 2017 4

Transaction (document, itemset)

LHS (A) RHS (B)

Antecedent Consequent

Basket Analysis

Terms, items, tokens, words

slide-5
SLIDE 5

CoDaWork 2017 5

AR: : Support, Confidence, , Lift ft and Odd Ratio

RHS ^RHS LHS x1 x2 g ^LHS x3 x4 1-g f 1-f 1

support {A=>B} = x1

A B

4 1

1, 0 , 1...4.

i i i

x x i

    

Proportion of transactions in which an item set appears

confidence {A=>B} = x1/g

Strength of implication, or predictive power

lift {A=>B} = confidence{A=>B} / support{B} = support{A=>B}/support{A}support{B} Lift < 1, A and B repel each other Lift > 1, A and B have affinity to each other OR {A=>B} = (x1*x4)/(x2*x3) OR < 1, A and B repel each other OR > 1, A and B have affinity to each other

slide-6
SLIDE 6

The Simplex

4 1

1, 0 , 1...4.

i i i

x x i

    

RHS ^RHS LHS x1 x2 ^LHS x3 x4

A B

CoDaWork 2017 6

slide-7
SLIDE 7

D 

M

D RLD D 

M

D

D

4 2 3

D x x x x

 

Relative Linkage Disequilibrium

CoDaWork 2017 7

lift {A=>B} = 1 OR {A=>B} = 1 1 2 3 4

( , , , ) ( ,1 ) ( ,1 ) (1, 1) where X x x x x f f f g g g e       

X f g De e    

RHS ^RHS LHS x1 x2 g ^LHS x3 x4 1-g f 1-f 1 Kenett, R.S. (1983). On an Exploratory Analysis of Contingency

  • Tables. J R Stat Soc Series D, 32, 395—403.

Kenett R.S. (2014). Frequenct vectors and contingency tables: a non paramtric and graphical analysis. Girona Seminar, 27/11/14.

independence dependence

slide-8
SLIDE 8

CoDa Analysis and Principles

CoDaWork 2017 8

  • Multiplicative tools to CoDa are equivalent to classical

additive (Euclidean) tools to log-ratio values

  • Transform CoDa, e.g. isometric log-ratio coordinates: ilr(x)
  • Scale invariance (Vectors P =

[p1,…,pD] and P’ = αP, α > 0, give the same information

  • Subcompositional coherence
slide-9
SLIDE 9
  • 2
  • 1

1 2

  • 1.5
  • 0.5

0.5 1.5 ilr1 ilr2

CoDaWork 2017 9

Real space: log-ratio coordinates (alr, clr, or ilr)

clr(𝒚) = (log(

𝑦1 𝑕 𝑦 ), log( 𝑦2 𝑕 𝑦 ),..., log( 𝑦𝐸 𝑕 𝑦 ))

ilr(𝒚)=(ilr1(𝒚), … , ilrD−1(𝒚))

Simplex: raw data (%)

𝒚 = (𝑦1, 𝑦2,..., 𝑦𝐸)

x1 x2 x3

x1 x3 x2 3

Logratio (multiplicative) approach

slide-10
SLIDE 10

CoDaWork 2017 10

CoDa Analysis of f 2X2 2 tables T

𝑗𝑚𝑠 𝐔 = 1 2 ln 𝑦1𝑦4 𝑦2𝑦3 , 2 2 ln 𝑦1 𝑦4 , 2 2 ln 𝑦2 𝑦3 .

Sequential Binary Partition (SBP), Pawlowsky-Glahn and Buccianti, 2011, Chapter 2.

RHS ^RHS LHS x1 x2 g ^LHS x3 x4 1-g f 1-f 1

A B

4 1

1, 0 , 1...4.

i i i

x x i

    

slide-11
SLIDE 11

CoDa Analysis of f 2X2 2 tables

CoDaWork 2017 11

ilr-coordinates ilr1 ilr2 ilr3 T 1 2 ln 𝑦1𝑦4 𝑦2𝑦3 2 2 ln 𝑦1 𝑦4 2 2 ln 𝑦2 𝑦3 Tind 2 2 ln 𝑦1 𝑦4 2 2 ln 𝑦2 𝑦3 Tint 1 2 ln 𝑦1𝑦4 𝑦2𝑦3

independence interaction Perturbation operation subtracting table Tind from T

ilr1(T) < 0 : negative effect between itemsets (A true, B less likely true) ilr1(T) = 0 : independence ilr1(T) > 0 : positive effect (A true, B more likely true)

slide-12
SLIDE 12

CoDa Analysis of f 2X2 2 tables

CoDaWork 2017 12

X f g De e    

ilr(T)=ilr(Tind)+ilr(Tint).

Let 𝐔 𝑏 = 𝑗𝑚𝑠(𝐲) be the Aitchison norm of a table T, then 𝐔 𝑏

2 =

𝐔𝑗𝑜𝑒

𝑏 2 + 𝐔𝑗𝑜𝑢 𝑏 2 , that is, one has a

decomposition of the Aitchison norm of table T.

independence interaction

slide-13
SLIDE 13

CoDaWork 2017 13

slide-14
SLIDE 14

CoDa Simplical Deviance (S (SD)

CoDaWork 2017 14

𝑇𝐸(𝐔) = 𝐔𝑗𝑜𝑢

𝑏 2 = 1 4 l𝑜2 𝑦1𝑦4 𝑦2𝑦3 = 𝑗𝑚𝑠 1 2(𝐔)

𝑦1𝑦4 𝑦2𝑦3 = 1 ⟺ 𝑚𝑜 𝑦1𝑦4 𝑦2𝑦3 = 0 ⟺ 𝑗𝑚𝑠

1 𝐔 = 0 ⟺ 𝑇𝐸 = 0

X f g De e    

independence interaction

slide-15
SLIDE 15

CoDa Relative Simplical Deviance (S (SD)

CoDaWork 2017 15

𝑆𝑇𝐸(𝐔) = 𝑇𝐸 𝐔 𝑏

2 =

) 𝑗𝑚𝑠

1 2(𝐔

) 𝑗𝑚𝑠(𝐔

2

M

D RLD D 

3 2 3 2

If D then if x x D then RLD D x D else RLD D x            

1 4 1 4

else if x x D then RLD D x D else RLD D x          

RSD takes values in an interval [0,1]

slide-16
SLIDE 16

Bootstrap Algorithm

CoDaWork 2017 16

Egozcue et al. (2015) introduce a bootstrap algorithm consisting of following steps: i) Calculate Tind, Tint, SD and RSD. ii) Simulate 10000 multinomial samples (T(k)) assuming the independence hypothesis H0: T=Tind is true. For each table T(k), calculate T(k)

ind, T(k) int, SD(k)

and RSD(k). iii) Compare respectively the value of SD and RSD with the distribution of the 10000 values of SD(k) and RSD(k) to obtain the percentile p-value (left tail). Calculate the 0.05 significance critical points (5th quantile) in the left tail of each distribution.

slide-17
SLIDE 17

CoDa Measures for Association Rules

CoDaWork 2017 17

lift(AR) = 𝑦1 𝑦1 + 𝑦2)(𝑦1 + 𝑦3

𝐸(AR) = 𝑦1𝑦4 − 𝑦2𝑦3 lift AR = 1 + ) 𝐸(AR 𝑦1 + 𝑦2)(𝑦1 + 𝑦3

slide-18
SLIDE 18

CoDa Measures for Association Rules

CoDaWork 2017 18

𝑃𝑆∗ AR = 𝑍𝑣𝑚𝑓′𝑡 𝑅 𝐵𝑆 = 𝑦1𝑦4−𝑦2𝑦3

𝑦1𝑦4+𝑦2𝑦3

OR(AR) = odds(B/A)/odds(B/cA) = (x1x4)/(x2x3). Lift(AR) =1 D(AR) =0 OR(AR) =1

OR is defined in [0, +Infinite), OR* is defined in [-1,1]

slide-19
SLIDE 19

CoDa Measures for Association Rules

CoDaWork 2017 19

𝐷 AR = 𝑗𝑚𝑠

1

𝐔

𝐷∗ AR = tanh 𝐷 AR = 𝑃𝑆∗ AR = 𝑍𝑣𝑚𝑓′𝑡 𝑅 𝐵𝑆

C is defined in (-Infinite, +Infinite), C* is defined in [-1,1]

slide-20
SLIDE 20

CoDaWork 2017 20

CoDa Measures for Association Rules - 𝐷(A (AR)

slide-21
SLIDE 21

A Case Study

CoDaWork 2017 21

slide-22
SLIDE 22

CoDaWork 2017 22

https://treato.com

slide-23
SLIDE 23

CoDaWork 2017 23

https://treato.com/Nicardipine/?a=s

slide-24
SLIDE 24

CoDaWork 2017 24

slide-25
SLIDE 25

CoDaWork 2017 25

slide-26
SLIDE 26

CoDaWork 2017 26

Document Term Matrix (DTM)

slide-27
SLIDE 27

CoDaWork 2017 27

Association Rules Analysis

slide-28
SLIDE 28

CoDaWork 2017 28

Association Rules by consequent (lowest Lift)

Vasoconstriction

slide-29
SLIDE 29

CoDaWork 2017 29

Top 10 Association Rules by Lift (in red)

slide-30
SLIDE 30

CoDaWork 2017 30

Top 10 Association Rules by Lift (in red)

slide-31
SLIDE 31

CoDa AR Vis isualization ilr ilr plo lot by consequent it item

CoDaWork 2017 31

interaction

ilr.1 frequency 0.2 0.4 0.6 0.8 1.0 5 10 15
slide-32
SLIDE 32

CoDa AR Visualization ilr ilr plot

CoDaWork 2017 32

interaction

slide-33
SLIDE 33

CoDa AR Vis isualization clr lr plo lot by consequent it item

CoDaWork 2017 33

interaction

slide-34
SLIDE 34

Conclusions

  • Compositional measures of independence SD and RSD are coherent with

the simplicial geometry of the simplex, the sample space of contingency tables of AR.

  • The relation between CoDa-AR measures and other common measures

facilitates the interpretation of negative and positive effects between itemsets.

  • The CoDa geometry provides visualization techniques of measures when all

the significant AR of a large database are analyzed.

  • The principles of coherence and scalability, that are fundamental to CoDa,

are relevant to big data text analysis.

  • More research in this area is needed

CoDaWork 2017 34

slide-35
SLIDE 35

Acknowledgements

CoDaWork 2017 35

https://www.kaggle.com/c/instacart-market-basket-analysis

slide-36
SLIDE 36

Thank you for your attention