Chapter 4: Freq equen ent Ite temsets and nd Asso Associatio - PowerPoint PPT Presentation

Chapter 4: Freq equen ent Ite temsets and nd Asso Associatio ion R Rules ules Jilles Vreeken Revision 1, November 9 th Notation clarified, Chi-square: clarified Revision 2, November 10 th details added of derivability example Revision 3, November 12 th typo fixed in Pearson correlation IRDM ‘15/16 5 Nov 2015

Recall the Qu Question on of th f the w week How can we mine interest esting p patter erns s and usef useful rul ules es from data? 5 Nov 2015 IV-2: 2

IRDM Chapter 4, today Definitions 1. Algorithms for Frequent Itemset Mining 2. Association Rules and Interestingness 3. Summarising Collections of Itemsets 4. You’ll find this covered in Aggarwal Chapter 4, 5.2 Zaki & Meira, Ch. 10, 11 IV-2: 3 IRDM ‘15/16

Chapter 4.3: Assoc ociati tion on R Rules IV-2: 4 IRDM ‘15/16

IRDM Chapter 4.3 Generating Association Rules 1. Measures of Interestingness 2. Properties of Measures 3. Simpson’s Paradox 4. You’ll find this covered in Aggarwal, Chapter 4 Zaki & Meira, Ch. 10 IV-2: 5 IRDM ‘15/16

Generating Association Rules We can generate association rules from frequent itemsets  if 𝑎 is a frequent itemset and 𝑌 ⊂ 𝑎 is its proper subset, we have rule 𝑌 → 𝑍 , where 𝑍 = 𝑎 ∖ 𝑌 These rules are frequent because 𝑡𝑡𝑡𝑡 𝑌 → 𝑍 = 𝑡𝑡𝑡𝑡 𝑌 ∪ 𝑍 = 𝑡𝑡𝑡𝑡 ( 𝑎 ) 𝑡𝑡𝑡𝑡 𝑎  we still need to compute the confidence as 𝑡𝑡𝑡𝑡 𝑌 Which means, if rule 𝑌 → 𝑎 ∖ 𝑌 is not confident, no rule of type 𝑋 → 𝑎 ∖ 𝑋 , with 𝑋 ⊆ 𝑌 , is confident  we can use this to prune the search space IV-2: 6 IRDM ‘15/16

Pseudo-code (Algorithm 8.6 in Zaki & Meira) IV-2: 7 IRDM ‘15/16

Measures of interestingness Consider the following example: ∑ Coffee Not Coffee Tea 150 50 200 Not Tea 650 150 800 ∑ 800 200 1000 Rule Tea → {Coffee} has 15% support and 75% confidence  reasonably good numbers Is this a good rule?  the overall fraction of coffee drinkers is 80%, drinking tea redu educes es the probability of drinking coffee! IV-2: 8 IRDM ‘15/16

Problems with confidence The support-confidence framework does not take the support of the consequent into account rules with relatively small support for the antecedent and high support for the  consequent often have high confidence To fix this, many other measures have been proposed Most measures are easy to express using contingen ency tables es B ¬B ∑ We’ll use 𝑡 𝑗𝑗 as shorthand for support: A s 11 s 10 s 1+ s 11 = 𝑡𝑡𝑡𝑡 𝐵𝐵 , 𝑡 01 = 𝑡𝑡𝑡𝑡 (¬ 𝐵𝐵 ) , … Analogue, we’ll say 𝑔 𝑗𝑗 for frequency: ¬A s 01 s 00 s 0+ 𝑔 11 = 𝑔𝑔𝑔𝑔 𝐵𝐵 , 𝑔 01 = 𝑔𝑔𝑔𝑔 (¬ 𝐵𝐵 ), … ∑ s +1 s +0 N (revised on Nov 9 th , now using 𝑡 𝑗𝑗 notation to more clearly indicate suppo port) IV-2: 9 IRDM ‘15/16

Statistical Coefficient of Correlation A natural statistical measure between a pair of items is the Pearson co correla latio ion co coefficie icient 𝜍 = 𝐹 𝑌𝑍 − 𝐹 𝑌 𝐹 𝑍 𝜏 𝑌 𝜏 𝑍 𝐹 𝑌𝑍 − 𝐹 𝑌 𝐹 𝑍 𝐹 𝑌 2 − 𝐹 𝑌 2 𝐹 𝑍 2 − 𝐹 𝑍 2 = IV-2: 10 IRDM ‘15/16

Pearson of Correlation of Items For items 𝐵 and 𝐵 it reduces to 𝑔 11 − 𝑔 1+ 𝑔 +1 𝜍 𝐵𝐵 = 𝑔 1+ 𝑔 +1 (1 − 𝑔 1+ )(1 − 𝑔 +1 ) It is + 1 when the data is perfectly positively correlated, -1 when perfectly negatively correlated, and 0 when uncorrelated. (revised on November 12 th ; typo fixed, as 𝑔 11 should be inside the nominator) IV-2: 11 IRDM ‘15/16

Chi-square 𝒴 2 is another natural statistical measure of significance for itemsets. For a set of 𝑙 items, it compares the observed frequencies against the expected frequencies of all 2 𝑙 possible states. 2 𝑔𝑔𝑔𝑔 ( 𝑍 ) − 𝐹 𝑌 𝑔𝑔𝑔𝑔 ( 𝑍 ) 𝒴 2 ( 𝑌 ) = � 𝐹 𝑌 𝑔𝑔𝑔𝑔 ( 𝑍 ) 𝑍∈𝒬 ( 𝑌 ) where 𝒬 ( 𝑌 ) is the powerset of 𝑌 and 𝐹 𝑌 [ 𝑔𝑔𝑔𝑔 𝑍 ] is the expected frequency of state 𝑍 over itemset 𝑌 For example, for 𝑌 = {beer, diapers} , it considers states beer, diapers , ¬beer, diapers , beer, ¬diapers and ¬beer, ¬diapers . (Brin et al. 1998, 1.6k+ cites) (revised on Nov 9 th , now using 𝐹 𝑌 [ 𝑔𝑔𝑔𝑔 𝑍 ] to more clearly indicate the expectation is of state 𝑍 over itemset 𝑌 ) IV-2: 12 IRDM ‘15/16

Chi-square (2) To compute 𝒴 2 ( 𝑌 ) we need to define 𝐹 𝑌 𝑔𝑔𝑔𝑔 𝑍 . The standard way is to assume me indep epen endence e between the items of 𝑍 . That is, the probability of a state 𝑍 is the multiplication of its individual item frequencies. 𝐹 𝑌 𝑔𝑔𝑔𝑔 𝑍 = � 𝑔𝑔𝑔𝑔 ( 𝐵 ) � (1 − 𝑔𝑔𝑔𝑔 𝐵 ) 𝐵∈𝑍 𝐵∈𝑌∖𝑍 The first product is over the items that ar are e present in 𝑍 (the 1s). For these their empirical probability is simply 𝑔𝑔𝑔𝑔 ( ⋅ ) . The second product considers the 0s in 𝑍 , or in other words, the 1s of 𝑌 not in 𝑍 . The empirical probability of not seeing an item A is (1 − 𝑔𝑔𝑔𝑔 𝐵 ) . Note! Independence between items is a very strong assumption, and hence we will find that many itemsets will be ‘significantly’ correlated. (revised on Nov 9 th , now using 𝐹 𝑌 [ 𝑔𝑔𝑔𝑔 𝑍 ] notation, added more explanation) IV-2: 13 IRDM ‘15/16

Chi-square (3) 2 𝑔𝑔𝑔𝑔 ( 𝑍 ) − 𝐹 𝑌 𝑔𝑔𝑔𝑔 ( 𝑍 ) 𝒴 2 ( 𝑌 ) = � 𝐹 𝑌 𝑔𝑔𝑔𝑔 ( 𝑍 ) 𝑍∈𝒬 ( 𝑌 ) Chi-square scores close to 0 indicate statistical independence, while larger values indicate stronger dependencies.  no differentiation to positive or negative correlation lly costly at 𝑃 (2 𝑌 )  it is computatio iona nall  but as it is upward closed, we can mine interesting sets efficiently Always be thoughtful of how you define your expected frequency! (revised on Nov 9 th , now using 𝐹 𝑌 [ 𝑔𝑔𝑔𝑔 𝑍 ] notation, added more explanation) IV-2: 14 IRDM ‘15/16

Interest Ratio The inter erest est r ratio 𝐽 of rule 𝐵 → 𝐵 is 𝑡𝑡𝑡𝑡 𝐵 × 𝑡𝑡𝑡𝑡 ( 𝐵 ) = 𝑂𝑡 11 𝑂 × 𝑡𝑡𝑡𝑡 𝐵𝐵 𝐽 𝐵 , 𝐵 = 𝑡 1+ 𝑡 +1 𝑑𝑑𝑑𝑑 𝐵→𝐵  it is equivalent to lif lift = = 𝑡𝑡𝑡𝑡 𝐵 Interest ratio compares the frequencies against the assumption that 𝐵 and 𝐵 are independent 𝑡 1+ 𝑡 +1  if 𝐵 and 𝐵 are independent, 𝑡 11 = 𝑂 Interpreting interest ratios  𝐽 𝐵 , 𝐵 = 1 if 𝐵 and 𝐵 are independent  𝐽 𝐵 , 𝐵 > 1 if 𝐵 and 𝐵 are positively correlated  𝐽 𝐵 , 𝐵 < 1 if 𝐵 and 𝐵 are negatively correlated ( 𝑔 𝑗𝑗 changed into 𝑡 𝑗𝑗 in revision 1) IV-2: 15 IRDM ‘15/16

The cosine measure The cosin ine, or 𝐽𝐽 , measure of rule 𝐵 → 𝐵 is defined as 𝑡 11 𝑑𝑑𝑡𝑑𝑑𝑔 𝐵 , 𝐵 = 𝐽 𝐵 , 𝐵 × 𝑡𝑡𝑡𝑡 ( 𝐵𝐵 )/ 𝑂 = 𝑡 1+ × 𝑡 +1 which is regular cosine if we think of 𝐵 and 𝐵 as binary vectors It also is the geomet metri ric m c mean between the confidences of 𝐵 → 𝐵 and 𝐵 → 𝐵 as 𝑡𝑡𝑡𝑡 𝐵𝐵 × 𝑡𝑡𝑡𝑡 𝐵𝐵 𝑑𝑑𝑑𝑔 𝐵 → 𝐵 × 𝑑𝑑𝑑𝑔 ( 𝐵 → 𝐵 ) 𝑑𝑑𝑡𝑑𝑑𝑔 𝐵 , 𝐵 = = 𝑡𝑡𝑡𝑡 𝐵 𝑡𝑡𝑡𝑡 𝐵 ( 𝑔 𝑗𝑗 changed into 𝑡 𝑗𝑗 in revision 1) IV-2: 16 IRDM ‘15/16

Examples (1) ∑ Coffee Not Coffee Tea 150 50 200 Not Tea 650 150 800 ∑ 800 200 1000 The interest ratio of Tea → {Coffee} is 1000 × 150 800 × 200 = 0.9375  almost 1, so not very interesting; below 1, so (slight) negative correlation The 𝑑𝑑𝑡𝑑𝑑𝑔 of this rule, however, is 0.375  quite far from 0, so, it is is interesting. IV-2: 17 IRDM ‘15/16

Examples (2) p ¬p ∑ r ¬r ∑ q 880 50 930 t 20 50 70 ¬ q 50 20 70 ¬ t 50 880 930 ∑ 930 70 1000 ∑ 70 930 1000 𝐽 𝑡 , 𝑔 = 1.02 and 𝐽 𝑔 , 𝑢 = 4.08  𝑡 and 𝑔 are close to independent But 𝑡 and 𝑔 appear  𝑔 and 𝑢 have highest interest factor together in 88% of cases But 𝑔 and 𝑢 appear together only seldom Now 𝑑𝑑𝑑𝑔 𝑡 → 𝑔 = 0.946 and 𝑑𝑑𝑑𝑔 𝑔 → 𝑢 = 0.286 (revised on Nov 9 th , now using 𝑢 instead of 𝑡 to avoid confusion with support-notation) IV-2: 18 IRDM ‘15/16

Examples (2) p ¬p ∑ r ¬r ∑ q 880 50 930 s 20 50 70 ¬ q 50 20 70 ¬ s 50 880 930 Bottom line: Lunch is not free. ∑ 930 70 1000 ∑ 70 930 1000 There is no single measure that 𝐽 𝑡 , 𝑔 = 1.02 and 𝐽 𝑔 , 𝑡 = 4.08 works well all the time.  𝑡 and 𝑔 are close to independent But 𝑡 and 𝑔 appear  𝑔 and 𝑡 have highest interest factor together in 88% of cases But 𝑔 and 𝑡 appear together only seldom Now 𝑑𝑑𝑑𝑔 𝑡 → 𝑔 = 0.946 and 𝑑𝑑𝑑𝑔 𝑔 → 𝑡 = 0.286 IV-2: 19 IRDM ‘15/16

Chapter 4: Freq equen ent Ite temsets and nd Asso Associatio - PowerPoint PPT Presentation

Chapter 4: Freq equen ent Ite temsets and nd Asso Associatio ion R Rules ules Jilles Vreeken Revision 1, November 9 th Notation clarified, Chi-square: clarified Revision 2, November 10 th details added of derivability example Revision 3,

Qn Answer Mk Comment 1 Time freq width freq density (i) 40- 26 5

ITE Group PLC Managing Downturn in Emerging Markets Managing Downturn in Emerging Markets

L e a rning L ite ra c y, L e a rning to T e a c h L ite ra c y: Suppo rting AL L

From m Contain inme ment t to Conseq equen uence ces s of COVID-19 9 in in the e Food d

Appea ppeals on e err rrors o of fact Asses essing g the r e reputational c conseq equen

ITE Aptitude-based Admissions (for students applying for admission into ITE in AY2019) Briefing

Anderson Secondary School Briefing for Sec 2 Normal(Tech) Subjects Briefing 2019 Na Nationa

Pavem ent Type Selection and Pavem ent Type Selection and Alternate Pavem ent Bidding Alternate

Departm Departm tment tment ent of Local ent of Local l Government l Government ent Finance

Ash Deposition and Shedding in Straw and W ood Suspension-Fired Boilers : Full-scale Measurem

Childre n s Ho me Asso c ia tio n o f I llino is Be c o ming a T ra uma Info rme d Ag e

Third A Amen endmen ent to t the S Settlem emen ent Agree eement b between een Ga

Bespoke Transparent Perform ing Discreet I nvestm ent Managem ent W ealth and Asset Managem ent

Settlem ent Agreem ent Funding of Pension & Benefits during 2 0 1 0 Janice Payne

Ow n Risk Solv ency Assessm ent (ORSA) Linking Risk Ma na gem ent, Ca p ita l Ma na gem ent

Dep Depar artmen ent of of Lo Loca cal Go Governmen ent Finan Finance ce Dep Depar

Spectroscopy Methods for Network Inference Andre Broido C A I D A CAIDA / SDSC / UCSD

L2VPN Interworking draft-sajassi-l2vpn-interworking-00.txt Ali Sajassi Cisco Systems November

I/O Disclaimer: some slides are adopted from book authors slides with permission 1 Concepts to

Energy Consumption and Performance Analysis Between SSD and HDD Pablo J. Pavan , Vincius R.

M4 Brno, video processing Pavel Zem k, Stanislav Sumec, Igor Pot ek, Kteish Abu

I NTRODUCTION Feb. 13, 2017 Acknowledgement: The course slides are adapted from the slides

Wireless Communication Fundamentals David Holmer dholmer@jhu.edu Physical Properties of

Tom DeFanti Research Scientist California Institute for Telecommunications and Information

Chapter 4: Freq equen ent Ite temsets and nd Asso Associatio - PowerPoint PPT Presentation

Chapter 4: Freq equen ent Ite temsets and nd Asso Associatio ion R Rules ules Jilles Vreeken Revision 1, November 9 th Notation clarified, Chi-square: clarified Revision 2, November 10 th details added of derivability example Revision 3,

Qn Answer Mk Comment 1 Time freq width freq density (i) 40- 26 5

ITE Group PLC Managing Downturn in Emerging Markets Managing Downturn in Emerging Markets

L e a rning L ite ra c y, L e a rning to T e a c h L ite ra c y: Suppo rting AL L

From m Contain inme ment t to Conseq equen uence ces s of COVID-19 9 in in the e Food d

Appea ppeals on e err rrors o of fact Asses essing g the r e reputational c conseq equen

ITE Aptitude-based Admissions (for students applying for admission into ITE in AY2019) Briefing

Anderson Secondary School Briefing for Sec 2 Normal(Tech) Subjects Briefing 2019 Na Nationa

Pavem ent Type Selection and Pavem ent Type Selection and Alternate Pavem ent Bidding Alternate

Departm Departm tment tment ent of Local ent of Local l Government l Government ent Finance

Ash Deposition and Shedding in Straw and W ood Suspension-Fired Boilers : Full-scale Measurem

Childre n s Ho me Asso c ia tio n o f I llino is Be c o ming a T ra uma Info rme d Ag e

Third A Amen endmen ent to t the S Settlem emen ent Agree eement b between een Ga

Bespoke Transparent Perform ing Discreet I nvestm ent Managem ent W ealth and Asset Managem ent

Settlem ent Agreem ent Funding of Pension &amp; Benefits during 2 0 1 0 Janice Payne

Ow n Risk Solv ency Assessm ent (ORSA) Linking Risk Ma na gem ent, Ca p ita l Ma na gem ent

Dep Depar artmen ent of of Lo Loca cal Go Governmen ent Finan Finance ce Dep Depar

Spectroscopy Methods for Network Inference Andre Broido C A I D A CAIDA / SDSC / UCSD

L2VPN Interworking draft-sajassi-l2vpn-interworking-00.txt Ali Sajassi Cisco Systems November

I/O Disclaimer: some slides are adopted from book authors slides with permission 1 Concepts to

Energy Consumption and Performance Analysis Between SSD and HDD Pablo J. Pavan , Vincius R.

M4 Brno, video processing Pavel Zem k, Stanislav Sumec, Igor Pot ek, Kteish Abu

I NTRODUCTION Feb. 13, 2017 Acknowledgement: The course slides are adapted from the slides

Wireless Communication Fundamentals David Holmer dholmer@jhu.edu Physical Properties of

Tom DeFanti Research Scientist California Institute for Telecommunications and Information

Settlem ent Agreem ent Funding of Pension & Benefits during 2 0 1 0 Janice Payne