Chapter VII.3: Association Rules 1. Generating the Association Rules - PowerPoint PPT Presentation

Chapter VII.3: Association Rules 1. Generating the Association Rules 2. Measures of Interestingness 2.1. Problems with confidence 2.2. Some other measures 3. Properties of Measures 4. Simpson’s Paradox Zaki & Meira, Chapter 10; Tan, Steinbach & Kumar, Chapter 6 IR&DM ’13/14 19 December 2013 VII.3&4- 1

Generating association rules • We can generate the association rules from the frequent itemsets – If Z is a frequent itemset and X ⊂ Z is its proper subset, we have rule X → Y , where Y = Z \ X • These rules are frequent because supp ( X → Y ) = supp ( X ∪ Y ) = supp ( Z ) – We still need to compute the confidence as supp ( Z )/ supp ( X ) • If rule X → Z \ X is not confident, no rule of type W → Z \ W , with W ⊆ X , is confident – We can use this to prune the search space IR&DM ’13/14 19 December 2013 VII.3&4- 2

Pseudo-code for generating association rules Algorithm 8.6 : Algorithm AssociationRules AssociationRules ( F , minconf ) : foreach Z ∈ F , such that | Z | ≥ 2 do 1 � � A ← X | X ⊂ Z, X ̸ = ∅ 2 while A ̸ = ∅ do 3 X ← maximal element in A 4 A ← A \ X // remove X from A 5 c ← sup ( Z ) /sup ( X ) 6 if c ≥ minconf then 7 print X − → Y , sup ( Z ) , c 8 else 9 � � A ← A \ W | W ⊂ X // remove all subsets of X from A 10 Algorithm 8.6 of Zaki & Meira IR&DM ’13/14 19 December 2013 VII.3&4- 3

Measures of Interestingness • Consider the following example: Coffee Not ¡Coffee ∑ Tea 150 50 200 Not ¡Tea 650 150 800 ∑ 800 200 1000 • The rule {Tea} → {Coffee} has 15% support and 75% confidence – Reasonably good numbers • Is this a good rule? • The overall fraction of coffee drinkers is 80% ⇒ Drinking tea reduces the probability of drinking coffee! IR&DM ’13/14 19 December 2013 VII.3&4- 4

Problems with Confidence • Support–Confidence framework doesn’t take into account the support of the consequent (tail) – Rules with relatively small support for the antecedent and high support for the consequent often have high confidence • To fix this, many other measures have been proposed • Most measures are easy to express using contingency tables B ¬B ∑ A f 11 f 10 f 1+ ¬A f 01 f 00 f 0+ ∑ f +1 f +0 N IR&DM ’13/14 19 December 2013 VII.3&4- 5

Interest Factor • The interest factor I of rule A → B is defined as N × supp ( AB ) Nf 11 I ( A , B ) = supp ( A ) × supp ( B ) = f 1 + f + 1 – It is equivalent to lift conf ( A → B )/ supp ( B ) • Interest factor compares the frequencies against the assumption that A and B are independent – If A and B are independent, f 11 = f 1 + f + 1 N • Interpreting interest factor: – I ( A , B ) = 1 if A and B are independent – I ( A , B ) > 1 if A and B are positively correlated – I ( A , B ) < 1 if A and B are negatively correlated IR&DM ’13/14 19 December 2013 VII.3&4- 6

The IS measure • The IS measure of rule A → B is defined as f 11 p IS ( A , B ) = I ( A , B ) × supp ( AB ) / N = √ f 1 + f + 1 • If we think A and B as binary vectors, IS is their cosine • IS is also the geometric mean between confidences of A → B and B → A s supp ( AB ) supp ( A ) × supp ( AB ) IS ( A , B ) = supp ( B ) p conf ( A → B ) × conf ( B → A ) = IR&DM ’13/14 19 December 2013 VII.3&4- 7

Examples (1) Coffee Not ¡Coffee ∑ Tea 150 50 200 Not ¡Tea 650 150 800 ∑ 800 200 1000 • The interest factor of {Tea} → {Coffee} is (1000 × 150)/(800 × 200) = 0.9375 – Slight negative correlation • The IS of the rule is 0.375 IR&DM ’13/14 19 December 2013 VII.3&4- 8

Examples (2) p ¬p ∑ r ¬r ∑ q 880 50 930 s 20 50 70 ¬ q 50 20 70 ¬ s 50 880 930 ∑ 930 70 1000 ∑ 70 930 1000 • I ( p , q ) = 1.02 and I ( r , s ) = 4.08 But p and q appear – p and q are close to independent together in 88% of cases – r and s have higher interest factor But r and s seldom appear together • Now conf ( p → q ) = 0.946 and conf ( r → s ) = 0.286 IR&DM ’13/14 19 December 2013 VII.3&4- 9

Measures for pairs of itemsets { Measure (Symbol) Definition Nf 11 − f 1+ f +1 Correlation ( φ ) √ f 1+ f +1 f 0+ f +0 � �� Odds ratio ( α ) f 11 f 00 f 10 f 01 Nf 11 + Nf 00 − f 1+ f +1 − f 0+ f +0 Kappa ( κ ) N 2 − f 1+ f +1 − f 0+ f +0 � �� Interest ( I ) Nf 11 f 1+ f +1 � �� Cosine ( IS ) f 11 f 1+ f +1 N − f 1+ f +1 f 11 Piatetsky-Shapiro ( PS ) N 2 f 1+ f +1 + f 0+ f +0 × N − f 1+ f +1 − f 0+ f +0 f 11 + f 00 Collective strength ( S ) N − f 11 − f 00 �� Jaccard ( ζ ) f 1+ + f +1 − f 11 f 11 � f 11 f 1+ , f 11 � All-confidence ( h ) min f +1 Tan, Steinbach & Kumar Table 6.11 IR&DM ’13/14 19 December 2013 VII.3&4- 10

Measures for association rules − → Measure (Symbol) Definition � � �� Goodman-Kruskal ( λ ) j max k f jk − max k f + k N − max k f + k f ij Nf ij f i + N log f i + � � �� − � Mutual Information ( M ) N log i j i f i + f + j N f 11 f 1+ f +1 + f 10 Nf 11 Nf 10 J-Measure ( J ) N log N log f 1+ f +0 f 1+ ) 2 + ( f 10 f 1+ f 1+ ) 2 ] − ( f +1 N × ( f 11 N ) 2 Gini index ( G ) f 0+ ) 2 + ( f 00 + f 0+ f 0+ ) 2 ] − ( f +0 N × [( f 01 N ) 2 � �� Laplace ( L ) f 11 + 1 f 1+ + 2 � �� Conviction ( V ) f 1+ f +0 Nf 10 � f 11 f 1+ − f +1 1 − f +1 �� Certainty factor ( F ) N N f 1+ − f +1 f 11 Added Value ( AV ) N Tan, Steinbach & Kumar Table 6.12 IR&DM ’13/14 19 December 2013 VII.3&4- 11

Properties of Measures • The measures do not agree on how they rank itemset pairs or rules • To understand how they behave, we need to study their properties – Measures that share some property behave similarly under that property’s conditions IR&DM ’13/14 19 December 2013 VII.3&4- 12

Three properties • Measure has the inversion property if its value stays the same if we exchange f 11 with f 00 and f 10 with f 01 – The measure is invariant for flipping the bits • Measure has the null addition property if it is not affected by increasing f 00 if other values stay constant – The measure is invariant on adding new transactions that don’t have the items in the itemsets • Measure has the scaling invariance property if it is not affected by replacing the values f 11 , f 10 , f 01 , and f 00 with values k 1 k 3 f 11 , k 2 k 3 f 10 , k 1 k 4 f 01 , and k 2 k 4 f 00 – k ’s are positive constants IR&DM ’13/14 19 December 2013 VII.3&4- 13

Which properties hold? Symbol Measure Inversion Null Addition Scaling φ -coe ffi cient Yes No No φ odds ratio Yes No Yes α Cohen’s Yes No No κ Interest No No No I Cosine No Yes No IS Piatetsky-Shapiro’s Yes No No PS Collective strength Yes No No S Jaccard No Yes No ζ All-confidence No No No h Support No No No s Tan, Steinbach & Kumar Table 6.17 IR&DM ’13/14 19 December 2013 VII.3&4- 14

Simpson’s Paradox • Consider the following data on who bought HDTVs and exercise machines Exercise ¡ No ¡Exercise ¡ ∑ Machine Machine HDTV 99 81 180 No ¡HDTV 54 66 120 ∑ 153 147 300 • {HDTV} → {Exercise mach.} has confidence 0.55 • {¬HDTV} → {Exercise mach.} has confidence 0.45 ⇒ Customers who buy HDTVs are more likely to buy exercise machines than those who don’t buy HDTVs IR&DM ’13/14 19 December 2013 VII.3&4- 15

Deeper analysis Exerc. ¡m rc. ¡mach. Group HDTV Yes No ∑ Yes 1 9 10 College College No 4 30 34 Yes 98 72 170 Working Working No 50 36 86 • For college students – conf(HDTV → Exerc. mach.) = 0.10 – conf(¬HDTV → Exerc. mach.) = 0.118 No HDTV is more • For working adults likely to by exercise machine! – conf(HDTV → Exerc. mach.) = 0.577 – conf(¬HDTV → Exerc. mach.) = 0.581 IR&DM ’13/14 19 December 2013 VII.3&4- 16

The paradox and why it happens • In the combined data, HDTVs and exercise machines correlate positively • In the stratified data, they correlate negatively – This is the Simpson’s paradox • The explanation: – Most customers were working adults • They also bought most HDTVs and exercise machines – In the combined data this increased the correlation between HDTVs and exercise machines • Moral of the story: stratify your data properly! IR&DM ’13/14 19 December 2013 VII.3&4- 17

Chapter VII.4: Summarizing Itemsets 1. The flood of itemsets 2. Maximal and closed frequent itemsets 2.1. Definitions 2.2. Algorithms 3. Non-derivable itemsets 3.1. Inclusion-exclusion principle 3.2. Non-derivability Zaki & Meira, Chapter 11; Tan, Steinbach & Kumar, Chapter 6 IR&DM ’13/14 19 December 2013 VII.3&4- 18

The Flood of Itemsets • Consider the following table: Dd A B C D E F G H • How many itemsets with 1 minimum frequency of ✔ ✔ ✔ ✔ ✔ 2 1/7 it has? ✔ ✔ ✔ ✔ ✔ ✔ 3 ✔ ✔ ✔ ✔ ✔ ✔ • 255! 4 ✔ ✔ ✔ ✔ ✔ ✔ • Still 31 frequent itemsets 5 ✔ ✔ ✔ ✔ ✔ with 50% minfreq 6 ✔ ✔ ✔ ✔ ✔ 7 ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ • ”Data mining is … to summarize the data” – Hardly a summarization! IR&DM ’13/14 19 December 2013 VII.3&4- 19

Chapter VII.3: Association Rules 1. Generating the Association Rules - PowerPoint PPT Presentation

Chapter VII.3: Association Rules 1. Generating the Association Rules 2. Measures of Interestingness 2.1. Problems with confidence 2.2. Some other measures 3. Properties of Measures 4. Simpsons Paradox Zaki & Meira, Chapter 10; Tan,

CHAPTER CHAPTER VII CHAPTER CHAPTER VII VII VII MANAGEMENT AND MANAGEMENT AND

Chapter VII: Frequent Itemsets & Association Rules Information Retrieval & Data Mining

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

Association Rules Data Mining and Exploration: Association Rules Itemsets, association rules

Mining Association Rules Mining Association Rules Additional Measures of rule interestingness

Association Rules from transactional databases ! Mining multilevel association rules from

Healthy Cities Network Implementation Framework for Phase VII (2019-2024) WHO European Healthy

CommBank PERLS VII Capital Notes Investor Presentation 18 August 2014 Investments in PERLS VII

PERFORMANCE APPRAISAL SYSTEMS CHAPTER VII REWARD FOR PERFORMANCE PERFORMANCE APPRAISAL SYSTEMS

Rules Engine Tool What is the Rules Engine? Alert Proactive Reaction Business Rules Actions

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 11/27/2006 Chapter 13

Chapter 7: Frequent Itemsets and Association Rules Information Retrieval & Data Mining

IJF referee Seminar Malaga 2014 Rules presentation 1 IJF RULES 2014 - 2016 IJF RULES 2014 -

Conventional Rounding Rules Conventional Rounding Rules Conventional Rounding Rules Conventional

WIAA SOCCER RULES CLINIC 2017-18 RULES CLINIC PROCEDURE The 2017-18 Soccer Rules Clinic is

WRESTLING RULES CLINIC 2016-17 NFHS WRESTLING RULES The WIAA follows NFHS rules for Wrestling.

Counting Linear Extensions of Sparse Posets Kustaa Kangas , Teemu Hankala, Teppo Niinimki,

Advanced Counting Techniques CS1200, CSE IIT Madras Meghana Nasre April 3, 2020 CS1200, CSE IIT

Algebraic techniques in parameterized algorithms ukasz Kowalik University of Warsaw FPT

Lecture 27: Inclusion-exclusion Principle Dr. Chengjiang Long Computer Vision Researcher at

Foundations of Computer Science Lecture 14 Advanced Counting Sequences with Repetition Union of

Probabilities Alice Gao Lecture 12 Based on work by K. Leyton-Brown, K. Larson, and P. van Beek

MA162: Finite mathematics . Jack Schmidt University of Kentucky November 2, 2011 Schedule: HW

Randomness in Computing L ECTURE 1 Randomness in Computing Course information Verifying

Sambuz

Useful Links

Newsletter

Mail Us

Chapter VII.3: Association Rules 1. Generating the Association Rules - PowerPoint PPT Presentation

Chapter VII.3: Association Rules 1. Generating the Association Rules 2. Measures of Interestingness 2.1. Problems with confidence 2.2. Some other measures 3. Properties of Measures 4. Simpsons Paradox Zaki & Meira, Chapter 10; Tan,

CHAPTER CHAPTER VII CHAPTER CHAPTER VII VII VII MANAGEMENT AND MANAGEMENT AND

Chapter VII: Frequent Itemsets &amp; Association Rules Information Retrieval &amp; Data Mining

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

Association Rules Data Mining and Exploration: Association Rules Itemsets, association rules

Mining Association Rules Mining Association Rules Additional Measures of rule interestingness

Association Rules from transactional databases ! Mining multilevel association rules from

Healthy Cities Network Implementation Framework for Phase VII (2019-2024) WHO European Healthy

CommBank PERLS VII Capital Notes Investor Presentation 18 August 2014 Investments in PERLS VII

PERFORMANCE APPRAISAL SYSTEMS CHAPTER VII REWARD FOR PERFORMANCE PERFORMANCE APPRAISAL SYSTEMS

Rules Engine Tool What is the Rules Engine? Alert Proactive Reaction Business Rules Actions

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 11/27/2006 Chapter 13

Chapter 7: Frequent Itemsets and Association Rules Information Retrieval &amp; Data Mining

IJF referee Seminar Malaga 2014 Rules presentation 1 IJF RULES 2014 - 2016 IJF RULES 2014 -

Conventional Rounding Rules Conventional Rounding Rules Conventional Rounding Rules Conventional

WIAA SOCCER RULES CLINIC 2017-18 RULES CLINIC PROCEDURE The 2017-18 Soccer Rules Clinic is

WRESTLING RULES CLINIC 2016-17 NFHS WRESTLING RULES The WIAA follows NFHS rules for Wrestling.

Counting Linear Extensions of Sparse Posets Kustaa Kangas , Teemu Hankala, Teppo Niinimki,

Advanced Counting Techniques CS1200, CSE IIT Madras Meghana Nasre April 3, 2020 CS1200, CSE IIT

Algebraic techniques in parameterized algorithms ukasz Kowalik University of Warsaw FPT

Lecture 27: Inclusion-exclusion Principle Dr. Chengjiang Long Computer Vision Researcher at

Foundations of Computer Science Lecture 14 Advanced Counting Sequences with Repetition Union of

Probabilities Alice Gao Lecture 12 Based on work by K. Leyton-Brown, K. Larson, and P. van Beek

MA162: Finite mathematics . Jack Schmidt University of Kentucky November 2, 2011 Schedule: HW

Randomness in Computing L ECTURE 1 Randomness in Computing Course information Verifying

Sambuz

Useful Links

Newsletter

Mail Us

Chapter VII: Frequent Itemsets & Association Rules Information Retrieval & Data Mining

Chapter 7: Frequent Itemsets and Association Rules Information Retrieval & Data Mining