1
CISC 4631 Data Mining
Lecture 10: Association Rule Mining
Theses slides are based on the slides by
- Tan, Steinbach and Kumar (textbook authors)
- Prof. F. Provost (Stern, NYU)
- Prof. B. Liu, UIC
CISC 4631 Data Mining Lecture 10: Association Rule Mining Theses - - PowerPoint PPT Presentation
CISC 4631 Data Mining Lecture 10: Association Rule Mining Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Prof. F. Provost (Stern, NYU) Prof. B. Liu, UIC 1 What Is Association Mining?
1
Theses slides are based on the slides by
– Finding frequent patterns, associations, correlations, or
– Market Basket analysis, cross-marketing, catalog design,
2
– Rule form: “Body ead [support, confidence]”. – buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%]
5
6
Market-Basket transactions
TID Items
1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example of Association Rules
{Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk},
Implication means co-occurrence, not causality!
7
8
9
Customer buys diaper Customer buys both Customer buys beer
– actionable vs. trivial vs. inexplicable
– support ≈ p(R&C)
– confidence ≈ p(R|C)
– Support = probability that a transaction contains {X,Y}
database.
– Confidence = conditional probability that a transaction having X also contains Y
In general confidence of a rule LHS => RHS can be computed as the support
Confidence (LHS => RHS) = Support(LHS RHS) / Support(LHS)
Customer buys diaper Customer buys both Customer buys beer
– A collection of one or more items
– k-itemset
– Frequency of occurrence of itemset – E.g. ({Milk, Bread,Diaper}) = 2
– Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5
– An itemset whose support is greater than or equal to a minsup threshold
TID Items
1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
13
Example:
4 . 5 2 | T | ) Beer Diaper, , Milk ( s 67 . 3 2 ) Diaper , Milk ( ) Beer Diaper, Milk, ( c
Association Rule
– An implication expression of the form X Y, where X and Y are itemsets – Example: {Milk, Diaper} {Beer}
Rule Evaluation Metrics
– Support (s)
Fraction of transactions that contain
both X and Y
– Confidence (c)
Measures how often items in Y
appear in transactions that contain X
TID Items
1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
Transaction ID Items Bought 1001 A, B, C 1002 A, C 1003 A, D 1004 B, E, F 1005 A, D, F
Itemset {A, C} has a support of 2/5 = 40% Rule {A} ==> {C} has confidence of 50% Rule {C} ==> {A} has confidence of 100% Support for {A, C, E} ? Support for {A, D, F} ? Confidence for {A, D} ==> {F} ? Confidence for {A} ==> {D, F} ? Itemset {A, C} has a support of 2/5 = 40% Rule {A} ==> {C} has confidence of 50% Rule {C} ==> {A} has confidence of 100% Support for {A, C, E} ? Support for {A, D, F} ? Confidence for {A, D} ==> {F} ? Confidence for {A} ==> {D, F} ?
Goal: Find all rules that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf).
CS583, Bing Liu, UIC 16
minsup = 30% minconf = 80%
[sup = 3/7]
[sup = 3/7, conf = 3/3]
…
[sup = 3/7, conf = 3/3]
t1: Beef, Chicken, Milk t2: Beef, Cheese t3: Cheese, Boots t4: Beef, Chicken, Cheese t5: Beef, Chicken, Clothes, Cheese, Milk t6: Chicken, Clothes, Milk t7: Chicken, Milk, Clothes
{Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5)
TID Items
1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
{Milk, Diaper, Beer}
can have different confidence
Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Although confidence is high, rule is misleading P(Coffee|Tea) = 0.9375
– Generate all itemsets whose support minsup
– Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset
CS583, Bing Liu, UIC 20
CS583, Bing Liu, UIC 21
– Given a transaction data set T, and a minimum support and a minimum confident, the set of association rules existing in T is uniquely determined.
CS583, Bing Liu, UIC 22
CS583, Bing Liu, UIC 23
AB AC AD BC BD CD A B C D ABC ABD ACD BCD
– Frequent item sets are the sets of items that have minimum support – Support is “downward closed”, so, a subset of a frequent itemset must also be a frequent itemset
none of its supersets will either (this is essential for pruning search space)
– Iteratively find frequent itemsets with cardinality from 1 to k (k- itemsets)
25
null AB AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE
Given d items, there are 2d possible candidate itemsets
26
– A subset of a frequent itemset must also be a frequent itemset
frequent itemset. Why? Make sure you can explain this.
– Iteratively find frequent itemsets with cardinality from 1 to k (k- itemset)
– This step is more straightforward and requires less computation so we focus on the first step
27
28
Found to be Infrequent
null AB AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE null AB AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE
Pruned supersets
– Ck is the set of candidate k-itemsets – Lk is the set of k-itemsets
subset of a frequent k-itemset – This is a bit confusing since we want to use it the other way. We prune a candidate k-itemset if any of its k-1 itemsets are not in our list of frequent k-1 itemsets
and they you work your way up from there!
29
CS583, Bing Liu, UIC 30
CS583, Bing Liu, UIC 31
– The description below is a bit confusing– all we do is splice two sets together so that only one new item is added (see example)
forall itemsets c in Ck do forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
32
33
CS583, Bing Liu, UIC 34
TID Items T100 1, 3, 4 T200 2, 3, 5 T300 1, 2, 3, 5 T400 2, 5
C2: {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}
35
TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5
itemset sup {1 2} 1 {1 3} 2 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2
CS583, Bing Liu, UIC 36
support(A B) = support(AB) = support(X) confidence(A B) = support(A B) / support(A)
CS583, Bing Liu, UIC 37
– Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with sup=50%, 50%, 75%, 75%, 75%, 75% respectively – These generate these association rules:
CS583, Bing Liu, UIC 38
CS583, Bing Liu, UIC 39
– diet coke? – coke product? – soft drink? – beverage?
Be careful
– Items at the lower level are expected to have lower support – Rules regarding itemsets at appropriate levels could be quite useful – Transaction database can be encoded based on dimensions and levels
– First find high-level strong rules:
– Then find their lower-level “weaker” rules:
– When one threshold set for all levels; if support too high then it is possible to miss meaningful associations at low level; if support too low then possible generation of uninteresting rules
(e.g., decrease min-support at lower levels)
– Level-crossed association rules:
– Association rules with multiple, alternative hierarchies:
– Rules of form {X,Y} Z
43
44
45
– Among 5000 students
– play basketball eat cereal [40%, 66.7%] is misleading because the
66.7%. – play basketball not eat cereal [20%, 33.3%] is far more interesting, although with lower support and confidence
46
Lift of A => B = P(B|A)/P(B)
and a rule is interesting if lift is not near 1.0 What is the lift
(1/3)/(1250/5000) = 1.33
customer level
47
– merchandising
– recommendations
– fraud detection
– simply understanding my business
– can quickly mine patterns describing business/customers/etc. without major effort in problem formulation – virtual items allow much flexibility – unparalleled tool for hypothesis generation
– unfocused
– can produce many, many rules!