1
Association Rule Mining 1 What Is Association Rule Mining? - - PowerPoint PPT Presentation
Association Rule Mining 1 What Is Association Rule Mining? - - PowerPoint PPT Presentation
Association Rule Mining 1 What Is Association Rule Mining? Association rule mining is finding frequent patterns or associations among sets of items or objects, usually amongst transactional data Applications include Market Basket
What Is Association Rule Mining?
- Association rule mining is finding frequent patterns or
associations among sets of items or objects, usually amongst transactional data
- Applications include Market Basket analysis, cross-marketing,
catalog design, etc.
2
Association Mining
- Examples.
Rule form: “Body ead [support, confidence]”. buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%]
buys(x, "bread") buys(x, "milk") [0.6%, 65%] major(x, "CS") /\ takes(x, "DB") grade(x, "A") [1%, 75%] age(X,30-45) /\ income(X, 50K-75K) buys(X, SUVcar) age=“30-45”, income=“50K-75K” car=“SUV”
Market-basket Analysis & Finding Associations
- Do items occur together?
- Proposed by Agrawal et al in 1993.
- It is an important data mining model studied extensively
by the database and data mining community.
- Assumes all data are categorical.
- Initially used for Market Basket Analysis to find how
items purchased by customers are related. Bread Milk [sup = 5%, conf = 100%]
Association Rule: Basic Concepts
- Given: (1) database of transactions, (2) each transaction is a list
- f items (purchased by a customer in a visit)
- Find: all rules that correlate the presence of one set of items
with that of another set of items
E.g., 98% of people who purchase tires and auto accessories
also get automotive services done
- Applications
* Maintenance Agreement (What the store should do to
boost Maintenance Agreement sales)
Home Electronics * (What other products should the
store stocks up?)
Detecting “ping-pong”ing of patients, faulty “collisions”
5
Association Rule Mining
- Given a set of transactions, find rules that will predict the
- ccurrence of an item based on the occurrences of other
items in the transaction
6
Market-Basket transactions
TID Items
1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example of Association Rules
{Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk},
Implication means co-occurrence, not causality!
An itemset is simply a set of items
Examples from a Supermarket
- Can you think of association rules from a
supermarket?
- Let’s say you identify association rules from a
supermarket, how might you exploit them?
That is, if you are the store manager, how might
you make money?
Assume you have a rule of the form X Y
7
Supermarket examples
- If you have a rule X Y, you could:
Run a sale on X if you want to increase sales of Y Locate the two items near each other Locate the two items far from each other to make
the shopper walk through the store
Print out a coupon on checkout for Y if shopper
bought X but not Y
8
Association “rules”–standard format
Rule format: (A set can consist of just a single item) If {set of items} Then {set of items} Condition implies Results
If {Diapers, Baby Food} Condition {Beer, Chips} Results Then
Customer buys diaper Customer buys both Customer buys beer
Right side very often is a single item Rules do not imply causality
What is an Interesting Association?
- Requires domain-knowledge validation
Actionable, non-trivial, understandable
- Algorithms provide first-pass based on statistics
- n how “unexpected” an association is
- Some standard statistics used:
C R
support ≈ p(R&C)
percent of “baskets” where rule holds
confidence ≈ p(R|C)
percent of times R holds when C holds
Support and Confidence
- Find all the rules X Y with
minimum confidence and support
Support = probability that a transaction
contains {X,Y}
i.e., ratio of transactions in which X, Y
- ccur together to all transactions in DB.
Confidence = conditional probability
that a transaction having X contains Y
i.e., ratio of transactions in which X, Y
- ccur together to those in which X occurs.
Thel confidence of a rule LHS => RHS can be computed as the support of the whole itemset divided by the support of LHS: Confidence (LHS => RHS) = Support(LHS RHS) / Support(LHS)
Customer buys diaper Customer buys both Customer buys beer
Definition: Frequent Itemset
- Itemset
A collection of one or more items Example: {Milk, Bread, Diaper}
k-itemset: itemset with k items
- Support count ()
Frequency count of occurrence of itemset
E.g. ({Milk, Bread,Diaper}) = 2
- Support
Fraction of transactions containing the itemset
E.g. s({Milk, Bread, Diaper}) = 2/5
- Frequent Itemset
An itemset whose support is greater than or equal to a minsup threshold
TID Items
1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
12
Support and Confidence Calculations
TID Items
1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
Beer } Diaper , Milk {
4 . 5 2 | T | ) Beer Diaper, , Milk ( s 67 . 3 2 ) Diaper , Milk ( ) Beer Diaper, Milk, ( c
Given Association Rule
– {Milk, Diaper} {Beer}
Rule Evaluation Metrics
– Support (s)
Fraction of transactions that
contain both X and Y – Confidence (c)
Measures how often items in Y
appear in transactions that contain X
Now Compute these two metrics
Support and Confidence – 2nd Example
Transaction ID Items Bought 1001 A, B, C 1002 A, C 1003 A, D 1004 B, E, F 1005 A, D, F
Itemset {A, C} has a support of 2/5 = 40% Rule {A} ==> {C} has confidence of 50% Rule {C} ==> {A} has confidence of 100% Support for {A, C, E} ? Support for {A, D, F} ? Confidence for {A, D} ==> {F} ? Confidence for {A} ==> {D, F} ?
Goal: Find all rules that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf).
Example
- Transaction data
- Assume:
minsup = 30% minconf = 80%
- An example frequent itemset:
{Chicken, Clothes, Milk}
[sup = 3/7]
- Rules from the itemset are partitions of the items
- Association rules from above itemset:
Clothes Milk, Chicken
[sup = 3/7, conf = 3/3]
…
…
Clothes, Chicken Milk,
[sup = 3/7, conf = 3/3]
15
t1: Beef, Chicken, Milk t2: Beef, Cheese t3: Cheese, Boots t4: Beef, Chicken, Cheese t5: Beef, Chicken, Clothes, Cheese, Milk t6: Chicken, Clothes, Milk t7: Chicken, Milk, Clothes
Mining Association Rules
TID Items
1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
Example of Rules:
{Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5)
Observations:
- All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
- Rules originating from the same itemset have identical support (by
definition) but may have different confidence values
Drawback of Confidence
Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Although confidence is high, rule is misleading P(Coffee|Tea) = 0.9375
Mining Association Rules
- Two-step approach:
1.
Frequent Itemset Generation
– Generate all itemsets whose support minsup
2.
Rule Generation
– Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset
- Frequent itemset generation is still
computationally expensive
Transaction data representation
- A simplistic view of “shopping baskets”
- Some important information not considered:
the quantity of each item purchased the price paid
19
Many mining algorithms
- There are a large number of them
- They use different strategies and data structures.
- Their resulting sets of rules are all the same.
Given a transaction data set T, and a minimum support and a
minimum confident, the set of association rules existing in T is uniquely determined.
- Any algorithm should find the same set of rules
although their computational efficiencies and memory requirements may be different.
- We study only one: the Apriori Algorithm
20
The Apriori algorithm
- The best known algorithm
- Two steps:
Find all itemsets that have minimum support (frequent
itemsets, also called large itemsets).
Use frequent itemsets to generate rules.
- E.g., a frequent itemset
{Chicken, Clothes, Milk} [sup = 3/7]
and one rule from the frequent itemset
Clothes Milk, Chicken [sup = 3/7, conf = 3/3]
21
Step 1: Mining all Frequent Itemsets
- A frequent itemset is an itemset whose support is
≥ minsup.
- Key idea: The Apriori property (downward closure
property): any subsets of a frequent itemset are also frequent itemsets
22
AB AC AD BC BD CD A B C D ABC ABD ACD BCD
Steps in Association Rule Discovery
- Find frequent itemsets
Itemsets with at least minimum support Support is “downward closed” so a subset of a frequent
itemset must be frequent
if {AB} is a frequent itemset, both {A} and {B} are frequent itemsets If an itemset doesnot satisfy minimum support, none of its supersets will either (this is key point that allows pruning of search space)
Iteratively find frequent itemsets with cardinality from 1 to k
(k-itemsets)
- Use the frequent itemsets to generate assoc. rules
Generate all binary partitions, but may have to fit template E.g., only one item on right side or only two items on left side
Frequent Itemset Generation
24
null AB AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE
Given d items, there are 2d possible candidate itemsets
Mining Association Rules—An Example
For rule A C: support = support({A ,C}) = 50% confidence = support({A ,C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent
25
Transaction ID Items Bought 2000 A,B,C 1000 A,C 4000 A,D 5000 B,E,F
Frequent Itemset Support {A} 75% {B} 50% {C} 50% {A,C} 50%
- Min. support 50%
- Min. confidence 50%
User specifies these
26
Illustrating the Apriori Principle
Found to be Infrequent
null AB AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE null AB AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE
Pruned supersets
The Apriori Algorithm
- Terminology:
Ck is the set of candidate k-itemsets Lk is the set of k-itemsets
- Join Step: Ck is generated by joining two elements from Lk-1
There must be a lot of overlap for the join to only increase length by 1
- Prune Step: Any (k-1)-itemset that is not frequent cannot
be a subset of a frequent k-itemset
This is a bit confusing since we want to use it the other way.
We prune a candidate k-itemset if any of its k-1 itemsets are not in our list of frequent k-1 itemsets
- To utilize this you simply start with k=1, which is single-item
itemsets and they you work your way up from there!
27
The Algorithm
- Iterative algo. (also called level-wise search): Find
all 1-item frequent itemsets; then all 2-item frequent itemsets, and so on.
In each iteration k, only consider itemsets that contain
some k-1 frequent itemset.
- Find frequent itemsets of size 1: F1
- From k = 2
Ck = candidates of size k: those itemsets of size k that
could be frequent, given Fk-1
Fk = those itemsets that are actually frequent, Fk Ck
(need to scan the database once).
28
Apriori Candidate Generation
- The candidate-gen function takes Lk-1 and
returns a superset (called the candidates) of the set of all frequent k-itemsets.
- There are two steps:
join step: Generate all possible candidate itemsets
Ck of length k
prune step: Remove those candidates in Ck that
cannot be frequent.
29
How to Generate Candidates?
- Suppose the items in Lk-1 are listed in an order
- Step 1: self-joining Lk-1
The description below is a bit confusing– all we do is splice two sets
together so that only one new item is added (see next slide)
insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
- Step 2: pruning
forall itemsets c in Ck do forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
30
Self-Joining Step
- All items in the itemset to be self joined are in
a consistent order– any order
Such as lexicographic (alphabetical) order
- Two items in the itemset can be joined only if
they differ in the last position
- Then when you join them the size of the
itemset goes up by one
- See example on next slide
31
Example of Generating Candidates (1)
- L3={abc, abd, acd, ace, bcd}
- Self-joining: L3*L3
abc and abd yields abcd acd and ace yields acde We do not join abd and acd
Even though it would give abcd which is a candidate If the product were a candidate it would have already been generated given the ordering This may not be obvious at first glance
32
Example of Generating Candidates (2)
- Note that for abcd to be frequent by the
Apriori property abc, bcd, and abd must be frequent
- abc and abd are alphabetically before bcd
- So if we see abc and bcd we do not need to
generate abcd because if abd were there it would have already been generated
If it is not there then it would be pruned later
33
Example of Generating Candidates (3)
- Given abde we go to the pruning phase
acde is removed because ade is not in L3
Merge step does not ensure all subsets are frequent
- C4={abcd}
34
The Apriori Algorithm — Example (minsup = 30%)
35
TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5
Database D
itemset sup. {1} 2 {2} 3 {3} 3 {4} 1 {5} 3
itemset sup. {1} 2 {2} 3 {3} 3 {5} 3 Scan D C1 L1
itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5}
itemset sup {1 2} 1 {1 3} 2 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2
itemset sup {1 3} 2 {2 3} 2 {2 5} 3 {3 5} 2
L2 C2 C2 Scan D C3 L3 itemset {2 3 5} Scan D
itemset sup {2 3 5} 2
Warning: Do Not Forget Pruning
- Rules get pruned in two ways
Apriori property violated If Apriori not violated, still must scan database and if
minsup not exceeded then prune
Apriori property is necessary but not sufficient to keep a rule If you forget to prune via Apriori property, you will get
same results since will catch on the scan
But I will take off points on an exam. Make it clear when prune using Apriori property (do not fill in count when crossing off)
- Apriori property cannot be violated until k=3.
Begins go get trickier at k=4 since more subsets to check
36
Step 2: Rules from Frequent Itemsets
- Frequent itemsets association rules
- One more step is needed
- For each frequent itemset X,
For each proper nonempty subset A of X,
Let B = X - A A B is an association rule if
Confidence(A B) ≥ minconf, support(A B) = support(AB) = support(X) confidence(A B) = support(A B) / support(A)
37
Generating Rules: an Example
- Suppose {2,3,4} is frequent, with sup=50%
Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with sup=50%,
50%, 75%, 75%, 75%, 75% respectively
These generate these association rules:
2,3 4, confidence=100% 2,4 3, confidence=100% 3,4 2, confidence=67% 2 3,4, confidence=67% 3 2,4, confidence=67% 4 2,3, confidence=67% All rules have support = 50% Then apply confidence threshold to identify strong rules Rules that meet the support and confidence requirements If confidence threshold is 80% we are left with 2 strong rules
38
Generating Rules: Summary
- To recap, in order to obtain A B, we need to
have support(A B) and support(A)
- All the required information for confidence
computation has already been recorded in itemset generation. No need to see the data T any more.
- This step is not as time-consuming as frequent
itemset generation
Hint: I almost always ask this on the exam
39
On Apriori Algorithm
Seems to be very expensive
- Level-wise search
- K = the size of the largest itemset
- It makes at most K passes over data
- In practice, K is bounded (10).
- The algorithm is very fast. Under some conditions, all
rules can be found in linear time.
- Scale up to large data sets
40
Granularity of items
- One exception to the “ease” of applying association rules
is selecting the granularity of the items.
- Should you choose:
diet coke? coke product? soft drink? beverage?
- Should you include more than one level of granularity?
Some association finding techniques allow you to represent
hierarchies explicitly
Multiple-Level Association Rules
- Items often form a hierarchy
Items at the lower level are expected to have lower support Rules regarding itemsets at appropriate levels could be quite useful Transaction database can be encoded based on dimensions and levels
Food Milk Bread Skim 2% Wheat White
Mining Multi-Level Associations
- A top-down, progressive deepening
approach
First find high-level strong rules:
milk bread [20%, 60%]
Then find their lower-level “weaker” rules:
2% milk wheat bread [6%, 50%]
Usually requires different thresholds at
different levels to find meaningful rules
lower support at lower levels
Interestingness Measurements
- Objective measures
Two popular measurements:
Support Confidence
- Subjective measures (Silberschatz & Tuzhilin,
KDD95) A rule (pattern) is interesting if
it is unexpected (surprising to the user); and/or actionable (the user can do something with it)
44
Criticism to Support and Confidence
- Example 1:
Among 5000 students
3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal
play basketball eat cereal [40%, 66.7%] is misleading because the
- verall percentage of students eating cereal is 75% which is higher than
66.7%.
play basketball not eat cereal [20%, 33.3%] is far more interesting,
although with lower support and confidence
45
basketball not basketball sum(row) cereal 2000 1750 3750 not cereal 1000 250 1250 sum(col.) 3000 2000 5000
Lift of A => B = P(B|A)/P(B)
and a rule is interesting if lift is not near 1.0 What is the lift
- f this rule?
(1/3)/(1250/5000) = 1.33
Customer Number vs. Transaction ID
- In the homework you may have a problem where
there is a customer id for each transaction
You can be asked to do association analysis based on
the customer id
If this is so, you need to aggregate the transactions to the customer level If a customer has 3 transactions then you just create an itemset containing all of the items in the union of the 3 transactions
Note we will ignore the frequency of purchase
46
Virtual items
- If you’re interested in including other possible
variables, can create “virtual items”
- gift-wrap, used-coupon, new-store, winter-
holidays, bought-nothing,…
Associations: Pros and Cons
- Pros
can quickly mine patterns describing
business/customers/etc. without major effort in problem formulation
virtual items allow much flexibility unparalleled tool for hypothesis generation
- Cons
unfocused
not clear exactly how to apply mined “knowledge” only hypothesis generation
can produce many, many rules!
may only be a few nuggets among them (or none)
Association Rules
- Association rule types:
Actionable Rules – contain high-quality, actionable
information
Trivial Rules – information already well-known by
those familiar with the business
Inexplicable Rules – no explanation and do not
suggest action
- Trivial and inexplicable rules occur most often