CISC 4631 Data Mining Lecture 10: Association Rule Mining Theses - - PowerPoint PPT Presentation

cisc 4631 data mining lecture 10 association rule mining
SMART_READER_LITE
LIVE PREVIEW

CISC 4631 Data Mining Lecture 10: Association Rule Mining Theses - - PowerPoint PPT Presentation

CISC 4631 Data Mining Lecture 10: Association Rule Mining Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Prof. F. Provost (Stern, NYU) Prof. B. Liu, UIC 1 What Is Association Mining?


slide-1
SLIDE 1

1

CISC 4631 Data Mining

Lecture 10: Association Rule Mining

Theses slides are based on the slides by

  • Tan, Steinbach and Kumar (textbook authors)
  • Prof. F. Provost (Stern, NYU)
  • Prof. B. Liu, UIC
slide-2
SLIDE 2

What Is Association Mining?

  • Association rule mining:

– Finding frequent patterns, associations, correlations, or

causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.

  • Applications:

– Market Basket analysis, cross-marketing, catalog design,

loss-leader analysis, clustering, classification, etc.

2

slide-3
SLIDE 3

Association Mining?

  • Examples.

– Rule form: “Body ead [support, confidence]”. – buys(x, “diapers”)  buys(x, “beers”) [0.5%, 60%]

– buys(x, "bread") buys(x, "milk") [0.6%, 65%] – major(x, "CS") /\ takes(x, "DB")  grade(x, "A") [1%, 75%] – age(X,30-45) /\ income(X, 50K-75K)  buys(X, SUVcar) – age=“30-45”, income=“50K-75K”  car=“SUV”

slide-4
SLIDE 4

Market-basket analysis and finding associations

  • Do items occur together?

(more than I might expect)

  • Proposed by Agrawal et al in 1993.
  • It is an important data mining model studied extensively by

the database and data mining community.

  • Assume all data are categorical.
  • No good algorithm for numeric data.
  • Initially used for Market Basket Analysis to find how items

purchased by customers are related. Bread  Milk [sup = 5%, conf = 100%]

slide-5
SLIDE 5

Association Rule: Basic Concepts

  • Given: (1) database of transactions, (2) each transaction is a list
  • f items (purchased by a customer in a visit)
  • Find: all rules that correlate the presence of one set of items

with that of another set of items – E.g., 98% of people who purchase tires and auto accessories also get automotive services done

  • Applications

– *  Maintenance Agreement (What the store should do to boost Maintenance Agreement sales) – Home Electronics  * (What other products should the store stocks up?) – Detecting “ping-pong”ing of patients, faulty “collisions”

5

slide-6
SLIDE 6

Association Rule Mining

  • Given a set of transactions, find rules that will predict the
  • ccurrence of an item based on the occurrences of other items

in the transaction

6

Market-Basket transactions

TID Items

1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example of Association Rules

{Diaper}  {Beer}, {Milk, Bread}  {Eggs,Coke}, {Beer, Bread}  {Milk},

Implication means co-occurrence, not causality!

An itemset is simply a set of items

slide-7
SLIDE 7

Association Rule Mining

– We are interested in rules that are

  • non-trivial (and possibly unexpected)
  • actionable
  • easily explainable

7

slide-8
SLIDE 8

Examples from a Supermarket

  • Can you think of association rules from a

supermarket?

  • Let’s say you identify association rules from a

supermarket, how might you exploit them?

– That is, if you are the store manager, how might you make money?

  • Assume you have a rule of the form X  Y

8

slide-9
SLIDE 9

Supermarket examples

  • If you have a rule X  Y, you could:

– Run a sale on X if you want to increase sales of Y – Locate the two items near each other – Locate the two items far from each other to make the shopper walk through the store – Print out a coupon on checkout for Y if shopper bought X but not Y

9

slide-10
SLIDE 10

Association “rules” – standard format

Rule format: (A set can consist of just a single item) If {set of items}  Then {set of items} Condition implies Results

If {Diapers, Baby Food} Condition {Beer, Chips} Results Then

Customer buys diaper Customer buys both Customer buys beer

slide-11
SLIDE 11

What is an interesting association?

  • Requires domain-knowledge validation

– actionable vs. trivial vs. inexplicable

  • Algorithms provide first-pass based on statistics on how

“unexpected” an association is

  • Some standard statistics used:

C  R

– support ≈ p(R&C)

  • percent of “baskets” where rule holds

– confidence ≈ p(R|C)

  • percent of times R holds when C holds
slide-12
SLIDE 12

Support and Confidence

  • Find all the rules X  Y with

minimum confidence and support

– Support = probability that a transaction contains {X,Y}

  • i.e., ratio of transactions in which X, Y
  • ccur together to all transactions in

database.

– Confidence = conditional probability that a transaction having X also contains Y

  • i.e., ratio of transactions in which X, Y
  • ccur together to those in which X
  • ccurs.

In general confidence of a rule LHS => RHS can be computed as the support

  • f the whole itemset divided by the support of LHS:

Confidence (LHS => RHS) = Support(LHS  RHS) / Support(LHS)

Customer buys diaper Customer buys both Customer buys beer

slide-13
SLIDE 13

Definition: Frequent Itemset

  • Itemset

– A collection of one or more items

  • Example: {Milk, Bread, Diaper}

– k-itemset

  • An itemset that contains k items
  • Support count ()

– Frequency of occurrence of itemset – E.g. ({Milk, Bread,Diaper}) = 2

  • Support

– Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5

  • Frequent Itemset

– An itemset whose support is greater than or equal to a minsup threshold

TID Items

1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

13

slide-14
SLIDE 14

Definition: Association Rule

Example:

Beer } Diaper , Milk { 

4 . 5 2 | T | ) Beer Diaper, , Milk (     s 67 . 3 2 ) Diaper , Milk ( ) Beer Diaper, Milk, (      c

 Association Rule

– An implication expression of the form X  Y, where X and Y are itemsets – Example: {Milk, Diaper}  {Beer}

 Rule Evaluation Metrics

– Support (s)

 Fraction of transactions that contain

both X and Y

– Confidence (c)

 Measures how often items in Y

appear in transactions that contain X

TID Items

1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

slide-15
SLIDE 15

Support and Confidence - Example

Transaction ID Items Bought 1001 A, B, C 1002 A, C 1003 A, D 1004 B, E, F 1005 A, D, F

Itemset {A, C} has a support of 2/5 = 40% Rule {A} ==> {C} has confidence of 50% Rule {C} ==> {A} has confidence of 100% Support for {A, C, E} ? Support for {A, D, F} ? Confidence for {A, D} ==> {F} ? Confidence for {A} ==> {D, F} ? Itemset {A, C} has a support of 2/5 = 40% Rule {A} ==> {C} has confidence of 50% Rule {C} ==> {A} has confidence of 100% Support for {A, C, E} ? Support for {A, D, F} ? Confidence for {A, D} ==> {F} ? Confidence for {A} ==> {D, F} ?

Goal: Find all rules that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf).

slide-16
SLIDE 16

CS583, Bing Liu, UIC 16

Example

  • Transaction data
  • Assume:

minsup = 30% minconf = 80%

  • An example frequent itemset:

{Chicken, Clothes, Milk}

[sup = 3/7]

  • Association rules from the itemset:

Clothes  Milk, Chicken

[sup = 3/7, conf = 3/3]

Clothes, Chicken  Milk,

[sup = 3/7, conf = 3/3]

t1: Beef, Chicken, Milk t2: Beef, Cheese t3: Cheese, Boots t4: Beef, Chicken, Cheese t5: Beef, Chicken, Clothes, Cheese, Milk t6: Chicken, Clothes, Milk t7: Chicken, Milk, Clothes

slide-17
SLIDE 17

Mining Association Rules

Example of Rules:

{Milk,Diaper}  {Beer} (s=0.4, c=0.67) {Milk,Beer}  {Diaper} (s=0.4, c=1.0) {Diaper,Beer}  {Milk} (s=0.4, c=0.67) {Beer}  {Milk,Diaper} (s=0.4, c=0.67) {Diaper}  {Milk,Beer} (s=0.4, c=0.5) {Milk}  {Diaper,Beer} (s=0.4, c=0.5)

TID Items

1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Observations:

  • All the above rules are binary partitions of the same itemset:

{Milk, Diaper, Beer}

  • Rules originating from the same itemset have identical support but

can have different confidence

  • Thus, we may decouple the support and confidence requirements
slide-18
SLIDE 18

Drawback of Confidence

Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9  Although confidence is high, rule is misleading  P(Coffee|Tea) = 0.9375

slide-19
SLIDE 19

Mining Association Rules

  • Two-step approach:
  • 1. Frequent Itemset Generation

– Generate all itemsets whose support  minsup

  • 2. Rule Generation

– Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset

  • Frequent itemset generation is still computationally

expensive

slide-20
SLIDE 20

CS583, Bing Liu, UIC 20

Transaction data representation

  • A simplistic view of shopping baskets,
  • Some important information not considered.

E.g,

– the quantity of each item purchased and – the price paid.

slide-21
SLIDE 21

CS583, Bing Liu, UIC 21

Many mining algorithms

  • There are a large number of them!!
  • They use different strategies and data structures.
  • Their resulting sets of rules are all the same.

– Given a transaction data set T, and a minimum support and a minimum confident, the set of association rules existing in T is uniquely determined.

  • Any algorithm should find the same set of rules although

their computational efficiencies and memory requirements may be different.

  • We study only one: the Apriori Algorithm
slide-22
SLIDE 22

CS583, Bing Liu, UIC 22

The Apriori algorithm

  • The best known algorithm
  • Two steps:

– Find all itemsets that have minimum support (frequent itemsets, also called large itemsets). – Use frequent itemsets to generate rules.

  • E.g., a frequent itemset

{Chicken, Clothes, Milk} [sup = 3/7]

and one rule from the frequent itemset

Clothes  Milk, Chicken [sup = 3/7, conf = 3/3]

slide-23
SLIDE 23

CS583, Bing Liu, UIC 23

Step 1: Mining all frequent itemsets

  • A frequent itemset is an itemset whose support

is ≥ minsup.

  • Key idea: The apriori property (downward

closure property): any subsets of a frequent itemset are also frequent itemsets

AB AC AD BC BD CD A B C D ABC ABD ACD BCD

slide-24
SLIDE 24

Steps in Association Rule Discovery

  • Find the frequent itemsets

– Frequent item sets are the sets of items that have minimum support – Support is “downward closed”, so, a subset of a frequent itemset must also be a frequent itemset

  • if {AB} is a frequent itemset, both {A} and {B} are frequent itemsets
  • this also means that if an itemset that doesn’t satisfy minimum support,

none of its supersets will either (this is essential for pruning search space)

– Iteratively find frequent itemsets with cardinality from 1 to k (k- itemsets)

  • Use the frequent itemsets to generate association rules
slide-25
SLIDE 25

Frequent Itemset Generation

25

null AB AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE

Given d items, there are 2d possible candidate itemsets

slide-26
SLIDE 26

Mining Association Rules—An Example

For rule A  C: support = support({A ,C}) = 50% confidence = support({A ,C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent

26

Transaction ID Items Bought 2000 A,B,C 1000 A,C 4000 A,D 5000 B,E,F

Frequent Itemset Support {A} 75% {B} 50% {C} 50% {A,C} 50%

  • Min. support 50%
  • Min. confidence 50%

User specifies these

slide-27
SLIDE 27

Mining Frequent Itemsets: the Key Step

  • Find the frequent itemsets: the sets of items that have

minimum support

– A subset of a frequent itemset must also be a frequent itemset

  • i.e., if {AB} is a frequent itemset, both {A} and {B} should be a

frequent itemset. Why? Make sure you can explain this.

– Iteratively find frequent itemsets with cardinality from 1 to k (k- itemset)

  • Use the frequent itemsets to generate association rules

– This step is more straightforward and requires less computation so we focus on the first step

27

slide-28
SLIDE 28

28

Illustrating Apriori Principle

Found to be Infrequent

null AB AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE null AB AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE

Pruned supersets

slide-29
SLIDE 29

The Apriori Algorithm

  • Terminology:

– Ck is the set of candidate k-itemsets – Lk is the set of k-itemsets

  • Join Step: Ck is generated by joining the set Lk-1with itself
  • Prune Step: Any (k-1)-itemset that is not frequent cannot be a

subset of a frequent k-itemset – This is a bit confusing since we want to use it the other way. We prune a candidate k-itemset if any of its k-1 itemsets are not in our list of frequent k-1 itemsets

  • To utilize this you simply start with k=1, which is single-item itemsets

and they you work your way up from there!

29

slide-30
SLIDE 30

CS583, Bing Liu, UIC 30

The Algorithm

  • Iterative algo. (also called level-wise search):

Find all 1-item frequent itemsets; then all 2-item frequent itemsets, and so on.

– In each iteration k, only consider itemsets that contain some k-1 frequent itemset.

  • Find frequent itemsets of size 1: F1
  • From k = 2

– Ck = candidates of size k: those itemsets of size k that could be frequent, given Fk-1 – Fk = those itemsets that are actually frequent, Fk  Ck (need to scan the database once).

slide-31
SLIDE 31

CS583, Bing Liu, UIC 31

Apriori candidate generation

  • The candidate-gen function takes Lk-1 and

returns a superset (called the candidates)

  • f the set of all frequent k-itemsets. It has

two steps

– join step: Generate all possible candidate itemsets Ck of length k – prune step: Remove those candidates in Ck that cannot be frequent.

slide-32
SLIDE 32

How to Generate Candidates?

  • Suppose the items in Lk-1 are listed in an order
  • Step 1: self-joining Lk-1

– The description below is a bit confusing– all we do is splice two sets together so that only one new item is added (see example)

  • insert into Ck
  • select p.item1, p.item2, …, p.itemk-1, q.itemk-1
  • from Lk-1 p, Lk-1 q
  • where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
  • Step 2: pruning

forall itemsets c in Ck do forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck

32

slide-33
SLIDE 33

Example of Generating Candidates

  • L3={abc, abd, acd, ace, bcd}
  • Self-joining: L3*L3

– abcd from abc and abd – acde from acd and ace

  • Pruning:

– acde is removed because ade is not in L3

  • C4={abcd}

33

slide-34
SLIDE 34

CS583, Bing Liu, UIC 34

Example – Finding frequent itemsets

Dataset T

TID Items T100 1, 3, 4 T200 2, 3, 5 T300 1, 2, 3, 5 T400 2, 5

itemset:count

  • 1. scan T  C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3

 F1: {1}:2, {2}:3, {3}:3, {5}:3

 C2: {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}

  • 2. scan T  C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2

 F2: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2  C3: {2, 3,5}

  • 3. scan T  C3: {2, 3, 5}:2  F3: {2, 3, 5}

minsup=0.5

slide-35
SLIDE 35

The Apriori Algorithm — Example (minsup = 30%)

35

TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5

Database D

itemset sup. {1} 2 {2} 3 {3} 3 {4} 1 {5} 3

itemset sup. {1} 2 {2} 3 {3} 3 {5} 3 Scan D C1 L1

itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5}

itemset sup {1 2} 1 {1 3} 2 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2

itemset sup {1 3} 2 {2 3} 2 {2 5} 3 {3 5} 2

L2 C2 C2 Scan D C3 L3 itemset {2 3 5} Scan D

itemset sup {2 3 5} 2

slide-36
SLIDE 36

CS583, Bing Liu, UIC 36

Step 2: Generating rules from frequent itemsets

  • Frequent itemsets  association rules
  • One more step is needed to generate association rules
  • For each frequent itemset X,

For each proper nonempty subset A of X,

– Let B = X - A – A  B is an association rule if

  • Confidence(A  B) ≥ minconf,

support(A  B) = support(AB) = support(X) confidence(A  B) = support(A  B) / support(A)

slide-37
SLIDE 37

CS583, Bing Liu, UIC 37

Generating rules: an example

  • Suppose {2,3,4} is frequent, with sup=50%

– Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with sup=50%, 50%, 75%, 75%, 75%, 75% respectively – These generate these association rules:

  • 2,3  4,

confidence=100%

  • 2,4  3,

confidence=100%

  • 3,4  2,

confidence=67%

  • 2  3,4,

confidence=67%

  • 3  2,4,

confidence=67%

  • 4  2,3,

confidence=67%

  • All rules have support = 50%
slide-38
SLIDE 38

CS583, Bing Liu, UIC 38

Generating rules: summary

  • To recap, in order to obtain A  B, we need

to have support(A  B) and support(A)

  • All the required information for confidence

computation has already been recorded in itemset generation. No need to see the data T any more.

  • This step is not as time-consuming as

frequent itemsets generation.

slide-39
SLIDE 39

CS583, Bing Liu, UIC 39

On Apriori Algorithm

Seems to be very expensive

  • Level-wise search
  • K = the size of the largest itemset
  • It makes at most K passes over data
  • In practice, K is bounded (10).
  • The algorithm is very fast. Under some conditions, all

rules can be found in linear time.

  • Scale up to large data sets
slide-40
SLIDE 40

Granularity of items

  • One exception to the “ease” of applying association rules is

selecting the granularity of the items.

  • Should you choose:

– diet coke? – coke product? – soft drink? – beverage?

  • Should you include more than one level of granularity?

Be careful

  • (Some association finding techniques allow you to represent

hierarchies explicitly)

slide-41
SLIDE 41

Multiple-Level Association Rules

  • Items often form a hierarchy

– Items at the lower level are expected to have lower support – Rules regarding itemsets at appropriate levels could be quite useful – Transaction database can be encoded based on dimensions and levels

Food Milk Bread Skim 2% Wheat White

slide-42
SLIDE 42

Mining Multi-Level Associations

  • A top_down, progressive deepening approach

– First find high-level strong rules:

  • milk bread [20%, 60%]

– Then find their lower-level “weaker” rules:

  • 2% milk wheat bread [6%, 50%]

– When one threshold set for all levels; if support too high then it is possible to miss meaningful associations at low level; if support too low then possible generation of uninteresting rules

  • different minimum support thresholds across multi-levels lead to different algorithms

(e.g., decrease min-support at lower levels)

  • Variations at mining multiple-level association rules

– Level-crossed association rules:

  • milk wonder wheat bread

– Association rules with multiple, alternative hierarchies:

  • 2% milk wonder bread
slide-43
SLIDE 43

Rule Generation

  • Now that you have the frequent itemsets, you can generate

association rules – Split the frequent itemsets in all possible ways and prune if the confidence is below min_confidence threshold

  • Rules that are left are called strong rules

– You may be given a rule template that constrains the rules

  • Rules with only one item on the right side
  • Rules with two items on the left and one on the right

– Rules of form {X,Y}  Z

43

slide-44
SLIDE 44

Rules from Previous Example

  • What are the strong rules of the form {X,Y} Z if the

confidence threshold is 75%?

– We start with {2,3,5}

  • {2,3}  5 (confidence = 2/2 = 100%): STRONG
  • {3,5}  2 (confidence = 2/2 = 100%): STRONG
  • {2,5}  3 (confidence = 2/3 = 66%): PRUNE!
  • Note that in general you don’t just look at the

frequent itemsets of maximum length. If we wanted strong rules of form X  Y we would look at C2

44

slide-45
SLIDE 45

Interestingness Measurements

  • Objective measures

– Two popular measurements:

  • Support
  • Confidence
  • Subjective measures (Silberschatz & Tuzhilin, KDD95)

A rule (pattern) is interesting if – it is unexpected (surprising to the user); and/or – actionable (the user can do something with it)

45

slide-46
SLIDE 46

Criticism to Support and Confidence

  • Example 1:

– Among 5000 students

  • 3000 play basketball
  • 3750 eat cereal
  • 2000 both play basket ball and eat cereal

– play basketball  eat cereal [40%, 66.7%] is misleading because the

  • verall percentage of students eating cereal is 75% which is higher than

66.7%. – play basketball  not eat cereal [20%, 33.3%] is far more interesting, although with lower support and confidence

46

basketball not basketball sum(row) cereal 2000 1750 3750 not cereal 1000 250 1250 sum(col.) 3000 2000 5000

 Lift of A => B = P(B|A)/P(B)

and a rule is interesting if lift is not near 1.0 What is the lift

  • f this rule?

(1/3)/(1250/5000) = 1.33

slide-47
SLIDE 47

Customer Number vs. Transaction ID

  • In the homework you may have a problem where

there is a customer id for each transaction

– You can be asked to do association analysis based on the customer id

  • If this is so, you need to aggregate the transactions to the

customer level

47

slide-48
SLIDE 48

Market-basket analysis and finding associations

  • Do items occur together? (more than I might expect)
  • Why might I care?

– merchandising

  • e.g., placing products in a retail space (physical or electronic), catalog design
  • packaging optional services

– recommendations

  • cross-selling and up-selling opportunities
  • mining credit-card data
  • developing customer loyalty and self-investment

– fraud detection

  • e.g., in insurance data, a doctor very often works on cases of particular lawyer

– simply understanding my business

  • are there “investment profiles” of my clients?
  • customer segmentation based on buying behavior
  • is anything strange going on?
slide-49
SLIDE 49

Virtual items

  • If you’re interested in including other possible

variables, can create “virtual items”

  • gift-wrap, used-coupon, new-store, winter-holidays,

bought-nothing,…

slide-50
SLIDE 50

Associations: Pros and Cons

  • Pros

– can quickly mine patterns describing business/customers/etc. without major effort in problem formulation – virtual items allow much flexibility – unparalleled tool for hypothesis generation

  • Cons

– unfocused

  • not clear exactly how to apply mined “knowledge”
  • only hypothesis generation

– can produce many, many rules!

  • may only be a few nuggets among them (or none)
slide-51
SLIDE 51

Association Rules

  • Association rule types:

– Actionable Rules – contain high-quality, actionable information – Trivial Rules – information already well-known by those familiar with the business – Inexplicable Rules – no explanation and do not suggest action

  • Trivial and Inexplicable Rules occur most often