Chapter 4: Freq equen ent Ite temsets and nd Asso Associatio - - PowerPoint PPT Presentation

chapter 4 freq equen ent ite temsets and nd asso
SMART_READER_LITE
LIVE PREVIEW

Chapter 4: Freq equen ent Ite temsets and nd Asso Associatio - - PowerPoint PPT Presentation

Chapter 4: Freq equen ent Ite temsets and nd Asso Associatio ion R Rules ules Jilles Vreeken Revision 1, November 9 th Notation clarified, Chi-square: clarified Revision 2, November 10 th details added of derivability example Revision 3,


slide-1
SLIDE 1

IRDM ‘15/16

Jilles Vreeken

Chapter 4: Freq equen ent Ite temsets and nd Asso Associatio ion R Rules ules

5 Nov 2015 Revision 1, November 9th

Notation clarified, Chi-square: clarified

Revision 2, November 10th

details added of derivability example

Revision 3, November 12th

typo fixed in Pearson correlation

slide-2
SLIDE 2

Recall the Qu Question

  • n of th

f the w week

How can we mine interest esting p patter erns s and usef useful rul ules es from data?

5 Nov 2015 IV-2: 2

slide-3
SLIDE 3

IRDM ‘15/16

IRDM Chapter 4, today

1.

Definitions

2.

Algorithms for Frequent Itemset Mining

3.

Association Rules and Interestingness

4.

Summarising Collections of Itemsets

You’ll find this covered in Aggarwal Chapter 4, 5.2 Zaki & Meira, Ch. 10, 11

IV-2: 3

slide-4
SLIDE 4

IRDM ‘15/16

Chapter 4.3: Assoc

  • ciati

tion

  • n R

Rules

IV-2: 4

slide-5
SLIDE 5

IRDM ‘15/16

IRDM Chapter 4.3

1.

Generating Association Rules

2.

Measures of Interestingness

3.

Properties of Measures

4.

Simpson’s Paradox

You’ll find this covered in Aggarwal, Chapter 4 Zaki & Meira, Ch. 10

IV-2: 5

slide-6
SLIDE 6

IRDM ‘15/16

Generating Association Rules

We can generate association rules from frequent itemsets

 if 𝑎 is a frequent itemset and 𝑌 ⊂ 𝑎 is its proper subset,

we have rule 𝑌 → 𝑍, where 𝑍 = 𝑎 ∖ 𝑌

These rules are frequent because 𝑡𝑡𝑡𝑡 𝑌 → 𝑍 = 𝑡𝑡𝑡𝑡 𝑌 ∪ 𝑍 = 𝑡𝑡𝑡𝑡(𝑎)

 we still need to compute the confidence as

𝑡𝑡𝑡𝑡 𝑎 𝑡𝑡𝑡𝑡 𝑌

Which means, if rule 𝑌 → 𝑎 ∖ 𝑌 is not confident, no rule of type 𝑋 → 𝑎 ∖ 𝑋, with 𝑋 ⊆ 𝑌, is confident

 we can use this to prune the search space

IV-2: 6

slide-7
SLIDE 7

IRDM ‘15/16

Pseudo-code

(Algorithm 8.6 in Zaki & Meira) IV-2: 7

slide-8
SLIDE 8

IRDM ‘15/16

Measures of interestingness

Consider the following example: Rule Tea → {Coffee} has 15% support and 75% confidence

 reasonably good numbers

Is this a good rule?

 the overall fraction of coffee drinkers is 80%,

drinking tea redu educes es the probability of drinking coffee!

IV-2: 8

Coffee Not Coffee ∑ Tea 150 50 200 Not Tea 650 150 800 ∑ 800 200 1000

slide-9
SLIDE 9

IRDM ‘15/16

Problems with confidence

The support-confidence framework does not take the support of the consequent into account

rules with relatively small support for the antecedent and high support for the consequent often have high confidence

To fix this, many other measures have been proposed Most measures are easy to express using contingen ency tables es We’ll use 𝑡𝑗𝑗 as shorthand for support: s11 = 𝑡𝑡𝑡𝑡 𝐵𝐵 , 𝑡01 = 𝑡𝑡𝑡𝑡(¬𝐵𝐵), … Analogue, we’ll say 𝑔

𝑗𝑗 for frequency:

𝑔

11 = 𝑔𝑔𝑔𝑔 𝐵𝐵 , 𝑔 01 = 𝑔𝑔𝑔𝑔(¬𝐵𝐵), … (revised on Nov 9th, now using 𝑡𝑗𝑗 notation to more clearly indicate suppo port) IV-2: 9

B ¬B ∑ A s11 s10 s1+ ¬A s01 s00 s0+ ∑ s+1 s+0 N

slide-10
SLIDE 10

IRDM ‘15/16

Statistical Coefficient of Correlation

A natural statistical measure between a pair of items is the Pearson co correla latio ion co coefficie icient 𝜍 = 𝐹 𝑌𝑍 − 𝐹 𝑌 𝐹 𝑍 𝜏 𝑌 𝜏 𝑍 = 𝐹 𝑌𝑍 − 𝐹 𝑌 𝐹 𝑍 𝐹 𝑌2 − 𝐹 𝑌 2 𝐹 𝑍2 − 𝐹 𝑍 2

IV-2: 10

slide-11
SLIDE 11

IRDM ‘15/16

Pearson of Correlation of Items

For items 𝐵 and 𝐵 it reduces to 𝜍𝐵𝐵 = 𝑔

11 − 𝑔 1+𝑔 +1

𝑔

1+𝑔 +1(1 − 𝑔 1+)(1 − 𝑔 +1)

It is +1 when the data is perfectly positively correlated, -1 when perfectly negatively correlated, and 0 when uncorrelated.

(revised on November 12th; typo fixed, as 𝑔

11 should be inside the nominator)

IV-2: 11

slide-12
SLIDE 12

IRDM ‘15/16

Chi-square

𝒴2 is another natural statistical measure of significance for itemsets. For a set of 𝑙 items, it compares the observed frequencies against the expected frequencies of all 2𝑙 possible states. 𝒴2(𝑌) =

  • 𝑔𝑔𝑔𝑔(𝑍) − 𝐹𝑌 𝑔𝑔𝑔𝑔(𝑍)

2

𝐹𝑌 𝑔𝑔𝑔𝑔(𝑍)

𝑍∈𝒬(𝑌)

where 𝒬(𝑌) is the powerset of 𝑌 and 𝐹𝑌[𝑔𝑔𝑔𝑔 𝑍 ] is the expected frequency of state 𝑍 over itemset 𝑌 For example, for 𝑌 = {beer, diapers}, it considers states beer, diapers ,

¬beer, diapers , beer, ¬diapers and ¬beer, ¬diapers .

(Brin et al. 1998, 1.6k+ cites) (revised on Nov 9th, now using 𝐹𝑌[𝑔𝑔𝑔𝑔 𝑍 ] to more clearly indicate the expectation is of state 𝑍 over itemset 𝑌) IV-2: 12

slide-13
SLIDE 13

IRDM ‘15/16

Chi-square (2)

To compute 𝒴2(𝑌) we need to define 𝐹𝑌 𝑔𝑔𝑔𝑔 𝑍 . The standard way is to assume me indep epen endence e between the items of 𝑍. That is, the probability of a state 𝑍 is the multiplication of its individual item frequencies.

𝐹𝑌 𝑔𝑔𝑔𝑔 𝑍 = 𝑔𝑔𝑔𝑔(𝐵)

𝐵∈𝑍

(1 − 𝑔𝑔𝑔𝑔 𝐵 )

𝐵∈𝑌∖𝑍

The first product is over the items that ar are e present in 𝑍 (the 1s). For these their empirical probability is simply 𝑔𝑔𝑔𝑔(⋅). The second product considers the 0s in 𝑍, or in other words, the 1s of 𝑌 not in 𝑍. The empirical probability of not seeing an item A is (1 − 𝑔𝑔𝑔𝑔 𝐵 ). Note! Independence between items is a very strong assumption, and hence we will find that many itemsets will be ‘significantly’ correlated.

(revised on Nov 9th, now using 𝐹𝑌[𝑔𝑔𝑔𝑔 𝑍 ] notation, added more explanation) IV-2: 13

slide-14
SLIDE 14

IRDM ‘15/16

Chi-square (3)

𝒴2(𝑌) =

  • 𝑔𝑔𝑔𝑔(𝑍) − 𝐹𝑌 𝑔𝑔𝑔𝑔(𝑍)

2

𝐹𝑌 𝑔𝑔𝑔𝑔(𝑍)

𝑍∈𝒬(𝑌)

Chi-square scores close to 0 indicate statistical independence, while larger values indicate stronger dependencies.

 no differentiation to positive or negative correlation  it is computatio

iona nall lly costly at 𝑃(2 𝑌 )

 but as it is upward closed, we can mine interesting sets efficiently

Always be thoughtful of how you define your expected frequency!

(revised on Nov 9th, now using 𝐹𝑌[𝑔𝑔𝑔𝑔 𝑍 ] notation, added more explanation) IV-2: 14

slide-15
SLIDE 15

IRDM ‘15/16

Interest Ratio

The inter erest est r ratio 𝐽 of rule 𝐵 → 𝐵 is 𝐽 𝐵, 𝐵 = 𝑂 × 𝑡𝑡𝑡𝑡 𝐵𝐵 𝑡𝑡𝑡𝑡 𝐵 × 𝑡𝑡𝑡𝑡(𝐵) = 𝑂𝑡11 𝑡1+𝑡+1

 it is equivalent to lif

lift = =

𝑑𝑑𝑑𝑑 𝐵→𝐵 𝑡𝑡𝑡𝑡 𝐵

Interest ratio compares the frequencies against the assumption that 𝐵 and 𝐵 are independent

 if 𝐵 and 𝐵 are independent, 𝑡11 =

𝑡1+𝑡+1 𝑂

Interpreting interest ratios

 𝐽 𝐵, 𝐵 = 1 if 𝐵 and 𝐵 are independent  𝐽 𝐵, 𝐵 > 1 if 𝐵 and 𝐵 are positively correlated  𝐽 𝐵, 𝐵 < 1 if 𝐵 and 𝐵 are negatively correlated

(𝑔

𝑗𝑗 changed into 𝑡𝑗𝑗 in revision 1)

IV-2: 15

slide-16
SLIDE 16

IRDM ‘15/16

The cosine measure

The cosin ine, or 𝐽𝐽, measure of rule 𝐵 → 𝐵 is defined as 𝑑𝑑𝑡𝑑𝑑𝑔 𝐵, 𝐵 = 𝐽 𝐵, 𝐵 × 𝑡𝑡𝑡𝑡(𝐵𝐵)/𝑂 = 𝑡11 𝑡1+ × 𝑡+1 which is regular cosine if we think of 𝐵 and 𝐵 as binary vectors It also is the geomet metri ric m c mean between the confidences of 𝐵 → 𝐵 and 𝐵 → 𝐵 as

𝑑𝑑𝑡𝑑𝑑𝑔 𝐵, 𝐵 = 𝑡𝑡𝑡𝑡 𝐵𝐵 𝑡𝑡𝑡𝑡 𝐵 × 𝑡𝑡𝑡𝑡 𝐵𝐵 𝑡𝑡𝑡𝑡 𝐵 = 𝑑𝑑𝑑𝑔 𝐵 → 𝐵 × 𝑑𝑑𝑑𝑔(𝐵 → 𝐵)

(𝑔

𝑗𝑗 changed into 𝑡𝑗𝑗 in revision 1)

IV-2: 16

slide-17
SLIDE 17

IRDM ‘15/16

Examples (1)

The interest ratio of Tea → {Coffee} is

1000×150 800×200 = 0.9375

 almost 1, so not very interesting;

below 1, so (slight) negative correlation

The 𝑑𝑑𝑡𝑑𝑑𝑔 of this rule, however, is 0.375

 quite far from 0, so, it is

is interesting.

IV-2: 17

Coffee Not Coffee ∑ Tea 150 50 200 Not Tea 650 150 800 ∑ 800 200 1000

slide-18
SLIDE 18

IRDM ‘15/16

Examples (2)

𝐽 𝑡, 𝑔 = 1.02 and 𝐽 𝑔, 𝑢 = 4.08

 𝑡 and 𝑔 are close to independent  𝑔 and 𝑢 have highest interest factor

Now 𝑑𝑑𝑑𝑔 𝑡 → 𝑔 = 0.946 and 𝑑𝑑𝑑𝑔 𝑔 → 𝑢 = 0.286

(revised on Nov 9th, now using 𝑢 instead of 𝑡 to avoid confusion with support-notation) IV-2: 18

p ¬p ∑ q 880 50 930 ¬q 50 20 70 ∑ 930 70 1000 r ¬r ∑ t 20 50 70 ¬t 50 880 930 ∑ 70 930 1000

But 𝑡 and 𝑔 appear together in 88% of cases But 𝑔 and 𝑢 appear together only seldom

slide-19
SLIDE 19

IRDM ‘15/16

Examples (2)

𝐽 𝑡, 𝑔 = 1.02 and 𝐽 𝑔, 𝑡 = 4.08

 𝑡 and 𝑔 are close to independent  𝑔 and 𝑡 have highest interest factor

Now 𝑑𝑑𝑑𝑔 𝑡 → 𝑔 = 0.946 and 𝑑𝑑𝑑𝑔 𝑔 → 𝑡 = 0.286

IV-2: 19

p ¬p ∑ q 880 50 930 ¬q 50 20 70 ∑ 930 70 1000 r ¬r ∑ s 20 50 70 ¬s 50 880 930 ∑ 70 930 1000

But 𝑡 and 𝑔 appear together in 88% of cases But 𝑔 and 𝑡 appear together only seldom

Bottom line: Lunch is not free. There is no single measure that works well all the time.

slide-20
SLIDE 20

IRDM ‘15/16

Measures for pairs of itemsets

(revised on Nov 9th, now using 𝑡𝑗𝑗 notation to more clearly indicate suppo port) (after Tan, Steinbach, Kumar, Table 6.12) IV-2: 20

Measure (symbol) Definition Correlation (𝜚)

𝑂𝑡11 − 𝑡1+𝑡+1 𝑡1+𝑡+1𝑡0+𝑡+0

Odds ratio (𝛽)

(𝑡11𝑡00)/(𝑡10𝑡01)

Kappa (𝜆)

𝑂𝑡11 + 𝑂𝑡00 − 𝑡1+𝑡+1 − 𝑡0+𝑡+0 𝑂2 − 𝑡1+𝑡+1 − 𝑡0+𝑡+0

Interest (𝐽)

(𝑂𝑡11)/(𝑡1+𝑡+1)

Cosine (𝑑𝑑𝑡𝑑𝑑𝑔)

(𝑡11)/( 𝑡1+𝑡+1)

Pieatetsk-Shapiro (𝑄𝐽)

𝑡11 𝑂 − 𝑡1+𝑡+1 𝑂2

Collective Strength (𝐽)

𝑡11 + 𝑡00 𝑡1+𝑡+1 + 𝑡0+𝑡+0 × 𝑂 − 𝑡1+𝑡+1 − 𝑡0+𝑡+0 𝑂 − 𝑡11 − 𝑡00

Jaccard (𝐾)

𝑡11/(𝑡1+ + 𝑡+1 − 𝑡11)

All-confidence (ℎ)

min 𝑡11 𝑡1+ , 𝑡11 𝑡+1

slide-21
SLIDE 21

IRDM ‘15/16

Measures for association rules

(revised on Nov 9th, now using 𝑡𝑗𝑗 notation to more clearly indicate suppo port) (after Tan, Steinbach, Kumar, Table 6.12) IV-2: 21

Measure (symbol) Definition Goodman-Kruskal (𝜇)

max

𝑙

𝑡

𝑗𝑙 − max 𝑙

𝑡+𝑙

𝑗

/(𝑂 − max

𝑙

𝑡+𝑙

Mutual Information (𝑁)

𝑡𝑗𝑗 𝑂 log 𝑂𝑡𝑗𝑗 𝑡𝑗+𝑡+𝑗

𝑗 𝑗

/ − 𝑡𝑗+ 𝑂 log 𝑡𝑗+ 𝑂

𝑗

J-Measure (𝐾)

𝑡11 𝑂 log 𝑂𝑡11 𝑡1+𝑡+1 + 𝑡10 𝑂 log 𝑂𝑡10 𝑡1+𝑡+0

Gini index (𝐻)

𝑡1+ 𝑂 × 𝑡11 𝑡1+

2

+ 𝑡10 𝑡1+

2

− 𝑡+1 𝑂

2

+ 𝑡0+ 𝑂 × 𝑡01 𝑡0+

2

+ 𝑡00 𝑡0+

2

− 𝑡+0 𝑂

2

Laplace (𝑀)

(𝑡11 + 1)/(𝑡1+ + 2)

Conviction (𝑊)

(𝑡1+𝑡+0)/(𝑂𝑡10)

Certainty factor (𝐺)

𝑡11 𝑡1+ − 𝑡+1 𝑂 / 1 − 𝑡+1 𝑂

Added Value (𝐵𝑊)

𝑡11 𝑡1+ − 𝑡+1 𝑂

slide-22
SLIDE 22

IRDM ‘15/16

Properties of Measures

Most measures do not agree on how they rank itemset pairs or rules To understand how they behave, we need to study their properties

 measures that share some properties behave

similarly under that property’s conditions

IV-2: 22

slide-23
SLIDE 23

IRDM ‘15/16

Three properties

A measure has the inve version p n proper perty if its value stays the same if we exchange 𝑡11 with 𝑡00 and 𝑡01 with 𝑡10

 the measure is invariant for flipping bits – it is bit sy

symme mmetri ric

A measure has the null ll addit ition ion p prop

  • perty if is not affected by

increasing 𝑡00 if other values stay constant

 the measure is invariant on adding new transactions that have an

empty intersection with the itemset

A measure has the scaling ng i invari rianc nce p e prope perty if it is not affected by replacing the values 𝑡11, 𝑡10, 𝑡01 and 𝑡00 with values 𝑙1𝑙3𝑡11, 𝑙2𝑙3𝑡10, 𝑙1𝑙4𝑡01, and 𝑙2𝑙4𝑡00

 where all 𝑙𝑗 are positive constants

(revised on Nov 9th, now using 𝑡𝑗𝑗 notation to more clearly indicate suppo port) IV-2: 23

slide-24
SLIDE 24

IRDM ‘15/16

Which properties hold?

(Tan, Steinbach, Kumar, Table 6.17) IV-2: 24

slide-25
SLIDE 25

IRDM ‘15/16

Simpson’s Paradox

Consider this data on sales of HDTVs and exercise machines

HDTV → {Exerc. mach. } has confidence 0.55

¬HDTV → {Exerc. mach. } has confidence 0.45

Cust ustomers w who ho buy H buy HDTVs a s are mo more l likely t y to also so buy a buy an n exercise ma machi chines t tha han n tho hose se w who ho don’ n’t buy H buy HDTVs

IV-2: 25

Exercise Machine No Exercise Machine ∑ HDTV 99 81 180 No HDTV 54 66 120 ∑ 153 147 300

slide-26
SLIDE 26

IRDM ‘15/16

Deeper Analysis

For college students

 𝑑𝑑𝑑𝑔 𝐼𝐼𝐼𝑊 → 𝐹𝐹𝑔𝑔𝑑. 𝑛𝑛𝑑ℎ.

= 0.10

 𝑑𝑑𝑑𝑔 −𝐼𝐼𝐼𝑊 → 𝐹𝐹𝑔𝑔𝑑. 𝑛𝑛𝑑ℎ.

= 0.118

For working adults

 𝑑𝑑𝑑𝑔 𝐼𝐼𝐼𝑊 → 𝐹𝐹𝑔𝑔𝑑. 𝑛𝑛𝑑ℎ.

= 0.577

 𝑑𝑑𝑑𝑔 −𝐼𝐼𝐼𝑊 → 𝐹𝐹𝑔𝑔𝑑. 𝑛𝑛𝑑ℎ.

= 0.581

IV-2: 26

  • Exerc. mach.

Group HDTV Yes No ∑ College Yes 1 9 10 No 4 30 34 Working Yes 98 72 170 No 50 36 86

HDTV is not not made more likely by exercise machine!

slide-27
SLIDE 27

IRDM ‘15/16

The paradox, and why it happens

In the combined data, HDTVs and exercise machines correlate positiv ively

  • ly. In the stratified data, they correlate negativ

ively ly.

 this is Simpson’s paradox

The explanation

 most customers were working adults

 they also bought most HDTVs and exercise machines

 in the combined data this increased the correlation between HDTVs

and exercise machines

Moral of the story: Stratify yo y your d r data proper perly! y!

IV-2: 27

slide-28
SLIDE 28

IRDM ‘15/16

Chapter 4.4: Summaris isin ing Collecti ctions o

  • ns of Item

emsets

IV-2: 28

slide-29
SLIDE 29

IRDM ‘15/16

IRDM Chapter 4.4

1.

The Pattern Explosion

2.

Maximal and closed frequent itemsets

3.

Non-derivable frequent itemsets

You’ll find this covered in Aggarwal, Chapter 5.2 Zaki & Meira, Ch. 11 (non-derivable only here)

IV-2: 29

slide-30
SLIDE 30

IRDM ‘15/16

The Pattern Flood

Consider the following table:

IV-2: 30

tid A B C D E F G H 1 ✔ ✔ ✔ ✔ ✔ 2 ✔ ✔ ✔ ✔ ✔ ✔ 3 ✔ ✔ ✔ ✔ ✔ ✔ 4 ✔ ✔ ✔ ✔ ✔ ✔ 5 ✔ ✔ ✔ ✔ ✔ 6 ✔ ✔ ✔ ✔ ✔ 7 ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔

How many itemsets with minimum frequency of 1/7?

 255 (!)

How many with minimum frequency of 1/2?

 31 (!)

“The goal of data mining is … to summarize the data”

 Hardly a summary!

slide-31
SLIDE 31

IRDM ‘15/16

The Pattern Explosion

This phenomenon is called th the patt attern e exp xplosion For high thresholds you find only few patterns

 that only describe common knowledge

For lower thresholds you find enormously many patterns

 all potentially interesting  many represent noise, and many will be highly redundant  orders of magnitude more patterns than there are rows in the data

IV-2: 31

slide-32
SLIDE 32

IRDM ‘15/16

Curbing the Explosion

There exist two main approaches

 frequent pattern summarisation

 summarise the complete set of frequent patterns  impose a stricter lo

loca cal crit l criterion for individual patterns that removes locally redundant patterns, e.g. closed frequent, maximal frequent

 mine all p

ll patterns that satisfy this criterion

 pattern set mining

 improves by imposing a glo

lobal crit l criterion for the complete result, e.g. shortest description of the data, minimal overlap, maximal entropy

 mine that set o

t of patt tterns that is optimal with regard to this criterion

 this way we can glo

loball lly co control l noise and redundancy

IV-2: 32

slide-33
SLIDE 33

IRDM ‘15/16

Maximally frequent itemsets

Let ℱ be the collection of all frequent itemsets for data 𝑬 Itemset 𝑌 ∈ ℱ is maxi aximal if it has no frequent supersets

 i.e. for all 𝑍 ⊃ 𝑌, 𝑔𝑔𝑔𝑔 𝑍 < 𝑛𝑑𝑑𝑔𝑔𝑔𝑔

With the set of all maximal frequent itemsets we can reconstruct all elements of ℱ

 𝑌 is frequent if and only if there exists a

maximal frequent itemset 𝑁 such that 𝑌 ⊆ 𝑁

 this is a lossy representation:

it does not tell us what the frequency of 𝑌 is

(Bayardo, 1998, 1.7k cites) IV-2: 33

slide-34
SLIDE 34

IRDM ‘15/16

Example of maximal frequent itemsets

IV-2: 34

Not maximal because of {𝑛, 𝑑, 𝑔}

slide-35
SLIDE 35

IRDM ‘15/16

Closed frequent itemsets

Let ℱ be the collection of all frequent itemsets for data 𝑬 Itemset 𝑌 ∈ ℱ is clo closed is all its supersets are less frequent

 i.e. for all 𝑍 ⊃ 𝑌, 𝑔𝑔𝑔𝑔 𝑍 < 𝑔𝑔𝑔𝑔(𝑌)  all maximal itemsets are also closed itemsets

Given the set of all frequent closed itemsets, we can reconstruct all elements of ℱ including their frequency

 𝑌 is frequent if it is a subset of a frequent closed itemset  𝑡𝑡𝑡𝑡 𝑌 = max

{𝑡𝑡𝑡𝑡 𝑎 ∶ 𝑌 ⊆ 𝑎, 𝑎 is frequent and closed}

(Pasquier et al. 1999, 1.5k cites) IV-2: 35

slide-36
SLIDE 36

IRDM ‘15/16

Why “closed”?

Consider the following functions

 𝑢(𝑌) returns all transactions that contain itemset 𝑌  𝑑(𝐼) returns all items that are contained in all transactions in 𝐼

The closure function 𝑑(𝑌) maps itemsets to itemsets by

𝑑 𝑌 = 𝑑 ∘ 𝑢 𝑌 = 𝑑(𝑢 𝑌 )

The closure function satisfies the following properties

 extensive: 𝑌 ⊂ 𝑑(𝑌)  monotonic: if 𝑌 ⊆ 𝑍, then 𝑑 𝑌 ⊆ 𝑑(𝑍)  Idempotent: 𝑑 𝑑 𝑌

= 𝑑 𝑌

Itemset 𝑌 is closed if and only if 𝑌 = 𝑑(𝑌)

IV-2: 36

slide-37
SLIDE 37

IRDM ‘15/16

Example of closed frequent itemsets

IV-2: 37

Itemset a, b is contained in transactions 1 and 2

Closed and maximal Closed, but not maximal

slide-38
SLIDE 38

IRDM ‘15/16

Itemset taxonomy

IV-2: 38

Frequent itemsets Closed frequent itemsets Maximal frequent itemsets

slide-39
SLIDE 39

IRDM ‘15/16

Mining maximal and closed itemsets

Frequent maximal and closed itemsets can be found by post-processing the set of frequent itemsets To find maximal itemsets:

 start with an empty set of candidate maximal itemsets ℳ  fo

for ea each frequent itemset 𝑌 ∈ ℱ

 if

if a superset of 𝑌 is in ℳ , cont

  • ntinu

nue

 el

else insert 𝑌 in ℳ and remove all subsets of 𝑌 from ℳ

 return set ℳ

IV-2: 39

slide-40
SLIDE 40

IRDM ‘15/16

Mining maximal and closed itemsets

Closed itemsets can be found from the frequent itemsets by computing their closure

 this can be very time consuming

The Ch Char arm algorithm avoids testing all frequent itemsets by using the following properties

 if 𝑢 𝑌 = 𝑢(𝑍), then 𝑑 𝑌 = 𝑑 𝑍 = 𝑑(𝑌 ∪ 𝑍)

 we can replace 𝑌 with 𝑌 ∪ 𝑍 and prune 𝑍

 if 𝑢 𝑌 ⊂ 𝑢(𝑍), then 𝑑 𝑌 ≠ 𝑑(𝑍), but 𝑑 𝑌 = 𝑑 𝑌 ∪ 𝑍

 we can replace 𝑌 with 𝑌 ∪ 𝑍, but not prune 𝑍

 if 𝑢 𝑌 ≠ 𝑢(𝑍), 𝑑 𝑌 ≠ 𝑑 𝑍 ≠ 𝑑(𝑌 ∪ 𝑍)

 we cannot prune anything

(Zaki et al, 1999, 194 cites) IV-2: 40

slide-41
SLIDE 41

IRDM ‘15/16

Non-derivable frequent itemsets

Let ℱ be the set of all frequent itemsets. Itemset 𝑌 ∈ ℱ is non non-deri erivabl ble if we cannot derive its support from its subsets

 we can derive the support of 𝑌 if, by knowing the supports of

all of the subsets of 𝑌, we can compute the support of 𝑌

If 𝑌 is derivable, it does not add any new information

 knowing just the non-derivable frequent itemsets, we can

reconstruct every frequent itemset, including its frequency

 we only return itemsets that add new information on top of

what we already knew

(Calders & Goethals, 2004, 121 citations) IV-2: 41

slide-42
SLIDE 42

IRDM ‘15/16

Support of a generalised itemset

A gener eralised sed itemset is an itemset of form 𝑌𝑍

  •  all items in 𝑌 and none of the items in 𝑍

The suppo support of a generalised itemset 𝑌𝑍 is the number of transactions that contain all items in 𝑌 but no items in 𝑍 To compute the support of a generalised itemset 𝐵𝐵𝐶 we

 take the support of 𝐵  remove the supports of 𝐵𝐵 and 𝐵𝐶  add the support of 𝐵𝐵𝐶 that was removed twice  𝑡𝑡𝑡𝑡 𝐵𝐵𝐶 = 𝑡𝑡𝑡𝑡 𝐵 − 𝑡𝑡𝑡𝑡 𝐵𝐵 − 𝑡𝑡𝑡𝑡 𝐵𝐶 + 𝑡𝑡𝑡𝑡(𝐵𝐵𝐶)

IV-2: 42

slide-43
SLIDE 43

IRDM ‘15/16

Generalised Itemsets

IV-2: 43

𝐵

𝐵 𝐶 𝐵𝐵𝐶 𝐵𝐵𝐶 𝐵𝐵𝐶 𝐵𝐵𝐶̅ 𝐵̅𝐵𝐶̅ 𝐵̅𝐵𝐶 𝐵𝐵 𝐶 𝐵𝐵𝐶

slide-44
SLIDE 44

IRDM ‘15/16

The Inclusion-Exclusion Principle

Let 𝑌𝑍 be a generalised itemset and 𝐽 = 𝑌 ∪ Y Now, 𝑡𝑡𝑡𝑡(𝑌𝑍 ) can be expressed as a combination of supports of supersets 𝐾 ⊇ 𝑌 such that 𝐾 ⊆ 𝐽 using the in inclu lusio ion-excl clusi usion p n princi ncipl ple 𝑡𝑡𝑡𝑡 𝑌𝑍 = −1

𝐾∖𝑌 𝑡𝑡𝑡𝑡(𝐾) 𝑌⊆𝐾⊆𝐽

For example, 𝑡𝑡𝑡𝑡 𝐵𝐵𝐶 = 𝑡𝑡𝑡𝑡 ∅ − 𝑡𝑡𝑡𝑡 𝐵 − 𝑡𝑡𝑡𝑡 𝐵 − 𝑡𝑡𝑡𝑡 𝐶

+𝑡𝑡𝑡𝑡 𝐵𝐵 + 𝑡𝑡𝑡𝑡 𝐵𝐶 + 𝑡𝑡𝑡𝑡 𝐵𝐶 −𝑡𝑡𝑡𝑡(𝐵𝐵𝐶)

IV-2: 44

slide-45
SLIDE 45

IRDM ‘15/16

Support Bounds

The inclusion-exclusion formula gives us bounds for the support of itemsets in 𝑌 ∪ 𝑍 that are supersets of 𝑌

 all supports are non-negative!  𝑡𝑡𝑡𝑡 𝐵𝐵𝐶 = 𝑡𝑡𝑡𝑡 𝐵 − 𝑡𝑡𝑡𝑡 𝐵𝐵 − 𝑡𝑡𝑡𝑡 𝐵𝐶 + 𝑡𝑡𝑡𝑡 𝐵𝐵𝐶 ≥ 0

implies 𝑡𝑡𝑡𝑡 𝐵𝐵𝐶 ≥ −𝑡𝑡𝑡𝑡 𝐵 + 𝑡𝑡𝑡𝑡 𝐵𝐵 + 𝑡𝑡𝑡𝑡 𝐵𝐶

 this is a lower bound, but we can also get upper bounds

In general, the bounds for itemset 𝐽 w.r.t. 𝑌 ⊆ 𝐽

 if 𝐽 ∖ 𝑌 is odd:

𝑡𝑡𝑡𝑡 𝐽 ≤ ∑ −1

𝐽∖𝐾 +1𝑡𝑡𝑡𝑡(𝐾) 𝑌⊆𝐾⊂𝐽

 if 𝐽 ∖ 𝑌 is even:

𝑡𝑡𝑡𝑡 𝐽 ≥ ∑ −1

𝐽∖𝐾 +1𝑡𝑡𝑡𝑡(𝐾) 𝑌⊆𝐾⊂𝐽

IV-2: 45

slide-46
SLIDE 46

IRDM ‘15/16

Deriving the Support

Given the formula for the bounds, we can define

 the le

least upper bound nd 𝑚𝑡𝑚(𝐽) and

 the gr

grea eates est lower er bound 𝑕𝑚𝑚(𝐽) for itemset 𝐽

We know that 𝑡𝑡𝑡𝑡 𝐽 ∈ [𝑕𝑚𝑚 𝐽 , 𝑚𝑡𝑚 𝐽 ] If 𝑕𝑚𝑚 𝐽 = 𝑚𝑡𝑚(𝐽), then we can compute 𝑡𝑡𝑡𝑡(𝐽) just knowing the support of subsets of 𝐽

 we say 𝐽 is deriv

rivabl ble

 otherwise, 𝐽 is non

non-de derivab able le

IV-2: 46

slide-47
SLIDE 47

IRDM ‘15/16

Example deriving support – blackboard

Question: is itemset 𝐵𝐵𝐶 derivable?

IV-2: 47

tid A B C D E 1 1 1 1 1 2 1 1 1 3 1 1 1 1 4 1 1 1 1 5 1 1 1 1 1 6 1 1 1

slide-48
SLIDE 48

IRDM ‘15/16

Example deriving support – blackboard

𝑡𝑡𝑡𝑡(𝐵𝐵𝐶) ≥ 0 ≥ 𝑡𝐵𝐵 + 𝑡𝐵𝐵 − 𝑡𝐵 = 4 + 2 − 4 = 2 ≥ 𝑡𝐵𝐵 + 𝑡𝐵𝐵 − 𝑡𝐵 = 2 + 4 − 4 = 2 ≥ 𝑡𝐵𝐵 + 𝑡𝐵𝐵 − 𝑡𝐵 = 4 + 4 − 6 = 2 𝑚𝑚 𝐵𝐵𝐶 = 2,2,2,0 𝑕𝑚𝑚 𝐵𝐵𝐶 = max 𝑚𝑚 𝐵𝐵𝐶 = 2 𝑡𝑡𝑡𝑡 𝐵𝐵𝐶 ≤ 𝑡𝐵𝐵 = 4 ≤ 𝑡𝐵𝐵 = 2 ≤ 𝑡𝐵𝐵 = 4 ≤ 𝑡𝐵𝐵 + 𝑡𝐵𝐵 + 𝑡𝐵𝐵 − 𝑡𝐵 − 𝑡𝐵 − 𝑡𝐵 + 𝑡∅ = 4 + 2 + 4 − 4 − 6 − 4 + 6 = 2 𝑡𝑚 𝐵𝐵𝐶 = 4,2,4,2 𝑚𝑡𝑚 𝐵𝐵𝐶 = min 𝑡𝑚 𝐵𝐵𝐶 = 2 𝑕𝑚𝑚 𝐵𝐵𝐶 = 𝑚𝑡𝑚 𝐵𝐵𝐶 = 2 and hence ABC is derivable.

IV-2: 48

slide-49
SLIDE 49

IRDM ‘15/16

Conclusions

Association rules tell us which items we will probably see given that we’ve seen some other items

 many business and scientific applications

Frequent itemsets tell which items appear together

 mining these is the first step for mining many other things

 many different algorithms exist for efficient frequent itemset mining

The number of frequent itemsets is usually too large

 exponential output space  maximal, closed, and non-derivable itemsets provide

a summarisation of a collection of frequent itemsets

IV-2: 49

slide-50
SLIDE 50

IRDM ‘15/16

Thank you!

Association rules tell us which items we will probably see given that we’ve seen some other items

 many business and scientific applications

Frequent itemsets tell which items appear together

 mining these is the first step for mining many other things

 many different algorithms exist for efficient frequent itemset mining

The number of frequent itemsets is usually too large

 exponential output space  maximal, closed, and non-derivable itemsets provide

a summarisation of a collection of frequent itemsets

IV-2: 50