A New F ramew ork for Itemset Generation Charu C. Aggarw - - PDF document

a new f ramew ork for itemset generation charu c aggarw
SMART_READER_LITE
LIVE PREVIEW

A New F ramew ork for Itemset Generation Charu C. Aggarw - - PDF document

A New F ramew ork for Itemset Generation Charu C. Aggarw al Philip S. Y u IBM T J W atson Researc h Cen ter August 10, 1998 Asso ciation Rules (1) Iden tify the presence of one set of items implying the


slide-1
SLIDE 1 A New F ramew
  • rk
for Itemset Generation Charu C. Aggarw al Philip S. Y u IBM T J W atson Researc h Cen ter August 10, 1998
slide-2
SLIDE 2 Asso ciation Rules (1) Iden tify the presence
  • f
  • ne
set
  • f
items implying the presence
  • f
another set
  • f
items in a transaction e.g. diap er ) b eer (2) Applications { Marekt bask et analysis { A ttac hed mailing in direct mark eting { Departmen t store
  • r/shelf
planning { In ternet sufring patterns 1
slide-3
SLIDE 3 Generation
  • f
Asso ciation Rules (1) The supp
  • rt
  • f
a rule X ) Y is the fraction
  • f
the rules whic h con tain b
  • th
the set
  • f
items X and Y . (2) The condence
  • f
the rule X ) Y is the fraction
  • f
the rules con taining X whic h also con tain Y . (3) The traditio nal approac h
  • n
asso ciatio n rule mining { rst nding all the large itemsets whic h ha v e su- cien t supp
  • rt,
using large itemset generation algo- rithms { then using them to generate all the rules with su- cien t condence. (4) The Apriori metho d w
  • rks
b y { generating all p
  • ten
tial large (k + 1) itemsets from large k
  • itemsets
using joins
  • n
the large k
  • itemsets,
and { then v alidating them against the database. 2
slide-4
SLIDE 4 W eaknesses
  • f
the large itemset metho d (1) The large itemset mo del w
  • rks
v ery w ell when the data is sparse. (2) When the data loses its sparse prop ert y , the large item- set metho d breaks do wn. (3) The metho d do es not address the signicance
  • f
a rule (relativ e to the assumption
  • f
statisiti cal indep endence ) { Generalizing Asso ciation Rules to Correlati
  • ns
(SIGMOD 97), Brin, Mot w ani and Silv erstein 3
slide-5
SLIDE 5 Example (1) Consider the follo wing example: A retailer
  • f
breakfast cereal surv eys 5000 studen ts
  • n
the activities that they engage in the morning. The data sho ws that { 3000 studen ts pla y bask etball, { 3750 eat cereal, and { 2000 studen ts b
  • th
pla y bask etball and eat cereal. (2) Consider the follo wing rule at 40% supp
  • rt
and 60% condence: pl ay bask etbal l ) eat cer eal (3) This asso ciation rule is misleading, b ecause the
  • v
erall p ercen tage
  • f
studen ts eating cereal is 75%, whic h is ev en larger than 60%. (4) The rule pl ay bask etbal l ) (not) eat cer eal has b
  • th
lo w er condence and lo w er supp
  • rt
than the rule implying p
  • sitiv
e asso ciatio n. 4
slide-6
SLIDE 6 Another example (1) Consider the follo wing example: X 1 1 1 1 Y 1 1 Z 1 1 1 1 1 1 1 T able 1: The base data Rule Supp
  • rt
Condence X ) Y 25% 50% X ) Z 37.5% 75% T able 2: Corresp
  • nding
supp
  • rt
and condence
  • The
co ecien t
  • f
correlation b et w een the items X and Y is 0:577, while the co ecien t
  • f
correlati
  • n
b et w een X and Z is is 0:378. 5
slide-7
SLIDE 7 The basic problems
  • Spuriousness
in itemset generation as illustrated b y the last few examples.
  • Need
to deal with dense data sets: ho w to set supp
  • rt
lev el
  • Inabilit
y
  • f
nd negativ e asso ciatio n rules: T
  • m
uc h bias in fa v
  • r
  • f
the absence
  • f
items as
  • pp
  • sed
to the presence
  • f
items. W e need to treat the presence
  • r
absence
  • f
an item in a symmetric w a y .
  • Data
in whic h the dieren t attributes ha v e widely v arying densities. 6
slide-8
SLIDE 8 In terest Measure
  • The
use
  • f
in terest measure is an attempt to remo v e itemsets whic h do not ha v e statistical indep endence.
  • An
itemset is said to b e R
  • in
teresting, if its presence is R
  • times
the exp ected presence based
  • n
the assump- tion
  • f
statistica l indep endance. 7
slide-9
SLIDE 9 Use
  • f
in terest measures
  • The
use
  • f
in terest measures (whic h w ere prop
  • sed
b y Srik an t et. al.) is useful in pruning a w a y those rules whic h are rendered unin teresting.
  • As
the bask etball-cereal illustrates, so long as in terest is used as a p
  • stpro
cessing
  • p
erator, either the user has to set the supp
  • rt
v alue lo w enough so as not to lose an y in teresting rules in the
  • utput,
  • r
risk losing useful rules. The former ma y not alw a ys b e computationally feasible.
  • The
in terest measure do es not normalize uniformly with resp ect to dense
  • r
sparse data.
  • F
  • r
t w
  • items
with p erfect p
  • sitiv
e correlatio n, and base densit y
  • f
0.9 eac h the in terest lev el is 0:9=(0:9) 2 = 1:11, while for t w
  • items
with p erfect p
  • sitiv
e stataistica l correlation and base densit y
  • f
0.1 eac h, the in terest lev el is 10. 8
slide-10
SLIDE 10 The notion
  • f
collectiv e strength
  • Let
I b e an itemset.
  • An
itemset I is said to b e violated if some items tak e
  • n
the v alue
  • f
0, while
  • thers
tak e
  • n
the v alue
  • f
1 in a transaction.
  • Let
v (I ) b e the fraction
  • f
violatio ns. W e ha v e E [v (I )] = 1
  • i2I
p i
  • i2I
(1
  • p
i ).
  • Let
A(I ) b e the fraction
  • f
agreemen ts. A(I ) = 1
  • v
(I ). Also w e ha v e E [A(I )] = 1
  • E
[v (I )]. 9
slide-11
SLIDE 11 Collectiv e Strength
  • The
collectiv e strength
  • f
an itemset is equal to the agreemen t ratio divided b y the violation ratio. C (I ) = 1
  • v
(I ) 1
  • E
[v (I )]
  • E
[v (I )] v (I ) (1)
  • Another
w a y
  • f
lo
  • king
at collectiv e strength: C (I ) = Go
  • d
Ev en ts E[Go
  • d
Ev en ts]
  • E[Bad
Ev en ts] Bad Ev en ts (2)
  • When
there is p erfect negativ e correlatio n among the items, the collectiv e strength is 0, else the collectiv e strength is 1.
  • A
collectiv e strength
  • f
1 is the break ev en p
  • in
t. 10
slide-12
SLIDE 12 Application to previous examples
  • Bask
etball-cereal example: 5000 p eople, 3000 pla y bask etball, 3750 eat cereal, 2000 b
  • th
pla y bask etball and eat cereal. Itemset Supp
  • rt
Collectiv e Strength Pla y bask etball, eat cereal 40% 0.67 Pla y bask etball, (not)eat cereal 20% 1=0:67 = 1:49 X 1 1 1 1 Y 1 1 Z 1 1 1 1 1 1 1 T able 3: The base data Itemset Supp
  • rt
Statistical Correlation Collectiv e St rength X, Y 25% 0:577 3 X, Z 37:5% 0:378 0:6 Y, Z 12:5% 0:655 0:31 11
slide-13
SLIDE 13 Closure Prop ert y
  • Supp
  • se
that the items fMilk ; Bread g are closely cor- related and similarly for the items fDiap er ; Beer g.
  • This
will result in fMilk ; Bread ; Diap er ; Beer g to ha v e high collectiv e strength { fMilk ; Bread g and fDiap er ; Beer g are indep enden t { Items in a set p erfectly correlated (supp
  • rt
10%) { Collectiv e strength: 0:1 2 +0:9 2 0:1 4 +0:9 4
  • 1(0:1
4 +0:9 4 ) 1(0:1 2 +0:9 2 )
  • The
closure prop ert y forces all subsets to b e closely correlated.
  • An
itemset I is is said to b e strongly collectiv e at lev el K , if it satises the follo wing prop
  • erties:
{ The collectiv e strength C (I )
  • f
the itemset I is at least K . { Closure Prop ert y: The collectiv e strength
  • f
ev ery subset J
  • f
I is at least K . 12
slide-14
SLIDE 14 Generating the strongly collectiv e bask ets
  • Let
k b e a n um b er whic h is larger than 1. Consider an itemset B
  • f
size n
  • 2.
Supp
  • se
that all 2-subsets
  • f
B ha v e collectiv e strength larger than k . Then the itemset B is highly lik ely to ha v e collectiv e strength larger than k .
  • The
follo wing results can b e pro v ed for the 2 to 3 case: { Let I = fi 1 ; i 2 ; i 3 g b e a 3-itemset. Supp
  • se
that for ev ery 2-subset
  • f
I the violation ratio is at most
  • <
1. Then, it m ust also b e the case that the violation ratio
  • f
itemset I is at most
  • .
{ A similar result can b e pro v ed for the agreemen t ratio.
  • When
the ab
  • v
e t w
  • results
are used in conjunction, then the results for collectiv e strength ma y b e inferred. 13
slide-15
SLIDE 15 Algorithm for nding itemsets with collectiv e strength
  • Find
all t w
  • itemsets
with the appropriate collectiv e strength. Let us call this P 2 .
  • P
erform joins to nd P k +1 from P k .
  • Remo
v e all those (k + 1)-itemsets from P k +1 suc h that some k
  • subset
  • f
it is not included in P k .
  • Con
tin ue the pro cess for increasing k , un til P k is empt y .
  • P
erform a pass
  • v
er the transactio n database in
  • rder
to remo v e an y false itemsets in P k for eac h k .
  • V
alidating agaist the database is ecien t b ecause
  • f
the prop ert y discussed earlier. 14
slide-16
SLIDE 16 Empirical Results
  • The
syn thetic data sets w ere generated in a metho d similar to that discussed b y Rak esh Agarw al for gen- erting large itemsets.
  • The
rst step is to generate L = 2000 maximal \p
  • ten
tially large" itemsets.
  • A
transaction is generated as com bination
  • f
these maximal itemsets (after thro wing a w a y some
  • f
the items from eac h itemset.) 15
slide-17
SLIDE 17 Empirical Results
  • An
extra set
  • f
K corrupt items is added to eac h trans- action. Eac h
  • f
these K corrupted items ma y
  • ccur
indep endan tly in a transaction with a probabilit y
  • f
p c . This is called the corruption probabilit y . This ad- dition
  • f
uncorrelated corrupt items will b e used to test ho w eac h
  • f
the metho ds (large itemset metho d and collectiv e strength metho d) handle the data.
  • W
e dene an itemset to b e impure if it con tains at least
  • ne
corrupt item. 16
slide-18
SLIDE 18 Impurit y lev el with corruption parameter (large itemset approac h)

0.05 0.1 0.15 0.2 0.25 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Corruption Parameter Fraction of impure itemsets

17
slide-19
SLIDE 19 Impurit y lev el with n um b er
  • f
itemsets

1000 2000 3000 4000 5000 6000 7000 8000 9000 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Number of Itemsets Fraction of Impure Itemsets

18
slide-20
SLIDE 20 Summary
  • New
approac h to generate large itemset based
  • n
col- lectiv e strength { Greater robustness and accuracy { Reduce n um b er
  • f
passes
  • v
er the data { Pro vide negativ e asso ciatio n rules
  • Preliminary
results indicate that the tec hnique w
  • rks
faster than the Apriori metho d. 19