a new f ramew ork for itemset generation charu c aggarw
play

A New F ramew ork for Itemset Generation Charu C. Aggarw - PDF document

A New F ramew ork for Itemset Generation Charu C. Aggarw al Philip S. Y u IBM T J W atson Researc h Cen ter August 10, 1998 Asso ciation Rules (1) Iden tify the presence of one set of items implying the


  1. A New F ramew ork for Itemset Generation Charu C. Aggarw al Philip S. Y u IBM T J W atson Researc h Cen ter August 10, 1998

  2. Asso ciation Rules (1) Iden tify the presence of one set of items implying the presence of another set of items in a transaction ) e.g. diap er b eer (2) Applications Marekt bask et analysis { A ttac hed mailing in direct mark eting { Departmen t store �o or/shelf planning { In ternet sufring patterns { 1

  3. Generation of Asso ciation Rules X ) Y (1) The supp ort of a rule is the fraction of the X Y rules whic h con tain b oth the set of items and . (2) The con�dence of the rule X ) Y is the fraction of the rules con taining X whic h also con tain Y . (3) The traditio nal approac h on asso ciatio n rule mining �rst �nding all the large itemsets whic h ha v e su�- { cien t supp ort, using large itemset generation algo- rithms then using them to generate all the rules with su�- { cien t con�dence. (4) The Apriori metho d w orks b y ( k { generating all p oten tial large + 1) itemsets from large k -itemsets using joins on the large k -itemsets, and then v alidating them against the database. { 2

  4. W eaknesses of the large itemset metho d (1) The large itemset mo del w orks v ery w ell when the data is sparse. (2) When the data loses its sparse prop ert y , the large item- set metho d breaks do wn. (3) The metho d do es not address the signi�cance of a rule (relativ e to the assumption of statisiti cal indep endence ) Generalizing Asso ciation Rules to Correlati ons { (SIGMOD 97), Brin, Mot w ani and Silv erstein 3

  5. Example (1) Consider the follo wing example: A retailer of breakfast cereal surv eys 5000 studen ts on the activities that they engage in the morning. The data sho ws that 3000 studen ts pla y bask etball, { 3750 eat cereal, and { 2000 studen ts b oth pla y bask etball and eat cereal. { (2) Consider the follo wing rule at 40% supp ort and 60% con�dence: pl ay bask etbal l ) eat cer eal (3) This asso ciation rule is misleading, b ecause the o v erall p ercen tage of studen ts eating cereal is 75%, whic h is ev en larger than 60%. (4) The rule pl ay bask etbal l ) ( not ) eat cer eal has b oth lo w er con�dence and lo w er supp ort than the rule implying p ositiv e asso ciatio n. 4

  6. Another example (1) Consider the follo wing example: 1 1 1 1 0 0 0 0 X 1 1 0 0 0 0 0 0 Y 0 1 1 1 1 1 1 1 Z T able 1: The base data Rule Supp ort Con�dence X Y 25% 50% ) X Z 37.5% 75% ) T able 2: Corresp onding supp ort and con�dence � X The co e�cien t of correlation b et w een the items and Y 0 : 577, is while the co e�cien t of correlati on b et w een X and Z is is � 0 : 378. 5

  7. The basic problems � Spuriousness in itemset generation as illustrated b y the last few examples. � Need to deal with dense data sets: ho w to set supp ort lev el � Inabilit y of �nd negativ e asso ciatio n rules: T o o m uc h bias in fa v or of the absence of items as opp osed to the presence of items. W e need to treat the presence or absence of an item in a symmetric w a y . � Data in whic h the di�eren t attributes ha v e widely v arying densities. 6

  8. In terest Measure � The use of in terest measure is an attempt to remo v e itemsets whic h do not ha v e statistical indep endence. � R An itemset is said to b e -in teresting, if its presence is R -times the exp ected presence based on the assump- tion of statistica l indep endance. 7

  9. Use of in terest measures � The use of in terest measures (whic h w ere prop osed b y Srik an t et. al.) is useful in pruning a w a y those rules whic h are rendered unin teresting. � As the bask etball-cereal illustrates, so long as in terest is used as a p ostpro cessing op erator, either the user has to set the supp ort v alue lo w enough so as not to lose an y in teresting rules in the output, or risk losing useful rules. The former ma y not alw a ys b e computationally feasible. � The in terest measure do es not normalize uniformly with resp ect to dense or sparse data. � F or t w o items with p erfect p ositiv e correlatio n, and base densit y of 0.9 eac h the in terest lev el is 2 0 : 9 = (0 : 9) 1 : 11, = while for t w o items with p erfect p ositiv e stataistica l correlation and base densit y of 0.1 eac h, the in terest lev el is 10. 8

  10. The notion of collectiv e strength � Let I b e an itemset. � An itemset I is said to b e violated if some items tak e on the v alue of 0, while others tak e on the v alue of 1 in a transaction. � Let v ( I ) b e the fraction of violatio ns. W e ha v e E [ v ( I )] = 1 � � p � � (1 � p ). i 2 I i i 2 I i � Let A ( I ) b e the fraction of agreemen ts. A ( I ) = 1 � v ( I ). Also w e ha v e E [ A ( I )] = 1 � E [ v ( I )]. 9

  11. Collectiv e Strength � The collectiv e strength of an itemset is equal to the agreemen t ratio divided b y the violation ratio. 1 � v ( I ) E [ v ( I )] C ( I ) = � (1) � E [ v ( I v ( I 1 )] ) � Another w a y of lo oking at collectiv e strength: Go o d Ev en ts E[Bad Ev en ts] C ( I ) = � (2) E[Go o d Ev en ts] Bad Ev en ts � When there is p erfect negativ e correlatio n among the items, the collectiv e strength is 0, else the collectiv e strength is 1 . � A collectiv e strength of 1 is the break ev en p oin t. 10

  12. Application to previous examples Bask etball-cereal example: 5000 p eople, 3000 pla y bask etball, 3750 � eat cereal, 2000 b oth pla y bask etball and eat cereal. Itemset Supp ort Collectiv e Strength Pla y bask etball, eat cereal 40% 0.67 Pla y bask etball, (not)eat cereal 20% 1 = 0 : 67 = 1 : 49 1 1 1 1 0 0 0 0 X Y 1 1 0 0 0 0 0 0 Z 0 1 1 1 1 1 1 1 T able 3: The base data Itemset Supp ort Statistical Correlation Collectiv e St rength X, Y 25% 0 : 577 3 X, Z 37 : 5% � 0 : 378 0 : 6 Y, Z 12 : 5% � 0 : 655 0 : 31 11

  13. Closure Prop ert y � Supp ose that the items f Milk ; Bread g are closely cor- f Diap ; g . related and similarly for the items er Beer � This will result in f Milk ; Bread ; Diap er ; Beer g to ha v e high collectiv e strength f Milk ; Bread g and f Diap er ; Beer g are indep enden t { Items in a set p erfectly correlated (supp ort 10%) { 4 4 2 2 1 � (0 : 1 +0 : 9 ) 0 : 1 +0 : 9 � Collectiv e strength: { 4 4 2 2 0 : 1 +0 : 9 1 � (0 : 1 +0 : 9 ) � The closure prop ert y forces all subsets to b e closely correlated. � An itemset I is is said to b e strongly collectiv e at lev el K , if it satis�es the follo wing prop o erties: The collectiv e strength C ( I ) of the itemset I is at { least K . The collectiv e strength of { Closure Prop ert y: ev ery subset J of I is at least K . 12

  14. Generating the strongly collectiv e bask ets � Let k b e a n um b er whic h is larger than 1. Consider 0 B n � an itemset of size 2. Supp ose that all 2-subsets B k of ha v e collectiv e strength larger than . Then the 0 B itemset is highly lik ely to ha v e collectiv e strength k larger than . 0 � The follo wing results can b e pro v ed for the 2 to 3 case: Let I = f i ; i ; i g b e a 3-itemset. Supp ose that { 1 2 3 for ev ery 2-subset of I the violation ratio is at most � < 1. Then, it m ust also b e the case that the violation ratio of itemset I is at most � . A similar result can b e pro v ed for the agreemen t { ratio. � When the ab o v e t w o results are used in conjunction, then the results for collectiv e strength ma y b e inferred. 13

  15. Algorithm for �nding itemsets with collectiv e strength � Find all t w o itemsets with the appropriate collectiv e P strength. Let us call this . 2 � P erform joins to �nd P from P . k +1 k � Remo v e all those ( k + 1)-itemsets from P suc h that k +1 some k -subset of it is not included in P . k � Con tin ue the pro cess for increasing k , un til P is k empt y . � P erform a pass o v er the transactio n database in order P k to remo v e an y false itemsets in for eac h . k � V alidating agaist the database is e�cien t b ecause of the prop ert y discussed earlier. 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend