A New F ramew ork for Itemset Generation Charu C. Aggarw - PDF document

A New F ramew ork for Itemset Generation Charu C. Aggarw al Philip S. Y u IBM T J W atson Researc h Cen ter August 10, 1998

Asso ciation Rules (1) Iden tify the presence of one set of items implying the presence of another set of items in a transaction ) e.g. diap er b eer (2) Applications Marekt bask et analysis { A ttac hed mailing in direct mark eting { Departmen t store �o or/shelf planning { In ternet sufring patterns { 1

Generation of Asso ciation Rules X ) Y (1) The supp ort of a rule is the fraction of the X Y rules whic h con tain b oth the set of items and . (2) The con�dence of the rule X ) Y is the fraction of the rules con taining X whic h also con tain Y . (3) The traditio nal approac h on asso ciatio n rule mining �rst �nding all the large itemsets whic h ha v e su�- { cien t supp ort, using large itemset generation algorithms then using them to generate all the rules with su�- { cien t con�dence. (4) The Apriori metho d w orks b y ( k { generating all p oten tial large + 1) itemsets from large k -itemsets using joins on the large k -itemsets, and then v alidating them against the database. { 2

W eaknesses of the large itemset metho d (1) The large itemset mo del w orks v ery w ell when the data is sparse. (2) When the data loses its sparse prop ert y , the large itemset metho d breaks do wn. (3) The metho d do es not address the signi�cance of a rule (relativ e to the assumption of statisiti cal indep endence ) Generalizing Asso ciation Rules to Correlati ons { (SIGMOD 97), Brin, Mot w ani and Silv erstein 3

Example (1) Consider the follo wing example: A retailer of breakfast cereal surv eys 5000 studen ts on the activities that they engage in the morning. The data sho ws that 3000 studen ts pla y bask etball, { 3750 eat cereal, and { 2000 studen ts b oth pla y bask etball and eat cereal. { (2) Consider the follo wing rule at 40% supp ort and 60% con�dence: pl ay bask etbal l ) eat cer eal (3) This asso ciation rule is misleading, b ecause the o v erall p ercen tage of studen ts eating cereal is 75%, whic h is ev en larger than 60%. (4) The rule pl ay bask etbal l ) ( not ) eat cer eal has b oth lo w er con�dence and lo w er supp ort than the rule implying p ositiv e asso ciatio n. 4

Another example (1) Consider the follo wing example: 1 1 1 1 0 0 0 0 X 1 1 0 0 0 0 0 0 Y 0 1 1 1 1 1 1 1 Z T able 1: The base data Rule Supp ort Con�dence X Y 25% 50% ) X Z 37.5% 75% ) T able 2: Corresp onding supp ort and con�dence � X The co e�cien t of correlation b et w een the items and Y 0 : 577, is while the co e�cien t of correlati on b et w een X and Z is is � 0 : 378. 5

The basic problems � Spuriousness in itemset generation as illustrated b y the last few examples. � Need to deal with dense data sets: ho w to set supp ort lev el � Inabilit y of �nd negativ e asso ciatio n rules: T o o m uc h bias in fa v or of the absence of items as opp osed to the presence of items. W e need to treat the presence or absence of an item in a symmetric w a y . � Data in whic h the di�eren t attributes ha v e widely v arying densities. 6

In terest Measure � The use of in terest measure is an attempt to remo v e itemsets whic h do not ha v e statistical indep endence. � R An itemset is said to b e -in teresting, if its presence is R -times the exp ected presence based on the assumption of statistica l indep endance. 7

Use of in terest measures � The use of in terest measures (whic h w ere prop osed b y Srik an t et. al.) is useful in pruning a w a y those rules whic h are rendered unin teresting. � As the bask etball-cereal illustrates, so long as in terest is used as a p ostpro cessing op erator, either the user has to set the supp ort v alue lo w enough so as not to lose an y in teresting rules in the output, or risk losing useful rules. The former ma y not alw a ys b e computationally feasible. � The in terest measure do es not normalize uniformly with resp ect to dense or sparse data. � F or t w o items with p erfect p ositiv e correlatio n, and base densit y of 0.9 eac h the in terest lev el is 2 0 : 9 = (0 : 9) 1 : 11, = while for t w o items with p erfect p ositiv e stataistica l correlation and base densit y of 0.1 eac h, the in terest lev el is 10. 8

The notion of collectiv e strength � Let I b e an itemset. � An itemset I is said to b e violated if some items tak e on the v alue of 0, while others tak e on the v alue of 1 in a transaction. � Let v ( I ) b e the fraction of violatio ns. W e ha v e E [ v ( I )] = 1 � � p � � (1 � p ). i 2 I i i 2 I i � Let A ( I ) b e the fraction of agreemen ts. A ( I ) = 1 � v ( I ). Also w e ha v e E [ A ( I )] = 1 � E [ v ( I )]. 9

Collectiv e Strength � The collectiv e strength of an itemset is equal to the agreemen t ratio divided b y the violation ratio. 1 � v ( I ) E [ v ( I )] C ( I ) = � (1) � E [ v ( I v ( I 1 )] ) � Another w a y of lo oking at collectiv e strength: Go o d Ev en ts E[Bad Ev en ts] C ( I ) = � (2) E[Go o d Ev en ts] Bad Ev en ts � When there is p erfect negativ e correlatio n among the items, the collectiv e strength is 0, else the collectiv e strength is 1 . � A collectiv e strength of 1 is the break ev en p oin t. 10

Application to previous examples Bask etball-cereal example: 5000 p eople, 3000 pla y bask etball, 3750 � eat cereal, 2000 b oth pla y bask etball and eat cereal. Itemset Supp ort Collectiv e Strength Pla y bask etball, eat cereal 40% 0.67 Pla y bask etball, (not)eat cereal 20% 1 = 0 : 67 = 1 : 49 1 1 1 1 0 0 0 0 X Y 1 1 0 0 0 0 0 0 Z 0 1 1 1 1 1 1 1 T able 3: The base data Itemset Supp ort Statistical Correlation Collectiv e St rength X, Y 25% 0 : 577 3 X, Z 37 : 5% � 0 : 378 0 : 6 Y, Z 12 : 5% � 0 : 655 0 : 31 11

Closure Prop ert y � Supp ose that the items f Milk ; Bread g are closely cor- f Diap ; g . related and similarly for the items er Beer � This will result in f Milk ; Bread ; Diap er ; Beer g to ha v e high collectiv e strength f Milk ; Bread g and f Diap er ; Beer g are indep enden t { Items in a set p erfectly correlated (supp ort 10%) { 4 4 2 2 1 � (0 : 1 +0 : 9 ) 0 : 1 +0 : 9 � Collectiv e strength: { 4 4 2 2 0 : 1 +0 : 9 1 � (0 : 1 +0 : 9 ) � The closure prop ert y forces all subsets to b e closely correlated. � An itemset I is is said to b e strongly collectiv e at lev el K , if it satis�es the follo wing prop o erties: The collectiv e strength C ( I ) of the itemset I is at { least K . The collectiv e strength of { Closure Prop ert y: ev ery subset J of I is at least K . 12

Generating the strongly collectiv e bask ets � Let k b e a n um b er whic h is larger than 1. Consider 0 B n � an itemset of size 2. Supp ose that all 2-subsets B k of ha v e collectiv e strength larger than . Then the 0 B itemset is highly lik ely to ha v e collectiv e strength k larger than . 0 � The follo wing results can b e pro v ed for the 2 to 3 case: Let I = f i ; i ; i g b e a 3-itemset. Supp ose that { 1 2 3 for ev ery 2-subset of I the violation ratio is at most � < 1. Then, it m ust also b e the case that the violation ratio of itemset I is at most � . A similar result can b e pro v ed for the agreemen t { ratio. � When the ab o v e t w o results are used in conjunction, then the results for collectiv e strength ma y b e inferred. 13

Algorithm for �nding itemsets with collectiv e strength � Find all t w o itemsets with the appropriate collectiv e P strength. Let us call this . 2 � P erform joins to �nd P from P . k +1 k � Remo v e all those ( k + 1)-itemsets from P suc h that k +1 some k -subset of it is not included in P . k � Con tin ue the pro cess for increasing k , un til P is k empt y . � P erform a pass o v er the transactio n database in order P k to remo v e an y false itemsets in for eac h . k � V alidating agaist the database is e�cien t b ecause of the prop ert y discussed earlier. 14

A New F ramew ork for Itemset Generation Charu C. Aggarw - PDF document

A New F ramew ork for Itemset Generation Charu C. Aggarw al Philip S. Y u IBM T J W atson Researc h Cen ter August 10, 1998 Asso ciation Rules (1) Iden tify the presence of one set of items implying the

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Ja Java va Co Coll llection ection Fra ramew ework ork Version March 2009 Framework

Depth-First Non-Derivable Itemset Mining Toon Calders Bart Goethals University of Antwerp,

Outline CHARM: An Efficient Algorithm Introductions for Closed Itemset Mining

Sampling for Frequent Itemset Mining prof. dr Arno Siebes Algorithmic Data Analysis Group

Prosp er: A F ramew o rk fo r Extending Prolog Appl i a ti on s with a W eb

13th LSF A Logial and Semanti F ramew o rks, with Appliations UF C, F o rtaleza,

Audio: Generation & Extraction Charu Jaiswal Music Composition which approach? Feed

VOL Charu Dwivedi, Ameya Khare, and Aman Agrawal Vol is helping save Tesla car owners from

I NFECTIOUS D ISEASES S OCIETY OF N EW Y ORK N EW Y ORK C ITY H EALTH D EPARTMENT J OINT M

MOINA GROUP (12 th JUNE to 22 nd JUNE, 2018) Members- ANSHI SRIVASTAVA VINUSHA CHARU PRITWANI

TCP-Friendly Rate Control : An analysis Charu Jain www.sfu.ca/~cjain cjain@cs.sfu.ca Roadmap

Social Inequalities and Secondary Educational Attainment in India: An Inter-State Analysis Charu

Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple

An Embedding A Approac ach t to Anom omal aly D Detection Renjun Hu 1 , Charu Aggarwal 2 ,

Scaling up Link Prediction with Ensembles Liang Duan 1 , Charu Aggarwal 2 , Shuai Ma 1 , Renjun

Marketing in Schools: Little Educational or Nutritional Content Presentation at the Wellcome

Collecting and summarizing data From Data to Insight Dr. etinkaya-Rundel July 8, 2016 Data

Quality of Life Minerba Betancourt October 28, 2016 Overview Cleaning schedule: We

AUTOMORPHISMS OF MODELS OF ARITHMETIC ALI ENAYAT UNIVERSIT E DE PARIS VII S eminaire G

2. Write a function has path that takes in a tree t and a string word . It returns True if there

Graph Algorithms Graph Algorithms g Graph G = (V,E) is bipartite iff all its cycles have

Segmentation & Evaluation Devy Schonfeld Housekeeping Turn off your cell phones an put

Web Security, Part 1 CS 161 - Computer Security Profs. Vern Paxson & David Wagner TAs: John

A New F ramew ork for Itemset Generation Charu C. Aggarw - PDF document

A New F ramew ork for Itemset Generation Charu C. Aggarw al Philip S. Y u IBM T J W atson Researc h Cen ter August 10, 1998 Asso ciation Rules (1) Iden tify the presence of one set of items implying the

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Ja Java va Co Coll llection ection Fra ramew ework ork Version March 2009 Framework

Depth-First Non-Derivable Itemset Mining Toon Calders Bart Goethals University of Antwerp,

Outline CHARM: An Efficient Algorithm Introductions for Closed Itemset Mining

Sampling for Frequent Itemset Mining prof. dr Arno Siebes Algorithmic Data Analysis Group

Prosp er: A F ramew o rk fo r Extending Prolog Appl i a ti on s with a W eb

13th LSF A Logial and Semanti F ramew o rks, with Appliations UF C, F o rtaleza,

Audio: Generation &amp; Extraction Charu Jaiswal Music Composition which approach? Feed

VOL Charu Dwivedi, Ameya Khare, and Aman Agrawal Vol is helping save Tesla car owners from

I NFECTIOUS D ISEASES S OCIETY OF N EW Y ORK N EW Y ORK C ITY H EALTH D EPARTMENT J OINT M

MOINA GROUP (12 th JUNE to 22 nd JUNE, 2018) Members- ANSHI SRIVASTAVA VINUSHA CHARU PRITWANI

TCP-Friendly Rate Control : An analysis Charu Jain www.sfu.ca/~cjain cjain@cs.sfu.ca Roadmap

Social Inequalities and Secondary Educational Attainment in India: An Inter-State Analysis Charu

Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple

An Embedding A Approac ach t to Anom omal aly D Detection Renjun Hu 1 , Charu Aggarwal 2 ,

Scaling up Link Prediction with Ensembles Liang Duan 1 , Charu Aggarwal 2 , Shuai Ma 1 , Renjun

Marketing in Schools: Little Educational or Nutritional Content Presentation at the Wellcome

Collecting and summarizing data From Data to Insight Dr. etinkaya-Rundel July 8, 2016 Data

Quality of Life Minerba Betancourt October 28, 2016 Overview Cleaning schedule: We

AUTOMORPHISMS OF MODELS OF ARITHMETIC ALI ENAYAT UNIVERSIT E DE PARIS VII S eminaire G

2. Write a function has path that takes in a tree t and a string word . It returns True if there

Graph Algorithms Graph Algorithms g Graph G = (V,E) is bipartite iff all its cycles have

Segmentation &amp; Evaluation Devy Schonfeld Housekeeping Turn off your cell phones an put

Web Security, Part 1 CS 161 - Computer Security Profs. Vern Paxson &amp; David Wagner TAs: John

Audio: Generation & Extraction Charu Jaiswal Music Composition which approach? Feed

Segmentation & Evaluation Devy Schonfeld Housekeeping Turn off your cell phones an put

Web Security, Part 1 CS 161 - Computer Security Profs. Vern Paxson & David Wagner TAs: John