outline charm an efficient algorithm
play

Outline CHARM: An Efficient Algorithm Introductions for Closed - PowerPoint PPT Presentation

Outline CHARM: An Efficient Algorithm Introductions for Closed Itemset Mining Itemset-Tidset tree CHARM algorithm Authors: Mohammed J. Zaki and Ching-Jui Hsiao Performance study Presenter: Junfeng Wu Conclusion Comments


  1. Outline CHARM: An Efficient Algorithm � Introductions for Closed Itemset Mining � Itemset-Tidset tree � CHARM algorithm Authors: Mohammed J. Zaki and Ching-Jui Hsiao � Performance study Presenter: Junfeng Wu � Conclusion � Comments 28/10/2004 2 28/10/2004 1 Introductions Introductions When we are mining association rules in a Closed frequent itemsets are non- database, a huge number of frequent redundant representations of all patterns (itemsets) will be generated. frequent itemsets. Mining association rules on closed Database: {(1,2,3,4),(1,2,3,4,5,6)} � frequent itemsets is a much easier task. Minimum support = 50% � 63 frequent itemsets � ({(1),(2),(3),(4),(5),(6),(1,2),(1,3),…,(1,2,3,4,5,6)}) In the previous database, the number of closed frequent itemsets is only 2, (1,2,3,4) and (1,2,3,4,5,6). 28/10/2004 3 28/10/2004 4

  2. Closed frequent itemsets Example Database � A frequent itemset X is closed if and DISTINCT DATABASE ITEMS Sir Arthur Agatha P.G. only if there is no itemset Y such that Jane Austen Mark Twain Christie Wodehouse Conan Doyle A C D T W � Y subsumes X DATABASE ALL FREQUENT ITEMSETS � every transaction that contains X also MINIMUM SUPPORT = 50% Transaction Items contains Y 1 A,C,T,W Support Itemsets 2 C,D,W 100%(6) C 3 A,C,T,W 83%(5) W,CW Database: {(1,2,3,4),(1,2,3,4,5,6)} 4 A,C,D,W 67%(4) A,D,T,AC,AW,CD,CT,ACW Itemset (1,2) is not a closed itemset. 5 A,C,D,T,W 50%(3) AT,DW,TW,ACT,ATW,CDW,C Itemset (1,2,3,4) is a closed itemset. TW,ACTW 6 C,D,T 28/10/2004 5 28/10/2004 6 Vertical format database Horizontal/Vertical format database � Horizontal format database A C D T W � Each record is a set of items. 1 1 2 1 1 � Each record is assigned a distinct number 3 2 4 3 2 named transaction id. 4 3 5 5 3 � Vertical format database 5 4 6 6 4 � Each record is a set of transaction id about 5 5 an item. 6 � This item occurs in these transactions. 28/10/2004 7 28/10/2004 8

  3. Notations Itemset-Tidset Search Tree (IT-tree) Given an itemset X, t(X) is the set of all � Each node in the IT-tree is an itemset- tids that contains X . tidset pair, X×t(X). For example: t(ACW) = 1345 For example: AT×135 Given a tidset Y , i(Y) is the set of all � All the children of node X share the common items to all the tids in Y . same prefix X and belong to an For example: i(12) = CW equivalence class Given an itemset X , c(X) is the smallest closed set that contains X . For example: c(A)=c(C)=C(W)=ACW 28/10/2004 9 28/10/2004 10 Example of IT-tree Theorem 1 {} Let and be any two members of a X × X × t ( X ) t ( X ) � 123456 i i j j class , with , where is a total order. The X ≤ [ p ] X f i f j following four properties hold: A C D T W 1345 123456 2456 1356 12345 = = ∪ 1. If , then = c ( X ) c ( X ) c ( X X ) t ( X ) t ( X ) � i j i j i j 2. If , then , but ⊂ ≠ = ∪ t ( X ) t ( X ) c ( X ) c ( X ) c ( X ) c ( X X ) AC AD AT AW CD CT CW DT DW TW � i j i j i i j 1345 45 135 1345 2456 1356 12345 56 245 135 3. If , then , but ⊃ ≠ = ∪ t ( X ) t ( X ) c ( X ) c ( X ) c ( X ) c ( X X ) � i j j i j i j 4. If , then ≠ ≠ ∪ ≠ ACD ACT ACW ADW ADT ATW CDT CDW CTW DTW t ( X ) t ( X ) c ( X ) c ( X ) c ( X X ) � i j i j i j 45 135 1345 45 5 135 56 245 135 5 ACDT ACDW ACTW ADTW CDTW 5 45 135 5 5 ACDTW 5 28/10/2004 11 28/10/2004 12

  4. CHARM algorithm How does CHARM work? {} Dx2456 Tx1356 Ax1345 Wx12345 Cx123456 DCx2456 TCx1356 AWx1345 WCx12345 AWCx1345 DTx56 DAx45 DWx245 TAx135 TWx135 DWCx245 TACx135 TWCx135 TAWCx135 28/10/2004 13 28/10/2004 14 Subsumption Checking Hash function Before add a set X to the current set of ∑ ∈ = closed set, we need check if X is h ( X ) T subsumed by some closed sets. T t ( X ) � Comparing X with all closed set is expensive. The sum of the tids in the tidset of an itemset � Assumption: itemsets with the same hash key Solution: using hash function to retrieve have different supports. relevant closed sets 28/10/2004 15 28/10/2004 16

  5. Complexity issues Diffsets t(PX) Comparing two itemset’s tidsets becomes t(X) a time consuming task when tidset gets t(P) very large. Keeping all tids of itemsets in memory needs lots of space. t(Y) Solution: using diffsets d(PX) d(PY) d(PXY) t(PXY) 28/10/2004 17 28/10/2004 18 Diffset and Tidset CHARM using diffsets {} Let m(X i ) and m(X j ) denote the number of mismatches in the diffsets d(X i ) and d(X j ) For example: X i =D, X j =T, then d(X i )=2456, d(X j )=1356, Dx2456 Tx1356 Ax1345 Wx12345 Cx123456 DCx2456 TCx1356 AWx1345 WCx12345 m(X i )=|(13)|=2, m(X j )=|(24)|=2 AWCx1345 = = = = m X and m X then d X d X or t X t X ( ) 0 ( ) 0 , ( ) ( ) ( ) ( ) i j i j i j > = ⊃ ⊂ m ( X ) 0 and m ( X ) 0 , then d ( X ) d ( X ) or t ( X ) t ( X ) i j i j i j = > ⊂ ⊃ m ( X ) 0 and m ( X ) 0 , then d ( X ) d ( X ) or t ( X ) t ( X ) i j i j i j DTx24 DAx26 DWx6 TAx6 TWx6 > > ≠ ≠ DWCx6 TACx6 TWCx6 m ( X ) 0 and m ( X ) 0 , then d ( X ) d ( X ) or t ( X ) t ( X ) TAWCx6 i j i j i j 28/10/2004 19 28/10/2004 20

  6. Performance study Performance study � Datasets 28/10/2004 21 28/10/2004 22 Performance study Performance study 28/10/2004 23 28/10/2004 24

  7. Scalability Memory usage Linear increasing in the running time with increasing The memory usage is 50 times smaller by using diffsets number of transactions at a giving support. than using tidsets. Memory usage (using diffsets) 28/10/2004 25 28/10/2004 26 Conclusion Comments � Strength � Advantage of CHARM � The ideas in the paper are intuitive. � Faster than other algorithm at low support threshold � The authors first introduced an efficient data structure (IT- � Faster than other algorithm on a database with very long tree) for closed itemset mining. closed patterns � The authors demonstrated the algorithm on various � Disadvantage of CHARM datasets. � Slower than Closet when most of closed sets are 2-itemset � The experimental studies are convincing. � Weakness � The algorithm requires the conversion of database from horizontal format to vertical format. � Follow-up � Closet+ (Wang et al, 2003) beats CHARM one year later. 28/10/2004 27 28/10/2004 28

  8. THANK YOU! Questions or comments? 28/10/2004 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend