frequent itemsets
play

Frequent Itemsets Itemset: a set of items E.g., acm = {a, c, m} - PowerPoint PPT Presentation

Frequent Itemsets Itemset: a set of items E.g., acm = {a, c, m} Transaction database TDB Support of itemsets TID Items bought Sup(acm) = 3 100 f, a, c, d, g, I, m, p Given min_sup = 3, acm 200 a, b, c, f, l, m, o is a


  1. Frequent Itemsets • Itemset: a set of items – E.g., acm = {a, c, m} Transaction database TDB • Support of itemsets TID Items bought – Sup(acm) = 3 100 f, a, c, d, g, I, m, p • Given min_sup = 3, acm 200 a, b, c, f, l, m, o is a frequent pattern 300 b, f, h, j, o • Frequent pattern mining: 400 b, c, k, s, p finding all frequent 500 a, f, c, e, l, p, m, n patterns in a database Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 1

  2. Candidate Generation & Test • Any subset of a frequent itemset must also be frequent – an anti-monotonic property – A transaction containing {beer, diaper, nuts} also contains {beer, diaper} – {beer, diaper, nuts} is frequent à {beer, diaper} must also be frequent • In other words, any superset of an infrequent itemset must also be infrequent – No superset of any infrequent itemset should be generated or tested – Many item combinations can be pruned! Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 2

  3. Apriori-Based Mining • Generate length (k+1) candidate itemsets from length k frequent itemsets, and • Test the candidates against DB Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 3

  4. The Apriori Algorithm [AgSr94] Data base D 1-candidates Freq 1-itemsets 2-candidates TID Items Itemset Sup Itemset Sup Itemset 10 a, c, d a 2 a 2 ab Scan D 20 b, c, e b 3 b 3 ac 30 a, b, c, e c 3 c 3 ae 40 b, e d 1 bc e 3 Min_sup=2 e 3 be ce Counting 3-candidates Freq 2-itemsets Scan D Itemset Sup Itemset Itemset Sup ab 1 bce ac 2 Scan D ac 2 bc 2 ae 1 be 3 Freq 3-itemsets bc 2 ce 2 Itemset Sup be 3 bce 2 ce 2 Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 4

  5. Challenges of Freq Pat Mining • Multiple scans of transaction database • Huge number of candidates • Tedious workload of support counting for candidates Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 5

  6. Improving Apriori: Ideas • Reducing the number of transaction database scans • Shrinking the number of candidates • Facilitating support counting of candidates Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 6

  7. DIC: Reducing Number of Scans • Once both A and D are determined ABCD frequent, the counting of AD can begin ABC ABD ACD BCD • Once all length-2 subsets of BCD are determined frequent, the counting of BCD can begin AB AC BC AD BD CD Transactions 1-itemsets B C D A 2-itemsets Apriori … {} 1-itemsets Itemset lattice 2-items S. Brin R. Motwani, J. Ullman, DIC 3-items and S. Tsur, SIGMOD ’ 97. DIC: Dynamic Itemset Counting Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 7

  8. DHP: Reducing # of Candidates • A hashing bucket count < min_sup à every candidate in the bucket is infrequent – Candidates: a, b, c, d, e – Hash entries: {ab, ad, ae} {bd, be, de} … – Large 1-itemset: a, b, d, e – The sum of counts of {ab, ad, ae} < min_sup à ab should not be a candidate 2-itemset • J. Park, M. Chen, and P. Yu, SIGMOD ’ 95 – DHP: Direct Hashing and Pruning Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 8

  9. A 2-Scan Method by Partitioning • Partition the database into n partitions, such that each partition can be held into main memory • Itemset X is frequent à X must be frequent in at least one partition – Scan 1: partition database and find local frequent patterns – Scan 2: consolidate global frequent patterns • All local frequent itemsets can be held in main memory? A sometimes too strong assumption • A. Savasere, E. Omiecinski, and S. Navathe, VLDB ’ 95 Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 9

  10. Sampling for Frequent Patterns • Select a sample of the original database, mine frequent patterns in the sample using Apriori • Scan database once more to verify frequent itemsets found in the sample, only borders of closure of frequent patterns are checked – Example: check abcd instead of ab, ac, … , etc. • Scan database again to find missed frequent patterns • H. Toivonen, VLDB ’ 96 Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 10

  11. Eclat/MaxEclat and VIPER • Tid-list: the list of transaction-ids containing an itemset – Vertical Data Format • Major operation: intersections of tid-lists • Compression of tid-lists – Itemset A: t1, t2 t3, sup(A)=3 – Itemset B: t2, t3, t4, sup(B)=3 – Itemset AB: t2, t3, sup(AB)=2 • M. Zaki et al., 1997 • P. Shenoy et al., 2000 Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 11

  12. Bottleneck of Freq Pattern Mining • Multiple database scans are costly • Mining long patterns needs many scans and generates many candidates – To find frequent itemset i 1 i 2 … i 100 • # of scans: 100 100 100 100 ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ • # of Candidates: 100 30 � 2 1 1 . 27 10 ⎜ ⎟ + ⎜ ⎟ + + ⎜ ⎟ = − ≈ × ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ 1 2 100 ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ – Bottleneck: candidate-generation-and-test • Can we avoid candidate generation? Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 12

  13. Search Space of Freq. Pat. Mining • Itemsets form a lattice ABCD ABC ABD ACD BCD AB AC BC AD BD CD A B C D {} Itemset lattice Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 13

  14. Set Enumeration Tree • Use an order on items, enumerate itemsets in lexicographic order – a, ab, abc, abcd, ac, acd, ad, b, bc, bcd, bd, c, dc, d ∅ • Reduce a lattice to a tree a b c d Set enumeration tree ab ac ad bc bd cd abc abd acd bcd abcd Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 14

  15. Borders of Frequent Itemsets • Frequent itemsets are connected – ∅ is trivially frequent – X on the border à every subset of X is frequent ∅ a b c d ab ac ad bc bd cd abc abd acd bcd abcd Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 15

  16. Projected Databases • To test whether Xy is frequent, we can use the X-projected database – The sub-database of transactions containing X – Check whether item y is frequent in X-projected database ∅ a b c d ab ac ad bc bd cd abc abd acd bcd abcd Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 16

  17. Compress Database by FP-tree root Header table item • The 1st scan: find f:4 c:1 f frequent items c:3 b:1 c b:1 a – Only record frequent a:3 p:1 b items in FP-tree m b:1 m:2 p – F-list: f-c-a-b-m-p p:2 m:1 • The 2nd scan: construct tree (ordered) TID Items bought freq items – Order frequent items in 100 f, a, c, d, g, I, m, p f, c, a, m, p each transaction w.r.t. f- 200 a, b, c, f, l,m, o f, c, a, b, m list 300 b, f, h, j, o f, b – Explore sharing among 400 b, c, k, s, p c, b, p transactions 500 a, f, c, e, l, p, m, n f, c, a, m, p Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 17

  18. Benefits of FP-tree • Completeness – Never break a long pattern in any transaction – Preserve complete information for freq pattern mining • Not scan database anymore • Compactness – Reduce irrelevant info — infrequent items are removed – Items in frequency descending order (f-list): the more frequently occurring, the more likely to be shared – Never be larger than the original database (not counting node-links and the count fields) Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 18

  19. Partitioning Frequent Patterns • Frequent patterns can be partitioned into subsets according to f-list: f-c-a-b-m-p – Patterns containing p – Patterns having m but no p – … – Patterns having c but no a nor b, m, or p – Pattern f • Depth-first search of a set enumeration tree – The partitioning is complete and does not have any overlap Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 19

  20. Find Patterns Having Item “ p ” • Only transactions containing p are needed • Form p -projected database – Starting at entry p of the header table – Follow the side-link of frequent item p – Accumulate all transformed prefix paths of p root Header p -projected database TDB| p table item f:4 fcam : 2 c:1 f cb : 1 c:3 b:1 c b:1 a Local frequent item: c :3 a:3 p:1 b Frequent patterns containing p m b:1 m:2 p : 3, pc : 3 p p:2 m:1 Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 20

  21. Find Pat Having Item m But No p • Form m-projected database TDB|m – Item p is excluded (why?) – Contain fca:2, fcab:1 – Local frequent items: f, c, a Header table • Build FP-tree for TDB|m root item Header root f:4 f c:1 table c c:3 item b:1 f:3 b:1 a f a:3 c:3 b p:1 c m a b:1 m:2 a:3 p m-projected FP-tree p:2 m:1 Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 21

  22. Recursive Mining • Patterns having m but no p can be mined recursively • Optimization: enumerate patterns from a single-branch FP-tree Header – Enumerate all combination table root item – Support = that of the last item f f:3 • m, fm, cm, am c c:3 • fcm, fam, cam a a:3 • fcam m -projected FP-tree Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 22

  23. Enumerate Patterns From Single Prefix of FP-tree • A (projected) FP-tree has a single prefix – Reduce the single prefix into one node – Join the mining results of the two parts root root r 1 a 1 :n 1 a 1 :n 1 a 2 :n 2 b 1 :m 1 c 1 :k 1 r = + Ú Ú a 2 :n 2 a 3 :n 3 a 3 :n 3 b 1 :m 1 c 1 :k 1 c 2 :k 2 c 3 :k 3 c 2 :k 2 c 3 :k 3 Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend