frequent pattern mining how many words is a picture worth
play

Frequent Pattern Mining How Many Words Is a Picture Worth? E. Aiden - PowerPoint PPT Presentation

Frequent Pattern Mining How Many Words Is a Picture Worth? E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013 Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 2 Burnt or Burned? E. Aiden and J-B Michel: Uncharted. Reverhead Books,


  1. Frequent Pattern Mining

  2. How Many Words Is a Picture Worth? E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013 Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 2

  3. Burnt or Burned? E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013 Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 3

  4. Store Layout Design http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 4

  5. Transaction Data • Alphabet: a set of items – Example: all products sold in a store • A transaction: a set of items involved in an activity – Example: the items purchased by a customer in a visit • Other information is often associated – Timestamp, price, salesperson, customer-id, store-id, … Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 5

  6. Examples of Transaction Data • • • • • Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 6

  7. How to Store Transaction Data? Tid Item • Transaction-id t123 a (t123, a, b, c) t123 b t123 c (t236, b, d) … … • Relational storage t236 b t236 d • Transaction-based storage • Item-based (vertical) storage – Item a: … , t123, … – Item b: … , t123, … , t236, … – … Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 7

  8. Transaction Data Analysis • Transactions: customers ’ purchases of commodities – {bread, milk, cheese} if they are bought together • Frequent patterns: product combinations that are frequently purchased together by customers • Frequent patterns: patterns (set of items, sequence, etc.) that occur frequently in a database [AIS93] Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 8

  9. Why Frequent Patterns? • What products were often purchased together? • What are the frequent subsequent purchases after buying a iPod? • What kinds of genes are sensitive to this new drug? • What key-word combinations are frequently associated with web pages about game- evaluation? Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 9

  10. Why Frequent Pattern Mining? • Foundation for many data mining tasks – Association rules, correlation, causality, sequential patterns, spatial and multimedia patterns, associative classification, cluster analysis, iceberg cube, … • Broad applications – Basket data analysis, cross-marketing, catalog design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 10

  11. Frequent Itemsets • Itemset: a set of items – E.g., acm = {a, c, m} Transaction database TDB • Support of itemsets TID Items bought – Sup(acm) = 3 100 f, a, c, d, g, I, m, p • Given min_sup = 3, acm 200 a, b, c, f, l, m, o is a frequent pattern 300 b, f, h, j, o • Frequent pattern mining: 400 b, c, k, s, p finding all frequent 500 a, f, c, e, l, p, m, n patterns in a database Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 11

  12. A Naïve Attempt • Generate all possible itemsets, test their supports against the database • How to hold a large number of itemsets into main memory? – 100 items à 2 100 – 1 possible itemets • How to test the supports of a huge number of itemsets against a large database, say containing 100 million transactions? – A transaction of length 20 needs to update the support of 2 20 – 1 = 1,048,575 itemsets Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 12

  13. Transactions in Real Applications • A large department store often carries more than 100 thousand different kinds of items – Amazon.com carries more than 17,000 books relevant to data mining • Walmart has more than 20 million transactions per day, AT&T produces more than 275 million calls per day • Mining large transaction databases of many items is a real demand Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 13

  14. How to Get an Efficient Method? • Reducing the number of itemsets that need to be checked • Checking the supports of selected itemsets efficiently Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 14

  15. Candidate Generation & Test • Any subset of a frequent itemset must also be frequent – an anti-monotonic property – A transaction containing {beer, diaper, nuts} also contains {beer, diaper} – {beer, diaper, nuts} is frequent à {beer, diaper} must also be frequent • In other words, any superset of an infrequent itemset must also be infrequent – No superset of any infrequent itemset should be generated or tested – Many item combinations can be pruned! Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 15

  16. Apriori-Based Mining • Generate length (k+1) candidate itemsets from length k frequent itemsets, and • Test the candidates against DB Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 16

  17. The Apriori Algorithm [AgSr94] Data base D 1-candidates Freq 1-itemsets 2-candidates TID Items Itemset Sup Itemset Sup Itemset 10 a, c, d a 2 a 2 ab Scan D 20 b, c, e b 3 b 3 ac 30 a, b, c, e c 3 c 3 ae 40 b, e d 1 bc e 3 Min_sup=2 e 3 be ce Counting 3-candidates Freq 2-itemsets Scan D Itemset Sup Itemset Itemset Sup ab 1 bce ac 2 Scan D ac 2 bc 2 ae 1 be 3 Freq 3-itemsets bc 2 ce 2 Itemset Sup be 3 bce 2 ce 2 Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 17

  18. The Apriori Algorithm Level-wise, candidate generation and test • C k : Candidate itemset of size k • L k : frequent itemset of size k Candidate generation • L 1 = {frequent items}; • for (k = 1; L k != ∅ ; k++) do Test – C k+1 = candidates generated from L k ; – for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t – L k+1 = candidates in C k+ 1 with min_support • return ∪ k L k ; Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 18

  19. Important Steps in Apriori • How to find frequent 1- and 2-itemsets? • How to generate candidates? – Step 1: self-joining L k – Step 2: pruning • How to count supports of candidates? Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 19

  20. Finding Frequent 1- & 2-itemsets • Finding frequent 1-itemsets (i.e., frequent items) using a one dimensional array – Initialize c[item]=0 for each item – For each transaction T, for each item in T, c[item]++; – If c[item]>=min_sup, item is frequent • Finding frequent 2-itemsets using a 2- dimensional triangle matrix – For items i, j (i<j), c[i, j] is the count of itemset ij Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 20

  21. Counting Array • A 2-dimensional triangle matrix can be implemented using a 1-dimensional array There are n items 1 2 3 4 5 For items i, j (i>j), 1 1 2 3 4 c[i,j] = c[(i-1)(2n-i)/2+j-i]; 2 5 6 7 3 8 9 Example: c[3,5] =c[(3-1)*(2*5-3)/ 4 10 2+5-3]=c[9] 5 1 2 3 4 5 6 7 8 9 10 Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 21

  22. Example of Candidate-generation • L 3 = { abc, abd, acd, ace, bcd } • Self-joining: L 3 *L 3 – abcd ß abc * abd – acde ß acd * ace • Pruning: – acde is removed because ade is not in L 3 • C 4 ={ abcd } Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 22

  23. How to Generate Candidates? • Suppose the items in L k-1 are listed in an order • Step 1: self-join L k-1 INSERT INTO C k SELECT p.item 1 , p.item 2 , … , p.item k-1 , q.item k-1 FROM L k-1 p , L k-1 q WHERE p.item 1 =q.item 1 , … , p.item k-2 =q.item k-2 , p.item k-1 < q.item k-1 • Step 2: pruning – For each itemset c in C k do • For each ( k-1 )-subsets s of c do if ( s is not in L k-1 ) then delete c from C k Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 23

  24. How to Count Supports? • Why is counting supports of candidates a problem? – The total number of candidates can be very huge – One transaction may contain many candidates • Method – Candidate itemsets are stored in a hash-tree – A leaf node of hash-tree contains a list of itemsets and counts – Interior node contains a hash table – Subset function: finds all the candidates contained in a transaction Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 24

  25. Example: Counting Supports Subset function Transaction: 1 2 3 5 6 3,6,9 1,4,7 2,5,8 1 + 2 3 5 6 2 3 4 1 3 + 5 6 5 6 7 3 6 7 1 4 5 3 5 6 3 4 5 1 3 6 3 6 8 3 5 7 1 2 + 3 5 6 6 8 9 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 25

  26. Association Rules • Rule c à am • Support: 3 (i.e., the support Transaction database TDB of acm) TID Items bought • Confidence: 75% (i.e., 100 f, a, c, d, g, I, m, p sup(acm) / sup(c)) 200 a, b, c, f, l, m, o • Given a minimum support 300 b, f, h, j, o threshold and a minimum confidence threshold, find 400 b, c, k, s, p all association rules whose 500 a, f, c, e, l, p, m, n support and confidence passing the thresholds Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 26

  27. To-Do List • Read Sections 6.1, 6.2.1 and 6.2.2 in the textbook • Understand the concept of frequent itemsets and association rules • Understand algorithm Apriori • Figure out how to use Weka to mine frequent itemsets Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend