mining frequent patterns associations and correlations
play

Mining Frequent Patterns, Associations and Correlations Week 3 1 - PowerPoint PPT Presentation

Mining Frequent Patterns, Associations and Correlations Week 3 1 Team Homework Assignment #2 Team Homework Assignment #2 Read pp. 285 300 of the text book. R d 285 300 f h b k Do Example 6.1. Prepare for the results of the


  1. Mining Frequent Patterns, Associations and Correlations Week 3 1

  2. Team Homework Assignment #2 Team Homework Assignment #2 • Read pp. 285 – 300 of the text book. R d 285 300 f h b k • Do Example 6.1. Prepare for the results of the homework assignment. assignment. • Due date – beginning of the lecture on Friday February 18 th .

  3. Team Homework Assignment #3 Team Homework Assignment #3 • Prepare for the one ‐ page description of your group project P f h d i i f j topic • Prepare for presentation using slides Prepare for presentation using slides • Due date – beginning of the lecture on Friday February 11 th .

  4. http://www.lucyluvs.com/images/fitt edXLpooh.JPG edXLpooh.JPG http://www.mondobirra.org/sfondi/BudLight.siz ed.jpg 4

  5. cell_cycle ‐ > [+]Exp1,[+]Exp2,[+]Exp3,[+]Exp4, support=52.94% (9 genes) apoptosis ‐ > [+]Exp6,[+]Exp7,[+]Exp8, p p [ ] p ,[ ] p ,[ ] p , support=76.47% (13 genes) http://www.cnb.uam.es/~pcarmona/assocrules/imag4.JPG 5

  6. a ble 8.3 T he substitutio n matrix o f amino ac ids. T ig ure 8.8 Sc o ring two po te ntial pairwise alignme nts, (a) and F (b), o f amino ac ids. 6

  7. ig ure 9.1 A sample graph data se t. F ig ure 9.2 F re que nt graph. F 7

  8. 8 ig ure 9.14 A c he mic al database . F

  9. What Is Frequent Pattern Analysis? What Is Frequent Pattern Analysis? • Frequent pattern: a pattern for itemsets, subsequences, substructures , etc. that occurs frequently in a data set • First proposed by Agrawal, Imielinski, and Swami in 1993, in the context of frequent itemsets and association rule mining 9

  10. Why Is Frequent Pattern Mining I Important? ? • Discloses an intrinsic and important property of data sets • Discloses an intrinsic and important property of data sets • Forms the foundation for many essential data mining tasks and applications tasks and applications – What products were often purchased together?— Beer and diapers? – What are the subsequent purchases after buying a PC? – What kinds of DNA are sensitive to this new drug? – Can we automatically classify web documents? 10

  11. Topics of Frequent Pattern Mining (1) Topics of Frequent Pattern Mining (1) • Based on the kinds of patterns to be mined – Frequent itemset mining – Sequential pattern mining – Structured pattern mining 11

  12. Topics of Frequent Pattern Mining (2) Topics of Frequent Pattern Mining (2) • Based on the levels of abstraction involved in the rule set – Single ‐ level association rules Single level association rules – Multi ‐ level association rules 12

  13. Topics of Frequent Pattern Mining (3) Topics of Frequent Pattern Mining (3) • Based on the number of data dimensions involved in the rule – Single ‐ dimensional association rules Single dimensional association rules – Multi ‐ dimensional association rules 13

  14. Association Rule Mining Process Association Rule Mining Process • Find all frequent itemsets Fi d ll f i – Join steps – Prune steps – Prune steps • Generate “ strong” association rules from the frequent itemsets 14

  15. Basic Concepts of Frequent Itemsets Basic Concepts of Frequent Itemsets • Let I = { I 1 , I 2 , …., I m } be a set of items • Let D , the task ‐ relevant data, be a set of database L t D th t k l t d t b t f d t b transactions where each transaction T is a set of items such that T ⊆ I that T ⊆ I • Each transaction is associated with an identifier, called TID . • Let A be a set of items • Let A be a set of items • A transaction T is said to contain A if and only if A ⊆ T 15

  16. How to Generate Frequent Itemset? How to Generate Frequent Itemset? • Suppose the items in L k ‐ 1 are listed in an order • The join step : To find L k , a set of candidate k ‐ itemsets, C k , is generated by joining L k ‐ 1 with itself. Let l 1 and l 2 be itemsets in L k ‐ 1 .The resulting itemset formed by joining l 1 and l 2 is l 1 [1], l 1 [2], …, l 1 [k 2], l 1 [k 1], l 2 [k 1] l 1 [2] l 1 [k ‐ 2] l 1 [k ‐ 1] l 2 [k ‐ 1] • The prune step : Scan data set D and compare candidate support count of C k with minimum support count. Remove pp pp k candidate itemsets that whose support count is less than minimum support count, resulting in L k . 16

  17. Apriori Algorithm Apriori Algorithm • Initially, scan DB once to get frequent 1 ‐ itemset I iti ll DB t t f t 1 it t • Generate length (k+1) candidate itemsets from length k frequent itemsets frequent itemsets • Prune length (k+1) candidate itemsets with Apriori property – Apriori property: All nonempty subsets of a frequent itemset must Apriori property: All nonempty subsets of a frequent itemset must also be frequent • Test the candidates against DB g • Terminate when no frequent or candidate set can be generated g 17

  18. 18 5.4 T he A Aprio ri alg go rithm fo o r disc o ve e ring fre q que nt F ig ure ite mse ts fo r min ing Bo o le e an asso c c iatio n rul le s.

  19. T T ransac tio nalDatabase ransac tio nalDatabase TI D TI D List of item List of item _ I Ds I Ds T100 I1, I2, I5 T200 I2, I4 T300 I2, I3 T400 I1, I2, I4 T500 I1, I3 , T600 I2, I3 T700 I1, I3 T800 T800 I1 I2 I3 I5 I1, I2, I3, I5 T900 I1, I2, I3 a ble 5 1 T a ble 5.1 T ransac tio nal data fo r an AllE ransac tio nal data fo r an AllE le c tro nic s branc h le c tro nic s branc h. T T 19

  20. Minimum support count = 2 Figure 5.2 Generation of candidate itemsets and frequent itemsets, where 20 the minimum support count is 2.

  21. Generating Strong Association Rules Generating Strong Association Rules • From the frequent itemsets q • For each frequent itemset l , generate all nonempty subset of l • For every nonempty subset s of l, • Output the rule “s (l – s)” • If support_count(l) / support_count(s) ≥ min_conf, If t t(l) / t t( ) ≥ i f where min_conf is the minimum confidence threshold • Rules that satisfy both a minimum support threshold Rules that satisfy both a minimum support threshold and a minimum confidence threshold are called strong 1

  22. Support Support • The rule A The rule A B holds in the transaction set D B holds in the transaction set D with support s – support , s , probability that a transaction contains t b bilit th t t ti t i A and B – support (A support (A B) B) = P (A P (A B) B) 2

  23. Confidence Confidence • The rule A The rule A B has confidence c in the B has confidence c in the transaction set D – confidence , c , conditional probability that a fid diti l b bilit th t transaction having A also contains B – confidence (A confidence (A B) = P (B | A) B) P (B | A) ∪ ∪ ( ) ( ) support A B support_co unt A B ⇒ = = = ( ) ( | ) Confidence A B P B A ( ( ) ) ( ( ) ) support pp A support co pp _ unt A 3

  24. Generating Association Rules from Frequent Itemsets • Example 5.4: Suppose the data contain the frequent itemset l p pp q = {I1, I2, I5}. What are the association rules that can be generated from l? If the minimum confidence threshold is 70%, then which rules are strong? 70% then which rules are strong? – I1 ^I2 ‐ > I5, confidence = 2/4 = 50% – I1 ^I5 ‐ > I2, confidence = 2/2 = 100% – I2 ^I5 ‐ > I1, confidence = 2/2 = 100% – I1 ‐ > I2 ^ I5, confidence = 2/6 = 33% – I2 ‐ > I1 ^ I5, confidence = 2/7 = 29% , / – I5 ‐ > I1 ^ I2, confidence = 2/2 = 100% 1

  25. Exercise Exercise 5.3 A database has five transactions. Let min_sup = 60% and min_conf = 80%. TID Items_bought T100 {M, O, N, K, E, Y} T200 T200 {D O N K E Y} {D, O, N, K, E, Y} T300 {M, A, K, E} T400 {M, U, C, K, Y} T500 {C, O, O, K, I, E} (a) Find all frequent itemsets. (b) List all of the strong association rules (with support s and confidence c ) matching following meta ‐ rule, where X is a variable representing customers, and item i denotes variables representing items (e.g., “A”, “B”, etc.): representing items (e.g., A , B , etc.): ∀ ∈ ∧ ⇒ , ( , ) ( , ) ( , ) x transactio n buys X item buys X item buys X item [s, c] 4 1 2 3

  26. Challenges of Frequent Pattern Mining Challenges of Frequent Pattern Mining • Challenges Challenges – Multiple scans of transaction database – Huge number of candidates uge u be o ca d dates – Tedious workload of support counting for candidates • Improving Apriori – Reduce passes of transaction database scans – Shrink number of candidates – Facilitate support counting of candidates 5

  27. Advanced Methods for Mining Frequent Itemsets • Mining frequent itemsets without candidate Mining frequent itemsets without candidate generation – Frequent pattern growth (FP growth—Han Pei & – Frequent ‐ pattern growth (FP ‐ growth—Han, Pei & Yin @SIGMOD’00) • Mining frequent itemsets using vertical data • Mining frequent itemsets using vertical data format – Vertical data format approach (ECLAT—Zaki Vertical data format approach (ECLAT Zaki @IEEE ‐ TKDE’00) 6

  28. Mining Various Kinds of Association Rules • Mining multilevel association rules • Mining multilevel association rules • Mining multidimensional association rules g 7

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend