data mining techniques
play

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 2 - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 2 Jan-Willem van de Meent ( credit : Tan et al., Leskovec et al.) Frequent Itemsets & Association Rules (a.k.a. counting co-occurrences) The Market-Basket Model Input:


  1. Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 2 Jan-Willem van de Meent ( credit : Tan et al., Leskovec et al.)

  2. Frequent Itemsets & 
 Association Rules (a.k.a. counting co-occurrences)

  3. The Market-Basket Model Input: Output: TID Items Rules Discovered: 1 Bread, Coke, Milk {Milk} --> {Coke} 2 Beer, Bread {Diaper, Milk} --> {Beer} 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk • Baskets = sets of purchases, Items = products; • Brick and Mortar: Track purchasing habits • Chain stores have TBs of transaction data • Tie-in “tricks”, e.g., sale on diapers + raise price of beer • Need the rule to occur frequently, or no $$’s • Online: People who bought X also bought Y adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  4. Examples: Plagiarism, Side-Effects • Baskets = sentences; 
 Items = documents containing those sentences • Items that appear together too often 
 could represent plagiarism • Notice items do not have to be “in” baskets • Baskets = patients; 
 Items = drugs & side-effects • Has been used to detect combinations 
 of drugs that result in particular side-effects • Requires extension: Absence of an item 
 needs to be observed as well as presence adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  5. Example: Voting records Association Rule Confidence { budget resolution = no, MX-missile=no, aid to El Salvador = yes } 91.0% → { Republican } − { budget resolution = yes, MX-missile=yes, aid to El Salvador = no } 97.5% → { Democrat } − { crime = yes, right-to-sue = yes, physician fee freeze = yes } 93.5% → { Republican } − { crime = no, right-to-sue = no, physician fee freeze = no } 100% → { Democrat } − • Baskets = politicians; 
 Items = party & votes • Can extract set of votes most associated 
 with each party (or or faction within a party) adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  6. Frequent Itemsets • Simplest question: Find sets of items that appear together “frequently” in baskets • Support σ ( Χ ) for itemset Χ : 
 Number of baskets containing all items in Χ • (Often expressed as a fraction 
 of the total number of baskets) • Given a support threshold σ min , then 
 sets of items X that appear in at least 
 σ ( Χ ) ≥ σ min baskets are called 
 frequent itemsets adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  7. Example: Frequent Itemsets • Items = {milk, coke, pepsi, beer, juice} • Baskets B 1 = {m, c, b} B 2 = {m, p, j} B 3 = {m, b} B 4 = {c, j} B 5 = {m, c, b} B 6 = {m, c, b, j} B 7 = {c, b, j} B 8 = {b, c} • Frequent itemsets ( σ ( X ) ≥ 3) : {m}:5, {c}:6, {b}:6, {j}:4, {m,c}: 3, 
 {m,b}:4, {c,b}:5, {c,j}:3, {m,c,b}:3

  8. Association Rules • If-then rules about the contents of baskets • { a 1 , a 2 ,…,a k } → b means: “if a basket contains all of a 1 ,…,a k then it is likely to contain b ” • In practice there are many rules, want to find significant/interesting ones! • Confidence of this association rule is the probability of B ={ b } given A = { a 1 ,…,a k } σ ( X ∪ Y ) Support, s ( X − → Y ) = ; ( N σ ( X ∪ Y ) Confidence, c ( X − → Y ) = . σ ( X ) adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  9. Interest of Association Rules • Not all high-confidence rules are interesting • The rule A → milk may have high confidence because milk is just purchased very often (independent of A ) 
 • Interest Factor (or Lift) of a rule A → B : s ( A, B ) Lift = c ( A − → B ) I ( A, B ) = , s ( A ) × s ( B ) s ( B ) adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  10. Confidence and Interest B 1 = {m, c, b} B 2 = {m, p, j} B 3 = {m, b} B 4 = {c, j} B 5 = {m, c, b} B 6 = {m, c, b, j} B 7 = {c, b, j} B 8 = {b, c} • Association rule: {m} → b • Confidence = 4/5 • Interest Factor = 1/6 4/5 = 4/30 • Item b appears in 6/8 of the baskets • Rule is not very interesting! adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  11. Many measures of interest − → Measure (Symbol) Definition � � ��� � Goodman-Kruskal ( λ ) j max k f jk − max k f + k N − max k f + k f ij Nf ij f i + N log f i + � � ��� � � − � Mutual Information ( M ) N log i j i f i + f + j N f 11 f 1+ f +1 + f 10 Nf 11 Nf 10 J-Measure ( J ) N log N log f 1+ f +0 f 1+ ) 2 + ( f 10 f 1+ f 1+ ) 2 ] − ( f +1 N × ( f 11 N ) 2 Gini index ( G ) f 0+ ) 2 + ( f 00 + f 0+ f 0+ ) 2 ] − ( f +0 N × [( f 01 N ) 2 � ��� � Laplace ( L ) f 11 + 1 f 1+ + 2 B B � ��� � Conviction ( V ) f 1+ f +0 Nf 10 A f 11 f 10 f 1+ � f 11 f 1+ − f +1 1 − f +1 ��� � A f 01 f 00 f 0+ Certainty factor ( F ) N N f 1+ − f +1 f 11 f +1 f +0 N Added Value ( AV ) N adapted from : Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

  12. Mining Association Rules • Problem: Find all association rules with support ≥ s and confidence ≥ c • Note: Support of an association rule is the support of the set of items on the left side • Hard part: Finding the frequent itemsets! • If { i 1 , i 2 ,…, i k } → j has high support and confidence, then both { i 1 , i 2 ,…, i k } and 
 { i 1 , i 2 ,…,i k , j } will be “frequent” adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  13. Finding Frequent Item Sets Given k products, how many possible item sets are there? null a b c d e ab ac ad ae bc bd be cd ce de abc abd abe acd ace ade bcd bce bde cde abcd abce abde acde bcde abcde adapted from : Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

  14. Finding Frequent Item Sets Answer : 2 k - 1 -> Cannot enumerate all possible sets null a b c d e ab ac ad ae bc bd be cd ce de abc abd abe acd ace ade bcd bce bde cde abcd abce abde acde bcde abcde adapted from : Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

  15. Observation: A-priori Principle null a b c d e ab ac ad ae bc bd be cd ce de abc abd abe acd ace ade bcd bce bde cde abcd abce abde acde bcde Frequent Itemset abcde Subsets of a frequent item set are also frequent

  16. Corollary: Pruning of Candidates null Infrequent Itemset a b c d e ab ac ad ae bc bd be cd ce de abc abd abe acd ace ade bcd bce bde cde abcd abce abde acde bcde Pruned Supersets abcde If we know that a subset is not frequent, 
 then we can ignore all its supersets

  17. A-priori Algorithm Algorithm 6.1 Frequent itemset generation of the Apriori algorithm. 1: k = 1. 2: F k = { i | i ∈ I ∧ σ ( { i } ) ≥ N × minsup } . { Find all frequent 1-itemsets } 3: repeat k = k + 1. 4: C k = apriori-gen( F k − 1 ). { Generate candidate itemsets } 5: for each transaction t ∈ T do 6: C t = subset( C k , t ). { Identify all candidates that belong to t } 7: for each candidate itemset c ∈ C t do 8: σ ( c ) = σ ( c ) + 1. { Increment support count } 9: end for 10: end for 11: F k = { c | c ∈ C k ∧ σ ( c ) ≥ N × minsup } . { Extract the frequent k -itemsets } 12: 13: until F k = ∅ 14: Result = � F k . adapted from : Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

  18. Generating Candidates C k 1. Self-joining: Find pairs of sets in L k-1 
 that differ by one element 2. Pruning: Remove all candidates 
 with infrequent subsets

  19. Example: Generating Candidates C k B 1 = {m, c, b} B 2 = {m, p, j} B 3 = {m, b} B 4 = {c, j} B 5 = {m, c, b} B 6 = {m, c, b, j} B 7 = {c, b, j} B 8 = {b, c} • Frequent itemsets of size 2: 
 {m,b}:4, {m,c}:3, {c,b}:5, {c,j}:3 • Self-joining: 
 {m,b,c}, {b,c,j} • Pruning: 
 {b,c,j} since {b,j} not frequent

  20. Compacting the Output • To reduce the number of rules we can 
 post-process them and only output: • Maximal frequent itemsets: 
 No immediate superset is frequent • Gives more pruning • Closed itemsets: 
 No immediate superset has same count (> 0) • Stores not only frequent information, 
 but exact counts J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  21. Example: Maximal vs Closed B 1 = {m, c, b} B 2 = {m, p, j} B 3 = {m, b} B 4 = {c, j} B 5 = {m, c, b} B 6 = {m, c, b, j} B 7 = {c, b, j} B 8 = {b, c} Frequent itemsets: Closed {m}:5 , {c}:6 , {b}:6 , {j}:4 , 
 {m,c}:3, {m,b}:4 , {c,b}:5 , {c,j}:3 , 
 Maximal {m,c,b}:3

  22. Example: Maximal vs Closed Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets

  23. Subset Matching Given a transaction t, what Transaction, t are the possible subsets of ( items are sorted ) size 3? 1 2 3 5 6 Level 1 2 3 5 6 3 5 6 5 6 1 2 3 Level 2 3 5 6 5 6 6 5 6 6 6 1 2 1 3 1 5 2 3 2 5 3 5 1 2 3 1 3 5 2 3 5 1 2 5 1 5 6 2 5 6 3 5 6 1 3 6 2 3 6 1 2 6 Subsets of 3 items Level 3 adapted from : Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend