Frequent Pattern Mining Overview Basic Concepts and Challenges - PDF document

Data Mining Techniques: Frequent Patterns in Sets and Sequences Mirek Riedewald Some slides based on presentations by Han/Kamber and Tan/Steinbach/Kumar Frequent Pattern Mining Overview • Basic Concepts and Challenges • Efficient and Scalable Methods for Frequent Itemsets and Association Rules • Pattern Interestingness Measures • Sequence Mining 2 1

What Is Frequent Pattern Analysis? • Find patterns (itemset, sequence, structure, etc.) that occur frequently in a data set • First proposed for frequent itemsets and association rule mining • Motivation: Find inherent regularities in data – What products were often purchased together? – What are the subsequent purchases after buying a PC? – What kinds of DNA are sensitive to a new drug? • Applications – Market basket analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, DNA sequence analysis 3 Association Rule Mining • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules TID Items {Diaper}  {Beer}, 1 Bread, Milk {Milk, Bread}  {Eggs,Coke}, {Beer, Bread}  {Milk}, 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer Implication means co-occurrence, not causality! 5 Bread, Milk, Diaper, Coke 4 2

Definition: Frequent Itemset • Itemset – A collection of one or more items • Example: {Milk, Bread, Diaper} – k-itemset: itemset that contains k items Support count (  ) • TID Items – Frequency of occurrence of an itemset 1 Bread, Milk – E.g.,  ({Milk, Bread, Diaper}) = 2 2 Bread, Diaper, Beer, Eggs • Support (s) 3 Milk, Diaper, Beer, Coke – Fraction of transactions that contain an 4 Bread, Milk, Diaper, Beer itemset 5 Bread, Milk, Diaper, Coke – E.g., s({Milk, Bread, Diaper}) = 2/5 • Frequent Itemset – An itemset whose support is greater than or equal to a minsup threshold 5 Definition: Association Rule TID Items • Association Rule = implication expression of the form X  Y, 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs where X and Y are itemsets – Ex.: {Milk, Diaper}  {Beer} 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke • Rule Evaluation Metrics – Support (s) = P(X  Y)  Example: { Milk , Diaper } Beer • Estimated by fraction of transactions that contain both X and Y   ( Milk , Diaper, Beer ) 2  – Confidence (c) = P(Y| X) s | D | 5 • Estimated by fraction of transactions that contain X and Y  ( Milk, Diaper, Beer ) 2 among all transactions containing   c X  ( Milk , Diaper ) 3 6 3

Association Rule Mining Task • Given a transaction database DB, find all rules having support ≥ minsup and confidence ≥ minconf • Brute-force approach: – List all possible association rules – Compute support and confidence for each rule – Remove rules that fail the minsup or minconf thresholds – Computationally prohibitive! 7 Mining Association Rules Example rules: TID Items 1 Bread, Milk {Milk,Diaper}  {Beer} (s=0.4, c=0.67) {Milk,Beer}  {Diaper} (s=0.4, c=1.0) 2 Bread, Diaper, Beer, Eggs {Diaper,Beer}  {Milk} (s=0.4, c=0.67) 3 Milk, Diaper, Beer, Coke {Beer}  {Milk,Diaper} (s=0.4, c=0.67) 4 Bread, Milk, Diaper, Beer {Diaper}  {Milk,Beer} (s=0.4, c=0.5) 5 Bread, Milk, Diaper, Coke {Milk}  {Diaper,Beer} (s=0.4, c=0.5) Observations : • All the above rules are binary partitions of the same itemset {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements 8 4

Mining Association Rules • Two-step approach: 1. Frequent Itemset Generation • Generate all itemsets that have support  minsup 2. Rule Generation • Generate high-confidence rules from each frequent itemset, where each rule is a binary partitioning of the frequent itemset • Frequent itemset generation is still computationally expensive 9 Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE Given d items, there ABCD ABCE ABDE ACDE BCDE are 2 d possible candidate itemsets ABCDE 10 5

Frequent Itemset Generation • Brute-force approach: – Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the database – Match each transaction against every candidate – Complexity  O(N*M*w) => expensive since M=2 d List of Transactions Candidates TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs M Milk, Diaper, Beer, Coke 3 N Bread, Milk, Diaper, Beer 4 Bread, Milk, Diaper, Coke 5 w 11 Computational Complexity • Given d unique items, total number of itemsets = 2 d • Total number of possible association rules?          d 1 d d k d k           R          k j    1 1 k j     d d 1 3 2 1 If d=6, R = 602 possible rules 12 6

Frequent Pattern Mining Overview • Basic Concepts and Challenges • Efficient and Scalable Methods for Frequent Itemsets and Association Rules • Pattern Interestingness Measures • Sequence Mining 13 Frequent Itemset Generation Strategies • Reduce the number of candidates (M) – Complete search: M=2 d – Use pruning techniques to reduce M • Reduce the number of transactions (N) – Skip short transactions as size of itemset increases • Reduce the number of comparisons (N*M) – Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction 14 7

Reducing Number of Candidates • Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent • Apriori principle holds due to the following property of the support measure:     , : ( ) ( ) ( ) X Y X Y s X s Y – Support of an itemset never exceeds the support of its subsets – This is known as the anti-monotone property of support 15 Illustrating the Apriori Principle null null A A B B C C D D E E AB AB AC AC AD AD AE AE BC BC BD BD BE BE CD CD CE CE DE DE Found to be infrequent ABC ABC ABD ABD ABE ABE ACD ACD ACE ACE ADE ADE BCD BCD BCE BCE BDE BDE CDE CDE ABCD ABCD ABCE ABCE ABDE ABDE ACDE ACDE BCDE BCDE Pruned ABCDE ABCDE supersets 16 8

Illustrating the Apriori Principle Items (1-itemsets) Item Count Bread 4 Coke 2 Pairs (2-itemsets) Milk 4 Itemset Count Beer 3 {Bread,Milk} 3 Diaper 4 (No need to generate {Bread,Beer} 2 Eggs 1 candidates involving Coke {Bread,Diaper} 3 or Eggs) {Milk,Beer} 2 {Milk,Diaper} 3 {Beer,Diaper} 3 Minimum Support = 3 Triplets (3-itemsets) If every subset is considered, Itemset Count 6 C 1 + 6 C 2 + 6 C 3 = 41 {Bread,Milk,Diaper} 3 With support-based pruning, 6 + 6 + 1 = 13 17 Apriori Algorithm • Generate L 1 = frequent itemsets of length k=1 • Repeat until no new frequent itemsets are found – Generate C k+1 , the length-(k+1) candidate itemsets, from L k – Prune candidate itemsets in C k+1 containing subsets of length k that are not in L k (and hence infrequent) – Count support of each remaining candidate by scanning DB; eliminate infrequent ones from C k+1 – L k+1 =C k+1 ; k = k+1 18 9

Important Details of Apriori • How to generate candidates? – Step 1: self-joining L k – Step 2: pruning • How to count support of candidates? • Example of Candidate-generation for L 3 ={ {a,b,c}, {a,b,d}, {a,c,d}, {a,c,e}, {b,c,d} } – Self-joining L 3 • {a,b,c,d} from {a,b,c} and {a,b,d} • {a,c,d,e} from {a,c,d} and {a,c,e} – Pruning: • {a,c,d,e} is removed because {a,d,e} is not in L 3 – C 4 ={ {a,b,c,d} } 19 How to Generate Candidates? • Step 1: self-joining L k-1 insert into C k select p.item 1 , p.item 2 ,…, p.item k-1 , q.item k-1 from L k-1 p, L k-1 q where p.item 1 =q.item 1 AND … AND p.item k-2 =q.item k-2 AND p.item k-1 < q.item k-1 • Step 2: pruning – forall itemsets c in C k do • forall (k-1)-subsets s of c do – if (s is not in L k-1 ) then delete c from C k 20 10

How to Count Supports of Candidates? • Why is counting supports of candidates a problem? – Total number of candidates can be very large – One transaction may contain many candidates • Method: – Candidate itemsets stored in a hash-tree – Leaf node contains list of itemsets – Interior node contains a hash table – Subset function finds all candidates contained in a transaction 21 Generate Hash Tree • Suppose we have 15 candidate itemsets of length 3: – {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} • We need: – Hash function – Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node) 2 3 4 Hash function 5 6 7 3,6,9 1,4,7 3 6 7 1 4 5 3 4 5 3 5 6 1 3 6 3 6 8 3 5 7 2,5,8 6 8 9 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 22 11

Frequent Pattern Mining Overview Basic Concepts and Challenges - PDF document

Data Mining Techniques: Frequent Patterns in Sets and Sequences Mirek Riedewald Some slides based on presentations by Han/Kamber and Tan/Steinbach/Kumar Frequent Pattern Mining Overview Basic Concepts and Challenges Efficient and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Statistics and Data Analysis Logistic Regression & Frequent Pattern Mining Ling-Chieh Kung

Data Mining Associative pattern mining Hamid Beigy Sharif University of Technology Fall 1396

The shortcomings of the frequent pattern mining CLOSET:An Efficient Algorithm There may exist

CS570 Data Mining Frequent Pattern Mining and Association Analysis 2 Cengiz Gunay Slide

CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

Frequent Pattern Mining How Many Words Is a Picture Worth? E. Aiden and J-B Michel: Uncharted.

Hike Planning Workshop Presenter: Andy Captain Blue Niekamp My Appalachian Trail

(Deep) Learning for Robot Perception and Navigation Wolfram Burgard Deep Learning for Robot

Silver B Why do we need one? ! ! 14% increase in US snack bar sales in 2010 ! ! More control over

Not Every Pattern Is Interesting! Trivial patterns Pregnant Female 100% confidence

Distributed Machine Learning Sebastian Schelter GOTO Berlin 11/06/2014 Overview Apache

CERTIFICATE IN BUSINESS MANAGEMENT 2016 Instructions to candidates You are allowed three (3)

Textual Predictors of Bill Survival in Congressional Committees Tae Yano , LTI, CMU Noah Smith ,

Safety Assessment Approaches in Young Children May 20, 2016 Welcome Suzanne Fitzpatrick, PhD

Frequent Pattern Mining Overview Basic Concepts and Challenges - PDF document

Data Mining Techniques: Frequent Patterns in Sets and Sequences Mirek Riedewald Some slides based on presentations by Han/Kamber and Tan/Steinbach/Kumar Frequent Pattern Mining Overview Basic Concepts and Challenges Efficient and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Statistics and Data Analysis Logistic Regression &amp; Frequent Pattern Mining Ling-Chieh Kung

Data Mining Associative pattern mining Hamid Beigy Sharif University of Technology Fall 1396

The shortcomings of the frequent pattern mining CLOSET:An Efficient Algorithm There may exist

CS570 Data Mining Frequent Pattern Mining and Association Analysis 2 Cengiz Gunay Slide

CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

Frequent Pattern Mining How Many Words Is a Picture Worth? E. Aiden and J-B Michel: Uncharted.

Hike Planning Workshop Presenter: Andy Captain Blue Niekamp My Appalachian Trail

(Deep) Learning for Robot Perception and Navigation Wolfram Burgard Deep Learning for Robot

Silver B Why do we need one? ! ! 14% increase in US snack bar sales in 2010 ! ! More control over

Not Every Pattern Is Interesting! Trivial patterns Pregnant Female 100% confidence

Distributed Machine Learning Sebastian Schelter GOTO Berlin 11/06/2014 Overview Apache

CERTIFICATE IN BUSINESS MANAGEMENT 2016 Instructions to candidates You are allowed three (3)

Textual Predictors of Bill Survival in Congressional Committees Tae Yano , LTI, CMU Noah Smith ,

Safety Assessment Approaches in Young Children May 20, 2016 Welcome Suzanne Fitzpatrick, PhD

Statistics and Data Analysis Logistic Regression & Frequent Pattern Mining Ling-Chieh Kung