CS570 Data Mining Frequent Pattern Mining and Association Analysis - PowerPoint PPT Presentation

CS570 Data Mining Frequent Pattern Mining and Association Analysis 2 Cengiz Gunay Slide credits: Li Xiong, Jiawei Han and Micheline Kamber George Kollios 1

Mining Frequent Patterns and Association Analysis  Basic concepts  Efficient and scalable frequent itemset mining methods  Apriori (Agrawal & Srikant@VLDB’94) and variations  Frequent pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)  Algorithms using vertical format  Closed and maximal patterns and their mining method  Mining various kinds of association rules  From association mining to correlation analysis  Constraint-based association mining 2

Mining Frequent Patterns Without Candidate Generation  Basic idea: grow long patterns from short ones using local frequent items  “abc” is a frequent pattern  Get all transactions having “abc”: DB|abc  “d” is a local frequent item in DB|abc → abcd is a frequent pattern  FP-Growth  Construct FP-tree  Divide compressed database into a set of conditional databases and mine them separately 3

Construct FP-tree from a Transaction Database TID Items bought (ordered) frequent items 100 { f, a, c, d, g, i, m, p } { f, c, a, m, p } 200 { a, b, c, f, l, m, o } { f, c, a, b, m } 300 { b, f, h, j, o, w } { f, b } min_support = 3 400 { b, c, k, s, p } { c, b, p } 500 { a, f, c, e, l, p, m, n } { f, c, a, m, p } {} Header Table 1. Scan DB once, find frequent 1-itemsets f:4 c:1 Item frequency head (single item pattern) f 4 c 4 c:3 b:1 b:1 2. Sort frequent items in a 3 descending frequency b 3 a:3 p:1 order (f-list) m 3 p 3 3. Scan DB again, m:2 b:1 construct FP-tree F-list=f-c-a-b-m-p p:2 m:1 4

Benefits of the FP-tree Structure  Completeness  Preserve complete information for frequent pattern mining  Never break a long pattern of any transaction  Compactness  Reduce irrelevant info—infrequent items are gone  Items in frequency descending order: the more frequently occurring, the more likely to be shared  Never larger than the original database (not counting node-links and the count field)  For a Connect-4 Dataset, compression ratio could be over 100 5

Mining Frequent Patterns With FP-trees  Idea: Frequent pattern growth  Recursively grow frequent patterns by pattern and database partition  Method  For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree  Repeat the process on each newly created conditional FP-tree  Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern 6

Partition Patterns and Databases  Frequent patterns can be partitioned into subsets according to f-list: f-c-a-b-m-p  Patterns containing p  Patterns having m but no p  …  Patterns having c but no a nor b, m, p  Pattern f  Completeness and non-redundancy 7

Set Enumeration Tree of the Patterns  Depth-first recursive search  Pruning while building conditional patterns Φ (fcabmp) p (fcabm) b (fca) … m (fcab) mp (fcab) bp (fca) … bm (fca)… … fmp (cab) … … … … 8

Find Patterns Having p From p -conditional Database Start at the frequent item header table in the FP-tree  Traverse the FP-tree by following the link of each frequent item p  Accumulate all of transformed prefix paths of item p to form p’ s  conditional pattern base {} Header Table Conditional pattern bases f:4 c:1 Item frequency head itemcond. pattern base f 4 c:3 b:1 b:1 c 4 c f:3 a 3 a fc:3 b 3 a:3 p:1 b fca:1, f:1, c:1 m 3 p 3 m fca:2, fcab:1 m:2 b:1 p fcam:2, cb:1 p:2 m:1 9

From Conditional Pattern-bases to Conditional FP-trees Accumulate the count for each item in the base  Construct the FP-tree for the frequent items of the pattern base  Repeat the process on each newly created conditional FP-tree  until the resulting FP-tree is empty, or only one path p-conditional pattern base: fcam:2, cb:1 {} p-conditional FP-tree Header Table (min-support =3) Item frequency head f:4 c:1 All frequent patterns f 4 {} containing p c 4 c:3 b:1 b:1 p, → → a 3 cp c:3 a:3 p:1 b 3 m 3 m:2 b:1 p 3 p:2 m:1 10

Finding Patterns Having m  Construct m-conditional pattern-base, and then its conditional FP-tree  Repeat the process on each newly created conditional FP-tree until the resulting FP-tree is empty, or only one m-conditional pattern base: path fca:2, fcab:1 {} m-conditional FP-tree Header Table (min-support =3) All frequent Item frequency head f:4 c:1 patterns relate to m f 4 {} m, c 4 c:3 b:1 b:1 → fm, cm, am, → a 3 f:3 a:3 p:1 fcm, fam, cam, b 3 m 3 fcam c:3 m:2 b:1 p 3 a:3 p:2 m:1 11

FP-Growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K 100 90 D1 FP-grow th runtime D1 Apriori runtime 80 70 Run tim e (se c 60 50 40 30 20 10 0 0 0.5 1 1.5 2 2.5 3 Support threshold(%) 12

Why Is FP-Growth the Winner?  Decompose both mining task and DB and leads to focused search of smaller databases  Use least frequent items as suffix (offering good selectivity) and find shorter patterns recursively and concatenate with suffix 13

Scalable Methods for Mining Frequent Patterns  Scalable mining methods for frequent patterns  Apriori (Agrawal & Srikant@VLDB’94) and variations  Frequent pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)  Algorithms using vertical format (ECLAT)  Closed and maximal patterns and their mining methods  FIMI Workshop and implementation repository 9/12/13 Data Mining: Concepts and Techniques 14 14

ECLAT  M. J. Zaki. Scalable algorithms for association mining. IEEE TKDE, 12, 2000.  For each item, store a list of transaction ids (tids) Horizontal Data Layout Vertical Data Layout A B C D E TID Items 1 1 2 2 1 1 A,B,E 4 2 3 4 3 2 B,C,D 5 5 4 5 6 3 C,E 6 7 8 9 4 A,C,D 7 8 9 5 A,B,C,D 8 10 6 A,E 9 7 A,B 8 A,B,C 9 A,C,D TID-list 10 B 15

ECLAT  Determine support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets. AB A B 1 1 1 5 2 4 ∧ → 5 7 5 7 6 8 8 7 10 8 9  3 traversal approaches:  top-down, bottom-up and hybrid  Advantage: very fast support counting  Disadvantage: intermediate tid-lists may become too large for memory 16

Scalable Methods for Mining Frequent Patterns  Scalable mining methods for frequent patterns  Apriori (Agrawal & Srikant@VLDB’94) and variations  Frequent pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)  Algorithms using vertical data format (ECLAT)  Closed and maximal patterns and their mining methods  Concepts  Max-patterns: MaxMiner, MAFIA  Closed patterns: CLOSET, CLOSET+, CARPENTER  FIMI Workshop 17

Closed Patterns and Max-Patterns  A long pattern contains a combinatorial number of sub- patterns, e.g., {a1, …, a100} contains 2 100 -1 sub-patterns!  Solution: Mine “boundary” patterns  A frequent itemset X is: – closed if there exists no super-pattern Y כ X, with the same support as X (Pasquier, et al. @ ICDT’99) – a max-pattern if there exists no frequent super-pattern Y כ X (Bayardo @ SIGMOD’98)  Closed pattern is a lossless compression of freq. patterns and support counts 18

Max-patterns  Frequent patterns without frequent super patterns  BCDE, ACD are max-patterns  E.g. BCD, AD, CD is not a max-pattern Tid Items 10 A,B,C,D,E 20 B,C,D,E, 30 A,C,D,F Min_sup=2 19

Max-Patterns Illustration An itemset is maximal frequent if none of its immediate supersets is frequent Maximal Itemsets Infrequent Itemsets Border 20

Closed Patterns  An itemset is closed if none of its immediate supersets has the same support as the itemset Itemset Support {A,B,C} 2 {A,B,D} 3 {A,C,D} 2 {B,C,D} 3 {A,B,C,D} 2  Closed patterns: B: 5, {A,B}: 4, {B,D}: 4, {A,B,D}:3, {B,C,D}: 3, {A,B,C,D}: 2 21

Maximal vs Closed Itemsets 22

Example: Closed Patterns and Max-Patterns  DB = {<a1, …, a100>, < a1, …, a50>} Min_sup = 1  What is the set of closed itemsets? <a1, …, a100>: 1 < a1, …, a50>: 2  What is the set of max-patterns? <a1, …, a100>: 1  What is the set of all patterns?  !! 23

Scalable Methods for Mining Frequent Patterns  Scalable mining methods for frequent patterns  Apriori (Agrawal & Srikant@VLDB’94) and variations  Frequent pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)  Algorithms using vertical data format (ECLAT)  Closed and maximal patterns and their mining methods  Concepts  Max-pattern mining: MaxMiner, MAFIA  Closed pattern mining: CLOSET, CLOSET+, CARPENTER  FIMI Workshop 9/12/13 Data Mining: Concepts and Techniques 24 24

MaxMiner: Mining Max-patterns  R. Bayardo. Efficiently mining long patterns from databases. In SIGMOD’98  Idea: generate the complete set-enumeration tree one level at a time (breadth-first search), while pruning if applicable. Φ (ABCD) A (BCD) B (CD) C (D) D () AB (CD) AC (D) AD () BC (D) BD () CD () ABC (C) ABD () ACD () BCD () ABCD () 25

CS570 Data Mining Frequent Pattern Mining and Association Analysis - PowerPoint PPT Presentation

CS570 Data Mining Frequent Pattern Mining and Association Analysis 2 Cengiz Gunay Slide credits: Li Xiong, Jiawei Han and Micheline Kamber George Kollios 1 Mining Frequent Patterns and Association Analysis Basic concepts Efficient and

CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Data

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Today

CS570 Introduction to Data Mining Classification and Prediction Partial slide credits: Han and

CS570 Data Mining Classification: Ensemble Methods Cengiz Gnay Dept. Math & CS, Emory

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

DSE 210: Probability and statistics Overview The kinds of questions well study I Design a spam

C(I)S 330: Applied Database Systems A Break: A Mini-Introduction to Data Mining (Some slides

CS70: Jean Walrand: Lecture 25. Balls and Coupons & Random Variables Coupons Random

Apriori How to generate candidates? Step 1: self-joining L k Step 2: pruning

1 On the right hand side of the screen you will see the webinar navigation bar. The red arrow

Associations and Frequent Item Analysis 1 Outline Transactions Frequent itemsets

1977: When Modern US Antitrust Began William E. Kovacic Kings College London Thursday Night

Not Every Pattern Is Interesting! Trivial patterns Pregnant Female 100% confidence

CS570 Data Mining Frequent Pattern Mining and Association Analysis - PowerPoint PPT Presentation

CS570 Data Mining Frequent Pattern Mining and Association Analysis 2 Cengiz Gunay Slide credits: Li Xiong, Jiawei Han and Micheline Kamber George Kollios 1 Mining Frequent Patterns and Association Analysis Basic concepts Efficient and

CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Data

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Today

CS570 Introduction to Data Mining Classification and Prediction Partial slide credits: Han and

CS570 Data Mining Classification: Ensemble Methods Cengiz Gnay Dept. Math &amp; CS, Emory

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

DSE 210: Probability and statistics Overview The kinds of questions well study I Design a spam

C(I)S 330: Applied Database Systems A Break: A Mini-Introduction to Data Mining (Some slides

CS70: Jean Walrand: Lecture 25. Balls and Coupons &amp; Random Variables Coupons Random

Apriori How to generate candidates? Step 1: self-joining L k Step 2: pruning

1 On the right hand side of the screen you will see the webinar navigation bar. The red arrow

Associations and Frequent Item Analysis 1 Outline Transactions Frequent itemsets

1977: When Modern US Antitrust Began William E. Kovacic Kings College London Thursday Night

Not Every Pattern Is Interesting! Trivial patterns Pregnant Female 100% confidence

CS570 Data Mining Classification: Ensemble Methods Cengiz Gnay Dept. Math & CS, Emory

CS70: Jean Walrand: Lecture 25. Balls and Coupons & Random Variables Coupons Random