CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun yzsun@ccs.neu.edu October 26, 2014

Reminder • Midterm • Next Monday (Nov. 3), 2-hour (6-8pm) in class • Closed-book exam, and one A4 size cheating sheet is allowed • Bring a calculator (NO cell phone) • Cover to today’s lecture 2

Quiz of Last Week What is the advantage and disadvantage of k-medoids over k- 1. means? Suppose under a parameter setting for DBSCAN, we get the 2. following clustering results. How shall we change the two parameters (eps and minpts) if we want to get two clusters? 2.5 2 1.5 1 0.5 Increase eps or reduce minpts! 0 -0.5 -1 -1.5 3 1.5 1 0.5 0 -0.5 -1

Methods to Learn Matrix Data Set Data Sequence Time Series Graph & Data Network Classification Decision Tree; Naïve HMM Label Propagation Bayes; Logistic Regression SVM; kNN K-means; hierarchical SCAN; Spectral Clustering clustering; DBSCAN; Clustering Mixture Models; kernel k-means Apriori; GSP; Frequent FP-growth PrefixSpan Pattern Mining Prediction Linear Regression Autoregression Similarity DTW P-PageRank Search Ranking PageRank 4

Mining Frequent Patterns, Association and Correlations • Basic Concepts • Frequent Itemset Mining Methods • Pattern Evaluation Methods • Summary 5

Set Data • A data point corresponds to a set of items Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk 6

What Is Frequent Pattern Analysis? • Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set • First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining • Motivation: Finding inherent regularities in data • What products were often purchased together? — Beer and diapers?! • What are the subsequent purchases after buying a PC? • What kinds of DNA are sensitive to this new drug? 7

Why Is Freq. Pattern Mining Important? • Freq. pattern: An intrinsic and important property of datasets • Foundation for many essential data mining tasks • Association, correlation, and causality analysis • Sequential, structural (e.g., sub-graph) patterns • Pattern analysis in spatiotemporal, multimedia, time-series, and stream data • Classification: discriminative, frequent pattern analysis • Cluster analysis: frequent pattern-based clustering • Broad applications 8

Basic Concepts: Frequent Patterns Tid Items bought • itemset: A set of one or more items 10 Beer, Nuts, Diaper • k-itemset X = {x 1 , …, x k } 20 Beer, Coffee, Diaper • (absolute) support , or, support count 30 Beer, Diaper, Eggs of X: Frequency or occurrence of an 40 Nuts, Eggs, Milk itemset X 50 Nuts, Coffee, Diaper, Eggs, Milk • (relative) support , s , is the fraction of transactions that contains X (i.e., the Customer Customer probability that a transaction buys both buys diaper contains X) • An itemset X is frequent if X’s support is no less than a minsup threshold Customer buys beer 9

Basic Concepts: Association Rules Find all the rules X  Y with • minimum support and confidence Tid Items bought 10 Beer, Nuts, Diaper • support, s , probability that a 20 Beer, Coffee, Diaper transaction contains X  Y 30 Beer, Diaper, Eggs • confidence, c, conditional 40 Nuts, Eggs, Milk Nuts, Coffee, Diaper, Eggs, Milk 50 probability that a transaction having X also contains Y Customer Customer buys both buys Let minsup = 50%, minconf = 50% diaper Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 Customer Strong Association rules  buys beer Beer  Diaper (60%, 100%)  Diaper  Beer (60%, 75%)  10

Closed Patterns and Max-Patterns • A long pattern contains a combinatorial number of sub-patterns, e.g., {a 1 , … , a 100 } contains 2 100 – 1 = 1.27*10 30 sub-patterns! • Solution: Mine closed patterns and max-patterns instead • An itemset X is closed if X is frequent and there exists no super- pattern Y כ X, with the same support as X (proposed by Pasquier, et al. @ ICDT ’ 99) • An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X (proposed by Bayardo @ SIGMOD ’ 98) • Closed pattern is a lossless compression of freq. patterns • Reducing the # of patterns and rules 11

Closed Patterns and Max-Patterns • Exercise. DB = {<a 1 , …, a 100 >, < a 1 , …, a 50 >} • Min_sup = 1. • What is the set of closed pattern(s)? • <a 1 , …, a 100 >: 1 • < a 1 , …, a 50 >: 2 • What is the set of max-pattern(s)? • <a 1 , …, a 100 >: 1 • What is the set of all patterns? • !! 12

Computational Complexity of Frequent Itemset Mining • How many itemsets are potentially to be generated in the worst case? • The number of frequent itemsets to be generated is sensitive to the minsup threshold • When minsup is low, there exist potentially an exponential number of frequent itemsets • The worst case: M N where M: # distinct items, and N: max length of transactions 13

Mining Frequent Patterns, Association and Correlations • Basic Concepts • Frequent Itemset Mining Methods • Pattern Evaluation Methods • Summary 15

Scalable Frequent Itemset Mining Methods • Apriori: A Candidate Generation-and-Test Approach • Improving the Efficiency of Apriori • FPGrowth: A Frequent Pattern-Growth Approach • ECLAT: Frequent Pattern Mining with Vertical Data Format • Generating Association Rules 16

The Apriori Property and Scalable Mining Methods • The Apriori property of frequent patterns • Any nonempty subsets of a frequent itemset must be frequent • If {be beer, r, dia iaper, , nut uts} s} is frequent, so is {be beer, r, dia iaper} r} • i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} • Scalable mining methods: Three major approaches • Apriori (Agrawal & Srikant@VLDB’94) • Freq. pattern growth (FPgrowth —Han, Pei & Yin @SIGMOD’00) • Vertical data format approach (Eclat) 17

Apriori: A Candidate Generation & Test Approach • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila , et al. @ KDD’ 94) • Method: • Initially, scan DB once to get frequent 1-itemset • Generate length (k+1) candidate itemsets from length k frequent itemsets • Test the candidates against DB • Terminate when no frequent or candidate set can be generated 18

From Frequent k-1 Itemset To Frequent k-Itemset C k : Candidate itemset of size k L k : frequent itemset of size k • From 𝑀 𝑙−1 to 𝐷 𝑙 (Candidates Generation) • The join step • The prune step • From 𝐷 𝑙 to 𝑀 𝑙 • Test candidates by scanning database 19

The Apriori Algorithm — An Example Sup min = 2 Itemset sup Itemset sup Database TDB {A} 2 L 1 {A} 2 Tid Items C 1 {B} 3 {B} 3 10 A, C, D {C} 3 1 st scan {C} 3 20 B, C, E {D} 1 {E} 3 30 A, B, C, E {E} 3 40 B, E Itemset sup C 2 C 2 Itemset {A, B} 1 2 nd scan Itemset sup L 2 {A, B} {A, C} 2 {A, C} 2 {A, C} {A, E} 1 {B, C} 2 {A, E} {B, C} 2 {B, E} 3 {B, C} {B, E} 3 {C, E} 2 {C, E} 2 {B, E} {C, E} Itemset sup Itemset 3 rd scan L 3 C 3 {B, C, E} 2 {B, C, E} 20

The Apriori Algorithm ( Pseudo-Code) C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for ( k = 2; L k-1 !=  ; k ++) do begin C k = candidates generated from L k-1 ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return  k L k ; 21

Candidates Generation Assume a pre-specified order of items • How to generate candidates C k ? • Step 1: self-joining L k-1 • Two length k-1 itemsets 𝑚 1 and 𝑚 2 can join, only if the first k- 2 items are the same, and for the last term, 𝑚 1 𝑙 − 1 < 𝑚 2 𝑙 − 1 (why?) • Step 2: pruning • Why we need pruning for candidates? • How? • Again, use Apriori property • A candidate itemset can be safely pruned, if it contains infrequent subset 22

• Example of Candidate-generation from L 3 to C 4 • L 3 = { abc, abd, acd, ace, bcd } • Self-joining: L 3 *L 3 • abcd from abc and abd • acde from acd and ace • Pruning: • acde is removed because ade is not in L 3 • C 4 = { abcd } 23

The Apriori Algorithm — Example Review Sup min = 2 Itemset sup Itemset sup Database TDB {A} 2 L 1 {A} 2 Tid Items C 1 {B} 3 {B} 3 10 A, C, D {C} 3 1 st scan {C} 3 20 B, C, E {D} 1 {E} 3 30 A, B, C, E {E} 3 40 B, E Itemset sup C 2 C 2 Itemset {A, B} 1 2 nd scan Itemset sup L 2 {A, B} {A, C} 2 {A, C} 2 {A, C} {A, E} 1 {B, C} 2 {A, E} {B, C} 2 {B, E} 3 {B, C} {B, E} 3 {C, E} 2 {C, E} 2 {B, E} {C, E} Itemset sup Itemset 3 rd scan L 3 C 3 {B, C, E} 2 {B, C, E} 24

Questions • How many scans on DB are needed for Apriori algorithm? • When (k = ?) does Apriori algorithm generate most candidate itemsets? • Is support counting for candidates expensive? 25

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun yzsun@ccs.neu.edu October 26, 2014 Reminder Midterm Next Monday (Nov. 3), 2-hour (6-8pm) in class Closed-book exam, and one A4 size cheating

Link Analysis Lecture 7 Link Analysis November 29, 2017 1 CS6220 Data Mining Techniques

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data: Part I Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Sequential and Time Series Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou

CS6220: DATA MINING TECHNIQUES Chapter 3: Data Preprocessing Instructor: Yizhou Sun

Python Krus Advanced Python By Peder Bergebakken Sundt Programvareverstedet www.pvv.ntnu.no

Perfect competition with real firms 1 Topic 3 Topic 4 Topic 5 Isolate entry/exit Isolate

BNL FY17-18 Procurement USQCD All-Hands Meeting JLAB April 28-29, 2017 Robert Mawhinney

Using survival analysis to explain dropout in autonomous CALL practice with web-based mini-games

The Semantic Web Needs Anaphora Resolution Rodolfo Delmont Dipartimento Scienze del

Miscellaneous Set Concepts Slides to accompany Sections 1.(8 & 9) of Discrete Mathematics and

Climate Change Impacts on Maple Forests and Sugarmakers Legislative Policy Research Summit on

A Systems Perspective on A3L Heinz Kredel University of Mannheim Algorithmic Algebra and Logic

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun yzsun@ccs.neu.edu October 26, 2014 Reminder Midterm Next Monday (Nov. 3), 2-hour (6-8pm) in class Closed-book exam, and one A4 size cheating

Link Analysis Lecture 7 Link Analysis November 29, 2017 1 CS6220 Data Mining Techniques

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data: Part I Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Sequential and Time Series Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou

CS6220: DATA MINING TECHNIQUES Chapter 3: Data Preprocessing Instructor: Yizhou Sun

Python Krus Advanced Python By Peder Bergebakken Sundt Programvareverstedet www.pvv.ntnu.no

Perfect competition with real firms 1 Topic 3 Topic 4 Topic 5 Isolate entry/exit Isolate

BNL FY17-18 Procurement USQCD All-Hands Meeting JLAB April 28-29, 2017 Robert Mawhinney

Using survival analysis to explain dropout in autonomous CALL practice with web-based mini-games

The Semantic Web Needs Anaphora Resolution Rodolfo Delmont Dipartimento Scienze del

Miscellaneous Set Concepts Slides to accompany Sections 1.(8 &amp; 9) of Discrete Mathematics and

Climate Change Impacts on Maple Forests and Sugarmakers Legislative Policy Research Summit on

A Systems Perspective on A3L Heinz Kredel University of Mannheim Algorithmic Algebra and Logic

Miscellaneous Set Concepts Slides to accompany Sections 1.(8 & 9) of Discrete Mathematics and