Introduction to Data Mining Frequent Pattern Mining and Association - - PowerPoint PPT Presentation

introduction to data mining
SMART_READER_LITE
LIVE PREVIEW

Introduction to Data Mining Frequent Pattern Mining and Association - - PowerPoint PPT Presentation

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide credits: Jiawei Han and Micheline Kamber George Kollios 1 Mining Frequent Patterns, Association and Correlations Basic concepts Frequent


slide-1
SLIDE 1

Introduction to Data Mining

Frequent Pattern Mining and Association Analysis

Li Xiong

Slide credits: Jiawei Han and Micheline Kamber George Kollios

1

slide-2
SLIDE 2

Mining Frequent Patterns, Association and Correlations

 Basic concepts  Frequent itemset mining methods  Mining association rules  Association mining to correlation analysis  Constraint-based association mining

2

slide-3
SLIDE 3

What Is Frequent Pattern Analysis?

Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set

 Frequent sequential pattern  Frequent structured pattern

Motivation: Finding inherent regularities in data

 What products were often purchased together?— Beer and diapers?!  What are the subsequent purchases after buying a PC?  What kinds of DNA are sensitive to this new drug?

Applications

 Basket data analysis, cross-marketing, catalog design, sale campaign

analysis, Web log (click stream) analysis, and DNA sequence analysis.

3

slide-4
SLIDE 4

Frequent Itemset Mining

 Frequent itemset mining: frequent set of items in a

transaction data set

 Agrawal, Imielinski, and Swami, SIGMOD 1993

 SIGMOD Test of Time Award 2003

“This paper started a field of research. In addition to containing an innovative algorithm, its subject matter brought data mining to the attention of the database community … even led several years ago to an IBM commercial, featuring supermodels, that touted the importance of work such as that contained in this paper. ”

  • R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of

items in large databases. In SIGMOD ’93. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94.

4

slide-5
SLIDE 5

Basic Concepts: Transaction dataset

5

Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F

slide-6
SLIDE 6

Record Data

Data that consists of a collection of records, each of which consists of a fixed set of attributes

Points in a multi-dimensional space, where each dimension represents a distinct attribute

Represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute

Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes

10
slide-7
SLIDE 7

Transaction Data

 A special type of record data, where

 each record (transaction) involves a set of items.  For example, the set of products purchased by a

customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.

TID Items

1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk

slide-8
SLIDE 8

Document Data

 Each document becomes a `term' vector,

 each term is a component (attribute) of the vector,  the value of each component is the number of times

the corresponding term occurs in the document.

Document 1 season timeout lost wi n game score ball pla y coach team Document 2 Document 3 3 5 2 6 2 2 7 2 1 3 1 1 2 2 3

slide-9
SLIDE 9

Basic Concepts: Transaction dataset

9

Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F

slide-10
SLIDE 10

Basic Concepts: Frequent Patterns and Association Rules

Itemset: X = {x1, …, xk} (k-itemset)

10

Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F

slide-11
SLIDE 11

Basic Concepts: Frequent Patterns and Association Rules

Itemset: X = {x1, …, xk} (k- itemset)

Frequent itemset: X with minimum support count

Support count (absolute support): count of transactions containing X

11

Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F

slide-12
SLIDE 12

Basic Concepts: Frequent Patterns and Association Rules

Itemset: X = {x1, …, xk} (k- itemset)

Frequent itemset: X with minimum support count

Support count (absolute support): count of transactions containing X

Association rule: A  B with minimum support and confidence

Support: probability that a transaction contains A  B s = P(A  B)

Confidence: conditional probability that a transaction having A also contains B c = P(B | A)

12

Customer buys diaper Customer buys both Customer buys beer Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F

slide-13
SLIDE 13

Illustration of Frequent Itemsets and Association Rules

13

Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F

 Frequent itemsets (minimum support count = 3) ?  Association rules (minimum support = 50%, minimum

confidence = 50%) ?

slide-14
SLIDE 14

Illustration of Frequent Itemsets and Association Rules

14

Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F

 Frequent itemsets (minimum support count = 3) ?  Association rules (minimum support = 50%, minimum

confidence = 50%) ? {A:3, B:3, D:4, E:3, AD:3} A  D (60%, 100%) D  A (60%, 75%)

slide-15
SLIDE 15

Mining Frequent Patterns, Association and Correlations

 Basic concepts  Frequent itemset mining methods  Mining association rules  Association mining to correlation analysis  Constraint-based association mining

15

slide-16
SLIDE 16

Scalable Methods for Mining Frequent Patterns

 Frequent itemset mining methods

 Apriori (Agrawal & Srikant@VLDB’94) and variations  Frequent pattern growth (FPgrowth—Han, Pei & Yin

@SIGMOD’00)

 Closed and maximal patterns and their mining methods  FIMI Workshop and implementation repository

16

slide-17
SLIDE 17

Frequent itemset mining

 Brute force approach

Transaction- id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F

Itemset: X = {x1, …, xk} (k-itemset) Frequent itemset: X with minimum support count

slide-18
SLIDE 18

Frequent itemset mining

 Brute force approach  Set enumeration tree for all possible itemsets  Tree search

Transaction- id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F

slide-19
SLIDE 19

Apriori – Apriori Property

 The apriori property of frequent patterns

 Any nonempty subset of a frequent itemset must be

frequent

 If {beer, diaper, nuts} is frequent, so is {beer, diaper}

 Apriori pruning principle: If there is any itemset which is

infrequent, its superset should not be generated/tested!

19

slide-20
SLIDE 20

Apriori: Level-Wise Search Method

 Level-wise search method (BFS):

 Initially, scan DB once to get frequent 1-itemset  Generate length (k+1) candidate itemsets from length

k frequent itemsets

 Test the candidates against DB  Terminate when no frequent or candidate set can be

generated

20

slide-21
SLIDE 21

The Apriori Algorithm

 Pseudo-code:

Ck: Candidate k-itemset Lk : frequent k-itemset L1 = frequent 1-itemsets; for (k = 2; Lk-1 !=; k++) Ck = generate candidate set from Lk-1; for each transaction t in database find all candidates in Ck that are subset of t; increment their count; Lk = candidates in Ck with min_support return k Lk;

21

slide-22
SLIDE 22

The Apriori Algorithm—An Example

22

Transaction DB

1st scan C1 L1 L2 C2 C2 2nd scan C3 L3 3rd scan

Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E Itemset sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemset sup {A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 Itemset sup {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 Itemset {B, C, E} Itemset sup {B, C, E} 2

Supmin = 2

slide-23
SLIDE 23

Important Details of Apriori

 How to generate candidate sets?  How to count supports for candidate sets?

23

slide-24
SLIDE 24

Candidate Set Generation

Step 1: self-joining Lk-1: assuming items and itemsets are sorted in

  • rder, joinable only if the first k-2 items are in common

Step 2: pruning: prune if it has infrequent subset

Example: Generate C4 from L3={abc, abd, acd, ace, bcd}

Step 1: Self-joining: L3*L3

 abcd from abc and abd; acde from acd and ace 

Step 2: Pruning:

 acde is removed because ade is not in L3

C4={abcd}

24

Ck = generate candidate set from Lk-1;

slide-25
SLIDE 25

How to Count Supports of Candidates?

for each transaction t in database

find all candidates in Ck that are subset of t; increment their count;

 Why counting supports of candidates a problem?

 The total number of candidates can be very huge  One transaction may contain many candidates

 For each subset s in t, check if s is in Ck

25

slide-26
SLIDE 26

How to Count Supports of Candidates?

for each transaction t in database

find all candidates in Ck that are subset of t; increment their count;

 For each subset s in t, check if s is in Ck

 Linear search  Prefix tree  Hash-tree (prefix tree with hash function at interior

node)

 Hash-table

26

slide-27
SLIDE 27

Example: Hash-tree

28

1,4,7 2,5,8 3,6,9 hash function 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 Transaction: 2 3 5 6 7

2 3 5 6 5

slide-28
SLIDE 28

DHP: Direct Hashing and Pruning

DHP (Direct hashing and pruning): hash k-itemsets into buckets and a k-itemset whose bucket count is below the threshold cannot be frequent

Especially useful for 2-itemsets

 Generate a hash table of 2-itemsets during the scan for 1-itemset  If the count of a bucket is below minimum support count, the

itemsets in the bucket should not be included in candidate 2- itemsets

  • J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association
  • rules. In SIGMOD’95

29

slide-29
SLIDE 29

DHP: Reducing number of candidates

30

slide-30
SLIDE 30

DHP: Reducing number of candidates

31

slide-31
SLIDE 31

Directly counting all 1-itemsets and 2-itemsets

32

slide-32
SLIDE 32

Assignment 1

 Implementation and evaluation of Apriori  Performance competition with prizes!

33

slide-33
SLIDE 33

Improving Efficiency of Apriori

 Bottlenecks

 Huge number of candidates  Multiple scans of transaction database  Support counting for candidates

 Improving Apriori: general ideas

 Shrink number of candidates  Reduce passes of transaction database scans  Reduce number of transactions

34

slide-34
SLIDE 34

DHP: Reducing size and number of transactions

Discard infrequent items

If an item occurs in a frequent (k+1)-itemset, it must occur in at least k candidate k-itemset (necessary not sufficient)

Discard an item from a transaction if it does not occur in at least k candidate k-itemset during support counting

Discard a transaction if all items are discarded

  • J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD’95

35

slide-35
SLIDE 35

DIC: Reduce Number of Scans

DIC (Dynamic itemset counting): partition DB into blocks, add new candidate itemsets at partition points

Once both A and D are determined frequent, the counting of AD begins

Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins

36

ABCD ABC ABD ACD BCD AB AC BC AD BD CD A B C D {} Itemset lattice Transactions 1-itemsets 2-itemsets … Apriori 1-itemsets 2-items 3-items DIC

  • S. Brin R. Motwani, J. Ullman, and S.
  • Tsur. Dynamic itemset counting and

implication rules for market basket

  • data. In SIGMOD’97
slide-36
SLIDE 36

Partitioning: Reduce Number of Scans

 Any itemset that is potentially frequent in DB must be

frequent in at least one of the partitions of DB

 Scan 1: partition database in n partitions and find local

frequent patterns (minimum support count?)

 Scan 2: determine global frequent patterns from the

collection of all local frequent patterns

  • A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining

association in large databases. In VLDB’95

37

slide-37
SLIDE 37

Sampling for Frequent Patterns

 Select a sample of original database, mine frequent

patterns within samples using Apriori

 Scan database once to verify frequent itemsets found in

sample

 Use a lower support threshold than minimum support  Tradeoff accuracy against efficiency

  • H. Toivonen. Sampling large databases for association rules. In VLDB’96

38

slide-38
SLIDE 38

September 12, 2017 Data Mining: Concepts and Techniques 39

Scalable Methods for Mining Frequent Patterns

 Scalable mining methods for frequent patterns

 Apriori (Agrawal & Srikant@VLDB’94) and variations  Frequent pattern growth (FPgrowth—Han, Pei & Yin

@SIGMOD’00)

 Algorithms using vertical format

 Closed and maximal patterns and their mining methods  FIMI Workshop and implementation repository

39

slide-39
SLIDE 39

Mining Frequent Patterns Without Candidate Generation

40

 Apriori: Breadth first search in set enumeration tree  FP-Growth: Depth first search in set enumeration tree  Basic idea: Find (grow) long patterns from short ones in

conditional DB recursively

 “abc” is a frequent pattern  All transactions having “abc”: DB|abc (conditional DB)  “d” is a local frequent item in DB|abc, then abcd is a

frequent pattern

 Details:  How to represent DB to find conditional DB quickly? - FP

tree (prefix tree)

 How to represent all patterns? – set enumeration tree

with ordering

slide-40
SLIDE 40

Representing DB as FP-tree and Finding Conditional DB

41

{} f:4 c:1 b:1 p:1 b:1 c:3 a:3 b:1 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 min_support = 3 TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o, w} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

  • 1. Scan DB once, find

frequent 1-itemset (single item pattern)

  • 2. Sort frequent items in

frequency descending

  • rder, f-list
  • 3. Scan DB again,

construct FP-tree (prefix tree) F-list=f-c-a-b-m-p

slide-41
SLIDE 41

Finding Conditional FP-Tree

Starting at the frequent item header table in the FP-tree

Traverse the FP-tree by following the link of each frequent item p

Accumulate all of transformed prefix paths of item p to form p’s conditional DB (conditional pattern base)

42

Conditional pattern base item

  • cond. pattern base

c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 {} f:4 c:1 b:1 p:1 b:1 c:3 a:3 b:1 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

slide-42
SLIDE 42

Representing Patterns

Frequent patterns can be partitioned into subsets according to f-list: f-c-a-b-m-p (assuming items are sorted) – essentially set enumeration tree

 Patterns containing p (patterns ending with p)  Patterns having m but no p (patterns ending with m)  …  Patterns having c but no a nor b, m, p (patterns ending with c)  Pattern having f but no c, a, b, m, p (patterns ending with f)

Completeness and non-redundancy

Question: how to order the items?

44

slide-43
SLIDE 43

Representing Patterns

Frequent patterns can be partitioned into subsets according to f-list: f-c-a-b-m-p (assuming items are sorted) – essentially set enumeration tree

 Patterns containing p (patterns ending with p)  Patterns having m but no p (patterns ending with m)  …  Patterns having c but no a nor b, m, p (patterns ending with c)  Pattern having f but no c, a, b, m, p (patterns ending with f)

Completeness and non-redundancy

Question: how to order the items?

Intuition: start from least frequent items to frequent items, offers better selectivity and pruning

45

slide-44
SLIDE 44

Mining Frequent Patterns With FP-trees

Idea: Frequent pattern growth

 Recursively grow frequent patterns by pattern and database

partition

Method

 For each frequent item, construct its conditional DB and FP-tree  Repeat the process recursively on the new conditional FP-tree  Until the resulting FP-tree is empty, or it contains only one path—

single path will generate all the combinations of its sub-paths, each of which is a frequent pattern

46

slide-45
SLIDE 45

48 

Construct p-conditional DB

Construct the FP-tree for the frequent items of p-conditional DB

Repeat the process recursively on the new conditional FP-tree until the resulting FP-tree is empty, or only one path

Finding Patterns Ending with p

p-conditional DB: fcam:2, cb:1 p-conditional FP-tree (min-support =3) {} c:3 {} f:4 c:1 b:1 p:1 b:1 c:3 a:3 b:1 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 All frequent patterns containing p

p, cp

slide-46
SLIDE 46

Finding Patterns Ending with m

Construct m-conditional DB, then its conditional FP-tree

Repeat the process recursively on the new conditional FP-tree

49

m-conditional pattern base: fca:2, fcab:1

m-conditional FP-tree (min-support =3):

{} f:3 c:3 a:3 {} f:4 c:1 b:1 p:1 b:1 c:3 a:3 b:1 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

All frequent patterns ending with m:

m, fm, cm, am, fcm, fam, cam, fcam

slide-47
SLIDE 47

FP-Growth vs. Apriori

50

10 20 30 40 50 60 70 80 90 100 0.5 1 1.5 2 2.5 3 Support threshold(%) Run time(sec.)

D1 FP-grow th runtime D1 Apriori runtime

Data set T25I20D10K

slide-48
SLIDE 48

Why Is FP-Growth the Winner?

 Divide-and-conquer:

 Decompose both mining task and DB and leads to

focused search of smaller databases

 Use least frequent items as suffix (offering good

selectivity) and find shorter patterns recursively and concatenate with suffix

 Other factors

 no candidate generation, no candidate test  compressed database: FP-tree structure  no repeated scan of entire database  basic ops—counting local freq items and building sub

FP-tree, no pattern search and matching

51

slide-49
SLIDE 49

September 12, 2017 Data Mining: Concepts and Techniques 52

Scalable Methods for Mining Frequent Patterns

 Scalable mining methods for frequent patterns

 Apriori (Agrawal & Srikant@VLDB’94) and variations  Frequent pattern growth (FPgrowth—Han, Pei & Yin

@SIGMOD’00)

 Algorithms using vertical format (ECLAT)

 Closed and maximal patterns and their mining methods  FIMI Workshop and implementation repository

52

slide-50
SLIDE 50

ECLAT

 M. J. Zaki. Scalable algorithms for association mining.

IEEE TKDE, 12, 2000.

 For each item, store a list of transaction ids (tids)

53

10 B TID Items 1 A,B,E 2 B,C,D 3 C,E 4 A,C,D 5 A,B,C,D 6 A,E 7 A,B 8 A,B,C 9 A,C,D

Horizontal Data Layout

TID- list A B C D E 1 1 2 2 1 4 2 3 4 3 5 5 4 5 6 6 7 8 9 7 8 9 8 10 9

Vertical Data Layout

slide-51
SLIDE 51

ECLAT

 Determine support of any k-itemset by intersecting tid-

lists of two of its (k-1) subsets.

 Advantage: very fast support counting  Disadvantage: intermediate tid-lists may become too large

for memory

54

A 1 4 5 6 7 8 9

B 1 2 5 7 8 10

 

AB 1 5 7 8

slide-52
SLIDE 52

Scalable Methods for Mining Frequent Patterns

 Scalable mining methods for frequent patterns

 Apriori (Agrawal & Srikant@VLDB’94) and variations  Frequent pattern growth (FPgrowth—Han, Pei & Yin

@SIGMOD’00)

 Algorithms using vertical data format (ECLAT)

 Closed and maximal patterns and their mining methods

 Concepts  Max-patterns: MaxMiner, MAFIA  Closed patterns: CLOSET, CLOSET+, CARPENTER

 FIMI Workshop

55

slide-53
SLIDE 53

Closed Patterns and Max-Patterns

 A long pattern contains a combinatorial number of sub-

patterns, e.g., {a1, …, a100} contains ____ sub-patterns!

 Solution: Mine “boundary” patterns

56

slide-54
SLIDE 54

Closed Patterns and Max-Patterns

 An itemset X is closed if X is frequent and there exists no

super-pattern Y כ X, with the same support as X (Pasquier, et al. @ ICDT’99)

 An itemset X is a max-pattern if X is frequent and there

exists no frequent super-pattern Y כ X (Bayardo @ SIGMOD’98)

 Closed pattern is a lossless compression of freq. patterns

and support counts

57

slide-55
SLIDE 55

Max-patterns

 Frequent patterns without frequent super patterns

 BCDE (2), ACD (2) are max-patterns  BCD (2) is not a max-pattern

58

Tid Items 10 A,B,C,D,E 20 B,C,D,E, 30 A,C,D,F

Min_sup=2

slide-56
SLIDE 56

Max-Patterns Illustration

59

null AB AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCD E

Border Infrequent Itemsets Maximal Itemsets

An itemset is maximal frequent if none of its immediate supersets is frequent

slide-57
SLIDE 57

60

 An itemset is closed if none of its immediate supersets has

the same support as the itemset (min_sup = 2)

Itemset Support {A} 4 {B} 5 {C} 3 {D} 4 {A,B} 4 {A,C} 2 {A,D} 3 {B,C} 3 {B,D} 4 {C,D} 3

Itemset Support {A,B,C} 2 {A,B,D} 3 {A,C,D} 2 {B,C,D} 3 {A,B,C,D} 2

Closed Patterns

TID Items 1 {A,B} 2 {B,C,D} 3 {A,B,C,D} 4 {A,B,D} 5 {A,B,C,D}

slide-58
SLIDE 58

Maximal vs Closed Itemsets

Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets

61

slide-59
SLIDE 59

Exercise: Closed Patterns and Max-Patterns

 DB = {<a1, …, a100>, < a1, …, a50>}

Min_sup = 1.

 What is the set of closed itemset?  What is the set of max-pattern?  What is the set of all patterns?

62

slide-60
SLIDE 60

Exercise: Closed Patterns and Max-Patterns

 DB = {<a1, …, a100>, < a1, …, a50>}

Min_sup = 1.

 What is the set of closed itemset?  What is the set of max-pattern?  What is the set of all patterns?

 !!

63

<a1, …, a100>: 1 < a1, …, a50>: 2 <a1, …, a100>: 1

slide-61
SLIDE 61

September 12, 2017 Data Mining: Concepts and Techniques 64

Scalable Methods for Mining Frequent Patterns

 Scalable mining methods for frequent patterns

 Apriori (Agrawal & Srikant@VLDB’94) and variations  Frequent pattern growth (FPgrowth—Han, Pei & Yin

@SIGMOD’00)

 Algorithms using vertical data format (ECLAT)

 Closed and maximal patterns and their mining methods

 Concepts  Max-pattern mining: MaxMiner, MAFIA  Closed pattern mining: CLOSET, CLOSET+,

CARPENTER

64

slide-62
SLIDE 62

MaxMiner: Mining Max-patterns

 R. Bayardo. Efficiently mining long patterns from

  • databases. In SIGMOD’98

 Idea: generate the complete set-enumeration tree one

level at a time, prune if possible

65

 (ABCD) A (BCD) B (CD) C (D) D () AB (CD) AC (D) AD () BC (D) BD () CD () ABC (C) ABCD () ABD () ACD () BCD ()

slide-63
SLIDE 63

Algorithm MaxMiner

 Initially, generate one node N= , where h(N)=

and t(N)={A,B,C,D}.

 Recursively expanding N

 Local pruning

 If h(N)t(N) (the leaf node) is frequent, do not expand N

(prune entire subtree). (bottom-up pruning – not a max pattern)

 If for some it(N), h(N){i} (immediate child node) is NOT

frequent, remove i from t(N) before expanding N (prune subbranch i). (top-down pruning – same as Apriori)

 Global pruning

66

 (ABCD)

slide-64
SLIDE 64

Local Pruning Techniques (e.g. at node A)

Check the frequency of ABCD and AB, AC, AD.

If ABCD is frequent, prune the whole sub-tree.

If AC is NOT frequent, prune C from the parenthesis before expanding (prune AC branch)

67

 (ABCD) A (BCD) B (CD) C (D) D () AB (CD) AC (D) AD () BC (D) BD () CD () ABC (C) ABCD () ABD () ACD () BCD ()

slide-65
SLIDE 65

Global Pruning Technique (across sub-trees)

When a max pattern is identified (e.g. BCD), prune all nodes (e.g. C, D) where h(N)t(N) is a sub-set of it (e.g. BCD).

68

 (ABCD) A (BCD) B (CD) C (D) D () AB (CD) AC (D) AD () BC (D) BD () CD () ABC (C) ABCD () ABD () ACD () BCD ()

slide-66
SLIDE 66

Example

69

Tid Items 10 A,B,C,D,E 20 B,C,D,E, 30 A,C,D,F

 (ABCDEF)

Items Frequency ABCDEF A 2 B 2 C 3 D 3 E 2 F 1

Min_sup=2 Max patterns: A (BCDE)B (CDE) C (DE) E () D (E)

slide-67
SLIDE 67

Example

70

Tid Items 10 A,B,C,D,E 20 B,C,D,E, 30 A,C,D,F

 (ABCDEF)

Items Frequency ABCDE 1 AB 1 AC 2 AD 2 AE 1

Min_sup=2 A (BCDE)B (CDE) C (DE) E () D (E) AC (D) AD () Max patterns: Node A

slide-68
SLIDE 68

Example

71

Tid Items 10 A,B,C,D,E 20 B,C,D,E, 30 A,C,D,F

 (ABCDEF)

Items Frequency BCDE 2 BC BD BE

Min_sup=2 A (BCDE)B (CDE) C (DE) E () D (E) AC (D) AD () Max patterns: BCDE Node B

slide-69
SLIDE 69

Example

72

Tid Items 10 A,B,C,D,E 20 B,C,D,E, 30 A,C,D,F

 (ABCDEF)

Items Frequency ACD 2

Min_sup=2 A (BCDE)B (CDE) C (DE) E () D (E) AC (D) AD () Max patterns: BCDE ACD Node AC

slide-70
SLIDE 70

Mining Frequent Patterns, Association and Correlations

 Basic concepts and a road map  Frequent itemset mining methods  Frequent sequence mining and graph mining  Apriori based: GSP (EDBT 96)  FP-Growth based: prefixSpan (ICDE 01)  Mining various kinds of association rules  From association mining to correlation analysis  Summary

73

slide-71
SLIDE 71

GSP Algorithm (Generalized Sequential Pattern)

ID 100 200 300 400 500 Record a→c→d b→c→d a→b→c→e→d d→b a→d→c→d Database D Sequence {a} {b} {c} {d} Sup. 3 3 4 4 {e} 1 C1: cand 1-seqs Sequence {a} {b} {c} {d} Sup. 3 3 4 4 F1: freq 1-seqs

Sequence {a→a} {a→b} {a→c} {a→d} Sup. 1 3 3 {b→a} {b→b} {b→c} {b→d} 2 2 1 {c→a} {c→b} {c→c} {c→d} 4 {d→a} {d→b} {d→c} {d→d} 1 1 C2: cand 2-seqs Sequence {a→c} {a→d} {c→d} Sup. 3 3 4 F3: freq 2-seqs

Scan D Scan D Scan D

Sequence {a→a} {a→b} {a→c} {a→d} {b→a} {b→b} {b→c} {b→d} {c→a} {c→b} {c→c} {c→d} {d→a} {d→b} {d→c} {d→d} C2: cand 2-seqs

Sequence {a→b→c} C3: cand 3-seqs Sequence {a→b→c} Sup. 3 F3: freq 3-seqs

slide-72
SLIDE 72

Mining Frequent Patterns, Association and Correlations

 Basic concepts and a road map  Frequent itemset mining methods  Frequent sequence mining and graph mining  Mining various kinds of association rules  Summary

75

slide-73
SLIDE 73

Basic Concepts: Frequent Patterns and Association Rules

Itemset: X = {x1, …, xk} (k-itemset)

Frequent itemset: X with minimum support count

Support count (absolute support): count of transactions containing X

Association rule: A  B with minimum support and confidence

Support: probability that a transaction contains A  B s = P(A  B)

Confidence: conditional probability that a transaction having A also contains B c = P(B | A)

Association rule mining process

Find all frequent patterns (more costly)

Generate strong association rules

76

Customer buys diaper Customer buys both Customer buys beer Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F

slide-74
SLIDE 74

Mining Frequent Patterns, Association and Correlations

 Basic concepts and a road map  Efficient and scalable frequent itemset mining methods  Mining various kinds of association rules  From association mining to correlation analysis  Summary

81