Data Mining Associative pattern mining Hamid Beigy Sharif - - PowerPoint PPT Presentation

data mining
SMART_READER_LITE
LIVE PREVIEW

Data Mining Associative pattern mining Hamid Beigy Sharif - - PowerPoint PPT Presentation

Data Mining Associative pattern mining Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 70 Outline Introduction 1 Frequent pattern mining model 2 Frequent


slide-1
SLIDE 1

Data Mining

Associative pattern mining Hamid Beigy

Sharif University of Technology

Fall 1396

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 70

slide-2
SLIDE 2

Outline

1

Introduction

2

Frequent pattern mining model

3

Frequent itemset mining algorithms Brute force Frequent itemset mining algorithm Apriori algorithm Frequent pattern growth (FP-growth) Mining frequent itemsets using vertical data format

4

Summarizing itemsets Mining maximal itemsets Mining closed itemsets

5

Sequence mining

6

Graph mining

7

Pattern and rule assessment

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 2 / 70

slide-3
SLIDE 3

Table of contents

1

Introduction

2

Frequent pattern mining model

3

Frequent itemset mining algorithms Brute force Frequent itemset mining algorithm Apriori algorithm Frequent pattern growth (FP-growth) Mining frequent itemsets using vertical data format

4

Summarizing itemsets Mining maximal itemsets Mining closed itemsets

5

Sequence mining

6

Graph mining

7

Pattern and rule assessment

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 70

slide-4
SLIDE 4

Introduction

The classical problem of associative pattern mining is defined in the context of supermarket (items bought by customers). The items bought by customers are referred to as transactions. The goal is to determine association between groups of items bought by customers. The most popular model for associative pattern mining uses the frequencies of sets of items as the quantification of the level of association. The discovered set of items are referred to as large itemsets, frequent itemsets, or frequent patterns.

Which items are frequently purchased together by customers? milk cereal bread milk bread butter milk bread sugar eggs Customer 1 Market Analyst Customer 2 sugar eggs Customer n Customer 3 Shopping Baskets Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 70

slide-5
SLIDE 5

Applications of associative pattern mining

The associative pattern mining has a wide variety of applications Supermarket data The supermarket application was the original motivating scenario in which the frequent pattern mining problem was proposed. The goal is to mine the sets of items that are frequently bought together at a supermarket by analyzing the customer shopping transactions. Text mining Text data often represented in the bag-of-words model, frequent pattern mining can help in identifying co-occurring terms and keywords. Such co-occurring terms have numerous text-mining applications. Web Mining Web site logs all incoming traffic to its site in the form of record the source and destination pages requested by some user, time, return code. We interested in finding if there are sets of web pages that many users tend to browse whenever they visit the website. Generalization to dependency-oriented data types The original frequent pattern mining model has been generalized to many dependency-oriented data types, such as time-series data, sequential data, spatial data, and graph data, with a few modifications. Such models are useful in applications such as Web log analysis, software bug detection, and spatiotemporal event detection. Other major data mining problems Frequent pattern mining can be used as a subroutine to provide effective solutions to many data mining problems such as clustering, classification, and outliers analysis.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 4 / 70

slide-6
SLIDE 6

Association rules

Frequent itemsets can be used to generate association rules of the form X = ⇒ Y X and Y are set of items. For example, if the supermarket owner discovers the following rule {Eggs, Milk} = ⇒ {Yogurt} As a conclusion, she/he can promote Yogurt to customers who often buy Eggs and Milk. The frequency-based model for associative pattern mining is very popular due to its simplicity. However, the raw frequency of a pattern is not same as the statistical significance of underlying correlations. Therefor several models based on statistical significance are proposed , which are referred to as interesting patterns.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 5 / 70

slide-7
SLIDE 7

Table of contents

1

Introduction

2

Frequent pattern mining model

3

Frequent itemset mining algorithms Brute force Frequent itemset mining algorithm Apriori algorithm Frequent pattern growth (FP-growth) Mining frequent itemsets using vertical data format

4

Summarizing itemsets Mining maximal itemsets Mining closed itemsets

5

Sequence mining

6

Graph mining

7

Pattern and rule assessment

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 6 / 70

slide-8
SLIDE 8

Frequent pattern mining model

Assume that the database T contains n transactions T1, T2, . . . , Tn. Each transaction has a unique identifier, referred to as transaction identifier or tid. Each transaction Ti is drawn on the universe of items U.

tid Set of Items Binary Representation 1 {Bread, Butter, Milk} 110010 2 {Eggs, Milk, Y ogurt} 000111 3 {Bread, Cheese, Eggs, Milk} 101110 4 {Eggs, Milk, Y ogurt} 000111 5 {Cheese, Milk, Y ogurt} 001011

An itemset is a set of items. A k−itemset is an itemset that contains exactly k items. The fraction of transactions in T = {T1, T2, . . . , Tn} in which an itemset occurs as a subset is known as support of itemset. Definition (Support) The support of an itemset I is defined as the fraction of the transactions in the database T = {T1, T2, . . . , Tn} that contain I as a subset and denoted by sup(I). Items that are correlated will have high support.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 6 / 70

slide-9
SLIDE 9

Frequent pattern mining model (cont.)

The frequent pattern mining model is to determine itemsets with a minimum level of support denoted by minsup. Definition (Frequent itemset mining) Given a set of transactions T = {T1, T2, . . . , Tn}, where each transaction Ti is a subset of items from U, determine all itemsets I that occur as a subset of at least a predefined fraction minsup of the transactions in T . Consider the following database

tid Set of Items 1 {Bread, Butter, Milk} 2 {Eggs, Milk, Y ogurt} 3 {Bread, Cheese, Eggs, Milk} 4 {Eggs, Milk, Y ogurt} 5 {Cheese, Milk, Y ogurt}

The universe of items U = {Bread, Butter, Cheese, Eggs, Milk, Yogurt}. sup({Bread, Milk}) = 2

5 = 0.4.

sup({Cheese, Yogurt}) = 1

5 = 0.2.

The number of frequent itemsets is generally very sensitive to the value of minsup. Therefore, an appropriate choice of minsup is crucial for discovering a set of frequent patterns with meaningful size.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 7 / 70

slide-10
SLIDE 10

Frequent pattern mining model (cont.)

When an itemset I is contained in a transaction, all of its subsets will also contained in the transaction. Therefore, the support of any subset J of I will always be at least equal to that of I. This is referred to as support monotonicity property. Property (Support monotonicity property ) The support of every subset J of I is at least equal to that of the support of itemset I. sup(J) ≥ sup(I) ∀J ⊆ I This implies that every subset of a frequent itemset is also frequent. This is referred to as downward closure property. Property (Downward closure property) Every subset of a frequent itemset is also frequent. The downward closure property of frequent patterns is algorithmically very convenient because it provides an important constraint on the inherent structure of frequent patterns.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 8 / 70

slide-11
SLIDE 11

Frequent pattern mining model (cont.)

The downward closure property can be used to create concise representations of frequent patterns, wherein only the maximal frequent subsets are retained. Definition (Maximal frequent itemsets ) A frequent itemset is maximal at a given minimum support level minsup, if it is frequent, and no superset of it is frequent. Consider the following database

tid Set of Items 1 {Bread, Butter, Milk} 2 {Eggs, Milk, Y ogurt} 3 {Bread, Cheese, Eggs, Milk} 4 {Eggs, Milk, Y ogurt} 5 {Cheese, Milk, Y ogurt}

The itemset {Eggs, Milk, Yogurt} is maximal frequent itemset at minsup = 0.3. The itemset {Eggs, Milk} is not maximal, because it has a superset that is also frequent. All frequent itemsets can be derived from the maximal patterns by enumerating the subsets of the maximal frequent patterns. The maximal patterns can be considered condensed representation of the frequent patterns.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 9 / 70

slide-12
SLIDE 12

Frequent pattern mining model (cont.)

The maximal patterns can be considered condensed representation of frequent patterns. This representation does not retain information about the support values of the subsets. Ex: Sup({Eggs, Milk, Yogurt}) = 0.4 ⇏ Sup({Milk, Yogurt} = 0.6. A different representation called closed frequent itemset is able to retain support information of the subsets. (will be discussed later) An interesting property of itemsets is that they can be conceptually arranged in the form

  • f a lattice of itemsets.

This lattice contains one node for each subset and neighboring nodes differ by exactly one item. All frequent pattern mining algorithms, implicitly or explicitly, traverse this search space to determine frequent patterns.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 10 / 70

slide-13
SLIDE 13

Frequent pattern mining model (cont.)

This lattice contains one node for each subset and neighboring nodes differ by exactly one item.

lattice into frequent and infrequent that cannot be making them infrequent are called from the set of maximal

Infrequent itemsets Frequent itemsets

This lattice separated into frequent and infrequent itemset by a border. All itemsets above this border are frequent and below of this border are infrequent. All maximal frequent itemsets are adjacent to this border.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 11 / 70

slide-14
SLIDE 14

Association rule generation

Frequent itemsets can be used to generate association rules using confidence measure. The confidence of a rule X = ⇒ Y is the conditional probability that a transaction contains itemset Y given that it contains the set X. Definition (Confidence ) Let X and Y be two sets of items. The confidence conf (X = ⇒ Y ) of rule X = ⇒ Y is the conditional probability of X ∪ Y occurring in a transaction given that the transaction contains

  • X. Therefore, the confidence conf (X =

⇒ Y ) is defined as follows: conf (X = ⇒ Y ) = sup(X ∪ Y ) sup(X) Example sup({Eggs, Milk}) = 0.6 sup({Eggs, Milk, Yogurt}) = 0.4 conf ({Eggs, Milk}) = ⇒ {Yogurt} = 0.4 0.6 = 2 3

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 12 / 70

slide-15
SLIDE 15

Association rule generation (cont.)

Association rules are defined using both support and confidence criteria. Definition (Association Rules ) Let X and Y be two sets of items. Then, the rule X = ⇒ Y is said to be an association rule at a minimum support of minsup and minimum confidence of minconf , if it satisfies both the following criteria:

1

The support of the itemset X ∪ Y is at least minsup.

2

The confidence of the rule X = ⇒ Y is at least minconf .

The first criterion ensures that a sufficient number of transactions are relevant to the rule. The second criterion ensures that the rule has sufficient strength in terms of conditional probabilities. The association rules are generated in two phases.

1

All the frequent itemsets are generated at the minimum support of minsup.

2

The association rules are generated from the frequent itemsets at the minimum confidence level of minconf .

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 13 / 70

slide-16
SLIDE 16

Association rule generation (cont.)

Assume that a set of frequent itemset F is provided. For each I ∈ F, we partition I into all possible combinations of sets X and Y = I − X (Y ̸= ϕ, X ̸= ϕ) such that I = X ∪ Y . Then the rule X = ⇒ Y is generated. The confidence of each rule X = ⇒ Y is calculated. Association rules satisfy a confidence monotonicity property(drive it.). Property (Confidence Monotonicity ) Let X1, X2, and I be itemsets such that X1 ⊂ X2 ⊂ I. Then the confidence of X2 = ⇒ I − X2 is at least that of X1 = ⇒ I − X1. conf (X2 = ⇒ I − X2) ≥ conf (X1 = ⇒ I − X1) This property follows directly from definition of confidence and the property of support monotonicity.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 14 / 70

slide-17
SLIDE 17

Table of contents

1

Introduction

2

Frequent pattern mining model

3

Frequent itemset mining algorithms Brute force Frequent itemset mining algorithm Apriori algorithm Frequent pattern growth (FP-growth) Mining frequent itemsets using vertical data format

4

Summarizing itemsets Mining maximal itemsets Mining closed itemsets

5

Sequence mining

6

Graph mining

7

Pattern and rule assessment

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 15 / 70

slide-18
SLIDE 18

Outline

1

Introduction

2

Frequent pattern mining model

3

Frequent itemset mining algorithms Brute force Frequent itemset mining algorithm Apriori algorithm Frequent pattern growth (FP-growth) Mining frequent itemsets using vertical data format

4

Summarizing itemsets Mining maximal itemsets Mining closed itemsets

5

Sequence mining

6

Graph mining

7

Pattern and rule assessment

slide-19
SLIDE 19

Brute force Frequent itemset mining algorithm

For a set of items U, there are a total of 2|U| − 1 distinct subsets, excluding empty set. The simplest method is to generate all these candidate itemsets and then count their support from the database T . To count the support of an itemset , we must check that each itemset I is a subset of each transaction Ti ∈ T . This exhaustive approach is likely impractical when |U| is large. A faster approach is by observing that no (k + 1)−patterns are frequent if no k−patterns are frequent. This observation follows directly from downward closure property. Hence, we can enumerate and count the support of all patterns with increasing length. Better improvements can be obtained by using one or more of the follwing approaches.

1

Reducing the size of the explored search space by pruning candidate itemsets using tricks, such as the downward closed property.

2

Counting the support of each candidate more efficiently by pruning transactions that are known to be irrelevant for counting a candidate itemset.

3

Using compact data structures to represent either candidates or transaction databases that support efficient counting.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 15 / 70

slide-20
SLIDE 20

Outline

1

Introduction

2

Frequent pattern mining model

3

Frequent itemset mining algorithms Brute force Frequent itemset mining algorithm Apriori algorithm Frequent pattern growth (FP-growth) Mining frequent itemsets using vertical data format

4

Summarizing itemsets Mining maximal itemsets Mining closed itemsets

5

Sequence mining

6

Graph mining

7

Pattern and rule assessment

slide-21
SLIDE 21

Apriori algorithm

Apriori employs an iterative approach known as level-wise search, where k−itemsets are used to explore (K + 1)−itemsets. Apriori algorithm uses downward closure property to prune the candidate search space. Apriori algorithm works as

1

The set of 1−itemset called C1 is found by scanning database and count their support.

2

The support of items in C1 compared with minsup and items with support smaller than minsup are pruned. The pruned set is denoted by L1.

3

L1 is used to construct C2. This step is called join step.

4

C2 is pruned using minsup to construct L2. This step is called prune step.

5

L2 is used to construct C3.

6

C3 is pruned using minsup to construct L3.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 16 / 70

slide-22
SLIDE 22

Apriori algorithm (join and prune steps)

Join step

1

To find Ck, a set of k−itemsets is generated by joining Lk−1 with itself.

2

Let l1 and l2 be two itemsets in Lk−1. Let li[j] be the jth item in itemset i.

3

Apriori assumes that items in a itemsets are sorted in lexicographic order.

4

The join Lk−1 ▷ ◁ Lk−1 is performed, where members of Lk−1 joinable if their first (k − 2) items are common.

5

Join Lk−1 ▷ ◁ Lk−1 is performed if (l1[1] = l2[1]) ∧ (l1[2] = l2[2]) ∧ . . . ∧ (l1[k − 2] = l2[k − 2]) ∧ (l1[k − 1] < l2[k − 1])

6

Condition (l1[k − 1] < l2[k − 1]) ensures that no duplicates are generated.

7

Join Lk−1 ▷ ◁ Lk−1 is performed as Lk = Lk−1 ▷ ◁ Lk−1 = {l1[1], l1[2], . . . , l1[k − 2], l1[k − 1], l2[k − 1]}

Prune step

1

Ck is a superset of Lk.

2

A database scan is done to count support of each itemset.

3

All itemsets with support less than minsup are pruned.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 17 / 70

slide-23
SLIDE 23

Apriori algorithm (Example)

Consider the following database

TID List of item IDs

T100 I1, I2, I5 T200 I2, I4 T300 I2, I3 T400 I1, I2, I4 T500 I1, I3 T600 I2, I3 T700 I1, I3 T800 I1, I2, I3, I5 T900 I1, I2, I3

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 18 / 70

slide-24
SLIDE 24

Apriori algorithm (Example)

Assume that minsup = 2. Apriori generates frequent patterns in the following way.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 19 / 70

slide-25
SLIDE 25

Improving the efficiency of Apriori

How can we further improve the efficiency of Apriori-based mining?

1

Hash-based technique : This technique can be used to reduce the size of Ck ( for k > 1). For example, when scanning each transaction in the database for 1-itemsets, L1, all 2-itemsets are generated and hashed into the different buckets of hash table and increase its count.

2

Transaction reduction : A transaction that does not contain any frequent k−itemsets cannot contain any frequent (k + 1)−itemsets. Therefore, such a transaction can be marked

  • r removed from further consideration.

3

Partitioning : Partitioning the data to find the candidate itemsets.

Transactions in D Frequent itemsets in D Divide D into n partitions Find the frequent itemsets local to each partition (1 scan) Combine all local frequent itemsets to form candidate itemset Find global frequent itemsets among candidates (1 scan) Phase I Phase II

4

Sampling : Mining on a subset of the given data. The idea is to pick a random sample S of the given data T , and then search for frequent itemsets in S instead of T .

5

Dynamic itemset counting : Adding candidate itemsets at different points during the scan.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 20 / 70

slide-26
SLIDE 26

Outline

1

Introduction

2

Frequent pattern mining model

3

Frequent itemset mining algorithms Brute force Frequent itemset mining algorithm Apriori algorithm Frequent pattern growth (FP-growth) Mining frequent itemsets using vertical data format

4

Summarizing itemsets Mining maximal itemsets Mining closed itemsets

5

Sequence mining

6

Graph mining

7

Pattern and rule assessment

slide-27
SLIDE 27

Frequent pattern growth (FP-growth)

Apriori uses candidate generate-and-test method, which reduces the size of candidate

  • sets. It can suffer from two costs.

It may still need to generate a huge number of candidate sets. It may need to repeatedly scan the whole database and check a large set of candidates by pattern matching.

FP-growth method adopts the following divide and conquer strategy.

It compresses the database representing frequent items into frequent pattern tree (FP-tree). It divides the compressed database into a set of conditional database, each with one frequent item and then mines each database separately.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 21 / 70

slide-28
SLIDE 28

Frequent pattern growth (cont.)

FP-growth scans database and generates 1-itemsets and their support. The set of frequent items is sorted in decreasing order of support count in list L. An FP-tree is then constructed as follows.

1

Create the root of the tree labeled with null.

2

Scan the database for the second time.

3

The items in each transaction are processed in L order and a branch is created for each

  • transaction. This branch shares common prefix and increment the count in prefix nodes.

4

To facilitate tree traversal, an item header table is built on list L so that each time points to its occurrences in the tree via a chain of node-links.

The FP-tree for the following database is

TID List of item IDs

T100 I1, I2, I5 T200 I2, I4 T300 I2, I3 T400 I1, I2, I4 T500 I1, I3 T600 I2, I3 T700 I1, I3 T800 I1, I2, I3, I5 T900 I1, I2, I3 I2 I1 I3 I4 I5 7 6 6 2 2 I1:2 I3:2 I4:1 I3:2 I4:1 I5:1 I5:1 I1:4 I2:7 null{} I3:2 Node-link Item ID Support count

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 22 / 70

slide-29
SLIDE 29

Frequent pattern growth (FP-tree mining)

The FP-tree is mined as follows.

1

Started from each frequent length-1 pattern from last of L as an initial suffix pattern.

2

Construct its conditional pattern base (sub-database), which consist the set of prefix paths in the FP-tree co-occurring with suffix pattern.

3

Construct the corresponding conditional FP-tree.

4

Perform mining recursively on the tree.

5

The pattern growth is achieved by the concatenation of the suffix pattern with the frequent pattern generated from a conditional FP-tree.

I2 I1 I3 I4 I5 7 6 6 2 2 I1:2 I3:2 I4:1 I3:2 I4:1 I5:1 I5:1 I1:4 I2:7 null{} I3:2 Node-link Item ID Support count

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 23 / 70

slide-30
SLIDE 30

Frequent pattern growth (Example)

Consider the following FP-tree.

I2 I1 I3 I4 I5 7 6 6 2 2 I1:2 I3:2 I4:1 I3:2 I4:1 I5:1 I5:1 I1:4 I2:7 null{} I3:2 Node-link Item ID Support count 1

Start from I5. It occurs in two FP-tree branches.

2

The paths performed by these branches are (I2, I1, I5:1), (I2, I1, I3, I5:1)

3

Considering I5 as suffix, its prefix paths are (I2, I1:1), (I2, I1, I3:1)

4

Using this conditional pattern base as a transaction database, we build an I5-conditional FP-tree containing a single path (I2, I1:2).

5

sup({I2, I1, I3}) ≤ minsup and I3 removed.

6

This path generates all frequent patterns. {I2, I1:2}, {I1,I5:2}, {I2, I1, I5:2}

6.2 Mining the FP-Tree by Creating Conditional (Sub-)Pattern Bases Item Conditional Pattern Base Conditional FP-tree Frequent Patterns Generated

I5 {{I2, I1: 1}, {I2, I1, I3: 1}} hI2: 2, I1: 2i {I2, I5: 2}, {I1, I5: 2}, {I2, I1, I5: 2} I4 {{I2, I1: 1}, {I2: 1}} hI2: 2i {I2, I4: 2} I3 {{I2, I1: 2}, {I2: 2}, {I1: 2}} hI2: 4, I1: 2i, hI1: 2i {I2, I3: 4}, {I1, I3: 4}, {I2, I1, I3: 2} I1 {{I2: 4}} hI2: 4i {I2, I1: 4}

I2 4 I2:4 I1:2 I1:2 Node-link Item ID Support count null{} I1 4 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 24 / 70

slide-31
SLIDE 31

Outline

1

Introduction

2

Frequent pattern mining model

3

Frequent itemset mining algorithms Brute force Frequent itemset mining algorithm Apriori algorithm Frequent pattern growth (FP-growth) Mining frequent itemsets using vertical data format

4

Summarizing itemsets Mining maximal itemsets Mining closed itemsets

5

Sequence mining

6

Graph mining

7

Pattern and rule assessment

slide-32
SLIDE 32

Mining frequent itemsets using vertical data format

Both the Apriori and FP-growth methods mine frequent patterns from a set of transactions in TID-itemset format. This is known as the horizontal data format. Alternatively, data can be presented in item-TID set (IT) format (i.e., {x, t(x)}). This is known as the vertical data format. Eclat (Equivalence Class Transformation) algorithm proposed to mine frequent pattern using vertical data format. This algorithm works as follows

1

We transform the horizontally formatted data into the vertical format by scanning the data set once.

2

The support count of an itemset is simply the length of the TID set of the itemset.

3

Starting with k = 1, the frequent k−itemsets can be used to construct the candidate (k + 1)−itemsets based on the Apriori property.

4

The computation is done by intersection of the TID sets of the frequent k−itemsets to compute the TID sets of the corresponding (k + 1)−itemsets.

5

This process repeats, with k incremented by 1 each time, until no frequent itemsets or candidate itemsets can be found.

In the generation of candidate (k + 1)−itemset from frequent k−itemsets there is no need to scan the database to find the support of (k + 1)−itemsets. The TID set of each k − itemset carries the complete information required for counting such support. The TID sets can be quite long, taking substantial memory space as well as computation time for intersecting the long sets.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 25 / 70

slide-33
SLIDE 33

Mining frequent itemsets using vertical data format (Example)

Consider the following database in horizontal and vertical data format. TID List of item IDs

T100 I1, I2, I5 T200 I2, I4 T300 I2, I3 T400 I1, I2, I4 T500 I1, I3 T600 I2, I3 T700 I1, I3 T800 I1, I2, I3, I5 T900 I1, I2, I3 Set D of Table 6.1 itemset TID set

I1 {T100, T400, T500, T700, T800, T900} I2 {T100, T200, T300, T400, T600, T800, T900} I3 {T300, T500, T600, T700, T800, T900} I4 {T200, T400} I5 {T100, T800}

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 26 / 70

slide-34
SLIDE 34

Mining frequent itemsets using vertical data format (Example)

2-itemsets and 3-itemsets in vertical data format using minsup=2.

6.4 2-Itemsets in Vertical Data Format itemset TID set

{I1, I2} {T100, T400, T800, T900} {I1, I3} {T500, T700, T800, T900} {I1, I4} {T400} {I1, I5} {T100, T800} {I2, I3} {T300, T600, T800, T900} {I2, I4} {T200, T400} {I2, I5} {T100, T800} {I3, I5} {T800}

itemset TID set

{I1, I2, I3} {T800, T900} {I1, I2, I5} {T100, T800}

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 27 / 70

slide-35
SLIDE 35

Table of contents

1

Introduction

2

Frequent pattern mining model

3

Frequent itemset mining algorithms Brute force Frequent itemset mining algorithm Apriori algorithm Frequent pattern growth (FP-growth) Mining frequent itemsets using vertical data format

4

Summarizing itemsets Mining maximal itemsets Mining closed itemsets

5

Sequence mining

6

Graph mining

7

Pattern and rule assessment

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 28 / 70

slide-36
SLIDE 36

Summarizing itemsets

The search space for frequent itemsets is usually very large and it grows exponentially with the number of items. Small values of minsup may result in an intractable number of frequent itemsets. An alternative approach is to determine condensed representations of the frequent itemsets that summarize their essential characteristics. The use of condensed representation not only reduce the computational and storage requirements but it can make it easier to analyze the mined patterns. We consider the following representations

1

Maximal frequent itemsets

2

closed frequent itemsets

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 28 / 70

slide-37
SLIDE 37

Outline

1

Introduction

2

Frequent pattern mining model

3

Frequent itemset mining algorithms Brute force Frequent itemset mining algorithm Apriori algorithm Frequent pattern growth (FP-growth) Mining frequent itemsets using vertical data format

4

Summarizing itemsets Mining maximal itemsets Mining closed itemsets

5

Sequence mining

6

Graph mining

7

Pattern and rule assessment

slide-38
SLIDE 38

Mining maximal itemsets

A frequent itemset is called maximal if it has no frequent supersets. Let M be the set of maximal frequent itemsets. We can determine that any itemset I is frequent or not using M. Let Z be the maximal itemset (if exists). Let x be any itemset. If there exists a maximal itemset Z ∈ F such that x ⊆ Z, then x must be frequent,

  • therwise Z cannot be frequent.

Consider the following database

t i(t) 1 ABDE 2 BCE 3 ABDE 4 ABCE 5 ABCDE 6 BCD

The frequent itemsets using minsup = 3 equal to sup Itemsets 6 B 5 E,BE 4 A,C,D,AB,AE,BC,BD,ABE 3 AD,CE,DE,ABD,ADE,BCE,BDE,ABDE

(b) Frequent itemsets ( = 3)

The maximal frequent itemsets equal to ABDE, BCE

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 29 / 70

slide-39
SLIDE 39

Mining maximal itemsets (cont.)

Mining maximal itemsets requires steps beyond simply determining the frequent itemsets. Initially M = ϕ, when we generate a new frequent itemset X, we perform the following maximality checks.

1

Subset check : ̸ ∃Y ∈ M such that X ⊂ Y . If such a Y exists, then X is not maximal. Otherwise, we add X to M as a potentially maximal itemset.

2

Superset check : ̸ ∃Y ∈ M such that Y ⊂ X. If such a Y exists, then Y is not maximal and we have to remove it.

These two maximality checks takes O(|M|) time, which can get expensive when M grows. Any frequent itemset mining algorithms can be extended to mine maximal frequent itemsets by adding the maximality checking steps.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 30 / 70

slide-40
SLIDE 40

Mining maximal itemsets (GenMax)

GenMax is based on the tidset intersection approach of Eclat. GenMax never inserts a non-maximal itemset into M, thus eliminates the superset check and requires only subset check to determine maximality. GenMax first determines the set of frequent itemsets with their tidsets (i, t(i)). If union of items is already contained in some maximal pattern Z ∈ M, no maximal itemset can be generated from the current branch and it is pruned. If the branch is not pruned, (Xi, t(Xi)) and (Xj, t(Xj)) for j > i are intersected and new candidates Xij are generated. If sup(Xij) ≥ minsup, then (Xij, t(Xij)) are added to Pi (patterns in branch i). If Pi ̸= ϕ, then GenMax is made on Pi. If Pi = ϕ, it means that Xi cannot be extended and it is potentially maximal. We add Xi to M provided that Xi is not contained in any previously added maximal set Z ∈ M.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 31 / 70

slide-41
SLIDE 41

Mining maximal itemsets (GenMax algorithm)

ALGORITHM 9.1. Algorithm GENMAX // Initial Call: M ← ∅, P ← ⟨i,t(i)⟩ | i ∈ I,sup(i) ≥ minsup GENMAX (P, minsup, M): Y ← Xi

1

if ∃Z ∈ M, such that Y ⊆ Z then

2

return // prune entire branch

3

foreach ⟨Xi,t(Xi)⟩ ∈ P do

4

Pi ← ∅

5

foreach ⟨Xj,t(Xj)⟩ ∈ P, with j > i do

6

Xij ← Xi ∪ Xj

7

t(Xij) = t(Xi) ∩ t(Xj)

8

if sup(Xij) ≥ minsup then Pi ← Pi ∪ {⟨Xij,t(Xij)⟩}

9

if Pi ̸= ∅ then GENMAX (Pi, minsup, M)

10

else if ̸ ∃Z ∈ M,Xi ⊆ Z then

11

M = M ∪ Xi // add Xi to maximal set

12

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 32 / 70

slide-42
SLIDE 42

Mining maximal itemsets

Execution of GenMax using the following database

t i(t) 1 ABDE 2 BCE 3 ABDE 4 ABCE 5 ABCDE 6 BCD

A B C D E 1345 123456 2456 1356 12345 AB AD AE 1345 135 1345 PA ABD ABE 135 1345 PAB ABDE 135 PABD ADE 135 PAD BC BD BE 2456 1356 12345 PB BCE 245 PBC BDE 135 PBD CE 245 PC DE 135 PD

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 33 / 70

slide-43
SLIDE 43

Outline

1

Introduction

2

Frequent pattern mining model

3

Frequent itemset mining algorithms Brute force Frequent itemset mining algorithm Apriori algorithm Frequent pattern growth (FP-growth) Mining frequent itemsets using vertical data format

4

Summarizing itemsets Mining maximal itemsets Mining closed itemsets

5

Sequence mining

6

Graph mining

7

Pattern and rule assessment

slide-44
SLIDE 44

Mining closed itemsets

Although all the itemsets can be derived from the maximal itemsets with the sub-setting approach, their support values cannot be derived. Therefore, maximal itemsets are lossy because they do not retain information about the support values. To provide a lossless representation in terms of the support values, the notion of closed itemset mining is used. Definition (Closed itemset) A frequent itemset X ∈ F is closed if it has no frequent superset with the same frequency. The set of all closed frequent itemsets C is a condensed representation, as we can determine whether an itemset X is frequent, as well as the exact sup(X) using C . Show that the following relation holds. (show that) M ⊆ C ⊆ F . Define closure operator c : 2U → 2U as c(X) = i(t(X)) where t(X) = {t ∈ T|t contains X} i(T) = {X ∈ U|∀t ∈ T, t contains X}

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 34 / 70

slide-45
SLIDE 45

Mining closed itemsets (CHARM)

An itemset X is closed if c(X) = X. (Show this.) if c(X) ̸= X, then X is not closed but the set c(X) is called its closure. From the properties of the closure operator, both X and c(X) have the same tidset. The set of all closed frequent itemsets is defined as C = {X|X ∈ F & ̸ ∃Y ⊃ X such that sup(X) = sup(Y )} Mining closed frequent itemsets requires to perform closure checks, that is X = c(X). Direct closure checking can be very expensive. CHARM uses a vertical tidset intersection based method. Given IT pair (Xi, t(Xi)), the following properties hold (show that).

1

Property 1 : If t(xi) = t(xj), then c(Xi) = c(Xj) = c(Xi ∪ xj). This implies that we can replace every occurrence of Xi with Xi ∪ Xj and prune the branch under Xj because of the same closure.

2

Property 2 : If t(xi) ⊂ t(xj), then c(Xi) ̸= c(Xj) but c(Xi) = c(Xi ∪ xj). This means that we can replace every occurrence of Xi with Xi ∪ Xj but we cannot prune the branch under Xj because it generates a different closure.

3

Property 3 : If t(xi) ̸= t(xj), then c(Xi) ̸= c(Xj) ̸= c(Xi ∪ xj). This means that we cannot remove either Xi or Xj.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 35 / 70

slide-46
SLIDE 46

Mining closed itemsets (CHARM)

CHARM takes the set of frequent single items along with their tidset as input (i.e. (Xi, t(Xi))). Initially the set of all closed itemsets C is empty. Given any IT pair set P = {(Xi, t(Xi))}, CHARM first sorts them in increasing order of their support counts. For each itemset Xi, we try to extend it with all other items XJ in the sorted order and then apply the given properties 1 and 2 to prune branches.

1

We make sure that Xij = Xi ∪ Xj is frequent.

2

If Xij is frequent, then we check properties 1 and 2.

Only when property 3 holds, we add the new extension Xij to the set Pi. (Initially Pi = ϕ) If Pi ̸= ϕ, then CHARM is called using Pi. If Xi is not subset of any closed set Z with the same support, we can safely add it to C.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 36 / 70

slide-47
SLIDE 47

Mining closed itemsets (CHARM algorithm)

ALGORITHM 9.2. Algorithm CHARM // Initial Call: C ← ∅, P ←

  • ⟨i,t(i)⟩ : i ∈ I,sup(i) ≥ minsup
  • CHARM (P, minsup, C):

Sort P in increasing order of support (i.e., by increasing |t(Xi)|)

1

foreach ⟨Xi,t(Xi)⟩ ∈ P do

2

Pi ← ∅

3

foreach ⟨Xj,t(Xj)⟩ ∈ P, with j > i do

4

Xij = Xi ∪ Xj

5

t(Xij) = t(Xi) ∩ t(Xj)

6

if sup(Xij) ≥ minsup then

7

if t(Xi) = t(Xj) then // Property 1

8

Replace Xi with Xij in P and Pi

9

Remove ⟨Xj,t(Xj)⟩ from P

10

else

11

if t(Xi) ⊂ t(Xj) then // Property 2

12

Replace Xi with Xij in P and Pi

13

else // Property 3

14

Pi ← Pi ∪

  • ⟨Xij,t(Xij)⟩
  • 15

if Pi ̸= ∅ then CHARM (Pi, minsup, C)

16

if ̸ ∃Z ∈ C, such that Xi ⊆ Z and t(Xi) = t(Z) then

17

C = C ∪ Xi // Add Xi to closed set

18 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 37 / 70

slide-48
SLIDE 48

Mining closed itemsets (CHARM example)

Execution of CHARM using the following database

t i(t) 1 ABDE 2 BCE 3 ABDE 4 ABCE 5 ABCDE 6 BCD

A AE AEB 1345 C 2456 D 1356 E 12345 B 123456 AD ADE ADEB 135 PA

Process PA

A AE AEB C CB D DB E EB B 1345 2456 1356 12345 123456 AD ADE ADEB 135 PA CE CEB 245 PC DE DEB 135 PD

CHARM algorithm

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 38 / 70

slide-49
SLIDE 49

Table of contents

1

Introduction

2

Frequent pattern mining model

3

Frequent itemset mining algorithms Brute force Frequent itemset mining algorithm Apriori algorithm Frequent pattern growth (FP-growth) Mining frequent itemsets using vertical data format

4

Summarizing itemsets Mining maximal itemsets Mining closed itemsets

5

Sequence mining

6

Graph mining

7

Pattern and rule assessment

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 39 / 70

slide-50
SLIDE 50

Sequence mining

Many real-world applications such as bioinformatics, web mining, and text mining have to deal with sequential and temporal data. Sequence mining helps discover patterns across time or positions in a given dataset. Let Σ denote an alphabet, defined as a finite set of characters or symbols, and let |Σ| denote its cardinality. A sequence or a string is defined as an ordered list of symbols, and is written as s = s1s2 . . . sk, where si ∈ Σ is a symbol at position i, also denoted as s[i]. The length of sequence s is denoted by k = |s|. A sequence with length k is also called a k−sequence. A substring s[i : j] = sisi+1 . . . sj−1sj is a sequence of consecutive symbols in positions i through j (for j > i). A prefix of a sequence s is a substring of the form s[1 : i] = s1s2 . . . si ∀i ∈ [0, k]. A suffix of a sequence s is a substring of the form s[i : k] = sisi+1 . . . sk ∀i ∈ [0, n + 1]. String s[1 : 0] is empty prefix and string s[n + 1 : n] is empty suffix.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 39 / 70

slide-51
SLIDE 51

Sequence mining (cont.)

Let Σ∗ be the set of all possible sequences that can be constructed using the symbols in Σ , including the empty sequence ϕ (which has length zero). For two sequences s = s1s2 . . . sn and r = r1r2 . . . rm, we say that r is a subsequence of s denoted r ⊆ s, if there exists a one-to-one mapping ϕ : [1, m] → [1, n], such that r[i] = s[ϕ(i)] and for any two positions i, j in r, i < j = ⇒ ϕ(i) < ϕ(j). The sequence r is called a consecutive subsequence or substring of s provided r[1 : m] = s[j : j + m − 1] (for 1 ≤ j ≤ n − m + 1). Example Assume Σ = {A, C, G, T} and s = ACTGAACG. r1 = CGAAG is a subsequence of s. r2 = CTGA is a substring of s. r3 = ACT is a prefix of s. r4 = GAACG is a suffix of s.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 40 / 70

slide-52
SLIDE 52

Sequence mining (cont.)

Given a database D = {s1, s2, . . . , sN} of N sequences, and given some sequence r, sup(r) in the database D is defined as the total number of sequences in D that contain r. sup(r) = |{si ∈ D|r ⊆ si}| Given a minsup, a sequence r is frequent if sup(r) ≥ minsup. A frequent sequence is maximal if it is not a subsequence of any other frequent sequence. A frequent sequence is closed if it is not a subsequence of any other frequent sequence with the same support.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 41 / 70

slide-53
SLIDE 53

Mining frequent sequences

For sequence mining, the order of the symbols matters, and thus we have to consider all possible permutations of the symbols as the possible frequent candidates. The sequence search space can be organized in a prefix search tree. The root of the tree (level 0) contains the empty sequence, with each symbol x ∈ Σ as

  • ne of its children.

A node labeled with the sequence s = s1s2 . . . sk at level k has children of the form s′ = s1s2 . . . sksk+1 at level k + 1. s′ is called an extension of s. Consider the following database

Id Sequence s1 CAGAAGT s2 TGACAG s3 GAAGT

1

Σ = {A, C, G, T}

2

Sequence A has three extensions AA, AG, and AT.

3

If minsup = 3, AA and AG are frequent but AT is infrequent.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 42 / 70

slide-54
SLIDE 54

Mining frequent sequences (cont.)

The search space for all subsequence is conceptually infinite because it comprises all sequences (with length zero or more) in Σ∗. In practice, the database D consists of bounded length sequences. Let l denote the length

  • f the longest sequence in the database.

In the worst case, we must consider all candidate sequences of length up to l, and the size

  • f search space equals to

|Σ|1 + |Σ|2 + . . . + |Σ|l = O(|Σ|l) where at level k, there are |Σ|k possible subsequences of length k.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 43 / 70

slide-55
SLIDE 55

Generalized equential pattern mining (GSP)

GSP searches the sequence prefix tree using a level-wise or breadth-first search. Given the set of frequent sequences at level k, GSP generates all possible sequence extensions or candidates at level k + 1. Then GSP computes the support of each candidate and prunes infrequent sequences. GSP stops search when no more frequent extensions are possible. GSP uses monotonic property of support to prune candidate patters. The computational complexity of GSP is O(|Σ|l). The I/O complexity of GSP is O(l × D).

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 44 / 70

slide-56
SLIDE 56

Generalized sequential pattern mining (GSP)

Consider the following database

Id Sequence s1 CAGAAGT s2 TGACAG s3 GAAGT

The Sequence search space is (minsup = 3)

∅(3) A(3) AA(3) AAA(1) AAG(3) AAGG AG(3) AGA(1) AGG(1) AT(2) C(2) G(3) GA(3) GAA(3) GAAA GAAG(3) GAG(3) GAGA GAGG GG(3) GGA(0) GGG(0) GT(2) T(3) TA(1) TG(1) TT(0) Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 45 / 70

slide-57
SLIDE 57

Vertical sequences mining (SPADE)

The Spade algorithm uses a vertical database representation for sequence mining. The idea is to record two items for each symbol:

1

The sequence identifier

2

The position where it occurs (the position of last symbol of the subsequence)

Let L(s) be the set of such sequence-position tuples for symbol s, referred to as poslist. Consider the following database

Id Sequence s1 CAGAAGT s2 TGACAG s3 GAAGT

1

A occurs in s1 at positions 2, 4, and 5.

2

A occurs in s2 at positions 3 and 5.

3

A occurs in s3 at positions 2 and 3.

4

L(A) = {⟨1, {2, 4, 5}⟩, ⟨2, {3, 5}⟩ ⟨3, {2, 3}⟩}

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 46 / 70

slide-58
SLIDE 58

Vertical sequences mining (SPADE)

Let L(s) be the set of such sequence-position tuples for symbol s, referred to as poslist. L(s) for each s ∈ Σ represents its vertical representation. Given k−sequence r, L(r) maintains the list of positions for the occurrences of the last symbol r[k] in each database sequence si, provided r ⊆ si. The support of r is the number of distinct sequences in which r occurs (sup(r) = |L(r)|). SPADE uses the sequential join operation. Given L(a) and L(b) for k−sequences ra and

  • rb. L(r) keeps the track of last symbol of r.

ra and rb are join-able if they share the same (k − 1)length prefix. The main advantage of the vertical approach is that it enables different search strategies

  • ver the sequence search space, including breadth or depth-first search.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 47 / 70

slide-59
SLIDE 59

Vertical sequences mining (SPADE)

Consider the following database

Id Sequence s1 CAGAAGT s2 TGACAG s3 GAAGT

Sequence mined via SPADE is (minsup = 3)

∅ A 1 2,4,5 2 3,5 3 2,3 C 1 1 2 4 G 1 3,6 2 2,6 3 1,4 T 1 7 2 1 3 5 AA 1 4,5 2 5 3 3 AG 1 3,6 2 6 3 4 AT 1 7 3 5 GA 1 4,5 2 3,5 3 2,3 GG 1 6 2 6 3 4 GT 1 7 3 5 TA 2 3,5 TG 2 2,6 AAA 1 5 AAG 1 6 2 6 3 4 AGA 1 5 AGG 1 6 GAA 1 5 2 5 3 3 GAG 1 6 2 6 3 4

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 48 / 70

slide-60
SLIDE 60

Table of contents

1

Introduction

2

Frequent pattern mining model

3

Frequent itemset mining algorithms Brute force Frequent itemset mining algorithm Apriori algorithm Frequent pattern growth (FP-growth) Mining frequent itemsets using vertical data format

4

Summarizing itemsets Mining maximal itemsets Mining closed itemsets

5

Sequence mining

6

Graph mining

7

Pattern and rule assessment

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 49 / 70

slide-61
SLIDE 61

Graph mining (Introduction)

Graph data is becoming increasingly more ubiquitous in todays networked world. Examples include social networks, cell phone networks, blogs, Internet, hyper-linked structure of WWW, bioinformatics, and semantic Web. The goal of graph mining is to extract interesting subgraphs from a single large graph (such as a social network), or from a database of many graphs. In different applications we may be interested in different kinds of subgraph patterns, such as subtrees, complete graphs or cliques, bipartite cliques, and dense subgraphs. These subgraphs may represent communities in a social network, hub and authority pages

  • n the WWW, and cluster of proteins involved in similar biochemical functions.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 49 / 70

slide-62
SLIDE 62

Graphs

A graph is a pair G = (V , E) where V is a set of vertices, and E ⊆ V × V is a set of

  • edges. We assume that edges are unordered, so that the graph is undirected.

If (u, v) is an edge, we say that u and v are adjacent and that v is a neighbor of u, and vice versa. The set of all neighbors of u in G is given as N(u) = {v ∈ V |(u, v) ∈ E}. A labeled graph has labels associated with its vertices as well as edges. We use L(u) to denote the label of the vertex u, and L(u, v) to denote the label of the edge (u, v). Unlabeled graph

v1 v2 v3 v4 v5 v6 v7 v8

(a)

Labeled graph

a c b a d c b c v1 v2 v3 v4 v5 v6 v7 v8

(b)

Given an edge (u, v) ∈ G, the tuple ⟨u, v, L(u), L(v), L(u, v)⟩ that augments the edge with the node and edge labels is called an extended edge.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 50 / 70

slide-63
SLIDE 63

Subgraphs

A graph G ′ = (V ′, E ′) is said to be a subgraph of G if V ′ ⊆ V and E ′ ⊆ E. Note that this definition allows for disconnected subgraphs. Subgraph a c b a d c b c v1 v2 v3 v4 v5 v6 v7 v8

(a)

Connected subgraph a c b a d c b c v1 v2 v3 v4 v5 v6 v7 v8

(b)

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 51 / 70

slide-64
SLIDE 64

Graph and subgraph isomorphism

A graph G ′ = (V ′, E ′) is said to be isomorphic to graph G = (V , E) if there exists a function ϕ : V ′ → V such that

1

(u, v) ∈ E ′ ⇐ ⇒ (ϕ(u), ϕ(v)) ∈ E

2

∀u ∈ V ′, L(u) = L(ϕ(u))

3

∀(u, v) ∈ E ′, L(u, v) = L(ϕ(u), ϕ(v))

u1 a G1 u2 a u3 b u4 b

Figure 11.3

b u v1 a G2 v3 a v2 b v4 b w1 a G3 w2 a w3 b

x1 b G4 x2 a x3 b

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 52 / 70

slide-65
SLIDE 65

Subgraph support

Given a database of graphs,D = {G1, G2, . . . , Gn},and given some graph G,the support of G in D is defined as follows: sup(G) = |{GI ∈ D|G ⊆ Gi}| The support is simply the number of graphs in the database that contain G. Given a minsup threshold, the goal of graph mining is to mine all frequent connected subgraphs with sup(G) ≥ minsup. To mine all the frequent subgraphs, one has to search over the space of all possible graph patterns, which is exponential in size (Count the number of different subgraphs). There are two main challenges in frequent subgraph mining.

1

The first challenge is to systematically generate candidate subgraphs. We use edge-growth as the basic mechanism for extending the candidates.

2

The second challenge is to count the support of a graph in the database. This involves subgraph isomorphism checking, as we have to find the set of graphs that contain a given candidate.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 53 / 70

slide-66
SLIDE 66

Candidate generation

An effective strategy to enumerate subgraph patterns is the called rightmost path extension. This algorithm performs DFS over its vertices, and creates a DFS tree. Edges of DFS tree are called forward edges, and all other edges are called backward edges. The rightmost path as the path from the root to the rightmost leaf (leaf with the highest index in the DFS order). Subgraph

v6 d c v5 a v7 v1 a a v2 b v8 v4 c b v3

Connected subgraph v6 d c v5 a v7 v1 a a v2 b v8 v4 c b v3 The above DFS spanning tree obtained by starting at v1 and then choosing the vertex with the smallest index at each step.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 54 / 70

slide-67
SLIDE 67

Candidate generation (cont.)

For systematic candidate generation, the following steps are done. First, we try all backward extensions from the rightmost vertex. Backward extensions closer to the root are considered before those farther away from the root along the rightmost path. Then, we try forward extensions from vertices on the rightmost path. The vertices farther from the root are extended before those closer to the root.

v1 a v2 a v5 c #6 v3 b v4 c v6 d v7 a #5 v8 b #4 #3 #1 #2

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 55 / 70

slide-68
SLIDE 68

Canonical Code

When generating candidates using rightmost path extensions, it is possible that duplicate, that is, isomorphic, graphs are generated via different extensions. Among the isomorphic candidates, we need to keep only one for further extension, whereas the others can be pruned to avoid redundant computation. The idea is to rank the isomorphic graphs, and then pick the canonical representative. The DFS code of G (DFScode(G)) is used for canonical representation is the sequence of extended edge tuples of form ⟨vi, vj, L(vi), L(vj), L(vi, vj)⟩ listed in the DFS edge order.

v1 a G1 v2 a v3 a b v4 q r r r v1 a G2 v2 a v3 b a v4 q r r r v1 a G3 v2 a b v4 v3 a q r r r

t11 = ⟨v1,v2,a,a,q⟩ t12 = ⟨v2,v3,a,a,r⟩ t13 = ⟨v3,v1,a,a,r⟩ t14 = ⟨v2,v4,a,b,r⟩ t21 = ⟨v1,v2,a,a,q⟩ t22 = ⟨v2,v3,a,b,r⟩ t23 = ⟨v2,v4,a,a,r⟩ t24 = ⟨v4,v1,a,a,r⟩ t31 = ⟨v1,v2,a,a,q⟩ t32 = ⟨v2,v3,a,a,r⟩ t33 = ⟨v3,v1,a,a,r⟩ t34 = ⟨v1,v4,a,b,r⟩

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 56 / 70

slide-69
SLIDE 69

Canonical Code (cont.)

A subgraph is canonical if it has the smallest DFS code among all possible isomorphic graphs. Let t1 and t2 be any two DFS code tuples: t1 = ⟨vi, vj, L(vi), L(vj), L(vi, vj)⟩ t2 = ⟨vx, vy, L(vx), L(vy), L(vx, vy)⟩ We say that t1 is smaller than t2, written t1 < t2, iff

1

when (vi, vj) = (vx, vy) and ⟨L(vi), L(vj), L(vi, vj)⟩ <l ⟨L(vx), L(vy), L(vx, vy)⟩ or

2

when (vi, vj) <e (vx, vy). Let eij = (vi, vj) and exy = (vx, vy) be any two edges. We say that eij <e exy iff

1

If eij and exy are both forward edges, then (a) j < y, or (b) j = y and i > x.

2

If eij and exy are both backward edges, then (a) i < x, or (b) i = x and j < y.

3

If eij is a forward and exy is a backward edge, then j ≤ x.

4

If eij is a backward and exy is a forward edge, then i < y.

3

The edge order is derived from the rules for rightmost path extension

1

All backward extensions of a node must be considered before any forward edge from that node

2

Deep DFS trees are preferred over bushy DFS trees.

4

For example consider rule 1(b): If both the forward edges point to a node with the same DFS node order, then the forward extension from a node deeper in the tree is smaller.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 57 / 70

slide-70
SLIDE 70

Canonical Code (cont.)

v1 a G1 v2 a v3 a b v4 q r r r v1 a G2 v2 a v3 b a v4 q r r r v1 a G3 v2 a b v4 v3 a q r r r

t11 = ⟨v1,v2,a,a,q⟩ t12 = ⟨v2,v3,a,a,r⟩ t13 = ⟨v3,v1,a,a,r⟩ t14 = ⟨v2,v4,a,b,r⟩ t21 = ⟨v1,v2,a,a,q⟩ t22 = ⟨v2,v3,a,b,r⟩ t23 = ⟨v2,v4,a,a,r⟩ t24 = ⟨v4,v1,a,a,r⟩ t31 = ⟨v1,v2,a,a,q⟩ t32 = ⟨v2,v3,a,a,r⟩ t33 = ⟨v3,v1,a,a,r⟩ t34 = ⟨v1,v4,a,b,r⟩

DFScode(G1) DFScode(G2) DFScode(G3)

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 58 / 70

slide-71
SLIDE 71

gSpan algorithm

Given a database D = {G1, G2 . . . , Gn} comprising n graphs, and given a threshold minsup, the goal is to enumerate all (connected) subgraphs G that are frequent. In gSpan, each graph is represented by its canonical DFS code, so that the task of enumerating frequent subgraphs is equivalent to the task of generating all canonical DFS codes for frequent subgraphs. gSpan enumerates patterns in a depth-first manner, starting with the empty code. Given a canonical and frequent code C,

G1 a10 b20 a30 b40 G2 b50 a60 b70 a80

Let minsup = 2, that is, assume that we are interested in mining subgraphs that appear

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 59 / 70

slide-72
SLIDE 72

gSpan algorithm (cont.)

// Initial Call: C ← ∅

GSPAN (C, D, minsup):

E ← RIGHTMOSTPATH-EXTENSIONS(C,D) // extensions and

1

supports foreach (t,sup(t)) ∈ E do

2

C′ ← C ∪ t // extend the code with extended edge tuple t

3

sup(C′) ← sup(t) // record the support of new extension

4

// recursively call gSpan if code is frequent and canonical if sup(C′) ≥ minsup and ISCANONICAL (C′) then

5

GSPAN (C′, D, minsup)

6

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 60 / 70

slide-73
SLIDE 73

gSpan algorithm (cont.)

C0 ∅ C1 ⟨0,1,a,a⟩ a0 a1 C2 ⟨0,1,a,b⟩ a0 b1 C3 ⟨0,1,b,a⟩ b0 a1 C4 ⟨0,1,b,b⟩ b0 b1 C5 ⟨0,1,a,a⟩ ⟨1,2,a,b⟩ a0 a1 b2 C6 ⟨0,1,a,a⟩ ⟨0,2,a,b⟩ a0 a1 b2 C15 ⟨0,1,a,b⟩ ⟨1,2,b,a⟩ a0 b1 a2 C16 ⟨0,1,a,b⟩ ⟨1,2,b,b⟩ a0 b1 b2 C17 ⟨0,1,a,b⟩ ⟨0,2,a,a⟩ a0 b1 a2 C18 ⟨0,1,a,b⟩ ⟨0,2,a,b⟩ a0 b1 b2 C7 ⟨0,1,a,a⟩ ⟨1,2,a,b⟩ ⟨2,0,b,a⟩ a0 C8 ⟨0,1,a,a⟩ ⟨1,2,a,b⟩ ⟨2,3,b,b⟩ a0 a1 C9 ⟨0,1,a,a⟩ ⟨1,2,a,b⟩ ⟨1,3,a,b⟩ a0 C10 ⟨0,1,a,a⟩ ⟨1,2,a,b⟩ ⟨0,3,a,b⟩ a0 C24 ⟨0,1,a,b⟩ ⟨0,2,a,b⟩ ⟨2,3,b,a⟩ a0 C25 ⟨0,1,a,b⟩ ⟨0,2,a,b⟩ ⟨0,3,a,a⟩ a0

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 61 / 70

slide-74
SLIDE 74

Table of contents

1

Introduction

2

Frequent pattern mining model

3

Frequent itemset mining algorithms Brute force Frequent itemset mining algorithm Apriori algorithm Frequent pattern growth (FP-growth) Mining frequent itemsets using vertical data format

4

Summarizing itemsets Mining maximal itemsets Mining closed itemsets

5

Sequence mining

6

Graph mining

7

Pattern and rule assessment

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 62 / 70

slide-75
SLIDE 75

Which patterns are interesting?

Most association rule mining algorithms employ a support-confidence framework. Although minsup and minconf thresholds help to exclude the exploration of a good number of uninteresting rules, many of the rules generated are still not interesting to the users. Example (A misleading strong association rule)

1

Out of the 10, 000 transactions analyzed, the data show that 6,000 of the customer transactions included computer games, while 7,500 included videos, and 4,000 included both computer games and videos.

2

Suppose that a data mining program for discovering association rules is run on the data, using minsup = 0.30 and minconf = 60%. The following association rule is discovered: buys(X, computer games)) = ⇒ buys(X, videos)[sup = 4000 10000 = 0.4, conf = 4000 6000 = 0.66]

3

However, this rule is misleading because the probability of purchasing videos is 0.75, which is even larger than 0.66.

In fact, computer games and videos are negatively associated because the purchase of one

  • f these items actually decreases the likelihood of purchasing the other.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 62 / 70

slide-76
SLIDE 76

Which patterns are interesting?

Support and confidence measures are insufficient at filtering out uninteresting association rules. To tackle this weakness, a correlation measure can be used to augment the support-confidence framework for association rules. This leads to correlation rules of the form A = ⇒ B[support, confidence, correlation] This is, a correlation rule is measured not only by its support and confidence but also by the correlation between itemsets A and B. There are many different correlation measures from which to choose.

1

Lift measure

2

χ2 measure

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 63 / 70

slide-77
SLIDE 77

Pattern evaluation measures (Lift)

Lift is a simple correlation measure that is given by lift(A, B) = P(A ∪ B) P(A)P(B)

1

If lift(A, B) < 1, then the A and B are negatively correlated.

2

If lift(A, B) > 1, then A and B are positively correlated.

3

If lift(A, B) = 1, then A and B are independent.

Example (Correlation analysis using lift)

1

Consider the following contingency table summarizing transactions with respect to game and video.

game game

6row video 4000 3500 7500 video 2000 500 2500 6col 6000 4000 10,000 2

Then lift(game, video) = P(game ∪ video) P(game)P(video) = 0.4 0.6 × 0.75 = 0.89

3

Because this value is less than 1, there is a negative correlation between the occurrence of {game} and {video}.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 64 / 70

slide-78
SLIDE 78

Pattern evaluation measures ( χ2)

χ2 is computed as χ2 =

c

i=1 r

j=1

(oij − eij)2 eij where oij is observed frequency of event (Ai, Bj) and eij is the expected frequency of (Ai, Bj) computed as eij = count(A = ai) × count(B = bj) N Example (Correlation analysis using lift)

1

Consider the following contingency table summarizing transactions with respect to game and video. the Expected Values game game

6row video 4000 (4500) 3500 (3000) 7500 video 2000 (1500) 500 (1000) 2500 6col 6000 4000 10,000

2

Then χ2 = 555.6.

3

Because the χ2 > 1, and the observed value of the slot (game, video) = 4000, which is less than the expected value of 4500, buying game and buying video are negatively correlated.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 65 / 70

slide-79
SLIDE 79

Pattern evaluation measures(cont.)

Considering also the following measures

1

Given two itemsets, A and B, all-confidence measure of A and B is defined as all − confidence(A, B) = min[P(A|B), P(B|A)]

2

Given two itemsets, A and B, max-confidence measure of A and B is defined as max − confidence(A, B) = max[P(A|B), P(B|A)]

3

Given two itemsets, A and B, Kulczynski measure of A and B is defined as Kulczynski(A, B) = 1 2[P(A|B) + P(B|A)]

4

Given two itemsets, A and B, cosine measure of A and B is defined as cosine(A, B) = √ P(A|B) × P(B|A)

Values of these measures are only influenced by the supports of A, B, and A ∪ B, but not by the total number of transactions. These measures range from 0 to 1, and the higher the value, the closer the relationship between A and B.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 66 / 70

slide-80
SLIDE 80

Pattern evaluation measures (cont.)

Which is the best in assessing the discovered pattern relationships? Consider the following contingency table for two items coffee and milk.

× milk milk

6row coffee mc mc c coffee mc mc c 6col m m 6

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 67 / 70

slide-81
SLIDE 81

Pattern evaluation measures (cont.)

Comparison of six pattern evaluation measures using contingency tables for a variety of data sets Data Set mc mc mc mc

2

lift all conf. max conf. Kulc. cosine

D1 10,000 1000 1000 100,000 90557 9.26 0.91 0.91 0.91 0.91 D2 10,000 1000 1000 100 1 0.91 0.91 0.91 0.91 D3 100 1000 1000 100,000 670 8.44 0.09 0.09 0.09 0.09 D4 1000 1000 1000 100,000 24740 25.75 0.5 0.5 0.5 0.5 D5 1000 100 10,000 100,000 8173 9.18 0.09 0.91 0.5 0.29 D6 1000 10 100,000 100,000 965 1.97 0.01 0.99 0.5 0.10

1

The results of new measures show that m and c are strongly positively associated in D1 and D2.

2

However, lift and χ2 generate dramatically different measure values for D1 and D2 due to their sensitivity to mc.

3

In D3, the four new measures correctly show that m and c are strongly negatively associated.

4

In D4, lift and χ2 indicate a highly positive association between m and c, whereas the others indicate a neutral association because the ratio of mc to mc equals the ratio of mc to mc, which is 1. This means that if a customer buys coffee (or milk), the probability that he or she will also purchase milk (or coffee) is exactly 50

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 68 / 70

slide-82
SLIDE 82

Pattern evaluation measures (cont.)

A null-transaction is a transaction that does not contain any of the itemsets being examined. A measure is null-invariant if its value is free from the influence of null-transactions. Null-invariance is an important property for measuring association patterns in large transaction databases. Among the six discussed measures in this subsection, only lift and χ2 are not null-invariant measures. Among the all-confidence, max-confidence, Kulczynski, and cosine measures, which is best at indicating interesting pattern relationships? Imbalance ratio (IR), which assesses the imbalance of two itemsets, A and B, in rule implications. IR(A, B) = |sup(A) − sup(B)| sup(A) + sup(B) − sup(A ∪ B) If the two directional implications between A and B are the same, then IR(A, B) will be

  • zero. Otherwise, the larger the difference between the two, the larger the imbalance ratio.

The imbalance ratio is independent of the number of null-transactions and independent of the total number of transactions.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 69 / 70

slide-83
SLIDE 83

Pattern evaluation measures (cont.)

Consider the following contingency table for two items coffee and milk. Kulczynski and Imbalance Ratio (IR) together present a clear picture for all the three datasets D4 through D6

1

D4 is balanced and neutral

2

D5 is imbalanced and neutral

3

D6 is very imbalanced and neutral

Among the four null–invariant measures studied here, namely all confidence, max confidence, Kulc, and cosine, we recommend using Kulc in conjunction with the imbalance ratio.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 70 / 70