Data Mining and Machine Learning: Fundamental Concepts and - - PowerPoint PPT Presentation

data mining and machine learning fundamental concepts and
SMART_READER_LITE
LIVE PREVIEW

Data Mining and Machine Learning: Fundamental Concepts and - - PowerPoint PPT Presentation

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science


slide-1
SLIDE 1

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

dataminingbook.info Mohammed J. Zaki1 Wagner Meira Jr.2

1Department of Computer Science

Rensselaer Polytechnic Institute, Troy, NY, USA

2Department of Computer Science

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 9: Summarizing Itemsets

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 1 / 23

slide-2
SLIDE 2

Maximal Frequent Itemsets

Given a binary database D ⊆ T × I, over the tids T and items I, let F denote the set of all frequent itemsets, that is, F =

  • X | X ⊆ I and sup(X) ≥ minsup
  • A frequent itemset X ∈ F is called maximal if it has no frequent supersets. Let

M be the set of all maximal frequent itemsets, given as M =

  • X | X ∈ F and ∃Y ⊃ X, such that Y ∈ F
  • The set M is a condensed representation of the set of all frequent itemset F,

because we can determine whether any itemset X is frequent or not using M. If there exists a maximal itemset Z such that X ⊆ Z, then X must be frequent;

  • therwise X cannot be frequent.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 2 / 23

slide-3
SLIDE 3

An Example Database

Transaction database Tid Itemset 1 ABDE 2 BCE 3 ABDE 4 ABCE 5 ABCDE 6 BCD Frequent itemsets (minsup = 3) sup Itemsets 6 B 5 E,BE 4 A,C,D,AB,AE,BC,BD,ABE 3 AD,CE,DE,ABD,ADE,BCE,BDE,ABDE

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 3 / 23

slide-4
SLIDE 4

Closed Frequent Itemsets

Given T ⊆ T , and X ⊆ I, define t(X) = {t ∈ T | t contains X} i(T) = {x ∈ I | ∀t ∈ T, t contains x} c(X) = i ◦ t(X) = i(t(X)) The function c is a closure operator and an itemset X is called closed if c(X) = X. It follows that t(c(X)) = t(X). The set of all closed frequent itemsets is thus defined as C =

  • X | X ∈ F and ∃Y ⊃ X such that sup(X) = sup(Y )
  • X is closed if all supersets of X have strictly less support, that is, sup(X) > sup(Y ), for

all Y ⊃ X. The set of all closed frequent itemsets C is a condensed representation, as we can determine whether an itemset X is frequent, as well as the exact support of X using C alone.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 4 / 23

slide-5
SLIDE 5

Minimal Generators

A frequent itemset X is a minimal generator if it has no subsets with the same support: G =

  • X | X ∈ F and ∃Y ⊂ X, such that sup(X) = sup(Y )
  • In other words, all subsets of X have strictly higher support, that is,

sup(X) < sup(Y ), for all Y ⊂ X. Given an equivalence class of itemsets that have the same tidset, a closed itemset is the unique maximum element of the class, whereas the minimal generators are the minimal elements of the class.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 5 / 23

slide-6
SLIDE 6

Frequent Itemsets: Closed, Minimal Generators and Maximal

∅ A 1345 B 123456 D 1356 E 12345 C 2456 AD 135 DE 135 AB 1345 AE 1345 BD 1356 BE 12345 BC 2456 CE 245 ABD 135 ADE 135 BDE 135 ABE 1345 BCE 245 ABDE 135

Itemsets boxed and shaded are closed, double boxed are maximal, and those boxed are minimal generators

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 6 / 23

slide-7
SLIDE 7

Mining Maximal Frequent Itemsets: GenMax Algorithm

Mining maximal itemsets requires additional steps beyond simply determining the frequent itemsets. Assuming that the set of maximal frequent itemsets is initially empty, that is, M = ∅, each time we generate a new frequent itemset X, we have to perform the following maximality checks Subset Check: ∃Y ∈ M, such that X ⊂ Y . If such a Y exists, then clearly X is not maximal. Otherwise, we add X to M, as a potentially maximal itemset. Superset Check: ∃Y ∈ M, such that Y ⊂ X. If such a Y exists, then Y cannot be maximal, and we have to remove it from M.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 7 / 23

slide-8
SLIDE 8

GenMax Algorithm: Maximal Itemsets

GenMax is based on dEclat, i.e., it uses diffset intersections for support computation. The initial call takes as input the set of frequent items along with their tidsets, i,t(i), and the initially empty set of maximal itemsets, M. Given a set of itemset–tidset pairs, called IT-pairs, of the form X,t(X), the recursive GenMax method works as follows. If the union of all the itemsets, Y = Xi, is already subsumed by (or contained in) some maximal pattern Z ∈ M, then no maximal itemset can be generated from the current branch, and it is pruned. Otherwise, we intersect each IT-pair Xi,t(Xi) with all the

  • ther IT-pairs Xj,t(Xj), with j > i, to generate new candidates Xij, which are added to

the IT-pair set Pi. If Pi is not empty, a recursive call to GenMax is made to find other potentially frequent extensions of Xi. On the other hand, if Pi is empty, it means that Xi cannot be extended, and it is potentially maximal. In this case, we add Xi to the set M, provided that Xi is not contained in any previously added maximal set Z ∈ M.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 8 / 23

slide-9
SLIDE 9

GenMax Algorithm

// Initial Call: M ← ∅, P ←

  • i,t(i) | i ∈ I,sup(i) ≥ minsup
  • GenMax (P, minsup, M):

1 Y ← Xi 2 if ∃Z ∈ M, such that Y ⊆ Z then 3

return // prune entire branch

4 foreach Xi,t(Xi) ∈ P do 5

Pi ← ∅

6

foreach Xj,t(Xj) ∈ P, with j > i do

7

Xij ← Xi ∪ Xj

8

t(Xij) = t(Xi) ∩ t(Xj)

9

if sup(Xij) ≥ minsup then Pi ← Pi ∪ {Xij,t(Xij)}

10 11

if Pi = ∅ then GenMax (Pi, minsup, M)

12 13

else if ∃Z ∈ M,Xi ⊆ Z then

14

M = M ∪ Xi // add Xi to maximal set

15

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 9 / 23

slide-10
SLIDE 10

Mining Maximal Frequent Itemsets

A B C D E 1345 123456 2456 1356 12345 AB AD AE 1345 135 1345 PA ABD ABE 135 1345 PAB ABDE 135 PABD ADE 135 PAD BC BD BE 2456 1356 12345 PB BCE 245 PBC BDE 135 PBD CE 245 PC DE 135 PD

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 10 / 23

slide-11
SLIDE 11

Mining Closed Frequent Itemsets: Charm Algorithm

Mining closed frequent itemsets requires that we perform closure checks, that is, whether X = c(X). Direct closure checking can be very expensive. Given a collection of IT-pairs {Xi,t(Xi)}, Charm uses the following three properties:

Property (1)

If t(Xi) = t(Xj), then c(Xi) = c(Xj) = c(Xi ∪ Xj), which implies that we can replace every occurrence of Xi with Xi ∪ Xj and prune the branch under Xj because its closure is identical to the closure of Xi ∪ Xj.

Property (2)

If t(Xi) ⊂ t(Xj), then c(Xi) = c(Xj) but c(Xi) = c(Xi ∪ Xj), which means that we can replace every occurrence of Xi with Xi ∪ Xj, but we cannot prune Xj because it generates a different closure. Note that if t(Xi) ⊃ t(Xj) then we simply interchange the role of Xi and Xj.

Property (3)

If t(Xi) = t(Xj), then c(Xi) = c(Xj) = c(Xi ∪ Xj). In this case we cannot remove either Xi or Xj, as each of them generates a different closure.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 11 / 23

slide-12
SLIDE 12

Charm Algorithm: Closed Itemsets

// Initial Call: C ← ∅, P ←

  • i,t(i) : i ∈ I,sup(i) ≥ minsup
  • Charm (P, minsup, C):

1 Sort P in increasing order of support (i.e., by increasing |t(Xi)|) 2 foreach Xi,t(Xi) ∈ P do 3

Pi ← ∅

4

foreach Xj,t(Xj) ∈ P, with j > i do

5

Xij = Xi ∪ Xj

6

t(Xij) = t(Xi) ∩ t(Xj)

7

if sup(Xij) ≥ minsup then

8

if t(Xi) = t(Xj) then // Property 1

9

Replace Xi with Xij in P and Pi

10

Remove Xj,t(Xj) from P

11

else

12

if t(Xi) ⊂ t(Xj) then // Property 2

13

Replace Xi with Xij in P and Pi

14

else // Property 3

15

Pi ← Pi ∪

  • Xij,t(Xij)
  • 16

if Pi = ∅ then Charm (Pi, minsup, C)

17 18

if ∃Z ∈ C, such that Xi ⊆ Z and t(Xi) = t(Z) then

19

C = C ∪ Xi // Add Xi to closed set

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 12 / 23

slide-13
SLIDE 13

Mining Frequent Closed Itemsets: Charm

Process A

A AE AEB 1345 C 2456 D 1356 E 12345 B 123456 AD ADE ADEB 135 PA

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 13 / 23

slide-14
SLIDE 14

Mining Frequent Closed Itemsets: Charm

A AE AEB C CB D DB E EB B 1345 2456 1356 12345 123456 AD ADE ADEB 135 PA CE CEB 245 PC DE DEB 135 PD

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 14 / 23

slide-15
SLIDE 15

Nonderivable Itemsets

An itemset is called nonderivable if its support cannot be deduced from the supports of its subsets. The set of all frequent nonderivable itemsets is a summary or condensed representation of the set of all frequent itemsets. Further, it is lossless with respect to support, that is, the exact support of all other frequent itemsets can be deduced from it. Generalized Itemsets: Let X be a k-itemset, that is, X = {x1,x2,...,xk}. The k tidsets t(xi) for each item xi ∈ X induce a partitioning of the set of all tids into 2k regions, where each partition contains the tids for some subset of items Y ⊆ X, but for none of the remaining items Z = X \ Y . Each partition is therefore the tidset of a generalized itemset Y Z, where Y consists of regular items and Z consists of negated items. Define the support of a generalized itemset Y Z as the number of transactions that contain all items in Y but no item in Z: sup(Y Z) =

  • {t ∈ T | Y ⊆ i(t) and Z ∩ i(t) = ∅}
  • Zaki & Meira Jr.

(RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 15 / 23

slide-16
SLIDE 16

Inclusion–Exclusion Principle: Support Bounds

The inclusion–exclusion principle allows one to directly compute the support of Y Z sup(Y Z) =

  • Y ⊆W ⊆X

−1|W \Y | · sup(W ) From the 2k possible subsets Y ⊆ X, we derive 2k−1 lower bounds and 2k−1 upper bounds for sup(X), obtained after setting sup(Y Z) ≥ 0 Upper Bounds (|X \ Y | is odd) : sup(X) ≤

  • Y ⊆W ⊂X

−1(|X\Y |+1)sup(W ) Lower Bounds (|X \ Y | is even) : sup(X) ≥

  • Y ⊆W ⊂X

−1(|X\Y |+1)sup(W )

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 16 / 23

slide-17
SLIDE 17

Tidset Partitioning Induced by t(A),t(C), and t(D)

Tid Itemset 1 ABDE 2 BCE 3 ABDE 4 ABCE 5 ABCDE 6 BCD

t(ACD) = ∅ t(ACD) = 2 t(ACD) = ∅ t(ACD) = 4 t(ACD) = 13 t(ACD) = 6 t(ACD) = 5 t(ACD) = ∅ t(A) t(C) t(D)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 17 / 23

slide-18
SLIDE 18

Inclusion–Exclusion for Support

Consider the generalized itemset ACD = CAD, where Y = C, Z = AD and X = YZ = ACD. In the Venn diagram, we start with all the tids in t(C), and remove the tids contained in t(AC) and t(CD). However, we realize that in terms

  • f support this removes sup(ACD) twice, so we need to add it back. In other

words, the support of CAD is given as sup(CAD) = sup(C) − sup(AC) − sup(CD) + sup(ACD) = 4 − 2 − 2 + 1 = 1 But, this is precisely what the inclusion–exclusion formula gives: sup(CAD) = (−1)0 sup(C)+ W = C,|W \ Y | = 0 (−1)1 sup(AC)+ W = AC,|W \ Y | = 1 (−1)1 sup(CD)+ W = CD,|W \ Y | = 1 (−1)2 sup(ACD) W = ACD,|W \ Y | = 2 = sup(C) − sup(AC)− sup(CD) + sup(ACD)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 18 / 23

slide-19
SLIDE 19

Support Bounds

From each of the partitions, we get one bound, and out of the eight possible regions, exactly four give upper bounds and the other four give lower bounds for the support of ACD: sup(ACD) ≥ 0 when Y = ACD ≤ sup(AC) when Y = AC ≤ sup(AD) when Y = AD ≤ sup(CD) when Y = CD ≥ sup(AC) + sup(AD) − sup(A) when Y = A ≥ sup(AC) + sup(CD) − sup(C) when Y = C ≥ sup(AD) + sup(CD) − sup(D) when Y = D ≤ sup(AC) + sup(AD) + sup(CD)− sup(A) − sup(C) − sup(D) + sup(∅) when Y = ∅

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 19 / 23

slide-20
SLIDE 20

Support Bounds for Subsets

subset lattice ACD sign inequality level AC AD CD 1 ≤ 1 A C D −1 ≥ 2 ∅ 1 ≤ 3

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 20 / 23

slide-21
SLIDE 21

Nonderivable Itemsets

Given an itemset X, and Y ⊆ X, let IE(Y ) denote the summation IE(Y ) =

  • Y ⊆W ⊂X

−1(|X\Y |+1) · sup(W ) Then, the sets of all upper and lower bounds for sup(X) are given as UB(X) =

  • IE(Y )
  • Y ⊆ X, |X \ Y | is odd
  • LB(X) =
  • IE(Y )
  • Y ⊆ X, |X \ Y | is even
  • An itemset X is called nonderivable if max{LB(X)} = min{UB(X)}, which implies that

the support of X cannot be derived from the support values of its subsets; we know only the range of possible values, that is, sup(X) ∈

  • max{LB(X)},min{UB(X)}
  • On the other hand, X is derivable if sup(X) = max{LB(X)} = min{UB(X)} because in

this case sup(X) can be derived exactly using the supports of its subsets. Thus, the set

  • f all frequent nonderivable itemsets is given as

N =

  • X ∈ F | max{LB(X)} = min{UB(X)}
  • where F is the set of all frequent itemsets.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 21 / 23

slide-22
SLIDE 22

Nonderivable Itemsets: Example

Consider the support bound formulas for sup(ACD). The lower bounds are sup(ACD) ≥ 0 ≥ sup(AC) + sup(AD) − sup(A) = 2 + 3 − 4 = 1 ≥ sup(AC) + sup(CD) − sup(C) = 2 + 2 − 4 = 0 ≥ sup(AD) + sup(CD) − sup(D) = 3 + 2 − 4 = 0 and the upper bounds are sup(ACD) ≤ sup(AC) = 2 ≤ sup(AD) = 3 ≤ sup(CD) = 2 ≤ sup(AC) + sup(AD) + sup(CD) − sup(A) − sup(C)− sup(D) + sup(∅) = 2 + 3 + 2 − 4 − 4 − 4 + 6 = 1 Thus, we have LB(ACD) = {0,1} max{LB(ACD)} = 1 UB(ACD) = {1,2,3} min{UB(ACD)} = 1 Because max{LB(ACD)} = min{UB(ACD)} we conclude that ACD is derivable.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 22 / 23

slide-23
SLIDE 23

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

dataminingbook.info Mohammed J. Zaki1 Wagner Meira Jr.2

1Department of Computer Science

Rensselaer Polytechnic Institute, Troy, NY, USA

2Department of Computer Science

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 9: Summarizing Itemsets

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 23 / 23