Outline CHARM: An Efficient Algorithm Introductions for Closed - - PowerPoint PPT Presentation

outline charm an efficient algorithm
SMART_READER_LITE
LIVE PREVIEW

Outline CHARM: An Efficient Algorithm Introductions for Closed - - PowerPoint PPT Presentation

Outline CHARM: An Efficient Algorithm Introductions for Closed Itemset Mining Itemset-Tidset tree CHARM algorithm Authors: Mohammed J. Zaki and Ching-Jui Hsiao Performance study Presenter: Junfeng Wu Conclusion Comments


slide-1
SLIDE 1

28/10/2004 1

CHARM: An Efficient Algorithm for Closed Itemset Mining

Authors: Mohammed J. Zaki and Ching-Jui Hsiao Presenter: Junfeng Wu

28/10/2004 2

Outline

Introductions Itemset-Tidset tree CHARM algorithm Performance study Conclusion Comments

28/10/2004 3

Introductions

When we are mining association rules in a database, a huge number of frequent patterns (itemsets) will be generated.

  • Database: {(1,2,3,4),(1,2,3,4,5,6)}
  • Minimum support = 50%
  • 63 frequent itemsets

({(1),(2),(3),(4),(5),(6),(1,2),(1,3),…,(1,2,3,4,5,6)})

28/10/2004 4

Introductions

Closed frequent itemsets are non- redundant representations of all frequent itemsets. Mining association rules on closed frequent itemsets is a much easier task.

In the previous database, the number of closed frequent itemsets is only 2, (1,2,3,4) and (1,2,3,4,5,6).

slide-2
SLIDE 2

28/10/2004 5

Closed frequent itemsets

A frequent itemset X is closed if and

  • nly if there is no itemset Y such that

Y subsumes X every transaction that contains X also

contains Y

Database: {(1,2,3,4),(1,2,3,4,5,6)} Itemset (1,2) is not a closed itemset. Itemset (1,2,3,4) is a closed itemset.

28/10/2004 6

Example Database

50%(3) 67%(4) 83%(5) 100%(6) Support C,D,T 6 AT,DW,TW,ACT,ATW,CDW,C TW,ACTW A,C,D,T,W 5 A,D,T,AC,AW,CD,CT,ACW A,C,D,W 4 W,CW A,C,T,W 3 C C,D,W 2 Itemsets A,C,T,W 1 Items Transaction ALL FREQUENT ITEMSETS MINIMUM SUPPORT = 50% DATABASE W T D C A P.G. Wodehouse Mark Twain Sir Arthur Conan Doyle Agatha Christie Jane Austen DISTINCT DATABASE ITEMS

28/10/2004 7

Horizontal/Vertical format database

Horizontal format database

Each record is a set of items. Each record is assigned a distinct number

named transaction id.

Vertical format database

Each record is a set of transaction id about

an item.

This item occurs in these transactions.

28/10/2004 8

Vertical format database

6 5 5 4 6 6 4 5 3 5 5 3 4 2 3 4 2 3 1 1 2 1 1 W T D C A

slide-3
SLIDE 3

28/10/2004 9

Notations

Given an itemset X, t(X) is the set of all tids that contains X.

For example: t(ACW) = 1345

Given a tidset Y, i(Y) is the set of all common items to all the tids in Y.

For example: i(12) = CW

Given an itemset X, c(X) is the smallest closed set that contains X.

For example: c(A)=c(C)=C(W)=ACW

28/10/2004 10

Itemset-Tidset Search Tree (IT-tree)

Each node in the IT-tree is an itemset-

tidset pair, X×t(X).

For example: AT×135 All the children of node X share the

same prefix X and belong to an equivalence class

28/10/2004 11

Example of IT-tree

{} 123456 A 1345 C 123456 D 2456 T 1356 W 12345 AC 1345 AD 45 AT 135 AW 1345 ACD 45 ACT 135 ACW 1345 ACDT 5 ACDW 45 ACDTW 5 ACTW 135 ADW 45 ADT 5 ADTW 5 ATW 135 CD 2456 CT 1356 CW 12345 CDT 56 CDW 245 CDTW 5 CTW 135 DT 56 DW 245 DTW 5 TW 135 28/10/2004 12

Theorem 1

  • Let and be any two members of a

class , with , where is a total order. The following four properties hold:

  • 1. If , then
  • 2. If , then , but
  • 3. If , then , but
  • 4. If , then

) (

i i

X t X ×

) (

j j

X t X ×

] [ p

j f i

X X ≤

f

) ( ) (

j i

X t X t = ) ( ) (

j i

X t X t ⊂ ) ( ) (

j i

X t X t ⊃ ) ( ) (

j i

X t X t ≠ ) ( ) ( ) (

j i j i

X X c X c X c ∪ = = ) ( ) (

j i

X c X c ≠ ) ( ) (

j i i

X X c X c ∪ = ) ( ) (

j i

X c X c ≠ ) ( ) (

j i j

X X c X c ∪ = ) ( ) ( ) (

j i j i

X X c X c X c ∪ ≠ ≠

slide-4
SLIDE 4

28/10/2004 13

CHARM algorithm

28/10/2004 14

How does CHARM work?

{} Dx2456 Tx1356 Ax1345 Wx12345 Cx123456 DTx56 DAx45 DWx245 TAx135 TWx135 AWx1345 AWCx1345 WCx12345 DCx2456 DWCx245 TCx1356 TACx135 TWCx135 TAWCx135

28/10/2004 15

Subsumption Checking

Before add a set X to the current set of closed set, we need check if X is subsumed by some closed sets.

Comparing X with all closed set is

expensive.

Solution: using hash function to retrieve relevant closed sets

28/10/2004 16

Hash function

∑ ∈

=

) (

) (

X t T

T X h

The sum of the tids in the tidset of an itemset

Assumption: itemsets with the same hash key

have different supports.

slide-5
SLIDE 5

28/10/2004 17

Complexity issues

Comparing two itemset’s tidsets becomes a time consuming task when tidset gets very large. Keeping all tids of itemsets in memory needs lots of space. Solution: using diffsets

28/10/2004 18

Diffsets

t(P) t(X) t(Y) d(PX) d(PY) d(PXY) t(PXY) t(PX)

28/10/2004 19

Diffset and Tidset

) ( ) ( ) ( ) ( , ) ( ) ( ) ( ) ( ) ( ) ( , ) ( ) ( ) ( ) ( ) ( ) ( , ) ( ) ( ) ( ) ( ) ( ) ( , ) ( ) (

j i j i j i j i j i j i j i j i j i j i j i j i

X t X t

  • r

X d X d then X m and X m X t X t

  • r

X d X d then X m and X m X t X t

  • r

X d X d then X m and X m X t X t

  • r

X d X d then X m and X m ≠ ≠ > > ⊃ ⊂ > = ⊂ ⊃ = > = = = =

Let m(Xi) and m(Xj) denote the number of mismatches in the diffsets d(Xi) and d(Xj)

For example: Xi=D, Xj=T, then d(Xi)=2456, d(Xj)=1356, m(Xi)=|(13)|=2, m(Xj)=|(24)|=2

28/10/2004 20

CHARM using diffsets

{} Dx2456 Tx1356 Ax1345 Wx12345 Cx123456 DTx24 DAx26 DWx6 TAx6 TWx6 AWx1345 AWCx1345 WCx12345 DCx2456 DWCx6 TCx1356 TACx6 TWCx6 TAWCx6

slide-6
SLIDE 6

28/10/2004 21

Performance study

Datasets

28/10/2004 22

Performance study

28/10/2004 23

Performance study

28/10/2004 24

Performance study

slide-7
SLIDE 7

28/10/2004 25

Scalability

Linear increasing in the running time with increasing number of transactions at a giving support.

28/10/2004 26

Memory usage

The memory usage is 50 times smaller by using diffsets than using tidsets.

Memory usage (using diffsets)

28/10/2004 27

Conclusion

Advantage of CHARM

Faster than other algorithm at low support threshold Faster than other algorithm on a database with very long

closed patterns

Disadvantage of CHARM

Slower than Closet when most of closed sets are 2-itemset

28/10/2004 28

Comments

Strength

The ideas in the paper are intuitive. The authors first introduced an efficient data structure (IT-

tree) for closed itemset mining.

The authors demonstrated the algorithm on various

datasets.

The experimental studies are convincing.

Weakness

The algorithm requires the conversion of database from

horizontal format to vertical format.

Follow-up

Closet+ (Wang et al, 2003) beats CHARM one year later.

slide-8
SLIDE 8

28/10/2004 29

THANK YOU!

Questions or comments?