SLIDE 30 Reminder (Apriori): Transactions as a Prefix Tree
transaction database a, d, e b, c, d a, c, e a, c, d, e a, e a, c, d b, c a, c, d, e b, c, e a, d, e lexicographically sorted a, c, d a, c, d, e a, c, d, e a, c, e a, d, e a, d, e a, e b, c b, c, d b, c, e prefix tree representation a b c d e c d e e d e e : 7 : 3 : 4 : 2 : 1 : 3 : 3 : 1 : 2 : 1 : 1 : 2
- Items in transactions are sorted w.r.t. some arbitrary order,
transactions are sorted lexicographically, then a prefix tree is constructed.
- Advantage: identical transaction prefixes are processed only once.
Christian Borgelt Frequent Pattern Mining 117
Eclat: Transaction Ranges
transaction database a, d, e b, c, d a, c, e a, c, d, e a, e a, c, d b, c a, c, d, e b, c, e a, d, e item frequencies a: 7 b: 3 c: 7 d: 6 e: 7 sorted by frequency a, e, d c, d, b a, c, e a, c, e, d a, e a, c, d c, b a, c, e, d c, e, b a, e, d lexicographically sorted 1: a, c, e 2: a, c, e, d 3: a, c, e, d 4: a, c, d 5: a, e 6: a, e, d 7: a, e, d 8: c, e, b 9: c, d, b 10: c, b a 1 . . . 7 c 1 . . . 4 8 . . . 10 e 1 . . . 3 5 . . . 7 8 . . . 8 d 2 . . . 3 4 . . . 4 6 . . . 7 9 . . . 9 b 8 . . . 8 9 . . . 9 10 . . . 10
- The transaction lists can be compressed by combining
consecutive transaction identifiers into ranges.
- Exploit item frequencies and ensure subset relations between ranges
from lower to higher frequencies, so that intersecting the lists is easy.
Christian Borgelt Frequent Pattern Mining 118
Eclat: Transaction Ranges / Prefix Tree
transaction database a, d, e b, c, d a, c, e a, c, d, e a, e a, c, d b, c a, c, d, e b, c, e a, d, e sorted by frequency a, e, d c, d, b a, c, e a, c, e, d a, e a, c, d c, b a, c, e, d c, e, b a, e, d lexicographically sorted 1: a, c, e 2: a, c, e, d 3: a, c, e, d 4: a, c, d 5: a, e 6: a, e, d 7: a, e, d 8: c, e, b 9: c, d, b 10: c, b prefix tree representation a c c e e d b e d d b b d : 7 : 3 : 4 : 3 : 1 : 1 : 1 : 3 : 1 : 2 : 1 : 1 : 2
- Items in transactions are sorted by frequency,
transactions are sorted lexicographically, then a prefix tree is constructed.
- The transaction ranges reflect the structure of this prefix tree.
Christian Borgelt Frequent Pattern Mining 119
Eclat: Difference sets (Diffsets)
- In a conditional database, all transaction lists are “filtered” by the prefix:
Only transactions contained in the transaction identifier list for the prefix can be in the transaction identifier lists of the conditional database.
- This suggests the idea to use diffsets to represent conditional databases:
∀I : ∀a / ∈ I : DT(a | I) = KT(I) − KT(I ∪ {a}) DT(a | I) contains the identifiers of the transactions that contain I but not a.
- The support of direct supersets of I can now be computed as
∀I : ∀a / ∈ I : sT(I ∪ {a}) = sT(I) − |DT(a | I)|. The diffsets for the next level can be computed by ∀I : ∀a, b / ∈ I, a = b : DT(b | I ∪ {a}) = DT(b | I) − DT(a | I)
- For some transaction databases, using diffsets speeds up the search considerably.
Christian Borgelt Frequent Pattern Mining 120