OPTIMISING ASSOCIATION RULE ALGORITHMS USING ITEMSET ORDERING
ES2001 Peterhouse College, Cambridge
Frans Coenen, Paul Leng and Graham Goulbourne
The Department of Computer Science The University of Liverpool
OPTIMISING ASSOCIATION RULE ALGORITHMS USING ITEMSET ORDERING - - PowerPoint PPT Presentation
OPTIMISING ASSOCIATION RULE ALGORITHMS USING ITEMSET ORDERING ES2001 Peterhouse College, Cambridge Frans Coenen, Paul Leng and Graham Goulbourne The Department of Computer Science The University of Liverpool Introduction: The archetypal
ES2001 Peterhouse College, Cambridge
The Department of Computer Science The University of Liverpool
Which items tend to occur together in shopping baskets?
– Examine database of purchase transactions – look for associations
Find Association Rules:
PQ -> X
When P and Q occur together, X is likely to
support for a rule A for a rule A-
>B is the number (proportion)
The confidence confidence for a rule is the ratio of support for rule for a rule is the ratio of support for rule to support for its antecedent to support for its antecedent
The problem: Find all rules for which support and Find all rules for which support and confidence exceed some threshold (the confidence exceed some threshold (the frequent frequent sets) sets)
Support is the difficult part (confidence follows is the difficult part (confidence follows)
)
C D A B AB AD AC BC BD CD BCD ACD ABD ABC ABCD
– on each iteration k, examine a Candidate Set Ck
– Count the support for all members of Ck (one pass of the database, requiring all k-subsets of each record to be examined) – Find the set Lk of sets with required support – Use this to determine Ck+1, the set of sets of size k+1 all of whose subsets are in Lk
if database is dense)
members of Ck present is time-consuming
densely-packed records
present (not subsets): this gives us m’ partial support-counts (m’ < m, the database size)
support for subsets
data for efficient computation
– find the set i on the tree; – increment support-count for all sets on path to i – if set not present on tree, create a node for it
than 2n)
deriving from successor-supersets (leading to interim support-count Qi)
A ABD ACD BD CD BCD
A BD B CD C ABC D ABD AB ACD AC BCD AD ABCD BC
B C D AB ABC ABCD AC AD BC
8 4 4 2 2 2 2 1 1 1 1 1 1 1 1
ABD ACD BD CD BCD
A BD B CD C ABC D ABD AB ACD AC BCD AD ABCD BC
BC
2 1 1 1 1 1
A
7
AB ABCD
3 3
AC AD
2 1
B C D
4 2 1
ABD ACD
A AC AD ABC ABD ACD ABCD 1 1
A
7
ABC ABCD
3 3
AC AD
2 1
ABCD
3
ACD
A AC AD ABC ABD ACD ABCD 1
A
7
ABC ABCD
2 1
AC AD
2 1
ABD
1
ACD
1
A
7
AB ABCD
3 1
AC AD
2 1
ABD
1
ABC
2
A
8
ACD AC AD
2 1 1
AB
4
ABD ABC ABCD
2 1 1
BD CD BCD B C D BC
4 2 2 1 1 1 1
iTS = iPS+ sum(predessessor nodes of IPS) BTS = BPS+ABPS
A
8
ACD AC AD
2 1 1
AB
4
ABD ABC ABCD
2 1 1
BD CD BCD B C D BC
4 2 2 1 1 1 1
DTS = DPS + CDPS + BDPS + BCDPS + ADPS + ACDPS + ABDPS + ABCDPS
A
8
AB
4
ABCD
1
BCD
4 2 2 1
DTS = DPS + CDPS + BDPS + BCDPS + ADPS + ACDPS + ABDPS + ABCDPS
BD CD BC
1 1 1
B C ACD
2 1
ABD ABC
2 1 1
AC D AD
A B
AB C AC D ABC BC AD ACD ABD BD CD ABCD BCD
equally distributed throughout the set of candidates.
the support calculation is complete
first and thus reduced the effort required for total support counting.
A ABD BD
D CD BD AD BCD ACD ABD ABCD
B CD D AB ABCD ACD AD BCD
3 4 2 1 1 2 2 1 1 1 1
for set i
(adding support for predecessor-supersets)
j i
A B C D AB AC AD ABC ABCD ABD ACD BC BD CD BCD
– for all sets i in Target set T:
A B C D AB AC AD ABC ABCD ABD ACD BC BD CD BCD
have AB added, including ABC)
– So use Apriori type algorithm
– i is attribute not in parent node – starting at node i of T-tree:
adding support to all subsets of j at the required level
unsupported sets
A B
AB C AC D ABC BC AD ACD ABD BD CD ABCD BCD
Pass 1: C not supported, so do not add AC,BC,CD to tree Pass2: (eg) Item ABD from P-tree added to AD and BD (tree is walked from D to BD) ABD
tree
record of r attributes, Apriori counts r(r-1)/2 subset-pairs; our method only r-1
contemporaneously, has similar properties, but:
– FP-tree stores a single item only at each node (so more nodes) – FP-tree builds in more links to implement FP- growth algorithm – Conversely, P-tree is generic: Apriori-TFP is
– almost independent of N (number of attributes) – scale linearly with M (number of records) – seems to scale linearly as database density increases – less than for FP-tree (because of more nodes and links in latter)
T25.I10.N1K.D10K
used in FP-growth)
applied to P-tree
for subtrees
– (exhaustive methods may be effective for small very densely-populated subtrees)