Implications of Probabilistic Data Modeling for Rule Mining Michael - - PowerPoint PPT Presentation
Implications of Probabilistic Data Modeling for Rule Mining Michael - - PowerPoint PPT Presentation
Implications of Probabilistic Data Modeling for Rule Mining Michael Hahsler, Kurt Hornik and Thomas Reutterer Wirtschaftsuniversit at Wien 29th Annual Conference of the German Classification Society (GfKl 2005) Magdeburg, March 9-11, 2005
Motivation
- Mining association rules is an important technique for discovering
meaningful patterns in transaction databases. – Example: diapers ⇒ beer – Applications: product assortment decisions, adapting promotional activities, personalized product recommendations, adaptive user interfaces
- Current literature focuses on the properties of algorithms.
- We will discuss properties of
– transaction data sets and – interest measures from a probabilistic point of view.
- M. Hahsler, K. Hornik and T. Reutterer
2 Magdeburg, March 9-11, 2005
Outline
- 1. Association rules
- 2. Probabilistic model for transaction data
- 3. Simulation with R
- 4. Implications for confidence and lift
- 5. New measure: hyperlift
- 6. Conclusion
- M. Hahsler, K. Hornik and T. Reutterer
3 Magdeburg, March 9-11, 2005
Association Rules
An association rule is a rule of the form X ⇒ Y , where X and Y are two disjoint sets of items (itemsets). Rule selection with threshold on interest measures:
- Support: fraction of transactions containing an itemset
- Confidence: probability of seeing Y under the condition that the
transactions also contain X Found rules are often ranked by:
- Lift: how many times more often X and Y occur together than
expected if they where statistically independent
- M. Hahsler, K. Hornik and T. Reutterer
4 Magdeburg, March 9-11, 2005
A simple probabilistic framework for transaction data
Transactions occur following a Poisson process
time Tr1Tr2 Tr3 Tr4 Tr5 Trm-2 Trm-1 Trm t
We analyze transactions which are recorded in a fixed time interval of length t. The number of transactions m in the time interval is then poisson distributed with parameter θt:
P(M = m) = e−θt(θt)m m!
(1)
- M. Hahsler, K. Hornik and T. Reutterer
5 Magdeburg, March 9-11, 2005
A simple probabilistic framework (cont’d)
- n independent items L = {l1, l2, . . . , ln},
- with each having a fixed success probabilities to occur in a transaction
given by the vector p = (p1, p2, . . . , pn). Following the framework: ci, the observed number of transactions item
li is contained in, can be interpreted as a realization of a random variable Ci.
Under the condition of a fixed number of transactions m this random variable has a binomial distribution:
P(Ci = ci|M = m) = m ci
- pci
i (1 − pi)m−ci
(2)
- M. Hahsler, K. Hornik and T. Reutterer
6 Magdeburg, March 9-11, 2005
A simple probabilistic framework (cont’d)
Since for a fixed time interval t the number of transactions m is not fixed, the unconditional distribution gives:
P(Ci = ci) =
∞
- m=ci
P(Ci = ci|M = m) · P(M = m) =
∞
- m=ci
m ci
- pci
i (1 − pi)m−ci e−θt(θt)m
m! = e−θt(piθt)ci ci!
∞
- m=ci
((1 − p)θt)m−ci (m − ci)! = e−piθt(piθt)ci ci!
(3)
which has a Poisson distribution with parameter λi = piθt.
- M. Hahsler, K. Hornik and T. Reutterer
7 Magdeburg, March 9-11, 2005
A simple probabilistic framework (cont’d)
Representation of transaction data as a binary incidence matrix:
items transactions l1 l2 l3 ... ln Tr1 0 1 0 ... 1 Tr2 0 1 0 ... 1 Tr3 0 1 0 ... 0 Tr4 0 0 0 ... 0 . . . . . . . . . . . . . . . Trm-1 1 0 0 ... 1 Trm 0 0 1 ... 1 . . . c 99 201 7 ... 411 p 0.005 0.01 0.0003 ... 0.025
- M. Hahsler, K. Hornik and T. Reutterer
8 Magdeburg, March 9-11, 2005
Simulation
For simplicity we will assume for the following simulation that the parameters in λ are chosen from a single gamma distribution with parameters k = 0.75 and a = 250. We will simulate the counts ci, for n = 200 different items over a t = 30 day period with transaction intensity θ = 300 transactions per day.
> m <- rpois(1, theta * t) [1] 8885 > p <- sort(rgamma(n, shape = k, scale = a)/m, + decreasing = TRUE)
Now we can simulate the transactions in the database by m Bernoulli trials for each of the n items and calculate the count vector c.
> Tr <- matrix(rbinom(m * n, 1, p), ncol = n, byrow = TRUE) > c <- (apply(Tr, 2, sum))
- M. Hahsler, K. Hornik and T. Reutterer
9 Magdeburg, March 9-11, 2005
Simulation (cont’d)
We can directly calculate the support of each item from the transaction counts. > supp1 <- c/m > plot(supp1, type = "h", xlab = "items", + ylab = "support")
- M. Hahsler, K. Hornik and T. Reutterer
10 Magdeburg, March 9-11, 2005
- M. Hahsler, K. Hornik and T. Reutterer
11 Magdeburg, March 9-11, 2005
Simulation (cont’d)
Next, we extend the framework to the occurrences of 2-itemsets with a symmetric n × n count matrix c2 and a support matrix (supp2): > c2 <- sapply(1:n, function(i) { + apply(Tr[, i] & Tr[, 1:n], 2, sum)}) > diag(c2) <- NA > supp2 <- c2/m > persp(supp2, expand = 0.5, ticktype = "detailed", + border = 0, shade = 1, zlab = "support", + xlab = "items", ylab = "items")
- M. Hahsler, K. Hornik and T. Reutterer
12 Magdeburg, March 9-11, 2005
- M. Hahsler, K. Hornik and T. Reutterer
13 Magdeburg, March 9-11, 2005
Implications for confidence
Confidence is defined by
conf(X ⇒ Y ) = supp(X + Y ) supp(X) .
(4) From our 2-itemsets we can generate rules of the from li ⇒ lj, where
i, j = 1, 2, . . . , n and i = j. We calculate confidence for the n(n − 1)
possible rules in the data set. > conf2 <- supp2/supp1 > persp(conf2, expand = 0.5, ticktype = "detailed", + border = 0, shade = 1, zlab = "confidence", + xlab = "items", ylab = "items")
- M. Hahsler, K. Hornik and T. Reutterer
14 Magdeburg, March 9-11, 2005
- M. Hahsler, K. Hornik and T. Reutterer
15 Magdeburg, March 9-11, 2005
Implications for confidence (cont’d)
- Confidence values are generally very low which reflect the fact that
there are no associations in the data.
- Some rules with confidence of one. However, left-hand-sides (X) have
low support.
- Confidence increases with the item in the right-hand-side Y of the rule
getting more frequent. The fact that confidence systematically favors some rules makes the measure problematic when it comes to ranking rules.
- M. Hahsler, K. Hornik and T. Reutterer
16 Magdeburg, March 9-11, 2005
Implications for lift
Typically, rules mined using minimum support (and confidence) are filtered or
- rdered using their lift value. The measure lift is defined as:
lift(X ⇒ Y ) = conf(X ⇒ Y ) supp(Y )
(5) A lift value close to 1 indicates that the items are co-occurring in the database as expected under independence. > lift <- conf2/matrix(supp1, ncol = n, nrow = n, + byrow = TRUE) > persp(lift, expand = 0.5, ticktype = "detailed", + border = 0, shade = 1, zlab = "lift", + xlab = "items", ylab = "items") > length(which(lift > 2)) [1] 3424
- M. Hahsler, K. Hornik and T. Reutterer
17 Magdeburg, March 9-11, 2005
- M. Hahsler, K. Hornik and T. Reutterer
18 Magdeburg, March 9-11, 2005
Implications for lift (cont’d)
To counter the problem with extremely high lift values, we discard all 2-itemsets which do not satisfy a minimum support of 0.1%. > min_supp <- 0.001 > length(lift[supp2 >= min_supp]) [1] 7096 > lift[supp2 < min_supp] <- 1 > persp(lift, expand = 0.5, ticktype = "detailed", + border = 0, shade = 1, zlab = "lift", + xlab = "items", ylab = "items") > length(which(lift > 2)) [1] 130
- M. Hahsler, K. Hornik and T. Reutterer
19 Magdeburg, March 9-11, 2005
- M. Hahsler, K. Hornik and T. Reutterer
20 Magdeburg, March 9-11, 2005
Implications for lift (cont’d)
- Lift performs poorly to filter random noise in transaction data
especially if for relatively rare items.
- Lift has a tendency to produce higher values for rules with items close
to minimum support. This makes using lift problematic for ranking discovered rules.
- M. Hahsler, K. Hornik and T. Reutterer
21 Magdeburg, March 9-11, 2005
New measure: hyperlift
- The n × n co-occurrence matrix can be modeled by n2 random
variables Ci,j.
- The framework results in hypergeometric distributions for the Ci,js
(urn model).
- Using the expected value of Ci,j lift can be rewritten as:
lift(li ⇒ lj) = P(li + lj) P(li)P(lj) = ci,j E[Ci,j]
(6)
- As a more conservative approach we use quantile Qδ[Ci,j] instead of
the expected value.
hyperlift(li ⇒ lj) = ci,j Qδ[Ci,j].
(7)
- M. Hahsler, K. Hornik and T. Reutterer
22 Magdeburg, March 9-11, 2005
New measure: hyperlift (cont’d)
Calculating hyperlift for δ = 0.99: > calc_hyperbase <- function(ci, cj) { + qhyper(0.99, m = cj, n = m - cj, k = ci)} > hyperlift <- c2/outer(c, c, FUN = calc_hyperbase) > hyperlift[is.infinite(hyperlift)] <- NA > persp(hyperlift, shade = 1, ticktype = "detailed", + border = 0, expand = 0.5, zlab = "hyperlift", + xlab = "items", ylab = "items") > length(which(hyperlift > 2)) [1] 2
- M. Hahsler, K. Hornik and T. Reutterer
23 Magdeburg, March 9-11, 2005
- M. Hahsler, K. Hornik and T. Reutterer
24 Magdeburg, March 9-11, 2005
New measure: hyperlift (cont’d)
- Generally smaller than 1 and more evenly distributed than lift.
Indicates that hyperlift filters the random co-occurrences better than lift.
- Hyperlift shows a weak systematic dependency to favor rules with
more frequent items.
- M. Hahsler, K. Hornik and T. Reutterer
25 Magdeburg, March 9-11, 2005
Comparing lift and hyperlift on a grocery database
- 1 month of real-world point-of-sale transaction data from a local
grocery outlet with
- m = 9835 transaction and
- n = 169 categories.
- Support, confidence and lift distributions look almost identical to the
simulated data.
- M. Hahsler, K. Hornik and T. Reutterer
26 Magdeburg, March 9-11, 2005
Lift for 2-itemsets for items with support of 0.1% in the grocery database
- M. Hahsler, K. Hornik and T. Reutterer
27 Magdeburg, March 9-11, 2005
Hyperlift for 2-itemsets for items in the grocery database
- M. Hahsler, K. Hornik and T. Reutterer
28 Magdeburg, March 9-11, 2005
Comparing lift and hyperlift (cont’d)
Top 10 rules (ordered by lift, support = 0.001) l_i l_j supp lift 20 mayonnaise mustard 0.001423 12.965 8 Instant food products hamburger meat 0.003050 11.421 15 softener detergent 0.001118 10.600 16 liquor red/blush wine 0.002135 10.025 6 flour sugar 0.004982 8.463 4 popcorn salty snack 0.002237 8.192 11 processed cheese ham 0.003050 7.071 9 sauces hamburger meat 0.001220 6.684 3 meat spreads cream cheese 0.001118 6.605 14 house keeping products detergent 0.001017 6.346
- M. Hahsler, K. Hornik and T. Reutterer
29 Magdeburg, March 9-11, 2005
Comparing lift and hyperlift (cont’d)
Top 10 rules (ordered by hyperlift, no support)
l_i l_j supp hyperlift lift 11 Instant food products hamburger meat 0.0030 4.286 11.421 9 flour sugar 0.0049 4.083 8.463 15 liquor red/blush wine 0.0021 3.500 10.025 * 17 cooking chocolate baking powder 0.0007 3.500 15.826 18 mayonnaise mustard 0.0014 3.500 12.965 6 processed cheese white bread 0.0041 3.154 5.975 7 popcorn salty snack 0.0022 3.143 8.192 13 processed cheese ham 0.0030 3.000 7.071 3 liquor bottled beer 0.0046 2.875 5.241 14 softener detergent 0.0011 2.750 10.600 8 baking powder sugar 0.0032 2.667 5.432
- M. Hahsler, K. Hornik and T. Reutterer
30 Magdeburg, March 9-11, 2005
Comparing lift and hyperlift (cont’d)
- All rules for lift (with support) and hyperlift make intuitively sense.
- Rules with high hyperlift have potentially also high lift.
- Hyperlift selects rules with support varying from very rare to relatively
frequent (the tendency of hyperlift to favors rules with more frequent items seems not too strong).
- Hyperlift is also able to deal with very infrequent rules.
- M. Hahsler, K. Hornik and T. Reutterer
31 Magdeburg, March 9-11, 2005
Conclusion
- Interest measures are systematically influenced by the frequencies of
items in the corresponding itemsets or rules.
- Lift performs poorly to filter random noise.
- The presented framework provides many possibilities for further
research: – Adapt hyperlift to finding substitutes (instead of complements). – Analyze systematic influence of the occurrence frequency of items
- n the hyperlift measure.
– Use p-value instead of hyperlift. – Expand model to itemsets of size > 2. – Model dependencies between items.
- M. Hahsler, K. Hornik and T. Reutterer
32 Magdeburg, March 9-11, 2005