Interesting Patterns
Jilles Vreeken
15 May 2015
Interesting Patterns Jilles Vreeken 15 May 2015 Questions of the - - PowerPoint PPT Presentation
Interesting Patterns Jilles Vreeken 15 May 2015 Questions of the Day What is interestingness? what is a pattern? and how can we mine interest esting patterns? What is a pattern? Data ata Pattern ern y = x - 1 What is a pattern?
15 May 2015
Data ata Pattern ern
Recurring structure
Dat ata Pat attern
For a database ππ
ο§ a pattern language π and a set of constraints π·
the go goal al is to find the set of patterns π β π such that
ο§ each π β π satisfies each π β π· on ππ, and π is maximal
That is, find all ll patterns that satisfy the constraints
Suppose a supermarket,
ο§ which sells it
items, π½, and
ο§ logs every trans
nsaction n π’ β π½ in a database db
ο§ an interesting question to ask is,
βWhat products are often sold together?β
ο§ pattern language:
all possible sets of items π = ο(π½)
ο§ pattern:
an itemse mset, π β π½, π β π
4.9, 3.1, 1.5, 0.1, Iris-setosa 5.0, 3.2, 1.2, 0.2, Iris-setosa 5.5, 3.5, 1.3, 0.2, Iris-setosa 4.9, 3.1, 1.5, 0.1, Iris-setosa 4.4, 3.0, 1.3, 0.2, Iris-setosa 5.1, 3.4, 1.5, 0.2, Iris-setosa 5.0, 3.5, 1.3, 0.3, Iris-setosa 4.5, 2.3, 1.3, 0.3, Iris-setosa 4.4, 3.2, 1.3, 0.2, Iris-setosa 5.0, 3.5, 1.6, 0.6, Iris-setosa 5.1, 3.8, 1.9, 0.4, Iris-setosa 4.8, 3.0, 1.4, 0.3, Iris-setosa 5.1, 3.8, 1.6, 0.2, Iris-setosa 4.6, 3.2, 1.4, 0.2, Iris-setosa 5.3, 3.7, 1.5, 0.2, Iris-setosa 5.0, 3.3, 1.4, 0.2, Iris-setosa 7.0, 3.2, 4.7, 1.4, Iris-versicolor 6.4, 3.2, 4.5, 1.5, Iris-versicolor 6.9, 3.1, 4.9, 1.5, Iris-versicolor 5.5, 2.3, 4.0, 1.3, Iris-versicolor 6.5, 2.8, 4.6, 1.5, Iris-versicolor 5.7, 2.8, 4.5, 1.3, Iris-versicolor 6.3, 3.3, 4.7, 1.6, Iris-versicolor 4.9, 2.4, 3.3, 1.0, Iris-versicolor 6.6, 2.9, 4.6, 1.3, Iris-versicolor 5 2 2 7 3 9 1 4 Iris-versicolor
Petal length >= 2.0 and Petal width <= 0.5
The task is to find all frequent patterns βhow often is π soldβ β sup
ππ
π = | π’ β ππ π β π’ |
ο§ the number of transactions in ππ that βsupportβ the pattern
βoften enoughβ β sup
ππ
β₯ ππππ‘π‘π
ο§ have a support higher than the minimal-support threshold
So, the problem is to find all π βο οο with sup
ππ
π β₯ ππππ‘π‘π
ο§ how can do we do this?
The number of possible patterns is exponential, and hence exhaustive search is not a feasible option. However, in 1994 it was discovered that support exhibits mono notonic
π β π β π‘π‘ππ π β₯ π‘π‘ππ π This is known as the A Priori property. It allows efficient search for frequent itemsets over the lattice of all itemsets.
a b c d 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
data
abcd (1) abc (2) abd (3) acd (1) bcd (1) ab (4) ac (2) ad (3) bc (2) bd (3) cd (1) a (4) b (4) c (3) d (3) β (6)
itemset lattice
a b c d 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
data
abcd (1) abc (2) abd (3) acd (1) bcd (1) ab (4) ac (2) ad (3) bc (2) bd (3) cd (1) a (5) b (4) c (3) d (3) β (6) frequent
itemset lattice
1.
πΊ
1 = {π β π½ β£ π‘π‘ππ π β₯ ππππ‘π‘π}
2. 2.
while le πΊπ not empty
3.
π·π+1 = π β π π = π + 1, βπβπ, π =ππ β πΊπ
4.
πΊπ+1 = π β π·π+1 π‘π‘ππ π β₯ ππππ‘π‘π
5. 5.
retu eturn πΊ
1 βͺ πΊ2 βͺ β―
The A Priori algorithm can be applied to mine patterns for any enumerable pattern language π for any monotonic constraint π. Many algorithms exist that are more efficient, but none so versatile.
The pattern explosion
ο§ high thresholds
few, but well-known patterns
ο§ low thresholds
a gazillion patterns
Many patterns are redundant Unstable
ο§ small data change,
yet different results
ο§ even when distribution
did not really change
the Wine dataset has 178 rows, 14 columns
Why not just report only patterns for which there is no extension that is frequent? These patterns are called maxim imall lly frequent.
abcd (1) abc (2) abd (3) acd (1) bcd (1) ab (4) ac (2) ad (3) bc (2) bd (3) cd (1) a (5) b (4) c (3) d (3) β (6) frequent
(Bayardo, 1998)
Why throw away so much information? If we keep all π that cannot be extended without π‘π‘ππ π dropping, all frequent itemsets and their frequencies can be reconstructed without loss! These are called closed d frequent itemsets.
abcd (1) abc (2) abd (3) acd (1) bcd (1) ab (4) ac (2) ad (3) bc (2) bd (3) cd (1) a (5) b (4) c (3) d (3) β (6) frequent
(Pasquier, 1999)
Through inclusion/exclusion, we can derive the support of πππ. As π‘π‘ππ ππ = π‘π‘ππ π = 2, we know π and π always co-occur. Then, knowing that π‘π‘ππ ππ = 2, we can derive π‘π‘ππ πππ = 2.
abcd (1) abc (2) abd (3) acd (1) bcd (1) ab (4) ac (2) ad (3) bc (2) bd (3) cd (1) a (5) b (4) c (3) d (3) β (6) frequent
(Calders & Goethals, 2003)
non-derivable
Who cares that we can reconstruct all frequencies exactly? Why not allow a little bit of slack and zap more patterns? That is the main of idea of margin-closed ed frequent itemsets.
abcd (1) abc (2) abd (3) acd (1) bcd (1) ab (4) ac (2) ad (3) bc (2) bd (3) cd (1) a (5) b (4) c (3) d (3) β (6) frequent
(Moerchen et al, 2011)
Why is a frequent pattern π interesting? Because it identifies assoc sociation
Many people buy both and . Whatβs going on? Many patients have active genes A, B and C. Whatβs going on? Many molecules share this structure. Whatβs going on? Okayβ¦ but does higher frequency mean more interesting?
Frequency alone is deceiving, and leads to redundant results. Say that many many people buy . Then all `realβ patterns, such as can be extended with and we likely find that is also
Not unless its support deviates strongly from our expectation.
What do we expect? How do we model this? How can we measure whether expectation and reality are different enough? Letβs start simple. Letβs assume all ll it items are in independ ndent nt.
Under the assumption that all items π β π½ are independent, the expected frequency of an itemset π is simply πππ π = ππ‘(π¦)
π¦βπ
where we write ππ‘ π¦ =
π‘π‘π‘π‘ π¦ ππ
for the frequency β the relative support β of an item π¦ β π in our database. Item frequencies can easily be extracted from data, as well as reasonably expected to be known by your domain expert.
We want to identify patterns for which their frequency in the data deviates strongly from our expectation. One way to measure this deviation is lift. ππππ’ π = ππ‘ π πππ π Patterns with a lift higher than 1 are more frequent than
In our data/lattice example ππππ’ π΅π΅ = 1.2 and ππππ’ π΅π΅π· = 1.83,
(IBM, 1996)
ππ‘ π΅π΅ =
4 6
πππ π΅π΅ =
5 6 β 4 6 = 0.55
ππππ’ π΅π΅ =
0.66 0.55 = 1.2
ππ‘ π΅π΅π· =
2 6
πππ π΅π΅π· =
5 6 β 4 6 β 3 6 = 0.28
ππππ’ π΅π΅π· =
0.33 0.28 = 1.83
That is, according to ππππ’, π΅π΅π· is more interesting than π΅π΅.
A B C D 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Lift is ad hoc.
Lift strongly over-estimates, or under-estimates how surprising the frequency of a pattern is. It is a ba bad interestingness measure.
Somewhat more formally:
Lift is ad hoc because it comp mpares s scores dir irectly, it does not consider how likely scores are, and doe
se a prop
statistic ical l test to determine how significant the deviation is.
The probability of a random
saction
Assume our dataset contains π transactions, and let π be a random variable to state how many transactions support π. Then, π(π = π) is the probability that the support of π is π, and is given by the binomial distribution, with π = πππ π π π = π = π π ππ 1 β π πβπ We can now calculate how likely it is to observe a support of π‘π‘ππ(π) or hig ighe her, and decide whether the π-value π π β₯ π‘π‘ππ π π is significant (eg. < 0.05)
While weβre in the business of unrealistic assumptions, say we have a good way to calculate a p-value for π, What can we do with it? There are two main approaches,
1)
mine all patterns up to a certain threshold
2)
mine the top-π most surprising patterns
Under the independence assumption, we compare π π΅π΅π·π΅ β π π΅ π π΅ π π· π(π΅) And hence, any deviation from total independence is gauged as
high lift, but so will an any extension of it! In other words, all supersets of a pattern are also scored highly. Why? How can we avoid this? Which ones should we report?
Webb proposed that we should report only those patterns for which the frequency is surprising with regard to all ll its partitions. π π΅π΅π· β π π΅ π π΅ π π· β π π΅π΅ π π· β π π΅π· π π΅ β π π΅ π π΅π· Sounds like a good idea! But, how many partitions are there? And, how do we test for surprisingness? When we a consider 2-partition we can use Fisherβs exact test. Webb tests against the partition of π = π βͺ π s.t. π β© π = β and ππ‘ π ππ‘(π) is closest to ππ‘ π
(Webb, 2010β¦)
Letβs test ππ‘ π΅π΅π· vs. ππ‘ π΅π΅ ππ‘(π·) π = π + π π π + π π π π + π = π + π ! π + π ! π + π ! π + π ! π! π! π! π! π! We get π π΅π΅, π· = 0.6, meaning that π΅π΅π· is not interesting, yay!
(Fisher 1922; Webb 2010; Hamalainen 2012; Webb & Vreeken 2014)
π©π© βπ©π© π· 2 1 3 βπ· 2 1 3 4 2 6 ππ βππ π2 a b a+b βπ2 c d c+d a+c b+d a+b+c+d
Although the math and stats get more and more complicated, the models and tests we saw so far really are straightforward. How can we infuse more background knowledge? We can test against
ο§ Bayesian Networks (Jaroszwicz et al. 2004), ο§ Maximum Entropy models (Wang et al 2006, Mampaey et al 2012, β¦)
Goes (probably) too deep for today. Letβs re-consider this in a few weeks.
Only considering how of
patterns with low cardinality. Instead, we can consider how much
ers. That is, a pattern π now consists of a row-set, and a column-set, and is regarded as more interesting the larger ππ‘ππ π = |π‘π‘π π‘ π | Γ |ππ‘ππ‘(π)|
(Geerts et al 2014)
Genes Conditions
Sadly, ππ‘ππ is not (anti-)monotonic. Extending the column set of a given tile π may result in both an increase or decrease in ππ‘ππ. How can we mine tiles efficiently? Through depth-first search, using branch-and-bound. If you keep the row-set maximal, you can keep track of the conditiona nal l support of all not-yet-included items. Assuming maximal correlation, you have an upper bound.
A big pile of Tiles is as bad as a big pile of Frequent Itemsets. Way too many, way too redundant. Instead, we can also ask to find a set of tiles that together cover as many of the ones in the data as possible. This means we are doing set et cover er, which is well-known to be NP-hard, but for which the greedy algorithm is known to be the best possible polynomial time approximation algorithm.
Let ππ‘(π) for a tile π be the relative number of 1s in the tile, π‘πππ‘ π =
πβπ π π π‘(π) πβππ ππ‘(π)
ππ‘ π = π‘πππ‘(π) ππ‘ππ‘ π Γ π‘π‘π π‘ π For ππ‘ π = 1 or 0 we say the tile is exact ct. . Otherwise, it is noisy. y.
(Tatti & Vreeken 2011)
Boolean Matrix Factorisation (BMF) aims to find a low-rank decomposition of a given binary matrix A into row and column factor matrices B and C, such that π΅ β π΅ Β° π· where Β° is the binary product, i.e. 0 + 0 = 0, 0 + 1 = 1, 1 + 1 = 1, and we want to minimize the error between π΅ and π΅ Β° π· When restricted to exact tiles (factors), BMF and Tiling are
(Miettinen et al 2006, β¦)
Patterns are a powerful concept that can show a lot of insight in how your data is lo loca cally lly distributed. Monotonic constraints allow for efficient mining
ο§ levelwise search alway
ays works β more elegant algorithms exist
ο§ works for other pattern types equally well:
itemsets, sequences, trees, streams, low-entropy sets
Measuring interestingness is in inherently ly d dif iffic icult lt
ο§ frequency alone is a bad measure ο§ independence models are too weak ο§ stronger models are computionally expensive
Patterns are a powerful concept that can show a lot of insight in how your data is lo loca cally lly distributed. Monotonic constraints allow for efficient mining
ο§ levelwise search alway
ays works β more elegant algorithms exist
ο§ works for other pattern types equally well:
itemsets, sequences, trees, streams, low-entropy sets
Measuring interestingness is in inherently ly d dif iffic icult lt
ο§ frequency alone is a bad measure ο§ independence models are too weak ο§ stronger models are computionally expensive