Interesting Patterns Jilles Vreeken 15 May 2015 Questions of the - - PowerPoint PPT Presentation

β–Ά
interesting patterns
SMART_READER_LITE
LIVE PREVIEW

Interesting Patterns Jilles Vreeken 15 May 2015 Questions of the - - PowerPoint PPT Presentation

Interesting Patterns Jilles Vreeken 15 May 2015 Questions of the Day What is interestingness? what is a pattern? and how can we mine interest esting patterns? What is a pattern? Data ata Pattern ern y = x - 1 What is a pattern?


slide-1
SLIDE 1

Interesting Patterns

Jilles Vreeken

15 May 2015

slide-2
SLIDE 2

Questions of the Day

What is interestingness? what is a pattern? and how can we mine interest esting patterns?

slide-3
SLIDE 3

Data ata Pattern ern

What is a pattern?

y = x - 1

slide-4
SLIDE 4

Recurring structure

Dat ata Pat attern

What is a pattern?

slide-5
SLIDE 5

For a database 𝑒𝑒

 a pattern language 𝑄 and a set of constraints 𝐷

the go goal al is to find the set of patterns 𝑇 βŠ† 𝑄 such that

 each π‘ž ∈ 𝑄 satisfies each 𝑑 ∈ 𝐷 on 𝑒𝑒, and 𝑇 is maximal

That is, find all ll patterns that satisfy the constraints

Pattern mining, formally

slide-6
SLIDE 6

Suppose a supermarket,

 which sells it

items, 𝐽, and

 logs every trans

nsaction n 𝑒 βŠ† 𝐽 in a database db

 an interesting question to ask is,

β€˜What products are often sold together?’

 pattern language:

all possible sets of items 𝑄 = (𝐽)

 pattern:

an itemse mset, π‘Œ βŠ† 𝐽, π‘Œ ∈ 𝑄

Frequent Pattern Mining

slide-7
SLIDE 7

Frequent Itemsets

π‘‘π‘‘π‘žπ‘žπ‘‘π‘‘π‘’( ) = 3

slide-8
SLIDE 8

Frequent Conjunctive Formulas

4.9, 3.1, 1.5, 0.1, Iris-setosa 5.0, 3.2, 1.2, 0.2, Iris-setosa 5.5, 3.5, 1.3, 0.2, Iris-setosa 4.9, 3.1, 1.5, 0.1, Iris-setosa 4.4, 3.0, 1.3, 0.2, Iris-setosa 5.1, 3.4, 1.5, 0.2, Iris-setosa 5.0, 3.5, 1.3, 0.3, Iris-setosa 4.5, 2.3, 1.3, 0.3, Iris-setosa 4.4, 3.2, 1.3, 0.2, Iris-setosa 5.0, 3.5, 1.6, 0.6, Iris-setosa 5.1, 3.8, 1.9, 0.4, Iris-setosa 4.8, 3.0, 1.4, 0.3, Iris-setosa 5.1, 3.8, 1.6, 0.2, Iris-setosa 4.6, 3.2, 1.4, 0.2, Iris-setosa 5.3, 3.7, 1.5, 0.2, Iris-setosa 5.0, 3.3, 1.4, 0.2, Iris-setosa 7.0, 3.2, 4.7, 1.4, Iris-versicolor 6.4, 3.2, 4.5, 1.5, Iris-versicolor 6.9, 3.1, 4.9, 1.5, Iris-versicolor 5.5, 2.3, 4.0, 1.3, Iris-versicolor 6.5, 2.8, 4.6, 1.5, Iris-versicolor 5.7, 2.8, 4.5, 1.3, Iris-versicolor 6.3, 3.3, 4.7, 1.6, Iris-versicolor 4.9, 2.4, 3.3, 1.0, Iris-versicolor 6.6, 2.9, 4.6, 1.3, Iris-versicolor 5 2 2 7 3 9 1 4 Iris-versicolor

Petal length >= 2.0 and Petal width <= 0.5

slide-9
SLIDE 9

Frequent Subgraphs

slide-10
SLIDE 10

The Frequent Pattern Problem

The task is to find all frequent patterns β€˜how often is π‘Œ sold’ ↔ sup

𝑒𝑒

π‘Œ = | 𝑒 ∈ 𝑒𝑒 π‘Œ βŠ† 𝑒 |

 the number of transactions in 𝑒𝑒 that β€˜support’ the pattern

β€˜often enough’ ↔ sup

𝑒𝑒

β‰₯ π‘›π‘›π‘›π‘‘π‘‘π‘ž

 have a support higher than the minimal-support threshold

So, the problem is to find all π‘Œ βˆˆο€ οο€ with sup

𝑒𝑒

π‘Œ β‰₯ π‘›π‘›π‘›π‘‘π‘‘π‘ž

 how can do we do this?

slide-11
SLIDE 11

Monotonicity

The number of possible patterns is exponential, and hence exhaustive search is not a feasible option. However, in 1994 it was discovered that support exhibits mono notonic

  • nicity. That is, for two itemsets π‘Œ and 𝑍, we know

π‘Œ βŠ‚ 𝑍 β†’ π‘‘π‘‘π‘žπ‘ž π‘Œ β‰₯ π‘‘π‘‘π‘žπ‘ž 𝑍 This is known as the A Priori property. It allows efficient search for frequent itemsets over the lattice of all itemsets.

slide-12
SLIDE 12

The Itemset Lattice

a b c d 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

data

abcd (1) abc (2) abd (3) acd (1) bcd (1) ab (4) ac (2) ad (3) bc (2) bd (3) cd (1) a (4) b (4) c (3) d (3) βˆ… (6)

itemset lattice

slide-13
SLIDE 13

The Itemset Lattice

a b c d 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

data

abcd (1) abc (2) abd (3) acd (1) bcd (1) ab (4) ac (2) ad (3) bc (2) bd (3) cd (1) a (5) b (4) c (3) d (3) βˆ… (6) frequent

itemset lattice

slide-14
SLIDE 14

Levelwise search

1.

𝐺

1 = {𝑛 ∈ 𝐽 ∣ π‘‘π‘‘π‘žπ‘ž 𝑛 β‰₯ π‘›π‘›π‘›π‘‘π‘‘π‘ž}

2. 2.

while le 𝐺𝑙 not empty

3.

𝐷𝑙+1 = π‘Œ ∈ 𝑄 π‘Œ = 𝑙 + 1, βˆ€π‘βŠ‚π‘Œ, 𝑍 =𝑙𝑍 ∈ 𝐺𝑙

4.

𝐺𝑙+1 = π‘Œ ∈ 𝐷𝑙+1 π‘‘π‘‘π‘žπ‘ž π‘Œ β‰₯ π‘›π‘›π‘›π‘‘π‘‘π‘ž

5. 5.

retu eturn 𝐺

1 βˆͺ 𝐺2 βˆͺ β‹―

The A Priori algorithm can be applied to mine patterns for any enumerable pattern language 𝑄 for any monotonic constraint 𝑑. Many algorithms exist that are more efficient, but none so versatile.

slide-15
SLIDE 15

Problems in pattern paradise

The pattern explosion

 high thresholds

few, but well-known patterns

 low thresholds

a gazillion patterns

Many patterns are redundant Unstable

 small data change,

yet different results

 even when distribution

did not really change

slide-16
SLIDE 16

The Wine Explosion

the Wine dataset has 178 rows, 14 columns

slide-17
SLIDE 17

T

  • the Max!

Why not just report only patterns for which there is no extension that is frequent? These patterns are called maxim imall lly frequent.

abcd (1) abc (2) abd (3) acd (1) bcd (1) ab (4) ac (2) ad (3) bc (2) bd (3) cd (1) a (5) b (4) c (3) d (3) βˆ… (6) frequent

(Bayardo, 1998)

slide-18
SLIDE 18

Closure!

Why throw away so much information? If we keep all π‘Œ that cannot be extended without π‘‘π‘‘π‘žπ‘ž π‘Œ dropping, all frequent itemsets and their frequencies can be reconstructed without loss! These are called closed d frequent itemsets.

abcd (1) abc (2) abd (3) acd (1) bcd (1) ab (4) ac (2) ad (3) bc (2) bd (3) cd (1) a (5) b (4) c (3) d (3) βˆ… (6) frequent

(Pasquier, 1999)

slide-19
SLIDE 19

Non-Derivable Patterns

Through inclusion/exclusion, we can derive the support of 𝑏𝑒𝑑. As π‘‘π‘‘π‘žπ‘ž 𝑒𝑑 = π‘‘π‘‘π‘žπ‘ž 𝑑 = 2, we know 𝑒 and 𝑑 always co-occur. Then, knowing that π‘‘π‘‘π‘žπ‘ž 𝑏𝑑 = 2, we can derive π‘‘π‘‘π‘žπ‘ž 𝑏𝑒𝑑 = 2.

abcd (1) abc (2) abd (3) acd (1) bcd (1) ab (4) ac (2) ad (3) bc (2) bd (3) cd (1) a (5) b (4) c (3) d (3) βˆ… (6) frequent

(Calders & Goethals, 2003)

non-derivable

slide-20
SLIDE 20

Margin-Closed

Who cares that we can reconstruct all frequencies exactly? Why not allow a little bit of slack and zap more patterns? That is the main of idea of margin-closed ed frequent itemsets.

abcd (1) abc (2) abd (3) acd (1) bcd (1) ab (4) ac (2) ad (3) bc (2) bd (3) cd (1) a (5) b (4) c (3) d (3) βˆ… (6) frequent

(Moerchen et al, 2011)

slide-21
SLIDE 21

Associations

Why is a frequent pattern π‘Œ interesting? Because it identifies assoc sociation

  • ns between elements of π‘Œ.

Many people buy both and . What’s going on? Many patients have active genes A, B and C. What’s going on? Many molecules share this structure. What’s going on? Okay… but does higher frequency mean more interesting?

slide-22
SLIDE 22

Expectation

Frequency alone is deceiving, and leads to redundant results. Say that many many people buy . Then all `real’ patterns, such as can be extended with and we likely find that is also

  • frequent. Do we want it to be reported?

Not unless its support deviates strongly from our expectation.

slide-23
SLIDE 23

What did you expect?

What do we expect? How do we model this? How can we measure whether expectation and reality are different enough? Let’s start simple. Let’s assume all ll it items are in independ ndent nt.

slide-24
SLIDE 24

Independence!

Under the assumption that all items 𝑛 ∈ 𝐽 are independent, the expected frequency of an itemset π‘Œ is simply 𝑛𝑛𝑒 π‘Œ = 𝑔𝑑(𝑦)

π‘¦βˆˆπ‘Œ

where we write 𝑔𝑑 𝑦 =

𝑑𝑑𝑑𝑑 𝑦 𝑒𝑒

for the frequency – the relative support – of an item 𝑦 ∈ π‘Œ in our database. Item frequencies can easily be extracted from data, as well as reasonably expected to be known by your domain expert.

slide-25
SLIDE 25

Bro, do you even lift?

We want to identify patterns for which their frequency in the data deviates strongly from our expectation. One way to measure this deviation is lift. π‘šπ‘›π‘”π‘’ π‘Œ = 𝑔𝑑 π‘Œ 𝑛𝑛𝑒 π‘Œ Patterns with a lift higher than 1 are more frequent than

  • expected. Those with lift lower than 1 are less frequent.

In our data/lattice example π‘šπ‘›π‘”π‘’ 𝐡𝐡 = 1.2 and π‘šπ‘›π‘”π‘’ 𝐡𝐡𝐷 = 1.83,

(IBM, 1996)

slide-26
SLIDE 26

Example: Lift

𝑔𝑑 𝐡𝐡 =

4 6

𝑛𝑛𝑒 𝐡𝐡 =

5 6 βˆ— 4 6 = 0.55

π‘šπ‘›π‘”π‘’ 𝐡𝐡 =

0.66 0.55 = 1.2

𝑔𝑑 𝐡𝐡𝐷 =

2 6

𝑛𝑛𝑒 𝐡𝐡𝐷 =

5 6 βˆ— 4 6 βˆ— 3 6 = 0.28

π‘šπ‘›π‘”π‘’ 𝐡𝐡𝐷 =

0.33 0.28 = 1.83

That is, according to π‘šπ‘›π‘”π‘’, 𝐡𝐡𝐷 is more interesting than 𝐡𝐡.

A B C D 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

slide-27
SLIDE 27

Lift

Lift is ad hoc.

Lift strongly over-estimates, or under-estimates how surprising the frequency of a pattern is. It is a ba bad interestingness measure.

Somewhat more formally:

Lift is ad hoc because it comp mpares s scores dir irectly, it does not consider how likely scores are, and doe

  • es not
  • t use

se a prop

  • per

statistic ical l test to determine how significant the deviation is.

slide-28
SLIDE 28

Better Lift

The probability of a random

  • m transa

saction

  • n to suppor
  • rt π‘Œ is 𝑛𝑛𝑒(π‘Œ).

Assume our dataset contains 𝑂 transactions, and let π‘Ž be a random variable to state how many transactions support π‘Œ. Then, 𝑄(π‘Ž = 𝑁) is the probability that the support of π‘Œ is 𝑁, and is given by the binomial distribution, with π‘Ÿ = 𝑛𝑛𝑒 π‘Œ π‘ž π‘Ž = 𝑁 = 𝑂 𝑁 π‘Ÿπ‘ 1 βˆ’ π‘Ÿ π‘‚βˆ’π‘ We can now calculate how likely it is to observe a support of π‘‘π‘‘π‘žπ‘ž(π‘Œ) or hig ighe her, and decide whether the π‘ž-value π‘ž π‘Ž β‰₯ π‘‘π‘‘π‘žπ‘ž π‘Œ 𝑂 is significant (eg. < 0.05)

slide-29
SLIDE 29

Aside Aside: using Surprisingness

While we’re in the business of unrealistic assumptions, say we have a good way to calculate a p-value for π‘Œ, What can we do with it? There are two main approaches,

1)

mine all patterns up to a certain threshold

2)

mine the top-𝑙 most surprising patterns

slide-30
SLIDE 30

T

  • o much of a good thing

Under the independence assumption, we compare 𝑄 𝐡𝐡𝐷𝐡 ↔ 𝑄 𝐡 𝑄 𝐡 𝑄 𝐷 𝑄(𝐡) And hence, any deviation from total independence is gauged as

  • interesting. Say that 𝐡𝐡𝐷 is the true pattern, then it will have a

high lift, but so will an any extension of it! In other words, all supersets of a pattern are also scored highly. Why? How can we avoid this? Which ones should we report?

slide-31
SLIDE 31

Partitions

Webb proposed that we should report only those patterns for which the frequency is surprising with regard to all ll its partitions. 𝑄 𝐡𝐡𝐷 ↔ 𝑄 𝐡 𝑄 𝐡 𝑄 𝐷 ↔ 𝑄 𝐡𝐡 𝑄 𝐷 ↔ 𝑄 𝐡𝐷 𝑄 𝐡 ↔ 𝑄 𝐡 𝑄 𝐡𝐷 Sounds like a good idea! But, how many partitions are there? And, how do we test for surprisingness? When we a consider 2-partition we can use Fisher’s exact test. Webb tests against the partition of π‘Œ = 𝑍 βˆͺ π‘Ž s.t. 𝑍 ∩ π‘Ž = βˆ… and 𝑔𝑑 𝑍 𝑔𝑑(π‘Ž) is closest to 𝑔𝑑 π‘Œ

(Webb, 2010…)

slide-32
SLIDE 32

Applying Fisher’s Exact T est

Let’s test 𝑔𝑑 𝐡𝐡𝐷 vs. 𝑔𝑑 𝐡𝐡 𝑔𝑑(𝐷) π‘ž = 𝑏 + 𝑒 𝑏 𝑑 + 𝑒 𝑑 𝑛 𝑏 + 𝑑 = 𝑏 + 𝑒 ! 𝑑 + 𝑒 ! 𝑏 + 𝑑 ! 𝑒 + 𝑒 ! 𝑏! 𝑒! 𝑑! 𝑒! 𝑛! We get π‘ž 𝐡𝐡, 𝐷 = 0.6, meaning that 𝐡𝐡𝐷 is not interesting, yay!

(Fisher 1922; Webb 2010; Hamalainen 2012; Webb & Vreeken 2014)

𝑩𝑩 βˆ’π‘©π‘© 𝐷 2 1 3 βˆ’π· 2 1 3 4 2 6 π’’πŸ βˆ’π’’πŸ π‘ž2 a b a+b βˆ’π‘ž2 c d c+d a+c b+d a+b+c+d

slide-33
SLIDE 33

More Elaborate Models

Although the math and stats get more and more complicated, the models and tests we saw so far really are straightforward. How can we infuse more background knowledge? We can test against

 Bayesian Networks (Jaroszwicz et al. 2004),  Maximum Entropy models (Wang et al 2006, Mampaey et al 2012, …)

Goes (probably) too deep for today. Let’s re-consider this in a few weeks.

slide-34
SLIDE 34

Tiles

Only considering how of

  • ften something occurs biases us towards

patterns with low cardinality. Instead, we can consider how much

  • f the data a pattern cover

ers. That is, a pattern π‘Œ now consists of a row-set, and a column-set, and is regarded as more interesting the larger 𝑏𝑑𝑏𝑏 π‘Œ = |𝑑𝑑𝑠𝑑 π‘Œ | Γ— |π‘‘π‘‘π‘šπ‘‘(π‘Œ)|

(Geerts et al 2014)

slide-35
SLIDE 35

Patterns: Large Tiles

Genes Conditions

slide-36
SLIDE 36

Example: Tiles

slide-37
SLIDE 37

Mining Large Tiles

Sadly, 𝑏𝑑𝑏𝑏 is not (anti-)monotonic. Extending the column set of a given tile π‘Œ may result in both an increase or decrease in 𝑏𝑑𝑏𝑏. How can we mine tiles efficiently? Through depth-first search, using branch-and-bound. If you keep the row-set maximal, you can keep track of the conditiona nal l support of all not-yet-included items. Assuming maximal correlation, you have an upper bound.

slide-38
SLIDE 38

Mining Tilings

A big pile of Tiles is as bad as a big pile of Frequent Itemsets. Way too many, way too redundant. Instead, we can also ask to find a set of tiles that together cover as many of the ones in the data as possible. This means we are doing set et cover er, which is well-known to be NP-hard, but for which the greedy algorithm is known to be the best possible polynomial time approximation algorithm.

slide-39
SLIDE 39

Exact and Noisy Tiles

Let 𝑔𝑑(π‘Œ) for a tile π‘Œ be the relative number of 1s in the tile, 𝑑𝑛𝑏𝑑 π‘Œ =

  • 𝐽(𝐡𝑗,π‘˜ = 1)

π‘˜βˆˆπ‘ π‘ π‘ π‘‘(π‘Œ) π‘—βˆˆπ‘‘π‘ π‘‘π‘‘(π‘Œ)

𝑔𝑑 π‘Œ = 𝑑𝑛𝑏𝑑(π‘Œ) π‘‘π‘‘π‘šπ‘‘ π‘Œ Γ— 𝑑𝑑𝑠𝑑 π‘Œ For 𝑔𝑑 π‘Œ = 1 or 0 we say the tile is exact ct. . Otherwise, it is noisy. y.

(Tatti & Vreeken 2011)

slide-40
SLIDE 40

Noisy Tiles

Boolean Matrix Factorisation (BMF) aims to find a low-rank decomposition of a given binary matrix A into row and column factor matrices B and C, such that 𝐡 β‰ˆ 𝐡 Β° 𝐷 where Β° is the binary product, i.e. 0 + 0 = 0, 0 + 1 = 1, 1 + 1 = 1, and we want to minimize the error between 𝐡 and 𝐡 Β° 𝐷 When restricted to exact tiles (factors), BMF and Tiling are

  • equivalent. BMF, however, is more general, as it allows for errors.

(Miettinen et al 2006, …)

slide-41
SLIDE 41

Patterns are a powerful concept that can show a lot of insight in how your data is lo loca cally lly distributed. Monotonic constraints allow for efficient mining

 levelwise search alway

ays works – more elegant algorithms exist

 works for other pattern types equally well:

itemsets, sequences, trees, streams, low-entropy sets

Measuring interestingness is in inherently ly d dif iffic icult lt

 frequency alone is a bad measure  independence models are too weak  stronger models are computionally expensive

Conclusions

slide-42
SLIDE 42

Patterns are a powerful concept that can show a lot of insight in how your data is lo loca cally lly distributed. Monotonic constraints allow for efficient mining

 levelwise search alway

ays works – more elegant algorithms exist

 works for other pattern types equally well:

itemsets, sequences, trees, streams, low-entropy sets

Measuring interestingness is in inherently ly d dif iffic icult lt

 frequency alone is a bad measure  independence models are too weak  stronger models are computionally expensive

Thank you!