Interesting Patterns Jilles Vreeken 15 May 2015 Questions of the - PowerPoint PPT Presentation

Interesting Patterns Jilles Vreeken 15 May 2015

Questions of the Day What is interestingness? what is a pattern? and how can we mine interest esting patterns?

What is a pattern? Data ata Pattern ern y = x - 1

What is a pattern? Recurring structure Dat ata Pat attern

Pattern mining, formally For a database 𝑒𝑒  a pattern language 𝑄 and a set of constraints 𝐷 the go goal al is to find the set of patterns 𝑇 ⊆ 𝑄 such that  each 𝑞 ∈ 𝑄 satisfies each 𝑑 ∈ 𝐷 on 𝑒𝑒 , and 𝑇 is maximal That is, find all ll patterns that satisfy the constraints

Frequent Pattern Mining Suppose a supermarket,  which sells it items , 𝐽 , and  logs every trans nsaction n 𝑢 ⊆ 𝐽 in a database db  an interesting question to ask is, ‘What products are often sold together?’  pattern language: all possible sets of items 𝑄 =  ( 𝐽 )  pattern: an itemse mset, 𝑌 ⊆ 𝐽 , 𝑌 ∈ 𝑄

Frequent Itemsets 𝑡𝑡𝑞𝑞𝑡𝑡𝑢 ( ) = 3

Frequent Conjunctive Formulas 4.9, 3.1, 1.5, 0.1, Iris-setosa 5.0, 3.2, 1.2, 0.2, Iris-setosa 5.5, 3.5, 1.3, 0.2, Iris-setosa 4.9, 3.1, 1.5, 0.1, Iris-setosa Petal length >= 2.0 4.4, 3.0, 1.3, 0.2, Iris-setosa and Petal width <= 0.5 5.1, 3.4, 1.5, 0.2, Iris-setosa 5.0, 3.5, 1.3, 0.3, Iris-setosa 4.5, 2.3, 1.3, 0.3, Iris-setosa 4.4, 3.2, 1.3, 0.2, Iris-setosa 5.0, 3.5, 1.6, 0.6, Iris-setosa 5.1, 3.8, 1.9, 0.4, Iris-setosa 4.8, 3.0, 1.4, 0.3, Iris-setosa 5.1, 3.8, 1.6, 0.2, Iris-setosa 4.6, 3.2, 1.4, 0.2, Iris-setosa 5.3, 3.7, 1.5, 0.2, Iris-setosa 5.0, 3.3, 1.4, 0.2, Iris-setosa 7.0, 3.2, 4.7, 1.4, Iris-versicolor 6.4, 3.2, 4.5, 1.5, Iris-versicolor 6.9, 3.1, 4.9, 1.5, Iris-versicolor 5.5, 2.3, 4.0, 1.3, Iris-versicolor 6.5, 2.8, 4.6, 1.5, Iris-versicolor 5.7, 2.8, 4.5, 1.3, Iris-versicolor 6.3, 3.3, 4.7, 1.6, Iris-versicolor 4.9, 2.4, 3.3, 1.0, Iris-versicolor 6.6, 2.9, 4.6, 1.3, Iris-versicolor 5 2 2 7 3 9 1 4 Iris-versicolor

Frequent Subgraphs

The Frequent Pattern Problem The task is to find all frequent patterns ‘how often is 𝑌 sold’ ↔ sup 𝑌 ⊆ 𝑢 | 𝑌 = | 𝑢 ∈ 𝑒𝑒 𝑒𝑒  the number of transactions in 𝑒𝑒 that ‘support’ the pattern ‘often enough’ ↔ sup ≥ 𝑛𝑛𝑛𝑡𝑡𝑞 𝑒𝑒  have a support higher than the minimal-support threshold So, the problem is to find all 𝑌 ∈  with sup 𝑌 ≥ 𝑛𝑛𝑛𝑡𝑡𝑞 𝑒𝑒  how can do we do this?

Monotonicity The number of possible patterns is exponential, and hence exhaustive search is not a feasible option. However, in 1994 it was discovered that support exhibits mono notonic nicity. That is, for two itemsets 𝑌 and 𝑍 , we know 𝑌 ⊂ 𝑍 → 𝑡𝑡𝑞𝑞 𝑌 ≥ 𝑡𝑡𝑞𝑞 𝑍 This is known as the A Priori property. It allows efficient search for frequent itemsets over the lattice of all itemsets.

The Itemset Lattice abcd (1) a b c d bcd (1) abc (2) abd (3) acd (1) 1 1 1 1 1 1 1 0 cd (1) ab (4) ac (2) ad (3) bc (2) bd (3) 1 1 0 1 1 1 0 1 a (4) b (4) c (3) d (3) 0 0 1 0 1 0 0 0 ∅ (6) data itemset lattice

The Itemset Lattice abcd (1) a b c d frequent bcd (1) abc (2) abd (3) acd (1) 1 1 1 1 1 1 1 0 cd (1) ab (4) ac (2) ad (3) bc (2) bd (3) 1 1 0 1 1 1 0 1 a (5) b (4) c (3) d (3) 0 0 1 0 1 0 0 0 ∅ (6) data itemset lattice

Levelwise search 1 = { 𝑛 ∈ 𝐽 ∣ 𝑡𝑡𝑞𝑞 𝑛 ≥ 𝑛𝑛𝑛𝑡𝑡𝑞 } 𝐺 1. while le 𝐺 𝑙 not empty 2. 2. 𝑌 = 𝑙 + 1, ∀ 𝑍⊂𝑌 , 𝑍 =𝑙 𝑍 ∈ 𝐺 𝑙 𝐷 𝑙+1 = 𝑌 ∈ 𝑄 3. 𝑡𝑡𝑞𝑞 𝑌 ≥ 𝑛𝑛𝑛𝑡𝑡𝑞 𝐺 𝑙+1 = 𝑌 ∈ 𝐷 𝑙+1 4. 1 ∪ 𝐺 2 ∪ ⋯ retu eturn 𝐺 5. 5. The A Priori algorithm can be applied to mine patterns for any enumerable pattern language 𝑄 for any monotonic constraint 𝑑 . Many algorithms exist that are more efficient, but none so versatile.

Problems in pattern paradise The pattern explosion  high thresholds few, but well-known patterns  low thresholds a gazillion patterns Many patterns are redundant Unstable  small data change, yet different results  even when distribution did not really change

The Wine Explosion the Wine dataset has 178 rows, 14 columns

T o the Max! Why not just report only patterns for which there is no extension that is frequent? These patterns are called maxim imall lly frequent. abcd (1) frequent bcd (1) abc (2) abd (3) acd (1) cd (1) ab (4) ac (2) ad (3) bc (2) bd (3) d (3) a (5) b (4) c (3) ∅ (6) (Bayardo, 1998)

Closure! Why throw away so much information? If we keep all 𝑌 that cannot be extended without 𝑡𝑡𝑞𝑞 𝑌 dropping, all frequent itemsets and their frequencies can be reconstructed without loss! These are called closed d frequent itemsets. abcd (1) frequent bcd (1) abc (2) abd (3) acd (1) cd (1) ab (4) ac (2) ad (3) bc (2) bd (3) a (5) b (4) c (3) d (3) ∅ (6) (Pasquier, 1999)

Non-Derivable Patterns Through inclusion/exclusion, we can derive the support of 𝑏𝑒𝑑 . As 𝑡𝑡𝑞𝑞 𝑒𝑑 = 𝑡𝑡𝑞𝑞 𝑑 = 2 , we know 𝑒 and 𝑑 always co-occur. Then, knowing that 𝑡𝑡𝑞𝑞 𝑏𝑑 = 2 , we can derive 𝑡𝑡𝑞𝑞 𝑏𝑒𝑑 = 2 . abcd (1) frequent bcd (1) abc (2) abd (3) acd (1) non-derivable cd (1) ab (4) ac (2) ad (3) bc (2) bd (3) a (5) b (4) c (3) d (3) ∅ (6) (Calders & Goethals, 2003)

Margin-Closed Who cares that we can reconstruct all frequencies exactly? Why not allow a little bit of slack and zap more patterns? That is the main of idea of margin-closed ed frequent itemsets. abcd (1) frequent bcd (1) abc (2) abd (3) acd (1) cd (1) ab (4) ac (2) ad (3) bc (2) bd (3) a (5) b (4) c (3) d (3) ∅ (6) (Moerchen et al, 2011)

Associations Why is a frequent pattern 𝑌 interesting? Because it identifies assoc sociation ons between elements of 𝑌 . Many people buy both and . What’s going on? Many patients have active genes A, B and C. What’s going on? Many molecules share this structure. What’s going on? Okay… but does higher frequency mean more interesting?

Expectation Frequency alone is deceiving, and leads to redundant results. Say that many many people buy . Then all `real’ patterns, such as can be extended with and we likely find that is also frequent. Do we want it to be reported? Not unless its support deviates strongly from our expectation.

What did you expect? What do we expect? How do we model this? How can we measure whether expectation and reality are different enough? Let’s start simple. Let’s assume all ll it items are in independ ndent nt.

Independence! Under the assumption that all items 𝑛 ∈ 𝐽 are independent, the expected frequency of an itemset 𝑌 is simply 𝑛𝑛𝑒 𝑌 = � 𝑔𝑡 ( 𝑦 ) 𝑦∈𝑌 𝑡𝑡𝑡𝑡 𝑦 where we write 𝑔𝑡 𝑦 = for the frequency – the relative 𝑒𝑒 support – of an item 𝑦 ∈ 𝑌 in our database. Item frequencies can easily be extracted from data, as well as reasonably expected to be known by your domain expert.

Bro, do you even lift? We want to identify patterns for which their frequency in the data deviates strongly from our expectation. One way to measure this deviation is lift. 𝑚𝑛𝑔𝑢 𝑌 = 𝑔𝑡 𝑌 𝑛𝑛𝑒 𝑌 Patterns with a lift higher than 1 are more frequent than expected. Those with lift lower than 1 are less frequent. In our data/lattice example 𝑚𝑛𝑔𝑢 𝐵𝐵 = 1.2 and 𝑚𝑛𝑔𝑢 𝐵𝐵𝐷 = 1.83 , (IBM, 1996)

Example: Lift 4 𝑔𝑡 𝐵𝐵 = 6 5 4 𝑛𝑛𝑒 𝐵𝐵 = 6 ∗ 6 = 0.55 A B C D 0 . 66 1 1 1 1 0 . 55 = 1.2 𝑚𝑛𝑔𝑢 𝐵𝐵 = 1 1 1 0 1 1 0 1 1 1 0 1 2 𝑔𝑡 𝐵𝐵𝐷 = 6 0 0 1 0 5 4 3 𝑛𝑛𝑒 𝐵𝐵𝐷 = 6 ∗ 6 ∗ 6 = 0.28 1 0 0 0 0 . 33 𝑚𝑛𝑔𝑢 𝐵𝐵𝐷 = 0 . 28 = 1.83 That is, according to 𝑚𝑛𝑔𝑢 , 𝐵𝐵𝐷 is more interesting than 𝐵𝐵 .

Lift Lift is ad hoc. Lift strongly over-estimates, or under-estimates how surprising the frequency of a pattern is. It is a ba bad interestingness measure. Somewhat more formally: Lift is ad hoc because it comp mpares s scores dir irectly, it does not consider how likely scores are, and doe oes not ot use se a prop oper statistic ical l test to determine how significant the deviation is.

Better Lift The probability of a random om transa saction on to suppor ort 𝑌 is 𝑛𝑛𝑒 ( 𝑌 ) . Assume our dataset contains 𝑂 transactions, and let 𝑎 be a random variable to state how many transactions support 𝑌 . Then, 𝑄 ( 𝑎 = 𝑁 ) is the probability that the support of 𝑌 is 𝑁 , and is given by the binomial distribution, with 𝑟 = 𝑛𝑛𝑒 𝑌 𝑂 𝑁 𝑟 𝑁 1 − 𝑟 𝑂−𝑁 𝑞 𝑎 = 𝑁 = We can now calculate how likely it is to observe a support of 𝑡𝑡𝑞𝑞 ( 𝑌 ) or hig ighe her, and decide whether the 𝑞 -value 𝑞 𝑎 ≥ 𝑡𝑡𝑞𝑞 𝑌 𝑂 is significant (eg. < 0.05 )

Interesting Patterns Jilles Vreeken 15 May 2015 Questions of the - PowerPoint PPT Presentation

Interesting Patterns Jilles Vreeken 15 May 2015 Questions of the Day What is interestingness? what is a pattern? and how can we mine interest esting patterns? What is a pattern? Data ata Pattern ern y = x - 1 What is a pattern?

Factory Patterns: Factory Method and Abstract Factory Design Patterns In Java Bob Tarr

1 Closed Patterns and Max-Patterns Closed Patterns and Max-Patterns A long pattern contains a

Principles and Patterns 26 February, 2020 Recap Principles Patterns Inheritance Anti-patterns

Design Patterns Applications Programming What is design patterns? The design patterns are

Design Patterns 1 What are Design Patterns? Design patterns describe common (and successful)

Software, Faster Patterns of Effective Delivery Dan North @tastapod Patterns of Effective

Design Patterns in Eiffel Dr. Till Bay design patterns? [Design Patterns] are

More Design Patterns Horstmann ch.10.1,10.4 Design patterns Structural design patterns

Design Patterns (1) CSE 331 University of Washington Michael Ernst Outline Introduction to

Patterns of Projects: From Adrenaline Junkies to Template Zombies Tim Lister Patterns

Design Patterns & Refactoring Comparison of the Behavioral Patterns Oliver Haase HTWG

Design Patterns: Background Design Patterns: Background Five Principles (revisited)

Datatypes and Patterns Datatypes Amtoft from Hatcliff Type Names Datatypes Patterns Local

Lecture 20 Next lecture: Design Patterns 1 Structural patterns (controlling heap layout)

Privacy Design Patterns and Anti-Patterns Patterns Misapplied and Unintended Consequences Nick

Design Patterns Massimo Felici Massimo Felici Design Patterns 2011 c 1 Design Patterns

Hypothesis Testing Part I James J. Heckman University of Chicago Econ 312, Spring 2019 Heckman

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

Sta$s$cal Significance Tes$ng In Theory and In Prac$ce Ben

Users Really Do Plug in USB Drives They Find Matthew Tischer, Zakir Durumeric, Sam Foster, Sunny

Normalization and differential expression II Katharina H oel Statistical Analysis of RNA-Seq

Table of contents 1. Introduction: You are already an experimentalist 2. Conditions 3. Items

Disclosure Differential Effect of Plasma Estradiol Levels The authors have no financial

SOME STATISTICAL TESTS Overview Theory of statistical tests Test for a difference in mean

Interesting Patterns Jilles Vreeken 15 May 2015 Questions of the - PowerPoint PPT Presentation

Interesting Patterns Jilles Vreeken 15 May 2015 Questions of the Day What is interestingness? what is a pattern? and how can we mine interest esting patterns? What is a pattern? Data ata Pattern ern y = x - 1 What is a pattern?

Factory Patterns: Factory Method and Abstract Factory Design Patterns In Java Bob Tarr

1 Closed Patterns and Max-Patterns Closed Patterns and Max-Patterns A long pattern contains a

Principles and Patterns 26 February, 2020 Recap Principles Patterns Inheritance Anti-patterns

Design Patterns Applications Programming What is design patterns? The design patterns are

Design Patterns 1 What are Design Patterns? Design patterns describe common (and successful)

Software, Faster Patterns of Effective Delivery Dan North @tastapod Patterns of Effective

Design Patterns in Eiffel Dr. Till Bay design patterns? [Design Patterns] are

More Design Patterns Horstmann ch.10.1,10.4 Design patterns Structural design patterns

Design Patterns (1) CSE 331 University of Washington Michael Ernst Outline Introduction to

Patterns of Projects: From Adrenaline Junkies to Template Zombies Tim Lister Patterns

Design Patterns &amp; Refactoring Comparison of the Behavioral Patterns Oliver Haase HTWG

Design Patterns: Background Design Patterns: Background Five Principles (revisited)

Datatypes and Patterns Datatypes Amtoft from Hatcliff Type Names Datatypes Patterns Local

Lecture 20 Next lecture: Design Patterns 1 Structural patterns (controlling heap layout)

Privacy Design Patterns and Anti-Patterns Patterns Misapplied and Unintended Consequences Nick

Design Patterns Massimo Felici Massimo Felici Design Patterns 2011 c 1 Design Patterns

Hypothesis Testing Part I James J. Heckman University of Chicago Econ 312, Spring 2019 Heckman

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

Sta$s$cal Significance Tes$ng In Theory and In Prac$ce Ben

Users Really Do Plug in USB Drives They Find Matthew Tischer, Zakir Durumeric, Sam Foster, Sunny

Normalization and differential expression II Katharina H oel Statistical Analysis of RNA-Seq

Table of contents 1. Introduction: You are already an experimentalist 2. Conditions 3. Items

Disclosure Differential Effect of Plasma Estradiol Levels The authors have no financial

SOME STATISTICAL TESTS Overview Theory of statistical tests Test for a difference in mean

Design Patterns & Refactoring Comparison of the Behavioral Patterns Oliver Haase HTWG