Leiden University Efficient Frequent Query Discovery in F ARMER - PowerPoint PPT Presentation

Leiden University Efficient Frequent Query Discovery in F ARMER Siegfried Nijssen and Joost N. Kok ECML/PKDD-2003, Cavtat

Introduction • Frequent structure mining: given a set of complex structures (molecules, access logs, graphs, (free) trees, ...), find substructures that occur frequently • Frequent structure mining approaches: – Specialized: efficient algorithms for sequences, trees (Freqt, uFreqT) and graphs (gSpan, FSG) – General: ILP algorithms (Warmr), biased graph mining algorithms (B-AGM) September 25, 2003, Cavtat ECML/PKDD-2003

Introduction • [Yan, SIGKDD’2003] Comparison between gSpan and W ARMR on confirmed active Aids molecules: 6400s W ARMR 2s gSpan • Our goal: to build an efficient W ARMR - like algorithm September 25, 2003, Cavtat ECML/PKDD-2003

Overview • Problem description • Optimizations: – Use a bias for tight problem specifications – Perform a depth-first search – Use efficient data structures in a new complete enumeration strategy which combines pruning with candidate generation – Speed-up evaluation by storing intermediate evaluation results, construct low-cost queries • Experiments & conclusions September 25, 2003, Cavtat ECML/PKDD-2003

Problem description • The task of the algorithm is: 1 Given a database of Datalog facts Find a set of queries that occurs frequently 2 3 4 September 25, 2003, Cavtat ECML/PKDD-2003

Database of Facts g 1 g 2 n 2 n 4 n 6 a a b b a n 1 n 5 n 7 b c n 3 • { e(g 1 ,n 1 ,n 2 ,a),e(g 1 ,n 2 ,n 1 ,a),e(g 1 ,n 2 ,n 3 ,a), e(g 1 ,n 3 ,n 1 ,b),e(g 1 ,n 3 ,n 4 ,b),e(g 1 ,n 3 ,n 5 ,c), e(g 2 ,n 6 ,n 7 ,b) } September 25, 2003, Cavtat ECML/PKDD-2003

Queries N 4 b N 5 a N 1 a N 3 a N 2 • k(G) ← e(G,N 1 ,N 2 ,a),e(G,N 2 ,N 3 ,a), e(G,N 1 ,N 4 ,a),e(G,N 4 ,N 5 ,b) September 25, 2003, Cavtat ECML/PKDD-2003

Queries - Bias • For a fixed set of predicates many kinds of queries possible: – k(G) ← e(G,N 1 ,N 2 ,a),e(G,N 2 ,N 3 ,a), e(G,N 1 ,N 4 ,a),e(G,N 4 ,N 5 ,b) – k(G) ← e(G,N 1 ,N 2 ,L),e(G,N 2 ,N 3 ,L), e(G,N 1 ,N 4 ,L),e(G,N 4 ,N 5 ,L) • Our algorithm requires the user to specify a mode bias with types, primary keys, atom variable constraints, ... September 25, 2003, Cavtat ECML/PKDD-2003

Occurrence of Queries • Database D : θ ={ G/g 1 ,N 1 /n 2 ,N 2 /n 1 ,N 3 /n 2 ,N 4 /n 3 ,N 5 /n 1 } { e(g 1 ,n 1 ,n 2 ,a),e(g 1 ,n 2 ,n 1 ,a),e(g 1 ,n 2 ,n 3 ,a), e(g 1 ,n 3 ,n 1 ,b),e(g 1 ,n 3 ,n 4 ,b),e(g 1 ,n 3 ,n 5 ,a), e(g 1 ,n 4 ,n 2 ,a),e(g 1 ,n 4 ,n 5 ,b),e(g 2 ,n 6 ,n 7 ,b) } • Query Q : k(G) ← e(G,N 1 ,N 2 ,a),e(G,N 2 ,N 3 ,a), e(G,N 1 ,N 4 ,a),e(G,N 4 ,N 5 ,b) • (W ARMR ) θ - subsumption: D Q iff there is a substitution θ , ( Q θ ) ⊆ D September 25, 2003, Cavtat ECML/PKDD-2003

Occurrence of Queries g 1 g 2 n 2 a n 4 n 6 a a a a b b b a a n 1 n 5 n 7 b b a n 3 N 1 N 4 b a a N 5 a Counterintuitive! N 3 a N 1 a b N 3 N 2 N 5 N 4 a N 2 September 25, 2003, Cavtat ECML/PKDD-2003

Occurrence of Queries Equivalent: a k(G) ← e(G,N 1 ,N 2 ,b),e(G,N 2 ,N 3 ,a), b a a e(G,N 3 ,N 2 ,a),e(G,N 3 ,N 4 ,a) a k(G) ← e(G,N 1 ,N 2 ,b),e(G,N 2 ,N 3 ,a), b a e(G,N 3 ,N 2 ,a) Counterintuitive! September 25, 2003, Cavtat ECML/PKDD-2003

Occurrence of Queries • (F ARMER here) OI - subsumption: D Q iff there is a substitution θ , ( Q θ ) ⊆ D and : – θ is injective – θ does not map to constants in Q • Advantages over OI-subsumption: – in many situations (eg. graphs) more intuitive – if queries are equivalent, they are alphabetic variants; mode refinement is easier (proper) • Disadvantages? September 25, 2003, Cavtat ECML/PKDD-2003

Frequency • Database D : { e(g e(g 1 ,n 1 ,n 2 ,a),e(g e(g 1 ,n 2 ,n 1 ,a),e(g e(g 1 ,n 2 ,n 3 ,a), 1 ,n 1 ,n 2 ,a) 1 ,n 2 ,n 1 ,a) 1 ,n 2 ,n 3 ,a) e(g 1 ,n 3 ,n 1 ,b),e(g 1 ,n 3 ,n 4 ,b),e(g 1 ,n 3 ,n 5 ,a), e(g e(g 1 ,n 4 ,n 2 ,a),e(g 1 ,n 4 ,n 5 ,b),e(g 2 ,n 6 ,n 7 ,b) } 1 ,n 4 ,n 2 ,a) • Query Q : k(G) ← e(G,N e(G,N 1 ,N 2 ,a) 1 ,N 2 ,a) • Frequency freq(Q) : the number of different values for G for which the body is subsumed by the database. September 25, 2003, Cavtat ECML/PKDD-2003

Monotonicity • Frequently: frequency ≥ minsup , for predefined threshold value minsup • Monotonicity: if Q 2 OI-subsumes Q 1 , freq(Q 1 ) ≥ freq(Q 2 ) ⇒ if a query is infrequent, it should not be refined ⇒ if a query is subsumed by an infrequent query, it should not be considered September 25, 2003, Cavtat ECML/PKDD-2003

F ARMER F ARMER (Query Q ):: determine refinements of Q 1. compute frequency of refinements 2. sort refinements 3. for each frequent refinement Q’ do F ARMER (Q’) September 25, 2003, Cavtat ECML/PKDD-2003

Determine Refinements • Only one variant of each query should be counted and outputted • Main problem: query equivalency under OI has graph isomorphism complexity • Our approach: – use ordered tree-based heuristics – use efficient data structures to determine equivalency – perform also other pruning during exponential search September 25, 2003, Cavtat ECML/PKDD-2003

Determine Refinements • [IJCAI’01] e(G,N 1 ,N 2 ,a) e(G,N 1 ,N 2 ,b) e(G,N 3 ,N 4 ,b) e(G,N 1 ,N 3 ,a) e(G,N 2 ,N 3 ,a) e(G,N 2 ,N 3 ,b) e(G,N 1 ,N 3 ,b) e(G,N 3 ,N 4 ,a) September 25, 2003, Cavtat ECML/PKDD-2003

Determine Refinements e(G,N 1 ,N 2 ,a) e(G,N 1 ,N 2 ,b) 3 ,a) e(G,N 1 ,N 3 ,a) e(G,N 2 ,N 3 ,a) e(G,N 2 ,N 3 ,a) e(G,N 2 ,N 3 ,b) e(G,N 2 ,N 3 ,b) e(G,N 1 ,N 3 ,b) e(G,N 1 ,N 3 ,b) e(G,N 3 ,N 4 ,a) e(G,N 3 ,N 4 September 25, 2003, Cavtat ECML/PKDD-2003

Determine Refinements • (In the paper) we prove that – Refinement with this strategy is complete : of every frequent query defined by the bias, at least one variant is found – The order of siblings does not matter for completeness (but they must have some order) September 25, 2003, Cavtat ECML/PKDD-2003

Determine Refinements • Incrementally generate variants • Search for the variant (under construction) in the existing part of the query tree • To optimize this search, siblings are stored in a tree-like hash structure • If a query is found that is infrequent ⇒ query Q is pruned (monotonicity constraint!) September 25, 2003, Cavtat ECML/PKDD-2003

Frequency Computation • Main problem: the complexity of finding an OI substitution is the same as subgraph isomorphism, and is therefore NP complete • Our approach: try to avoid as much as possible that the same (exponential) computation is performed twice September 25, 2003, Cavtat ECML/PKDD-2003

Frequency Computation • D = { e(g 1 ,n 1 ,n 2 ,a),e(g 1 ,n 2 ,n 1 ,a),e(g 1 ,n 2 ,n 3 ,a),e e (g 1 ,n 3 ,n 1 ,b),e(g 1 ,n 3 ,n 4 ,b),e(g 1 ,n 3 ,n 5 ,a), (g 1 ,n 3 ,n 1 ,b) e(g 1 ,n 4 ,n 2 ,a),e(g 1 ,n 4 ,n 5 ,b),e(g e(g 2 ,n 6 ,n 7 ,b) } 2 ,n 6 ,n 7 ,b) • Q = k(G) ← e(G,N e(G,N 1 ,N 2 ,b) 1 ,N 2 ,b) • For each value of G for which the database subsumes the query, the `first’ substitution is stored September 25, 2003, Cavtat ECML/PKDD-2003

Frequency Computation • Once a query is refined, for each refinement the first subsuming substitution has to be determined • This computation is performed in one backtracking procedure for all refinements together (like query packs) • This search starts from the subsitution of the original query September 25, 2003, Cavtat ECML/PKDD-2003

Frequency Computation • D = { e(g 1 ,n 1 ,n 2 ,a),e(g 1 ,n 2 ,n 1 ,a),e(g 1 ,n 2 ,n 3 ,a),e { e(g 1 ,n 1 ,n 2 ,a),e(g 1 ,n 2 ,n 1 ,a),e(g 1 ,n 2 ,n 3 ,a),e { e(g 1 ,n 1 ,n 2 ,a),e(g 1 ,n 2 ,n 1 ,a),e(g 1 ,n 2 ,n 3 ,a),e { e(g { e(g 1 ,n 1 ,n 2 ,a),e(g 1 ,n 2 ,n 1 ,a),e(g 1 ,n 2 ,n 3 ,a),e e(g 1 ,n 1 ,n 2 ,a),e(g 1 ,n 2 ,n 1 ,a),e(g 1 ,n 2 ,n 3 ,a),e e e e 1 ,n 1 ,n 2 ,a) (g 1 (g 1 ,n 3 ,n 1 ,b),e(g 1 ,n 3 ,n 4 ,b),e(g 1 ,n 3 ,n 5 ,a), (g 1 ,n 3 ,n 1 ,b),e(g (g 1 (g 1 ,n 3 ,n 3 ,n 3 ,n 1 ,n 1 ,n 1 ,b),e(g 1 ,n 3 ,n 4 ,b),e(g 1 ,n 3 ,n 5 ,a), ,b),e(g 1 ,n 3 ,n 4 ,b),e(g 1 ,n 3 ,n 5 ,a), ,b),e(g 1 ,n 3 ,n 4 ,b),e(g 1 ,n 3 ,n 5 ,a), e(g 1 ,n 3 ,n 4 ,b),e(g 1 ,n 3 ,n 5 ,a), (g (g (g 1 ,n 1 ,n 1 ,n 3 ,n 3 ,n 3 ,n 1 ,b) 1 ,b) 1 ,b) 1 ,n 3 ,n 4 ,b) e(g 1 ,n 4 ,n 2 ,a),e(g 1 ,n 4 ,n 5 ,b),e(g 2 ,n 6 ,n 7 ,b) } e(g 1 ,n 4 ,n 2 ,a),e(g 1 ,n 4 ,n 5 ,b),e(g 2 ,n 6 ,n 7 ,b) } e(g 1 ,n 4 ,n 2 ,a),e(g 1 ,n 4 ,n 5 ,b),e(g 2 ,n 6 ,n 7 ,b) } e(g 1 ,n 4 ,n 2 ,a),e(g 1 ,n 4 ,n 5 ,b),e(g 2 ,n 6 ,n 7 ,b) } e(g 1 ,n 4 ,n 2 ,a),e(g e(g 1 ,n 4 ,n 5 ,b),e(g 2 ,n 6 ,n 7 ,b) } 1 ,n 4 ,n 5 ,b) e(G,N 2 ,N 3 ,a) e(G,N 2 ,N 3 ,a) e(G,N 2 ,N 3 ,a) e(G,N 2 ,N 3 ,a) e(G,N 2 e(G,N 2 ,N 3 ,b) ,N 3 ,b) e(G,N 2 ,N 3 ,b) e(G,N 2 ,N 3 ,b) • Q = k(G) ← e(G,N e(G,N 1 ,N 2 ,b) 1 ,N 2 ,b) e(G,N 1 ,N 3 ,b) e(G,N 1 ,N 3 ,b) e(G,N 1 ,N 3 ,b) e(G,N 3 ,N 4 ,b) e(G,N 3 ,N 4 ,b) e(G,N 3 ,N 4 ,b) September 25, 2003, Cavtat ECML/PKDD-2003

Leiden University Efficient Frequent Query Discovery in F ARMER - PowerPoint PPT Presentation

Leiden University Efficient Frequent Query Discovery in F ARMER Siegfried Nijssen and Joost N. Kok ECML/PKDD-2003, Cavtat Introduction Frequent structure mining: given a set of complex structures (molecules, access logs, graphs, (free)

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

School Psychology WELCOME! Developmental and Educational Psychology Leiden University Chair:

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Frequent Item Sets Chau Tran & Chun-Che Wang Outline 1. Definitions Frequent Itemsets

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

The shortcomings of the frequent pattern mining CLOSET:An Efficient Algorithm There may exist

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a

Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

Knowledge discovery from patient forums Anne Dirkson 12 June 2019 Discover theworld at Leiden

How Orange Successfully Deploys GPU Infrastructure for AI AI WEBINAR Date/Time: Tuesday, June 23

Kubernetes & AI with Run:AI, Red Hat & Excelero AI WEBINAR Date/Time: Tuesday, June 9 |

The Long and Short of Passwords Rich Shay November 5, 2009 1 / 34 The Long and Short of

Nieuw Leyden Alexander de Vries Director Nieuw Leyden Expertteam Self build Netherlands

AND FIELD PLACEMENT PROGRAMS AT UCONN LAW SCHOOL 2015-16 Practice-Based Learning

Justifications and Wrong Judgements Giuseppe Primiero FWO - Research Foundation Flanders Centre

Internet Architecture WG : DoS-resistant Internet Subgroup Report Mark Handley University

Rethinking Electrons and the Electric Phenomenon. Anton Vrba Anton Vrba c Presentation to

Leiden University Efficient Frequent Query Discovery in F ARMER - PowerPoint PPT Presentation

Leiden University Efficient Frequent Query Discovery in F ARMER Siegfried Nijssen and Joost N. Kok ECML/PKDD-2003, Cavtat Introduction Frequent structure mining: given a set of complex structures (molecules, access logs, graphs, (free)

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

School Psychology WELCOME! Developmental and Educational Psychology Leiden University Chair:

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Frequent Item Sets Chau Tran &amp; Chun-Che Wang Outline 1. Definitions Frequent Itemsets

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

The shortcomings of the frequent pattern mining CLOSET:An Efficient Algorithm There may exist

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a

Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

Knowledge discovery from patient forums Anne Dirkson 12 June 2019 Discover theworld at Leiden

How Orange Successfully Deploys GPU Infrastructure for AI AI WEBINAR Date/Time: Tuesday, June 23

Kubernetes &amp; AI with Run:AI, Red Hat &amp; Excelero AI WEBINAR Date/Time: Tuesday, June 9 |

The Long and Short of Passwords Rich Shay November 5, 2009 1 / 34 The Long and Short of

Nieuw Leyden Alexander de Vries Director Nieuw Leyden Expertteam Self build Netherlands

AND FIELD PLACEMENT PROGRAMS AT UCONN LAW SCHOOL 2015-16 Practice-Based Learning

Justifications and Wrong Judgements Giuseppe Primiero FWO - Research Foundation Flanders Centre

Internet Architecture WG : DoS-resistant Internet Subgroup Report Mark Handley University

Rethinking Electrons and the Electric Phenomenon. Anton Vrba Anton Vrba c Presentation to

Frequent Item Sets Chau Tran & Chun-Che Wang Outline 1. Definitions Frequent Itemsets

Kubernetes & AI with Run:AI, Red Hat & Excelero AI WEBINAR Date/Time: Tuesday, June 9 |