Leiden University Efficient Frequent Query Discovery in F ARMER - - PowerPoint PPT Presentation

leiden university efficient frequent query
SMART_READER_LITE
LIVE PREVIEW

Leiden University Efficient Frequent Query Discovery in F ARMER - - PowerPoint PPT Presentation

Leiden University Efficient Frequent Query Discovery in F ARMER Siegfried Nijssen and Joost N. Kok ECML/PKDD-2003, Cavtat Introduction Frequent structure mining: given a set of complex structures (molecules, access logs, graphs, (free)


slide-1
SLIDE 1

Leiden University Efficient Frequent Query Discovery in FARMER

Siegfried Nijssen and Joost N. Kok ECML/PKDD-2003, Cavtat

slide-2
SLIDE 2

September 25, 2003, Cavtat ECML/PKDD-2003

Introduction

  • Frequent structure mining: given a set of

complex structures (molecules, access logs, graphs, (free) trees, ...), find substructures that occur frequently

  • Frequent structure mining approaches:

– Specialized: efficient algorithms for sequences, trees (Freqt, uFreqT) and graphs (gSpan, FSG) – General: ILP algorithms (Warmr), biased graph mining algorithms (B-AGM)

slide-3
SLIDE 3

September 25, 2003, Cavtat ECML/PKDD-2003

Introduction

  • [Yan, SIGKDD’2003]

Comparison between gSpan and WARMR on confirmed active Aids molecules: 6400s WARMR 2s gSpan

  • Our goal:

to build an efficient WARMR- like algorithm

slide-4
SLIDE 4

September 25, 2003, Cavtat ECML/PKDD-2003

Overview

  • Problem description
  • Optimizations:

– Use a bias for tight problem specifications – Perform a depth-first search – Use efficient data structures in a new complete enumeration strategy which combines pruning with candidate generation – Speed-up evaluation by storing intermediate evaluation results, construct low-cost queries

  • Experiments & conclusions
slide-5
SLIDE 5

September 25, 2003, Cavtat ECML/PKDD-2003

2 3 4 1

  • The task of the algorithm is:

Given a database of Datalog facts Find a set of queries that occurs frequently

Problem description

slide-6
SLIDE 6

September 25, 2003, Cavtat ECML/PKDD-2003

Database of Facts

  • {e(g1,n1,n2,a),e(g1,n2,n1,a),e(g1,n2,n3,a),

e(g1,n3,n1,b),e(g1,n3,n4,b),e(g1,n3,n5,c), e(g2,n6,n7,b)} a a b a b c

n1 n2 n3 n4 n5 n6 n7

b

g1 g2

slide-7
SLIDE 7

September 25, 2003, Cavtat ECML/PKDD-2003

Queries

  • k(G) ← e(G,N1,N2,a),e(G,N2,N3,a),

e(G,N1,N4,a),e(G,N4,N5,b) a a b a

N1 N2 N3 N4 N5

slide-8
SLIDE 8

September 25, 2003, Cavtat ECML/PKDD-2003

Queries - Bias

  • For a fixed set of predicates many kinds of

queries possible:

– k(G)←e(G,N1,N2,a),e(G,N2,N3,a), e(G,N1,N4,a),e(G,N4,N5,b) – k(G)←e(G,N1,N2,L),e(G,N2,N3,L), e(G,N1,N4,L),e(G,N4,N5,L)

  • Our algorithm requires the user to specify a

mode bias with types, primary keys, atom variable constraints, ...

slide-9
SLIDE 9

September 25, 2003, Cavtat ECML/PKDD-2003

Occurrence of Queries

  • Database D:

{e(g1,n1,n2,a),e(g1,n2,n1,a),e(g1,n2,n3,a), e(g1,n3,n1,b),e(g1,n3,n4,b),e(g1,n3,n5,a), e(g1,n4,n2,a),e(g1,n4,n5,b),e(g2,n6,n7,b)}

  • Query Q:

k(G) ← e(G,N1,N2,a),e(G,N2,N3,a), e(G,N1,N4,a),e(G,N4,N5,b)

  • (WARMR) θ-subsumption: D Q iff there is a

substitution θ, (Qθ) ⊆ D

θ={G/g1,N1/n2,N2/n1,N3/n2,N4/n3,N5/n1}

slide-10
SLIDE 10

September 25, 2003, Cavtat ECML/PKDD-2003

a a b a b a

n3 n4 n5 n6 n7

b

g1 g2

Occurrence of Queries

n1 n2 N4 N2

b a a a

N1 N3 N5

a a

N1 N2 N3 N5

a b

N4

a a a b Counterintuitive! b a

slide-11
SLIDE 11

September 25, 2003, Cavtat ECML/PKDD-2003

Occurrence of Queries

a a b Counterintuitive! a a a b

k(G) ← e(G,N1,N2,b),e(G,N2,N3,a), e(G,N3,N2,a),e(G,N3,N4,a) k(G) ← e(G,N1,N2,b),e(G,N2,N3,a), e(G,N3,N2,a)

Equivalent:

slide-12
SLIDE 12

September 25, 2003, Cavtat ECML/PKDD-2003

Occurrence of Queries

  • (FARMER here) OI-subsumption: D Q iff

there is a substitution θ, (Qθ) ⊆ D and:

– θ is injective – θ does not map to constants in Q

  • Advantages over OI-subsumption:

– in many situations (eg. graphs) more intuitive – if queries are equivalent, they are alphabetic variants; mode refinement is easier (proper)

  • Disadvantages?
slide-13
SLIDE 13

September 25, 2003, Cavtat ECML/PKDD-2003

Frequency

  • Database D:

{e(g e(g1

1,n

,n1

1,n

,n2

2,a)

,a),e(g e(g1

1,n

,n2

2,n

,n1

1,a)

,a),e(g e(g1

1,n

,n2

2,n

,n3

3,a)

,a), e(g1,n3,n1,b),e(g1,n3,n4,b),e(g1,n3,n5,a), e(g e(g1

1,n

,n4

4,n

,n2

2,a)

,a),e(g1,n4,n5,b),e(g2,n6,n7,b)}

  • Query Q:

k(G) ← e(G,N e(G,N1

1,N

,N2

2,a)

,a)

  • Frequency freq(Q): the number of different

values for G for which the body is subsumed by the database.

slide-14
SLIDE 14

September 25, 2003, Cavtat ECML/PKDD-2003

Monotonicity

  • Frequently: frequency ≥ minsup, for

predefined threshold value minsup

  • Monotonicity: if Q2 OI-subsumes Q1,

freq(Q1)≥ freq(Q2) ⇒ if a query is infrequent, it should not be refined ⇒ if a query is subsumed by an infrequent query, it should not be considered

slide-15
SLIDE 15

September 25, 2003, Cavtat ECML/PKDD-2003

3. 2. 1.

FARMER

FARMER(Query Q):: determine refinements of Q compute frequency of refinements sort refinements for each frequent refinement Q’ do FARMER(Q’)

slide-16
SLIDE 16

September 25, 2003, Cavtat ECML/PKDD-2003

Determine Refinements

  • Only one variant of each query should be

counted and outputted

  • Main problem: query equivalency under OI

has graph isomorphism complexity

  • Our approach:

– use ordered tree-based heuristics – use efficient data structures to determine equivalency – perform also other pruning during exponential search

slide-17
SLIDE 17

September 25, 2003, Cavtat ECML/PKDD-2003

Determine Refinements

  • [IJCAI’01]

e(G,N1,N2,a) e(G,N1,N2,b) e(G,N3,N4,b) e(G,N2,N3,a) e(G,N1,N3,a) e(G,N2,N3,b) e(G,N1,N3,b) e(G,N3,N4,a)

slide-18
SLIDE 18

September 25, 2003, Cavtat ECML/PKDD-2003

Determine Refinements

e(G,N1,N2,a) e(G,N1,N2,b) e(G,N2,N3,a)

3,a)

e(G,N2,N3,b) e(G,N1,N3,b) e(G,N3,N4,a) e(G,N2,N3,a) e(G,N1,N3,a) e(G,N2,N3,b) e(G,N1,N3,b) e(G,N3,N4

slide-19
SLIDE 19

September 25, 2003, Cavtat ECML/PKDD-2003

Determine Refinements

  • (In the paper) we prove that

– Refinement with this strategy is complete: of every frequent query defined by the bias, at least

  • ne variant is found

– The order of siblings does not matter for completeness (but they must have some order)

slide-20
SLIDE 20

September 25, 2003, Cavtat ECML/PKDD-2003

Determine Refinements

  • Incrementally generate variants
  • Search for the variant (under construction) in

the existing part of the query tree

  • To optimize this search, siblings are stored

in a tree-like hash structure

  • If a query is found that is infrequent ⇒

query Q is pruned (monotonicity constraint!)

slide-21
SLIDE 21

September 25, 2003, Cavtat ECML/PKDD-2003

Frequency Computation

  • Main problem: the complexity of finding an

OI substitution is the same as subgraph isomorphism, and is therefore NP complete

  • Our approach: try to avoid as much as

possible that the same (exponential) computation is performed twice

slide-22
SLIDE 22

September 25, 2003, Cavtat ECML/PKDD-2003

Frequency Computation

  • D =
  • Q = k(G)← e(G,N

e(G,N1

1,N

,N2

2,b)

,b)

  • For each value of G for which the database

subsumes the query, the `first’ substitution is stored {e(g1,n1,n2,a),e(g1,n2,n1,a),e(g1,n2,n3,a),e e (g (g1

1,n

,n3

3,n

,n1

1,b)

,b),e(g1,n3,n4,b),e(g1,n3,n5,a), e(g1,n4,n2,a),e(g1,n4,n5,b),e(g e(g2

2,n

,n6

6,n

,n7

7,b)

,b)}

slide-23
SLIDE 23

September 25, 2003, Cavtat ECML/PKDD-2003

Frequency Computation

  • Once a query is refined, for each refinement

the first subsuming substitution has to be determined

  • This computation is performed in one

backtracking procedure for all refinements together (like query packs)

  • This search starts from the subsitution of the
  • riginal query
slide-24
SLIDE 24

September 25, 2003, Cavtat ECML/PKDD-2003

{e(g1,n1,n2,a),e(g1,n2,n1,a),e(g1,n2,n3,a),e (g1,n3,n1,b),e(g1,n3,n4,b),e(g1,n3,n5,a), e(g1,n4,n2,a),e(g1,n4,n5,b),e(g2,n6,n7,b)}

Frequency Computation

  • D =
  • Q = k(G)← e(G,N

e(G,N1

1,N

,N2

2,b)

,b) e(G,N2,N3,a) e(G,N2,N3,b) e(G,N1,N3,b) e(G,N3,N4,b) {e(g1,n1,n2,a),e(g1,n2,n1,a),e(g1,n2,n3,a),e e (g (g1

1,n

,n3

3,n

,n1

1,b)

,b),e(g1,n3,n4,b),e(g1,n3,n5,a), e(g1,n4,n2,a),e(g1,n4,n5,b),e(g2,n6,n7,b)} {e(g1,n1,n2,a),e(g1,n2,n1,a),e(g1,n2,n3,a),e e (g (g1

1,n

,n3

3,n

,n1

1,b)

,b),e(g1,n3,n4,b),e(g1,n3,n5,a), e(g1,n4,n2,a),e(g1,n4,n5,b),e(g2,n6,n7,b)} {e(g1,n1,n2,a),e(g1,n2,n1,a),e(g1,n2,n3,a),e (g1,n3,n1,b),e(g e(g1

1,n

,n3

3,n

,n4

4,b)

,b),e(g1,n3,n5,a), e(g1,n4,n2,a),e(g e(g1

1,n

,n4

4,n

,n5

5,b)

,b),e(g2,n6,n7,b)} e(G,N2,N3,a) e(G,N e(G,N2

2,N

,N3

3,b)

,b) e(G,N1,N3,b) e(G,N3,N4,b) {e(g e(g1

1,n

,n1

1,n

,n2

2,a)

,a),e(g1,n2,n1,a),e(g1,n2,n3,a),e e (g (g1

1,n

,n3

3,n

,n1

1,b)

,b),e(g1,n3,n4,b),e(g1,n3,n5,a), e(g1,n4,n2,a),e(g1,n4,n5,b),e(g2,n6,n7,b)} e(G,N e(G,N2

2,N

,N3

3,a)

,a) e(G,N2,N3,b) e(G,N1,N3,b) e(G,N3,N4,b)

slide-25
SLIDE 25

September 25, 2003, Cavtat ECML/PKDD-2003

  • D =
  • Q1 =
  • Q2 =

k(G)← e(G,N1,N2,b),e(G,N2,N3,a),e(G,N2,N3,b)

Sorting Order

{e(g1,n1,n2,a),e(g1,n2,n1,a),e(g1,n2,n3,a),e (g1,n3,n1,b),e(g1,n3,n4,b),e(g1,n3,n5,a), e(g1,n4,n2,a),e(g1,n4,n5,b),e(g2,n6,n7,b)} {e(g1,n1,n2,a),e(g1,n2,n1,a),e(g1,n2,n3,a),e (g1,n3,n1,b),e(g e(g1

1,n

,n3

3,n

,n4

4,b)

,b),e(g1,n3,n5,a), e(g e(g1

1,n

,n4

4,n

,n2

2,a)

,a),e(g e(g1

1,n

,n4

4,n

,n5

5,b)

,b),e(g2,n6,n7,b)} k(G)← e(G,N e(G,N1

1,N

,N2

2,b)

,b),e(G,N e(G,N2

2,N

,N3

3,a)

,a),e(G,N e(G,N2

2,N

,N3

3,b)

,b) {e(g e(g1

1,n

,n1

1,n

,n2

2,a)

,a),e(g1,n2,n1,a),e(g1,n2,n3,a),e e (g (g1

1,n

,n3

3,n

,n1

1,b)

,b),e(g1,n3,n4,b),e(g1,n3,n5,a), e(g1,n4,n2,a),e(g1,n4,n5,b),e(g2,n6,n7,b)} k(G)← e(G,N e(G,N1

1,N

,N2

2,b)

,b),e(G,N e(G,N2

2,N

,N3

3,a)

,a),e(G,N2,N3,b) k(G)← e(G,N1,N2,b),e(G,N2,N3,b),e(G,N2,N3,a) {e(g1,n1,n2,a),e(g1,n2,n1,a),e(g1,n2,n3,a),e (g1,n3,n1,b),e(g e(g1

1,n

,n3

3,n

,n4

4,b)

,b),e(g1,n3,n5,a), e(g1,n4,n2,a),e(g e(g1

1,n

,n4

4,n

,n5

5,b)

,b),e(g2,n6,n7,b)} k(G)← e(G,N e(G,N1

1,N

,N2

2,b)

,b),e(G,N e(G,N2

2,N

,N3

3,b)

,b),e(G,N2,N3,a) {e(g1,n1,n2,a),e(g1,n2,n1,a),e(g1,n2,n3,a),e (g1,n3,n1,b),e(g e(g1

1,n

,n3

3,n

,n4

4,b)

,b),e(g1,n3,n5,a), e(g e(g1

1,n

,n4

4,n

,n2

2,a)

,a),e(g e(g1

1,n

,n4

4,n

,n5

5,b)

,b),e(g2,n6,n7,b)} k(G)← e(G,N e(G,N1

1,N

,N2

2,b)

,b),e(G,N e(G,N2

2,N

,N3

3,b)

,b),e(G,N e(G,N2

2,N

,N3

3,a)

,a)

slide-26
SLIDE 26

September 25, 2003, Cavtat ECML/PKDD-2003

Experimental Results

  • Bongard dataset
  • Warmr emulates OI

392 examples

minsup=5% 1s

  • 192MB 350Mhz
slide-27
SLIDE 27

September 25, 2003, Cavtat ECML/PKDD-2003

Experimental Results

  • Predictive Toxicology dataset

Machine Algorithm 6% 7% Pentium III 500Mhz 448MB gSpan 5s Dual Athlon MP1800+ 2GB FSG IP 11s 7s Athlon XP1600+ 256MB Farmer 72s 48s Pentium II 350Mhz 192MB Farmer 224s 148s Pentium III 500Mhz 448MB FSG 248s Dual Athlon MP1800+ 2GB FSG II 675s 23s Pentium III 350Mhz 192MB Warmr>1h >1h

slide-28
SLIDE 28

September 25, 2003, Cavtat ECML/PKDD-2003

Conclusions

  • We decreased the performance gap

between specialized algorithms and ILP algorithms significantly

  • We did so by:

– using (weak) object identity – using a new complete enumeration strategy – choosing query evaluation strategies with low costs (much memory however required!)

  • Future: provide better comparisons