Finding Top-k Min-Cost Connected Trees in Databases Bolin Ding 1 - - PowerPoint PPT Presentation

finding top k min cost connected trees in databases
SMART_READER_LITE
LIVE PREVIEW

Finding Top-k Min-Cost Connected Trees in Databases Bolin Ding 1 - - PowerPoint PPT Presentation

Outline Finding Top-k Min-Cost Connected Trees in Databases Bolin Ding 1 Jeffrey Xu Yu 1 Shan Wang 2 Lu Qin 1 Xiao Zhang 2 Xuemin Lin 3 1 Department of System Engineering and Engineering Management The Chinese University of Hong Kong 2 School of


slide-1
SLIDE 1

Outline

Finding Top-k Min-Cost Connected Trees in Databases

Bolin Ding1 Jeffrey Xu Yu1 Shan Wang2 Lu Qin1 Xiao Zhang2 Xuemin Lin3

1Department of System Engineering and Engineering Management

The Chinese University of Hong Kong

2School of Informaion

Renmin University of China

3School of Computer Science and Engineering

The University of New South Wales

IEEE 23rd International Conference on Data Engineering

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-2
SLIDE 2

Outline

Outline

1

Keyword Search in Relational Databases Database Graph, Query, and Answer The Hardness of This Problem

2

Our New Parameterized Solutions Finding Top-1 Answer Finding Top-k Answers

3

Existing Solutions Other Graph-Based Solutions

4

Experimental Studies Some Representative Experimental Results

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-3
SLIDE 3

Outline

Outline

1

Keyword Search in Relational Databases Database Graph, Query, and Answer The Hardness of This Problem

2

Our New Parameterized Solutions Finding Top-1 Answer Finding Top-k Answers

3

Existing Solutions Other Graph-Based Solutions

4

Experimental Studies Some Representative Experimental Results

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-4
SLIDE 4

Outline

Outline

1

Keyword Search in Relational Databases Database Graph, Query, and Answer The Hardness of This Problem

2

Our New Parameterized Solutions Finding Top-1 Answer Finding Top-k Answers

3

Existing Solutions Other Graph-Based Solutions

4

Experimental Studies Some Representative Experimental Results

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-5
SLIDE 5

Outline

Outline

1

Keyword Search in Relational Databases Database Graph, Query, and Answer The Hardness of This Problem

2

Our New Parameterized Solutions Finding Top-1 Answer Finding Top-k Answers

3

Existing Solutions Other Graph-Based Solutions

4

Experimental Studies Some Representative Experimental Results

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-6
SLIDE 6

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Model Hardness

Outline

1

Keyword Search in Relational Databases Database Graph, Query, and Answer The Hardness of This Problem

2

Our New Parameterized Solutions Finding Top-1 Answer Finding Top-k Answers

3

Existing Solutions Other Graph-Based Solutions

4

Experimental Studies Some Representative Experimental Results

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-7
SLIDE 7

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Model Hardness

Weighted Database Graph G(V , E, W )

Node set V Nodes - tuples in database, |V | = n Edge set E Edges - foreign key references between tuples, |E| = m Edge Weight W Edge weight we((v, u)) = log2(1 + max{dv, du}) (dx - degree of node x) The lower, the tighter Intuition: the relationship between one node and the

  • thers is distributed

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-8
SLIDE 8

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Model Hardness

Weighted Database Graph G(V , E, W )

Node set V Nodes - tuples in database, |V | = n Edge set E Edges - foreign key references between tuples, |E| = m Edge Weight W Edge weight we((v, u)) = log2(1 + max{dv, du}) (dx - degree of node x) The lower, the tighter Intuition: the relationship between one node and the

  • thers is distributed

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-9
SLIDE 9

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Model Hardness

Weighted Database Graph G(V , E, W )

Node set V Nodes - tuples in database, |V | = n Edge set E Edges - foreign key references between tuples, |E| = m Edge Weight W Edge weight we((v, u)) = log2(1 + max{dv, du}) (dx - degree of node x) The lower, the tighter Intuition: the relationship between one node and the

  • thers is distributed

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-10
SLIDE 10

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Model Hardness

Query and Answer

Query l keywords p1, p2, · · · , pl

  • r l subsets V1, V2, · · · , Vl ⊆ V (Vi contains keyword pi)

Answer Connected subtree T in G containing the l keywords

  • r Group Steiner tree T, s.t. V (T) ∩ Vi = ∅ (i = 1, · · · , l)

Objective Cost of answer T: s(T) =

(u,v)∈E(T) we((u, v))

(linear combination of node/edge weight) Output answers T1, · · · , Tk, with top-k minimum costs (Top-k) Minimum Group Steiner Tree Problem

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-11
SLIDE 11

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Model Hardness

Query and Answer

Query l keywords p1, p2, · · · , pl

  • r l subsets V1, V2, · · · , Vl ⊆ V (Vi contains keyword pi)

Answer Connected subtree T in G containing the l keywords

  • r Group Steiner tree T, s.t. V (T) ∩ Vi = ∅ (i = 1, · · · , l)

Objective Cost of answer T: s(T) =

(u,v)∈E(T) we((u, v))

(linear combination of node/edge weight) Output answers T1, · · · , Tk, with top-k minimum costs (Top-k) Minimum Group Steiner Tree Problem

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-12
SLIDE 12

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Model Hardness

Example

Database

Steiner Problem in DB Efficient IR−Query over DB Online Cluster Problems Keyword Query over Web Query Optimization on DB Parameterized Complexity

t1 t2 t3 t4 t5 t6 t7 a2 t6 a2 t5 a2 t4 a2 t3 a2 t4 a1 a1 a1 a2 t1 t2 t3 t2 t5 t4 t6 t7

PID AID Author Title PID AID Cite Cited Citation Paper

Jim Robin

Name

t2

Paper−Author

Keyword Search on RDBMS

t7

Database Graph

t6 t7 t5 t4 t3 t2 t1 c1 c2 c3 c4 a2 a1 w4 w5 w6 w7 w3 w2 w1 2.6 2.6 2.6 2.6 2.6 1.6 1.6 2 2 1.6 2 1.6 1.6 1.6 1.6 1.6 1.6 2 1.6 2 2 1.6

Query Keyword (p1), Query (p2), DB (p3), and Jim (p4)

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-13
SLIDE 13

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Model Hardness

Example

Database

Steiner Problem in DB Efficient IR−Query over DB Online Cluster Problems Keyword Query over Web Query Optimization on DB Parameterized Complexity

t1 t2 t3 t4 t5 t6 t7 a2 t6 a2 t5 a2 t4 a2 t3 a2 t4 a1 a1 a1 a2 t1 t2 t3 t2 t5 t4 t6 t7

PID AID Author Title PID AID Cite Cited Citation Paper

Jim Robin

Name

t2

Paper−Author

Keyword Search on RDBMS

t7

Database Graph

{p1}

1.6 t6 t7 t5 t4 t3 t2 t1 c1 c2 c3 c4 a2 a1 w4 w5 w6 w7 w3 w2 w1 2.6 2.6 2.6 2.6 2.6 1.6 1.6 2 2 1.6 2 1.6 1.6 1.6 1.6 1.6 1.6 2 1.6 2 2

{p1}

Query Keyword (p1), Query (p2), DB (p3), and Jim (p4)

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-14
SLIDE 14

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Model Hardness

Example

Database

Steiner Problem in DB Efficient IR−Query over DB Online Cluster Problems Keyword Query over Web Query Optimization on DB Parameterized Complexity

t1 t2 t3 t4 t5 t6 t7 a2 t6 a2 t5 a2 t4 a2 t3 a2 t4 a1 a1 a1 a2 t1 t2 t3 t2 t5 t4 t6 t7

PID AID Author Title PID AID Cite Cited Citation Paper

Jim Robin

Name

t2

Paper−Author

Keyword Search on RDBMS

t7

Database Graph

1.6

{p2} {p1,p2}

t6 t7 t5 t4 t3 t2 t1 c1 c2 c3 c4 a2 a1 w4 w5 w6 w7 w3 w2 w1 2.6 2.6 2.6 2.6 2.6 1.6 1.6 2 2 1.6 2 1.6 1.6 1.6 1.6 1.6 2 1.6 2 2 1.6

{p1} {p2}

Query Keyword (p1), Query (p2), DB (p3), and Jim (p4)

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-15
SLIDE 15

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Model Hardness

Example

Database

Steiner Problem in DB Efficient IR−Query over DB Online Cluster Problems Keyword Query over Web Query Optimization on DB Parameterized Complexity

t1 t2 t3 t4 t5 t6 t7 a2 t6 a2 t5 a2 t4 a2 t3 a2 t4 a1 a1 a1 a2 t1 t2 t3 t2 t5 t4 t6 t7

PID AID Author Title PID AID Cite Cited Citation Paper

Jim Robin

Name

t2

Paper−Author

Keyword Search on RDBMS

t7

Database Graph

{p2,p3}

1.6

{p2} {p1,p2}

t6 t7 t5 t4 t3 t2 t1 c1 c2 c3 c4 a2 a1 w4 w5 w6 w7 w3 w2 w1 2.6 2.6 2.6 2.6 2.6 1.6 1.6 2 2 1.6 2 1.6 1.6 1.6 1.6 1.6 2 1.6 2 2 1.6

{p1,p3} {p3}

Query Keyword (p1), Query (p2), DB (p3), and Jim (p4)

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-16
SLIDE 16

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Model Hardness

Example

Database

Steiner Problem in DB Efficient IR−Query over DB Online Cluster Problems Keyword Query over Web Query Optimization on DB Parameterized Complexity

t1 t2 t3 t4 t5 t6 t7 a2 t6 a2 t5 a2 t4 a2 t3 a2 t4 a1 a1 a1 a2 t1 t2 t3 t2 t5 t4 t6 t7

PID AID Author Title PID AID Cite Cited Citation Paper

Jim Robin

Name

t2

Paper−Author

Keyword Search on RDBMS

t7

Database Graph

{p3}

t6 t7 t5 t4 t3 t2 t1 c1 c2 c3 c4 a2 a1 w4 w5 w6 w7 w3 w2 w1 2.6 2.6 2.6 2.6 2.6 1.6 1.6 2 2 1.6 2 1.6 1.6 1.6 1.6 1.6 2 1.6 2 2 1.6

{p1,p3} {p1,p2} {p2}

1.6

{p2,p3} {p4}

Query Keyword (p1), Query (p2), DB (p3), and Jim (p4)

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-17
SLIDE 17

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Model Hardness

Example

Database

Steiner Problem in DB Efficient IR−Query over DB Online Cluster Problems Keyword Query over Web Query Optimization on DB Parameterized Complexity

t1 t2 t3 t4 t5 t6 t7 a2 t6 a2 t5 a2 t4 a2 t3 a2 t4 a1 a1 a1 a2 t1 t2 t3 t2 t5 t4 t6 t7

PID AID Author Title PID AID Cite Cited Citation Paper

Jim Robin

Name

t2

Paper−Author

Keyword Search on RDBMS

t7

Database Graph

t6 t7 t5 t4 t3 t2 t1 c1 c2 c3 c4 a2 a1 w4 w5 w6 w7 w3 w2 w1 2.6 2.6 2.6 2.6 2.6 1.6 1.6 2 2 1.6 2 1.6 1.6 1.6 1.6 1.6 2 1.6 2 2 1.6

{p1,p3} {p1,p2} {p2}

1.6

{p2,p3} {p3} {p4}

Query Keyword (p1), Query (p2), DB (p3), and Jim (p4)

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-18
SLIDE 18

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Model Hardness

Example

Database Graph

t6 t7 t5 t4 t3 t2 t1 c1 c2 c3 c4 a2 a1 w4 w5 w6 w7 w3 w2 w1 2.6 2.6 2.6 2.6 2.6 1.6 1.6 2 2 1.6 2 1.6 1.6 1.6 1.6 1.6 2 1.6 2 2 1.6

{p1,p3} {p1,p2} {p2}

1.6

{p2,p3} {p3} {p4}

Node sets containing keywords: V1 = {t1, t5}, V2 = {t3, t5, t6}, V3 = {t1, t2, t6}, and V4 = {a1}. Answer T1

Jim (a1) writes a paper, t2, which is cited by two papers, t1 and t3. Cost: s(T1) = 10.8

Answer T2

Jim (a1) writes a paper, t2, which is cited by t3; the author of t3, Robin (a2), writes another paper t5. Cost: s(T2) = 15.6

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-19
SLIDE 19

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Model Hardness

Example

Database Graph

{p3} {p2,p3}

1.6

{p2} {p1,p2} {p1,p3}

1.6 t6 t7 t5 t4 t3 t2 t1 c1 c2 c3 c4 a2 a1 w4 w5 w6 w7 w3 w2 w1 2.6 2.6 2.6 2.6 2.6 1.6 1.6 2 2 1.6 2 1.6 1.6 1.6 1.6 1.6 2 1.6 2 2

{p4}

Node sets containing keywords: V1 = {t1, t5}, V2 = {t3, t5, t6}, V3 = {t1, t2, t6}, and V4 = {a1}. Answer T1

Jim (a1) writes a paper, t2, which is cited by two papers, t1 and t3. Cost: s(T1) = 10.8

Answer T2

Jim (a1) writes a paper, t2, which is cited by t3; the author of t3, Robin (a2), writes another paper t5. Cost: s(T2) = 15.6

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-20
SLIDE 20

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Model Hardness

Example

Database Graph

{p3} {p2,p3}

1.6

{p2} {p1,p2} {p1,p3}

1.6 2 2 t6 t7 t5 t4 t3 t2 t1 c1 c2 c3 c4 a2 a1 w4 w5 w6 w7 w3 w2 w1 2.6 2.6 2.6 2.6 2.6 1.6 1.6 2 2 1.6 2 1.6 1.6 1.6 1.6 1.6 2 1.6

{p4}

Node sets containing keywords: V1 = {t1, t5}, V2 = {t3, t5, t6}, V3 = {t1, t2, t6}, and V4 = {a1}. Answer T1

Jim (a1) writes a paper, t2, which is cited by two papers, t1 and t3. Cost: s(T1) = 10.8

Answer T2

Jim (a1) writes a paper, t2, which is cited by t3; the author of t3, Robin (a2), writes another paper t5. Cost: s(T2) = 15.6

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-21
SLIDE 21

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Model Hardness

Outline

1

Keyword Search in Relational Databases Database Graph, Query, and Answer The Hardness of This Problem

2

Our New Parameterized Solutions Finding Top-1 Answer Finding Top-k Answers

3

Existing Solutions Other Graph-Based Solutions

4

Experimental Studies Some Representative Experimental Results

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-22
SLIDE 22

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Model Hardness

Hardness Results

The Hardness of (Top-k) Minimum Group Steiner Tree Problem NP-Complete Harder than Undirected Minimum Steiner Tree Problem Equivalent to Directed Minimum Steiner Tree Problem NP-Hard to find (1 + ǫ)-approximation for any fixed ǫ > 0 NP-Hard to find (1 − ǫ) log l-approximation for any fixed ǫ > 0

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-23
SLIDE 23

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

Summary of Our Solution

Two Steps

Optimal top-1 answer with the minimum cost Approximate top-k answers

Parameterized Complexity

Time complexity: O(3ln + 2l((l + log n)n + m)) Space complexity: O(2ln) (For fixed l) Time complexity: O(n log n + m) (For fixed l) Space complexity: O(n) Efficient because: n (the number of tuples) is large; m (the number

  • f foreign keys) is large; l (the number of keywords) is small

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-24
SLIDE 24

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

Summary of Our Solution

Two Steps

Optimal top-1 answer with the minimum cost Approximate top-k answers

Parameterized Complexity

Time complexity: O(3ln + 2l((l + log n)n + m)) Space complexity: O(2ln) (For fixed l) Time complexity: O(n log n + m) (For fixed l) Space complexity: O(n) Efficient because: n (the number of tuples) is large; m (the number

  • f foreign keys) is large; l (the number of keywords) is small

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-25
SLIDE 25

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

Outline

1

Keyword Search in Relational Databases Database Graph, Query, and Answer The Hardness of This Problem

2

Our New Parameterized Solutions Finding Top-1 Answer Finding Top-k Answers

3

Existing Solutions Other Graph-Based Solutions

4

Experimental Studies Some Representative Experimental Results

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-26
SLIDE 26

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

Naive Dynamic Programming

Optimal Substructure T(v, p, h) Subtree T(v, p, h): (i) rooted at node v ∈ V ; (ii) with height ≤ h; (iii) containing a set of keywords p; (iv) with the minimum cost. Tree Grow Case

h u v T(v,p,h) p T(u,p,h−1)

Tree Merge Case

h v v v T(v,p,h) p p T(v,p ,h) T(v,p ,h) p

2 1 1 2

Dynamic Programming Equation

Tree Grow: Tg(v, p, h) = min

u∈N(v){(v, u) ⊕ T(u, p, h − 1)}

Tree Merge: Tm(v, p1 ∪ p2, h) = min

p1∩p2=∅{T(v, p1, h) ⊕ T(v, p2, h)}

(⊕: merge two trees into a new tree)

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-27
SLIDE 27

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

Naive Dynamic Programming

Optimal Substructure T(v, p, h) Subtree T(v, p, h): (i) rooted at node v ∈ V ; (ii) with height ≤ h; (iii) containing a set of keywords p; (iv) with the minimum cost. Tree Grow Case

h u v T(v,p,h) p T(u,p,h−1)

Tree Merge Case

h v v v T(v,p,h) p p T(v,p ,h) T(v,p ,h) p

2 1 1 2

Dynamic Programming Equation

Tree Grow: Tg(v, p, h) = min

u∈N(v){(v, u) ⊕ T(u, p, h − 1)}

Tree Merge: Tm(v, p1 ∪ p2, h) = min

p1∩p2=∅{T(v, p1, h) ⊕ T(v, p2, h)}

(⊕: merge two trees into a new tree)

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-28
SLIDE 28

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

Naive Dynamic Programming

Optimal Substructure T(v, p, h) Subtree T(v, p, h): (i) rooted at node v ∈ V ; (ii) with height ≤ h; (iii) containing a set of keywords p; (iv) with the minimum cost. Tree Grow Case

h u v T(v,p,h) p T(u,p,h−1)

Tree Merge Case

h v v v T(v,p,h) p p T(v,p ,h) T(v,p ,h) p

2 1 1 2

Dynamic Programming Equation

Tree Grow: Tg(v, p, h) = min

u∈N(v){(v, u) ⊕ T(u, p, h − 1)}

Tree Merge: Tm(v, p1 ∪ p2, h) = min

p1∩p2=∅{T(v, p1, h) ⊕ T(v, p2, h)}

(⊕: merge two trees into a new tree)

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-29
SLIDE 29

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

Naive Dynamic Programming

Optimal Substructure T(v, p, h) Subtree T(v, p, h): (i) rooted at node v ∈ V ; (ii) with height ≤ h; (iii) containing a set of keywords p; (iv) with the minimum cost. Tree Grow Case

h u v T(v,p,h) p T(u,p,h−1)

Tree Merge Case

h v v v T(v,p,h) p p T(v,p ,h) T(v,p ,h) p

2 1 1 2

Dynamic Programming Equation

Tree Grow: Tg(v, p, h) = min

u∈N(v){(v, u) ⊕ T(u, p, h − 1)}

Tree Merge: Tm(v, p1 ∪ p2, h) = min

p1∩p2=∅{T(v, p1, h) ⊕ T(v, p2, h)}

(⊕: merge two trees into a new tree)

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-30
SLIDE 30

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

Naive Dynamic Programming

Dynamic Programming Equation T(v, p, h) = min{Tg(v, p, h), Tm(v, p, h), T(v, p, h − 1)} T(v, p, 0) = 0, for v containing keywords p Naive Dynamic Programming Algorithm Compute T(v, p, h) in the ascending order of h Optimal top-1 answer: minv∈V {T(v, {p1, · · · , pl}, n)} Time complexity: O(3ln2 + 2lnm) Space complexity: O(2ln2)

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-31
SLIDE 31

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

Speedup of Dynamic Programming

Main Idea

1 Reduce the search space of dynamic programming 2 Speed up the computation of dynamic programming equation

Complexity Time complexity: O(3ln + 2l((l + log n)n + m)) Space complexity: O(2ln)

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-32
SLIDE 32

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

Reducing the Search Space: T(v, p, h) → T(v, p)

Origin Dynamic Programming Equation

T(v, p, 0) = 0 for v containing keywords p, T(v, p, h) = min{Tg(v, p, h), Tm(v, p, h), T(v, p, h − 1)}, where Tg(v, p, h) = min

u∈N(v){(v, u) ⊕ T(u, p, h − 1)},

Tm(v, p1 ∪ p2, h) = min

p1∩p2=∅{T(v, p1, h) ⊕ T(v, p2, h)}.

Presence of Height h Promising the correctness Size of the search space: O(n22l) Serving as an order to compute T(v, p, h)

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-33
SLIDE 33

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

Reducing the Search Space: T(v, p, h) → T(v, p)

Origin Dynamic Programming Equation

T(v, p, 0) = 0 for v containing keywords p, T(v, p, h) = min{Tg(v, p, h), Tm(v, p, h), T(v, p, h − 1)}, where Tg(v, p, h) = min

u∈N(v){(v, u) ⊕ T(u, p, h − 1)},

Tm(v, p1 ∪ p2, h) = min

p1∩p2=∅{T(v, p1, h) ⊕ T(v, p2, h)}.

Presence of Height h Promising the correctness Size of the search space: O(n22l) Serving as an order to compute T(v, p, h)

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-34
SLIDE 34

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

Reducing the Search Space: T(v, p, h) → T(v, p)

Simplified Dynamic Programming Equation

T(v, p, 0) = 0 for v containing keywords p, T(v, p, h) = min{Tg(v, p, h), Tm(v, p, h), T(v, p, h − 1)}, where Tg(v, p, h) = min

u∈N(v){(v, u) ⊕ T(u, p, h − 1)},

Tm(v, p1 ∪ p2, h) = min

p1∩p2=∅{T(v, p1, h) ⊕ T(v, p2, h)}.

Absence of Height h Still promising the correctness Size of the search space: O(n2l) Need to find another order to compute T(v, p)

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-35
SLIDE 35

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

Reducing the Search Space: T(v, p, h) → T(v, p)

Simplified Dynamic Programming Equation

T(v, p) = 0 for v containing keywords p, T(v, p) = min(Tg(v, p), Tm(v, p)), where Tg(v, p) = min

u∈N(v){(v, u) ⊕ T(u, p)},

Tm(v, p1 ∪ p2) = min

p1∩p2=∅{T(v, p1) ⊕ T(v, p2)}.

Absence of Height h Still promising the correctness Size of the search space: O(n2l) Need to find another order to compute T(v, p)

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-36
SLIDE 36

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

The Order to Compute T(v, p)

Requirement of the Order to Compute T(v, p) IF T(v′, p′) is a subtree of T(v, p), THEN T(v′, p′) must be computed earlier than T(v, p). Two Possible Orders

1 The ascending order of |p| (size of keywords set p) -

an unpublished work by Benny Kimelfeld and Yehoshua Sagiv (www.cs.huji.ac.il/∼bennyk/papers/steiner06.pdf)

2 The ascending order of s(T(v, p)) (cost of tree T(v, p)) -

  • ur solution

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-37
SLIDE 37

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

The Order to Compute T(v, p)

Ascending Order of |p|

T(v, ) p s(T(v, )) p Cost |p| Search Space

top−1

Ascending Order of s(T(v, p))

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-38
SLIDE 38

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

The Order to Compute T(v, p)

Ascending Order of |p|

T(v, ) p s(T(v, )) p Cost |p| Search Space

top−1

Ascending Order of s(T(v, p))

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-39
SLIDE 39

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

The Order to Compute T(v, p)

Ascending Order of |p|

T(v, ) p s(T(v, )) p Cost |p| Search Space

top−1

Ascending Order of s(T(v, p))

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-40
SLIDE 40

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

The Order to Compute T(v, p)

Ascending Order of |p|

p s(T(v, )) p T(v, ) Cost |p| Search Space

top−1

Ascending Order of s(T(v, p))

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-41
SLIDE 41

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

The Order to Compute T(v, p)

Ascending Order of |p|

p s(T(v, )) p T(v, ) Cost |p| Search Space

top−1

Ascending Order of s(T(v, p))

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-42
SLIDE 42

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

The Order to Compute T(v, p)

Ascending Order of |p| Ascending Order of s(T(v, p))

T(v, ) p s(T(v, )) p Cost |p| Search Space

top−1 Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-43
SLIDE 43

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

The Order to Compute T(v, p)

Ascending Order of |p| Ascending Order of s(T(v, p))

T(v, ) p s(T(v, )) p Cost |p| Search Space

top−1 Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-44
SLIDE 44

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

The Order to Compute T(v, p)

Ascending Order of |p| Ascending Order of s(T(v, p))

T(v, ) p s(T(v, )) p Cost |p| Search Space

top−1 Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-45
SLIDE 45

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

The Order to Compute T(v, p)

Ascending Order of |p| Ascending Order of s(T(v, p))

T(v, ) p s(T(v, )) p Cost |p| Search Space

top−1 Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-46
SLIDE 46

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

The Order to Compute T(v, p)

Ascending Order of |p| Ascending Order of s(T(v, p))

p s(T(v, )) p T(v, ) Cost |p| Search Space

top−1 Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-47
SLIDE 47

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

The Order to Compute T(v, p)

Ascending Order of |p|

p s(T(v, )) p T(v, ) Cost |p| Search Space

top−1

Ascending Order of s(T(v, p))

p s(T(v, )) p T(v, ) Cost |p| Search Space

top−1 Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-48
SLIDE 48

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

Speedup of Dynamic Programming

Comparing the Two Orders

In the same search space {T(v, p)} Ascending order of |p|: visiting nearly the whole search space Ascending order of s(T(v, p)): following a shortcut to the top-1

Our Second Dynamic Programming Algorithm Reduce the search space from {T(v, p, h)} to {T(v, p)} Follow the ascending order of s(T(v, p)) to compute the dynamic programming equation of T(v, p) Visit only the necessary portion of the search space Achieve low time / space complexity

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-49
SLIDE 49

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

Speedup of Dynamic Programming

Comparing the Two Orders

In the same search space {T(v, p)} Ascending order of |p|: visiting nearly the whole search space Ascending order of s(T(v, p)): following a shortcut to the top-1

Our Second Dynamic Programming Algorithm Reduce the search space from {T(v, p, h)} to {T(v, p)} Follow the ascending order of s(T(v, p)) to compute the dynamic programming equation of T(v, p) Visit only the necessary portion of the search space Achieve low time / space complexity

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-50
SLIDE 50

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

Outline

1

Keyword Search in Relational Databases Database Graph, Query, and Answer The Hardness of This Problem

2

Our New Parameterized Solutions Finding Top-1 Answer Finding Top-k Answers

3

Existing Solutions Other Graph-Based Solutions

4

Experimental Studies Some Representative Experimental Results

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-51
SLIDE 51

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Top-1 Answer Top-k Answers

Finding Top-k Answers

A Progressive Method Finding T(v, {p1, · · · , pl})’s with top-k minimum costs as the top-k answers Advantage I: time / space complexity unchanges Advantage II: no sorting is needed

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-52
SLIDE 52

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Graph-Based Solutions

Outline

1

Keyword Search in Relational Databases Database Graph, Query, and Answer The Hardness of This Problem

2

Our New Parameterized Solutions Finding Top-1 Answer Finding Top-k Answers

3

Existing Solutions Other Graph-Based Solutions

4

Experimental Studies Some Representative Experimental Results

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-53
SLIDE 53

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Graph-Based Solutions

Other Graph-Based Solutions

1-Star Tree Combining l shortest-paths from leaves (containing keywords) to the roots O(l)-approximation for linear cost functions BANKS I: Gaurav Bhalotia, et. al., ICDE 2002 BANKS II: Varun Kacholia, et. al., VLDB 2005 Spanning and Cleanup Spanning a set of trees until some of them cover all the l keywords O(l)-approximation for linear cost functions RIU-E: Wen-Syan Li, et. al., WWW 2001, TKDE 2002

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-54
SLIDE 54

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Some Experimental Results

Outline

1

Keyword Search in Relational Databases Database Graph, Query, and Answer The Hardness of This Problem

2

Our New Parameterized Solutions Finding Top-1 Answer Finding Top-k Answers

3

Existing Solutions Other Graph-Based Solutions

4

Experimental Studies Some Representative Experimental Results

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-55
SLIDE 55

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Some Experimental Results

Experiment Setup

Implementation Compare our solution (DPBF) with: RIU-E, BANKS-I, and BANKS-II Implement these algorithms in memory Use linear cost function (sum of edge weights) Environment: 3.4GHz CPU and 2G memory PC running XP Datasets - 10 Subsets of DBLP

100K (up to 1982), 300K (up to 1987), 500K (up to 1993), 700K (up to 1996), 900K (up to 1997), 1100K (up to 1999), 1300K (up to 2000), 1500K (up to 2001), 1700K (up to 2002), and 1900K (up to 2004)

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-56
SLIDE 56

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Some Experimental Results

Some Results

Varying the Size of Database n — 4-Keyword Queries, k = 1

10 40 160 640 2560 10240 100 300 500 700 900 1100 1300 1500 1700 1900 Time (msec) Number of Nodes BANKS-I BANKS-II DPBF RIU-E 20 40 60 80 100 120 100 300 500 700 900 1100 1300 1500 1700 1900 Cost Number of Nodes BANKS-I BANKS-II DPBF RIU-E

Varying the Number of Keywords l — DBLP 500K, k = 1

1 10 100 1000 10000 2 3 4 5 6 Time (msec) Number of Keywords BANKS-I BANKS-II DPBF RIU-E 50 100 2 3 4 5 6 Cost Number of Keywords BANKS-I BANKS-II DPBF RIU-E

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-57
SLIDE 57

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Summary Related Work

Summary

Model the keyword search problem as the (top-k) minimum group Steiner tree problem. Propose a parameterized algorithm for this problem.

1 Bounded time / space complexity 2 Efficient in practice

Support undirected / directed model, node / edge weights, and cost function in the form of linear combination of weights.

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-58
SLIDE 58

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Summary Related Work

Related Work: Other Graph-Based Solutions

BANKS I: Gaurav Bhalotia, et. al. Keyword Searching and Browsing in Databases using BANKS. ICDE’02, pages 431-440, 2002. BANKS II: Varun Kacholia, et. al. Bidirectional Expansion For Keyword Search on Graph Databases. VLDB’05, pages 505-516, 2005. RIU-E: Wen-Syan Li, et. al. Query Relaxation by Structure and Semantics for Retrieval of Logical Web Documents. IEEE Trans. Knowl. Data Eng., 14(4):768-791, 2002. Benny Kimelfeld, et. al. Finding and Approximating Top-k Answers in Keyword Proximity Search. PODS’06, pages 173-182, 2006.

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases

slide-59
SLIDE 59

Introduction Parameterized Solution Existing Solutions Experimental Studies Summary Summary Related Work

Related Work: Database-Based Solutions

Sanjay Agrawal, et. al. DBXplorer: A System for Keyword-Based Search over Relational Databases. ICDE’02, pages 5-16, 2002. Vagelis Hristidis, et. al. Efficient IR-Style Keyword Search over Relational Databases. VLDB’03, pages 850-861, 2003. Fang Liu, et. al. Effective Keyword Search in Relational Databases. SIGMOD’06, pages 563-574, 2006. Yi Luo, et. al. SPARK: Top-k Keyword Query in Relational Databases. To Appear in SIGMOD’07, 2007.

Ding, Yu, Wang, Qin, Zhang, Lin Finding Top-k Min-Cost Connected Trees in Databases