on the fly token similarity joins in relational databases
play

On-the-Fly Token Similarity Joins in Relational Databases N. Augsten - PowerPoint PPT Presentation

On-the-Fly Token Similarity Joins in Relational Databases N. Augsten 1 A. Miraglia 2 T. Neumann 3 A. Kemper 3 1 University of Salzburg, Austria nikolaus.augsten@sbg.ac.at 2 VU University Amsterdam, Netherlands a.miraglia@student.vu.nl 3 TU M


  1. On-the-Fly Token Similarity Joins in Relational Databases N. Augsten 1 A. Miraglia 2 T. Neumann 3 A. Kemper 3 1 University of Salzburg, Austria nikolaus.augsten@sbg.ac.at 2 VU University Amsterdam, Netherlands a.miraglia@student.vu.nl 3 TU M¨ unchen, Germany { neumann,kemper } @in.tum.de June 26, 2014 SIGMOD, Snowbird, Utah Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 1 / 21

  2. Outline 1 Motivation 2 The Tokenize Operator Efficient Implementation Query Optimization 3 Experiments Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 2 / 21

  3. Motivation Outline 1 Motivation 2 The Tokenize Operator Efficient Implementation Query Optimization 3 Experiments Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 3 / 21

  4. Motivation Token Similarity Join R S A B snowbird ⋊ ⋉ sim( A , B ) ≥ 70% snowbasin canyons snowbirds . . . sim( A , B ) = | α ( A ) ∩ α ( B ) | . . . | α ( A ) ∪ α ( B ) | Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 4 / 21

  5. Motivation Token Similarity Join R S A B snowbird ⋊ ⋉ sim( A , B ) ≥ 70% snowbasin canyons snowbirds . . . sim( A , B ) = | α ( A ) ∩ α ( B ) | . . . | α ( A ) ∪ α ( B ) | α ( A ) α ( B ) #s ba bi sn as ir no si rd ow in d# wb n# 14 = 36% ✗ sim( A , B ) = 5 Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 4 / 21

  6. Motivation Token Similarity Join R S A B snowbird ⋊ ⋉ sim( A , B ) ≥ 70% snowbasin canyons snowbirds . . . sim( A , B ) = | α ( A ) ∩ α ( B ) | . . . | α ( A ) ∪ α ( B ) | α ( A ) α ( B ) α ( A ) α ( B ) #s ba bi #s sn sn as ir no ow ds no si d# rd wb bi s# ow in d# ir rd wb n# 14 = 36% ✗ 11 = 73% ✓ sim( A , B ) = 5 sim( A , B ) = 8 Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 4 / 21

  7. Motivation Token Generation in Similarity Joins recived little attention well studied generate R token set similarity ∼ ⊲ ⊳ S R join on generate tokens S token Part-Enum (VLDB’06) All-Pairs (WWW’07) precomputed tokens assumed PP-Join (WWW’08) token generation not part of MP-Join (Inf. Syst.’11) query plan Adapt-Join (SIGMOD’12) . . . Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 5 / 21

  8. Motivation Token Generation in Similarity Joins recived little attention well studied generate R token set similarity ∼ ⊲ ⊳ S R join on generate tokens S token Part-Enum (VLDB’06) All-Pairs (WWW’07) precomputed tokens assumed PP-Join (WWW’08) token generation not part of MP-Join (Inf. Syst.’11) query plan Adapt-Join (SIGMOD’12) . . . Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 5 / 21

  9. Motivation Token Generation in Similarity Joins recived little attention well studied generate R token set similarity ∼ ⊲ ⊳ S R join on generate tokens S token Part-Enum (VLDB’06) All-Pairs (WWW’07) precomputed tokens assumed PP-Join (WWW’08) token generation not part of MP-Join (Inf. Syst.’11) query plan Adapt-Join (SIGMOD’12) . . . Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 5 / 21

  10. Motivation Token Generation in Similarity Joins recived little attention well studied generate R token set similarity ∼ ⊲ ⊳ S R join on generate tokens S token Part-Enum (VLDB’06) All-Pairs (WWW’07) precomputed tokens assumed Goal: integrate token generation into query plan! PP-Join (WWW’08) token generation not part of MP-Join (Inf. Syst.’11) query plan Adapt-Join (SIGMOD’12) . . . Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 5 / 21

  11. Motivation Generating Tokens Stand-alone client: export data, generate tokens, import tokens ➥ overhead for export/import ➥ no integration into query plan ➥ only good for precomputation Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 6 / 21

  12. Motivation Generating Tokens Stand-alone client: export data, generate tokens, import tokens ➥ overhead for export/import ➥ no integration into query plan ➥ only good for precomputation Table function: ➥ UDF generates tokens on-the-fly ➥ table function used like a table in query Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 6 / 21

  13. Motivation State-of-the-Art: Table Function Customer tables R , S : ➥ join customers with similar names ➥ only customers from ’SLC’ and ’Salt Lake’ SELECT TR.ssn, TS.ssn FROM tblfunc(’R’, ’name’) TR, tblfunc(’S’, ’name’) TS WHERE TR.city = ’SLC’ AND TS.county=’Salt Lake’ AND TR.token = TS.token GROUP BY TR.ssn, TS.ssn HAVING COUNT(*) > = k; Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 7 / 21

  14. Motivation State-of-the-Art: Table Function Customer tables R , S : ➥ join customers with similar names ➥ only customers from ’SLC’ and ’Salt Lake’ SELECT TR.ssn, TS.ssn FROM tblfunc(’R’, ’name’) TR, tblfunc(’S’, ’name’) TS WHERE TR.city = ’SLC’ AND TS.county=’Salt Lake’ AND TR.token = TS.token GROUP BY TR.ssn, TS.ssn HAVING COUNT(*) > = k; Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 7 / 21

  15. Motivation State-of-the-Art: Table Function π TR . ssn , TS . ssn Customer tables R , S : ➥ join customers with σ olap ≥ k ◦ true cardinality similar names γ TR . ssn , TS . ssn ;COUNT( ∗ ) → olap ➥ only customers from ’SLC’ and ’Salt Lake’ ⋉ TR . token = TS . token ⋊ SELECT TR.ssn, TS.ssn 10 2’000 10 1’000 FROM tblfunc(’R’, ’name’) TR, σ city = ’SLC’ σ county = ’Salt Lake’ tblfunc(’S’, ’name’) TS WHERE TR.city = ’SLC’ 1000 100’000 1000 200’000 AND TS.county=’Salt Lake’ tblfunc / TS tblfunc/TR AND TR.token = TS.token GROUP BY TR.ssn, TS.ssn 20’000 10’000 HAVING COUNT(*) > = k; R ( ssn , name , city ) S ( ssn , name , county ) Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 7 / 21

  16. Motivation State-of-the-Art: Table Function π TR . ssn , TS . ssn Customer tables R , S : ➥ join customers with σ olap ≥ k ◦ true cardinality similar names γ TR . ssn , TS . ssn ;COUNT( ∗ ) → olap ➥ only customers from ’SLC’ and ’Salt Lake’ ⋉ TR . token = TS . token ⋊ SELECT TR.ssn, TS.ssn 10 2’000 10 1’000 FROM tblfunc(’R’, ’name’) TR, σ city = ’SLC’ [1%] σ county = ’Salt Lake’ [5%] tblfunc(’S’, ’name’) TS WHERE TR.city = ’SLC’ 1000 100’000 1000 200’000 AND TS.county=’Salt Lake’ tblfunc / TS tblfunc/TR AND TR.token = TS.token GROUP BY TR.ssn, TS.ssn 20’000 10’000 HAVING COUNT(*) > = k; R ( ssn , name , city ) S ( ssn , name , county ) black box: selection and projection not pushed down ➥ tokens computed for all customers ➥ too many attributes replicated (name, city, county) Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 7 / 21

  17. Motivation State-of-the-Art: Table Function π TR . ssn , TS . ssn Customer tables R , S : ➥ join customers with σ olap ≥ k ◦ true cardinality similar names ◦ estimated card. γ TR . ssn , TS . ssn ;COUNT( ∗ ) → olap ➥ only customers from ’SLC’ and ’Salt Lake’ ⋊ ⋉ TR . token = TS . token SELECT TR.ssn, TS.ssn 10 2’000 10 1’000 FROM tblfunc(’R’, ’name’) TR, σ city = ’SLC’ [1%] σ county = ’Salt Lake’ [5%] tblfunc(’S’, ’name’) TS WHERE TR.city = ’SLC’ 1000 100’000 1000 200’000 AND TS.county=’Salt Lake’ tblfunc / TS tblfunc/TR AND TR.token = TS.token GROUP BY TR.ssn, TS.ssn 20’000 10’000 HAVING COUNT(*) > = k; R ( ssn , name , city ) S ( ssn , name , county ) black box: selection and projection not pushed down ➥ tokens computed for all customers ➥ too many attributes replicated (name, city, county) unknown cardinality of table function (often assumed a constant) Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 7 / 21

  18. Motivation State-of-the-Art: Table Function π TR . ssn , TS . ssn Customer tables R , S : ➥ join customers with σ olap ≥ k ◦ true cardinality similar names ◦ estimated card. γ TR . ssn , TS . ssn ;COUNT( ∗ ) → olap ➥ only customers from ’SLC’ and ’Salt Lake’ ⋉ TR . token = TS . token ⋊ SELECT TR.ssn, TS.ssn 10 2’000 10 1’000 FROM tblfunc(’R’, ’name’) TR, σ city = ’SLC’ [1%] σ county = ’Salt Lake’ [5%] tblfunc(’S’, ’name’) TS WHERE TR.city = ’SLC’ 1000 100’000 1000 200’000 AND TS.county=’Salt Lake’ tblfunc / TS tblfunc/TR AND TR.token = TS.token GROUP BY TR.ssn, TS.ssn 20’000 10’000 HAVING COUNT(*) > = k; R ( ssn , name , city ) S ( ssn , name , county ) Problem: poor query plans with table functions. Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 7 / 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend