On-the-Fly Token Similarity Joins in Relational Databases N. Augsten - - PowerPoint PPT Presentation

on the fly token similarity joins in relational databases
SMART_READER_LITE
LIVE PREVIEW

On-the-Fly Token Similarity Joins in Relational Databases N. Augsten - - PowerPoint PPT Presentation

On-the-Fly Token Similarity Joins in Relational Databases N. Augsten 1 A. Miraglia 2 T. Neumann 3 A. Kemper 3 1 University of Salzburg, Austria nikolaus.augsten@sbg.ac.at 2 VU University Amsterdam, Netherlands a.miraglia@student.vu.nl 3 TU M


slide-1
SLIDE 1

On-the-Fly Token Similarity Joins in Relational Databases

  • N. Augsten1
  • A. Miraglia2
  • T. Neumann3
  • A. Kemper3

1University of Salzburg, Austria

nikolaus.augsten@sbg.ac.at

2VU University Amsterdam, Netherlands

a.miraglia@student.vu.nl

3TU M¨

unchen, Germany {neumann,kemper}@in.tum.de

June 26, 2014 SIGMOD, Snowbird, Utah

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 1 / 21

slide-2
SLIDE 2

Outline

1 Motivation 2 The Tokenize Operator

Efficient Implementation Query Optimization

3 Experiments

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 2 / 21

slide-3
SLIDE 3

Motivation

Outline

1 Motivation 2 The Tokenize Operator

Efficient Implementation Query Optimization

3 Experiments

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 3 / 21

slide-4
SLIDE 4

Motivation

Token Similarity Join

R A snowbird canyons . . .

⋊ ⋉sim(A,B)≥70%

sim(A, B) = |α(A)∩α(B)|

|α(A)∪α(B)|

S B snowbasin snowbirds . . .

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 4 / 21

slide-5
SLIDE 5

Motivation

Token Similarity Join

R A snowbird canyons . . .

⋊ ⋉sim(A,B)≥70%

sim(A, B) = |α(A)∩α(B)|

|α(A)∪α(B)|

S B snowbasin snowbirds . . . bi ir rd d# ba as si in n# α(A) α(B) #s sn no

  • w

wb sim(A, B) = 5

14 = 36% ✗

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 4 / 21

slide-6
SLIDE 6

Motivation

Token Similarity Join

R A snowbird canyons . . .

⋊ ⋉sim(A,B)≥70%

sim(A, B) = |α(A)∩α(B)|

|α(A)∪α(B)|

S B snowbasin snowbirds . . . bi ir rd d# ba as si in n# α(A) α(B) #s sn no

  • w

wb sim(A, B) = 5

14 = 36% ✗

d# ds s# α(A) α(B) #s sn no ow wb bi ir rd sim(A, B) = 8

11 = 73% ✓

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 4 / 21

slide-7
SLIDE 7

Motivation

Token Generation in Similarity Joins

R S generate token generate token set similarity join on tokens R

⊲ ⊳ S well studied recived little attention

precomputed tokens assumed token generation not part of query plan Part-Enum (VLDB’06) All-Pairs (WWW’07) PP-Join (WWW’08) MP-Join (Inf. Syst.’11) Adapt-Join (SIGMOD’12) . . .

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 5 / 21

slide-8
SLIDE 8

Motivation

Token Generation in Similarity Joins

R S generate token generate token set similarity join on tokens R

⊲ ⊳ S well studied recived little attention

precomputed tokens assumed token generation not part of query plan Part-Enum (VLDB’06) All-Pairs (WWW’07) PP-Join (WWW’08) MP-Join (Inf. Syst.’11) Adapt-Join (SIGMOD’12) . . .

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 5 / 21

slide-9
SLIDE 9

Motivation

Token Generation in Similarity Joins

R S generate token generate token set similarity join on tokens R

⊲ ⊳ S well studied recived little attention

precomputed tokens assumed token generation not part of query plan Part-Enum (VLDB’06) All-Pairs (WWW’07) PP-Join (WWW’08) MP-Join (Inf. Syst.’11) Adapt-Join (SIGMOD’12) . . .

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 5 / 21

slide-10
SLIDE 10

Motivation

Token Generation in Similarity Joins

R S generate token generate token set similarity join on tokens R

⊲ ⊳ S well studied recived little attention

precomputed tokens assumed token generation not part of query plan Part-Enum (VLDB’06) All-Pairs (WWW’07) PP-Join (WWW’08) MP-Join (Inf. Syst.’11) Adapt-Join (SIGMOD’12) . . .

Goal: integrate token generation into query plan!

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 5 / 21

slide-11
SLIDE 11

Motivation

Generating Tokens

Stand-alone client: export data, generate tokens, import tokens

➥ overhead for export/import ➥ no integration into query plan ➥ only good for precomputation

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 6 / 21

slide-12
SLIDE 12

Motivation

Generating Tokens

Stand-alone client: export data, generate tokens, import tokens

➥ overhead for export/import ➥ no integration into query plan ➥ only good for precomputation

Table function:

➥ UDF generates tokens on-the-fly ➥ table function used like a table in query

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 6 / 21

slide-13
SLIDE 13

Motivation

State-of-the-Art: Table Function

Customer tables R, S: ➥ join customers with similar names ➥ only customers from ’SLC’ and ’Salt Lake’

SELECT TR.ssn, TS.ssn FROM tblfunc(’R’, ’name’) TR, tblfunc(’S’, ’name’) TS WHERE TR.city = ’SLC’ AND TS.county=’Salt Lake’ AND TR.token = TS.token GROUP BY TR.ssn, TS.ssn HAVING COUNT(*) >= k;

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 7 / 21

slide-14
SLIDE 14

Motivation

State-of-the-Art: Table Function

Customer tables R, S: ➥ join customers with similar names ➥ only customers from ’SLC’ and ’Salt Lake’

SELECT TR.ssn, TS.ssn FROM tblfunc(’R’, ’name’) TR, tblfunc(’S’, ’name’) TS WHERE TR.city = ’SLC’ AND TS.county=’Salt Lake’ AND TR.token = TS.token GROUP BY TR.ssn, TS.ssn HAVING COUNT(*) >= k;

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 7 / 21

slide-15
SLIDE 15

Motivation

State-of-the-Art: Table Function

Customer tables R, S: ➥ join customers with similar names ➥ only customers from ’SLC’ and ’Salt Lake’

SELECT TR.ssn, TS.ssn FROM tblfunc(’R’, ’name’) TR, tblfunc(’S’, ’name’) TS WHERE TR.city = ’SLC’ AND TS.county=’Salt Lake’ AND TR.token = TS.token GROUP BY TR.ssn, TS.ssn HAVING COUNT(*) >= k;

πTR.ssn,TS.ssn σolap≥k γTR.ssn,TS.ssn;COUNT(∗)→olap ⋊ ⋉TR.token=TS.token σcity=’SLC’ tblfunc/TR

R(ssn, name, city)

20’000 1000 200’000 10 2’000

σcounty=’Salt Lake’ tblfunc/TS

S(ssn, name, county)

10’000 1000 100’000 10 1’000

  • true cardinality

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 7 / 21

slide-16
SLIDE 16

Motivation

State-of-the-Art: Table Function

Customer tables R, S: ➥ join customers with similar names ➥ only customers from ’SLC’ and ’Salt Lake’

SELECT TR.ssn, TS.ssn FROM tblfunc(’R’, ’name’) TR, tblfunc(’S’, ’name’) TS WHERE TR.city = ’SLC’ AND TS.county=’Salt Lake’ AND TR.token = TS.token GROUP BY TR.ssn, TS.ssn HAVING COUNT(*) >= k;

πTR.ssn,TS.ssn σolap≥k γTR.ssn,TS.ssn;COUNT(∗)→olap ⋊ ⋉TR.token=TS.token σcity=’SLC’[1%] tblfunc/TR

R(ssn, name, city)

20’000 1000 200’000 10 2’000

σcounty=’Salt Lake’[5%] tblfunc/TS

S(ssn, name, county)

10’000 1000 100’000 10 1’000

  • true cardinality

black box: selection and projection not pushed down

➥ tokens computed for all customers ➥ too many attributes replicated (name, city, county)

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 7 / 21

slide-17
SLIDE 17

Motivation

State-of-the-Art: Table Function

Customer tables R, S: ➥ join customers with similar names ➥ only customers from ’SLC’ and ’Salt Lake’

SELECT TR.ssn, TS.ssn FROM tblfunc(’R’, ’name’) TR, tblfunc(’S’, ’name’) TS WHERE TR.city = ’SLC’ AND TS.county=’Salt Lake’ AND TR.token = TS.token GROUP BY TR.ssn, TS.ssn HAVING COUNT(*) >= k;

πTR.ssn,TS.ssn σolap≥k γTR.ssn,TS.ssn;COUNT(∗)→olap ⋊ ⋉TR.token=TS.token σcity=’SLC’[1%] tblfunc/TR

R(ssn, name, city)

20’000 1000 200’000 10 2’000

σcounty=’Salt Lake’[5%] tblfunc/TS

S(ssn, name, county)

10’000 1000 100’000 10 1’000

  • true cardinality
  • estimated card.

black box: selection and projection not pushed down

➥ tokens computed for all customers ➥ too many attributes replicated (name, city, county)

unknown cardinality of table function (often assumed a constant)

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 7 / 21

slide-18
SLIDE 18

Motivation

State-of-the-Art: Table Function

Customer tables R, S: ➥ join customers with similar names ➥ only customers from ’SLC’ and ’Salt Lake’

SELECT TR.ssn, TS.ssn FROM tblfunc(’R’, ’name’) TR, tblfunc(’S’, ’name’) TS WHERE TR.city = ’SLC’ AND TS.county=’Salt Lake’ AND TR.token = TS.token GROUP BY TR.ssn, TS.ssn HAVING COUNT(*) >= k;

πTR.ssn,TS.ssn σolap≥k γTR.ssn,TS.ssn;COUNT(∗)→olap ⋊ ⋉TR.token=TS.token σcity=’SLC’[1%] tblfunc/TR

R(ssn, name, city)

20’000 1000 200’000 10 2’000

σcounty=’Salt Lake’[5%] tblfunc/TS

S(ssn, name, county)

10’000 1000 100’000 10 1’000

  • true cardinality
  • estimated card.

Problem: poor query plans with table functions.

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 7 / 21

slide-19
SLIDE 19

The Tokenize Operator

Outline

1 Motivation 2 The Tokenize Operator

Efficient Implementation Query Optimization

3 Experiments

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 8 / 21

slide-20
SLIDE 20

The Tokenize Operator

Solution: Tokenize Operator

Tokenize τ is a relational operator defined as follows: τα(A)(R) = {r ◦ tk | r ∈ R, tk ∈ α(t.A)} R

A snowbird canyons . . .

τα(A)→token(R)

A token snowbird #s snowbird sn snowbird no · · · · · · canyons #c canyons ca canyons an · · · · · ·

τα(A)

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 9 / 21

slide-21
SLIDE 21

The Tokenize Operator

Solution: Tokenize Operator

Tokenize τ is a relational operator defined as follows: τα(A)(R) = {r ◦ tk | r ∈ R, tk ∈ α(t.A)} R is a relation, A ⊆ schema(R) is a sequence of attributes. R

A snowbird canyons . . .

τα(A)→token(R)

A token snowbird #s snowbird sn snowbird no · · · · · · canyons #c canyons ca canyons an · · · · · ·

τα(A)

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 9 / 21

slide-22
SLIDE 22

The Tokenize Operator

Solution: Tokenize Operator

Tokenize τ is a relational operator defined as follows: τα(A)(R) = {r ◦ tk | r ∈ R, tk ∈ α(t.A)} R is a relation, A ⊆ schema(R) is a sequence of attributes. α is token function

➥ computes tokens for a single value (e.g., q-grams for a string)

R

A snowbird canyons . . .

τα(A)→token(R)

A token snowbird #s snowbird sn snowbird no · · · · · · canyons #c canyons ca canyons an · · · · · ·

τα(A)

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 9 / 21

slide-23
SLIDE 23

The Tokenize Operator

Solution: Tokenize Operator

Tokenize τ is a relational operator defined as follows: τα(A)(R) = {r ◦ tk | r ∈ R, tk ∈ α(t.A)} R is a relation, A ⊆ schema(R) is a sequence of attributes. α is token function

➥ computes tokens for a single value (e.g., q-grams for a string)

Output: relation with tokens R

A snowbird canyons . . .

τα(A)→token(R)

A token snowbird #s snowbird sn snowbird no · · · · · · canyons #c canyons ca canyons an · · · · · ·

τα(A)

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 9 / 21

slide-24
SLIDE 24

The Tokenize Operator

Solution: Tokenize Operator

Tokenize τ is a relational operator defined as follows: τα(A)(R) = {r ◦ tk | r ∈ R, tk ∈ α(t.A)} R is a relation, A ⊆ schema(R) is a sequence of attributes. α is token function

➥ computes tokens for a single value (e.g., q-grams for a string)

Output: relation with tokens R

A snowbird canyons . . .

τα(A)→token(R)

A token snowbird #s snowbird sn snowbird no · · · · · · canyons #c canyons ca canyons an · · · · · ·

τα(A) Note:

➥ α is type specific (like MAX, AVG) ➥ tokenize τ abstracts from types (like GROUP)

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 9 / 21

slide-25
SLIDE 25

The Tokenize Operator Efficient Implementation

Outline

1 Motivation 2 The Tokenize Operator

Efficient Implementation Query Optimization

3 Experiments

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 10 / 21

slide-26
SLIDE 26

The Tokenize Operator Efficient Implementation

Virtual Tuple Replication/1

R

name role conference curtis dyreson general chair sigmod feifei li general chair sigmod tamer ozsu program chair sigmod · · · · · · · · ·

τα(name)→token(R)

name role conference token curtis dyreson general chair sigmod #c curtis dyreson general chair sigmod cu curtis dyreson general chair sigmod ur · · · · · · · · · · · · feifei li general chair sigmod #f feifei li general chair sigmod fe · · · · · · · · · · · · tamer ozsu program chair sigmod #t tamer ozsu program chair sigmod ta · · · · · · · · · · · ·

τα(name)→token;GID(R)

name role conference token GID curtis dyreson general chair sigmod #c 1 curtis dyreson general chair sigmod cu 1 curtis dyreson general chair sigmod ur 1 · · · · · · · · · · · · · · · feifei li general chair sigmod #f 2 feifei li general chair sigmod fe 2 · · · · · · · · · · · · · · · tamer ozsu program chair sigmod #t 3 tamer ozsu program chair sigmod ta 3 · · · · · · · · · · · · · · · VTR Bit Array 1 1 1 1 1 1 1 1 1 · · · · · · · · · · · · · · · 1 1 1 1 1 1 1 · · · · · · · · · · · · · · · 1 1 1 1 1 1 1 · · · · · · · · · · · · · · ·

expensive!

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 11 / 21

slide-27
SLIDE 27

The Tokenize Operator Efficient Implementation

Virtual Tuple Replication/1

R

name role conference curtis dyreson general chair sigmod feifei li general chair sigmod tamer ozsu program chair sigmod · · · · · · · · ·

τα(name)→token(R)

name role conference token curtis dyreson general chair sigmod #c curtis dyreson general chair sigmod cu curtis dyreson general chair sigmod ur · · · · · · · · · · · · feifei li general chair sigmod #f feifei li general chair sigmod fe · · · · · · · · · · · · tamer ozsu program chair sigmod #t tamer ozsu program chair sigmod ta · · · · · · · · · · · ·

τα(name)→token;GID(R)

name role conference token GID curtis dyreson general chair sigmod #c 1 curtis dyreson general chair sigmod cu 1 curtis dyreson general chair sigmod ur 1 · · · · · · · · · · · · · · · feifei li general chair sigmod #f 2 feifei li general chair sigmod fe 2 · · · · · · · · · · · · · · · tamer ozsu program chair sigmod #t 3 tamer ozsu program chair sigmod ta 3 · · · · · · · · · · · · · · · VTR Bit Array 1 1 1 1 1 1 1 1 1 · · · · · · · · · · · · · · · 1 1 1 1 1 1 1 · · · · · · · · · · · · · · · 1 1 1 1 1 1 1 · · · · · · · · · · · · · · ·

expensive!

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 11 / 21

slide-28
SLIDE 28

The Tokenize Operator Efficient Implementation

Virtual Tuple Replication/1

R

name role conference curtis dyreson general chair sigmod feifei li general chair sigmod tamer ozsu program chair sigmod · · · · · · · · ·

τα(name)→token(R)

name role conference token curtis dyreson general chair sigmod #c curtis dyreson general chair sigmod cu curtis dyreson general chair sigmod ur · · · · · · · · · · · · feifei li general chair sigmod #f feifei li general chair sigmod fe · · · · · · · · · · · · tamer ozsu program chair sigmod #t tamer ozsu program chair sigmod ta · · · · · · · · · · · ·

τα(name)→token;GID(R)

name role conference token GID curtis dyreson general chair sigmod #c 1 curtis dyreson general chair sigmod cu 1 curtis dyreson general chair sigmod ur 1 · · · · · · · · · · · · · · · feifei li general chair sigmod #f 2 feifei li general chair sigmod fe 2 · · · · · · · · · · · · · · · tamer ozsu program chair sigmod #t 3 tamer ozsu program chair sigmod ta 3 · · · · · · · · · · · · · · · VTR Bit Array 1 1 1 1 1 1 1 1 1 · · · · · · · · · · · · · · · 1 1 1 1 1 1 1 · · · · · · · · · · · · · · · 1 1 1 1 1 1 1 · · · · · · · · · · · · · · ·

physical replication is expensive

expensive!

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 11 / 21

slide-29
SLIDE 29

The Tokenize Operator Efficient Implementation

Virtual Tuple Replication/1

R

name role conference curtis dyreson general chair sigmod feifei li general chair sigmod tamer ozsu program chair sigmod · · · · · · · · ·

τα(name)→token(R)

name role conference token curtis dyreson general chair sigmod #c curtis dyreson general chair sigmod cu curtis dyreson general chair sigmod ur · · · · · · · · · · · · feifei li general chair sigmod #f feifei li general chair sigmod fe · · · · · · · · · · · · tamer ozsu program chair sigmod #t tamer ozsu program chair sigmod ta · · · · · · · · · · · ·

τα(name)→token;GID(R)

name role conference token GID curtis dyreson general chair sigmod #c 1 curtis dyreson general chair sigmod cu 1 curtis dyreson general chair sigmod ur 1 · · · · · · · · · · · · · · · feifei li general chair sigmod #f 2 feifei li general chair sigmod fe 2 · · · · · · · · · · · · · · · tamer ozsu program chair sigmod #t 3 tamer ozsu program chair sigmod ta 3 · · · · · · · · · · · · · · · VTR Bit Array 1 1 1 1 1 1 1 1 1 · · · · · · · · · · · · · · · 1 1 1 1 1 1 1 · · · · · · · · · · · · · · · 1 1 1 1 1 1 1 · · · · · · · · · · · · · · ·

physical replication is expensive VTR avoids physical replication

expensive!

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 11 / 21

slide-30
SLIDE 30

The Tokenize Operator Efficient Implementation

Virtual Tuple Replication/1

R

name role conference curtis dyreson general chair sigmod feifei li general chair sigmod tamer ozsu program chair sigmod · · · · · · · · ·

τα(name)→token(R)

name role conference token curtis dyreson general chair sigmod #c curtis dyreson general chair sigmod cu curtis dyreson general chair sigmod ur · · · · · · · · · · · · feifei li general chair sigmod #f feifei li general chair sigmod fe · · · · · · · · · · · · tamer ozsu program chair sigmod #t tamer ozsu program chair sigmod ta · · · · · · · · · · · ·

τα(name)→token;GID(R)

name role conference token GID curtis dyreson general chair sigmod #c 1 curtis dyreson general chair sigmod cu 1 curtis dyreson general chair sigmod ur 1 · · · · · · · · · · · · · · · feifei li general chair sigmod #f 2 feifei li general chair sigmod fe 2 · · · · · · · · · · · · · · · tamer ozsu program chair sigmod #t 3 tamer ozsu program chair sigmod ta 3 · · · · · · · · · · · · · · · VTR Bit Array 1 1 1 1 1 1 1 1 1 · · · · · · · · · · · · · · · 1 1 1 1 1 1 1 · · · · · · · · · · · · · · · 1 1 1 1 1 1 1 · · · · · · · · · · · · · · ·

physical replication is expensive VTR avoids physical replication

  • 1. mark group with GID

expensive!

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 11 / 21

slide-31
SLIDE 31

The Tokenize Operator Efficient Implementation

Virtual Tuple Replication/1

R

name role conference curtis dyreson general chair sigmod feifei li general chair sigmod tamer ozsu program chair sigmod · · · · · · · · ·

τα(name)→token(R)

name role conference token curtis dyreson general chair sigmod #c curtis dyreson general chair sigmod cu curtis dyreson general chair sigmod ur · · · · · · · · · · · · feifei li general chair sigmod #f feifei li general chair sigmod fe · · · · · · · · · · · · tamer ozsu program chair sigmod #t tamer ozsu program chair sigmod ta · · · · · · · · · · · ·

τα(name)→token;GID(R)

name role conference token GID curtis dyreson general chair sigmod #c 1 curtis dyreson general chair sigmod cu 1 curtis dyreson general chair sigmod ur 1 · · · · · · · · · · · · · · · feifei li general chair sigmod #f 2 feifei li general chair sigmod fe 2 · · · · · · · · · · · · · · · tamer ozsu program chair sigmod #t 3 tamer ozsu program chair sigmod ta 3 · · · · · · · · · · · · · · · VTR Bit Array 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 · · · · · · · · · · · · · · · 1 1 1 1 1 1 1 1 1 1 · · · · · · · · · · · · · · · 1 1 1 1 1 1 1 1 1 1 · · · · · · · · · · · · · · ·

physical replication is expensive VTR avoids physical replication

  • 1. mark group with GID
  • 2. create VTR bit array

expensive!

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 11 / 21

slide-32
SLIDE 32

The Tokenize Operator Efficient Implementation

Virtual Tuple Replication/1

R

name role conference curtis dyreson general chair sigmod feifei li general chair sigmod tamer ozsu program chair sigmod · · · · · · · · ·

τα(name)→token(R)

name role conference token curtis dyreson general chair sigmod #c curtis dyreson general chair sigmod cu curtis dyreson general chair sigmod ur · · · · · · · · · · · · feifei li general chair sigmod #f feifei li general chair sigmod fe · · · · · · · · · · · · tamer ozsu program chair sigmod #t tamer ozsu program chair sigmod ta · · · · · · · · · · · ·

τα(name)→token;GID(R)

name role conference token GID curtis dyreson general chair sigmod #c 1 ⊥ ⊥ ⊥ cu 1 ⊥ ⊥ ⊥ ur 1 · · · · · · · · · · · · · · · feifei li general chair sigmod #f 2 ⊥ ⊥ ⊥ fe 2 · · · · · · · · · · · · · · · tamer ozsu program chair sigmod #t 3 ⊥ ⊥ ⊥ ta 3 · · · · · · · · · · · · · · · VTR Bit Array 1 1 1 1 1 1 1 1 1 · · · · · · · · · · · · · · · 1 1 1 1 1 1 1 · · · · · · · · · · · · · · · 1 1 1 1 1 1 1 · · · · · · · · · · · · · · ·

physical replication is expensive VTR avoids physical replication

  • 1. mark group with GID
  • 2. create VTR bit array
  • 3. keep single bit for each replicated attribute

expensive!

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 11 / 21

slide-33
SLIDE 33

The Tokenize Operator Efficient Implementation

Virtual Tuple Replication/1

R

name role conference curtis dyreson general chair sigmod feifei li general chair sigmod tamer ozsu program chair sigmod · · · · · · · · ·

τα(name)→token(R)

name role conference token curtis dyreson general chair sigmod #c curtis dyreson general chair sigmod cu curtis dyreson general chair sigmod ur · · · · · · · · · · · · feifei li general chair sigmod #f feifei li general chair sigmod fe · · · · · · · · · · · · tamer ozsu program chair sigmod #t tamer ozsu program chair sigmod ta · · · · · · · · · · · ·

τα(name)→token;GID(R)

name role conference token GID curtis dyreson general chair sigmod #c 1 ⊥ ⊥ ⊥ cu 1 ⊥ ⊥ ⊥ ur 1 · · · · · · · · · · · · · · · feifei li general chair sigmod #f 2 ⊥ ⊥ ⊥ fe 2 · · · · · · · · · · · · · · · tamer ozsu program chair sigmod #t 3 ⊥ ⊥ ⊥ ta 3 · · · · · · · · · · · · · · · VTR Bit Array 1 1 1 1 1 1 1 1 1 · · · · · · · · · · · · · · · 1 1 1 1 1 1 1 · · · · · · · · · · · · · · · 1 1 1 1 1 1 1 · · · · · · · · · · · · · · ·

physical replication is expensive VTR avoids physical replication

  • 1. mark group with GID
  • 2. create VTR bit array
  • 3. keep single bit for each replicated attribute

bit array + grouping restores original values

expensive!

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 11 / 21

slide-34
SLIDE 34

The Tokenize Operator Efficient Implementation

Virtual Tuple Replication/2

Example: count number of tokens for each name γname;COUNT(*)→cnt(τα(name)(R))

➥ name value replicated for each token ➥ replicated values removed by grouping operator

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 12 / 21

slide-35
SLIDE 35

The Tokenize Operator Efficient Implementation

Virtual Tuple Replication/2

Example: count number of tokens for each name γname;COUNT(*)→cnt(τα(name)(R))

➥ name value replicated for each token ➥ replicated values removed by grouping operator

VTR version of query: γ

COUNT(*)→cnt(τα(name)→tk

(R))

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 12 / 21

slide-36
SLIDE 36

The Tokenize Operator Efficient Implementation

Virtual Tuple Replication/2

Example: count number of tokens for each name γname;COUNT(*)→cnt(τα(name)(R))

➥ name value replicated for each token ➥ replicated values removed by grouping operator

VTR version of query: γGID;

COUNT(*)→cnt(τα(name)→tk;GID(R))

➥ generate and group by GID

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 12 / 21

slide-37
SLIDE 37

The Tokenize Operator Efficient Implementation

Virtual Tuple Replication/2

Example: count number of tokens for each name γname;COUNT(*)→cnt(τα(name)(R))

➥ name value replicated for each token ➥ replicated values removed by grouping operator

VTR version of query: γGID; REST(name), COUNT(*)→cnt(τα(name)→tk;GID(R))

➥ generate and group by GID ➥ restore name attribute

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 12 / 21

slide-38
SLIDE 38

The Tokenize Operator Efficient Implementation

Virtual Tuple Replication/2

Example: count number of tokens for each name γname;COUNT(*)→cnt(τα(name)(R))

➥ name value replicated for each token ➥ replicated values removed by grouping operator

VTR version of query: πname,cnt(γGID; REST(name), COUNT(*)→cnt(τα(name)→tk;GID(R)))

➥ generate and group by GID ➥ restore name attribute ➥ remove GID

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 12 / 21

slide-39
SLIDE 39

The Tokenize Operator Efficient Implementation

Virtual Tuple Replication/2

Example: count number of tokens for each name γname;COUNT(*)→cnt(τα(name)(R))

➥ name value replicated for each token ➥ replicated values removed by grouping operator

VTR version of query: πname,cnt(γGID; REST(name), COUNT(*)→cnt(τα(name)→tk;GID(R)))

➥ generate and group by GID ➥ restore name attribute ➥ remove GID

VTR: efficient implementation of tokenize.

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 12 / 21

slide-40
SLIDE 40

The Tokenize Operator Query Optimization

Outline

1 Motivation 2 The Tokenize Operator

Efficient Implementation Query Optimization

3 Experiments

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 13 / 21

slide-41
SLIDE 41

The Tokenize Operator Query Optimization

Logical Query Plan

Equivalence transformations with tokenize: push down selection: attr(θ) ⊆ schema(R) σθ(τα(A)(R)) = τα(A)(σθ(R)) push down projection: A ⊆ B πBA′(τα(A)→A′(R)) = τα(A)→A′(πB(R)) reorder with join: A ⊆ schema(R) τα(A)(R ⋊ ⋉θ S) = τα(A)(R) ⋊ ⋉θ S reorder tokenize operators: A, B ⊆ schema(R) τα(A)τα(B)(R) = τα(B)τα(A)(R)

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 14 / 21

slide-42
SLIDE 42

The Tokenize Operator Query Optimization

Cardinality Estimation

Cardinality estimation for tokenize: |τα(A)(R)| = |R| × |α(A)|avg Most token functions α produce linear number of tokens ⇒ accurate cardinality estimates

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 15 / 21

slide-43
SLIDE 43

The Tokenize Operator Query Optimization

Cardinality Estimation

Cardinality estimation for tokenize: |τα(A)(R)| = |R| × |α(A)|avg Most token functions α produce linear number of tokens ⇒ accurate cardinality estimates Token Type Statistics Cardinality q-grams string

  • avg. string length ¯

s |R|(¯ s + q − 1) binar branchs tree

  • avg. node number ¯

t |R|¯ t pq-grams tree

  • avg. node number ¯

t |R|q¯ t

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 15 / 21

slide-44
SLIDE 44

The Tokenize Operator Query Optimization

Cardinality Estimation

Cardinality estimation for tokenize: |τα(A)(R)| = |R| × |α(A)|avg Most token functions α produce linear number of tokens ⇒ accurate cardinality estimates Token Type Statistics Cardinality q-grams string

  • avg. string length ¯

s |R|(¯ s + q − 1) binar branchs tree

  • avg. node number ¯

t |R|¯ t pq-grams tree

  • avg. node number ¯

t |R|q¯ t

Simple and accurate cardinality estimates for tokenize.

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 15 / 21

slide-45
SLIDE 45

The Tokenize Operator Query Optimization

Query Plans with Tokenize

SELECT R.ssn, S.ssn FROM R, S TOKENIZE R ON name AS R token, S ON name AS S token WHERE R.city = ’SLC’ AND S.county=’Salt Lake’ AND R token = S token GROUP BY R.ssn, S.ssn HAVING COUNT(*) >= k;

Customer tables R, S: ➥ |name| = 9 chars on avg. ➥ |α(name)| = 10 on avg.

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 16 / 21

slide-46
SLIDE 46

The Tokenize Operator Query Optimization

Query Plans with Tokenize

SELECT R.ssn, S.ssn FROM R, S TOKENIZE R ON name AS R token, S ON name AS S token WHERE R.city = ’SLC’ AND S.county=’Salt Lake’ AND R token = S token GROUP BY R.ssn, S.ssn HAVING COUNT(*) >= k;

Customer tables R, S: ➥ |name| = 9 chars on avg. ➥ |α(name)| = 10 on avg.

πR.ssn,S.ssn σolap≥k γR.ssn,S.ssn;COUNT(∗)→olap ⋊ ⋉R token=S token σcity=’SLC’[1%] τα(name)→R token

R(ssn, name, city)

20’000 200’000 2’000

σcounty=’Salt Lake’[5%] τα(name)→S token

S(ssn, name, county)

10’000 100’000 5’000

⋊ ⋉R token=S token τα(name)→R token σcity=’SLC’[1%]

R(ssn, name, city)

20’000 20’000 200 200 2’000 2’000

τα(name)→S token σcounty=’Salt Lake’[5%]

S(ssn, name, county)

10’000 10’000 500 500 5’000 5’000

  • true cardinality

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 16 / 21

slide-47
SLIDE 47

The Tokenize Operator Query Optimization

Query Plans with Tokenize

SELECT R.ssn, S.ssn FROM R, S TOKENIZE R ON name AS R token, S ON name AS S token WHERE R.city = ’SLC’ AND S.county=’Salt Lake’ AND R token = S token GROUP BY R.ssn, S.ssn HAVING COUNT(*) >= k;

Customer tables R, S: ➥ |name| = 9 chars on avg. ➥ |α(name)| = 10 on avg.

πR.ssn,S.ssn σolap≥k γR.ssn,S.ssn;COUNT(∗)→olap ⋊ ⋉R token=S token σcity=’SLC’[1%] τα(name)→R token

R(ssn, name, city)

20’000 200’000 2’000

σcounty=’Salt Lake’[5%] τα(name)→S token

S(ssn, name, county)

10’000 100’000 5’000

⋊ ⋉R token=S token τα(name)→R token σcity=’SLC’[1%]

R(ssn, name, city)

20’000 20’000 200 200 2’000 2’000

τα(name)→S token σcounty=’Salt Lake’[5%]

S(ssn, name, county)

10’000 10’000 500 500 5’000 5’000

  • true cardinality

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 16 / 21

slide-48
SLIDE 48

The Tokenize Operator Query Optimization

Query Plans with Tokenize

SELECT R.ssn, S.ssn FROM R, S TOKENIZE R ON name AS R token, S ON name AS S token WHERE R.city = ’SLC’ AND S.county=’Salt Lake’ AND R token = S token GROUP BY R.ssn, S.ssn HAVING COUNT(*) >= k;

Customer tables R, S: ➥ |name| = 9 chars on avg. ➥ |α(name)| = 10 on avg.

πR.ssn,S.ssn σolap≥k γR.ssn,S.ssn;COUNT(∗)→olap ⋊ ⋉R token=S token σcity=’SLC’[1%] τα(name)→R token

R(ssn, name, city)

20’000 200’000 2’000

σcounty=’Salt Lake’[5%] τα(name)→S token

S(ssn, name, county)

10’000 100’000 5’000

⋊ ⋉R token=S token τα(name)→R token σcity=’SLC’[1%]

R(ssn, name, city)

20’000 20’000 200 200 2’000 2’000

τα(name)→S token σcounty=’Salt Lake’[5%]

S(ssn, name, county)

10’000 10’000 500 500 5’000 5’000

  • true cardinality

efficient logical plans with transformation rules

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 16 / 21

slide-49
SLIDE 49

The Tokenize Operator Query Optimization

Query Plans with Tokenize

SELECT R.ssn, S.ssn FROM R, S TOKENIZE R ON name AS R token, S ON name AS S token WHERE R.city = ’SLC’ AND S.county=’Salt Lake’ AND R token = S token GROUP BY R.ssn, S.ssn HAVING COUNT(*) >= k;

Customer tables R, S: ➥ |name| = 9 chars on avg. ➥ |α(name)| = 10 on avg.

πR.ssn,S.ssn σolap≥k γR.ssn,S.ssn;COUNT(∗)→olap ⋊ ⋉R token=S token σcity=’SLC’[1%] τα(name)→R token

R(ssn, name, city)

20’000 200’000 2’000

σcounty=’Salt Lake’[5%] τα(name)→S token

S(ssn, name, county)

10’000 100’000 5’000

⋊ ⋉R token=S token τα(name)→R token σcity=’SLC’[1%]

R(ssn, name, city)

20’000 20’000 200 200 2’000 2’000

τα(name)→S token σcounty=’Salt Lake’[5%]

S(ssn, name, county)

10’000 10’000 500 500 5’000 5’000

  • true cardinality

efficient logical plans with transformation rules

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 16 / 21

slide-50
SLIDE 50

The Tokenize Operator Query Optimization

Query Plans with Tokenize

SELECT R.ssn, S.ssn FROM R, S TOKENIZE R ON name AS R token, S ON name AS S token WHERE R.city = ’SLC’ AND S.county=’Salt Lake’ AND R token = S token GROUP BY R.ssn, S.ssn HAVING COUNT(*) >= k;

Customer tables R, S: ➥ |name| = 9 chars on avg. ➥ |α(name)| = 10 on avg.

πR.ssn,S.ssn σolap≥k γR.ssn,S.ssn;COUNT(∗)→olap ⋊ ⋉R token=S token σcity=’SLC’[1%] τα(name)→R token

R(ssn, name, city)

20’000 200’000 2’000

σcounty=’Salt Lake’[5%] τα(name)→S token

S(ssn, name, county)

10’000 100’000 5’000

⋊ ⋉R token=S token τα(name)→R token σcity=’SLC’[1%]

R(ssn, name, city)

20’000 20’000 200 200 2’000 2’000

τα(name)→S token σcounty=’Salt Lake’[5%]

S(ssn, name, county)

10’000 10’000 500 500 5’000 5’000

  • true cardinality
  • estimated card.

efficient logical plans with transformation rules accurate cardinality estimates

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 16 / 21

slide-51
SLIDE 51

Experiments

Outline

1 Motivation 2 The Tokenize Operator

Efficient Implementation Query Optimization

3 Experiments

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 17 / 21

slide-52
SLIDE 52

Experiments

Generating Tokens

50 100 150 200 250 300 500000 1e+06 time [sec] number of strings (N) substring stand-alone table function tokenize

q-grams, q = 3, |s| = 10

50 100 150 200 250 1000 2000 3000 time [min] string length (|s|) substring stand-alone table function tokenize

q-grams, q = 5, 1M tuples

generate tokens: stand-alone client, substring function (VLDB’01), table function, tokenize operator increase number of tuples / string length measure runtime

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 18 / 21

slide-53
SLIDE 53

Experiments

Generating Tokens

50 100 150 200 250 300 500000 1e+06 time [sec] number of strings (N) substring stand-alone table function tokenize

q-grams, q = 3, |s| = 10

50 100 150 200 250 1000 2000 3000 time [min] string length (|s|) substring stand-alone table function tokenize

q-grams, q = 5, 1M tuples

generate tokens: stand-alone client, substring function (VLDB’01), table function, tokenize operator increase number of tuples / string length measure runtime

Tokenize scales with tuple size and string length.

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 18 / 21

slide-54
SLIDE 54

Experiments

Tokenize vs. Table Function

5 10 15 20 25 30 35 40 45 500000 1e+06 time [sec] number of tuples (N) table function tokenize

q-grams, q = 2, Jaccard threshold 0.9

join customer tables on similar names select by city (1k customers) increase number of tuples runtime for tokenize vs. table function

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 19 / 21

slide-55
SLIDE 55

Experiments

Tokenize vs. Table Function

5 10 15 20 25 30 35 40 45 500000 1e+06 time [sec] number of tuples (N) table function tokenize

q-grams, q = 2, Jaccard threshold 0.9

join customer tables on similar names select by city (1k customers) increase number of tuples runtime for tokenize vs. table function

Tokenize generates more efficient query plans.

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 19 / 21

slide-56
SLIDE 56

Experiments

Virtual Tuple Replication

10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 time [sec] number of columns without VTR with VTR 1 2 3 4 5 6 7 8 9 10 5 10 15 20 table size [GB] number of columns without VTR with VTR

q-grams, q = 5, 1M tuples

generate tokens for 1M tuples compare VTR vs. physical replication increase tuple size (number of 50 char columns) measure runtime and size on disk

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 20 / 21

slide-57
SLIDE 57

Experiments

Virtual Tuple Replication

10 15 20 25 30 35 40 45 50 55 60 5 10 15 20 time [sec] number of columns without VTR with VTR 1 2 3 4 5 6 7 8 9 10 5 10 15 20 table size [GB] number of columns without VTR with VTR

q-grams, q = 5, 1M tuples

generate tokens for 1M tuples compare VTR vs. physical replication increase tuple size (number of 50 char columns) measure runtime and size on disk

VTR is fast and reduces size of intermediate results.

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 20 / 21

slide-58
SLIDE 58

Experiments

Conclusion

Tokenize is a logical operator that computes tokens VTR avoids replicating tuples physically Efficient query plans with tokenize:

➥ flexible transformation rules ➥ accurate cardinality estimates

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 21 / 21

slide-59
SLIDE 59

Experiments

Conclusion

Tokenize is a logical operator that computes tokens VTR avoids replicating tuples physically Efficient query plans with tokenize:

➥ flexible transformation rules ➥ accurate cardinality estimates

Not shown here:

➥ computing prefixes for filtering ➥ efficient verification with tokenize

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 21 / 21

slide-60
SLIDE 60

Experiments

Conclusion

Tokenize is a logical operator that computes tokens VTR avoids replicating tuples physically Efficient query plans with tokenize:

➥ flexible transformation rules ➥ accurate cardinality estimates

Not shown here:

➥ computing prefixes for filtering ➥ efficient verification with tokenize

Tokenize enables efficient on-the-fly token similarity joins.

Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 21 / 21