Efficient Parallel Partition based Algorithms for Similarity Search - - PowerPoint PPT Presentation

efficient parallel partition based algorithms for
SMART_READER_LITE
LIVE PREVIEW

Efficient Parallel Partition based Algorithms for Similarity Search - - PowerPoint PPT Presentation

Motivation Our Approach Experiment Efficient Parallel Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints Yu Jiang, Dong Deng, Jiannan Wang, Guoliang Li, and Jianhua Feng Tsinghua University Similarity


slide-1
SLIDE 1

Motivation Our Approach Experiment

Efficient Parallel Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints

Yu Jiang, Dong Deng, Jiannan Wang, Guoliang Li, and Jianhua Feng

Tsinghua University

Similarity Search&Join Competition on EDBT/ICDT 2013

Dong Deng Parallel PassJoin

slide-2
SLIDE 2

Motivation Our Approach Experiment

Outline

1

Motivation Problem Definition Application

2

Our Approach Pass Join Algorithm Additional Filters Parallel

3

Experiment Evaluating Pruning Techniques Evaluating Parallelism Evaluating Scalability

Dong Deng Parallel PassJoin

slide-3
SLIDE 3

Motivation Our Approach Experiment Problem Definition Application

Problem Definition

STRING SIMILARITY JOINS

Given a set of strings S, the task is to find all pairs of τ-similar strings from S. A program must output all matches with both string identifiers and distance τ.(Track II)

Dong Deng Parallel PassJoin

slide-4
SLIDE 4

Motivation Our Approach Experiment Problem Definition Application

An Example

Table: A string dataset

ID Strings Length s1 vankatesh 9 s2 avataresha 10 s3 kaushic chaduri 15 s4 kaushik chakrab 15 s5 kaushuk chadhui 15 s6 caushik chakrabar 17 Consider the string dataset in Table 1. Suppose τ = 3. s4, s6 is a similar pair as ED(s4, s6) ≤ τ

Dong Deng Parallel PassJoin

slide-5
SLIDE 5

Motivation Our Approach Experiment Problem Definition Application

Application

Data cleaning Information Extraction Comparison of biological sequences ...

Dong Deng Parallel PassJoin

slide-6
SLIDE 6

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Basic Idea

Lemma Given a string r with τ + 1 segments and a string s, if s is similar to r within threshold τ, s must contain a segment of r. Example τ = 1, r =“EDBT” has two segments “ED” and “BT”. s =“ICDT” cannot similar to r as s contains none of the two segemtns.

Dong Deng Parallel PassJoin

slide-7
SLIDE 7

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Even Partition Scheme

Definition In even partition scheme, each segment has almost the same

  • length. (⌊ |s|

τ+1⌋ or ⌈ |s| τ+1⌉)

Example τ = 3, we partition s1 =“vankatesh” into four segments “va”, “nk”, “at”, “esh”.

Dong Deng Parallel PassJoin

slide-8
SLIDE 8

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Substring Selection

Basic Methods

Enumeration: Enumerate all substrings for each of the segment. Length-based: For each segment, only select substrings with same length. Shift-based: For segment with start position pi, select substrings with start position in [pi − τ, pi + τ]

Dong Deng Parallel PassJoin

slide-9
SLIDE 9

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Substring Selection

Position-aware Substring Selection

Observation Theorem (Position-aware Substring Selection) For segment with start position pi, select substrings with start position in [pi − ⌊ τ−△

2 ⌋, pi + ⌊ τ+△ 2 ⌋] where △ = |s| − |r|.

Dong Deng Parallel PassJoin

slide-10
SLIDE 10

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Substring Selection

Position-aware Substring Selection

Observation Theorem (Position-aware Substring Selection) For segment with start position pi, select substrings with start position in [pi − ⌊ τ−△

2 ⌋, pi + ⌊ τ+△ 2 ⌋] where △ = |s| − |r|.

Dong Deng Parallel PassJoin

slide-11
SLIDE 11

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Substring Selection

Position-aware Substring Selection

Example τ = 3, △ = 1, [pi − ⌊ τ−△

2 ⌋, pi + ⌊ τ+△ 2 ⌋] = [pi − 1, pi + 2]

Dong Deng Parallel PassJoin

slide-12
SLIDE 12

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Substring Selection

Multi-match-aware Substring Selection

Observation There must be another matching between rr and sr. Theorem (Multi-match-aware Substring Selection) For the i-th segment with start position pi, select substrings within [pi−i, pi+i] ∩ [pi+△−(τ+1−i), pi+△+(τ+1−i)].

Dong Deng Parallel PassJoin

slide-13
SLIDE 13

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Substring Selection

Multi-match-aware Substring Selection

Observation There must be another matching between rr and sr. Theorem (Multi-match-aware Substring Selection) For the i-th segment with start position pi, select substrings within [pi−i, pi+i] ∩ [pi+△−(τ+1−i), pi+△+(τ+1−i)].

Dong Deng Parallel PassJoin

slide-14
SLIDE 14

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Substring Selection

Multi-match-aware Substring Selection

Example

Dong Deng Parallel PassJoin

slide-15
SLIDE 15

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Substring Selection

Theoretical Results

1

The number of selected substrings by the multi-match-aware method is minimum.

2

For strings longer than 2 ∗ (τ + 1), our selection method is the only way to select minimum number of substrings.

Dong Deng Parallel PassJoin

slide-16
SLIDE 16

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Substring Selection

Experimental Results

1e+006 1e+007 1e+008 1e+009 1 2 3 4 # of selected substrings Threshold τ Length Shift Positon Multi-Match

(a) Author Name

(Avg Len = 15)

1e+006 1e+007 1e+008 1e+009 1e+010 4 5 6 7 8 # of selected substrings Threshold τ Length Shift Positon Multi-Match

(b) Query Log

(Avg Len = 45)

1e+007 1e+008 1e+009 1e+010 1e+011 5 6 7 8 9 10 # of selected substrings Threshold τ Length Shift Positon Multi-Match

(c) Author+Title

(Avg Len = 105) Figure: Numbers of selected substrings

Dong Deng Parallel PassJoin

slide-17
SLIDE 17

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Substring Selection

Experimental Results

0.1 1 10 100 1 2 3 4

Selection Time (s)

Threshold τ Length Shift Positon Multi-Match

(a) Author Name

(Avg Len = 15)

1 10 100 1000 4 5 6 7 8

Selection Time (s)

Threshold τ Length Shift Positon Multi-Match

(b) Query Log

(Avg Len = 45)

1 10 100 1000 10000 5 6 7 8 9 10

Selection Time (s)

Threshold τ Length Shift Positon Multi-Match

(c) Author+Title

(Avg Len = 105) Figure: Elapsed time for generating substrings

Dong Deng Parallel PassJoin

slide-18
SLIDE 18

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Verification

Length-aware Verification

Inspired by the position-aware substring selection. Save at least half computation than traditional dynamic method. Save even more using improved early termination.

Dong Deng Parallel PassJoin

slide-19
SLIDE 19

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Verification

Length-aware Verification

Inspired by the position-aware substring selection. Save at least half computation than traditional dynamic method. Save even more using improved early termination.

Dong Deng Parallel PassJoin

slide-20
SLIDE 20

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Verification

Length-aware Verification

Inspired by the position-aware substring selection. Save at least half computation than traditional dynamic method. Save even more using improved early termination.

Dong Deng Parallel PassJoin

slide-21
SLIDE 21

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Verification

Extension-based Verification

Inspired by the multi-match-aware substring selection. Using tighter thresholds to verify the candidate pairs. Verify if ED(rr, sr) ≤ τ + 1 − i and ED(rl, sl) ≤ i − 1.

Dong Deng Parallel PassJoin

slide-22
SLIDE 22

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Verification

Extension-based Verification

Inspired by the multi-match-aware substring selection. Using tighter thresholds to verify the candidate pairs. Verify if ED(rr, sr) ≤ τ + 1 − i and ED(rl, sl) ≤ i − 1.

Dong Deng Parallel PassJoin

slide-23
SLIDE 23

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Verification

Extension-based Verification

Inspired by the multi-match-aware substring selection. Using tighter thresholds to verify the candidate pairs. Verify if ED(rr, sr) ≤ τ + 1 − i and ED(rl, sl) ≤ i − 1.

Dong Deng Parallel PassJoin

slide-24
SLIDE 24

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Verification

Experimental Results

1 10 100 1000 10000 100000 1 2 3 4

Elapsed Time (s)

Threshold τ 2τ+1 τ+1 Extension SharePrefix

(a) Author Name

(Avg Len 15)

10 100 1000 10000 4 5 6 7 8

Elapsed Time (s)

Threshold τ 2τ+1 τ+1 Extension SharePrefix

(b) Query Log

(Avg Len 45)

10 100 1000 10000 5 6 7 8 9 10

Elapsed Time (s)

Threshold τ 2τ+1 τ+1 Extension SharePrefix

(c) Author+Title

(Avg Len 105) Figure: Elapsed time for verification

Dong Deng Parallel PassJoin

slide-25
SLIDE 25

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Additional Filters

Effective Indexing Strategy

Partition longer strings into segments. Select substrings from shorter strings. Longer segments decrease the possibility of matching. Thus decrease the number of candidates.

Dong Deng Parallel PassJoin

slide-26
SLIDE 26

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Additional Filters

Effective Indexing Strategy

Partition longer strings into segments. Select substrings from shorter strings. Longer segments decrease the possibility of matching. Thus decrease the number of candidates.

Dong Deng Parallel PassJoin

slide-27
SLIDE 27

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Additional Filters

Effective Indexing Strategy

Partition longer strings into segments. Select substrings from shorter strings. Longer segments decrease the possibility of matching. Thus decrease the number of candidates.

Dong Deng Parallel PassJoin

slide-28
SLIDE 28

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Additional Filters

Effective Indexing Strategy

Partition longer strings into segments. Select substrings from shorter strings. Longer segments decrease the possibility of matching. Thus decrease the number of candidates.

Dong Deng Parallel PassJoin

slide-29
SLIDE 29

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Additional Filters

Content Filter

Observation Let Hr denote the character frequency vector of r. r =“abyyyy”, s =“axxyyyxy”. Hr = {{a, 1}, {b, 1}, {y, 4}}, Hs = {{a, 1}, {x, 3}, {y, 4}} Let H△ = |Hr − Hs|. H△ = |Hr − Hs| = ||1| + | − 3|| = 4. A deletion or insertion changes H△ by 1 at most. An substitution changes H△ by 2 at most.

Dong Deng Parallel PassJoin

slide-30
SLIDE 30

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Additional Filters

Content Filter

Observation Let Hr denote the character frequency vector of r. r =“abyyyy”, s =“axxyyyxy”. Hr = {{a, 1}, {b, 1}, {y, 4}}, Hs = {{a, 1}, {x, 3}, {y, 4}} Let H△ = |Hr − Hs|. H△ = |Hr − Hs| = ||1| + | − 3|| = 4. A deletion or insertion changes H△ by 1 at most. An substitution changes H△ by 2 at most.

Dong Deng Parallel PassJoin

slide-31
SLIDE 31

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Additional Filters

Content Filter

Observation Let Hr denote the character frequency vector of r. r =“abyyyy”, s =“axxyyyxy”. Hr = {{a, 1}, {b, 1}, {y, 4}}, Hs = {{a, 1}, {x, 3}, {y, 4}} Let H△ = |Hr − Hs|. H△ = |Hr − Hs| = ||1| + | − 3|| = 4. A deletion or insertion changes H△ by 1 at most. An substitution changes H△ by 2 at most.

Dong Deng Parallel PassJoin

slide-32
SLIDE 32

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Additional Filters

Content Filter

Observation Let Hr denote the character frequency vector of r. r =“abyyyy”, s =“axxyyyxy”. Hr = {{a, 1}, {b, 1}, {y, 4}}, Hs = {{a, 1}, {x, 3}, {y, 4}} Let H△ = |Hr − Hs|. H△ = |Hr − Hs| = ||1| + | − 3|| = 4. A deletion or insertion changes H△ by 1 at most. An substitution changes H△ by 2 at most.

Dong Deng Parallel PassJoin

slide-33
SLIDE 33

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Additional Filters

Content Filter

Observation At most τ edit operations, H△ ≤ 2τ. At most τ −

  • |r| − |s|
  • substitutions, H△ ≤ 2τ −
  • |r| − |s|
  • .

Group symbols to improve the content-filter running time. Integrate the content filter with the extension-based verification.

Dong Deng Parallel PassJoin

slide-34
SLIDE 34

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Additional Filters

Content Filter

Observation At most τ edit operations, H△ ≤ 2τ. At most τ −

  • |r| − |s|
  • substitutions, H△ ≤ 2τ −
  • |r| − |s|
  • .

Group symbols to improve the content-filter running time. Integrate the content filter with the extension-based verification.

Dong Deng Parallel PassJoin

slide-35
SLIDE 35

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Additional Filters

Content Filter

Observation At most τ edit operations, H△ ≤ 2τ. At most τ −

  • |r| − |s|
  • substitutions, H△ ≤ 2τ −
  • |r| − |s|
  • .

Group symbols to improve the content-filter running time. Integrate the content filter with the extension-based verification.

Dong Deng Parallel PassJoin

slide-36
SLIDE 36

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel

Additional Filters

Content Filter

Observation At most τ edit operations, H△ ≤ 2τ. At most τ −

  • |r| − |s|
  • substitutions, H△ ≤ 2τ −
  • |r| − |s|
  • .

Group symbols to improve the content-filter running time. Integrate the content filter with the extension-based verification.

Dong Deng Parallel PassJoin

slide-37
SLIDE 37

Motivation Our Approach Experiment Pass Join Algorithm Additional Filters Parallel 1

Parallel Sorting. Group strings by lengths using existing parallel algorithm.

2

Parallel Building Indexes. Parallel building indexes for each group.

3

Parallel Joins. Parallel perform similarity joins on each groups.

Dong Deng Parallel PassJoin

slide-38
SLIDE 38

Motivation Our Approach Experiment Evaluating Pruning Techniques Evaluating Parallelism Evaluating Scalability

Experiment Setup

Table: Datasets Datasets cardinality average len max len min len GeoNames 400,000 11.106 1 60 GeoNames Query 100,000 10.7 2 43 Reads 750,000 101.388 86 106 Reads Query 100,000 101.2 88 116

Dong Deng Parallel PassJoin

slide-39
SLIDE 39

Motivation Our Approach Experiment Evaluating Pruning Techniques Evaluating Parallelism Evaluating Scalability

Experiment Setup

10000 20000 30000 40000 50000 10 20 30 40 50 60

Numbers of strings

String Lengths

(a) GeoNames

100000 200000 300000 400000 85 90 95 100 105

Numbers of strings

String Lengths

(b) Reads

Figure: Length Distribution.

Dong Deng Parallel PassJoin

slide-40
SLIDE 40

Motivation Our Approach Experiment Evaluating Pruning Techniques Evaluating Parallelism Evaluating Scalability

Evaluating Pruning Techniques

10 20 30 40 50 1 2 3 4

Elapsed Time (s) Edit Distance Threshold

Basic Content Longer ParaJoin

(a) GeoNames

200 400 600 800 4 8 12 16

Elapsed Time (s) Edit Distance Threshold

Basic Content Longer ParaJoin

(b) Reads

Figure: Evaluating pruning techniques for similarity joins(8 threads).

Dong Deng Parallel PassJoin

slide-41
SLIDE 41

Motivation Our Approach Experiment Evaluating Pruning Techniques Evaluating Parallelism Evaluating Scalability

Evaluating Pruning Techniques

10 20 30 40 1 2 3 4

Elapsed Time (s) Edit Distance Threshold

BasicSearch ParaSearch

(a) GeoNames

50 100 150 200 4 8 12 16

Elapsed Time (s) Edit Distance Threshold

BasicSearch ParaSearch

(b) Reads

Figure: Evaluating pruning techniques for similarity search(8 threads).

Dong Deng Parallel PassJoin

slide-42
SLIDE 42

Motivation Our Approach Experiment Evaluating Pruning Techniques Evaluating Parallelism Evaluating Scalability

Evaluating Parallelism

20 40 60 80 100 2 4 6 8

Elapsed Time (s) Number of Threads

tau=4 tau=3 tau=2 tau=1

(a) GeoNames

150 300 450 600 750 2 4 6 8

Elapsed Time (s) Number of Threads

tau=16 tau=12 tau=8 tau=4

(b) Reads

Figure: Evaluating running time of similarity join by varying number of threads.

Dong Deng Parallel PassJoin

slide-43
SLIDE 43

Motivation Our Approach Experiment Evaluating Pruning Techniques Evaluating Parallelism Evaluating Scalability

Evaluating Speedup

2 4 6 8 2 4 6 8

Speedup Number of Threads

tau=4 tau=3 tau=2 tau=1 Ideal

(a) GeoNames

2 4 6 8 2 4 6 8

Speedup Number of Threads

tau=16 tau=12 tau=8 tau=4 Ideal

(b) Reads

Figure: Evaluating speedup of similarity join.

Dong Deng Parallel PassJoin

slide-44
SLIDE 44

Motivation Our Approach Experiment Evaluating Pruning Techniques Evaluating Parallelism Evaluating Scalability

Evaluating Parallelism

30 60 90 120 150 2 4 6 8

Elapsed Time (s) Number of Threads

tau=4 tau=3 tau=2 tau=1

(a) GeoNames

50 100 150 200 2 4 6 8

Elapsed Time (s) Number of Threads

tau=16 tau=12 tau=8 tau=4

(b) Reads

Figure: Evaluating running time of similarity search by varying number of threads.

Dong Deng Parallel PassJoin

slide-45
SLIDE 45

Motivation Our Approach Experiment Evaluating Pruning Techniques Evaluating Parallelism Evaluating Scalability

Evaluating Speedup

2 4 6 8 2 4 6 8

Speedup Number of Threads

tau=4 tau=3 tau=2 tau=1 Ideal

(a) GeoNames

2 4 6 8 2 4 6 8

Speedup Number of Threads

tau=16 tau=12 tau=8 tau=4 Ideal

(b) Reads

Figure: Evaluating speedup of similarity search.

Dong Deng Parallel PassJoin

slide-46
SLIDE 46

Motivation Our Approach Experiment Evaluating Pruning Techniques Evaluating Parallelism Evaluating Scalability

Evaluating Scalability

40 80 120 0.25 0.5 0.75 1

Elapsed Time (s) Number of Strings(*1,000,000)

tau=4 tau=3 tau=2 tau=1

(a) GeoNames

80 160 240 320 0.25 0.5 0.75 1

Elapsed Time (s) Number of Strings(*1,000,000)

tau=16 tau=12 tau=8 tau=4

(b) Reads

Figure: Evaluating the scalability of the similarity join algorithm(8 threads).

Dong Deng Parallel PassJoin

slide-47
SLIDE 47

Motivation Our Approach Experiment Evaluating Pruning Techniques Evaluating Parallelism Evaluating Scalability

Evaluating Scalability

30 60 90 0.25 0.5 0.75 1

Elapsed Time (s) Number of Strings(*1,000,000)

tau=4 tau=3 tau=2 tau=1

(a) GeoNames

10 20 30 40 0.25 0.5 0.75 1

Elapsed Time (s) Number of Strings(*1,000,000)

tau=16 tau=12 tau=8 tau=4

(b) Reads

Figure: Evaluating the scalability of the similarity search algorithm(8 threads).

Dong Deng Parallel PassJoin

slide-48
SLIDE 48

Appendix Our Team

About our team I

We are from Tsinghua University, Beijing, China. Yu Jiang, Jiannan Wang, Guoliang Li, Jianhua Feng and Dong Deng.

Dong Deng Parallel PassJoin

slide-49
SLIDE 49

Appendix Our Team

About our team II

Dong Deng Parallel PassJoin

slide-50
SLIDE 50

Appendix Our Team

Thank You Q & A http://dbgroup.cs.tsinghua.edu.cn/dd Pass-Join: A Partition based Method for Similarity Joins. Guoliang Li, Dong Deng, Jiannan Wang, Jianhua Feng. VLDB 2012.

Dong Deng Parallel PassJoin