B ed -Tree: An All-Purpose Index Structure for String Similarity - - PowerPoint PPT Presentation

b ed tree an all purpose index structure for string
SMART_READER_LITE
LIVE PREVIEW

B ed -Tree: An All-Purpose Index Structure for String Similarity - - PowerPoint PPT Presentation

B ed -Tree: An All-Purpose Index Structure for String Similarity Search Based on Edit Distance Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Divesh Srivastava Outline Motivation and B ed -Tree Framework String Orders


slide-1
SLIDE 1

Bed-Tree: An All-Purpose Index Structure for String Similarity Search Based on Edit Distance

Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Divesh Srivastava

slide-2
SLIDE 2

Outline

Motivation and Bed-Tree Framework String Orders

Dictionary order Gram counting order Gram location order

Experiments Conclusion

2010-6-22 Bed-Tree: An All-Purpose Index Structure for String Similarity Search 2

slide-3
SLIDE 3

Approximate String Search

Information Retrieval

Web search query with string “Posgre SQL” instead of “Postgre SQL”

Data Cleaning

“13 Computing Road” is the same as “#13 Comput’ng Rd”?

Bioinformatics

Find out all protein sequences similar to “ACBCEEACCDECAAB”

2010-6-22 Bed-Tree: An All-Purpose Index Structure for String Similarity Search 3

slide-4
SLIDE 4

Edit Distance

Edit distance on strings Normalized edit distance

2010-6-22 Bed-Tree: An All-Purpose Index Structure for String Similarity Search 4

13 Computing Drive 13 Computing Dr 13 Comput’ng Dr 3 deletions 1 replacement Edit distance: 5 ED(s1,s2) MaxLength(s1,s2) 5 18 #13 Comput’ng Dr 1 insertion

slide-5
SLIDE 5

Existing Solution

Q-Gram

2010-6-22 Bed-Tree: An All-Purpose Index Structure for String Similarity Search 5

Postgre Q=3 ##P #Po Pos

  • st

stg tgr gre re# e## Posgre ##P #Po Pos osg sgr gre re# e## Observation: If ED(s1,s2)=d, they agree on at least min(|s1|,|s2|)+Q-1-d*(Q+1) grams

slide-6
SLIDE 6

Existing Solution

Inverted List

2010-6-22 Bed-Tree: An All-Purpose Index Structure for String Similarity Search 6

##P #Po Pos

  • sg

sgr gre re$ e$$ Postgre Posgre

slide-7
SLIDE 7

Limitations

Inverted List Method

Limited queries supported Uncontrollable memory consumption Concurrency protocol

2010-6-22 Bed-Tree: An All-Purpose Index Structure for String Similarity Search 7

Range Query Join Query Top-K Query Top-K Join Edit Distance Y Y N N Normalized ED N N N N

slide-8
SLIDE 8

Our Contributions

Bed-Tree

Wide support on different queries and distances Adjustable buffer size and low I/O cost Highly concurrent Easy to implement Competitive performance

2010-6-22 Bed-Tree: An All-Purpose Index Structure for String Similarity Search 8

Range Query Join Query Top-K Query Top-K Join Edit Distance Y Y Y Y Normalized ED Y Y Y Y

slide-9
SLIDE 9

Basic Index Framework

Bed-Tree Framework

2010-6-22 Bed-Tree: An All-Purpose Index Structure for String Similarity Search 9

Map all strings to a 1D domain

Result: Postgre Query: Posgre

Estimate the minimal distance to query and prune B+ tree nodes Refine the result by exact edit distance Index Construction follows standard B+ tree

slide-10
SLIDE 10

Outline

Motivation and Bed-Tree Framework String Orders

Dictionary order Gram counting order Gram location order

Experiments Conclusion

2010-6-22 Bed-Tree: An All-Purpose Index Structure for String Similarity Search 10

slide-11
SLIDE 11

String Order Properties

P1: Comparability

Given two string s1 and s2, we know the order of s1 and s2 under the specified string order

P2: Lower Bounding

Given an interval [L,U] on the string order, we know a lower bound on edit distance to the query string

2010-6-22 Bed-Tree: An All-Purpose Index Structure for String Similarity Search 11

Query: Posgre

Candidates in the sub-tree?

slide-12
SLIDE 12

String Order Properties

P3: Pairwise Lower Bounding

Given two intervals [L,U] and [L’,U’], we know the lower bound of edit distance between s1 from [L,U] and s2 from [L’,U’]

P4: Length Bounding

Given an interval [L,U] on the string order, we know the minimal length of the strings in the interval

2010-6-22 Bed-Tree: An All-Purpose Index Structure for String Similarity Search 12

Potential join results?

slide-13
SLIDE 13

String Order Properties

Properties v.s. supported queries and distances

2010-6-22 Bed-Tree: An All-Purpose Index Structure for String Similarity Search 13

Range Query Join Query Top-K Query Top-K Join Edit Distance P1, P2 P1, P3 P1, P2 P1, P3 Normalized ED P1, P2, P4 P1, P3, P4 P1, P2, P4 P1, P3, P4 Description P1 Comparability P2 Lower Bounding P3 Pair-wise Lower Bounding P4 Length Bounding

slide-14
SLIDE 14

Dictionary Order

All strings are ordered alphabetically, satisfying P1, P2 and P3

2010-6-22 Bed-Tree: An All-Purpose Index Structure for String Similarity Search 14

pose powder sit Insertion: Postgre

It’s between “pose” and “powder”

Search: Posgre with ED=1

slide-15
SLIDE 15

Dictionary Order

All strings are ordered alphabetically, satisfying P1, P2 and P3

2010-6-22 Bed-Tree: An All-Purpose Index Structure for String Similarity Search 15

Search: Posgre with ED=1

Not pruning anything! Pruning happens

  • nly when long

prefix exists

power put sad pose powder sit

slide-16
SLIDE 16

Gram Counting Order

2010-6-22

Jim Gray

Hash all grams to 4 buckets Count the grams in binary 1 1 1 1 1 1

slide-17
SLIDE 17

Gram Counting Order

Transform the count vector to a bit string with z-order

2010-6-22 Bed-Tree: An All-Purpose Index Structure for String Similarity Search 17

Encode with z-

  • rder

Order the strings with this signature

slide-18
SLIDE 18

Gram Counting Order

Lower Bounding

2010-6-22 Bed-Tree: An All-Purpose Index Structure for String Similarity Search 18

“11011011” to “11011101” Prefix: “11011???” signature: (4,1,2,2) Minimal edit distance: 1 Query: Jim Gary

slide-19
SLIDE 19

Gram Location Order

Extension of Gram Counting Order

Include positional information of the grams Allow better estimation of mismatch grams Harder to encode

2010-6-22 Bed-Tree: An All-Purpose Index Structure for String Similarity Search 19

Jim Gray Grace Hopper

slide-20
SLIDE 20

Outline

Motivation and Framework String Orders

Expected properties Dictionary order Gram counting order Gram location order

Experiments Conclusion

2010-6-22 Bed-Tree: An All-Purpose Index Structure for String Similarity Search 20

slide-21
SLIDE 21

Experiment Settings

Data Five Index Schemes

Bed-Tree: BD, BGC, BGL Inverted List: Flamingo, Mismatch

Default Setting

Q=2, Bucket=4, Page Size=4KB

2010-6-22 Bed-Tree: An All-Purpose Index Structure for String Similarity Search 21

slide-22
SLIDE 22

Empirical Observations

How good is Bed-Tree?

With small threshold, Inverted Lists are better When threshold increases, Bed-Tree is not worse

slide-23
SLIDE 23

Empirical Observations

Which string order is better?

Gram counting order is generally better Gram Location order: tradeoff between gram content information and position information

slide-24
SLIDE 24

Conclusion

A new B+ tree index scheme

All similarity queries supported Both edit distance and normalized distance General transaction and concurrency protocol competitive efficiencies

2010-6-22 Bed-Tree: An All-Purpose Index Structure for String Similarity Search 24

slide-25
SLIDE 25

Q&A

slide-26
SLIDE 26

Results

Range Query

2010-6-22 Bed-Tree: An All-Purpose Index Structure for String Similarity Search 26

slide-27
SLIDE 27

Results

Top-K Query

2010-6-22 Bed-Tree: An All-Purpose Index Structure for String Similarity Search 27

slide-28
SLIDE 28

Results

Normalized Edit Distance & Join Query

2010-6-22 Bed-Tree: An All-Purpose Index Structure for String Similarity Search 28