VGRAM: Improving Performance of Approximate Queries on String - - PowerPoint PPT Presentation

vgram improving performance of approximate queries on
SMART_READER_LITE
LIVE PREVIEW

VGRAM: Improving Performance of Approximate Queries on String - - PowerPoint PPT Presentation

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University, China Approximate selection queries Keanu Reeves Samuel Jackson


slide-1
SLIDE 1

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams

Chen Li Bin Wang and Xiaochun Yang

Northeastern University, China

slide-2
SLIDE 2

2

Approximate selection queries

… Samuel Jackson Schwarzenegger Samuel Jackson Keanu Reeves Schwarrzenger

Query errors:

  • Limited knowledge about data
  • Typos
  • Limited input device (cell phone) input

Data errors

  • Typos
  • Web data
  • OCR

Applications

  • Spellchecking
  • Query relaxation
slide-3
SLIDE 3

3

Record linkage

R

… … microsoft informix

S

… mcrosoft … infromix Applications

Record linkage …

Similarity functions:

Edit distance Jaccard Cosine …

slide-4
SLIDE 4

4

“q-grams” of strings

u n i v e r s a l

2-grams

slide-5
SLIDE 5

5

q-gram inverted lists

id strings 1 2 3 4 rich stick stich stuck static

4 2 3 1 4

2-grams

at ch ck ic ri st ta ti tu uc

2 1 3 1 2 4 4 1 2 4 3 3

slide-6
SLIDE 6

6

# of common grams >= 3

Searching using inverted lists

Query: “shtick”, ED(shtick, ?)≤1

id strings 1 2 3 4 rich stick stich stuck static

2-grams

at ch ck ic ri st ta ti tu uc

4 2 3 1 4 2 1 3 1 2 4 4 1 2 4 3 3

ti ic ck sh ht ti ic ck

slide-7
SLIDE 7

7

# of common grams >= 1

2-grams 3-grams?

Query: “shtick”, ED(shtick, ?)≤1

id strings 1 2 3 4 rich stick stich stuck static

3-grams

ati ich ick ric sta sti stu tat tic tuc uck

tic ick sht hti tic ick

4 2 4 1 2 1 3 4 1 3 4 2 3

id strings 1 2 3 4 rich stick stich stuck static id strings 1 2 3 4 rich stick stich stuck static

Shorter inverted list More false positive

slide-8
SLIDE 8

8

Outline

Motivation VGRAM

Main idea Decomposing strings to grams Choosing good grams Effect of edit operations on grams Adopting vgram in existing algorithms

Experiments

slide-9
SLIDE 9

9

Motivation

Small index size (memory) Small running time

Merge matched inverted lists Calculate ED(query, candidate)

slide-10
SLIDE 10

10

Observation 1: dilemma of choosing “q”

Increasing “q” causing:

Longer grams Shorter lists Smaller # of common grams of similar strings

id strings 1 2 3 4 rich stick stich stuck static

4 2 3 1 4

2-grams

at ch ck ic ri st ta ti tu uc

2 1 3 1 2 4 4 1 2 4 3 3

slide-11
SLIDE 11

11

Observation 2: skew distributions of gram frequencies

  • DBLP: 276,699 article titles

Popular 5-grams: ation (>114K times), tions, ystem, catio

slide-12
SLIDE 12

12

VGRAM: Main idea

Grams with variable lengths (between qmin

and qmax)

zebra

ze(123)

corrasion

co(5213), cor(859), corr(171)

Advantages

Reducing index size ☺ Reducing running time ☺ Adoptable by many algorithms ☺

slide-13
SLIDE 13

13

Challenges

Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their

gram-set similarity?

Adopting VGRAM in existing algorithms?

slide-14
SLIDE 14

14

Challenge 1: String Variable-length grams?

Fixed-length 2-grams Variable-length grams

u n i v e r s a l ni ivr sal uni vers

[2,4]-gram dictionary

u n i v e r s a l

slide-15
SLIDE 15

15

Representing gram dictionary as a trie

Fixed-length 2-grams Variable-length grams

u n i v e r s a l ni ivr sal uni vers

[2,4]-gram dictionary

u n i v e r s a l

slide-16
SLIDE 16

16

Challenge 2: Constructing gram dictionary

selecting grams

  • Pruning trie using a frequency threshold T (e.g., 2)
slide-17
SLIDE 17

17

Challenge 2: Constructing gram dictionary

selecting grams

  • Pruning trie using a frequency threshold T (e.g., 2)
slide-18
SLIDE 18

18

Final gram dictionary

Final grams

slide-19
SLIDE 19

19

Outline

Motivation VGRAM

Main idea Decomposing strings to grams Choosing good grams Effect of edit operations on grams Adopting vgram in existing algorithms

Experiments

slide-20
SLIDE 20

20

Challenge 3: Edit operation’s effect on grams

k operations could affect k * q grams

u n i v e r s a l

Fixed length: q

slide-21
SLIDE 21

21

Deletion affects variable-length grams

i-qmax+1 i+qmax- 1

Deletion Not affected Not affected Affected

i

slide-22
SLIDE 22

22

Grams affected by a deletion

u n i v e r s a l

i+qmax- 1

Deletion

i

Affected? Deletion Affected?

[2,4]-grams i-qmax+1

slide-23
SLIDE 23

23

Grams affected by a deletion (cont)

i-qmax+1 i+qmax- 1

Deletion

i

Affected? Trie of grams Trie of reversed grams

slide-24
SLIDE 24

24

# of grams affected by each operation

_ u _ n _ i _ v _ e _ r _ s _ a _ l _

0 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 0

Deletion/substitution Insertion

slide-25
SLIDE 25

25

Max # of grams affected by k operations

Vector of s = <2,4>

With 2 edit operations, at most 4 grams can be affected

Called NAG vector (# of affected grams) Precomputed

_ u _ n _ i _ v _ e _ r _ s _ a _ l _

0 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 0

Deletion/substitution Insertion

slide-26
SLIDE 26

26

Summary of VGRAM index

slide-27
SLIDE 27

27

Challenge 4: adopting VGRAM

Easily adoptable by many algorithms Basic interfaces:

String s grams String s1, s2 such that ed(s1,s2) <= k

min # of their common grams

slide-28
SLIDE 28

28

Lower bound on # of common grams

If ed(s1,s2) <= k, then their # of common grams >=:

(|s1|- q + 1) – k * q

u n i v e r s a l

Fixed length (q) Variable lengths: lower bound = # of grams of s1 – NAG(s1,k)

slide-29
SLIDE 29

29

Example: algorithm using inverted lists

1 2 4 1 2 1 4 3

… ck ic … ti …

Lower bound = 3 Lower bound = 1

Query: “shtick”, ED(shtick, ?)≤1

sh ht tick

2 4 1 4 1 1 2 3

… ck ic ich … tic tick …

2-4 grams 2-grams

tick id strings 1 2 3 4 rich stick stich stuck static id strings 1 2 3 4 rich stick stich stuck static id strings 1 2 3 4 rich stick stich stuck static

slide-30
SLIDE 30

30

Outline

Motivation VGRAM

Main idea Decomposing strings to grams Choosing good grams Effect of edit operations on grams Adopting vgram in existing algorithms

Experiments

slide-31
SLIDE 31

31

Data sets

Data set 1: Texas Real Estate Commission.

151K person names, average length = 33.

Data set 2: English dictionary from the Aspell

spellchecker for Cygwin.

149,165 words, average length = 8.

Data set 3: DBLP Bibliography.

277K titles, average length = 62.

Environment: VC++, Dell GX620 PC with an Intel Pentium 3.40Hz Dual Core CPU, 2GB memory, Window XP O.S.

slide-32
SLIDE 32

32

VGRAM overhead (index size)

Dataset 3: DBLP titles, [5,7]-gram, T=500, LargeFirst pruning policy

slide-33
SLIDE 33

33

VGRAM overhead (construction time)

Dataset 3: DBLP titles, [5,7]-gram, T=500, LargeFirst pruning policy

slide-34
SLIDE 34

34

Benefits over fixed-length grams (index)

Dataset 1: 150K Person names, k=1, MergeCount algorithm, T=1000, LargeFirst pruning policy

slide-35
SLIDE 35

35

Benefits over fixed-length grams (running time)

Dataset 1: 150K Person names, k=1, MergeCount algorithm, T=1000, LargeFirst pruning policy

slide-36
SLIDE 36

36

Enhance approximate join algorithms

ProbeCount ProbeCluster PartEnum

slide-37
SLIDE 37

37

Improving algorithm ProbeCount

Dataset 1: [4,6]-gram, T=200, LargeFirst pruning policy K=3 50K person names

slide-38
SLIDE 38

38

Improving algorithm ProbeCluster

Dataset 1: [5,7]-gram, T=1000, LargeFirst pruning policy

slide-39
SLIDE 39

39

Improving algorithm PartEnum

Dataset 1: [4,6]-gram, T=1000, LargeFirst pruning policy

slide-40
SLIDE 40

40

Conclusions

VGRAM: using grams of

variable-length high-quality

Adoptable in existing algorithms

Reduce index size Reduce running time

slide-41
SLIDE 41

41

Related work

Approximate String Matching

q-Grams, q-Samples Inside DBMS Substring matching

Set similarity join Variable length gram applications

Speech recognition, information retrieval, artificial intelligence Substring selectivity estimation

Improve space and time efficiency

n-Gram/2L

slide-42
SLIDE 42

42

Thank you

Questions or Comments?