VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams
Chen Li Bin Wang and Xiaochun Yang
Northeastern University, China
VGRAM: Improving Performance of Approximate Queries on String - - PowerPoint PPT Presentation
VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University, China Approximate selection queries Keanu Reeves Samuel Jackson
Northeastern University, China
2
… Samuel Jackson Schwarzenegger Samuel Jackson Keanu Reeves Schwarrzenger
Query errors:
Data errors
Applications
3
Record linkage …
Edit distance Jaccard Cosine …
4
2-grams
5
4 2 3 1 4
2-grams
2 1 3 1 2 4 4 1 2 4 3 3
6
# of common grams >= 3
Query: “shtick”, ED(shtick, ?)≤1
2-grams
4 2 3 1 4 2 1 3 1 2 4 4 1 2 4 3 3
7
# of common grams >= 1
Query: “shtick”, ED(shtick, ?)≤1
3-grams
4 2 4 1 2 1 3 4 1 3 4 2 3
Shorter inverted list More false positive
8
Motivation VGRAM
Main idea Decomposing strings to grams Choosing good grams Effect of edit operations on grams Adopting vgram in existing algorithms
Experiments
9
Merge matched inverted lists Calculate ED(query, candidate)
10
Increasing “q” causing:
Longer grams Shorter lists Smaller # of common grams of similar strings
4 2 3 1 4
2-grams
2 1 3 1 2 4 4 1 2 4 3 3
11
Observation 2: skew distributions of gram frequencies
Popular 5-grams: ation (>114K times), tions, ystem, catio
12
Grams with variable lengths (between qmin
zebra
ze(123)
corrasion
co(5213), cor(859), corr(171)
Advantages
Reducing index size ☺ Reducing running time ☺ Adoptable by many algorithms ☺
13
Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their
Adopting VGRAM in existing algorithms?
14
Fixed-length 2-grams Variable-length grams
15
Fixed-length 2-grams Variable-length grams
16
selecting grams
17
selecting grams
18
Final grams
19
Motivation VGRAM
Main idea Decomposing strings to grams Choosing good grams Effect of edit operations on grams Adopting vgram in existing algorithms
Experiments
20
Fixed length: q
21
Deletion Not affected Not affected Affected
22
Deletion
Affected? Deletion Affected?
23
Deletion
Affected? Trie of grams Trie of reversed grams
24
Deletion/substitution Insertion
25
With 2 edit operations, at most 4 grams can be affected
Called NAG vector (# of affected grams) Precomputed
Deletion/substitution Insertion
26
27
String s grams String s1, s2 such that ed(s1,s2) <= k
28
29
1 2 4 1 2 1 4 3
Lower bound = 3 Lower bound = 1
Query: “shtick”, ED(shtick, ?)≤1
2 4 1 4 1 1 2 3
2-4 grams 2-grams
30
Motivation VGRAM
Main idea Decomposing strings to grams Choosing good grams Effect of edit operations on grams Adopting vgram in existing algorithms
Experiments
31
Data set 1: Texas Real Estate Commission.
151K person names, average length = 33.
Data set 2: English dictionary from the Aspell
149,165 words, average length = 8.
Data set 3: DBLP Bibliography.
277K titles, average length = 62.
Environment: VC++, Dell GX620 PC with an Intel Pentium 3.40Hz Dual Core CPU, 2GB memory, Window XP O.S.
32
Dataset 3: DBLP titles, [5,7]-gram, T=500, LargeFirst pruning policy
33
Dataset 3: DBLP titles, [5,7]-gram, T=500, LargeFirst pruning policy
34
Dataset 1: 150K Person names, k=1, MergeCount algorithm, T=1000, LargeFirst pruning policy
35
Dataset 1: 150K Person names, k=1, MergeCount algorithm, T=1000, LargeFirst pruning policy
36
ProbeCount ProbeCluster PartEnum
37
Dataset 1: [4,6]-gram, T=200, LargeFirst pruning policy K=3 50K person names
38
Dataset 1: [5,7]-gram, T=1000, LargeFirst pruning policy
39
Dataset 1: [4,6]-gram, T=1000, LargeFirst pruning policy
40
VGRAM: using grams of
variable-length high-quality
Adoptable in existing algorithms
Reduce index size Reduce running time
41
Approximate String Matching
q-Grams, q-Samples Inside DBMS Substring matching
Set similarity join Variable length gram applications
Speech recognition, information retrieval, artificial intelligence Substring selectivity estimation
Improve space and time efficiency
n-Gram/2L
42