vgram improving performance of approximate queries on
play

VGRAM: Improving Performance of Approximate Queries on String - PowerPoint PPT Presentation

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University, China Approximate selection queries Keanu Reeves Samuel Jackson


  1. VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University, China

  2. Approximate selection queries Keanu Reeves Samuel Jackson Schwarzenegger Schwarrzenger Samuel Jackson … Query errors: Limited knowledge about data � Applications Typos � Spellchecking Limited input device (cell phone) input � � Data errors Query relaxation � Typos � … � Web data � OCR � 2

  3. Record linkage R S infromix informix … microsoft mcrosoft … … … Similarity functions: Applications � Edit distance � Record linkage � Jaccard � … � Cosine � … 3

  4. “ q-grams ” of strings u n i v e r s a l 2-grams 4

  5. q-gram inverted lists at 4 ch 0 2 id strings ck 1 3 0 rich ic 0 1 2 4 2-grams 1 stick ri 0 2 stich st 4 2 3 1 3 stuck ta 4 4 static ti 1 2 4 tu 3 uc 5 3

  6. Searching using inverted lists � Query: “ shtick ” , ED(shtick, ?) ≤ 1 sh ht ti ic ck ti ic ck # of common grams >= 3 at 4 ch 0 2 id strings ck 1 3 0 rich ic 0 1 2 4 2-grams 1 stick ri 0 2 stich st 4 2 3 1 3 stuck ta 4 4 static ti 1 2 4 tu 3 uc 6 3

  7. 2-grams � 3-grams? � Query: “ shtick ” , ED(shtick, ?) ≤ 1 sht hti tic ick tic ick # of common grams >= 1 ati 4 ich 0 2 ick 1 id strings id strings id strings � Shorter inverted list ric 0 0 0 0 rich rich rich � More false positive sta 4 3-grams 1 1 1 stick stick stick sti 1 2 2 2 2 stich stich stich stu 3 3 3 3 stuck stuck stuck tat 4 4 4 4 static static static tic 2 1 4 tuc 3 7 uck 3

  8. Outline � Motivation � VGRAM � Main idea � Decomposing strings to grams � Choosing good grams � Effect of edit operations on grams � Adopting vgram in existing algorithms � Experiments 8

  9. Motivation � Small index size (memory) � Small running time � Merge matched inverted lists � Calculate ED(query, candidate) 9

  10. Observation 1: dilemma of choosing “q” � Increasing “q” causing: � Longer grams � Shorter lists � Smaller # of common grams of similar strings at 4 ch 0 2 id strings ck 1 3 0 rich ic 0 1 2 4 2-grams 1 stick ri 0 2 stich st 4 2 3 1 3 stuck ta 4 4 static ti 1 2 4 tu 3 uc 10 3

  11. Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles � � Popular 5-grams: ation (>114K times), tions, ystem, catio 11

  12. VGRAM: Main idea � Grams with variable lengths (between q min and q max ) � zebra � ze(123) � corrasion � co(5213), cor(859), corr(171) � Advantages � Reducing index size ☺ � Reducing running time ☺ � Adoptable by many algorithms ☺ 12

  13. Challenges � Generating variable-length grams? � Constructing a high-quality gram dictionary? � Relationship between string similarity and their gram-set similarity? � Adopting VGRAM in existing algorithms? 13

  14. Challenge 1: String � Variable-length grams? � Fixed-length 2-grams u n i v e r s a l � Variable-length grams [2,4]-gram dictionary ni u n i v e r s a l ivr sal uni vers 14

  15. Representing gram dictionary as a trie � Fixed-length 2-grams u n i v e r s a l � Variable-length grams [2,4]-gram dictionary ni u n i v e r s a l ivr sal uni vers 15

  16. Challenge 2: Constructing gram dictionary � selecting grams Pruning trie using a frequency threshold T (e.g., 2) � 16

  17. Challenge 2: Constructing gram dictionary � selecting grams Pruning trie using a frequency threshold T (e.g., 2) � 17

  18. Final gram dictionary 18 Final grams

  19. Outline � Motivation � VGRAM � Main idea � Decomposing strings to grams � Choosing good grams � � Effect of edit operations on grams � Adopting vgram in existing algorithms � Experiments 19

  20. Challenge 3: Edit operation’s effect on grams Fixed length: q u n i v e r s a l k operations could affect k * q grams 20

  21. Deletion affects variable-length grams Not affected Not affected Affected i i - q max +1 i + q max - 1 Deletion 21

  22. Grams affected by a deletion Affected? i i - q max +1 i + q max - 1 Deletion Deletion u n i v e r s a l Affected? [2,4]-grams 22

  23. Grams affected by a deletion (cont) Affected? i i - q max +1 i + q max - 1 Deletion 23 Trie of grams Trie of reversed grams

  24. # of grams affected by each operation Deletion/substitution Insertion 0 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 0 _ u _ n _ i _ v _ e _ r _ s _ a _ l _ 24

  25. Max # of grams affected by k operations Deletion/substitution Insertion 0 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 0 _ u _ n _ i _ v _ e _ r _ s _ a _ l _ Vector of s = <2,4> With 2 edit operations, at most 4 grams can be affected � Called NAG vector (# of affected grams) � Precomputed 25

  26. Summary of VGRAM index 26

  27. Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: � String s � grams � String s1, s2 such that ed(s1,s2) <= k � min # of their common grams 27

  28. Lower bound on # of common grams Fixed length ( q) u n i v e r s a l If ed(s1,s2) <= k, then their # of common grams >=: (| s 1 | - q + 1) – k * q Variable lengths: lower bound = # of grams of s1 – NAG(s1,k) 28

  29. Example: algorithm using inverted lists � Query: “shtick”, ED(shtick, ?) ≤ 1 sh ht tick tick 2-grams 2-4 grams … … Lower bound = 3 ck 1 3 ck 1 3 ic 4 1 ic 1 2 4 0 ich 2 0 … … ti 1 2 4 tic 2 4 … id strings id strings id strings tick 1 0 0 0 rich rich rich … 1 1 1 stick stick stick 2 2 2 stich stich stich Lower bound = 1 3 3 3 stuck stuck stuck 29 4 4 4 static static static

  30. Outline � Motivation � VGRAM � Main idea � Decomposing strings to grams � Choosing good grams � Effect of edit operations on grams � Adopting vgram in existing algorithms � Experiments 30

  31. Data sets � Data set 1 : Texas Real Estate Commission. � 151 K person names, average length = 33. � Data set 2 : English dictionary from the Aspell spellchecker for Cygwin. � 149 , 165 words, average length = 8. � Data set 3 : DBLP Bibliography. � 277 K titles, average length = 62. Environment: VC++, Dell GX620 PC with an Intel Pentium 3.40Hz Dual Core CPU, 2GB memory, Window XP O.S. 31

  32. VGRAM overhead (index size) Dataset 3: DBLP titles, [5,7]-gram, T=500, LargeFirst pruning policy 32

  33. VGRAM overhead (construction time) Dataset 3: DBLP titles, [5,7]-gram, T=500, LargeFirst pruning policy 33

  34. Benefits over fixed-length grams (index) Dataset 1: 150K Person names, k=1, MergeCount algorithm, 34 T=1000, LargeFirst pruning policy

  35. Benefits over fixed-length grams (running time) Dataset 1: 150K Person names, k=1, MergeCount algorithm, 35 T=1000, LargeFirst pruning policy

  36. Enhance approximate join algorithms � ProbeCount � ProbeCluster � PartEnum 36

  37. Improving algorithm ProbeCount K=3 50K person names Dataset 1: [4,6]-gram, T=200, LargeFirst pruning policy 37

  38. Improving algorithm ProbeCluster Dataset 1: [5,7]-gram, T=1000, LargeFirst pruning policy 38

  39. Improving algorithm PartEnum Dataset 1: [4,6]-gram, T=1000, LargeFirst pruning policy 39

  40. Conclusions � VGRAM: using grams of � variable-length � high-quality � Adoptable in existing algorithms � Reduce index size � Reduce running time 40

  41. Related work � Approximate String Matching � q-Grams, q-Samples � Inside DBMS � Substring matching � Set similarity join � Variable length gram applications � Speech recognition, information retrieval, artificial intelligence � Substring selectivity estimation � Improve space and time efficiency � n-Gram/2L 41

  42. Questions or Comments? Thank you 42

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend