efficient top k algorithms for fuzzy search in string
play

Efficient Top-k Algorithms for Fuzzy Search in String Collections - PowerPoint PPT Presentation

Efficient Top-k Algorithms for Fuzzy Search in String Collections Rares Vernica Chen Li Department of Computer Science University of California, Irvine First International Workshop on Keyword Search on Structured Data, 2009 Rares Vernica (UC


  1. Efficient Top-k Algorithms for Fuzzy Search in String Collections Rares Vernica Chen Li Department of Computer Science University of California, Irvine First International Workshop on Keyword Search on Structured Data, 2009 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 1 / 17

  2. Outline Motivation 1 Efficient Top- k Algorithms 2 Problem Formulation Algorithms Overview Top-k Single-Pass Search Algorithm Experimental Evaluation 3 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 2 / 17

  3. Need for Approximate String Queries ID FirstName LastName # Movies 10 1 Al Swartzberg 11 1 Hanna Wartenegg 12 30 Rik Swartzwelder 13 1 Joey Swartzentruber 14 4 Rene Swartenbroekx 15 283 Arnold Schwarzenegger 16 Luc Swartenbroeckx 1 . . . . . . . . . Figure: Actor names and popularities Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17

  4. Need for Approximate String Queries ID FirstName LastName # Movies 10 1 Al Swartzberg 11 1 Hanna Wartenegg 12 30 Rik Swartzwelder 13 1 Joey Swartzentruber 14 4 Rene Swartenbroekx 15 283 Arnold Schwarzenegger 16 Luc Swartenbroeckx 1 . . . . . . . . . Figure: Actor names and popularities SELECT * FROM Actors WHERE LastName = ’Shwartzenetrugger’ ORDER BY ’# Movies’ DESC LIMIT 1; Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17

  5. Need for Approximate String Queries ID FirstName LastName # Movies 10 1 Al Swartzberg 11 1 Hanna Wartenegg 12 30 Rik Swartzwelder 13 1 Joey Swartzentruber 14 4 Rene Swartenbroekx 15 283 Arnold Schwarzenegger 16 Luc Swartenbroeckx 1 . . . . . . . . . Figure: Actor names and popularities SELECT * FROM Actors WHERE LastName = ’Shwartzenetrugger’ ORDER BY ’# Movies’ DESC LIMIT 1; 0 Results Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17

  6. Need for Ranking ID FirstName LastName # Movies ED 10 1 8 Al Swartzberg 11 1 8 Hanna Wartenegg 12 30 8 Rik Swartzwelder 13 1 4 Joey Swartzentruber 14 4 9 Rene Swartenbroekx 15 283 5 Arnold Schwarzenegger 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . . Figure: Actor names, popularities, and edit distances to a query string “ Shwartzenetrugger ”. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17

  7. Need for Ranking ID FirstName LastName # Movies ED 10 1 8 Al Swartzberg 11 1 8 Hanna Wartenegg 12 30 8 Rik Swartzwelder 13 1 4 Joey Swartzentruber 14 4 9 Rene Swartenbroekx 15 283 5 Arnold Schwarzenegger 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . . Figure: Actor names, popularities, and edit distances to a query string “ Shwartzenetrugger ”. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17

  8. Need for Ranking ID FirstName LastName # Movies ED 10 1 8 Al Swartzberg 11 1 8 Hanna Wartenegg 12 30 8 Rik Swartzwelder 13 1 4 Joey Swartzentruber 14 4 9 Rene Swartenbroekx 15 283 5 Arnold Schwarzenegger 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . . Figure: Actor names, popularities, and edit distances to a query string “ Shwartzenetrugger ”. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17

  9. Need for Ranking ID FirstName LastName # Movies ED 10 1 8 Al Swartzberg 11 1 8 Hanna Wartenegg 12 30 8 Rik Swartzwelder 13 1 4 Joey Swartzentruber 14 4 9 Rene Swartenbroekx 15 283 5 Arnold Schwarzenegger 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . . Figure: Actor names, popularities, and edit distances to a query string “ Shwartzenetrugger ”. Which one result should the system return? Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17

  10. Need for Ranking ID FirstName LastName # Movies ED 10 1 8 Al Swartzberg 11 1 8 Hanna Wartenegg 12 30 8 Rik Swartzwelder 13 1 4 Joey Swartzentruber 14 4 9 Rene Swartenbroekx 15 283 5 Arnold Schwarzenegger 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . . Figure: Actor names, popularities, and edit distances to a query string “ Shwartzenetrugger ”. Which one result should the system return? Which value is more important, # Movies or similarity? Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17

  11. Top- k Similar Strings Given: Weighted string collection e.g., actors’ LastName and # Movies Query string e.g., “ Shwartzenetrugger ” Similarity function e.g, Edit Distance Scoring function (score of a string in terms of similarity and weight) e.g., linear combination of similarity and popularity Integer k Return: k best strings in terms of overall score to the query string. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 5 / 17

  12. Top- k Similar Strings Given: Weighted string collection e.g., actors’ LastName and # Movies Query string e.g., “ Shwartzenetrugger ” Similarity function e.g, Edit Distance Scoring function (score of a string in terms of similarity and weight) e.g., linear combination of similarity and popularity Integer k Return: k best strings in terms of overall score to the query string. Advantages over Range Search: specify k instead of a similarity threshold guaranteed k results; a range search might have 0 results Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 5 / 17

  13. Algorithms Overview Range Search Range Search Iterative Range Single-Pass Search Search Two-Phase Search Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 6 / 17

  14. q -grams: Overlapping substrings of fixed length Find similar strings: e.g., “ Vernica ” and “ Veronica ” q -gram: substring of length q of a string: e.g., q = 2 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

  15. q -grams: Overlapping substrings of fixed length Find similar strings: e.g., “ Vernica ” and “ Veronica ” q -gram: substring of length q of a string: e.g., q = 2 Vernica Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

  16. q -grams: Overlapping substrings of fixed length Find similar strings: e.g., “ Vernica ” and “ Veronica ” q -gram: substring of length q of a string: e.g., q = 2 Vernica → {Ve Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

  17. q -grams: Overlapping substrings of fixed length Find similar strings: e.g., “ Vernica ” and “ Veronica ” q -gram: substring of length q of a string: e.g., q = 2 Vernica → {Ve,er Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

  18. q -grams: Overlapping substrings of fixed length Find similar strings: e.g., “ Vernica ” and “ Veronica ” q -gram: substring of length q of a string: e.g., q = 2 Vernica → {Ve,er,rn Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

  19. q -grams: Overlapping substrings of fixed length Find similar strings: e.g., “ Vernica ” and “ Veronica ” q -gram: substring of length q of a string: e.g., q = 2 Vernica → {Ve,er,rn,ni Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

  20. q -grams: Overlapping substrings of fixed length Find similar strings: e.g., “ Vernica ” and “ Veronica ” q -gram: substring of length q of a string: e.g., q = 2 Vernica → {Ve,er,rn,ni,ic Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

  21. q -grams: Overlapping substrings of fixed length Find similar strings: e.g., “ Vernica ” and “ Veronica ” q -gram: substring of length q of a string: e.g., q = 2 Vernica → {Ve,er,rn,ni,ic,ca} Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

  22. q -grams: Overlapping substrings of fixed length Find similar strings: e.g., “ Vernica ” and “ Veronica ” q -gram: substring of length q of a string: e.g., q = 2 Vernica → {Ve,er,rn,ni,ic,ca} Veronica → {Ve,er,ro,on,ni,ic,ca} Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

  23. q -grams: Overlapping substrings of fixed length Find similar strings: e.g., “ Vernica ” and “ Veronica ” q -gram: substring of length q of a string: e.g., q = 2 Vernica → { Ve , er ,rn, ni , ic , ca } Veronica → { Ve , er ,ro,on, ni , ic , ca } Similar strings share a certain number of grams Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

  24. q -gram Inverted List Index q = 2 ID String Weight 1 0.80 ab 2 0.70 ccd 3 0.60 cd 4 0.50 abcd 5 0.40 bcc Figure: Dataset Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17

  25. q -gram Inverted List Index q = 2 ID String Weight 1 0.80 ab cc cd bc ab 1 2 0.70 2 2 4 ccd ⇒ 3 0.60 4 5 3 5 cd 4 0.50 4 ab cd 5 0.40 bcc Figure: Gram inverted-list index Figure: Dataset Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17

  26. q -gram Inverted List Index q = 2 ID String Weight 1 0.80 ab cc cd bc ab 2 0.70 1 2 2 4 ccd 3 0.60 4 5 3 5 cd 4 0.50 4 abcd 5 0.40 bcc Figure: Gram inverted-list index Figure: Dataset Query string: “bcd” Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend