Efficient Top-k Algorithms for Fuzzy Search in String Collections - - PowerPoint PPT Presentation

efficient top k algorithms for fuzzy search in string
SMART_READER_LITE
LIVE PREVIEW

Efficient Top-k Algorithms for Fuzzy Search in String Collections - - PowerPoint PPT Presentation

Efficient Top-k Algorithms for Fuzzy Search in String Collections Rares Vernica Chen Li Department of Computer Science University of California, Irvine First International Workshop on Keyword Search on Structured Data, 2009 Rares Vernica (UC


slide-1
SLIDE 1

Efficient Top-k Algorithms for Fuzzy Search in String Collections

Rares Vernica Chen Li

Department of Computer Science University of California, Irvine

First International Workshop on Keyword Search on Structured Data, 2009

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 1 / 17

slide-2
SLIDE 2

Outline

1

Motivation

2

Efficient Top-k Algorithms Problem Formulation Algorithms Overview Top-k Single-Pass Search Algorithm

3

Experimental Evaluation

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 2 / 17

slide-3
SLIDE 3

Need for Approximate String Queries

ID FirstName LastName # Movies 10 Al Swartzberg 1 11 Hanna Wartenegg 1 12 Rik Swartzwelder 30 13 Joey Swartzentruber 1 14 Rene Swartenbroekx 4 15 Arnold Schwarzenegger 283 16 Luc Swartenbroeckx 1 . . . . . . . . .

Figure: Actor names and popularities

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17

slide-4
SLIDE 4

Need for Approximate String Queries

ID FirstName LastName # Movies 10 Al Swartzberg 1 11 Hanna Wartenegg 1 12 Rik Swartzwelder 30 13 Joey Swartzentruber 1 14 Rene Swartenbroekx 4 15 Arnold Schwarzenegger 283 16 Luc Swartenbroeckx 1 . . . . . . . . .

Figure: Actor names and popularities

SELECT * FROM Actors WHERE LastName = ’Shwartzenetrugger’ ORDER BY ’# Movies’ DESC LIMIT 1;

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17

slide-5
SLIDE 5

Need for Approximate String Queries

ID FirstName LastName # Movies 10 Al Swartzberg 1 11 Hanna Wartenegg 1 12 Rik Swartzwelder 30 13 Joey Swartzentruber 1 14 Rene Swartenbroekx 4 15 Arnold Schwarzenegger 283 16 Luc Swartenbroeckx 1 . . . . . . . . .

Figure: Actor names and popularities

SELECT * FROM Actors WHERE LastName = ’Shwartzenetrugger’ ORDER BY ’# Movies’ DESC LIMIT 1; 0 Results

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17

slide-6
SLIDE 6

Need for Ranking

ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . .

Figure: Actor names, popularities, and edit distances to a query string “Shwartzenetrugger”.

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17

slide-7
SLIDE 7

Need for Ranking

ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . .

Figure: Actor names, popularities, and edit distances to a query string “Shwartzenetrugger”.

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17

slide-8
SLIDE 8

Need for Ranking

ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . .

Figure: Actor names, popularities, and edit distances to a query string “Shwartzenetrugger”.

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17

slide-9
SLIDE 9

Need for Ranking

ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . .

Figure: Actor names, popularities, and edit distances to a query string “Shwartzenetrugger”.

Which one result should the system return?

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17

slide-10
SLIDE 10

Need for Ranking

ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . .

Figure: Actor names, popularities, and edit distances to a query string “Shwartzenetrugger”.

Which one result should the system return? Which value is more important, # Movies or similarity?

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17

slide-11
SLIDE 11

Top-k Similar Strings

Given:

Weighted string collection

e.g., actors’ LastName and # Movies

Query string

e.g., “Shwartzenetrugger”

Similarity function

e.g, Edit Distance

Scoring function (score of a string in terms of similarity and weight)

e.g., linear combination of similarity and popularity

Integer k

Return: k best strings in terms of overall score to the query string.

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 5 / 17

slide-12
SLIDE 12

Top-k Similar Strings

Given:

Weighted string collection

e.g., actors’ LastName and # Movies

Query string

e.g., “Shwartzenetrugger”

Similarity function

e.g, Edit Distance

Scoring function (score of a string in terms of similarity and weight)

e.g., linear combination of similarity and popularity

Integer k

Return: k best strings in terms of overall score to the query string. Advantages over Range Search:

specify k instead of a similarity threshold guaranteed k results; a range search might have 0 results

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 5 / 17

slide-13
SLIDE 13

Algorithms Overview

Iterative Range Search Single-Pass Search Two-Phase Search

Range Search Range Search

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 6 / 17

slide-14
SLIDE 14

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

slide-15
SLIDE 15

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2

Vernica

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

slide-16
SLIDE 16

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

slide-17
SLIDE 17

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

slide-18
SLIDE 18

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

slide-19
SLIDE 19

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

slide-20
SLIDE 20

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni,ic

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

slide-21
SLIDE 21

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni,ic,ca}

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

slide-22
SLIDE 22

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni,ic,ca} Veronica → {Ve,er,ro,on,ni,ic,ca}

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

slide-23
SLIDE 23

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni,ic,ca} Veronica → {Ve,er,ro,on,ni,ic,ca}

Similar strings share a certain number of grams

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

slide-24
SLIDE 24

q-gram Inverted List Index

q = 2 ID String Weight 1 ab 0.80 2 ccd 0.70 3 cd 0.60 4 abcd 0.50 5 bcc 0.40

Figure: Dataset

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17

slide-25
SLIDE 25

q-gram Inverted List Index

q = 2 ID String Weight 1 ab 0.80 2 ccd 0.70 3 cd 0.60 4 abcd 0.50 5 bcc 0.40

Figure: Dataset

ab cc cd bc 1 2 2 4 4 5 3 5 4

Figure: Gram inverted-list index

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17

slide-26
SLIDE 26

q-gram Inverted List Index

q = 2 ID String Weight 1 ab 0.80 2 ccd 0.70 3 cd 0.60 4 abcd 0.50 5 bcc 0.40

Figure: Dataset

ab cc cd bc 1 2 2 4 4 5 3 5 4

Figure: Gram inverted-list index

Query string: “bcd”

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17

slide-27
SLIDE 27

q-gram Inverted List Index

q = 2 ID String Weight 1 ab 0.80 2 ccd 0.70 3 cd 0.60 4 abcd 0.50 5 bcc 0.40

Figure: Dataset

ab cc cd bc 1 2 2 4 4 5 3 5 4

Figure: Gram inverted-list index

Query string: “bcd”

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17

slide-28
SLIDE 28

q-gram Inverted List Index

q = 2 ID String Weight 1 ab 0.80 2 ccd 0.70 3 cd 0.60 4 abcd 0.50 5 bcc 0.40

Figure: Dataset

ab cc cd bc 1 2 2 4 4 5 3 5 4

Figure: Gram inverted-list index

Query string: “bcd” Identified strings are verified by computing the real similarity. Verification is usually an expensive process.

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17

slide-29
SLIDE 29

Top-k Single-pass Search Algorithm

ID String Weight 1 ab 0.80 2 ccd 0.70 3 cd 0.60 4 abcd 0.50 5 bcc 0.40

Figure: Dataset

ab cc cd bc 1 2 2 4 4 5 3 5 4

Figure: Gram inverted-list index

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17

slide-30
SLIDE 30

Top-k Single-pass Search Algorithm

Setup

Assign IDs s.t. ascending order of IDs ≡ descending order of weights ID String Weight 1 ab 0.80 2 ccd 0.70 3 cd 0.60 4 abcd 0.50 5 bcc 0.40

Figure: Dataset

ab cc cd bc 1 2 2 4 4 5 3 5 4

Figure: Gram inverted-list index

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17

slide-31
SLIDE 31

Top-k Single-pass Search Algorithm

Setup

Assign IDs s.t. ascending order of IDs ≡ descending order of weights Sort the IDs on each list in ascending

  • rder

ID String Weight 1 ab 0.80 2 ccd 0.70 3 cd 0.60 4 abcd 0.50 5 bcc 0.40

Figure: Dataset

ab cc cd bc 1 2 2 4 4 5 3 5 4

Figure: Gram inverted-list index

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17

slide-32
SLIDE 32

Top-k Single-pass Search Algorithm

Setup

Assign IDs s.t. ascending order of IDs ≡ descending order of weights Sort the IDs on each list in ascending

  • rder

Scan the lists corresponding to the grams in the query. e.g., “bcd” ID String Weight 1 ab 0.80 2 ccd 0.70 3 cd 0.60 4 abcd 0.50 5 bcc 0.40

Figure: Dataset

ab cc cd bc 1 2 2 4 4 5 3 5 4

Figure: Gram inverted-list index

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17

slide-33
SLIDE 33

Top-k Single-pass Search Algorithm

Naïve approach: Round-Robin

Scan all the lists in the same time Maintain a list of “open” IDs (might still appear on some of the lists) Store the best k “closed” IDs in a top-k buffer Stop when the top-k buffer cannot improve ID String Weight 1 ab 0.80 2 ccd 0.70 3 cd 0.60 4 abcd 0.50 5 bcc 0.40

Figure: Dataset

ab cc cd bc 1 2 2 4 4 5 3 5 4

Figure: Gram inverted-list index

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17

slide-34
SLIDE 34

Top-k Single-pass Search Algorithm

Heap-based

Traverse the lists in a sorted

  • rder using a heap on the top

IDs of the lists Advantages:

1

No need to maintain the list

  • f “open” IDs

2

Skip elements

ab bc cd 1 2 2 20 3 4 21 4 5 . . . . . . . . . 19 19 20 20 . . . . . .

Figure: Gram inverted-lists for query “abcd”

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

slide-35
SLIDE 35

Top-k Single-pass Search Algorithm

Heap-based

Traverse the lists in a sorted

  • rder using a heap on the top

IDs of the lists Advantages:

1

No need to maintain the list

  • f “open” IDs

2

Skip elements

ab bc cd →1 2 2 20 3 4 21 4 5 . . . . . . . . . 19 19 20 20 . . . . . .

Figure: Gram inverted-lists for query “abcd”

1

Figure: Min- heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

slide-36
SLIDE 36

Top-k Single-pass Search Algorithm

Heap-based

Traverse the lists in a sorted

  • rder using a heap on the top

IDs of the lists Advantages:

1

No need to maintain the list

  • f “open” IDs

2

Skip elements

ab bc cd →1 →2 2 20 3 4 21 4 5 . . . . . . . . . 19 19 20 20 . . . . . .

Figure: Gram inverted-lists for query “abcd”

1 2

Figure: Min- heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

slide-37
SLIDE 37

Top-k Single-pass Search Algorithm

Heap-based

Traverse the lists in a sorted

  • rder using a heap on the top

IDs of the lists Advantages:

1

No need to maintain the list

  • f “open” IDs

2

Skip elements

ab bc cd →1 →2 →2 20 3 4 21 4 5 . . . . . . . . . 19 19 20 20 . . . . . .

Figure: Gram inverted-lists for query “abcd”

1 2 2

Figure: Min- heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

slide-38
SLIDE 38

Top-k Single-pass Search Algorithm

Heap-based

Traverse the lists in a sorted

  • rder using a heap on the top

IDs of the lists Advantages:

1

No need to maintain the list

  • f “open” IDs

2

Skip elements

ab bc cd →1 →2 →2 20 3 4 21 4 5 . . . . . . . . . 19 19 20 20 . . . . . .

Figure: Gram inverted-lists for query “abcd”

1 2 2

Figure: Min- heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

slide-39
SLIDE 39

Top-k Single-pass Search Algorithm

Heap-based

Traverse the lists in a sorted

  • rder using a heap on the top

IDs of the lists Advantages:

1

No need to maintain the list of “open” IDs

2

Skip elements

ab bc cd →1 →2 →2 20 3 4 21 4 5 . . . . . . . . . 19 19 20 20 . . . . . .

Figure: Gram inverted-lists for query “abcd”

1

Figure: Top-k buffer, k = 1

2 2

Figure: Min- heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

slide-40
SLIDE 40

Top-k Single-pass Search Algorithm

Heap-based

Traverse the lists in a sorted

  • rder using a heap on the top

IDs of the lists Advantages:

1

No need to maintain the list

  • f “open” IDs

2

Skip elements

ab bc cd 1 →2 →2 →20 3 4 21 4 5 . . . . . . . . . 19 19 20 20 . . . . . .

Figure: Gram inverted-lists for query “abcd”

1

Figure: Top-k buffer, k = 1

2 2 20

Figure: Min- heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

slide-41
SLIDE 41

Top-k Single-pass Search Algorithm

Heap-based

Traverse the lists in a sorted

  • rder using a heap on the top

IDs of the lists Advantages:

1

No need to maintain the list

  • f “open” IDs

2

Skip elements

ab bc cd 1 →2 →2 →20 3 4 21 4 5 . . . . . . . . . 19 19 20 20 . . . . . .

Figure: Gram inverted-lists for query “abcd”

1

Figure: Top-k buffer, k = 1

2 2 20

Figure: Min- heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

slide-42
SLIDE 42

Top-k Single-pass Search Algorithm

Heap-based

Traverse the lists in a sorted

  • rder using a heap on the top

IDs of the lists Advantages:

1

No need to maintain the list

  • f “open” IDs

2

Skip elements

ab bc cd 1 →2 →2 →20 3 4 21 4 5 . . . . . . . . . 19 19 20 20 . . . . . .

Figure: Gram inverted-lists for query “abcd”

2

Figure: Top-k buffer, k = 1

20

Figure: Min- heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

slide-43
SLIDE 43

Top-k Single-pass Search Algorithm

Heap-based

Traverse the lists in a sorted

  • rder using a heap on the top

IDs of the lists Advantages:

1

No need to maintain the list

  • f “open” IDs

2

Skip elements

ab bc cd 1 2 2 →20 →3 →4 21 4 5 . . . . . . . . . 19 19 20 20 . . . . . .

Figure: Gram inverted-lists for query “abcd”

2

Figure: Top-k buffer, k = 1

3 4 20

Figure: Min- heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

slide-44
SLIDE 44

Top-k Single-pass Search Algorithm

Heap-based

Traverse the lists in a sorted

  • rder using a heap on the top

IDs of the lists Advantages:

1

No need to maintain the list

  • f “open” IDs

2

Skip elements

ab bc cd 1 2 2 →20 →3 →4 21 4 5 . . . . . . . . . 19 19 20 20 . . . . . .

Figure: Gram inverted-lists for query “abcd”

2

Figure: Top-k buffer, k = 1

3 4 20

Figure: Min- heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

slide-45
SLIDE 45

Top-k Single-pass Search Algorithm

Heap-based

Traverse the lists in a sorted

  • rder using a heap on the top

IDs of the lists Advantages:

1

No need to maintain the list

  • f “open” IDs

2

Skip elements

ab bc cd 1 2 2 →20 →3 →4 21 4 5 . . . . . . . . . 19 19 20 20 . . . . . .

Figure: Gram inverted-lists for query “abcd”

2

Figure: Top-k buffer, k = 1

4 20

Figure: Min- heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

slide-46
SLIDE 46

Top-k Single-pass Search Algorithm

Heap-based

Traverse the lists in a sorted

  • rder using a heap on the top

IDs of the lists Advantages:

1

No need to maintain the list

  • f “open” IDs

2

Skip elements

ab bc cd 1 2 2 →20 →3 →4 21 4 5 . . . . . . . . . 19 19 20 20 . . . . . .

Figure: Gram inverted-lists for query “abcd”

2

Figure: Top-k buffer, k = 1

4 20

Figure: Min- heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

slide-47
SLIDE 47

Top-k Single-pass Search Algorithm

Heap-based

Traverse the lists in a sorted

  • rder using a heap on the top

IDs of the lists Advantages:

1

No need to maintain the list

  • f “open” IDs

2

Skip elements

ab bc cd 1 2 2 →20 →3 →4 21 4 5 . . . . . . . . . 19 19 20 20 . . . . . .

Figure: Gram inverted-lists for query “abcd”

2

Figure: Top-k buffer, k = 1

20

Figure: Min- heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

slide-48
SLIDE 48

Top-k Single-pass Search Algorithm

Heap-based

Traverse the lists in a sorted

  • rder using a heap on the top

IDs of the lists Advantages:

1

No need to maintain the list

  • f “open” IDs

2

Skip elements

ab bc cd 1 2 2 →20 3 4 21 4 5 . . . . . . . . . 19 19 →20 →20 . . . . . .

Figure: Gram inverted-lists for query “abcd”

2

Figure: Top-k buffer, k = 1

20

Figure: Min- heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

slide-49
SLIDE 49

Algorithms Overview

Iterative Range Search Single-Pass Search Two-Phase Search

Range Search Range Search

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 11 / 17

slide-50
SLIDE 50

Experimental Setting

Datasets:

IMDB Actor Names1

actor names and the number of movies they played in 1.2 million actors, average name length 15 weight is the number of movies (log normalized)

WEB Corpus Word Grams2

sequences of English words and their frequency on the Web 2.4 million sequences, average sequence length 20 weight is the frequency (log normalized)

Jaccard similarity and normalized edit similarity, q = 3 Index and data are stored in main memory at all times Implemented in C++ (GNU compiler) on Ubuntu Linux OS Intel 2.40GHz PC, 2GB RAM

1http://www.imdb.com/interfaces 2http://www.ldc.upenn.edu/Catalog LDC2006T13 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 12 / 17

slide-51
SLIDE 51

Benefits of Skipping Elements

0.2 0.4 0.6 0.8 1.0 1.2 Dataset Size (millions) 10 20 30 40 50 Time (ms) SPS SPS*

Average running time for top-10 queries. IMDB dataset with Jaccard

  • similarity. Single-Pass

Search (SPS) algorithm and SPS without skipping (SPS*).

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 13 / 17

slide-52
SLIDE 52

Potential of the Two-Phase Algorithm

Q1 Q2 Q3 Queries 10 20 30 40 Time (ms)

Running time for 3 top-10 queries. WEB Corpus dataset with normalized edit

  • similarity. Two-Phase

algorithm with different initial thresholds.

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 14 / 17

slide-53
SLIDE 53

Optimum Initial Threshold for the Two-Phase Algorithm

0.4 0.8 1.2 1.6 2.0 2.4 Dataset Size (millions) 20 40 60 80 100 Time (ms) SPS 2PH 2PH Opt

Average running time for top-10 queries. Web Corpus dataset with normalized edit

  • similarity. Single-Pass

Search (SPS) algorithm, Two-Phase (2PH) algorithm, 2PH with the

  • ptimum initial threshold

(2PH Opt).

The Iterative Range Search algorithm was to expensive to be

  • plotted. The average running time

was around 5s.

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 15 / 17

slide-54
SLIDE 54

Summary

Approximate ranking queries in string collections Useful when mismatch between query and data Propose three approaches to solve the problem:

1

Use existing approximate range search algorithms as a “black box” Proves to be the most expensive

2

Use particularities of the top-k problem Proves to be very efficient

3

Combine (1) and (2) sequentially Proves to be slightly more efficient

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 16 / 17

slide-55
SLIDE 55

The Flamingo Project

This work is part of The Flamingo Project at UC Irvine http://flamingo.ics.uci.edu

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 17 / 17

slide-56
SLIDE 56

A Quick Note on Related Work

Fagin et. al [1]

similarity on multiple numerical attributes traverse list of IDs

  • ne list per attribute

lists sorted on similarity to that attribute lists have different orders of IDs all IDs appear on all the lists

Our Setting

similarity on one string attribute traverse list of IDs

  • ne list per q-gram

lists sorted on global weight lists have the same order of IDs a subset of IDs appear on each list

[1] R.Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleware. In PODS, 2001

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 17 / 17