Approximate Pattern Matching Using Suffix Tries Hendrik Nigul - - PowerPoint PPT Presentation

approximate pattern matching using suffix tries
SMART_READER_LITE
LIVE PREVIEW

Approximate Pattern Matching Using Suffix Tries Hendrik Nigul - - PowerPoint PPT Presentation

Approximate Pattern Matching Using Suffix Tries Hendrik Nigul nigulh@math.ut.ee University of Tartu Veskisilla, Oct 3 2004 p. 1 Overview Introduction, problem description Suffix tries What is a suffix trie How to create suffix tries


slide-1
SLIDE 1

Approximate Pattern Matching Using Suffix Tries

Hendrik Nigul

nigulh@math.ut.ee

University of Tartu

Veskisilla, Oct 3 2004 – p. 1

slide-2
SLIDE 2

Overview

Introduction, problem description Suffix tries What is a suffix trie How to create suffix tries How to use suffix tries Algorithms with suffix tries Exact string matching Approximate string matching Exact all-against-all matching Approximate all-against-all matching Results Conclusions

Veskisilla, Oct 3 2004 – p. 2

slide-3
SLIDE 3

Introduction

Problem statement: Given text T = t1t2 . . . tn and pattern P = p1p2 . . . pm, find all occurrences of P in T. By an occurrence we mean a position i, such that

ti+1 = p1, ti+2 = p2, . . . , ti+m = pm

Veskisilla, Oct 3 2004 – p. 3

slide-4
SLIDE 4

Introduction

Problem statement: Given text T = t1t2 . . . tn and pattern P = p1p2 . . . pm, find all occurrences of P in T. By an occurrence we mean a position i, such that

ti+1 = p1, ti+2 = p2, . . . , ti+m = pm

Sometimes we have have several patterns: Find occurrences of BANANA in text T Find occurrences of ANANAS in text T

. . .

Veskisilla, Oct 3 2004 – p. 3

slide-5
SLIDE 5

Introduction

Sometimes we accept approximate matches: Find occurrences of BANANA, but also accept MANANA, BANAANA, BAANA, etc. If we make several queries, we should preprocess our text. We use suffix tries.

Veskisilla, Oct 3 2004 – p. 4

slide-6
SLIDE 6

Suffix trie

Example: Create suffix trie for text BANANA

: A N A N A $ $ $ B A N A N A $ N A N A $ $ $

: A NA NA$ $ $ BANANA$ NA NA$ $ $

Suffix tree Suffix trie

Veskisilla, Oct 3 2004 – p. 5

slide-7
SLIDE 7

Indexing

All suffixes are added to the trie one by one.

: B A N A N A $ : A N A N A $ B A N A N A $ : A N A N A $ $ B A N A N A $ N A N A $

Inserting BANANA$ Inserting ANANA$ Inserting NANA$ and ANA$

Veskisilla, Oct 3 2004 – p. 6

slide-8
SLIDE 8

Outputting index to a file

We want to use the index many times. We want to write it into a file. Later we must be able to read that trie from file. We output the trie in prefix order, i.e. we output a node first, and then its children. We need to calculate the size of each node, that is the number of bytes of the description of the subtree rooted with that node

Veskisilla, Oct 3 2004 – p. 7

slide-9
SLIDE 9

Outputting index to a file

Suffix trie for BANANA contains suffixes BANANA$ ANANA$ NANA$ ANA$ NA$ A$ $

: A N A N A $ $ $ B A N A N A $ N A N A $ $ $

Veskisilla, Oct 3 2004 – p. 8

slide-10
SLIDE 10

Outputting index to a file

The suffix trie for BANANA

: A N A N A $ $ $ B A N A N A $ N A N A $ $ $ 55 19 14 11 6 4 2 2 2 17 14 11 8 6 4 2 14 11 6 4 2 2 2

The index written to a file :55A19N14A11N6A4$2$2$2B17A14N11A8N6A4$2 N14A11N6A4$2$2$2

Veskisilla, Oct 3 2004 – p. 9

slide-11
SLIDE 11

Introducing pointers

The size of a trie for string of length

n is O(n2).

Indexing of an

1MB textfile would

be impractical. We will use the same idea as in suffix trees – group nodes with a single

  • child. Here we only

group nodes with a single leaf child.

: A N A @ @ @ @ N A @ @ @ : A N A N A $ $ $ B A N A N A $ N A N A $ $ $

Trie with pointers Trie before

Veskisilla, Oct 3 2004 – p. 10

slide-12
SLIDE 12

Outputting index with pointers

Input string BANANA$ 0123456 Suffix trie with pointers

: A N A @ @ @ @ N A @ @ @ 28 13 8 6 4 6 6 8 6 4 6 6

Suffix trie in file :28A13N8A6@4@6@6@0N8A6@4@6@6 In order to read suffix trie from file, we need the original input

Veskisilla, Oct 3 2004 – p. 11

slide-13
SLIDE 13

Indexing

Sometimes we have data consisting of several items. We can make suffix trie for many strings. Later we can use the index to search patterns from all the strings simultanously.

Veskisilla, Oct 3 2004 – p. 12

slide-14
SLIDE 14

Size of index

Index size / text size ratio

  • No. of

Length of row rows 10 100 1000 1 12.9 163 2223 10 9.73 157 2214 100 6.94 152 1000 4.93 146 10000 3.54

  • No. of

Length of row rows 10 100 1000 10000 1 3.55 5.14 6.01 7.11 10 4.28 5.95 7.10 8.11 100 5.21 6.97 8.09 9.13 1000 5.92 7.93 9.11 10000 6.65 8.92 without pointers with pointers If a random string in 4-letter alphabet has length n, then the number of nodes is about 1.72n. The description of each node is at most 1 + log10n bytes.

Veskisilla, Oct 3 2004 – p. 13

slide-15
SLIDE 15

Using the index

Suppose we have an suffix trie S for text T written to a file. The two operations that can be performed for any node: Get the next sibling of that node Get the first child of that node :55A19N14A11N6A4$2$2$2B17A14N11A8N6A4$2 N14A11N6A4$2$2$2 How can we walk through the trie?

Veskisilla, Oct 3 2004 – p. 14

slide-16
SLIDE 16

Walking through the trie

: A N A N A $ $ $ B A N A N A $ N A N A $ $ $ :

:55A19N14A11N6A4$2$2$2B17A14N11A8N6A4$2N14A11N6A4$2$2$2

Veskisilla, Oct 3 2004 – p. 15

slide-17
SLIDE 17

Walking through the trie

: A N A N A $ $ $ B A N A N A $ N A N A $ $ $ : A

:55A19N14A11N6A4$2$2$2B17A14N11A8N6A4$2N14A11N6A4$2$2$2

Veskisilla, Oct 3 2004 – p. 15

slide-18
SLIDE 18

Walking through the trie

: A N A N A $ $ $ B A N A N A $ N A N A $ $ $ : A N A

:55A19N14A11N6A4$2$2$2B17A14N11A8N6A4$2N14A11N6A4$2$2$2

Veskisilla, Oct 3 2004 – p. 15

slide-19
SLIDE 19

Walking through the trie

: A N A N A $ $ $ B A N A N A $ N A N A $ $ $ : A N A N

:55A19N14A11N6A4$2$2$2B17A14N11A8N6A4$2N14A11N6A4$2$2$2

Veskisilla, Oct 3 2004 – p. 15

slide-20
SLIDE 20

Walking through the trie

: A N A N A $ $ $ B A N A N A $ N A N A $ $ $ : A N A

:55A19N14A11N6A4$2$2$2B17A14N11A8N6A4$2N14A11N6A4$2$2$2

Veskisilla, Oct 3 2004 – p. 15

slide-21
SLIDE 21

Walking through the trie

: A N A N A $ $ $ B A N A N A $ N A N A $ $ $ : A N A

:55A19N14A11N6A4$2$2$2B17A14N11A8N6A4$2N14A11N6A4$2$2$2

Veskisilla, Oct 3 2004 – p. 15

slide-22
SLIDE 22

Walking through the trie

: A N A N A $ $ $ B A N A N A $ N A N A $ $ $ : A

:55A19N14A11N6A4$2$2$2B17A14N11A8N6A4$2N14A11N6A4$2$2$2

Veskisilla, Oct 3 2004 – p. 15

slide-23
SLIDE 23

Walking through the trie

: A N A N A $ $ $ B A N A N A $ N A N A $ $ $ : B

:55A19N14A11N6A4$2$2$2B17A14N11A8N6A4$2N14A11N6A4$2$2$2

Veskisilla, Oct 3 2004 – p. 15

slide-24
SLIDE 24

Walking through the trie

: A N A N A $ $ $ B A N A N A $ N A N A $ $ $ : B A N A

:55A19N14A11N6A4$2$2$2B17A14N11A8N6A4$2N14A11N6A4$2$2$2

Veskisilla, Oct 3 2004 – p. 15

slide-25
SLIDE 25

Walking through the trie

: A N A N A $ $ $ B A N A N A $ N A N A $ $ $ : B A N A A

:55A19N14A11N6A4$2$2$2B17A14N11A8N6A4$2N14A11N6A4$2$2$2

Veskisilla, Oct 3 2004 – p. 15

slide-26
SLIDE 26

Walking through the trie

: A N A N A $ $ $ B A N A N A $ N A N A $ $ $ : N

:55A19N14A11N6A4$2$2$2B17A14N11A8N6A4$2N14A11N6A4$2$2$2

Veskisilla, Oct 3 2004 – p. 15

slide-27
SLIDE 27

Walking through the trie

: A N A N A $ $ $ B A N A N A $ N A N A $ $ $ : N A

:55A19N14A11N6A4$2$2$2B17A14N11A8N6A4$2N14A11N6A4$2$2$2

Veskisilla, Oct 3 2004 – p. 15

slide-28
SLIDE 28

Walking through the trie

: A N A N A $ $ $ B A N A N A $ N A N A $ $ $ :

:55A19N14A11N6A4$2$2$2B17A14N11A8N6A4$2N14A11N6A4$2$2$2

Veskisilla, Oct 3 2004 – p. 15

slide-29
SLIDE 29

Algorithms with tries

We now show how to use tries. Suffix tries can be used in the same way.

Veskisilla, Oct 3 2004 – p. 16

slide-30
SLIDE 30

Exact string matching

  • Example. We have an index containing strings:

ALGO ANGLO ANGOLA ANGO GO MANGO We want to search for occurrences of string ANGO

Veskisilla, Oct 3 2004 – p. 17

slide-31
SLIDE 31

Exact string matching

Trie containig strings: ALGO ANGLO ANGOLA ANGO GO MANGO

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $

Veskisilla, Oct 3 2004 – p. 18

slide-32
SLIDE 32

Exact string matching

Searching for string ANGO Search table char OK + A N G O

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ :

Veskisilla, Oct 3 2004 – p. 18

slide-33
SLIDE 33

Exact string matching

Searching for string ANGO Search table char OK + A + N G O

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A

Veskisilla, Oct 3 2004 – p. 18

slide-34
SLIDE 34

Exact string matching

Searching for string ANGO Search table char OK + A + N

  • G

O

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A L

Veskisilla, Oct 3 2004 – p. 18

slide-35
SLIDE 35

Exact string matching

Searching for string ANGO Search table char OK + A + N + G O

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N

Veskisilla, Oct 3 2004 – p. 18

slide-36
SLIDE 36

Exact string matching

Searching for string ANGO Search table char OK + A + N + G + O

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N G

Veskisilla, Oct 3 2004 – p. 18

slide-37
SLIDE 37

Exact string matching

Searching for string ANGO Search table char OK + A + N + G + O

  • :

A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N G L

Veskisilla, Oct 3 2004 – p. 18

slide-38
SLIDE 38

Exact string matching

Searching for string ANGO Search table char OK + A + N + G + O +

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N G O

Veskisilla, Oct 3 2004 – p. 18

slide-39
SLIDE 39

Exact string matching

Searching for string ANGO Search table char OK + A + N + G + O + Print the results

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N G O

Veskisilla, Oct 3 2004 – p. 18

slide-40
SLIDE 40

Exact string matching

Searching for string ANGO This is how many nodes we needed to examine in the trie. Searching for a string of length m can be done in O(ms + M) time, where s is the size of the alphabet and M is the number of matches.

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N G O L L L A

Veskisilla, Oct 3 2004 – p. 18

slide-41
SLIDE 41

Approximate string matching

We would like to find all substrings of text T, which have the edit distance from pattern P at most D. Computation of edit distance of strings x and y:

M0,0 ← 0 Mi,j ← min(Mi−1,j−1 + δ(xi, yi), Mi−1,j + 1, Mi,j−1 + 1)

Return M|x|,|y| Here δ(xi, yi) = 0, if xi = yi 1, if xi = yi The edit distance of ANGEL and MANGO is 3:

M A N G O 1 2 3 4 5 A 1 1 1 2 3 4 N 2 2 2 1 2 3 G 3 3 3 2 1 2 E 4 4 4 3 2 2 L 5 5 5 4 3 3

Veskisilla, Oct 3 2004 – p. 19

slide-42
SLIDE 42

Calculation of edit distance

We do not need to calculate the whole table. If D = 2, we only need some values close to the main diagonal. M A N G O 1 2 A 1 1 1 2 N 2 2 2 1 2 G 2 1 2 E 2 2 L

Veskisilla, Oct 3 2004 – p. 20

slide-43
SLIDE 43

Approximate string matching

  • Example. We have a trie

containing strings: ALGO ANGLO ANGOLA ANGO GO MANGO We want to search for occurrences

  • f string ANGEL with edit distance

at most 1.

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $

Veskisilla, Oct 3 2004 – p. 21

slide-44
SLIDE 44

Approximate string matching

Searching for string ANGEL with edit distance at most 1. Edit distance table : A 1 N G E L

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ :

Veskisilla, Oct 3 2004 – p. 21

slide-45
SLIDE 45

Approximate string matching

Searching for string ANGEL with edit distance at most 1. Edit distance table : A 1 A 1 N 1 G E L

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A

Veskisilla, Oct 3 2004 – p. 21

slide-46
SLIDE 46

Approximate string matching

Searching for string ANGEL with edit distance at most 1. Edit distance table : A L 1 A 1 1 N 1 1 G E L

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A L

Veskisilla, Oct 3 2004 – p. 21

slide-47
SLIDE 47

Approximate string matching

Searching for string ANGEL with edit distance at most 1. Edit distance table : A L G 1 A 1 1 N 1 1 G 1 E L

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A L G

Veskisilla, Oct 3 2004 – p. 21

slide-48
SLIDE 48

Approximate string matching

Searching for string ANGEL with edit distance at most 1. Edit distance table : A L G O 1 A 1 1 N 1 1 G 1 E L

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A L G O

Veskisilla, Oct 3 2004 – p. 21

slide-49
SLIDE 49

Approximate string matching

Searching for string ANGEL with edit distance at most 1. Edit distance table : A N 1 A 1 1 N 1 G 1 E L

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N

Veskisilla, Oct 3 2004 – p. 21

slide-50
SLIDE 50

Approximate string matching

Searching for string ANGEL with edit distance at most 1. Edit distance table : A N G 1 A 1 1 N 1 1 G 1 E 1 L

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N G

Veskisilla, Oct 3 2004 – p. 21

slide-51
SLIDE 51

Approximate string matching

Searching for string ANGEL with edit distance at most 1. Edit distance table : A N G L 1 A 1 1 N 1 1 G 1 1 E 1 1 L

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N G L

Veskisilla, Oct 3 2004 – p. 21

slide-52
SLIDE 52

Approximate string matching

Searching for string ANGEL with edit distance at most 1. Edit distance table : A N G L O 1 A 1 1 N 1 1 G 1 1 E 1 1 L

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N G L O

Veskisilla, Oct 3 2004 – p. 21

slide-53
SLIDE 53

Approximate string matching

Searching for string ANGEL with edit distance at most 1. Edit distance table : A N G O 1 A 1 1 N 1 1 G 1 1 E 1 1 L

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N G O

Veskisilla, Oct 3 2004 – p. 21

slide-54
SLIDE 54

Approximate string matching

Searching for string ANGEL with edit distance at most 1. Edit distance table : A N G O L 1 A 1 1 N 1 1 G 1 1 E 1 1 L 1

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N G O L

Veskisilla, Oct 3 2004 – p. 21

slide-55
SLIDE 55

Approximate string matching

Searching for string ANGEL with edit distance at most 1. Edit distance table : A N G O L 1 A 1 1 N 1 1 G 1 1 E 1 1 L 1 Print occurrence

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N G O L A

Veskisilla, Oct 3 2004 – p. 21

slide-56
SLIDE 56

Approximate string matching

Searching for string ANGEL with edit distance at most 1. Edit distance table : A N G O L A 1 A 1 1 N 1 1 G 1 1 E 1 1 L 1

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N G O L A

Veskisilla, Oct 3 2004 – p. 21

slide-57
SLIDE 57

Approximate string matching

Searching for string ANGEL with edit distance at most 1. Edit distance table : G 1 A 1 1 N G E L

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : G

Veskisilla, Oct 3 2004 – p. 21

slide-58
SLIDE 58

Approximate string matching

Searching for string ANGEL with edit distance at most 1. Edit distance table : G O 1 A 1 1 N G E L

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : G O

Veskisilla, Oct 3 2004 – p. 21

slide-59
SLIDE 59

Approximate string matching

And so on...

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : M

Veskisilla, Oct 3 2004 – p. 21

slide-60
SLIDE 60

Approximate string matching

Searching for a string of length m with edit distance D can be done in O((ms)D+1 + M) time, where s is the size of the alphabet and M is the number of matches.

Veskisilla, Oct 3 2004 – p. 22

slide-61
SLIDE 61

Exact all-against-all matching

Suppose we would like to find all substrings of pattern ANANGO in a trie. That is, we are interested in finding all prefixes of the following strings: ANANGO NANGO ANGO NGO GO O What should we do?

Veskisilla, Oct 3 2004 – p. 23

slide-62
SLIDE 62

Exact all-against-all matching

We should index the pattern string first. We now have two tries. We would like to find the common nodes of the tries.

Veskisilla, Oct 3 2004 – p. 24

slide-63
SLIDE 63

Exact all-against-all matching

Common nodes:

: A N A N G O $ G O $ G O $ N A N G O $ G O $ O $ $ : A L G O $ N G L O $ O L A $ $ G O $ M A N G O $

Veskisilla, Oct 3 2004 – p. 25

slide-64
SLIDE 64

Exact all-against-all matching

Common nodes:

: A N A N G O $ G O $ G O $ N A N G O $ G O $ O $ $ : : A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ :

Veskisilla, Oct 3 2004 – p. 25

slide-65
SLIDE 65

Exact all-against-all matching

Common nodes: A

: A N A N G O $ G O $ G O $ N A N G O $ G O $ O $ $ : A : A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A

Veskisilla, Oct 3 2004 – p. 25

slide-66
SLIDE 66

Exact all-against-all matching

Common nodes: A

: A N A N G O $ G O $ G O $ N A N G O $ G O $ O $ $ : A N : A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A L

Veskisilla, Oct 3 2004 – p. 25

slide-67
SLIDE 67

Exact all-against-all matching

Common nodes: A AN

: A N A N G O $ G O $ G O $ N A N G O $ G O $ O $ $ : A N : A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N

Veskisilla, Oct 3 2004 – p. 25

slide-68
SLIDE 68

Exact all-against-all matching

Common nodes: A AN

: A N A N G O $ G O $ G O $ N A N G O $ G O $ O $ $ : A N A : A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N G

Veskisilla, Oct 3 2004 – p. 25

slide-69
SLIDE 69

Exact all-against-all matching

Common nodes: A AN ANG

: A N A N G O $ G O $ G O $ N A N G O $ G O $ O $ $ : A N G : A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N G

Veskisilla, Oct 3 2004 – p. 25

slide-70
SLIDE 70

Exact all-against-all matching

Common nodes: A AN ANG

: A N A N G O $ G O $ G O $ N A N G O $ G O $ O $ $ : A N G O : A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N G L

Veskisilla, Oct 3 2004 – p. 25

slide-71
SLIDE 71

Exact all-against-all matching

Common nodes: A AN ANG ANGO

: A N A N G O $ G O $ G O $ N A N G O $ G O $ O $ $ : A N G O : A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N G O

Veskisilla, Oct 3 2004 – p. 25

slide-72
SLIDE 72

Exact all-against-all matching

Common nodes: A AN ANG ANGO G

: A N A N G O $ G O $ G O $ N A N G O $ G O $ O $ $ : G : A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : G

Veskisilla, Oct 3 2004 – p. 25

slide-73
SLIDE 73

Exact all-against-all matching

Common nodes: A AN ANG ANGO G GO

: A N A N G O $ G O $ G O $ N A N G O $ G O $ O $ $ : G O : A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : G O

Veskisilla, Oct 3 2004 – p. 25

slide-74
SLIDE 74

Exact all-against-all matching

Common nodes: A AN ANG ANGO G GO

: A N A N G O $ G O $ G O $ N A N G O $ G O $ O $ $ : N : A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : M

Veskisilla, Oct 3 2004 – p. 25

slide-75
SLIDE 75

Exact all-against-all matching

Common nodes: A AN ANG ANGO G GO

: A N A N G O $ G O $ G O $ N A N G O $ G O $ O $ $ : A N A G O G O N : A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A L N G L O G O M

Veskisilla, Oct 3 2004 – p. 25

slide-76
SLIDE 76

Approximate all-against-all matching

Find all approximate occurrences of any substring of AAGL. Maximum edit distance 1. Suffix trie:

: A A G L $ G L $ G L $ L $ $

Veskisilla, Oct 3 2004 – p. 26

slide-77
SLIDE 77

Approximate all-against-all matching

: A A G L G L G L L

: A 1

AA AAG AAGL AG AGL

G 1

GL

L 1

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ :

Veskisilla, Oct 3 2004 – p. 27

slide-78
SLIDE 78

Approximate all-against-all matching

: A A G L G L G L L

: A 1 A 1

AA

1

AAG AAGL AG

1

AGL

G 1 1

GL

L 1 1

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A

Veskisilla, Oct 3 2004 – p. 27

slide-79
SLIDE 79

Approximate all-against-all matching

: A A G L G L G L L

: A L 1 A 1 1

AA

1 1

AAG AAGL AG

1 1

AGL

1 G 1 1

GL

1 L 1 1 1

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A L

Veskisilla, Oct 3 2004 – p. 27

slide-80
SLIDE 80

Approximate all-against-all matching

: A A G L G L G L L

: A L G 1 A 1 1

AA

1 1

AAG

1

AAGL AG

1 1 1

AGL

1 G 1 1

GL

1 L 1 1 1

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A L G

Veskisilla, Oct 3 2004 – p. 27

slide-81
SLIDE 81

Approximate all-against-all matching

: A A G L G L G L L

: A L G O 1 A 1 1

AA

1 1

AAG

1

AAGL AG

1 1 1

AGL

1 G 1 1

GL

1 L 1 1 1

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A L G O

Veskisilla, Oct 3 2004 – p. 27

slide-82
SLIDE 82

Approximate all-against-all matching

: A A G L G L G L L

: A N 1 A 1 1

AA

1 1

AAG AAGL AG

1 1

AGL

G 1 1

GL

L 1 1

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N

Veskisilla, Oct 3 2004 – p. 27

slide-83
SLIDE 83

Approximate all-against-all matching

: A A G L G L G L L

: A N G 1 A 1 1

AA

1 1

AAG

1

AAGL AG

1 1 1

AGL

G 1 1

GL

L 1 1

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N G

Veskisilla, Oct 3 2004 – p. 27

slide-84
SLIDE 84

Approximate all-against-all matching

: A A G L G L G L L

: A N G L 1 A 1 1

AA

1 1

AAG

1

AAGL

1

AG

1 1 1

AGL

1 G 1 1

GL

L 1 1

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N G L

Veskisilla, Oct 3 2004 – p. 27

slide-85
SLIDE 85

Approximate all-against-all matching

: A A G L G L G L L

: A N G L O 1 A 1 1

AA

1 1

AAG

1

AAGL

1

AG

1 1 1

AGL

1 G 1 1

GL

L 1 1

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N G L O

Veskisilla, Oct 3 2004 – p. 27

slide-86
SLIDE 86

Approximate all-against-all matching

: A A G L G L G L L

: A N G O 1 A 1 1

AA

1 1

AAG

1

AAGL AG

1 1 1

AGL

G 1 1

GL

L 1 1

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : A N G O

Veskisilla, Oct 3 2004 – p. 27

slide-87
SLIDE 87

Approximate all-against-all matching

: A A G L G L G L L

: G 1 A 1 1

AA AAG AAGL AG

1

AGL

G 1

GL

1 L 1 1

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : G

Veskisilla, Oct 3 2004 – p. 27

slide-88
SLIDE 88

Approximate all-against-all matching

: A A G L G L G L L

: G O 1 A 1 1

AA AAG AAGL AG

1

AGL

G 1 1

GL

1 1 L 1 1

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : G O

Veskisilla, Oct 3 2004 – p. 27

slide-89
SLIDE 89

Approximate all-against-all matching

: A A G L G L G L L

: M 1 A 1 1

AA AAG AAGL AG AGL

G 1 1

GL

L 1 1

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : M

Veskisilla, Oct 3 2004 – p. 27

slide-90
SLIDE 90

Approximate all-against-all matching

: A A G L G L G L L

: M A 1 A 1 1 1

AA

1

AAG AAGL AG AGL

G 1 1

GL

L 1 1

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : M A

Veskisilla, Oct 3 2004 – p. 27

slide-91
SLIDE 91

Approximate all-against-all matching

: A A G L G L G L L

: M A N 1 A 1 1 1

AA

1

AAG AAGL AG AGL

G 1 1

GL

L 1 1

: A L G O $ N G L O $ O L A $ $ G O $ M A N G O $ : M A N

Veskisilla, Oct 3 2004 – p. 27

slide-92
SLIDE 92

Implementing approximate all-against-all

We cannot use the same DP table as in approximate string matching algorithm This is because the elements we want to calculate are not always close to the main diagonal.

Veskisilla, Oct 3 2004 – p. 28

slide-93
SLIDE 93

Implementing approximate all-against-all

We cannot use the same DP table as in approximate string matching algorithm This is because the elements we want to calculate are not always close to the main diagonal. Instead of the error column, we use a list for each node in the text trie. Each element of a list is pair (pos, error), where pos is the position in pattern trie and error is the corresponding error table value.

Veskisilla, Oct 3 2004 – p. 28

slide-94
SLIDE 94

Implementing approximate all-against-all

Mi,j ← min(Mi−1,j−1 + δ(xi, yi), Mi−1,j + 1, Mi,j−1 + 1)

which worked in approximate string matching, will not work here. Each element in list for column j − 1 gives new elements to list for column j. Duplicates are removed from new list.

Veskisilla, Oct 3 2004 – p. 29

slide-95
SLIDE 95

Experiment results

All tests are performed in alphabet Σ = {A, C, G, T}. All texts and patterns are random. The computer was 2.8GHz Pentium 4. The text consisted of 100000 lines, each line containing

100 symbols and additional newline ’\n’, total size 9.63MB.

The creation of text took 2.6 seconds (size 9.63MB). The creation of index took 34.0 seconds (size 95.2MB). All searching times are in milliseconds and do not contain the time for outputting matches.

Veskisilla, Oct 3 2004 – p. 30

slide-96
SLIDE 96

tagrep vs. agrep

Length of pattern Error Program 5 10 15 20 25 tagrep 649 82 82 81 72 1 agrep 286 185 241 275 366 tagrep 6908 121 97 105 94 2 agrep 247 301 312 319 525 tagrep 29904 542 218 191 193 3 agrep 235 395 447 458 1374 tagrep 5314 715 633 697 4 agrep 226 242 456 480 2438 tagrep 18602 2435 2121 2439 5 agrep 261 496 581 3422

Veskisilla, Oct 3 2004 – p. 31

slide-97
SLIDE 97

Experiment results

Finding all exact substrings of length 10 or more of a pattern of 10000 symbols from the 10MB text took 0.18 seconds. Finding all approximate substrings with error 1 (other parameters are same) took 13.7 seconds. Finding approximate substrings with error 1 and length

20 took 4.1 seconds.

Finding approximate substrings with error 2 and length

20 took 43.8 seconds.

Veskisilla, Oct 3 2004 – p. 32

slide-98
SLIDE 98

Conclusion

Suffix tries are useful, when we need to make several queries from the same text. Tagrep beats agrep!

Veskisilla, Oct 3 2004 – p. 33

slide-99
SLIDE 99

Questions?

Veskisilla, Oct 3 2004 – p. 34