Algorithms theory 15 Text search (1) Prof. Dr. S. Albers Winter - - PowerPoint PPT Presentation

algorithms theory 15 text search 1
SMART_READER_LITE
LIVE PREVIEW

Algorithms theory 15 Text search (1) Prof. Dr. S. Albers Winter - - PowerPoint PPT Presentation

Algorithms theory 15 Text search (1) Prof. Dr. S. Albers Winter term 07/08 Text search Various scenarios: Static texts Literature databases Library systems Gene databases World Wide Web Dynamic texts Text


slide-1
SLIDE 1

Winter term 07/08

  • Prof. Dr. S. Albers

Algorithms theory 15 – Text search (1)

slide-2
SLIDE 2

2 Winter term 07/08

Text search

Various scenarios: Static texts

  • Literature databases
  • Library systems
  • Gene databases
  • World Wide Web

Dynamic texts

  • Text editors
  • Symbol manipulators
slide-3
SLIDE 3

3 Winter term 07/08

Properties of suffix trees

Search index for a text σ in order to search for several patterns α. Properties:

  • 1. Substring searching in time O(|α |).
  • 2. Queries to σ itself, e.g.:

Longest substring of σ that occurs at least twice.

  • 3. Prefix search: all positions in σ with prefix α.
slide-4
SLIDE 4

4 Winter term 07/08

Properties of suffix trees

  • 4. Range search: all locations (substrings) in σ belonging to an

interval [α, β] with α ≤lex β, e.g. abrakadabra, acacia ∈ [abc, acc], abacus ∉ [abc, acc] .

  • 5. Linear complexity:

Space requirement and construction time in O(|σ |).

slide-5
SLIDE 5

5 Winter term 07/08

Tries

Trie: A tree representing a set of keys. Alphabet Σ, set S of keys, S ⊂ Σ* Key: string in Σ* Edge of a trie T: labeled with a single character of Σ Neighboring edges (edges that lead to different children of a node): labeled with different characters

slide-6
SLIDE 6

6 Winter term 07/08

Tries

a a a c b b c b b c c c

Example:

slide-7
SLIDE 7

7 Winter term 07/08

Tries

A leaf represents a key: The corresponding key is the string consisting of the edge labels along the path from the root to the leaf. Keys are not stored in nodes!

slide-8
SLIDE 8

8 Winter term 07/08

Suffix tries

Trie representing all suffixes of a string Example: σ = ababc suffixes: ababc = suf1 babc = suf2 abc = suf3 bc = suf4 c = suf5

a a a c b b c b b c c c

slide-9
SLIDE 9

9 Winter term 07/08

Suffix tries

Internal nodes of a suffix trie substrings of σ Each proper substring of σ is represented by an internal node. Let σ = anbn. Then, there are n2 + 2n + 1 different substrings (or internal nodes). ⇒ space requirement in O(n2)

= ˆ

slide-10
SLIDE 10

10 Winter term 07/08

Suffix tries

A suffix trie T satisfies some of the desired properties:

a a a c b b c b b c c c

  • 1. String matching for α : Following the path with

edge labels α takes O(|α |) time. leaves of the subtree

  • ccurrences of α
  • 2. Longest substring occurring at least twice:

internal node with maximum depth having at least two chilren

  • 3. Prefix search: All occurrences of strings with

prefix α are represented by the nodes of the subtree rooted at the internal node corres- ponding to α .

= ˆ

slide-11
SLIDE 11

11 Winter term 07/08

Suffix trees

A suffix tree is obtained from a suffix trie by contracting unary nodes:

a a a c b b c b b c c c ab abc abc b c c c suffix tree = contracted suffix trie

slide-12
SLIDE 12

12 Winter term 07/08

Internal representation of suffix trees

Child-sibling representation substring: pair of numbers (i,j)

ab abc abc b c c c T

Example: σ = ababc

slide-13
SLIDE 13

13 Winter term 07/08

Internal representation of suffix trees

(∗∗) (1,2) (2,2) (5,$) (3,$) (5,$) (3,$) (5,$) ab abc abc b c c c

Example: σ = ababc node v = (v.l, v.u, v.c, v.s) Further pointers (suffix links) are added later.

slide-14
SLIDE 14

14 Winter term 07/08

Properties of suffix trees

(S1) No suffix is prefix of another suffix. This holds if the last character of σ is $ ∉ Σ. Search: (T1) edge non-empty substring of σ. (T2) neighboring edges : corresponding substrings start with different characters

= ˆ

slide-15
SLIDE 15

15 Winter term 07/08

Properties of suffix trees

Size (T3) each internal node (≠ root) has at least two children (T4) leaf (non-empty) suffix of σ. Let n = |σ | ≠ 1.

= ˆ

) ( 1

) 3 ( ) 4 (

n n n

T T

Ο ⇒ − ≤ =

⇒ ⇒

in t requiremen space nodes internal

  • f

number leaves

  • f

number

slide-16
SLIDE 16

16 Winter term 07/08

Construction of suffix trees

Definitions: Partial path: Path from the root to a node in T. Path: A partial path ending at a leaf. Location of a string α : Node where the partial path corresponding to α ends (if it exists).

ab abc abc b c c c T

slide-17
SLIDE 17

17 Winter term 07/08

Construction of suffix trees

Extension of a string α : string with prefix α Extended location of a string α : location of the shortest extension of α whose location is defined Contracted location of a string α : location of the longest prefix of α whose location is defined

ab abc abc b c c c T

slide-18
SLIDE 18

18 Winter term 07/08

Construction of suffix trees

Definitions: sufi : suffix of σ beginning at position i, e.g. suf1 = σ, sufn = $. headi : longest prefix of sufi which is also a prefix of sufj for some j < i. Example: σ = bbabaabc α = baa (has no location) suf4 = baabc head4 = ba

slide-19
SLIDE 19

19 Winter term 07/08

Construction of suffix trees

a abc abc c b aabc b baabc a c babaabc c σ = bbabaabc

slide-20
SLIDE 20

20 Winter term 07/08

Naive suffix tree construction

Start with the empty tree T0 . The tree Ti+1 is constructed from Ti by inserting the suffix sufi+1. Algorithm suffix-tree Input: string σ Output: suffix tree T for σ 1 n := | σ |; T0 := ∅; 2 for i := 0 to n – 1do 3 insert sufi+1 into Ti, store the result in Ti+1 ; 4 end for

slide-21
SLIDE 21

21 Winter term 07/08

Naive suffix tree construction

All suffixes sufj with j ≤ i have a location in Ti . headi+1 = longest prefix of sufi+1 whose extended location exists in Ti Definition: taili+1 := sufi+1 – headi+1 i.e. sufi+1 = headi+1 taili +1. taili+1 ≠ ε.

) 1 (S

slide-22
SLIDE 22

22 Winter term 07/08

Naive suffix tree construction

Example: σ = ababc

suf3 = abc head3 = ab tail3 =

c T0 = T1 = T2 = ababc ababc babc

slide-23
SLIDE 23

23 Winter term 07/08

Naive suffix tree construction

Ti+1 can be constructed from Ti as follows:

  • 1. Determine the extended location of headi+1 in Ti and split the last

edge leading to this location into two new edges by inserting a new node.

  • 2. Insert a new leaf as location for sufi+1 .

x = extended location

  • f headi+1

x v headi+1 taili+1

slide-24
SLIDE 24

24 Winter term 07/08

Naive suffix tree construction

Example: σ = ababc babc c babc ababc abc ab T3 T2 head3 = ab tail3 = c

slide-25
SLIDE 25

25 Winter term 07/08

Naive suffix tree construction

Algorithm suffix-insertion Input: tree Ti and suffix sufi+1 Output: tree Ti+1 1 v := root of Ti 2 j := i 3 repeat 4 find child w of v with σw.l = σj+1 5 k := w.l – 1; 6 while k < w.u and σk+1 = σj+1 do 7 k := k +1; j := j + 1 8 end while

slide-26
SLIDE 26

26 Winter term 07/08

Naive suffix tree construction

9 if k = w.u then v := w 10 until k <w.u or w = nil 11 /* v is the contracted location of headi+1 */ 12 insert the location of headi+1 and taili+1 below v into Ti Running time of suffix-insertion : O( ) Total time required for the naive construction: O( )

slide-27
SLIDE 27

27 Winter term 07/08

The algorithm M

(Mc Creight, 1976) Idea: Extended location of headi+1 in Ti is determined in constant amortized time. (Additional information required!) When the extended location of headi+1 in Ti has been found: Creating a new node and splitting an edge takes O(1) time. Theorem 1 Algorithm M constructs a suffix tree for σ with |σ | leaves and at most |σ | - 1 internal nodes in time O(|σ |).

slide-28
SLIDE 28

28 Winter term 07/08

Suffix links

Definition: Let x? be an arbitrary string where x is a single character and ? some (possibly empty) substring. For an internal node v with edge labels x? the following holds: If there exists a node s(v) with edge label ?, then there is a pointer from v to s(v) which is called a suffix link.

? ? x s(v) v

slide-29
SLIDE 29

29 Winter term 07/08

Suffix links

The idea is the following: By following the suffix links, we do not have to start each search for a splitting point at the root node. Instead, we can use the suffix links in

  • rder to determine these nodes more efficiently, i.e. in constant

amortized time.

? ? x s(v) v

slide-30
SLIDE 30

30 Winter term 07/08

Suffix tree: example

T0 = T1 = bbabaabc suf1 = bbabaabc suf2 = babaabc head2 = b

slide-31
SLIDE 31

31 Winter term 07/08

Suffix tree: example

T2 = b abaabc babaabc T3 = abaabc b abaabc babaabc suf3 = abaabc suf4 = baabc head3 = ε head4 = ba

slide-32
SLIDE 32

32 Winter term 07/08

Suffix tree: example

T4 = abaabc b babaabc a abc baabc location of head4 suf5 = aabc head5 = a

slide-33
SLIDE 33

33 Winter term 07/08

Suffix tree: example

babaabc a abc baabc location of head5 abc a b T5 = suf6 = abc head6 = ab baabc

slide-34
SLIDE 34

34 Winter term 07/08

Suffix tree: example

babaabc a abc baabc location of head6 abc a b T6 = b c aabc suf7 = bc head7 = b

slide-35
SLIDE 35

35 Winter term 07/08

Suffix tree: example

babaabc a abc baabc abc a b T7 = b c aabc c suf8 = c

slide-36
SLIDE 36

36 Winter term 07/08

Suffix tree: example

babaabc a abc baabc abc a b T8 = b c aabc c c

slide-37
SLIDE 37

37 Winter term 07/08

Suffix tree: application

Usage of a suffix tree T: 1 Search for a string α: Follow the path with edge labels α (takes O(|α |) time). leaves of the subtree

  • ccurrences of α

2 Search for the longest substring occurring at least twice: Find the location of a substring with maximum weighted depth that is an internal node. 3 Prefix search: All occurrences of strings with prefix α are represented by the nodes of the subtree rooted the location of α in T. = ˆ

slide-38
SLIDE 38

38 Winter term 07/08

Suffix tree: application

4 Range search for [α, β] : range boundaries