algorithms theory 15 text search 1
play

Algorithms theory 15 Text search (1) Prof. Dr. S. Albers Winter - PowerPoint PPT Presentation

Algorithms theory 15 Text search (1) Prof. Dr. S. Albers Winter term 07/08 Text search Various scenarios: Static texts Literature databases Library systems Gene databases World Wide Web Dynamic texts Text


  1. Algorithms theory 15 – Text search (1) Prof. Dr. S. Albers Winter term 07/08

  2. Text search Various scenarios: Static texts • Literature databases • Library systems • Gene databases • World Wide Web Dynamic texts • Text editors • Symbol manipulators Winter term 07/08 2

  3. Properties of suffix trees Search index for a text σ in order to search for several patterns α . Properties: 1. Substring searching in time O(| α |). 2. Queries to σ itself , e.g.: Longest substring of σ that occurs at least twice. 3. Prefix search: all positions in σ with prefix α . Winter term 07/08 3

  4. Properties of suffix trees 4. Range search: all locations (substrings) in σ belonging to an interval [ α , β ] with α ≤ lex β , e.g. abrakadabra, acacia ∈ [abc, acc], abacus ∉ [abc, acc] . 5. Linear complexity: Space requirement and construction time in O(| σ |). Winter term 07/08 4

  5. Tries Trie: A tree representing a set of keys. Alphabet Σ , set S of keys, S ⊂ Σ * Key: string in Σ * Edge of a trie T : labeled with a single character of Σ Neighboring edges (edges that lead to different children of a node): labeled with different characters Winter term 07/08 5

  6. Tries Example: c a b c b a a c b c b c Winter term 07/08 6

  7. Tries A leaf represents a key: The corresponding key is the string consisting of the edge labels along the path from the root to the leaf. Keys are not stored in nodes! Winter term 07/08 7

  8. Suffix tries Trie representing all suffixes of a string σ = ababc Example: c a b ababc = suf 1 suffixes : babc = suf 2 c abc = suf 3 b a bc = suf 4 c = suf 5 a c b c b c Winter term 07/08 8

  9. Suffix tries = substrings of σ Internal nodes of a suffix trie ˆ Each proper substring of σ is represented by an internal node. Let σ = a n b n . Then, there are n 2 + 2 n + 1 different substrings (or internal nodes). ⇒ space requirement in O( n 2 ) Winter term 07/08 9

  10. Suffix tries A suffix trie T satisfies some of the desired properties: 1. String matching for α : Following the path with c edge labels α takes O (| α |) time. a b = occurrences of α leaves of the subtree ˆ c b a 2. Longest substring occurring at least twice: internal node with maximum depth having a c b at least two chilren 3. Prefix search: All occurrences of strings with c b prefix α are represented by the nodes of the subtree rooted at the internal node corres- c ponding to α . Winter term 07/08 10

  11. Suffix trees A suffix tree is obtained from a suffix trie by contracting unary nodes: c a c ab b b c b a c c abc abc a c b suffix tree = contracted suffix trie c b c Winter term 07/08 11

  12. Internal representation of suffix trees Child-sibling representation substring: pair of numbers ( i,j ) Example: σ = ababc T c ab b c c abc abc Winter term 07/08 12

  13. Internal representation of suffix trees Example: σ = ababc ( ∗∗ ) ab c b (1,2) (2,2) (5,$) abc abc (3,$) (5,$) (3,$) (5,$) c c node v = ( v.l , v.u , v.c , v.s ) Further pointers (suffix links) are added later. Winter term 07/08 13

  14. Properties of suffix trees (S1) No suffix is prefix of another suffix. This holds if the last character of σ is $ ∉ Σ . Search: = non-empty substring of σ . (T1) edge ˆ (T2) neighboring edges : corresponding substrings start with different characters Winter term 07/08 14

  15. Properties of suffix trees Size each internal node ( ≠ root) has at least two children (T3) = (non-empty) suffix of σ . (T4) leaf ˆ Let n = | σ | ≠ 1. ( T 4 ) ⇒ number of leaves = n ( T 3 ) ⇒ number of internal nodes ≤ − n 1 space requiremen t in ⇒ Ο ( n ) Winter term 07/08 15

  16. Construction of suffix trees Definitions: Partial path: Path from the root to a node in T. Path: A partial path ending at a leaf. Location of a string α : Node where the partial path corresponding to α ends (if it exists). T c ab b c c abc abc Winter term 07/08 16

  17. Construction of suffix trees Extension of a string α : string with prefix α Extended location of a string α : location of the shortest extension of α whose location is defined Contracted location of a string α : location of the longest prefix of α whose location is defined T c ab b c c abc abc Winter term 07/08 17

  18. Construction of suffix trees Definitions: suf i : suffix of σ beginning at position i , e.g. suf 1 = σ , suf n = $. head i : longest prefix of suf i which is also a prefix of suf j for some j < i. σ α = baa (has no location) Example: = bbabaabc suf 4 = baabc head 4 = ba Winter term 07/08 18

  19. Construction of suffix trees σ = bbabaabc b a c babaabc b a c abc c aabc abc baabc Winter term 07/08 19

  20. Naive suffix tree construction Start with the empty tree T 0 . The tree T i+1 is constructed from T i by inserting the suffix suf i+1 . Algorithm suffix-tree Input: string σ Output: suffix tree T for σ 1 n := | σ |; T 0 := ∅ ; 2 for i := 0 to n – 1 do insert suf i+1 into T i , store the result in T i+1 ; 3 4 end for Winter term 07/08 20

  21. Naive suffix tree construction All suffixes suf j with j ≤ i have a location in T i . � head i+1 = longest prefix of suf i+1 whose extended location exists in T i Definition: tail i+1 := suf i+1 – head i+1 i.e. suf i+1 = head i+1 tail i +1 . ( S 1 ) ⇒ tail i+1 ≠ ε . Winter term 07/08 21

  22. Naive suffix tree construction Example: σ = ababc suf 3 = abc T 0 = head 3 = ab tail 3 = c T 1 = ababc T 2 = babc ababc Winter term 07/08 22

  23. Naive suffix tree construction T i+1 can be constructed from T i as follows: 1. Determine the extended location of head i+1 in T i and split the last edge leading to this location into two new edges by inserting a new node. 2. Insert a new leaf as location for suf i+1 . head i+1 v tail i+1 x = extended location x of head i+1 Winter term 07/08 23

  24. Naive suffix tree construction Example: σ = ababc T 3 T 2 ab ababc babc babc abc c head 3 = ab tail 3 = c Winter term 07/08 24

  25. Naive suffix tree construction Algorithm suffix-insertion Input: tree T i and suffix suf i+1 Output: tree T i+1 1 v := root of T i 2 j := i 3 repeat find child w of v with σ w.l = σ j+1 4 k := w.l – 1; 5 while k < w.u and σ k+1 = σ j+1 do 6 k := k +1; j := j + 1 7 8 end while Winter term 07/08 25

  26. Naive suffix tree construction if k = w.u then v := w 9 10 until k < w.u or w = nil 11 /* v is the contracted location of head i+1 */ 12 insert the location of head i+1 and tail i+1 below v into T i Running time of suffix-insertion : O( ) Total time required for the naive construction: O( ) Winter term 07/08 26

  27. The algorithm M (Mc Creight, 1976) Idea: Extended location of head i+1 in T i is determined in constant amortized time. (Additional information required!) When the extended location of head i+1 in T i has been found: Creating a new node and splitting an edge takes O(1) time. Theorem 1 Algorithm M constructs a suffix tree for σ with | σ | leaves and at most | σ | - 1 internal nodes in time O (| σ |). Winter term 07/08 27

  28. Suffix links Definition: Let x ? be an arbitrary string where x is a single character and ? some (possibly empty) substring. For an internal node v with edge labels x ? the following holds: If there exists a node s ( v ) with edge label ?, then there is a pointer from v to s ( v ) which is called a suffix link. x ? ? s(v) v Winter term 07/08 28

  29. Suffix links The idea is the following: By following the suffix links, we do not have to start each search for a splitting point at the root node. Instead, we can use the suffix links in order to determine these nodes more efficiently, i.e. in constant amortized time. x ? ? s(v) v Winter term 07/08 29

  30. Suffix tree: example T 0 = T 1 = bbabaabc suf 1 = bbabaabc suf 2 = babaabc head 2 = b Winter term 07/08 30

  31. Suffix tree: example T 2 = T 3 = b abaabc b abaabc babaabc abaabc babaabc suf 3 = abaabc suf 4 = baabc head 3 = ε head 4 = ba Winter term 07/08 31

  32. Suffix tree: example T 4 = abaabc b babaabc a abc location of head 4 baabc suf 5 = aabc head 5 = a Winter term 07/08 32

  33. Suffix tree: example T 5 = a b babaabc abc a baabc abc baabc location of head 5 suf 6 = abc head 6 = ab Winter term 07/08 33

  34. Suffix tree: example T 6 = a b babaabc abc a b aabc abc c baabc location of head 6 suf 7 = bc head 7 = b Winter term 07/08 34

  35. Suffix tree: example T 7 = a b babaabc abc a b c aabc abc c baabc suf 8 = c Winter term 07/08 35

  36. Suffix tree: example T 8 = a b c babaabc abc a b c aabc abc c baabc Winter term 07/08 36

  37. Suffix tree: application Usage of a suffix tree T : Search for a string α : 1 Follow the path with edge labels α (takes O (| α |) time). = occurrences of α leaves of the subtree ˆ 2 Search for the longest substring occurring at least twice: Find the location of a substring with maximum weighted depth that is an internal node. 3 Prefix search: All occurrences of strings with prefix α are represented by the nodes of the subtree rooted the location of α in T . Winter term 07/08 37

  38. Suffix tree: application 4 Range search for [ α , β ] : range boundaries Winter term 07/08 38

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend