tries and suffix trees
play

Tries and Suffix Trees Inge Li Grtz String indexing problem String - PowerPoint PPT Presentation

Tries and Suffix Trees Inge Li Grtz String indexing problem String matching problem. Given strings T (text) and P (pattern) over an alphabet , report starting positions of all occurrences of P in T. Finite automaton: O(m + n) time


  1. Tries and Suffix Trees Inge Li Gørtz

  2. String indexing problem • String matching problem. Given strings T (text) and P (pattern) over an alphabet Σ , report starting positions of all occurrences of P in T. • Finite automaton: O(m Σ + n) time and space • KMP: O(m+n) time and space • String indexing problem. Given a string S of characters from an alphabet Σ . Preprocess S into a data structure to support • Search(P): Return starting position of all occurrences of P in S. • Today: Data structure using O(n) space and supporting Search(P) in O(m) time. • Applications: • Search engines, e.g. prefix searches. • Finding common substrings of many biological strings • Finding repeating substructures in biological strings • Detecting DNA contamination

  3. Outline • Tries • Compressed tries • Su ffi x trees • Applications of su ffi x trees

  4. Tries

  5. Tries • Text retrieval b t s e h e h y S 2 a a e e l S 4 S 6 S 3 l l s l S 1 s S 5 • Trie over the strings: sells, by, the, sea, shells, tea.

  6. Tries • Text retrieval • Prefix-free? b t s e h e h y S 2 a a e e l S 4 S 6 S 3 l l s l S 1 s S 5 • Trie over the strings: sells, by, the, sea, shells, tea, she .

  7. Tries • Text retrieval • Prefix-free? b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells, by, the, sea, shells, tea, she .

  8. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  9. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  10. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  11. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  12. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  13. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  14. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  15. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  16. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  17. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  18. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  19. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  20. Tries • Text retrieval • Search for “short” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  21. Tries • Text retrieval • Search for “short” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  22. Tries • Text retrieval • Search for “short” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  23. Tries • Text retrieval • Search for “short” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  24. Tries • Text retrieval • Search for “short” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  25. Tries • Build a trie over the strings: by$, sells$, sea$. b s e y a $ l S 2 $ l S 4 s $ S 1

  26. Trie • Properties of the trie. A trie T storing a collection S of s strings of total length n from an alphabet of size d has the following properties: • How many children can a node have? • How many leaves does T have? • What is the height of T? • What is the number of nodes in T?

  27. Trie • Search time: O(d) in each node => O(dm). • O(m) if d constant. • d not constant: use dictionary • Hashing O(1) • Balanced BST: O(log d) • Time and space for a trie (for small/constant d): • O(m) for searching for a string of length m. • O(n) space. • Preprocessing: O(n)

  28. Tries • Prefix search: return all words in the trie starting with “se” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5

  29. Tries • Prefix search: return all words in the trie starting with “se” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5

  30. Tries • Prefix search: return all words in the trie starting with “se” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5

  31. Trie • Time for prefix search: O(m) + time to report all occurrences. Could be large!! • Solution: compact tries.

  32. Compact tries

  33. Tries • Compact trie: Chains of nodes with a single child is merged into a single node. b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5

  34. Tries • Compact trie: Chains of nodes with a single child is merged into a single node. b s t e y h e h $ a l e a e $ l l $ $ $ s l $ s $ b y t $ s S 2 e he e h a e $ $ l l a l l S 6 $ S 3 s $ s $ $ S 4 S 1 S 5 S 7

  35. Trie • Properties of the compact trie. A compact trie T storing a collection S of s strings of total length n from an alphabet of size d has the following properties: • Every internal node of T has at least 2 and at most d children. • T has s leaves • The number of nodes in T is < 2s. • Time and space for a compact trie (constant d): • O(m) for searching for a string of length m. • O(m + occ) for prefix search, where occ = #occurrences • O(s) space. • Preprocessing: O(n)

  36. Suffix trees

  37. Suffix tree • String indexing problem. Given a string S of characters from an alphabet Σ . Preprocess S into a data structure to support • Search(P): Return starting position of all occurrences of P in S. • Build a compressed trie over all su ffi xes of S (su ffi x tree). Label leaves with index of su ffi x. • Observation: An occurrence of P is a prefix of a su ffi x of S. occurrence of P Su ffi x of S

  38. Suffix tree • String indexing problem. Given a string S of characters from an alphabet Σ . Preprocess S into a data structure to support • Search(P): Return starting position of all occurrences of P in S. • Build a compressed trie over all su ffi xes of S (su ffi x tree). Label leaves with index of su ffi x. • Observation: An occurrence of P is a prefix of a su ffi x of S. occurrence of P Su ffi x of S • Example: P = ana. b a n a n a s t r i n g s s a l a d s Su ffi x of S Su ffi x of S

  39. Suffix Tree • Su ffi x tree: over the string banana$ a n $ b a a n n 7 a a $ n n a $ a 6 $ $ n a $ 1 3 5 $ 2 4

  40. Suffix Tree • Su ffi x tree: over the string banana$ a n $ b a a n n 7 a a $ n n a $ a 6 $ $ n a $ 1 3 5 $ 2 4 • Search for P . • Report labels of all leaves below final node

  41. Suffix Tree • Su ffi x tree: over the string banana$ • Find all occurrences of P=“an” a n $ b a a n n 7 a a $ n n a $ a 6 $ $ n a $ 1 3 5 $ 2 4 • Search for P . • Report labels of all leaves below final node

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend