suffix tries
play

Suffix Tries Slides adapted from the course by Ben Langmead - PowerPoint PPT Presentation

Suffix Tries Slides adapted from the course by Ben Langmead ben.langmead@gmail.com Indexing with su ffi xes Until now, our indexes have been based on extracting substrings from T A very di ff erent approach is to extract su ffi xes from T. This


  1. Suffix Tries Slides adapted from the course by Ben Langmead ben.langmead@gmail.com

  2. Indexing with su ffi xes Until now, our indexes have been based on extracting substrings from T A very di ff erent approach is to extract su ffi xes from T. This will lead us to some interesting and practical index data structures: $ B A N A N A 6 $ A $ B A N A N 5 A$ A N A $ B A N 3 ANA$ 1 ANANA$ A N A N A $ B 0 BANANA$ B A N A N A $ 4 NA$ N A $ B A N A 2 NANA$ N A N A $ B A Su ffi x Tree Su ffi x Trie Su ffi x Array FM Index

  3. Tries A trie (pronounced “try”) is a tree representing a collection of strings with one node per common pre fi x Smallest tree such that: Each edge is labeled with a character c ∈ Σ A node has at most one outgoing edge labeled c , for c ∈ Σ Each key is “spelled out” along some path starting at the root Natural way to represent either a set or a map where keys are strings

  4. Su ffi x trie Build a trie containing all su ffi xes of a text T T: G T T A T A G C T G A T C G C G G C G T A G C G G $ G T T A T A G C T G A T C G C G G C G T A G C G G $ T T A T A G C T G A T C G C G G C G T A G C G G $ T A T A G C T G A T C G C G G C G T A G C G G $ A T A G C T G A T C G C G G C G T A G C G G $ T A G C T G A T C G C G G C G T A G C G G $ A G C T G A T C G C G G C G T A G C G G $ G C T G A T C G C G G C G T A G C G G $ C T G A T C G C G G C G T A G C G G $ T G A T C G C G G C G T A G C G G $ G A T C G C G G C G T A G C G G $ m(m+1)/2 A T C G C G G C G T A G C G G $ T C G C G G C G T A G C G G $ chars C G C G G C G T A G C G G $ G C G G C G T A G C G G $ C G G C G T A G C G G $ G G C G T A G C G G $ G C G T A G C G G $ C G T A G C G G $ G T A G C G G $ T A G C G G $ A G C G G $ G C G G $ C G G $ G G $ G $ $

  5. Su ffi x trie First add special terminal character $ to the end of T $ is a character that does not appear elsewhere in T , and we de fi ne it to be less than other characters (for DNA: $ < A < C < G < T ) $ enforces a rule we’re all used to using: e.g. “as” comes before “ash” in the dictionary. $ also guarantees no su ffi x is a pre fi x of any other su ffi x. T: G T T A T A G C T G A T C G C G G C G T A G C G G $ G T T A T A G C T G A T C G C G G C G T A G C G G $ T T A T A G C T G A T C G C G G C G T A G C G G $ T A T A G C T G A T C G C G G C G T A G C G G $ A T A G C T G A T C G C G G C G T A G C G G $ T A G C T G A T C G C G G C G T A G C G G $ A G C T G A T C G C G G C G T A G C G G $ G C T G A T C G C G G C G T A G C G G $ C T G A T C G C G G C G T A G C G G $ T G A T C G C G G C G T A G C G G $ G A T C G C G G C G T A G C G G $ A T C G C G G C G T A G C G G $ T C G C G G C G T A G C G G $ C G C G G C G T A G C G G $ G C G G C G T A G C G G $ C G G C G T A G C G G $ G G C G T A G C G G $ G C G T A G C G G $

  6. Tries Smallest tree such that: Each edge is labeled with a character from Σ A node has at most one outgoing edge labeled with c , for any c ∈ Σ Each key is “spelled out” along some path starting at the root

  7. Su ffi x trie a b $ Shortest (non-empty) abaaba $ T: abaaba T $ : su ffi x a b $ a Each path from root to leaf represents a su ffi x; each su ffi x is represented by some b a a $ path from root to leaf a a $ b Would this still be the case if we hadn’t added $ ? $ b a a $ $ Longest su ffi x

  8. Su ffi x trie a b T: abaaba Each path from root to leaf represents a a b a su ffi x; each su ffi x is represented by some path from root to leaf b a a Would this still be the case if we hadn’t No added $ ? a a b b a a

  9. Su ffi x trie a b $ We can think of nodes as having labels , where the label spells out characters on the a b $ a path from the root to the node b a a $ a a $ b baa $ b a a $ $

  10. Su ffi x trie a b $ How do we check whether a string S is a substring of T ? a b $ a Note: Each of T ’s substrings is spelled out along a path from the root. I.e., every b a a $ substring is a pre fi x of some su ffi x of T. S = baa a a $ b Yes, it’s a substring Start at the root and follow the edges labeled with the characters of S $ b a If we “fall o ff ” the trie -- i.e. there is no outgoing edge for next character of S , then a $ S is not a substring of T If we exhaust S without falling o ff , S is a $ substring of T

  11. Su ffi x trie a b $ How do we check whether a string S is a substring of T ? a b $ a Note: Each of T ’s substrings is spelled out along a path from the root. I.e., every b a a $ substring is a pre fi x of some su ffi x of T. a a $ b Start at the root and follow the edges labeled with the characters of S $ b a If we “fall o ff ” the trie -- i.e. there is no outgoing edge for next character of S , then a $ S = abaaba S is not a substring of T Yes, it’s a substring If we exhaust S without falling o ff , S is a $ substring of T

  12. Su ffi x trie a b $ How do we check whether a string S is a substring of T ? a b $ a Note: Each of T ’s substrings is spelled out along a path from the root. I.e., every b a a $ substring is a pre fi x of some su ffi x of T. a a $ b Start at the root and follow the edges x labeled with the characters of S $ b a S = baabb No, not a substring If we “fall o ff ” the trie -- i.e. there is no outgoing edge for next character of S , then a $ S is not a substring of T If we exhaust S without falling o ff , S is a $ substring of T

  13. Su ffi x trie a b $ How do we check whether a string S is a su ffi x of T ? a b $ a Same procedure as for substring, but additionally check whether the fi nal node in b a a $ the walk has an outgoing edge labeled $ S = baa a a $ b Not a su ffi x $ b a a $ $

  14. Su ffi x trie a b $ How do we check whether a string S is a su ffi x of T ? a b $ a Same procedure as for substring, but additionally check whether the fi nal node in b a a $ S = aba the walk has an outgoing edge labeled $ Is a su ffi x a a $ b $ b a a $ $

  15. Su ffi x trie a b $ How do we count the number of times a string S occurs as a substring of T ? a b $ a Follow path corresponding to S . b a a $ S = aba Either we fall o ff , in which case 2 occurrences n answer is 0, or we end up at node n and the answer = # of leaf nodes in a a $ b the subtree rooted at n . Leaves can be counted with depth- fi rst $ b a traversal. a $ $

  16. Su ffi x trie a b $ How do we fi nd the longest repeated substring of T ? a b $ a Find the deepest node with more b a a $ than one child aba a a $ b $ b a a $ $

  17. Suffix Trie implementation (derived from Ben Langmead) class SuffixTrie (object): ''' building a suffix Trie ''' def __init__(self, t): """ Make suffix trie from t """ if t[-1]!='$': t += '$' # special terminator symbol self.root = {} for i in range(len(t)): # for each suffix cur = self.root for c in t[i:]: # for each character in i'th suffix if c == '$': cur[c] = i # add outgoing edge and suffix position elif c not in cur: cur[c] = {} # add outgoing edge if necessary cur = cur[c]

  18. Suffix Trie implementation: followPath class SuffixTrie (object): …. def followPath (self, s): """ Follow path given by characters of s. Return node at end of path, or None if we fall off . """ cur = self.root for c in s: if c not in cur: return None cur = cur[c] return cur

  19. Suffix Trie implementation: find all positons class SuffixTrie (object): …. def findLeaves (self,v): """ Return the leaves from a given vertex v""" leaves=[] if v == None: return leaves for c in v: if c == '$': leaves+=[v[c]] else : leaves+=self.findLeaves(v[c]) return leaves def findPositions (self,s): """ Return a list of matching positions of s """ return self.findLeaves(self.followPath(s))

  20. Examples if __name__ == '__main__': seq='abaaba' print "seq=",seq strie=SuffixTrie(seq) for p in ['a','ba','aa','bb']: print "find postion of ",p,"in seq",strie.findPositions(p) print "find the leaves=",strie.findLeaves(strie.root) $ python ../codes/ST/STrie.py seq= abaaba find postion of a in seq [2, 0, 3, 5] find postion of ba in seq [1, 4] find postion of aa in seq [2] find postion of bb in seq [] find the leaves= [2, 0, 3, 5, 1, 4, 6]

  21. Su ffi x trie How many nodes does the su ffi x trie have? T = aaaa a $ Is there a class of string where the number of su ffi x trie nodes grows linearly with m ? a $ Yes: e.g. a string of m a’s in a row (a m ) a $ • 1 Root a $ • m nodes with incoming a edge • m + 1 nodes with $ incoming $ edge 2 m + 2 nodes

  22. Su ffi x trie Is there a class of string where the number of su ffi x trie nodes grows with m 2 ? Yes: a n b n • 1 root • n nodes along “b chain,” right Figure & example • n nodes along “a chain,” middle by Carl Kingsford • n chains of n “b” nodes hanging o ff each“a chain” node • 2 n + 1 $ leaves (not shown) n 2 + 4 n + 2 nodes, where m = 2 n

  23. Su ffi x trie: upper bound on size Could worst-case # nodes be worse than O( m 2 )? Root Max # nodes from top to bottom = length of longest su ffi x + 1 Su ffi x trie = m + 1 Deepest leaf Max # nodes from left to right O ( m 2 ) is worst case = max # distinct substrings of any length ≤ m

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend