Su ffi x trees Ben Langmead You are free to use these slides. If - PowerPoint PPT Presentation

Su ffi x trees Ben Langmead You are free to use these slides. If you do, please sign the guestbook (www.langmead-lab.org/teaching-materials), or email me (ben.langmead@gmail.com) and tell me brie fl y how you’re using them. For original Keynote fi les, email me.

Su ffi x trie: making it smaller T = abaaba$ Idea 1: Coalesce non-branching paths into a single edge with a string label $ aba$ Reduces # nodes, edges, guarantees internal nodes have >1 child

Su ffi x tree T = abaaba$ $ a With respect to m : ba How many leaves? m $ How many non-leaf nodes? ≤ m - 1 ba $ aba$ ≤ 2 m -1 nodes total, or O ( m ) nodes aba$ $ aba$ No : total length of edge Is the total size O ( m ) now? labels is quadratic in m

Su ffi x tree Idea 2: Store T itself in addition to the tree. Convert tree’s T = abaaba$ edge labels to (o ff set, length) pairs with respect to T. T = abaaba$ (6, 1) $ a (0, 1) ba (1, 2) (6, 1) $ ba $ (1, 2) (6, 1) aba$ (3, 4) (3, 4) aba$ $ (6, 1) (3, 4) aba$ Space required for su ffi x tree is now O ( m )

Su ffi x tree: leaves hold o ff sets T = abaaba$ T = abaaba$ (6, 1) (6, 1) (0, 1) (0, 1) (1, 2) (1, 2) 6 (6, 1) (6, 1) (1, 2) (6, 1) (1, 2) (6, 1) 5 (3, 4) (3, 4) 4 (3, 4) (3, 4) (6, 1) (6, 1) 1 3 (3, 4) (3, 4) 2 0

Su ffi x tree: labels Again, each node’s label equals the T = abaaba$ concatenated edge labels from the root to the node. These aren’t stored explicitly. (6, 1) (0, 1) (1, 2) 6 (6, 1) Label = “ba” (1, 2) (6, 1) 5 (3, 4) 4 (3, 4) (6, 1) 1 3 (3, 4) Label = “aaba$” 2 0

Su ffi x tree: labels Because edges can have string labels, we must distinguish two notions of “depth” T = abaaba$ • Node depth: how many edges we must (6, 1) (0, 1) (1, 2) follow from the root to reach the node 6 (6, 1) • Label depth: total length of edge labels (1, 2) (6, 1) 5 for edges on path from root to node (3, 4) 4 (3, 4) (6, 1) 1 3 (3, 4) 2 0

Su ffi x tree: space caveat Minor point: T = abaaba$ We say the space taken by the edge labels is (6, 1) O( m ), because we keep 2 integers per edge and (0, 1) (1, 2) 6 there are O( m ) edges (6, 1) To store one such integer, we need enough bits (1, 2) (6, 1) 5 (3, 4) to distinguish m positions in T , i.e. ceil(log 2 m ) 4 (3, 4) bits. We usually ignore this factor, since 64 bits is (6, 1) 1 plenty for all practical purposes. 3 (3, 4) 2 Similar argument for the pointers / references 0 used to distinguish tree nodes.

Su ffi x tree: building Naive method 1: build a su ffi x trie, then coalesce non-branching paths and relabel edges (6, 1) (0, 1) (1, 2) 6 Naive method 2: build a single-edge tree (6, 1) representing only the longest su ffi x, then (1, 2) (6, 1) 5 (3, 4) augment to include the 2 nd -longest, then 4 (3, 4) augment to include 3 rd -longest, etc (6, 1) 1 3 (3, 4) Both are O ( m 2 ) time, but fi rst uses 2 O ( m 2 ) space while second uses O ( m ) 0 Naive method 2 is described in Gus fi eld 5.4

Su ffi x tree: implementation class ¡ SuffixTree (object): ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ class ¡ Node (object): O ( m 2 ) time, O( m ) space ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ def ¡ __init__ (self, ¡lab): ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡self . lab ¡ = ¡lab ¡ # ¡label ¡on ¡path ¡leading ¡to ¡this ¡node ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡self . out ¡ = ¡{} ¡ ¡ # ¡outgoing ¡edges; ¡maps ¡characters ¡to ¡nodes ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ def ¡ __init__ (self, ¡s): ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡""" ¡Make ¡suffix ¡tree, ¡without ¡suffix ¡links, ¡from ¡s ¡in ¡quadratic ¡time ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡and ¡linear ¡space ¡""" ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡s ¡ += ¡'$' ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡self . root ¡ = ¡self . Node(None) Make 2-node tree for longest su ffi x ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡self . root . out[s[0]] ¡ = ¡self . Node(s) ¡ # ¡trie ¡for ¡just ¡longest ¡suf ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡# ¡add ¡the ¡rest ¡of ¡the ¡suffixes, ¡from ¡longest ¡to ¡shortest ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ for ¡i ¡ in ¡xrange(1, ¡len(s)): ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ # ¡start ¡at ¡root; ¡we’ll ¡walk ¡down ¡as ¡far ¡as ¡we ¡can ¡go ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡cur ¡ = ¡self . root Add rest of su ffi xes from ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡j ¡ = ¡i ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ while ¡j ¡ < ¡len(s): long to short, adding 1 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ if ¡s[j] ¡ in ¡cur . out: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡child ¡ = ¡cur . out[s[j]] or 2 nodes for each ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡lab ¡ = ¡child . lab ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ # ¡Walk ¡along ¡edge ¡until ¡we ¡exhaust ¡edge ¡label ¡or ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ # ¡until ¡we ¡mismatch ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡k ¡ = ¡j + 1 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ while ¡k -‑ j ¡ < ¡len(lab) ¡ and ¡s[k] ¡ == ¡lab[k -‑ j]: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡k ¡ += ¡1 Most complex case: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ if ¡k -‑ j ¡ == ¡len(lab): ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡cur ¡ = ¡child ¡ # ¡we ¡exhausted ¡the ¡edge ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡j ¡ = ¡k ... ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ else : ... ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ # ¡we ¡fell ¡off ¡in ¡middle ¡of ¡edge ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡cExist, ¡cNew ¡ = ¡lab[k -‑ j], ¡s[k] ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ # ¡create ¡“mid”: ¡new ¡node ¡bisecting ¡edge u ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡mid ¡ = ¡self . Node(lab[:k -‑ j]) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡mid . out[cNew] ¡ = ¡self . Node(s[k:]) uv ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ # ¡original ¡child ¡becomes ¡mid’s ¡child ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡mid . out[cExist] ¡ = ¡child w$ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ # ¡original ¡child’s ¡label ¡is ¡curtailed v ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡child . lab ¡ = ¡lab[k -‑ j:] ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ # ¡mid ¡becomes ¡new ¡child ¡of ¡original ¡parent ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡cur . out[s[j]] ¡ = ¡mid ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ else : ... ... ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ # ¡Fell ¡off ¡tree ¡at ¡a ¡node: ¡make ¡new ¡edge ¡hanging ¡off ¡it ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡cur . out[s[j]] ¡ = ¡self . Node(s[j:]) ¡ ¡ ¡ ¡

Su ffi x trees Ben Langmead You are free to use these slides. If - PowerPoint PPT Presentation

Su ffi x trees Ben Langmead You are free to use these slides. If you do, please sign the guestbook (www.langmead-lab.org/teaching-materials), or email me (ben.langmead@gmail.com) and tell me brie fl y how youre using them. For original

Lecture 3. Su ffi ciency Lecture 3. Su ffi ciency 1 (114) 3. Su ffi ciency 3.1. Su ffi cient

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

FFI The good, the bad and the ugly Esteban Lorenzano (The Pharo firefighter) Current status of

Trees Eric McCreath Overview In this lecture we will explore: general trees, binary trees,

2-3-4 Trees and Red- Black Trees 204 erm CS 16: Balanced Trees 2-3-4 Trees Revealed Nodes

/ + - * * 5 3 2 6 5 2 Examples Binary Trees BSTs Augmenting BinExpr General Trees

Taming the C Monster Haskell FFI Techniques Fraser Tweedale @hackuador May 22, 2018 FFI basics

Trees Applied Multivariate Statistics Spring 2012 Overview Intuition for Trees

Splay Trees and B-Trees CSE 373 Data Structures Lecture 9 Readings Reading Sections

Decision Trees Lecture 23 To left or to right 1 Decision Trees 2 Decision Trees A different

Outline Univariate Trees 1 Decision Trees Classification Regression Pruning Steven J Zeil

The number of spanning trees of random 2 -trees Stephan Wagner (joint work with Elmar Teufl)

Binary Trees, Heaps Binary Trees, Heaps Binary trees Binary trees A binary tree (

AVL TREES Height Balance : AVL Trees h 1 h 2 | h - h | 1 AVL AVL 2 1 non-AVL trees

Network Application Performance Carey Williamson Department of Computer Science University of

Page 1 Example: Branch Stall Impact Example: Calculating CPI bottom up Run benchmark and collect

How to Write a 6.033 Design Report Mya Poe 1 and Keith Winstein 2 1 MIT Program in Writing and

Automatic intrusion recovery with system-wide history Taesoo Kim MIT CSAIL Current focus of

Machine-Level Programming V: Advanced Topics CS140 - Assembly Language and Computer Organization

and Observational Science The Convergence of Data-Intensive and Compute-Intensive Infrastructure

Gr obner Bases a short introduction Elena Dimitrova AIMS, 2019 Elena Dimitrova Gr

Group Signatures [CH91] preserve the anonymity of the signer. Blind Signatures [Cha83] preserve the