splitmem graphical pan genome analysis with suffix skips
play

splitMEM: graphical pan-genome analysis with suffix skips Shoshana - PowerPoint PPT Presentation

splitMEM: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014 Outline 1 Overview 2 Data Structures 3 splitMEM Algorithm 4 Pan-genome Analysis Objective Input ! Output ! A" B" C" D" Several


  1. splitMEM: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014

  2. Outline 1 Overview 2 Data Structures 3 splitMEM Algorithm 4 Pan-genome Analysis

  3. Objective Input ! Output ! A" B" C" D" • Several complete genomes ! Compressed de Bruijn graph ! • Available today for many • Graphical representation microbial species, near future depicts how population for higher eukaryotes ! variants relate to each other, • Pan-genome: analyze multiple especially where they diverge genomes of species together at branch points ! • How well conserved is a sequence? ! • What are network properties? !

  4. de Bruijn graph • Node for each distinct kmer • Directed edge connects consecutive kmers • Nodes overlap by k- 1 bp • Self-loops, multi-edges AGAAGTCC ATAAGTTA Reconstruct original sequence: Eulerian path through graph, visit each edge once

  5. Compressed de Bruijn graph • Merge non-branching chains of nodes • Min. number of nodes that preserve path labels ! Usually built from uncompressed graph ! We build directly in O(n log n) time and space

  6. Compresssed de Bruijn graph 9 strains of Bacillus anthracis k=25

  7. Compresssed de Bruijn graph 9 strains of Bacillus anthracis k=1000

  8. Outline 1 Overview 2 Data Structures 3 splitMEM Algorithm 4 Pan-genome Analysis

  9. Suffix Tree • Rooted, directed tree with leaf for each su ffj x. • Each internal node, except the root, has at least two children. • Each edge is labeled with nonempty substring. • No two siblings begin with the same character. • Path from root to leaf i spells su ffj x S[i . . . n]. • Append special character $ to guarantee each su ffj x ends at leaf.

  10. Constructing Suffix Tree Naïve"Algorithm" S"="banana$" suf 1" banana$" " suf 1" "

  11. Constructing Suffix Tree Naïve"Algorithm" S"="banana$" b a n a n a $ suf 2" " suf 2" anana$" " suf 1" " "

  12. Constructing Suffix Tree Naïve"Algorithm" S"="banana$" nana$" suf 3" suf 2" nana$" " suf 1" suf 3" " " "

  13. Constructing Suffix Tree Naïve"Algorithm" S"="banana$" nana$" suf 4" suf 2" ana$" " suf 1" suf 3" " " "

  14. Constructing Suffix Tree Naïve"Algorithm" S"="banana$" banana$" suf 4" suf 2" ana$" suf 4" " suf 1" suf 3" " " " "

  15. Constructing Suffix Tree Naïve"Algorithm" S"="banana$" banana$" suf 1" $" suf 6" anana$" suf 2" na" banana$" " nana$" suf 3" ana$" suf 4" suf 7" suf 2" suf 5" " na$" suf 5" suf 3" " " a$" suf 4" suf 6 " suf 1" " " " $" suf 7" O(n 2 )"Eme" "

  16. Constructing Suffix Tree O(n)"Eme" Suffix"Links" $" suf 6" " na" " banana$" ana"" """na""" suf 7" " "" suf 2" suf 5" " " "a" suf 3" " " suf 4" suf 1" " " " On#line'Construc/n'of'Suffix'Trees ,""E."Ukkonen" Algorithmica"(1995)""

  17. Suffix Tree Query S"="banana$" $" Search for ban suf 6" na" banana$" " suf 7" suf 2" suf 5" " suf 3" " " suf 4" suf 1" " " "

  18. Suffix Tree Query S"="banana$" $" Search for ban suf 6" na" b anana$" " suf 7" suf 2" suf 5" " suf 3" " " suf 4" suf 1" " " "

  19. Suffix Tree Query S"="banana$" $" Search for ban suf 6" na" ba nana$" " suf 7" suf 2" suf 5" " suf 3" " " suf 4" suf 1" " " "

  20. Suffix Tree Query S"="banana$" $" Search for ban suf 6" na" ban ana$" " suf 7" suf 2" suf 5" " suf 3" " " suf 4" suf 1" " " " Found 1 occurrence

  21. Suffix Tree Query S"="banana$" $" Search for band suf 6" na" ban ana$" " suf 7" suf 2" suf 5" " suf 3" " " suf 4" suf 1" " " " Not found

  22. Suffix Tree Query S"="banana$" $" Search for an suf 6" n a" banana$" " suf 7" suf 2" suf 5" " suf 3" " " suf 4" suf 1" " " " Found 2 occurrences

  23. Suffix Tree ! Many applications in computational biology ! Linear time construction algorithms Linear time solutions to • Genome alignment • Finding longest common substring • All-pairs suffix-prefix matching • Locating all maximal repetitions • And many more…

  24. MEMs Maximal"Exact"Match"(MEM)"" Exact"match"within"sequence"that"cannot"be" extended"leT"or"right"without"introducing" mismatch." T G C AC G C A A We"are"interested" in"MEMs""length" ≥ k"

  25. MEMs Maximal Exact Match (MEM) Exact match within sequence that cannot be extended left or right without introducing mismatch. MEMs are internal nodes in the suffix tree that have left-diverse descendants. (have descendant leaves that represent suffixes with different characters preceding them) ! Linear-time suffix tree traversal to locate MEMs.

  26. MEMs in Suffix Tree Possible MEMs: a, ana, na S"="banana$" banana$" suf 1" $" MEM? suf 6" anana$" suf 2" na" banana$" MEM? " nana$" suf 3" MEM? ana$" suf 4" suf 7" suf 2" suf 5" " na$" suf 5" suf 3" " " a$" suf 4" suf 6 " suf 1" " " " $" suf 7" MEMs are internal nodes in suffix " tree with left-diverse descendants

  27. MEMs in Suffix Tree MEMs: a, ana S"="banana$" b anana$" ! suf 2" MEM? $" suf 6" n ana$" suf 4" na" banana$" ! MEM? " MEM? " suf 7" suf 2" suf 5" " suf 3" " " suf 4" suf 1" " " " MEMs are internal nodes in suffix tree with left-diverse descendants

  28. MEMs in Suffix Tree MEMs: a, ana S"="banana$" a nana$" suf 3" MEM $" X suf 6" a na$" suf 5" na" banana$" MEM? " MEM " suf 7" suf 2" suf 5" " suf 3" " " suf 4" suf 1" " " " MEMs are internal nodes in suffix tree with left-diverse descendants

  29. Outline 1 Overview 2 Data Structures 3 splitMEM Algorithm 4 Pan-genome Analysis

  30. Compresssed de Bruijn graph Types of nodes: i. repeatNodes ii. uniqueNodes Input: AGAAGTCC$ATAAGTTA

  31. splitMEM Nodes in compressed de Bruijn graph classified as i. repeatNodes ii. uniqueNodes Algorithm: 1 Construct set of repeatNodes 2 Sort start positions of repeatNodes 3 Create edges and uniqueNodes to link non- contiguous repeatNodes

  32. repeatNodes 1 Construct set of repeatNodes 1. Build suffix tree of genome 2. Mark internal nodes that are MEMs, length ≥ k 3. Preprocess suffix tree for LMA queries 4. Compute repeatNodes in compressed de Bruijn graph by decomposing MEMs and extracting overlapping components, length ≥ k

  33. 1 MEM occurs twice T G C AC … G G C A A GCA "

  34. Overlapping MEMs T G C C AT C G C C A AC C AT T G C C AT C G C C A AC C AT

  35. Tandem Repeat AGGCTTGGCTTGGCTTGGCTA AGGCTTGGCTTGGCTTGGCTA AGGCTTGGCTTGGCTTGGCTA AGGCTTGGCTTGGCTTGGCTA AGGCTTGGCTTGGCTTGGCTA

  36. repeatNodes 1 Construct set of repeatNodes 1. Build suffix tree of genome 2. Mark internal nodes that are MEMs, length ≥ k 3. Preprocess suffix tree for LMA queries 4. Compute repeatNodes in compressed de Bruijn graph by decomposing MEMs and extracting overlapping components, length ≥ k

  37. Split"MEM"to"repeatNodes " …" …" …" …" "" x"x"y"z"" α β y"x"y"z"" α β u"" α γ " z α " " MEM" " u � α" x y z � α" α " α" β " " α � β" α � γ" MEM"

  38. Split"MEM"to"repeatNodes " …" …" …" …" "" x"x"y"z"" α β y"x"y"z"" α β u"" α γ " z α " " MEM" " α " x y z α β" β " " MEM" Find"MEM"in"suffix"tree."

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend