splitMEM: graphical pan-genome analysis with suffix skips
Shoshana Marcus
May 7, 2014
splitMEM: graphical pan-genome analysis with suffix skips Shoshana - - PowerPoint PPT Presentation
splitMEM: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014 Outline 1 Overview 2 Data Structures 3 splitMEM Algorithm 4 Pan-genome Analysis Objective Input ! Output ! A" B" C" D" Several
splitMEM: graphical pan-genome analysis with suffix skips
Shoshana Marcus
May 7, 2014
Outline
1 Overview 2 Data Structures 3 splitMEM Algorithm 4 Pan-genome Analysis
Objective
Compressed de Bruijn graph!
depicts how population variants relate to each other, especially where they diverge at branch points!
sequence? !
microbial species, near future for higher eukaryotes!
genomes of species together
A" B" C" D"
Input! Output!
de Bruijn graph
AGAAGTCC ATAAGTTA Reconstruct original sequence: Eulerian path through graph, visit each edge once
! Usually built from uncompressed graph ! We build directly in O(n log n) time and space
Compressed de Bruijn graph
Compresssed de Bruijn graph
9 strains of Bacillus anthracis k=25
9 strains of Bacillus anthracis k=1000
Compresssed de Bruijn graph
Outline
1 Overview 2 Data Structures 3 splitMEM Algorithm 4 Pan-genome Analysis
ends at leaf.
Suffix Tree
tree with leaf for each suffjx.
except the root, has at least two children.
suf1" " S"="banana$"
Constructing Suffix Tree
Naïve"Algorithm" banana$" suf1" "
suf1"
"
b a n a n a $ "
S"="banana$"
Constructing Suffix Tree
Naïve"Algorithm"
suf2"
" anana$" suf2" "
suf1"
" S"="banana$"
Constructing Suffix Tree
Naïve"Algorithm"
suf2"
" nana$" suf3" "
suf3"
"
nana$"
suf1"
" S"="banana$"
Constructing Suffix Tree
Naïve"Algorithm"
suf2"
"
suf3"
"
nana$"
ana$" suf4" "
suf1"
"
banana$"
S"="banana$"
Constructing Suffix Tree
Naïve"Algorithm"
suf2"
"
suf3"
"
suf4"
" ana$" suf4" "
suf1"
"
banana$"
S"="banana$"
Constructing Suffix Tree
Naïve"Algorithm"
na" suf2"
" banana$" anana$" nana$" ana$" na$" a$" $" suf1" suf2" suf3" suf4" suf5" suf6" suf7" "
suf3"
"
suf4"
" O(n2)"Eme"
suf5"
"
suf7"
"
$" suf6"
"
suf1"
"
Constructing Suffix Tree
O(n)"Eme"
suf2"
"
suf3"
"
suf4"
"
suf5"
"
suf7"
"
$" suf6"
"
On#line'Construc/n'of'Suffix'Trees,""E."Ukkonen" Algorithmica"(1995)"" banana$" na"
Suffix"Links" " ana"" """na""" " "" " "a"
suf1"
" S"="banana$"
Suffix Tree Query
suf2"
"
suf3"
"
suf4"
"
suf5"
"
suf7"
"
$" suf6"
"
Search for ban banana$" na"
suf1"
" S"="banana$"
Suffix Tree Query
suf2"
"
suf3"
"
suf4"
"
suf5"
"
suf7"
"
$" suf6"
"
Search for ban banana$" na"
suf1"
" S"="banana$"
Suffix Tree Query
suf2"
"
suf3"
"
suf4"
"
suf5"
"
suf7"
"
$" suf6"
"
Search for ban banana$" na"
suf1"
" S"="banana$"
Suffix Tree Query
suf2"
"
suf3"
"
suf4"
"
suf5"
"
suf7"
"
$" suf6"
"
Search for ban Found 1
banana$" na"
suf1"
" S"="banana$"
Suffix Tree Query
suf2"
"
suf3"
"
suf4"
"
suf5"
"
suf7"
"
$" suf6"
"
Search for band Not found banana$" na"
suf1"
" S"="banana$"
Suffix Tree Query
suf2"
"
suf3"
"
suf4"
"
suf5"
"
suf7"
"
$" suf6"
"
Search for an Found 2
banana$" na"
Suffix Tree
! Many applications in computational biology ! Linear time construction algorithms Linear time solutions to
MEMs
T G C AC G C A A Maximal"Exact"Match"(MEM)"" Exact"match"within"sequence"that"cannot"be" extended"leT"or"right"without"introducing" mismatch."
We"are"interested" in"MEMs""length"≥ k"
MEMs
Maximal Exact Match (MEM) Exact match within sequence that cannot be extended left or right without introducing mismatch. MEMs are internal nodes in the suffix tree that have left-diverse descendants. (have descendant leaves that represent suffixes with different characters preceding them) ! Linear-time suffix tree traversal to locate MEMs.
suf1"
"
banana$"
S"="banana$"
MEMs in Suffix Tree
na" suf2"
" banana$" anana$" nana$" ana$" na$" a$" $" suf1" suf2" suf3" suf4" suf5" suf6" suf7" "
suf3"
"
suf4"
"
suf5"
"
suf7"
"
$" suf6"
"
MEMs are internal nodes in suffix tree with left-diverse descendants
MEM? MEM? MEM?
Possible MEMs: a, ana, na
suf1"
"
banana$"
S"="banana$"
MEMs in Suffix Tree
na" suf2"
" banana$" nana$" suf2" suf4" "
suf3"
"
suf4"
"
suf5"
"
suf7"
"
$" suf6"
"
MEM? MEM? MEM?
MEMs: a, ana
MEMs are internal nodes in suffix tree with left-diverse descendants
suf1"
"
banana$"
S"="banana$"
MEMs in Suffix Tree
na" suf2"
"
suf3"
"
suf4"
"
suf5"
"
suf7"
"
$" suf6"
"
MEM?
MEMs: a, ana
MEMs are internal nodes in suffix tree with left-diverse descendants
anana$" ana$" suf3" suf5" "
MEM MEM
Outline
1 Overview 2 Data Structures 3 splitMEM Algorithm 4 Pan-genome Analysis
Compresssed de Bruijn graph
Input: AGAAGTCC$ATAAGTTA Types of nodes: i. repeatNodes
splitMEM
Nodes in compressed de Bruijn graph classified as i. repeatNodes
Algorithm: 1 Construct set of repeatNodes 2 Sort start positions of repeatNodes 3 Create edges and uniqueNodes to link non- contiguous repeatNodes
1 Construct set of repeatNodes
graph by decomposing MEMs and extracting
repeatNodes
1 MEM occurs twice
GCA"
T G C AC … G G C A A
Overlapping MEMs
T G C C AT C G C C A AC C AT T G C C AT C G C C A AC C AT
Tandem Repeat
AGGCTTGGCTTGGCTTGGCTA AGGCTTGGCTTGGCTTGGCTA AGGCTTGGCTTGGCTTGGCTA AGGCTTGGCTTGGCTTGGCTA AGGCTTGGCTTGGCTTGGCTA
1 Construct set of repeatNodes
graph by decomposing MEMs and extracting
repeatNodes
…" …" …" …"
α" αβ" αγ" x y z α" u α"
α " β " " z α " " " MEM" MEM"
"" x"x"y"z""α β y"x"y"z""α β u""α γ"
Split"MEM"to"repeatNodes "
…" …" …" …"
x y z α β"
α " β " " z α " " " MEM" MEM"
"" x"x"y"z""α β y"x"y"z""α β u""α γ"
Split"MEM"to"repeatNodes "
Find"MEM"in"suffix"tree."
…" …" …" …"
α " β " " z α " " " MEM" MEM"
"" x"x"y"z""α β y"x"y"z""α β u""α γ"
Split"MEM"to"repeatNodes "
Traverse"suffix"link." Look"for"MEM"as"ancestor."
x y z α β"
…" …" …" …"
α " β " " z α " " " MEM" MEM"
"" x"x"y"z""α β y"x"y"z""α β u""α γ"
Split"MEM"to"repeatNodes "
Traverse"suffix"link." Look"for"MEM"as"ancestor."
x y z α β"
…" …" …" …"
α " β " " z α " " " MEM" MEM"
"" x"x"y"z""α β y"x"y"z""α β u""α γ"
Split"MEM"to"repeatNodes "
Traverse"suffix"link." Look"for"MEM"as"ancestor."
x y z α β"
…" …" …" …"
α" αβ" αγ" x y z α" u α"
α " β " " z α " " " MEM" MEM"
"" x"x"y"z""α β y"x"y"z""α β u""α γ"
Split"MEM"to"repeatNodes "
Found"MEM"as"ancestor.""Decompose." Remove"embedded"MEM"(suffix"links)."Find"next"embedded"MEM."
Suffix Skips
! Reduce O(n2) time to O(n log n) time Suffix link: quickly navigate to distant part of tree
Suffix skip:
Genome: babab
!
Additional Preprocessing: ! pointer jumping to rapidly add additional links! Suffix skips 0
(dist = 1; suffix links)
Suffix skips 1
(dist=2)
Suffix skips 2
(dist=4)
Suffix Skips
splitMEM
construct several compressed de Bruijn graphs without rebuilding suffix tree
Outline
1 Overview 2 Data Structures 3 splitMEM Algorithm 4 Pan-genome Analysis
Pan-genome analysis
Examine graph properties:
core genome Other properties that can be studied:
genome specific genes.
Pan-genome analysis
Pan-genome analysis
Graphs of main chromosomes
Species
K Nodes Edges
25 103926 138468 1.33
100 41343 54954 1.32
6627 8659 1.30
25 494783 662081 1.33
100 230996 308256 1.33
1000 11900 15695 1.31
Pan-genome analysis
Examine graph properties
to core genome
Histogram of Node Lengths
Pan-genome analysis
Histogram of Node Lengths
Pan-genome analysis
Spike at 2k: SNPs
Pan-genome analysis
Examine graph properties
to core genome
Fraction of nodes with each level of genome sharing
Pan-genome analysis
Pan-genome analysis
Examine graph properties
to core genome
Pan-genome analysis
Graph encodes sequence context of segments. Core genome: subsequences that occur in at least 70% of underlying genomes.
Nodes can be further in terms of hops while closer by base pairs. Branch and Bound Search
Pan-genome analysis
Node distances to core genome
Searched 1000- hop radius
Summary
compressed de Bruijn graph.
graph for single or pan-genome.
SplitMEM: Graphical pan-genome analysis with suffix skips. Marcus, S, Lee, H, Schatz, MC (2014) BioRxiv http://biorxiv.org/content/early/2014/04/06/003954
Future work
Improve splitMEM software:
graph
Biological applications:
specific segments
larger genomes
Acknowledgments
Michael Schatz Hayan Lee Giuseppe Narzisi James Gurtowski Schatz Lab IT department Todd Heywood
Pan-genome analysis
Branch and bound search (like Dijkstra’s shortest path algorithm) to compute bp distance from each non-core node to core genome: Traverse all distinct paths from source until
by a shorter path Bounded search
maximum search distance along other paths OR