Assembly Assembly Assembling with Repeats Assembling with Repeats - - PowerPoint PPT Presentation
Assembly Assembly Assembling with Repeats Assembling with Repeats - - PowerPoint PPT Presentation
Assembly Assembly Assembling with Repeats Assembling with Repeats Mate Pairs Mate Pairs Whole genome Whole genome shotgun shotgun Input: Input: Shotgun sequence fragments (reads) Shotgun sequence fragments (reads) Mate
Assembling with Repeats Assembling with Repeats
Mate Pairs Mate Pairs
Whole genome Whole genome shotgun shotgun
ß ß Input:
Input:
ß ß Shotgun sequence fragments (reads) Shotgun sequence fragments (reads) ß ß Mate pairs Mate pairs
ß ß Output:
Output:
ß ß A single sequence created by consensus of overlapping reads A single sequence created by consensus of overlapping reads
ß ß First generation of assemblers did not include mate-pairs
First generation of assemblers did not include mate-pairs ( (Phrap Phrap, CAP..) , CAP..)
ß ß Second generation: CA,
Second generation: CA, Arachne Arachne, Euler , Euler
ß ß We will
We will discuss Arachne discuss Arachne, a freely available sequence , a freely available sequence assembler (2nd generation) assembler (2nd generation)
Arachne Arachne: : Details Details
ß ß Initial processing
Initial processing
ß ß Alignment module
Alignment module
Alignment Module Alignment Module
ß ß Input: Collection of DNA sequences of
Input: Collection of DNA sequences of arbitrary length arbitrary length
ß ß Output:
Output: Pairwise Pairwise alignments between alignments between them. them.
Overlap detection Overlap detection
ß ß Option 1: Compute an alignment between
Option 1: Compute an alignment between every pair. every pair.
ß ß G = 150Mb, L=500 G = 150Mb, L=500 ß ß Coverage LN/G = 10 Coverage LN/G = 10 ß ß N = 10*150*10 N = 10*150*106
6/500 = 3*10
/500 = 3*106
6
ß ß Not good! (Only a small fraction are true Not good! (Only a small fraction are true
- verlaps)
- verlaps)
K- K-mer mer based overlap based overlap
ß ß A 25-
A 25-bp bp sequence appears at most once sequence appears at most once in the genome! in the genome!
ß ß Two overlapping sequences should share
Two overlapping sequences should share a 25- a 25-mer mer
ß ß Two non-overlapping sequences should
Two non-overlapping sequences should not! not!
Sorting k- Sorting k-mers mers
ß ß Build a list of k-
Build a list of k-mers mers that appear in the that appear in the sequences and their reverse complements sequences and their reverse complements
ß ß Create a record with 4 entries:
Create a record with 4 entries:
ß ß K- K-mer mer ß ß Sequence number Sequence number ß ß Position in the sequence Position in the sequence ß ß Reverse complementation flag Reverse complementation flag
ß ß Sort a vector of these according to k-
Sort a vector of these according to k-mer mer
ß ß If number of records exceeds
If number of records exceeds threshold threshold, discard , discard (why?) (why?)
Phase 2-4 of Alignment module Phase 2-4 of Alignment module
ß ß
Coalesce k- Coalesce k-mer mer hits into hits into longer, gap-free partial longer, gap-free partial alignments. alignments.
ß ß
These extended k- These extended k-mer mer hits are saved. hits are saved.
ß ß
For each pair of For each pair of sequences, form a sequences, form a directed graph. directed graph.
ß ß
For each maximal path For each maximal path in the graph, construct in the graph, construct an alignment. an alignment.
ß ß
Refine alignment via Refine alignment via banded DP banded DP
Detecting Detecting Chimeric Chimeric reads reads
ß ß
Chimeric Chimeric reads: Reads that reads: Reads that contain sequence from two contain sequence from two genomic locations. genomic locations.
ß ß
Good overlaps: G(a,b) if a,b Good overlaps: G(a,b) if a,b
- verlap with a
- verlap with a high
high score score
ß ß
Transitive overlap: T(a,c) if Transitive overlap: T(a,c) if G(a,b), and G(b,c) G(a,b), and G(b,c)
ß ß
Find a point x across which Find a point x across which
- nly transitive overlaps occur.
- nly transitive overlaps occur.
X is a point of X is a point of chimerism chimerism
Repeats Repeats
Contig Contig assembly assembly
ß ß
Reads are merged into Reads are merged into contigs contigs upto upto repeat boundaries. repeat boundaries.
ß ß
(a,b) & (a,c) overlap, (b,c) (a,b) & (a,c) overlap, (b,c) should overlap as well. Also, should overlap as well. Also,
ß ß shift(a,c)=shift(a,b)+shift(b,c) shift(a,c)=shift(a,b)+shift(b,c)
ß ß
Most of the Most of the contigs contigs are unique are unique pieces of the genome, and end pieces of the genome, and end at some Repeat boundary. at some Repeat boundary.
ß ß
Some Some contigs contigs might be entirely might be entirely within repeats. These must be within repeats. These must be detected detected
Detecting Repeat Detecting Repeat Contigs Contigs 1: Read Density 1: Read Density
ß ß Compute the log-odds
Compute the log-odds ratio of two ratio of two hypotheses: hypotheses:
ß ß H1: The
H1: The contig contig is from is from a unique region of the a unique region of the genome. genome.
ß ß The
The contig contig is from a is from a region that is region that is repeated at least repeated at least twice twice
Creating Super Creating Super Contigs Contigs
Supercontig Supercontig assembly assembly
ß ß Supercontigs
Supercontigs are built incrementally are built incrementally
ß ß Initially, each
Initially, each contig contig is a is a supercontig supercontig. .
ß ß In each round, a
In each round, a pair pair of super-
- f super-contigs
contigs is is merged until no more can be performed. merged until no more can be performed.
ß ß Create a Priority Queue with a score for
Create a Priority Queue with a score for every pair of every pair of ‘ ‘mergeable supercontigs mergeable supercontigs’ ’. .
ß ß Score has two terms: Score has two terms: ß ß A reward for multiple mate-pair links
A reward for multiple mate-pair links
ß ß A penalty for distance between the links.
A penalty for distance between the links.
Supercontig Supercontig merging merging
ß ß Remove the top scoring pair (S1,S2) from
Remove the top scoring pair (S1,S2) from the priority queue. the priority queue.
ß ß Merge (S
Merge (S1
1,S
,S2
2) to form
) to form contig contig T. T.
ß ß Remove all pairs in Q containing S
Remove all pairs in Q containing S1
1 or S
- r S2
2
ß ß Find all
Find all supercontigs supercontigs W that share mate- W that share mate- pair links with T and insert (T,W) into the pair links with T and insert (T,W) into the priority queue. priority queue.
ß ß Detect Repeated
Detect Repeated Supercontigs Supercontigs and and remove remove
Repeat Repeat Supercontigs Supercontigs
ß ß If the distance
If the distance between two super- between two super- contigs contigs is not correct, is not correct, they are marked as they are marked as Repeated Repeated
ß ß If transitivity is not
If transitivity is not maintained, then maintained, then there is a Repeat there is a Repeat
Filling gaps in Filling gaps in Supercontigs Supercontigs
Consenus Consenus Derivation Derivation
ß ß Consensus sequence is created by
Consensus sequence is created by converting converting pairwise pairwise read alignments into read alignments into multiple-read alignments multiple-read alignments
Summary Summary
ß ß Whole genome shotgun is now routine:
Whole genome shotgun is now routine:
ß ß Human, Mouse, Rat, Dog, Chimpanzee.. Human, Mouse, Rat, Dog, Chimpanzee.. ß ß Many Prokaryotes (One can be sequenced in a day) Many Prokaryotes (One can be sequenced in a day) ß ß Plant genomes: Arabidopsis, Rice Plant genomes: Arabidopsis, Rice ß ß Model organisms: Worm, Fly, Yeast Model organisms: Worm, Fly, Yeast
ß ß A lot is not known about genome structure,
A lot is not known about genome structure,
- rganization and function.
- rganization and function.
ß ß Comparative genomics offers low hanging fruit Comparative genomics offers low hanging fruit
The central dogma again The central dogma again
Protein Sequence Analysis
Sequence Analysis Gene Finding Assembly
Much other analysis is Much other analysis is possible possible
Protein Sequence Analysis
Sequence Analysis Gene Finding Assembly ncRNA Genomic Analysis/ Pop. Genetics