assembly assembly assembling with repeats assembling with
play

Assembly Assembly Assembling with Repeats Assembling with Repeats - PowerPoint PPT Presentation

Assembly Assembly Assembling with Repeats Assembling with Repeats Mate Pairs Mate Pairs Whole genome Whole genome shotgun shotgun Input: Input: Shotgun sequence fragments (reads) Shotgun sequence fragments (reads) Mate


  1. Assembly Assembly

  2. Assembling with Repeats Assembling with Repeats

  3. Mate Pairs Mate Pairs

  4. Whole genome Whole genome shotgun shotgun ß Input: ß Input: ß Shotgun sequence fragments (reads) Shotgun sequence fragments (reads) ß ß Mate pairs Mate pairs ß ß Output: ß Output: ß A single sequence created by consensus of overlapping reads A single sequence created by consensus of overlapping reads ß ß First generation of assemblers did not include mate-pairs ß First generation of assemblers did not include mate-pairs (Phrap Phrap, CAP..) , CAP..) ( ß Second generation: CA, ß Second generation: CA, Arachne Arachne, Euler , Euler ß We will ß We will discuss Arachne discuss Arachne, a freely available sequence , a freely available sequence assembler (2nd generation) assembler (2nd generation)

  5. Arachne: : Details Details Arachne ß Initial processing ß Initial processing ß Alignment module ß Alignment module

  6. Alignment Module Alignment Module ß Input: Collection of DNA sequences of ß Input: Collection of DNA sequences of arbitrary length arbitrary length ß Output: ß Output: Pairwise Pairwise alignments between alignments between them. them.

  7. Overlap detection Overlap detection ß Option 1: Compute an alignment between ß Option 1: Compute an alignment between every pair. every pair. ß G = 150Mb, L=500 G = 150Mb, L=500 ß ß Coverage LN/G = 10 Coverage LN/G = 10 ß ß N = 10*150*10 N = 10*150*10 6 /500 = 3*10 6 ß 6 /500 = 3*10 6 ß Not good! (Only a small fraction are true Not good! (Only a small fraction are true ß overlaps) overlaps)

  8. K-mer mer based overlap based overlap K- ß A 25- ß A 25-bp bp sequence appears at most once sequence appears at most once in the genome! in the genome! ß Two overlapping sequences should share ß Two overlapping sequences should share a 25-mer mer a 25- ß Two non-overlapping sequences should ß Two non-overlapping sequences should not! not!

  9. Sorting k-mers mers Sorting k- ß Build a list of k- ß Build a list of k-mers mers that appear in the that appear in the sequences and their reverse complements sequences and their reverse complements ß Create a record with 4 entries: ß Create a record with 4 entries: ß K- K-mer mer ß ß Sequence number Sequence number ß ß Position in the sequence Position in the sequence ß ß Reverse complementation flag Reverse complementation flag ß ß Sort a vector of these according to k- ß Sort a vector of these according to k-mer mer ß If number of records exceeds ß If number of records exceeds threshold threshold, discard , discard (why?) (why?)

  10. Phase 2-4 of Alignment module Phase 2-4 of Alignment module ß ß Coalesce k-mer Coalesce k- mer hits into hits into longer, gap-free partial longer, gap-free partial alignments. alignments. ß ß These extended k-mer mer These extended k- hits are saved. hits are saved. ß ß For each pair of For each pair of sequences, form a sequences, form a directed graph. directed graph. ß ß For each maximal path For each maximal path in the graph, construct in the graph, construct an alignment. an alignment. ß ß Refine alignment via Refine alignment via banded DP banded DP

  11. Detecting Chimeric Chimeric reads reads Detecting ß ß Chimeric reads: Reads that Chimeric reads: Reads that contain sequence from two contain sequence from two genomic locations. genomic locations. ß ß Good overlaps: G(a,b) if a,b Good overlaps: G(a,b) if a,b overlap with a high high score score overlap with a ß ß Transitive overlap: T(a,c) if Transitive overlap: T(a,c) if G(a,b), and G(b,c) G(a,b), and G(b,c) ß ß Find a point x across which Find a point x across which only transitive overlaps occur. only transitive overlaps occur. X is a point of chimerism chimerism X is a point of

  12. Repeats Repeats

  13. Contig assembly assembly Contig ß ß Reads are merged into contigs Reads are merged into contigs upto repeat boundaries. upto repeat boundaries. ß ß (a,b) & (a,c) overlap, (b,c) (a,b) & (a,c) overlap, (b,c) should overlap as well. Also, should overlap as well. Also, shift(a,c)=shift(a,b)+shift(b,c) ß ß shift(a,c)=shift(a,b)+shift(b,c) ß ß Most of the contigs contigs are unique are unique Most of the pieces of the genome, and end pieces of the genome, and end at some Repeat boundary. at some Repeat boundary. ß ß Some contigs contigs might be entirely might be entirely Some within repeats. These must be within repeats. These must be detected detected

  14. Detecting Repeat Contigs Contigs 1: Read Density 1: Read Density Detecting Repeat ß Compute the log-odds ß Compute the log-odds ratio of two ratio of two hypotheses: hypotheses: ß H1: The ß H1: The contig contig is from is from a unique region of the a unique region of the genome. genome. ß The ß The contig contig is from a is from a region that is region that is repeated at least repeated at least twice twice

  15. Creating Super Contigs Contigs Creating Super

  16. Supercontig assembly assembly Supercontig ß Supercontigs ß Supercontigs are built incrementally are built incrementally ß Initially, each ß Initially, each contig contig is a is a supercontig supercontig. . ß In each round, a ß In each round, a pair pair of super- of super-contigs contigs is is merged until no more can be performed. merged until no more can be performed. ß Create a Priority Queue with a score for ß Create a Priority Queue with a score for every pair of ‘ ‘mergeable supercontigs mergeable supercontigs’ ’. . every pair of ß Score has two terms: Score has two terms: ß ß A reward for multiple mate-pair links ß A reward for multiple mate-pair links ß A penalty for distance between the links. ß A penalty for distance between the links.

  17. Supercontig merging merging Supercontig ß Remove the top scoring pair (S1,S2) from ß Remove the top scoring pair (S1,S2) from the priority queue. the priority queue. ß Merge (S ß Merge (S 1 ,S 2 ) to form contig contig T. T. 1 ,S 2 ) to form ß Remove all pairs in Q containing S ß Remove all pairs in Q containing S 1 or S 2 1 or S 2 ß Find all ß Find all supercontigs supercontigs W that share mate- W that share mate- pair links with T and insert (T,W) into the pair links with T and insert (T,W) into the priority queue. priority queue. ß Detect Repeated ß Detect Repeated Supercontigs Supercontigs and and remove remove

  18. Repeat Supercontigs Supercontigs Repeat ß If the distance ß If the distance between two super- between two super- contigs is not correct, is not correct, contigs they are marked as they are marked as Repeated Repeated ß If transitivity is not ß If transitivity is not maintained, then maintained, then there is a Repeat there is a Repeat

  19. Filling gaps in Supercontigs Supercontigs Filling gaps in

  20. Consenus Derivation Derivation Consenus ß Consensus sequence is created by ß Consensus sequence is created by converting pairwise pairwise read alignments into read alignments into converting multiple-read alignments multiple-read alignments

  21. Summary Summary ß Whole genome shotgun is now routine: ß Whole genome shotgun is now routine: ß Human, Mouse, Rat, Dog, Chimpanzee.. Human, Mouse, Rat, Dog, Chimpanzee.. ß ß Many Prokaryotes (One can be sequenced in a day) Many Prokaryotes (One can be sequenced in a day) ß ß Plant genomes: Arabidopsis, Rice Plant genomes: Arabidopsis, Rice ß ß Model organisms: Worm, Fly, Yeast Model organisms: Worm, Fly, Yeast ß ß A lot is not known about genome structure, ß A lot is not known about genome structure, organization and function. organization and function. ß Comparative genomics offers low hanging fruit Comparative genomics offers low hanging fruit ß

  22. The central dogma again The central dogma again Assembly Protein Sequence Sequence Analysis Analysis Gene Finding

  23. Much other analysis is Much other analysis is possible possible Assembly Genomic Analysis/ Pop. Genetics Protein Sequence Sequence Analysis Analysis ncRNA Gene Finding

  24. A Static picture of the cell is insufficient A Static picture of the cell is insufficient ß Each Cell is continuously active, ß Each Cell is continuously active, ß Genes are being transcribed into RNA Genes are being transcribed into RNA ß ß RNA is translated into proteins RNA is translated into proteins ß ß Proteins are PT modified and transported Proteins are PT modified and transported ß ß Proteins perform various cellular functions Proteins perform various cellular functions ß ß Can we probe the Cell dynamically ß Can we probe the Cell dynamically

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend