Assembly Assembly Assembling with Repeats Assembling with Repeats - PowerPoint PPT Presentation

Assembly Assembly

Assembling with Repeats Assembling with Repeats

Mate Pairs Mate Pairs

Whole genome Whole genome shotgun shotgun ß Input: ß Input: ß Shotgun sequence fragments (reads) Shotgun sequence fragments (reads) ß ß Mate pairs Mate pairs ß ß Output: ß Output: ß A single sequence created by consensus of overlapping reads A single sequence created by consensus of overlapping reads ß ß First generation of assemblers did not include mate-pairs ß First generation of assemblers did not include mate-pairs (Phrap Phrap, CAP..) , CAP..) ( ß Second generation: CA, ß Second generation: CA, Arachne Arachne, Euler , Euler ß We will ß We will discuss Arachne discuss Arachne, a freely available sequence , a freely available sequence assembler (2nd generation) assembler (2nd generation)

Arachne: : Details Details Arachne ß Initial processing ß Initial processing ß Alignment module ß Alignment module

Alignment Module Alignment Module ß Input: Collection of DNA sequences of ß Input: Collection of DNA sequences of arbitrary length arbitrary length ß Output: ß Output: Pairwise Pairwise alignments between alignments between them. them.

Overlap detection Overlap detection ß Option 1: Compute an alignment between ß Option 1: Compute an alignment between every pair. every pair. ß G = 150Mb, L=500 G = 150Mb, L=500 ß ß Coverage LN/G = 10 Coverage LN/G = 10 ß ß N = 10*150*10 N = 10*150*10 6 /500 = 3*10 6 ß 6 /500 = 3*10 6 ß Not good! (Only a small fraction are true Not good! (Only a small fraction are true ß overlaps) overlaps)

K-mer mer based overlap based overlap K- ß A 25- ß A 25-bp bp sequence appears at most once sequence appears at most once in the genome! in the genome! ß Two overlapping sequences should share ß Two overlapping sequences should share a 25-mer mer a 25- ß Two non-overlapping sequences should ß Two non-overlapping sequences should not! not!

Sorting k-mers mers Sorting k- ß Build a list of k- ß Build a list of k-mers mers that appear in the that appear in the sequences and their reverse complements sequences and their reverse complements ß Create a record with 4 entries: ß Create a record with 4 entries: ß K- K-mer mer ß ß Sequence number Sequence number ß ß Position in the sequence Position in the sequence ß ß Reverse complementation flag Reverse complementation flag ß ß Sort a vector of these according to k- ß Sort a vector of these according to k-mer mer ß If number of records exceeds ß If number of records exceeds threshold threshold, discard , discard (why?) (why?)

Phase 2-4 of Alignment module Phase 2-4 of Alignment module ß ß Coalesce k-mer Coalesce k- mer hits into hits into longer, gap-free partial longer, gap-free partial alignments. alignments. ß ß These extended k-mer mer These extended k- hits are saved. hits are saved. ß ß For each pair of For each pair of sequences, form a sequences, form a directed graph. directed graph. ß ß For each maximal path For each maximal path in the graph, construct in the graph, construct an alignment. an alignment. ß ß Refine alignment via Refine alignment via banded DP banded DP

Detecting Chimeric Chimeric reads reads Detecting ß ß Chimeric reads: Reads that Chimeric reads: Reads that contain sequence from two contain sequence from two genomic locations. genomic locations. ß ß Good overlaps: G(a,b) if a,b Good overlaps: G(a,b) if a,b overlap with a high high score score overlap with a ß ß Transitive overlap: T(a,c) if Transitive overlap: T(a,c) if G(a,b), and G(b,c) G(a,b), and G(b,c) ß ß Find a point x across which Find a point x across which only transitive overlaps occur. only transitive overlaps occur. X is a point of chimerism chimerism X is a point of

Repeats Repeats

Contig assembly assembly Contig ß ß Reads are merged into contigs Reads are merged into contigs upto repeat boundaries. upto repeat boundaries. ß ß (a,b) & (a,c) overlap, (b,c) (a,b) & (a,c) overlap, (b,c) should overlap as well. Also, should overlap as well. Also, shift(a,c)=shift(a,b)+shift(b,c) ß ß shift(a,c)=shift(a,b)+shift(b,c) ß ß Most of the contigs contigs are unique are unique Most of the pieces of the genome, and end pieces of the genome, and end at some Repeat boundary. at some Repeat boundary. ß ß Some contigs contigs might be entirely might be entirely Some within repeats. These must be within repeats. These must be detected detected

Detecting Repeat Contigs Contigs 1: Read Density 1: Read Density Detecting Repeat ß Compute the log-odds ß Compute the log-odds ratio of two ratio of two hypotheses: hypotheses: ß H1: The ß H1: The contig contig is from is from a unique region of the a unique region of the genome. genome. ß The ß The contig contig is from a is from a region that is region that is repeated at least repeated at least twice twice

Creating Super Contigs Contigs Creating Super

Supercontig assembly assembly Supercontig ß Supercontigs ß Supercontigs are built incrementally are built incrementally ß Initially, each ß Initially, each contig contig is a is a supercontig supercontig. . ß In each round, a ß In each round, a pair pair of super- of super-contigs contigs is is merged until no more can be performed. merged until no more can be performed. ß Create a Priority Queue with a score for ß Create a Priority Queue with a score for every pair of ‘ ‘mergeable supercontigs mergeable supercontigs’ ’. . every pair of ß Score has two terms: Score has two terms: ß ß A reward for multiple mate-pair links ß A reward for multiple mate-pair links ß A penalty for distance between the links. ß A penalty for distance between the links.

Supercontig merging merging Supercontig ß Remove the top scoring pair (S1,S2) from ß Remove the top scoring pair (S1,S2) from the priority queue. the priority queue. ß Merge (S ß Merge (S 1 ,S 2 ) to form contig contig T. T. 1 ,S 2 ) to form ß Remove all pairs in Q containing S ß Remove all pairs in Q containing S 1 or S 2 1 or S 2 ß Find all ß Find all supercontigs supercontigs W that share mate- W that share mate- pair links with T and insert (T,W) into the pair links with T and insert (T,W) into the priority queue. priority queue. ß Detect Repeated ß Detect Repeated Supercontigs Supercontigs and and remove remove

Repeat Supercontigs Supercontigs Repeat ß If the distance ß If the distance between two super- between two supercontigs is not correct, is not correct, contigs they are marked as they are marked as Repeated Repeated ß If transitivity is not ß If transitivity is not maintained, then maintained, then there is a Repeat there is a Repeat

Filling gaps in Supercontigs Supercontigs Filling gaps in

Consenus Derivation Derivation Consenus ß Consensus sequence is created by ß Consensus sequence is created by converting pairwise pairwise read alignments into read alignments into converting multiple-read alignments multiple-read alignments

Summary Summary ß Whole genome shotgun is now routine: ß Whole genome shotgun is now routine: ß Human, Mouse, Rat, Dog, Chimpanzee.. Human, Mouse, Rat, Dog, Chimpanzee.. ß ß Many Prokaryotes (One can be sequenced in a day) Many Prokaryotes (One can be sequenced in a day) ß ß Plant genomes: Arabidopsis, Rice Plant genomes: Arabidopsis, Rice ß ß Model organisms: Worm, Fly, Yeast Model organisms: Worm, Fly, Yeast ß ß A lot is not known about genome structure, ß A lot is not known about genome structure, organization and function. organization and function. ß Comparative genomics offers low hanging fruit Comparative genomics offers low hanging fruit ß

The central dogma again The central dogma again Assembly Protein Sequence Sequence Analysis Analysis Gene Finding

Much other analysis is Much other analysis is possible possible Assembly Genomic Analysis/ Pop. Genetics Protein Sequence Sequence Analysis Analysis ncRNA Gene Finding

A Static picture of the cell is insufficient A Static picture of the cell is insufficient ß Each Cell is continuously active, ß Each Cell is continuously active, ß Genes are being transcribed into RNA Genes are being transcribed into RNA ß ß RNA is translated into proteins RNA is translated into proteins ß ß Proteins are PT modified and transported Proteins are PT modified and transported ß ß Proteins perform various cellular functions Proteins perform various cellular functions ß ß Can we probe the Cell dynamically ß Can we probe the Cell dynamically

Assembly Assembly Assembling with Repeats Assembling with Repeats - PowerPoint PPT Presentation

Assembly Assembly Assembling with Repeats Assembling with Repeats Mate Pairs Mate Pairs Whole genome Whole genome shotgun shotgun Input: Input: Shotgun sequence fragments (reads) Shotgun sequence fragments (reads) Mate

DNA Short Tandem Repeats Organism DNA Short Tandem Repeats Organ DNA Short Tandem Repeats Cell

Self-Assembling DNA Self-Assembling DNA N. Jonoska Jonoska, N. C. , N. C. Seeman Seeman, DNA

CSI5126 . Algorithms in bioinformatics Suffjx Trees Marcel Turcotte School of Electrical

Assembling Assembling Nanomaterials Nanomaterials Richard W. Siegel Rensselaer Nanotechnology

2.11. The Maximum of n Random Variables 3.4. Hypothesis Testing 5.4. Long Repeats of the Same

Approaches to Repeat Finding Beth Skwarecki Cornell Genomics Forum, 2005-03-18 Why is repeat

Sequencing, data cleaning and assembling Swiss Institute of Bioinformatics (SIB) 26-30 November

Assembling wall panels with robotic technologies Frans van Gassel & Pascal Schrijver

OPNFV Frank Brockners OPNFV TSC Member Distinguished Engineer, Cisco Assembling a Platform for

Assembling Systems Jacob Hendricks University of Wisconsin River Falls Dagstuhl Seminar:

#join Y assembly to Box JellyBox Build: 15_Y-Assembly Join (link directly to the y assembly part

Applications using the Class Frequency Distribution of Maximal Repeats from Tagged Sequential

Drums, Tempo, and Nested Repeats Drums! The play note block plays a percussion note. It has

PROTEINS WITH TANDEM REPEATS Dr Andrey Kajava Group of Structural Bioinformatics and Molecular

On-line Construction of Compact Suffix Vectors and Maximal Repeats Elise Prieur and Thierry

Studying and Fighting pathogenic bacteria with CRISPR David Bikard Synthetic Biology Group

Google Matrix Analysis of DNA Sequences Vivek Kandiah and Dima Shepelyansky Laboratoire de

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms

Conference Welcome Paul Holme Chair NWPN Apprenticeships The Leeds Way Treat 2 million

Improving Information Services for Education The Strategic Vision New services will

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Chapter 6 Dynamic Programming CS 573: Algorithms, Fall 2013 September 12, 2013 6.1 Maximum

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Information Technology Advisory Committee (ITAC) Public Business Meeting June 21, 2019

Assembly Assembly Assembling with Repeats Assembling with Repeats - PowerPoint PPT Presentation

Assembly Assembly Assembling with Repeats Assembling with Repeats Mate Pairs Mate Pairs Whole genome Whole genome shotgun shotgun Input: Input: Shotgun sequence fragments (reads) Shotgun sequence fragments (reads) Mate

DNA Short Tandem Repeats Organism DNA Short Tandem Repeats Organ DNA Short Tandem Repeats Cell

Self-Assembling DNA Self-Assembling DNA N. Jonoska Jonoska, N. C. , N. C. Seeman Seeman, DNA

CSI5126 . Algorithms in bioinformatics Suffjx Trees Marcel Turcotte School of Electrical

Assembling Assembling Nanomaterials Nanomaterials Richard W. Siegel Rensselaer Nanotechnology

2.11. The Maximum of n Random Variables 3.4. Hypothesis Testing 5.4. Long Repeats of the Same

Approaches to Repeat Finding Beth Skwarecki Cornell Genomics Forum, 2005-03-18 Why is repeat

Sequencing, data cleaning and assembling Swiss Institute of Bioinformatics (SIB) 26-30 November

Assembling wall panels with robotic technologies Frans van Gassel &amp; Pascal Schrijver

OPNFV Frank Brockners OPNFV TSC Member Distinguished Engineer, Cisco Assembling a Platform for

Assembling Systems Jacob Hendricks University of Wisconsin River Falls Dagstuhl Seminar:

#join Y assembly to Box JellyBox Build: 15_Y-Assembly Join (link directly to the y assembly part

Applications using the Class Frequency Distribution of Maximal Repeats from Tagged Sequential

Drums, Tempo, and Nested Repeats Drums! The play note block plays a percussion note. It has

PROTEINS WITH TANDEM REPEATS Dr Andrey Kajava Group of Structural Bioinformatics and Molecular

On-line Construction of Compact Suffix Vectors and Maximal Repeats Elise Prieur and Thierry

Studying and Fighting pathogenic bacteria with CRISPR David Bikard Synthetic Biology Group

Google Matrix Analysis of DNA Sequences Vivek Kandiah and Dima Shepelyansky Laboratoire de

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms

Conference Welcome Paul Holme Chair NWPN Apprenticeships The Leeds Way Treat 2 million

Improving Information Services for Education The Strategic Vision New services will

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Chapter 6 Dynamic Programming CS 573: Algorithms, Fall 2013 September 12, 2013 6.1 Maximum

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Information Technology Advisory Committee (ITAC) Public Business Meeting June 21, 2019

Assembling wall panels with robotic technologies Frans van Gassel & Pascal Schrijver