Genome Assembly Sample Prepara1on Fragments Sequencing - PowerPoint PPT Presentation

Genome ¡Assembly ¡

Sample ¡Prepara1on ¡ Fragments ¡ Sequencing ¡ Reads ¡ ACGTAGAATACGTAGAA Assembly ¡ ACGTAGAATCGACCATG GGGACGTAGAATACGAC ACGTAGAATACGTAGAAACAGATTAGAGAG… Con1gs ¡

Paired-‑End ¡Reads ¡ Genomic ¡segment ¡ Genome fragments Get ¡two ¡reads ¡from ¡ each ¡segment ¡ ~100 bp ~100 bp

Read ¡Coverage ¡ C • Length ¡of ¡genomic ¡segment: ¡L ¡ • Number ¡of ¡reads: ¡n ¡ • Length ¡of ¡each ¡read: ¡l ¡ Coverage ¡ ¡C ¡= ¡n ¡l ¡/ ¡L ¡

Fragment ¡Assembly ¡ • Cover ¡region ¡with ¡~7-‑fold ¡redundancy ¡ • Overlap ¡reads ¡and ¡extend ¡to ¡reconstruct ¡the ¡ original ¡genomic ¡region ¡ ¡

Challenges ¡in ¡Fragment ¡Assembly ¡ • Repeats: ¡a ¡ major ¡problem ¡for ¡fragment ¡assembly ¡ • > ¡50% ¡of ¡human ¡genome ¡are ¡repeats: ¡ ¡ ¡-‑ ¡over ¡1 ¡million ¡ Alu ¡repeats ¡(about ¡300 ¡bp) ¡ ¡ ¡-‑ ¡about ¡200,000 ¡LINE ¡repeats ¡(1000 ¡bp ¡and ¡longer) ¡ Repeat Repeat Repeat Green ¡and ¡blue ¡fragments ¡are ¡interchangeable ¡when ¡ ¡ assembling ¡repe11ve ¡DNA ¡

Triazzle: ¡A ¡Fun ¡Example ¡ The ¡puzzle ¡looks ¡simple ¡ ¡ BUT ¡there ¡are ¡repeats!!! ¡ ¡ The ¡repeats ¡make ¡it ¡very ¡ difficult. ¡ ¡ Try ¡it ¡– ¡only ¡$7.99 ¡at ¡ www.triazzle.com ¡

Repeat ¡Types ¡ • Low-‑Complexity ¡DNA ¡(e.g. ¡ATATATATACATA…) ¡ • Microsatellite ¡repeats ¡ ¡ ¡ ¡ ¡(a 1 …a k ) N ¡where ¡k ¡~ ¡3-‑6 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(e.g. ¡CAGCAGTAGCAGCACCAG) ¡ • Transposons/retrotransposons ¡ ¡ ¡ ¡ – SINE ¡ ¡ ¡ ¡Short ¡Interspersed ¡Nuclear ¡Elements ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(e.g., ¡ Alu : ¡~300 ¡bp ¡long, ¡10 6 ¡copies) ¡ – LINE ¡ ¡ ¡ ¡Long ¡Interspersed ¡Nuclear ¡Elements ¡ ¡ ¡ ¡ ¡ ¡~500 ¡-‑ ¡5,000 ¡bp ¡long, ¡200,000 ¡copies ¡ ¡ • Gene ¡Families ¡ ¡ ¡genes ¡duplicate ¡& ¡then ¡diverge ¡ • Segmental ¡duplicaCons ¡ ¡~very ¡long, ¡very ¡similar ¡copies ¡

Fragment ¡Assembly ¡ • ComputaConal ¡Challenge: ¡ assemble ¡ individual ¡short ¡fragments ¡(reads) ¡into ¡a ¡single ¡ genomic ¡sequence ¡( “ superstring ” ) ¡ ¡ • Un1l ¡late ¡1990s ¡the ¡shotgun ¡fragment ¡ assembly ¡of ¡human ¡genome ¡was ¡viewed ¡as ¡ intractable ¡problem ¡ ¡ ¡ ¡

Shortest ¡Superstring ¡Problem ¡ • Problem: ¡Given ¡a ¡set ¡of ¡strings, ¡find ¡a ¡shortest ¡ string ¡that ¡contains ¡all ¡of ¡them ¡ • Input: ¡ ¡Strings ¡ s 1 , ¡s 2 ,…., ¡s n ¡ • Output: ¡ ¡A ¡string ¡ s ¡that ¡contains ¡all ¡strings ¡ ¡ ¡ ¡ ¡s 1 , ¡s 2 ,…., ¡s n ¡as ¡substrings, ¡such ¡that ¡the ¡length ¡ of ¡ s ¡is ¡minimized ¡ ¡ • Complexity: ¡ ¡NP-‑complete ¡ ¡ • Note: ¡ ¡this ¡formula1on ¡does ¡not ¡take ¡into ¡account ¡ sequencing ¡errors ¡

Whole ¡Genome ¡Shotgun ¡Sequencing ¡ Genome ¡ Genome ¡amplified ¡and ¡sliced ¡into ¡ smaller ¡fragments ¡(>=600bp) ¡ Build ¡consensus ¡sequence ¡from ¡overlap ¡

Overlap-‑Layout-‑Consensus ¡ ¡ Assemblers: ¡ARACHNE, ¡PHRAP, ¡CAP, ¡TIGR, ¡CELERA ¡ Overlap: ¡ ¡find ¡poten1ally ¡overlapping ¡reads ¡ Layout: ¡ ¡merge ¡reads ¡into ¡con1gs ¡and ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡con1gs ¡into ¡supercon1gs ¡ Consensus: ¡ ¡derive ¡the ¡DNA ¡sequence ¡ ..ACGATTACAATAGGTT.. and ¡correct ¡read ¡errors ¡

Overlap ¡ • Each ¡read ¡is ¡compared ¡to ¡that ¡of ¡every ¡other ¡read, ¡in ¡both ¡the ¡ forward ¡and ¡reverse ¡complement ¡orienta1ons. ¡ ¡ • As ¡such, ¡the ¡overlap ¡computa1on ¡step ¡is ¡a ¡very ¡1me ¡intensive ¡ step ¡– ¡especially ¡if ¡the ¡set ¡of ¡reads ¡is ¡very ¡large. ¡ • For ¡example, ¡the ¡whole ¡genome ¡shotgun ¡assembly ¡of ¡ Drosophila ¡had ¡about ¡3 ¡x ¡10^6 ¡reads ¡of ¡500 ¡bases, ¡requiring ¡ roughly ¡10^13 ¡comparisons ¡(Deonier ¡ et ¡al., ¡ 2010). ¡ • Even ¡on ¡today's ¡computers, ¡running ¡that ¡many ¡comparisons ¡is ¡ imprac1cal, ¡so ¡seeded ¡algorithm ¡are ¡used ¡

Overlapping ¡Reads ¡ • Sort ¡all ¡k-‑mers ¡in ¡reads ¡ • Find ¡pairs ¡of ¡reads ¡sharing ¡a ¡k-‑mer ¡ • Extend ¡to ¡full ¡alignment ¡– ¡throw ¡away ¡if ¡not ¡>95% ¡ similar ¡ TACA TAGATTACACAGATTAC T GA || ||||||||||||||||| | || TAGT TAGATTACACAGATTAC TAGA

Finding ¡Overlapping ¡Reads ¡ Create ¡local ¡mul1ple ¡alignments ¡from ¡the ¡ overlapping ¡reads. ¡ TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA

Finding ¡Overlapping ¡Reads ¡ ¡ Correct ¡errors ¡using ¡mul1ple ¡alignment. ¡ • Find ¡loca1ons ¡where ¡there ¡is ¡a ¡devia1on ¡in ¡ which ¡1% ¡of ¡the ¡data ¡diverge ¡from ¡the ¡rest. ¡ • Make ¡those ¡posi1ons ¡agree ¡with ¡the ¡rest. ¡ TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA

Build ¡the ¡Overlap ¡Graph ¡ • Overlap ¡graph: ¡the ¡nodes ¡represent ¡actual ¡ reads, ¡and ¡edges ¡represent ¡overlaps ¡between ¡ these ¡reads. ¡ ¡ • Thus, ¡the ¡genome ¡assembly ¡becomes ¡ equivalent ¡to ¡finding ¡a ¡path ¡through ¡the ¡graph ¡ that ¡visits ¡each ¡node ¡exactly ¡once ¡( i.e., ¡ a ¡ Hamiltonian ¡path ). ¡ 17 ¡

An ¡overlap ¡graph. ¡Nodes ¡are ¡complete ¡reads ¡and ¡ edges ¡connect ¡reads ¡that ¡overlap. ¡Note ¡that ¡in ¡an ¡ actual ¡graph, ¡reads ¡and ¡overlaps ¡would ¡be ¡much ¡ larger. ¡ ¡ 18 ¡

Layout ¡ • Finding ¡a ¡Hamiltonian ¡path ¡through ¡the ¡ overlap ¡graph ¡is ¡not ¡a ¡trivial ¡task. ¡ ¡ • In ¡order ¡to ¡decrease ¡the ¡size ¡of ¡the ¡graph, ¡the ¡ OLC ¡assembly ¡graph ¡is ¡simplified ¡in ¡the ¡layout ¡ stage, ¡where ¡segments ¡of ¡the ¡graph ¡are ¡ compressed ¡into ¡con1gs ¡ • Thus, ¡we ¡have ¡to ¡find ¡a ¡manner ¡to ¡decrease ¡ the ¡complexity ¡of ¡the ¡graph ¡

Graph ¡Reduc1on ¡ • A ¡con1g ¡would ¡be ¡a ¡subgraph, ¡or ¡a ¡group ¡of ¡ nodes, ¡with ¡many ¡connec1ons ¡among ¡each ¡ other, ¡as ¡they ¡all ¡overlap ¡with ¡each ¡other ¡and ¡ refer ¡to ¡the ¡same ¡sequence ¡(A ¡and ¡B). ¡ ¡ • Once ¡a ¡subgraph ¡has ¡been ¡iden1fied, ¡these ¡ nodes ¡and ¡edges ¡are ¡compressed ¡into ¡one ¡ node, ¡or ¡a ¡con1g, ¡thereby ¡simplifying ¡the ¡ graph ¡(C) ¡ ¡ 20 ¡

Separa1ng ¡Con1gs ¡ • There ¡are ¡two ¡classes ¡of ¡con1gs, ¡ unique ¡ conCgs ¡and ¡ repeat ¡conCgs . ¡ ¡ • Unique ¡con1gs ¡are ¡composed ¡of ¡reads ¡that ¡ can ¡be ¡unambiguously ¡assembled. ¡ ¡ • Repeat ¡con1gs ¡are ¡con1gs ¡with ¡an ¡abnormally ¡ high ¡read ¡coverage ¡or ¡connected ¡to ¡an ¡ abnormally ¡large ¡number ¡of ¡other ¡con1gs ¡ 22 ¡

Separa1ng ¡Con1gs ¡ Normal density Too dense: Overcollapsed? Inconsistent links: Overcollapsed?

Crea1ng ¡Scaffolds ¡ • Unique ¡con1gs ¡are ¡joined ¡into ¡larger ¡sequences, ¡ called ¡ scaffolds . ¡ ¡ • The ¡most ¡common ¡way ¡to ¡piece ¡con1gs ¡into ¡ scaffolds ¡is ¡through ¡ mate-‑pair ¡informaCon . ¡ ¡ • With ¡mate-‑pair ¡informa1on, ¡assemblers ¡can ¡ iden1fy ¡how ¡far ¡reads ¡and ¡unique ¡con1gs ¡should ¡ be ¡apart ¡from ¡each ¡other. ¡ ¡ – e.g. ¡if ¡a ¡2kb ¡fragment ¡of ¡a ¡genome ¡were ¡sequenced ¡ 100bp ¡on ¡each ¡end, ¡then ¡we ¡know ¡these ¡reads ¡and ¡ the ¡unique ¡con1gs ¡they ¡are ¡in ¡should ¡be ¡roughly ¡2kb ¡ apart. ¡ ¡ 24 ¡

Genome Assembly Sample Prepara1on Fragments Sequencing - PowerPoint PPT Presentation

Genome Assembly Sample Prepara1on Fragments Sequencing Reads ACGTAGAATACGTAGAA Assembly ACGTAGAATCGACCATG GGGACGTAGAATACGAC ACGTAGAATACGTAGAAACAGATTAGAGAG Con1gs Paired-End Reads Genomic

Genome assembly Mark Stenglein, Todos Santos 2018 Genome assembly is the process of attempting to

BayeHem: Bayesian Optimisation of Genome Assembly 1. Genome Assembly 2. Bayesian Optimisation

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Informed and automated k -mer size selection for genome assembly Rayan Chikhi, Paul Medvedev

10X Genome Assembly Technology and Single Cell CNV Credit: 10X Genomics Diana Burkart-Waco DNA

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Optical-Kermit: Optical map guided genome assembly Miika Leinonen, Leena Salmela University of

Towards More Effective Formulations of the Genome Assembly Problem Alexandru Tomescu Department

Visualizing ENCODE Data in the UCSC Genome Browser Pauline Fujita, Ph.D. UCSC Genome Bioinformatics

The Mouse Genome The Mouse Genome Database (MGD) Database (MGD) Eppig J.T., et al. (2005). The

Self Study: Yeast Genome Comparison SESSION 4 MARTIN KRZYWINSKI Genome Sciences Centre BC

Genome 562 February 2015 Week 6 Genome 562 p.1/13 Julian Huxley (1887-1975) Oxford

Highly Scalable Genome Assembly on Campus Grids Christopher Moretti Michael Olson, Scott Emrich,

Self-Assembling DNA Self-Assembling DNA N. Jonoska Jonoska, N. C. , N. C. Seeman Seeman, DNA

Using chromosome conformation capture to assemble genomes to perfection Nadge Guiglielmoni,

RNA-seq nanopore read correction R. Chikhi, L. Lima, C. Marchet, ASTER Consortium December 2017

Metagenomics an introduction Katie Lennard Metagenomics vs. amplicon sequencing (16S)

Crash course on Computational Biology for Computer Scientists Bartek Wilczyski

Large Scale DNA Sequence Analysis and Biomedical Computing using MapReduce, MPI and Threading

Seriation & Ranking: Spectral Approach Fajwel Fogel , CNRS & ENS, Paris. with Alexandre

Genome Assembly Sample Prepara1on Fragments Sequencing - PowerPoint PPT Presentation

Genome Assembly Sample Prepara1on Fragments Sequencing Reads ACGTAGAATACGTAGAA Assembly ACGTAGAATCGACCATG GGGACGTAGAATACGAC ACGTAGAATACGTAGAAACAGATTAGAGAG Con1gs Paired-End Reads Genomic

Genome assembly Mark Stenglein, Todos Santos 2018 Genome assembly is the process of attempting to

BayeHem: Bayesian Optimisation of Genome Assembly 1. Genome Assembly 2. Bayesian Optimisation

Introduction to Bioinformatics Genome sequencing &amp; assembly Genome sequencing &amp; assembly

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Informed and automated k -mer size selection for genome assembly Rayan Chikhi, Paul Medvedev

10X Genome Assembly Technology and Single Cell CNV Credit: 10X Genomics Diana Burkart-Waco DNA

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics &amp; Computational

Genome Sequencing &amp; Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Optical-Kermit: Optical map guided genome assembly Miika Leinonen, Leena Salmela University of

Towards More Effective Formulations of the Genome Assembly Problem Alexandru Tomescu Department

Visualizing ENCODE Data in the UCSC Genome Browser Pauline Fujita, Ph.D. UCSC Genome Bioinformatics

The Mouse Genome The Mouse Genome Database (MGD) Database (MGD) Eppig J.T., et al. (2005). The

Self Study: Yeast Genome Comparison SESSION 4 MARTIN KRZYWINSKI Genome Sciences Centre BC

Genome 562 February 2015 Week 6 Genome 562 p.1/13 Julian Huxley (1887-1975) Oxford

Highly Scalable Genome Assembly on Campus Grids Christopher Moretti Michael Olson, Scott Emrich,

Self-Assembling DNA Self-Assembling DNA N. Jonoska Jonoska, N. C. , N. C. Seeman Seeman, DNA

Using chromosome conformation capture to assemble genomes to perfection Nadge Guiglielmoni,

RNA-seq nanopore read correction R. Chikhi, L. Lima, C. Marchet, ASTER Consortium December 2017

Metagenomics an introduction Katie Lennard Metagenomics vs. amplicon sequencing (16S)

Crash course on Computational Biology for Computer Scientists Bartek Wilczyski

Large Scale DNA Sequence Analysis and Biomedical Computing using MapReduce, MPI and Threading

Seriation &amp; Ranking: Spectral Approach Fajwel Fogel , CNRS &amp; ENS, Paris. with Alexandre

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Seriation & Ranking: Spectral Approach Fajwel Fogel , CNRS & ENS, Paris. with Alexandre