De novo assembly of complex genomes using single molecule sequencing - PowerPoint PPT Presentation

De novo assembly of complex genomes using single molecule sequencing Michael Schatz Jan 14, 2014 PAG XXII @mike_schatz / #PAGXXII

Assembling a Genome 1. Shear & Sequence DNA 2. Construct assembly graph from overlapping reads …AGCCTAGGGATGCGCGACACGT GGATGCGCGACACGTCGCATATCCGGTTTGGTCAACCTCGGACGGAC CAACCTCGGACGGACCTCAGCGAA… 3. Simplify assembly graph

Assembly Complexity A" R" B" R" R" C" B" A" R" C"

Assembly Complexity A" R" B" R" R" C" A" R" B" R" C" R" A" R" B" R" C" R"

Single Molecule Sequencing Technology PacBio RS II Moleculo Oxford Nanopore

PacBio Assembly Algorithms PBJelly PacBioToCA HGAP & Quiver & ECTools Gap Filling Hybrid/PB-only Error PB-only Correction & and Assembly Upgrade Correction Polishing English et al (2012) Koren , Schatz, et al (2012) Chin et al (2013) PLOS One. 7(11): e47768 Nature Biotechnology. 30:693–700 Nature Methods. 10:563–569 < 5x PacBio Coverage > 50x

What should we expect from an assembly? https://en.wikipedia.org/wiki/Genome_size

S. cerevisiae W303 PacBio RS II s equencing at CSHL by Dick McCombie • Size selection using an 7 Kb elution window on a BluePippin™ device from Sage Science Over 175x coverage in 2 days using P5-C3 Mean: 5910 83x over 10kbp 8.7x over 20kb Max: 36,861bp

S. cerevisiae W303 S288C Reference sequence • 12.1Mbp; 16 chromo + mitochondria; N50: 924kbp PacBio assembly using HGAP + Celera Assembler • 12.4Mbp; 21 non-redundant contigs; N50: 811kbp; >99.8% id

S. cerevisiae W303 S288C Reference sequence • 12.1Mbp; 16 chromo + mitochondria; N50: 924kbp PacBio assembly using HGAP + Celera Assembler • 12.4Mbp; 21 non-redundant contigs; N50: 811kbp; >99.8% id 35kbp repeat cluster Near-perfect assembly: All but 1 chromosome assembled as a single contig

A. thaliana Ler-0 http://blog.pacificbiosciences.com/2013/08/new-data-release-arabidopsis-assembly.html A. thaliana Ler-0 sequenced at PacBio • Sequenced using the previous P4 enzyme and C2 chemistry • Size selection using an 8 Kb to 50 Kb elution window on a BluePippin™ device from Sage Science • Total coverage >119x Genome size: 124.6 Mbp Sum of Contig Lengths: 149.5Mb Chromosome N50: 23.0 Mbp N50 Contig Length: 8.4 Mb Raw data: 11 Gb Number of Contigs: 1788 High quality assembly of chromosome arms Assembly Performance: 8.4Mbp/23Mbp = 36% MiSeq assembly: 63kbp/23Mbp [.2%]

Hybrid Approaches for Larger Genomes PacBioToCA fails in complex regions 1. Error Dense Regions – Difficult to compute overlaps with many errors 2. Simple Repeats – Kmer Frequency Too High to Seed Overlaps 3. Extreme GC – Lacks Illumina Coverage 30 30 Coverage Error Rate 25 25 Observed Error Rate Observed Coverage 20 15 20 10 5 15 0 0 1000 2000 3000 4000

ECTools: Error Correction with pre-assembled reads https://github.com/jgurtowski/ectools Short&Reads&,>&Assemble&Uni5gs&,>&Align&&&Select&,&>&Error&Correct&& & " Can"Help"us"overcome:" 1. Error"Dense"Regions"–"Longer"sequences"have"more"seeds"to"match" 2. Simple"Repeats"–"Longer"sequences"easier"to"resolve & & However,&cannot&overcome&Illumina&coverage&gaps&&&other&biases& &

O. sativa pv Nipponbare Genome size: 370 Mb Chromosome N50: 29.7 Mbp 19x PacBio C2XL sequencing at CSHL from Summer 2012 Assembly Contig NG50 MiSeq Fragments 6,332 23x 459bp 8x 2x251bp @ 450 “ALLPATHS-recipe” 18,248 50x 2x100bp @ 180 36x 2x50bp @ 2100 51x 2x50bp @ 4800 PacBioToCA 50,995 19x @ 3500 ** MiSeq for correction ECTools 155,695 19x @ 3500 ** MiSeq for correction

Lee, H*, Gurtowski, J*, Assembly complexity of long read sequencing Assembly Complexity of Long Reads Assembly N50 / Chromosome N50 Target Percentage 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0 10 6 M.jannaschii (Euryarchaeota) Yoo, S, Marcus, S, McCombie, WR, Schatz MC et al . (2014) In preparation SVR Fit ( 3,650 ± 140bp) mean1 ( 3,650 ± 140bp) C.hydrogenoformans (Firmicutes) E.coli(Eubacteria) Y.pestis(Proteobacteria) B.anthracis(Firmicutes) A.mirum(Actinobacteria) 10 7 “C2” 2012 S.cerevisiae(Yeast) Y.lipolytica(Fungus) D.discoideum(Slime mold) Genome Size N.crassa (Red bread mold) C.intestinalis(Sea squirt) C.elegans(Roundworm) 10 8 C.reinhardtii(Green algae) A.taliana(Arabidopsis) D.melanogaster(Fruitfly) P.persica(Peach) O.sativa(Rice) P.trichocarpa(Poplar) S.lycopersicum(Tomato) 10 9 G.max(Soybean) M.gallopavo(Turkey) D.rerio(Zebrafish) A.carollnensis(Lizard) Z.mays(Corn) M.musculus(Mouse) H.sapiens(Human)

Lee, H*, Gurtowski, J*, Assembly complexity of long read sequencing Assembly Complexity of Long Reads Assembly N50 / Chromosome N50 Target Percentage 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0 10 6 M.jannaschii (Euryarchaeota) Yoo, S, Marcus, S, McCombie, WR, Schatz MC et al . (2014) In preparation SVR Fit ( 3,650 ± 140bp) SVR Fit ( 7,400 ± 245bp) mean1 ( 3,650 ± 140bp) mean2 ( 7,400 ± 245bp) C.hydrogenoformans (Firmicutes) E.coli(Eubacteria) Y.pestis(Proteobacteria) B.anthracis(Firmicutes) A.mirum(Actinobacteria) 10 7 “C2” 2012 “C3” 2013 S.cerevisiae(Yeast) Y.lipolytica(Fungus) D.discoideum(Slime mold) Genome Size N.crassa (Red bread mold) C.intestinalis(Sea squirt) C.elegans(Roundworm) 10 8 C.reinhardtii(Green algae) A.taliana(Arabidopsis) D.melanogaster(Fruitfly) P.persica(Peach) O.sativa(Rice) P.trichocarpa(Poplar) S.lycopersicum(Tomato) 10 9 G.max(Soybean) M.gallopavo(Turkey) D.rerio(Zebrafish) A.carollnensis(Lizard) Z.mays(Corn) M.musculus(Mouse) H.sapiens(Human)

Lee, H*, Gurtowski, J*, Assembly complexity of long read sequencing Assembly Complexity of Long Reads Assembly N50 / Chromosome N50 Target Percentage 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0 10 6 M.jannaschii (Euryarchaeota) SVR Fit ( 3,650 ± 140bp) SVR Fit ( 7,400 ± 245bp) SVR Fit (15,000 ± 435bp) SVR Fit (30,000 ± 692bp) mean1 ( 3,650 ± 140bp) mean2 ( 7,400 ± 245bp) mean4 (15,000 ± 435bp) mean8 (30,000 ± 692bp) Yoo, S, Marcus, S, McCombie, WR, Schatz MC et al . (2014) In preparation C.hydrogenoformans (Firmicutes) E.coli(Eubacteria) Y.pestis(Proteobacteria) B.anthracis(Firmicutes) A.mirum(Actinobacteria) 10 “C2” 2012 “C3” 2013 “C4” ???? “C5” ???? 7 S.cerevisiae(Yeast) Y.lipolytica(Fungus) D.discoideum(Slime mold) Genome Size N.crassa (Red bread mold) C.intestinalis(Sea squirt) C.elegans(Roundworm) 10 C.reinhardtii(Green algae) 8 A.taliana(Arabidopsis) D.melanogaster(Fruitfly) P.persica(Peach) O.sativa(Rice) P.trichocarpa(Poplar) S.lycopersicum(Tomato) 10 G.max(Soybean) M.gallopavo(Turkey) 9 D.rerio(Zebrafish) A.carollnensis(Lizard) Z.mays(Corn) M.musculus(Mouse) H.sapiens(Human)

Summary • Long read sequencing of eukaryotic genomes is here • Recommendations < 100 Mbp: HGAP/PacBio2CA @ 100x PB C3-P5 expect near perfect chromosome arms < 1GB: HGAP/PacBio2CA @ 100x PB C3-P5 expect high quality assembly: contig N50 over 1Mbp > 1GB: hybrid/gap filling expect contig N50 to be 100kbp – 1Mbp > 5GB: Email mschatz@cshl.edu • Caveats – Model only as good as the available references (esp. haploid sequences) – Technologies are quickly improving, exciting new scaffolding technologies

Acknowledgements Schatz Lab CSHL James Gurtowski McCombie Lab Hayan Lee Hannon Lab Shoshana Marcus Gingeras Lab Alejandro Wences Jackson Lab Giuseppe Narzisi Iossifov Lab Srividya Levy Lab Ramakrishnan Lippman Lab Rob Aboukhalil Lyon Lab Mitch Bekritsky Martienssen Lab Charles Underwood Tuveson Lab Tyler Gavin Ware Lab Greg Vurture Wigler Lab Eric Biggers Aspyn Palatnick NBACC Serge Koren Adam Phillippy

Thank You! http://schatzlab.cshl.edu @mike_schatz / #PAGXXII Variant Calling and RNA-seq @ 4:25 in the KBase Workshop

De novo assembly of complex genomes using single molecule sequencing - PowerPoint PPT Presentation

De novo assembly of complex genomes using single molecule sequencing Michael Schatz Jan 14, 2014 PAG XXII @mike_schatz / #PAGXXII Assembling a Genome 1. Shear & Sequence DNA 2. Construct assembly graph from overlapping reads

Single Molecule Bio-Physics Single Molecule Fluorescence Techniques Single Molecule Fluorescence

De Novo Genome Analysis . . . . . Ketil Malde Analysis Annotation evaluation Assembly

Genomes for LIfe Cohort study of Genomes

SciLifeLab Drug Discovery Workshop Uppsala 1 June 2015 Nanna Lneborg Novo Seeds Novo Seeds

The 1000 genomes project The 1000 genomes project Genetic variation > 1% 1000 2500

Bioinformatics Seminars Series: Assembly Validation Francesco Vezzi KTH: Royal Institute of

Using Single Photons Using Single Photons Using Single Photons Using Single Photons for WIMP

KEEPING WIC CONNECTED Novo Dia Group NOVO DIA GROUP, INC (NDG) OVERVIEW Core Competencies

T T r r ial De Novo: ial De Novo: T T he Justic e Cour he Justic e Cour t Appe al

Algorithms in Bioinformatics: A Practical Introduction Genome Alignment Complete genomes

Masters Thesis Genome Assembly: Scaffolding Guided by Related Genomes Runar Furenes

Complex Numbers Complex Numbers 1 / 19 Complex Numbers Complex numbers ( C ) are an extension of

Single molecule mechanical studies of acto-myosin Justin E. Molloy Francis Crick

10X Genome Assembly Technology and Single Cell CNV Credit: 10X Genomics Diana Burkart-Waco DNA

Chemspace 3D-Shaped Fragments Description The shape of the molecule is an important factor in

Molecule Screen and Cell Quality Molecule Screen and Cell Quality Assessment Assessment

ARACHNE: A Whole-Genome Shotgun Assembler Serafim Batzoglou,David B. Jaffe, Ken Stanley, Jonathan

PROCEDURES AND ISSUES TO CONSIDER FOR AN APPLICATION FOR A GUARDIANSHIP IN TEXAS Presentation by

2020 EOPA OVERVIEW Irish Saxton 404-657-0536 isaxton@doe.k12.ga.us Richard Woods, Georgias

Post-EQK Damage Assessment of Bridges Marc J. Veletzos, Ph.D., P.E. Merrimack College

tt Prr r tr t

The most advanced 3D-360 camera YI Technologies and Google's Jump team are working together from

Assembly Language Introduction Learning Objectives Explain what assembly language is

Computer er S Scien ence D e Depar artmen ent November 3, 2017 College of Natural a and B

Sambuz

Useful Links

Newsletter

Mail Us

De novo assembly of complex genomes using single molecule sequencing - PowerPoint PPT Presentation

De novo assembly of complex genomes using single molecule sequencing Michael Schatz Jan 14, 2014 PAG XXII @mike_schatz / #PAGXXII Assembling a Genome 1. Shear & Sequence DNA 2. Construct assembly graph from overlapping reads

Single Molecule Bio-Physics Single Molecule Fluorescence Techniques Single Molecule Fluorescence

De Novo Genome Analysis . . . . . Ketil Malde Analysis Annotation evaluation Assembly

Genomes for LIfe Cohort study of Genomes

SciLifeLab Drug Discovery Workshop Uppsala 1 June 2015 Nanna Lneborg Novo Seeds Novo Seeds

The 1000 genomes project The 1000 genomes project Genetic variation &gt; 1% 1000 2500

Bioinformatics Seminars Series: Assembly Validation Francesco Vezzi KTH: Royal Institute of

Using Single Photons Using Single Photons Using Single Photons Using Single Photons for WIMP

KEEPING WIC CONNECTED Novo Dia Group NOVO DIA GROUP, INC (NDG) OVERVIEW Core Competencies

T T r r ial De Novo: ial De Novo: T T he Justic e Cour he Justic e Cour t Appe al

Algorithms in Bioinformatics: A Practical Introduction Genome Alignment Complete genomes

Masters Thesis Genome Assembly: Scaffolding Guided by Related Genomes Runar Furenes

Complex Numbers Complex Numbers 1 / 19 Complex Numbers Complex numbers ( C ) are an extension of

Single molecule mechanical studies of acto-myosin Justin E. Molloy Francis Crick

10X Genome Assembly Technology and Single Cell CNV Credit: 10X Genomics Diana Burkart-Waco DNA

Chemspace 3D-Shaped Fragments Description The shape of the molecule is an important factor in

Molecule Screen and Cell Quality Molecule Screen and Cell Quality Assessment Assessment

ARACHNE: A Whole-Genome Shotgun Assembler Serafim Batzoglou,David B. Jaffe, Ken Stanley, Jonathan

PROCEDURES AND ISSUES TO CONSIDER FOR AN APPLICATION FOR A GUARDIANSHIP IN TEXAS Presentation by

2020 EOPA OVERVIEW Irish Saxton 404-657-0536 isaxton@doe.k12.ga.us Richard Woods, Georgias

Post-EQK Damage Assessment of Bridges Marc J. Veletzos, Ph.D., P.E. Merrimack College

tt Prr r tr t

The most advanced 3D-360 camera YI Technologies and Google's Jump team are working together from

Assembly Language Introduction Learning Objectives Explain what assembly language is

Computer er S Scien ence D e Depar artmen ent November 3, 2017 College of Natural a and B

Sambuz

Useful Links

Newsletter

Mail Us

The 1000 genomes project The 1000 genomes project Genetic variation > 1% 1000 2500