de novo assembly of complex genomes using single molecule
play

De novo assembly of complex genomes using single molecule sequencing - PowerPoint PPT Presentation

De novo assembly of complex genomes using single molecule sequencing Michael Schatz Jan 14, 2014 PAG XXII @mike_schatz / #PAGXXII Assembling a Genome 1. Shear & Sequence DNA 2. Construct assembly graph from overlapping reads


  1. De novo assembly of complex genomes using single molecule sequencing Michael Schatz Jan 14, 2014 PAG XXII @mike_schatz / #PAGXXII

  2. Assembling a Genome 1. Shear & Sequence DNA 2. Construct assembly graph from overlapping reads …AGCCTAGGGATGCGCGACACGT GGATGCGCGACACGTCGCATATCCGGTTTGGTCAACCTCGGACGGAC CAACCTCGGACGGACCTCAGCGAA… 3. Simplify assembly graph

  3. Assembly Complexity A" R" B" R" R" C" B" A" R" C"

  4. Assembly Complexity A" R" B" R" R" C" A" R" B" R" C" R" A" R" B" R" C" R"

  5. Single Molecule Sequencing Technology PacBio RS II Moleculo Oxford Nanopore

  6. PacBio Assembly Algorithms PBJelly PacBioToCA HGAP & Quiver & ECTools Gap Filling Hybrid/PB-only Error PB-only Correction & and Assembly Upgrade Correction Polishing English et al (2012) Koren , Schatz, et al (2012) Chin et al (2013) PLOS One. 7(11): e47768 Nature Biotechnology. 30:693–700 Nature Methods. 10:563–569 < 5x PacBio Coverage > 50x

  7. What should we expect from an assembly? https://en.wikipedia.org/wiki/Genome_size

  8. S. cerevisiae W303 PacBio RS II s equencing at CSHL by Dick McCombie • Size selection using an 7 Kb elution window on a BluePippin™ device from Sage Science Over 175x coverage in 2 days using P5-C3 Mean: 5910 83x over 10kbp 8.7x over 20kb Max: 36,861bp

  9. S. cerevisiae W303 S288C Reference sequence • 12.1Mbp; 16 chromo + mitochondria; N50: 924kbp PacBio assembly using HGAP + Celera Assembler • 12.4Mbp; 21 non-redundant contigs; N50: 811kbp; >99.8% id

  10. S. cerevisiae W303 S288C Reference sequence • 12.1Mbp; 16 chromo + mitochondria; N50: 924kbp PacBio assembly using HGAP + Celera Assembler • 12.4Mbp; 21 non-redundant contigs; N50: 811kbp; >99.8% id 35kbp repeat cluster Near-perfect assembly: All but 1 chromosome assembled as a single contig

  11. A. thaliana Ler-0 http://blog.pacificbiosciences.com/2013/08/new-data-release-arabidopsis-assembly.html A. thaliana Ler-0 sequenced at PacBio • Sequenced using the previous P4 enzyme and C2 chemistry • Size selection using an 8 Kb to 50 Kb elution window on a BluePippin™ device from Sage Science • Total coverage >119x Genome size: 124.6 Mbp Sum of Contig Lengths: 149.5Mb Chromosome N50: 23.0 Mbp N50 Contig Length: 8.4 Mb Raw data: 11 Gb Number of Contigs: 1788 High quality assembly of chromosome arms Assembly Performance: 8.4Mbp/23Mbp = 36% MiSeq assembly: 63kbp/23Mbp [.2%]

  12. Hybrid Approaches for Larger Genomes PacBioToCA fails in complex regions 1. Error Dense Regions – Difficult to compute overlaps with many errors 2. Simple Repeats – Kmer Frequency Too High to Seed Overlaps 3. Extreme GC – Lacks Illumina Coverage 30 30 Coverage Error Rate 25 25 Observed Error Rate Observed Coverage 20 15 20 10 5 15 0 0 1000 2000 3000 4000

  13. ECTools: Error Correction with pre-assembled reads https://github.com/jgurtowski/ectools Short&Reads&,>&Assemble&Uni5gs&,>&Align&&&Select&,&>&Error&Correct&& & " Can"Help"us"overcome:" 1. Error"Dense"Regions"–"Longer"sequences"have"more"seeds"to"match" 2. Simple"Repeats"–"Longer"sequences"easier"to"resolve & & However,&cannot&overcome&Illumina&coverage&gaps&&&other&biases& &

  14. O. sativa pv Nipponbare Genome size: 370 Mb Chromosome N50: 29.7 Mbp 19x PacBio C2XL sequencing at CSHL from Summer 2012 Assembly Contig NG50 MiSeq Fragments 6,332 23x 459bp 8x 2x251bp @ 450 “ALLPATHS-recipe” 18,248 50x 2x100bp @ 180 36x 2x50bp @ 2100 51x 2x50bp @ 4800 PacBioToCA 50,995 19x @ 3500 ** MiSeq for correction ECTools 155,695 19x @ 3500 ** MiSeq for correction

  15. Lee, H*, Gurtowski, J*, Assembly complexity of long read sequencing Assembly Complexity of Long Reads Assembly N50 / Chromosome N50 Target Percentage 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0 10 6 M.jannaschii (Euryarchaeota) Yoo, S, Marcus, S, McCombie, WR, Schatz MC et al . (2014) In preparation SVR Fit ( 3,650 ± 140bp) mean1 ( 3,650 ± 140bp) C.hydrogenoformans (Firmicutes) E.coli(Eubacteria) Y.pestis(Proteobacteria) B.anthracis(Firmicutes) A.mirum(Actinobacteria) 10 7 “C2” 2012 S.cerevisiae(Yeast) Y.lipolytica(Fungus) D.discoideum(Slime mold) Genome Size N.crassa (Red bread mold) C.intestinalis(Sea squirt) C.elegans(Roundworm) 10 8 C.reinhardtii(Green algae) A.taliana(Arabidopsis) D.melanogaster(Fruitfly) P.persica(Peach) O.sativa(Rice) P.trichocarpa(Poplar) S.lycopersicum(Tomato) 10 9 G.max(Soybean) M.gallopavo(Turkey) D.rerio(Zebrafish) A.carollnensis(Lizard) Z.mays(Corn) M.musculus(Mouse) H.sapiens(Human)

  16. Lee, H*, Gurtowski, J*, Assembly complexity of long read sequencing Assembly Complexity of Long Reads Assembly N50 / Chromosome N50 Target Percentage 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0 10 6 M.jannaschii (Euryarchaeota) Yoo, S, Marcus, S, McCombie, WR, Schatz MC et al . (2014) In preparation SVR Fit ( 3,650 ± 140bp) SVR Fit ( 7,400 ± 245bp) mean1 ( 3,650 ± 140bp) mean2 ( 7,400 ± 245bp) C.hydrogenoformans (Firmicutes) E.coli(Eubacteria) Y.pestis(Proteobacteria) B.anthracis(Firmicutes) A.mirum(Actinobacteria) 10 7 “C2” 2012 “C3” 2013 S.cerevisiae(Yeast) Y.lipolytica(Fungus) D.discoideum(Slime mold) Genome Size N.crassa (Red bread mold) C.intestinalis(Sea squirt) C.elegans(Roundworm) 10 8 C.reinhardtii(Green algae) A.taliana(Arabidopsis) D.melanogaster(Fruitfly) P.persica(Peach) O.sativa(Rice) P.trichocarpa(Poplar) S.lycopersicum(Tomato) 10 9 G.max(Soybean) M.gallopavo(Turkey) D.rerio(Zebrafish) A.carollnensis(Lizard) Z.mays(Corn) M.musculus(Mouse) H.sapiens(Human)

  17. Lee, H*, Gurtowski, J*, Assembly complexity of long read sequencing Assembly Complexity of Long Reads Assembly N50 / Chromosome N50 Target Percentage 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0 10 6 M.jannaschii (Euryarchaeota) SVR Fit ( 3,650 ± 140bp) SVR Fit ( 7,400 ± 245bp) SVR Fit (15,000 ± 435bp) SVR Fit (30,000 ± 692bp) mean1 ( 3,650 ± 140bp) mean2 ( 7,400 ± 245bp) mean4 (15,000 ± 435bp) mean8 (30,000 ± 692bp) Yoo, S, Marcus, S, McCombie, WR, Schatz MC et al . (2014) In preparation C.hydrogenoformans (Firmicutes) E.coli(Eubacteria) Y.pestis(Proteobacteria) B.anthracis(Firmicutes) A.mirum(Actinobacteria) 10 “C2” 2012 “C3” 2013 “C4” ???? “C5” ???? 7 S.cerevisiae(Yeast) Y.lipolytica(Fungus) D.discoideum(Slime mold) Genome Size N.crassa (Red bread mold) C.intestinalis(Sea squirt) C.elegans(Roundworm) 10 C.reinhardtii(Green algae) 8 A.taliana(Arabidopsis) D.melanogaster(Fruitfly) P.persica(Peach) O.sativa(Rice) P.trichocarpa(Poplar) S.lycopersicum(Tomato) 10 G.max(Soybean) M.gallopavo(Turkey) 9 D.rerio(Zebrafish) A.carollnensis(Lizard) Z.mays(Corn) M.musculus(Mouse) H.sapiens(Human)

  18. Summary • Long read sequencing of eukaryotic genomes is here • Recommendations < 100 Mbp: HGAP/PacBio2CA @ 100x PB C3-P5 expect near perfect chromosome arms < 1GB: HGAP/PacBio2CA @ 100x PB C3-P5 expect high quality assembly: contig N50 over 1Mbp > 1GB: hybrid/gap filling expect contig N50 to be 100kbp – 1Mbp > 5GB: Email mschatz@cshl.edu • Caveats – Model only as good as the available references (esp. haploid sequences) – Technologies are quickly improving, exciting new scaffolding technologies

  19. Acknowledgements Schatz Lab CSHL James Gurtowski McCombie Lab Hayan Lee Hannon Lab Shoshana Marcus Gingeras Lab Alejandro Wences Jackson Lab Giuseppe Narzisi Iossifov Lab Srividya Levy Lab Ramakrishnan Lippman Lab Rob Aboukhalil Lyon Lab Mitch Bekritsky Martienssen Lab Charles Underwood Tuveson Lab Tyler Gavin Ware Lab Greg Vurture Wigler Lab Eric Biggers Aspyn Palatnick NBACC Serge Koren Adam Phillippy

  20. Thank You! http://schatzlab.cshl.edu @mike_schatz / #PAGXXII Variant Calling and RNA-seq @ 4:25 in the KBase Workshop

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend