De novo genome assembly
Dr Torsten Seemann
IMB Winter School - Brisbane – Mon 1 July 2013
De novo genome assembly Dr Torsten Seemann IMB Winter School - - - PowerPoint PPT Presentation
De novo genome assembly Dr Torsten Seemann IMB Winter School - Brisbane Mon 1 July 2013 Introduction Ideal world I would not need to give this talk! AGTCTAGGATTCGCTA TAGATTCAGGCTCTGA TATATTTCGCGGGATT AGCTAGATCGCTATGC TATGATCTAGATCTCG
IMB Winter School - Brisbane – Mon 1 July 2013
AGTCTAGGATTCGCTA TAGATTCAGGCTCTGA TATATTTCGCGGGATT AGCTAGATCGCTATGC TATGATCTAGATCTCG AGATTCGTATAAGTCT AGGATTCGCTATAGAT TCAGGCTCTGATATAT TTCGCGGGATTAGCTA Human DNA Non-existent USB3 device 46 complete haplotype chromosome sequences
– no instrument exists (yet)
– 100 at a time (Sanger) – 100,000 at a time (Roche 454) – 1,000,000 at a time (PGM) – 10,000,000 at a time (Proton, MiSeq) – 100,000,000 at a time (HiSeq)
– depends on sequencing platform being used
– Shearing: chop DNA into smaller fragments – Size selection: choose the size range you need – Adaptor ligation: add special sequence to ends
Platform Method Read Length Yield Quality Value Illumina
synthesis + fluorescence
250
++++ +++++ ++++
SOLiD
ligation + fluorescence
75
++++ +++ +++
PGM
non-term NTP + pH wells
300
++ +++ +++
Proton
non-term NTP + pH wells
400
+++ ++ +++
Roche 454
non-term NTP + luminescence
600
+ +++ ++
PacBio
synthesis + ZMW
12000
++ + ++
– Find reads which “fit together” (overlap) – Could be missing pieces (sequencing bias) – Some pieces will be dirty (sequencing errors)
I’ll return them tomorrow!
ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me
Oops! I dropped them.
ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me
Friends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears;
I’m good with words.
ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me
Friends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears;
Friends, Romans, countrymen, lend me your ears;
We have a consensus!
A/Prof. Mihai Pop World leader in de novo assembly research.
He wears glasses so he must be smart
– hmm, sounds like a lot of work…
– a picture of read connections
– sequencing errors will mess it up a lot
– trace a sensible path to produce a consensus
– we have to do ½N(N-1) ~ O(N²) comparisons – each comparison is an ~ O(L²) alignment – use special tricks/heuristics to reduce these!
– minimum overlap length eg. 20bp – minimum %identity across overlap eg. 95% – choice depends on L and expected error rate
Read#
1 2 3 4 5 6 1
80
95 85
30 20
25 70
35 25 60 50
Thicker lines mean stronger evidence for
Node/Vertex Edge/Arc
– introduce false edges and nodes
– heterozygosity causes lots of detours
– if longer than read length – causes nodes to be shared, locality confusion
– collapse small errors (or minor heterozygosity)
– short “dead end” hairs on the graph
– reliable stretches of unique DNA
– at least one per replicon in original sample
– Hamiltonian path/cycle is NP-hard (this is bad) – solution will be a set of paths which terminate at decision points
– use all the overlap alignments – each of these collapsed paths is a contig
– Real ends (for linear DNA molecules) – Dead ends (missing sequence) – Decision points (forks in the road)
– Transposons (self replicating genes) – Satellites (repetitive adjacent patterns) – Gene duplications (paralogs)
The repeated element is collapsed into a single contig
a b c a c b a b c d I II III I II III a b c d b c a b d c e f I II III IV I III II IV a d b e c f a
collapsed tandem excision rearrangement
– Can’t change this!
– Wait for new technology – Use “tricks” with existing technology
– atcgtatgatcttgagattctctcttcccttatagctgctata
– atcgtatgatcttgagattctctcttcccttatagctgctata
– Sequence one end of the fragment
– atcgtatgatcttgagattctctcttcccttatagctgctata
– Sequence both ends of same fragment – we can exploit this information!
– known sequences at either end – roughly known distance between ends – unknown sequence between ends
– if our contigs are longer than pair distance
– evidence that these contigs are linked!
Gap Gap
– A single organism eg. its chromosomal DNA
– Genomic DNA from a mixture of organisms
– A single organism’s RNA inc. mRNA, ncRNA
– RNA from a mixture of organisms
2:30pm
– Each part of genome represented by roughly equal number of reads
– Genome: 4 Mbp – Yield: 4 million x 50 bp reads = 200 Mbp – Coverage: 200 ÷ 4 = 50x (reads per bp)
– Each genome represented by proportion of reads similar to their proportion in mixture
– Mix of 3 species: ¼ Staph, ¼ Clost, ½ Ecoli – Say we get 4M reads – Then we expect about: 1M from Staph, 1M from Clost, 2M from Ecoli
– will have very similar reads – lots of shared nodes in the graph
– bits of DNA common to lots of organisms – “hub” nodes in the graph
– need longer reads
– 1,1,3,5,8,12,20
– 1+1+3+5+8+12+20 = 50
– 1+1+3+5+8+12 = 30 (≥ 25) so N50 is 12
– encourages mis-assemblies!
– 1,1,3,5,8,12,20 (previous) – 1,1,3,5,20,20 (now) – 1+1+3+5+20+20 = 50 (unchanged)
– 1+1+3+5+20= 30 (≥ 25) so N50 is 20 (was 12)
– Align read back to contigs – Check for errors or discordant pairs
– Use two complementary sequencing methods – Target troublesome areas for PCR – Use a genome wide “optical map”
– 100bp paired end
– MRSA_R1.fastq.gz – MRSA_R2.fastq.gz
– Velvet, Abyss, Mira, Newbler, SGA, AllPaths, Ray, SOAPdenovo, Spades, Masurca, …
– MetaVelvet, SGA, custom scripts + above
– Trans-Abyss, Oases, Trinity
– custom scripts + above
– Genomics Virtual Laboratory – http://genome.edu.au
– Microbial de novo assembly for Illumina data – Written by Simon Gladman (VBC/LSCC) – https://genome.edu.au/wiki/Protocols
>NODE_1_length_43211_cov_27.36569 AGTCGATGCTTAGAGAGTATGACCTTCTATACAAAA ATCTTATATTAGCGCTAGTCTGATAGCTCCCTAGAT CTGATCTGATATGATCTTAGAGTATCGGCTATTGCT AGTCTCGCGTATAATAAATAATATATTTTTCTAATG ATCTTATATTAGCGCTAGTCTGATAGCTCCCTAGAT CTGATCTGATATGATCTTAGAGTATCGGCTATTGCT AGTCTCGCGTATAATAAATAATATATTTAGTAGTCT …
Velvet Assembler Graphical User Environment
– torsten.seemann@monash.edu
– TheGenomeFactory.blogspot.com
– vicbioinformatics.com – vlsci.org.au/lscc
Torst 5½