De novo genome assembly Dr Torsten Seemann IMB Winter School - - - PowerPoint PPT Presentation

de novo genome assembly
SMART_READER_LITE
LIVE PREVIEW

De novo genome assembly Dr Torsten Seemann IMB Winter School - - - PowerPoint PPT Presentation

De novo genome assembly Dr Torsten Seemann IMB Winter School - Brisbane Mon 1 July 2013 Introduction Ideal world I would not need to give this talk! AGTCTAGGATTCGCTA TAGATTCAGGCTCTGA TATATTTCGCGGGATT AGCTAGATCGCTATGC TATGATCTAGATCTCG


slide-1
SLIDE 1

De novo genome assembly

Dr Torsten Seemann

IMB Winter School - Brisbane – Mon 1 July 2013

slide-2
SLIDE 2

Introduction

slide-3
SLIDE 3

Ideal world

I would not need to give this talk!

AGTCTAGGATTCGCTA TAGATTCAGGCTCTGA TATATTTCGCGGGATT AGCTAGATCGCTATGC TATGATCTAGATCTCG AGATTCGTATAAGTCT AGGATTCGCTATAGAT TCAGGCTCTGATATAT TTCGCGGGATTAGCTA Human DNA Non-existent USB3 device 46 complete haplotype chromosome sequences

slide-4
SLIDE 4

Real world

  • Can’t sequence full-length native DNA

– no instrument exists (yet)

  • But we can sequence short fragments

– 100 at a time (Sanger) – 100,000 at a time (Roche 454) – 1,000,000 at a time (PGM) – 10,000,000 at a time (Proton, MiSeq) – 100,000,000 at a time (HiSeq)

slide-5
SLIDE 5

Make a DNA library

  • DNA preparation

– depends on sequencing platform being used

  • Typical steps

– Shearing: chop DNA into smaller fragments – Size selection: choose the size range you need – Adaptor ligation: add special sequence to ends

  • Now ready to sequence!
slide-6
SLIDE 6

Instruments

Platform Method Read Length Yield Quality Value Illumina

synthesis + fluorescence

250

++++ +++++ ++++

SOLiD

ligation + fluorescence

75

++++ +++ +++

PGM

non-term NTP + pH wells

300

++ +++ +++

Proton

non-term NTP + pH wells

400

+++ ++ +++

Roche 454

non-term NTP + luminescence

600

+ +++ ++

PacBio

synthesis + ZMW

12000

++ + ++

slide-7
SLIDE 7

Which sequencing platform?

Long reads Low cost High yield High quality

Pick any 3

slide-8
SLIDE 8

De novo assembly

The process of reconstructing the original DNA sequence from the fragment reads alone.

  • Instinctively like a jigsaw puzzle

– Find reads which “fit together” (overlap) – Could be missing pieces (sequencing bias) – Some pieces will be dirty (sequencing errors)

slide-9
SLIDE 9

An example

slide-10
SLIDE 10

A small “genome”

Friends, Romans, countrymen, lend me your ears;

I’ll return them tomorrow!

slide-11
SLIDE 11

Shakespearomics

  • Reads

ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me

Oops! I dropped them.

slide-12
SLIDE 12

Shakespearomics

  • Reads

ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me

  • Overlaps

Friends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears;

I’m good with words.

slide-13
SLIDE 13

Shakespearomics

  • Reads

ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me

  • Overlaps

Friends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears;

  • Majority consensus

Friends, Romans, countrymen, lend me your ears;

We have a consensus!

slide-14
SLIDE 14

So far, so good.

slide-15
SLIDE 15

The awful truth “Genome assembly is impossible.”

A/Prof. Mihai Pop World leader in de novo assembly research.

He wears glasses so he must be smart

slide-16
SLIDE 16

Methods

slide-17
SLIDE 17

Approaches

  • greedy assembly
  • overlap :: layout :: consensus
  • de Bruijn graphs
  • string graphs
  • seed and extend

… all essentially doing the same thing, but taking different short cuts.

slide-18
SLIDE 18

Assembly recipe

  • Find all overlaps between reads

– hmm, sounds like a lot of work…

  • Build a graph

– a picture of read connections

  • Simplify the graph

– sequencing errors will mess it up a lot

  • Traverse the graph

– trace a sensible path to produce a consensus

slide-19
SLIDE 19
slide-20
SLIDE 20

Find read overlaps

  • If we have N reads of length L

– we have to do ½N(N-1) ~ O(N²) comparisons – each comparison is an ~ O(L²) alignment – use special tricks/heuristics to reduce these!

  • What counts as “overlapping” ?

– minimum overlap length eg. 20bp – minimum %identity across overlap eg. 95% – choice depends on L and expected error rate

slide-21
SLIDE 21

N=6 → 15 alignment scores

Read#

1 2 3 4 5 6 1

  • 2

80

  • 3

95 85

  • 4

30 20

  • 5

25 70

  • 6

35 25 60 50

slide-22
SLIDE 22

Graph construction

Thicker lines mean stronger evidence for

  • verlap

Node/Vertex Edge/Arc

slide-23
SLIDE 23

A more realistic graph

slide-24
SLIDE 24

What ruins the graph?

  • Read errors

– introduce false edges and nodes

  • Non-haploid organisms

– heterozygosity causes lots of detours

  • Repeats

– if longer than read length – causes nodes to be shared, locality confusion

slide-25
SLIDE 25

Graph simplification

  • Squash small bubbles

– collapse small errors (or minor heterozygosity)

  • Remove spurs

– short “dead end” hairs on the graph

  • Join unambiguously connected nodes

– reliable stretches of unique DNA

slide-26
SLIDE 26

Graph traversal

  • For each unconnected graph

– at least one per replicon in original sample

  • Find a path which visits each node once

– Hamiltonian path/cycle is NP-hard (this is bad) – solution will be a set of paths which terminate at decision points

  • Form a consensus sequences from paths

– use all the overlap alignments – each of these collapsed paths is a contig

slide-27
SLIDE 27

Contigs

Contiguous, unambiguous stretches of assembled DNA sequence

  • Contigs ends correspond to

– Real ends (for linear DNA molecules) – Dead ends (missing sequence) – Decision points (forks in the road)

slide-28
SLIDE 28

Repeats

slide-29
SLIDE 29

What is a repeat?

A segment of DNA which occurs more than once in the genome sequence

  • Very common

– Transposons (self replicating genes) – Satellites (repetitive adjacent patterns) – Gene duplications (paralogs)

slide-30
SLIDE 30

Dot plots

Self similarity plot, genome versus itself

slide-31
SLIDE 31

Effect on assembly

The repeated element is collapsed into a single contig

slide-32
SLIDE 32

Repeat mis-assembly

a b c a c b a b c d I II III I II III a b c d b c a b d c e f I II III IV I III II IV a d b e c f a

collapsed tandem excision rearrangement

slide-33
SLIDE 33

The law of repeats

  • It is impossible to resolve repeats of

length S unless you have reads longer than S.

  • It is impossible to resolve repeats of

length S unless you have reads longer than S.

slide-34
SLIDE 34

Scaffolding

slide-35
SLIDE 35

Beyond contigs

Contig sizes are limited by:

  • the length of repeats in your genome

– Can’t change this!

  • the length (or “span”) of the reads

– Wait for new technology – Use “tricks” with existing technology

slide-36
SLIDE 36

Types of reads

  • Example fragment

– atcgtatgatcttgagattctctcttcccttatagctgctata

  • “Single-end” read

– atcgtatgatcttgagattctctcttcccttatagctgctata

– Sequence one end of the fragment

  • “Paired-end” read

– atcgtatgatcttgagattctctcttcccttatagctgctata

– Sequence both ends of same fragment – we can exploit this information!

slide-37
SLIDE 37

Scaffolding

  • Paired-end reads

– known sequences at either end – roughly known distance between ends – unknown sequence between ends

  • Most ends will occur in same contig

– if our contigs are longer than pair distance

  • Some ends will be in different contigs

– evidence that these contigs are linked!

slide-38
SLIDE 38

Contigs to scaffolds

Contigs Paired-end read Scaffold

Gap Gap

slide-39
SLIDE 39

Assumptions

slide-40
SLIDE 40

What can we assemble?

  • Genomes

– A single organism eg. its chromosomal DNA

  • Meta-genomes

– Genomic DNA from a mixture of organisms

  • Transcriptomes

– A single organism’s RNA inc. mRNA, ncRNA

  • Meta-transcriptomes

– RNA from a mixture of organisms

2:30pm

slide-41
SLIDE 41

Genomes

  • Expect uniformity

– Each part of genome represented by roughly equal number of reads

  • Average depth of coverage

– Genome: 4 Mbp – Yield: 4 million x 50 bp reads = 200 Mbp – Coverage: 200 ÷ 4 = 50x (reads per bp)

slide-42
SLIDE 42

Meta-genomes

  • Expect proportionality & uniformity

– Each genome represented by proportion of reads similar to their proportion in mixture

  • Example

– Mix of 3 species: ¼ Staph, ¼ Clost, ½ Ecoli – Say we get 4M reads – Then we expect about: 1M from Staph, 1M from Clost, 2M from Ecoli

slide-43
SLIDE 43

Meta-genome issues

  • Closely related species

– will have very similar reads – lots of shared nodes in the graph

  • Conserved sequence

– bits of DNA common to lots of organisms – “hub” nodes in the graph

  • Untangling is difficult

– need longer reads

slide-44
SLIDE 44

Assessment

slide-45
SLIDE 45

Assessing assemblies

  • We desire

– Total length similar to genome size – Fewer, larger contigs – Correct contigs

  • Metrics

– No generally useful objective measure – Longest contig, total bp, N50, …

slide-46
SLIDE 46

The “N50”

The length of that contig from which 50% of the bases are in it and shorter contigs

  • Imagine we got 7 contigs with lengths:

– 1,1,3,5,8,12,20

  • Total

– 1+1+3+5+8+12+20 = 50

  • N50 is the “halfway sum” = 25

– 1+1+3+5+8+12 = 30 (≥ 25) so N50 is 12

slide-47
SLIDE 47

N50 concerns

  • Optimizing for N50

– encourages mis-assemblies!

  • An aggressive assembler may over-join:

– 1,1,3,5,8,12,20 (previous) – 1,1,3,5,20,20 (now) – 1+1+3+5+20+20 = 50 (unchanged)

  • N50 is the “halfway sum” (still 25)

– 1+1+3+5+20= 30 (≥ 25) so N50 is 20 (was 12)

slide-48
SLIDE 48

Validation

  • Self consistency

– Align read back to contigs – Check for errors or discordant pairs

  • Second opinion

– Use two complementary sequencing methods – Target troublesome areas for PCR – Use a genome wide “optical map”

slide-49
SLIDE 49

How do I do it?

slide-50
SLIDE 50

Example

  • Culture your bacterium
  • Extract your genomic DNA
  • Send it to AGRF for Illumina sequencing

– 100bp paired end

  • Get back two files:

– MRSA_R1.fastq.gz – MRSA_R2.fastq.gz

  • Now what?
slide-51
SLIDE 51

Assembly tools

  • Genome

– Velvet, Abyss, Mira, Newbler, SGA, AllPaths, Ray, SOAPdenovo, Spades, Masurca, …

  • Meta-genome

– MetaVelvet, SGA, custom scripts + above

  • Transcriptome

– Trans-Abyss, Oases, Trinity

  • Meta-Transcriptome

– custom scripts + above

slide-52
SLIDE 52

Online tutorial

  • The GVL

– Genomics Virtual Laboratory – http://genome.edu.au

  • Protocols

– Microbial de novo assembly for Illumina data – Written by Simon Gladman (VBC/LSCC) – https://genome.edu.au/wiki/Protocols

slide-53
SLIDE 53

Velvet: hash reads

velveth MyFolder 71

  • shortPaired
  • fastq.gz
  • separate

MRSA_R1.fastq.gz MRSA_R2.fastq.gz Read type Read files K-mer size

slide-54
SLIDE 54

Velvet: assembly

velvetg MyFolder

  • exp_cov auto
  • cov_cutoff auto

“Signal” level “Noise” level

slide-55
SLIDE 55

Velvet: examine results

less MyFolder/contigs.fa

>NODE_1_length_43211_cov_27.36569 AGTCGATGCTTAGAGAGTATGACCTTCTATACAAAA ATCTTATATTAGCGCTAGTCTGATAGCTCCCTAGAT CTGATCTGATATGATCTTAGAGTATCGGCTATTGCT AGTCTCGCGTATAATAAATAATATATTTTTCTAATG ATCTTATATTAGCGCTAGTCTGATAGCTCCCTAGAT CTGATCTGATATGATCTTAGAGTATCGGCTATTGCT AGTCTCGCGTATAATAAATAATATATTTAGTAGTCT …

slide-56
SLIDE 56

Velvet: GUI

Where to save Click run Add your reads

Velvet Assembler Graphical User Environment

slide-57
SLIDE 57

Contact

  • Email

– torsten.seemann@monash.edu

  • Blog

– TheGenomeFactory.blogspot.com

  • Web

– vicbioinformatics.com – vlsci.org.au/lscc

Torst 5½

slide-58
SLIDE 58