short read genome assembly Sorin Istrail CSCI1820 Short-read genome - - PowerPoint PPT Presentation

short read genome assembly
SMART_READER_LITE
LIVE PREVIEW

short read genome assembly Sorin Istrail CSCI1820 Short-read genome - - PowerPoint PPT Presentation

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014 1 Genomathica Assembler Mathematica notebook for genome assembly simulation Assembler can be found at:


slide-1
SLIDE 1

short read genome assembly

Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014

1

slide-2
SLIDE 2

Genomathica Assembler

  • Mathematica notebook for genome assembly simulation
  • Assembler can be found at:

http://cs.brown.edu/courses/csci1820/software/minimal_asse mbler.nb

  • Sample FASTA genome phix174.fasta can be found in HW5

Biology: http://cs.brown.edu/courses/csci1820/software/phix174.fasta

  • Remember to

– Change the input genome to your FASTA file’s location – Evaluate each cell initially, then you only need to evaluate the last two cells to re-run the assembly, and display the results respectively – Mathematica can be downloaded here: http://www.brown.edu/information-technology/software/

2

slide-3
SLIDE 3

coverage = 1

  • Sequence reads

are in black

  • Contiguous

strings of assembled DNA (contigs) are in red

slide-4
SLIDE 4

coverage = 2

  • Sequence reads

are in black

  • Contiguous

strings of assembled DNA (contigs) are in red

slide-5
SLIDE 5

coverage = 3

  • Sequence reads

are in black

  • Contiguous

strings of assembled DNA (contigs) are in red

slide-6
SLIDE 6

coverage = 4

  • Sequence reads

are in black

  • Contiguous

strings of assembled DNA (contigs) are in red

slide-7
SLIDE 7

coverage = 5

  • Sequence reads

are in black

  • Contiguous

strings of assembled DNA (contigs) are in red

slide-8
SLIDE 8

coverage = 2, paired ends

slide-9
SLIDE 9

Raw Sequence Reads Sample prep

Sequence data

  • wet-lab experimental methods to isolate, prepare, and

sequence the DNA

  • results in a number of large FASTQ files
  • FASTQC can be used to check basic statistics of the files

–http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

  • many tools available for QC

–e.g. http://hannonlab.cshl.edu/fastx_toolkit/

slide-10
SLIDE 10

Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: www.genome.gov/sequencingcosts. Accessed April 2013.

slide-11
SLIDE 11

http://www.ncbi.nlm.nih.gov/Traces/sra/

slide-12
SLIDE 12
slide-13
SLIDE 13

Genome Assembly Software

  • Overlap-layout-consensus
  • Celera: http://wgs-assembler.sourceforge.net/
  • K-mer based
  • Velvet: http://www.ebi.ac.uk/~zerbino/velvet/
  • SOAP-denovo: http://soap.genomics.org.cn/soapdenovo.html
  • ALLPATHS-LG:

http://www.broadinstitute.org/software/allpaths-lg/blog/

  • IDBA-UD: http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/
slide-14
SLIDE 14

Two graph models

  • A first graph model

– Nodes (vertices) are contiguous sequences of k characters (k-mer) – Directed edge from vi to vj if vi[2..k]=vj[1..k-1]

A C G T T C ACG CGT GTT TTC

slide-15
SLIDE 15

Two graph models

  • De-bruijn graph

– Nodes (vertices) are contiguous sequences of k-1 characters (k-1-mer) – Directed edge from vi to vj if vi[1..k-1]+vj[k-1] are a valid k-mer

A C G T T C AC CG GT TT TC ACG CGT GTT TTC

slide-16
SLIDE 16

Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly

Note edges that are not reflected in the input!

slide-17
SLIDE 17

Genome Assembly

  • Building the k-mer graph

– nodes as k-mers, edges (k-1) overlap

17

slide-18
SLIDE 18

Genome assembly

Genome GACGTACGTT Reads GACGTA CGTACG TACGTT

GACG ACGT

k=4 k=3

GAC ACG CGT CGTA GTA

1 1 1 1 1

slide-19
SLIDE 19

Genome assembly

Genome GACGTACGTT Reads GACGTA CGTACG TACGTT

GACG ACGT

k=4 k=3

GAC ACG CGT CGTA GTA GTAC TACG TAC

1 1 1 1 2 1 1 1 1

slide-20
SLIDE 20

Genome assembly

Genome GACGTACGTT Reads GACGTA CGTACG TACGTT

GACG ACGT

k=4 k=3

GAC ACG CGT CGTA GTA GTAC TACG TAC CGTT GTT

1 1 1 2 2 1 1 1 2 1 1

slide-21
SLIDE 21

Genome Assembly

  • Building the k-mer graph

– nodes as k-mers, edges (k-1) overlap – nodes as (k-1)-mers, edges form k-mers

21

slide-22
SLIDE 22

Genome assembly

Genome GACGTACGTT Reads GACGTA CGTACG TACGTT k=4 k=3

GA AC CG GT TA GAC ACG CGT GTA

1 1 1 1 1 1 1

slide-23
SLIDE 23

Genome assembly

Genome GACGTACGTT Reads GACGTA CGTACG TACGTT k=4 k=3

GA AC CG GT TA GAC ACG CGT GTA TAC

1 1 1 3 2 2 1 2 1 1

slide-24
SLIDE 24

Genome assembly

Genome GACGTACGTT Reads GACGTA CGTACG TACGTT k=4 k=3

GA AC CG GT TA GT GAC ACG CGT GTA TAC GTT TT

1 2 1 4 2 2 1 2 2 1 1 2 1

slide-25
SLIDE 25

Genome Assembly

  • Building the k-mer graph

– G(k): nodes as k-mers, edges (k-1) overlap – H(k): nodes as (k-1)-mers, edges form k-mers

  • H(k)=G(k-1)

– So it really does not matter which you choose to implement

  • Where does the complexity come from?

– Sequencing errors, repeats, uneven coverage, contamination from other organisms, ploidy, unsequenced regions

25

slide-26
SLIDE 26

Popping bubbles

Error occurs in the middle of a read and is propagated to many k-mers.

slide-27
SLIDE 27

Trimming tips

Error creates an erroneous ending k-mer

slide-28
SLIDE 28

Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly

Chimeric extensions

Errors connect two nodes in the graph which do not correspond to a valid extension in the genome sequence

slide-29
SLIDE 29

Repetitive regions

  • Satellites, SINEs, LINEs
  • Homologous Genes

– Ortholog: descended from the same ancestral sequence and separated by speciation – Paralog: genes created by a duplication event

29

slide-30
SLIDE 30

30 Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly

slide-31
SLIDE 31

Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly

slide-32
SLIDE 32

Velvet assembler

  • Four stages

– Hashing reads into k-mers – Constructing the de Bruijn graph (not all 4^k k- mers, only those that exist in input) – Correct errors – Resolve repeats

  • But what after?

– Paper gives very little information on this...

32

slide-33
SLIDE 33

The Chinese postman problem (CPP)

  • Compute a closed tour of minimum length

that visits each edge at least once

– Similar to what we want except we may want to visit edges more than once due to repeats

  • How do we deal with repeats?

– Also, the starting and ending vertices are distinct in genome assembly

  • How can we convert the closed tour to an open one?

33

slide-34
SLIDE 34

Your homework

  • You are not required to implement section 4
  • f

http://web.eecs.umich.edu/~pettie/matching /Edmonds-Johnson-chinese-postman.pdf

  • You are not even required to model genome

assembly as CPP

  • But you do have to build the k-mer graph,

correct errors, resolve repeats, and compute a CPP or Eulerian-like tour.

34

slide-35
SLIDE 35

Evaluating assembly

  • The Assemblathon2 study lists 102 measures for evaluating

assembly quality.

  • Bradnam et al. (2013) Assemblathon 2: evaluating de novo

methods fo genome assembly in three vertebrate species

  • 1. NG50 scaffold length: a length x where all scaffolds of length x or

longer consists of at least 50% of the genome size

  • 2. NG50 contig length: a length x where all contigs of length x or

longer consists of at least 50% of the genome size

  • 3. Amount of gene-sized scaffolds (>25 kbp). Useful for gene

finding.

  • 4. CEGMA: Number of 458 core genes mapped
slide-36
SLIDE 36
  • 5. Fosmid coverage: How many validated fosmid regions were

captured in assembly

  • 6. Fosmid validity: Percentage of assembly validated by validated

fosmid regions

  • 7. Validated fosmid region tag scaffold summary score: number of

validated fosmid region tag pairs that match the same scaffold multiplied by the percentage of uniquely mapping tag pairs that map with correct distance. Rewards short-range accuracy.

  • 8. and 9. Using local and global alignments of optimal map data,

how well the assembly is ordered.

  • 10. REAPR summary score: a tool that evalutes accuracy of assembly

using paired reads

Evaluating assembly