CS681: Advanced Topics in Computational Biology Week 7 Lecture 1 - - PowerPoint PPT Presentation

cs681 advanced topics in
SMART_READER_LITE
LIVE PREVIEW

CS681: Advanced Topics in Computational Biology Week 7 Lecture 1 - - PowerPoint PPT Presentation

CS681: Advanced Topics in Computational Biology Week 7 Lecture 1 Can Alkan EA224 calkan@cs.bilkent.edu.tr http://www.cs.bilkent.edu.tr/~calkan/teaching/cs681/ Genome Assembly Given a set of sequence reads (Sanger, NGS single end, NGS


slide-1
SLIDE 1

CS681: Advanced Topics in Computational Biology

Can Alkan EA224 calkan@cs.bilkent.edu.tr

http://www.cs.bilkent.edu.tr/~calkan/teaching/cs681/ Week 7 Lecture 1

slide-2
SLIDE 2

Genome Assembly

 Given a set of sequence reads (Sanger, NGS

single end, NGS paired end, NGS strobe, etc.) reconstruct the genomic sequence

 Reference guided: When a reference genome

(same species or highly similar) is available

 de novo: No apriori information needed

slide-3
SLIDE 3

Genome Assembly

Test genome Random shearing and Size-selection Sequencing Contigs/ scaffolds Assemble

slide-4
SLIDE 4

Challenges

 DNA is double stranded; assemblers must

consider 2 versions for each read

 Sequencing errors  Repeats & duplications  Heterozygosity

 Diploid genomes: 2 alternates of each locus  Polyploid plant genomes are harder to deal

with!

slide-5
SLIDE 5

Challenges (cont’d)

 Large genomes require

 More computational power  More memory (most algorithms >300 GB for

mammalian genomes)

 Contamination:

 Quite common to have DNA from other sources in the

dataset

  • Eg. yeast, E. coli, other bacteria, etc.

 Initial dataset from the bonobo genome was contaminated even

with tomato and corn!  Big data

 Billions of reads to work with

slide-6
SLIDE 6

Parameters for assembly

 Coverage

 GC% biases can be ameliorated a little by increasing overall

coverage

 Read length  Insert size  Better with multiple libraries with different insert sizes  Better with multi-platform data  Better with additional information

 Physical fingerprinting (if clones available)  STS mapping (needs some a priori information)

slide-7
SLIDE 7

Basics

 No technology can read a chromosome from

start to finish; all sequencers have limits for read lengths

 Two major approaches

 Hierarchical sequencing (used by the human genome

project)

High quality, very low error rate, little fragmentation

Slow and expensive!

 Whole genome shotgun (WGS) sequencing

Lower quality, more errors, assembly is more fragmented

Fast and cheap(er)

slide-8
SLIDE 8

Hierarchical vs. shotgun sequencing

Assemble all Assemble step by step

slide-9
SLIDE 9

Cloning vectors

 Plasmids: carry 3-10 kbp of DNA  Fosmids: carry ~40 kbp of DNA  Cosmids: carry ~35-50 kbp of DNA  BACs (bacterial artificial chromosomes):

~150-200 kbp of DNA

 YACs (yeast artificial chromosomes): 100 kbp

– 3 Mbp of DNA

slide-10
SLIDE 10

Human genomes: public vs private

slide-11
SLIDE 11

Assembly terminology

Contig: contiguous segments of DNA sequences generated by the assembler using the reads

Scaffold: Ordering of contigs separated by gaps

Draft assembly: Includes many contigs and scaffolds, most sequence remains unassigned to chromosomes

Finished assembly: most sequence assigned to chromosomes, most gaps are closed

Typically involves manual intervention & costly and slow methods

http://genome.jgi.doe.gov/help/scaffolds.html

slide-12
SLIDE 12

Assembly quality assessment

 Assembly size: is the summation of contig/scaffold

lengths similar to what is expected from the genome of interest?

 Number of contigs/scaffolds: lower is better

 Ideally equal to # of chromosomes

 N50: contig length such that using equal or longer

contigs produces half the bases of the genome

 L = Sum of all contig lengths c[1..n]  Sort contigs in descending order by length  X = 0, I = 0  X = X + c[i] 

If X >= L/2; N50 = c[i]

slide-13
SLIDE 13

Consensus (15 Consensus (15-

  • 30Kbp)

30Kbp) Reads Reads

Contig Contig Assembly without pairs Assembly without pairs results in results in contigs contigs whose whose

  • rder and orientation are not
  • rder and orientation are not

known. known.

?

Pairs, especially groups of Pairs, especially groups of corroborating ones, link the contigs corroborating ones, link the contigs into scaffolds where the size of gaps is into scaffolds where the size of gaps is well characterized. well characterized. 2-pair pair Mean & Std.Dev. Mean & Std.Dev. is known is known Scaffold Scaffold

Scaffolding with read pairs

Slide from Mihai Pop

slide-14
SLIDE 14

Chromosome Chromosome STS STS STS STS-

  • mapped Scaffolds

mapped Scaffolds Contig Contig Gap (mean & std. dev. Known) Gap (mean & std. dev. Known) Read pair (mates) Read pair (mates) Consensus Consensus Reads (of several haplotypes) Reads (of several haplotypes) SNPs SNPs External “Reads” External “Reads”

WGS Assembly

Slide from Mihai Pop STS: sequence-tagged sites = 200-500 bp

  • f sequence that is unique

In the genome

slide-15
SLIDE 15

Assembly gaps

sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap

Sequencing gaps Physical gaps

Slide from Mihai Pop

slide-16
SLIDE 16

Typical contig coverage

1 2 3 4 5 6 Coverage Contig Reads

Slide from Mihai Pop

slide-17
SLIDE 17

Lander-Waterman statistics

L = read length T = minimum detectable overlap G = genome size N = number of reads c = coverage (NL / G) σ = 1 – T/L E(#islands) = Ne-cσ E(island size) = L((ecσ – 1) / c + 1 – σ) contig = island with 2 or more reads

Slide from Mihai Pop

slide-18
SLIDE 18

Example

c N #islands #contigs bases not in any read bases not in contigs 1 1,667 655 614 698 367,806 3 5,000 304 250 121 49,787 5 8,334 78 57 20 6,735 8 13,334 7 5 1 335

Genome size: 1 Mbp Read Length: 600 Detectable overlap: 40 Slide from Mihai Pop

slide-19
SLIDE 19

Experimental data

X coverage # ctgs % > 2X avg ctg size (L-W) max ctg size # ORFs 1 284 54 1,234 (1,138) 3,337 526 3 597 67 1,794 (4,429) 9,589 1,092 5 548 79 2,495 (21,791) 17,977 1,398 8 495 85 3,294 (302,545) 64,307 1,762 complete 1 100 1.26 M 1.26 M 1,329 numbers based on artificially chopping up the genome of Wolbachia pipientis

Slide from Mihai Pop

slide-20
SLIDE 20

Basic algorithmic definition

 Genome assembly problem is finding

shortest common superstring of a set of sequences (reads):

 Given strings {s1, s2, …, sn}; find the superstring T

such that every si is a substring of T

 NP-hard problem  Greedy approximation algorithm

 Works for simple (low-repeat) genomes

slide-21
SLIDE 21

Shortest superstring problem

A B R A C A C A D A A D A B R D A B R A R A C A D

input

ABRACADABRA ABRAC RACAD ACADA ADABR DABRA

slide-22
SLIDE 22

22

Assembly paradigms

 Overlap-layout-consensus

 greedy (TIGR Assembler, phrap, CAP3...)  graph-based (Celera Assembler, Arachne)

 SGA for NGS platforms

 Eulerian path on de Bruijn graphs(especially

useful for short read sequencing)

 EULER, Velvet, ABySS, ALLPATHS-LG, Cortex,

etc.

Slide from Mihai Pop

slide-23
SLIDE 23

Greedy Algorithms

 The greedy solution to shortest common

superstring problem

 Good for small genomes with no or low

repeat/duplication content

 First assembly algorithms used greedy

methods

slide-24
SLIDE 24

TIGR Assembler/phrap

Greedy method

Build a rough map of fragment

  • verlaps

Pick the largest scoring

  • verlap

Merge the two fragments

Repeat until no more merges can be done

Slide from Mihai Pop

slide-25
SLIDE 25

Overlap-layout-consensus

Main entity: read Relationship between reads: overlap

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 1 2 3 1 2 3 1 2 3 1 3 2 1 3 2

ACCTGA ACCTGA AGCTGA ACCAGA Slide from Mihai Pop

slide-26
SLIDE 26

Paths through graphs and assembly

 Hamiltonian cycle: visit each node exactly

  • nce, returning to the start

A B D C E H G I F

A B C D H I F G E

Genome

slide-27
SLIDE 27

IMPLEMENTATION DETAILS

slide-28
SLIDE 28

Overlap between two sequences

…AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT…

  • verlap (19 bases)
  • verhang (6 bases)
  • verhang
  • verlap - region of similarity between regions
  • verhang - unaligned ends of the sequences

The assembler screens merges based on:

  • length of overlap
  • % identity in overlap region
  • maximum overhang size.

% identity = 18/19 % = 94.7%

Slide from Mihai Pop

slide-29
SLIDE 29

All pairs alignment

 Needed by the assembler  Try all pairs – must consider ~ n2 pairs  Smarter solution: only n x coverage (e.g. 8) pairs are

possible

 Build a table of k-mers contained in sequences (single pass

through the genome)

 Generate the pairs from k-mer table (single pass through k-

mer table)

k-mer

A B C D H I F G E

Slide from Mihai Pop

slide-30
SLIDE 30

REPEATS

30

slide-31
SLIDE 31

Handling repeats

 Repeat detection

 pre-assembly: find fragments that belong to repeats  statistically (most existing assemblers)  repeat database (RepeatMasker)  during assembly: detect "tangles" indicative of repeats (Pevzner,

Tang, Waterman 2001)

 post-assembly: find repetitive regions and potential misassemblies.  Reputer, RepeatMasker  "unhappy" mate-pairs (too close, too far, misoriented)

 Repeat resolution

 find DNA fragments belonging to the repeat  determine correct tiling across the repeat

Slide from Mihai Pop

slide-32
SLIDE 32

Statistical repeat detection

 Significant deviations from average coverage flagged as

repeats.

 frequent k-mers are ignored  “arrival” rate of reads in contigs compared with theoretical value (e.g.,

800 bp reads & 8x coverage - reads "arrive" every 100 bp)

 Problem 1: assumption of uniform distribution of fragments -

leads to false positives

 non-random libraries  poor clonability regions

 Problem 2: repeats with low copy number are missed -

leads to false negatives

Slide from Mihai Pop

slide-33
SLIDE 33

Mis-assembled repeats

a b c a c b a b c d I II III I II III a b c d b c a b d c e f I II III IV I III II IV a d b e c f a

collapsed tandem excision rearrangement Slide from Mihai Pop