Assembly Assembly Assembling with Repeats Assembling with Repeats - - PowerPoint PPT Presentation

assembly assembly assembling with repeats assembling with
SMART_READER_LITE
LIVE PREVIEW

Assembly Assembly Assembling with Repeats Assembling with Repeats - - PowerPoint PPT Presentation

Assembly Assembly Assembling with Repeats Assembling with Repeats Mate Pairs Mate Pairs Whole genome Whole genome shotgun shotgun Input: Input: Shotgun sequence fragments (reads) Shotgun sequence fragments (reads) Mate


slide-1
SLIDE 1

Assembly Assembly

slide-2
SLIDE 2

Assembling with Repeats Assembling with Repeats

slide-3
SLIDE 3

Mate Pairs Mate Pairs

slide-4
SLIDE 4

Whole genome Whole genome shotgun shotgun

ß ß Input:

Input:

ß ß Shotgun sequence fragments (reads) Shotgun sequence fragments (reads) ß ß Mate pairs Mate pairs

ß ß Output:

Output:

ß ß A single sequence created by consensus of overlapping reads A single sequence created by consensus of overlapping reads

ß ß First generation of assemblers did not include mate-pairs

First generation of assemblers did not include mate-pairs ( (Phrap Phrap, CAP..) , CAP..)

ß ß Second generation: CA,

Second generation: CA, Arachne Arachne, Euler , Euler

ß ß We will

We will discuss Arachne discuss Arachne, a freely available sequence , a freely available sequence assembler (2nd generation) assembler (2nd generation)

slide-5
SLIDE 5

Arachne Arachne: : Details Details

ß ß Initial processing

Initial processing

ß ß Alignment module

Alignment module

slide-6
SLIDE 6

Alignment Module Alignment Module

ß ß Input: Collection of DNA sequences of

Input: Collection of DNA sequences of arbitrary length arbitrary length

ß ß Output:

Output: Pairwise Pairwise alignments between alignments between them. them.

slide-7
SLIDE 7

Overlap detection Overlap detection

ß ß Option 1: Compute an alignment between

Option 1: Compute an alignment between every pair. every pair.

ß ß G = 150Mb, L=500 G = 150Mb, L=500 ß ß Coverage LN/G = 10 Coverage LN/G = 10 ß ß N = 10*150*10 N = 10*150*106

6/500 = 3*10

/500 = 3*106

6

ß ß Not good! (Only a small fraction are true Not good! (Only a small fraction are true

  • verlaps)
  • verlaps)
slide-8
SLIDE 8

K- K-mer mer based overlap based overlap

ß ß A 25-

A 25-bp bp sequence appears at most once sequence appears at most once in the genome! in the genome!

ß ß Two overlapping sequences should share

Two overlapping sequences should share a 25- a 25-mer mer

ß ß Two non-overlapping sequences should

Two non-overlapping sequences should not! not!

slide-9
SLIDE 9

Sorting k- Sorting k-mers mers

ß ß Build a list of k-

Build a list of k-mers mers that appear in the that appear in the sequences and their reverse complements sequences and their reverse complements

ß ß Create a record with 4 entries:

Create a record with 4 entries:

ß ß K- K-mer mer ß ß Sequence number Sequence number ß ß Position in the sequence Position in the sequence ß ß Reverse complementation flag Reverse complementation flag

ß ß Sort a vector of these according to k-

Sort a vector of these according to k-mer mer

ß ß If number of records exceeds

If number of records exceeds threshold threshold, discard , discard (why?) (why?)

slide-10
SLIDE 10

Phase 2-4 of Alignment module Phase 2-4 of Alignment module

ß ß

Coalesce k- Coalesce k-mer mer hits into hits into longer, gap-free partial longer, gap-free partial alignments. alignments.

ß ß

These extended k- These extended k-mer mer hits are saved. hits are saved.

ß ß

For each pair of For each pair of sequences, form a sequences, form a directed graph. directed graph.

ß ß

For each maximal path For each maximal path in the graph, construct in the graph, construct an alignment. an alignment.

ß ß

Refine alignment via Refine alignment via banded DP banded DP

slide-11
SLIDE 11

Detecting Detecting Chimeric Chimeric reads reads

ß ß

Chimeric Chimeric reads: Reads that reads: Reads that contain sequence from two contain sequence from two genomic locations. genomic locations.

ß ß

Good overlaps: G(a,b) if a,b Good overlaps: G(a,b) if a,b

  • verlap with a
  • verlap with a high

high score score

ß ß

Transitive overlap: T(a,c) if Transitive overlap: T(a,c) if G(a,b), and G(b,c) G(a,b), and G(b,c)

ß ß

Find a point x across which Find a point x across which

  • nly transitive overlaps occur.
  • nly transitive overlaps occur.

X is a point of X is a point of chimerism chimerism

slide-12
SLIDE 12

Repeats Repeats

slide-13
SLIDE 13

Contig Contig assembly assembly

ß ß

Reads are merged into Reads are merged into contigs contigs upto upto repeat boundaries. repeat boundaries.

ß ß

(a,b) & (a,c) overlap, (b,c) (a,b) & (a,c) overlap, (b,c) should overlap as well. Also, should overlap as well. Also,

ß ß shift(a,c)=shift(a,b)+shift(b,c) shift(a,c)=shift(a,b)+shift(b,c)

ß ß

Most of the Most of the contigs contigs are unique are unique pieces of the genome, and end pieces of the genome, and end at some Repeat boundary. at some Repeat boundary.

ß ß

Some Some contigs contigs might be entirely might be entirely within repeats. These must be within repeats. These must be detected detected

slide-14
SLIDE 14

Detecting Repeat Detecting Repeat Contigs Contigs 1: Read Density 1: Read Density

ß ß Compute the log-odds

Compute the log-odds ratio of two ratio of two hypotheses: hypotheses:

ß ß H1: The

H1: The contig contig is from is from a unique region of the a unique region of the genome. genome.

ß ß The

The contig contig is from a is from a region that is region that is repeated at least repeated at least twice twice

slide-15
SLIDE 15

Creating Super Creating Super Contigs Contigs

slide-16
SLIDE 16

Supercontig Supercontig assembly assembly

ß ß Supercontigs

Supercontigs are built incrementally are built incrementally

ß ß Initially, each

Initially, each contig contig is a is a supercontig supercontig. .

ß ß In each round, a

In each round, a pair pair of super-

  • f super-contigs

contigs is is merged until no more can be performed. merged until no more can be performed.

ß ß Create a Priority Queue with a score for

Create a Priority Queue with a score for every pair of every pair of ‘ ‘mergeable supercontigs mergeable supercontigs’ ’. .

ß ß Score has two terms: Score has two terms: ß ß A reward for multiple mate-pair links

A reward for multiple mate-pair links

ß ß A penalty for distance between the links.

A penalty for distance between the links.

slide-17
SLIDE 17

Supercontig Supercontig merging merging

ß ß Remove the top scoring pair (S1,S2) from

Remove the top scoring pair (S1,S2) from the priority queue. the priority queue.

ß ß Merge (S

Merge (S1

1,S

,S2

2) to form

) to form contig contig T. T.

ß ß Remove all pairs in Q containing S

Remove all pairs in Q containing S1

1 or S

  • r S2

2

ß ß Find all

Find all supercontigs supercontigs W that share mate- W that share mate- pair links with T and insert (T,W) into the pair links with T and insert (T,W) into the priority queue. priority queue.

ß ß Detect Repeated

Detect Repeated Supercontigs Supercontigs and and remove remove

slide-18
SLIDE 18

Repeat Repeat Supercontigs Supercontigs

ß ß If the distance

If the distance between two super- between two super- contigs contigs is not correct, is not correct, they are marked as they are marked as Repeated Repeated

ß ß If transitivity is not

If transitivity is not maintained, then maintained, then there is a Repeat there is a Repeat

slide-19
SLIDE 19

Filling gaps in Filling gaps in Supercontigs Supercontigs

slide-20
SLIDE 20

Consenus Consenus Derivation Derivation

ß ß Consensus sequence is created by

Consensus sequence is created by converting converting pairwise pairwise read alignments into read alignments into multiple-read alignments multiple-read alignments

slide-21
SLIDE 21

Summary Summary

ß ß Whole genome shotgun is now routine:

Whole genome shotgun is now routine:

ß ß Human, Mouse, Rat, Dog, Chimpanzee.. Human, Mouse, Rat, Dog, Chimpanzee.. ß ß Many Prokaryotes (One can be sequenced in a day) Many Prokaryotes (One can be sequenced in a day) ß ß Plant genomes: Arabidopsis, Rice Plant genomes: Arabidopsis, Rice ß ß Model organisms: Worm, Fly, Yeast Model organisms: Worm, Fly, Yeast

ß ß A lot is not known about genome structure,

A lot is not known about genome structure,

  • rganization and function.
  • rganization and function.

ß ß Comparative genomics offers low hanging fruit Comparative genomics offers low hanging fruit

slide-22
SLIDE 22

The central dogma again The central dogma again

Protein Sequence Analysis

Sequence Analysis Gene Finding Assembly

slide-23
SLIDE 23

Much other analysis is Much other analysis is possible possible

Protein Sequence Analysis

Sequence Analysis Gene Finding Assembly ncRNA Genomic Analysis/ Pop. Genetics

slide-24
SLIDE 24

A Static picture of the cell is insufficient A Static picture of the cell is insufficient

ß ß Each Cell is continuously active,

Each Cell is continuously active,

ß ß Genes are being transcribed into RNA Genes are being transcribed into RNA ß ß RNA is translated into proteins RNA is translated into proteins ß ß Proteins are PT modified and transported Proteins are PT modified and transported ß ß Proteins perform various cellular functions Proteins perform various cellular functions

ß ß Can we probe the Cell dynamically

Can we probe the Cell dynamically