Relaxations of the Seriation Problem and Applications to de novo - - PowerPoint PPT Presentation

relaxations of the seriation problem and applications to
SMART_READER_LITE
LIVE PREVIEW

Relaxations of the Seriation Problem and Applications to de novo - - PowerPoint PPT Presentation

Relaxations of the Seriation Problem and Applications to de novo Genome Assembly Soutenance de th` ese Antoine Recanati sous la direction dAlexandre dAspremont 29 Novembre 2018 Introduction Genome sequencing ...ATGGCGTGCAATG...


slide-1
SLIDE 1

Relaxations of the Seriation Problem and Applications to de novo Genome Assembly

Soutenance de th` ese

Antoine Recanati

sous la direction d’Alexandre d’Aspremont 29 Novembre 2018

slide-2
SLIDE 2

Introduction

slide-3
SLIDE 3

Genome sequencing

  • ...ATGGCGTGCAATG...

...TACCGCACGTTAC...

1

slide-4
SLIDE 4

DNA sequencing

Image: Nik Spencer/Nature

Genome is cut into

  • verlapping fragments

(reads). Ex: ATGGCGTGCAATG         

2

slide-5
SLIDE 5

DNA sequencing

Image: Nik Spencer/Nature

Genome is cut into

  • verlapping fragments

(reads). Ex: ATGGCGTGCAATG          CGTGCAA

2

slide-6
SLIDE 6

DNA sequencing

Image: Nik Spencer/Nature

Genome is cut into

  • verlapping fragments

(reads). Ex: ATGGCGTGCAATG          CGTGCAA ATGGCGT

2

slide-7
SLIDE 7

DNA sequencing

Image: Nik Spencer/Nature

Genome is cut into

  • verlapping fragments

(reads). Ex: ATGGCGTGCAATG          CGTGCAA ATGGCGT TGCAATG

2

slide-8
SLIDE 8

DNA sequencing

Image: Nik Spencer/Nature

Genome is cut into

  • verlapping fragments

(reads). Ex: ATGGCGTGCAATG          CGTGCAA ATGGCGT TGCAATG GGCGTGC

2

slide-9
SLIDE 9

Assembly

Goal: assemble reads together to reconstruct the full sequence. The position and ordering of the reads are unknown.          CGTGCAA ATGGCGT TGCAATG GGCGTGC

  • ATGGCGTGCAATG

ATGGCGTGCAATG ATGGCGTGCAATG ATGGCGTGCAATG ATGGCGTGCAATG

3

slide-10
SLIDE 10

Genome assembly: mapping

If reference genome available: map the fragments to it, then derive consensus sequence CGTGCAA ATGGCGT TGCAATG GGCGTGC

4

slide-11
SLIDE 11

Genome assembly: mapping

If reference genome available: map the fragments to it, then derive consensus sequence CGTGCAA ATGGCGT TGCAATG GGCGTGC AAGGCGTGCATTG (ref. (proxy))

4

slide-12
SLIDE 12

Genome assembly: mapping

If reference genome available: map the fragments to it, then derive consensus sequence CGTGCAA ATGGCGT TGCAATG GGCGTGC AAGGCGTGCATTG (ref. (proxy)) ATGGCGTGCAATG

4

slide-13
SLIDE 13

Genome assembly: mapping

If reference genome available: map the fragments to it, then derive consensus sequence CGTGCAA ATGGCGT TGCAATG GGCGTGC AAGGCGTGCATTG (ref. (proxy)) ATGGCGTGCAATG ATGGCGTGCAATG

4

slide-14
SLIDE 14

Genome assembly: mapping

If reference genome available: map the fragments to it, then derive consensus sequence CGTGCAA ATGGCGT TGCAATG GGCGTGC AAGGCGTGCATTG (ref. (proxy)) ATGGCGTGCAATG ATGGCGTGCAATG ATGGCGTGCAATG

4

slide-15
SLIDE 15

Genome assembly: mapping

If reference genome available: map the fragments to it, then derive consensus sequence CGTGCAA ATGGCGT TGCAATG GGCGTGC AAGGCGTGCATTG (ref. (proxy)) ATGGCGTGCAATG ATGGCGTGCAATG ATGGCGTGCAATG ATGGCGTGCAATG

4

slide-16
SLIDE 16

Genome assembly: mapping

If reference genome available: map the fragments to it, then derive consensus sequence CGTGCAA ATGGCGT TGCAATG GGCGTGC AAGGCGTGCATTG (ref. (proxy)) ATGGCGTGCAATG ATGGCGTGCAATG ATGGCGTGCAATG ATGGCGTGCAATG

4

slide-17
SLIDE 17

Genome assembly: mapping

If reference genome available: map the fragments to it, then derive consensus sequence CGTGCAA ATGGCGT TGCAATG GGCGTGC AAGGCGTGCATTG (ref. (proxy)) ATGGCGTGCAATG ATGGCGTGCAATG ATGGCGTGCAATG ATGGCGTGCAATG ATGGCGTGCAATG (assembly)

4

slide-18
SLIDE 18

Genome assembly: mapping

If reference genome available: map the fragments to it, then derive consensus sequence CGTGCAA ATGGCGT TGCAATG GGCGTGC AAGGCGTGCATTG (ref. (proxy)) ATGGCGTGCAATG ATGGCGTGCAATG ATGGCGTGCAATG ATGGCGTGCAATG ATGGCGTGCAATG (assembly)

4

slide-19
SLIDE 19

Genome assembly: de novo

No reference available. Greedy assembly: take one read, “add” the

  • ne with largest overlap, etc., until all reads are included.

CGTGCAA ATGGCGT TGCAATG GGCGTGC

5

slide-20
SLIDE 20

Genome assembly: de novo

No reference available. Greedy assembly: take one read, “add” the

  • ne with largest overlap, etc., until all reads are included.

CGTGCAA ATGGCGT TGCAATG GGCGTGC ATGGCGTGCAATG ATGGCGTGCAATG

5

slide-21
SLIDE 21

Genome assembly: de novo

No reference available. Greedy assembly: take one read, “add” the

  • ne with largest overlap, etc., until all reads are included.

CGTGCAA ATGGCGT TGCAATG GGCGTGC ATGGCGTGCAATG ATGGCGTGCAATG ATGGCGTGCAATG

5

slide-22
SLIDE 22

Genome assembly: de novo

No reference available. Greedy assembly: take one read, “add” the

  • ne with largest overlap, etc., until all reads are included.

CGTGCAA ATGGCGT TGCAATG GGCGTGC ATGGCGTGCAATG ATGGCGTGCAATG ATGGCGTGCAATG ATGGCGTGCAATG

5

slide-23
SLIDE 23

Genome assembly: de novo

No reference available. Greedy assembly: take one read, “add” the

  • ne with largest overlap, etc., until all reads are included.

CGTGCAA ATGGCGT TGCAATG GGCGTGC ATGGCGTGCAATG ATGGCGTGCAATG ATGGCGTGCAATG ATGGCGTGCAATG ATGGCGTGCAATG

5

slide-24
SLIDE 24

De novo assembly paradigms

  • Greedy methods
  • De Bruijn graphs
  • Overlap-Layout-Consensus

6

slide-25
SLIDE 25

Overlap-Layout-Consensus

  • Compute overlaps between all read pairs
  • Find tiling of reads consistent with overlaps
  • Average reads values to create consensus sequence

ATGGCGT CGTGCAA TGCAATG GGCGTGC

GGCGT CGTGC TGCAA

CGT TGC

ATGGCGTGCAATG ATGGCGTGCAATG ATGGCGTGCAATG ATGGCGTGCAATG ATGGCGTGCAATG

7

slide-26
SLIDE 26

Modern sequencing technologies

  • 2nd gen. (SGS): short (∼100bp), accurate (< 2% err.)reads

(Illumina/Solexa), with pairing information. De Bruijn graphs methods (on k-mers based graph) preferred.

  • 3rd. gen.: long (∼10000bp), noisy (∼10%) reads (Pacific

Biosciences [PacBio], Oxford Nanopore Technology [ONT]). Come-back of OLC methods.

  • Can be combined to have both accuracy and length (hybrid

methods)

8

slide-27
SLIDE 27

De novo assembly methods with ONT reads

State of the art: Canu (ex. Celera Assembler). Heavy pre-processing, many heuristics

  • correction: (uses [hash-based] overlaps for consensus)
  • trimming: recalculate overlaps to filter

low-coverage/high-error regions

  • re-computation of overlaps with specific target errors (uses a

priori model of errors)

  • assemble unitigs (unambiguous sequences) first, then

incremental scaffolding

9

slide-28
SLIDE 28

De novo assembly methods with ONT reads

  • ONT-only assemblers (non-hybrid): active field of research

2015-now

  • Canu: complex pipeline, high quality consensus.
  • Miniasm: ideas of Canu assembly, no pre-processing, smart
  • heuristics. Ultra-fast, low-quality.
  • Naive OLC approach with clean mathematical formulation ?

10

slide-29
SLIDE 29

Introduction De novo Genome Assembly Seriation Application of the Spectral Method to Genome Assembly Robust Seriation Multi-dimensional spectral ordering Conclusion

11