Seriation and de novo genome assembly
Antoine Recanati, CNRS & ENS with Alexandre d’Aspremont, Fajwel Fogel, Thomas Br¨ uls, CNRS - ENS Paris & Genoscope.
- A. Recanati
Institut Curie, Octobre 2016, 1/17
Seriation and de novo genome assembly Antoine Recanati , CNRS & - - PowerPoint PPT Presentation
Seriation and de novo genome assembly Antoine Recanati , CNRS & ENS with Alexandre dAspremont, Fajwel Fogel, Thomas Br uls, CNRS - ENS Paris & Genoscope. A. Recanati Institut Curie, Octobre 2016, 1/17 Seriation The Seriation
Institut Curie, Octobre 2016, 1/17
Pairwise similarity information Aij on n variables. Suppose the data has a serial structure, i.e. there is an order π such that
20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160
Institut Curie, Octobre 2016, 2/17
Genomes are cloned multiple times and randomly cut into shorter reads
Reorder the reads to recover the genome.
Institut Curie, Octobre 2016, 3/17
Compute overlap between all read pairs. Reorder overlap matrix to recover read order. Average the read values to create a consensus sequence.
Institut Curie, Octobre 2016, 4/17
There are base calling errors in the reads, typically 2% to 20% depending on
Entire parts of the genome are repeated, which breaks the serial structure.
Next generation : short reads (∼ 400bp), few errors (∼ 2%). Repeats are
Third generation : long reads (∼ 10kbp), more errors (∼ 15%). Can resolve
Institut Curie, Octobre 2016, 5/17
With short accurate reads, the reordering problem is solved by
With long noisy reads, reads are corrected before assembly (hybrid correction
Layout and consensus not clearly separated, many heuristics . . . miniasm : first long raw reads straight assembler (but consensus sequence is as
Institut Curie, Octobre 2016, 6/17
Introduction Combinatorial problem Spectral relaxation Results (Application to genome assembly)
Institut Curie, Octobre 2016, 7/17
The 2-SUM problem is written
π∈P n
Define LA = diag(A1) − A is the Laplacian of A. The 2-SUM problem is
π∈P πTLAπ
i=1 x2 i(n j=1 Aij) − n i,j=1 Aijxixj
i,j=1 Aij(x2 i − xixj)
1 2
i,j=1 Aij(x2 j + x2 i − 2xixj)
1 2
i,j=1 Aij(xi − xj)2
Institut Curie, Octobre 2016, 8/17
Institut Curie, Octobre 2016, 9/17
π∈P πTLAπ
2
2 = n(n+1)(2n+1) 6
2 1. LA1 = 0. Withdrawing c from any vector π does not change the
1T x=0, x2=1
Institut Curie, Octobre 2016, 10/17
1T x=0, x2=1
Institut Curie, Octobre 2016, 11/17
Spectral solution easy to compute and scales well But sensitive and not flexible (hard to include additional structural constraints) Other (convex) relaxations handle structural constraints
Overlap : computed from k-mers, yielding a similarity matrix A Layout : A is thresholded to remove noise-induced overlaps, and reordered
Consensus : Genome sliced in windows
Institut Curie, Octobre 2016, 12/17
Introduction Combinatorial problem Spectral relaxation Results (Application to genome assembly)
Institut Curie, Octobre 2016, 13/17
Long raw reads (Oxford Nanopore Technology) Overlaps computed with minimap : hashing k-mers Threshold on similarity matrix to remove false-overlaps
read length
5000 15000 25000
frequency (count)
Mean: 6863 Median: 7002 Min: 327 Max: 25494 >7Kbp: 50%
Institut Curie, Octobre 2016, 14/17
Two bacterial genomes : E. Coli and A. Baylyi Circular genomes, size ∼ 4Mbp A few connected components after threshold
×104 0.5 1 1.5 2 2.5
×104 0.5 1 1.5 2 2.5
Institut Curie, Octobre 2016, 15/17
16 chromosomes Many repeats Higher threshold on similarity matrix ⇒ many connected components
true ordering (from BWA) recovered (spectral ordering)
Institut Curie, Octobre 2016, 16/17
Equivalence 2-SUM ⇐
Layout correctly found by spectral relaxation for bacterial genomes (with
Consensus computed by MSA in sliding windows ⇒∼ 99% avg. identity with
Additional information could help assemble more complex genomes (e.g.
Other problems involving Seriation ? Convex relaxations can also handle constraints (e.g. |π(i) − π(j)| ≤ k) for
Institut Curie, Octobre 2016, 17/17
References J.E. Atkins, E.G. Boman, B. Hendrickson, et al. A spectral algorithm for seriation and the consecutive ones problem. SIAM J. Comput., 28 (1):297–310, 1998. Avrim Blum, Goran Konjevod, R Ravi, and Santosh Vempala. Semidefinite relaxations for minimum bandwidth and other vertex ordering
Moses Charikar, Mohammad Taghi Hajiaghayi, Howard Karloff, and Satish Rao. l2 2 spreading metrics for vertex ordering problems. Algorithmica, 56(4):577–604, 2010.
paper, 2008. Guy Even, Joseph Seffi Naor, Satish Rao, and Baruch Schieber. Divide-and-conquer approximation algorithms via spreading metrics. Journal
Uriel Feige. Approximating the bandwidth via volume respecting embeddings. Journal of Computer and System Sciences, 60(3):510–539, 2000. Uriel Feige and James R Lee. An improved approximation ratio for the minimum linear arrangement problem. Information Processing Letters, 101(1):26–29, 2007.
Michel X. Goemans. Smallest compact formulation for the permutahedron. Mathematical Programming, pages 1–7, 2014. David G Kendall. Incidence matrices, interval graphs and seriation in archeology. Pacific Journal of mathematics, 28(3):565–570, 1969. Cong Han Lim and Stephen J Wright. Beyond the birkhoff polytope: Convex relaxations for vector permutation problems. arXiv preprint arXiv:1407.6609, 2014.
109(2):283–317, 2007. Satish Rao and Andr´ ea W Richa. New approximation techniques for some linear ordering problems. SIAM Journal on Computing, 34(2): 388–404, 2005. Anthony Man-Cho So. Moment inequalities for sums of random matrices and their applications in optimization. Mathematical programming, 130(1):125–151, 2011.
Institut Curie, Octobre 2016, 18/17
Once layout is computed and fined-grained, slicing in windows Multiple Sequence Alignment using Partial Order Graphs (POA) in windows Windows merging
window 1 window 2 window 3
POA in windows
consensus 1 consensus 2 consensus 3 consensus (1+2) consensus ((1+2) +3)
Institut Curie, Octobre 2016, 19/17
The 2-SUM problem is written
π∈P n
π∈P πTLAπ
NP-Complete for generic matrices A.
Institut Curie, Octobre 2016, 20/17
π∈P n
Gives a spectral (hence polynomial) solution for 2-SUM on some R-matrices. Write a convex relaxation for 2-SUM and seriation.
Institut Curie, Octobre 2016, 21/17
Let Dn the set of doubly stochastic matrices, where
Notice that P = D ∩ O, i.e. Π permutation matrix if and only Π is both
Solve
F
1 Πg + 1 ≤ eT nΠg,
n11T and Y ∈ Rn×p is a matrix
Institut Curie, Octobre 2016, 22/17
F
2-SUM term Tr(Y TΠTLAΠY ) = p
i=1 yT i ΠTLAΠyi where yi are small
Orthogonalization penalty −µPΠ2
F, where P = I − 1 n11T.
eT
1 Πg + 1 ≤ eT nΠg breaks degeneracies by imposing π(1) ≤ π(n). Without it,
Π1 = 1, ΠT1 = 1 and Π ≥ 0, keep Π doubly stochastic.
Institut Curie, Octobre 2016, 23/17
Relaxations for orthogonality constraints, e.g. SDPs in [Nemirovski, 2007,
O(√log n) approximation bounds for Minimum Linear Arrangement [Even
All these relaxations form extremely large SDPs.
Institut Curie, Octobre 2016, 24/17
Semi-Supervised Seriation. We can add structural constraints to the
i Πg − eT j Πg ≤ b.
Sampling permutations. We can generate permutations from a doubly
Institut Curie, Octobre 2016, 25/17
Institut Curie, Octobre 2016, 26/17
Institut Curie, Octobre 2016, 27/17
Longer reads. Average 10k base pairs in early experiments. Compared with
High error rate. About 20% compared with a few percents for existing
Real-time data. Sequencing data flows continuously.
Institut Curie, Octobre 2016, 28/17