On the representation of de Bruijn Graphs
Rayan Chikhi
joint work with P . Medvedev, A. Limasset, S. Jackman, J. Simpson
- Univ. Lille
ALEA 2016
1
On the representation of de Bruijn Graphs Rayan Chikhi joint work - - PowerPoint PPT Presentation
On the representation of de Bruijn Graphs Rayan Chikhi joint work with P . Medvedev, A. Limasset, S. Jackman, J. Simpson Univ. Lille ALEA 2016 1 de Bruijn Graph sq
joint work with P . Medvedev, A. Limasset, S. Jackman, J. Simpson
1
s❡q✉❡♥❝❡✿
❦✲♠❡rs✿
✭❦❂✸✮ ❆❚❚ ❚❚❆ ✳✳✳ Nodes: k-mers (words of length k) Edges: exact suffix-prefix overlaps of length k − 1
GAT ATT TTA TAC CAT ACA CAA
Usages:
◮ de novo assembly of sequencing data
2
3
4
5
14 244 512 4300
Desktop computer Bacterium Cluster node Mammal Pine tree (20 Gbp)
Memory (GB) [Birol 13]
6
14 244 512 4300
Desktop computer Bacterium Cluster node Mammal Pine tree (20 Gbp)
Memory (GB) [Birol 13]
GAT ATG TGA GAT ATG TGA Nodes Hash table
Additional information: coverage, status, etc..
6
How to encode the de Bruijn graph using as little space as possible? nodes only: {GAT, ATT, . . .}
(human genome: k = 75, n = 3 · 109 k-mers)
2k · n bits 56 GB
[Conway, Bromage 11] log2 4k n
44 GB
7
bits/node 4 8 16 22 self-information (k=27) BF (k=27) XBW
[Chikhi, Rizk 12], [Salikhov et al. 13]
[Bowe et al. 12] Why are they doing better? → different types of data structures
8
A membership data structure is a pair of algorithms (const, contains_node), where: data ← const(G) contains_node(data, kmer) returns {true, false} whether kmer ∈ G A navigational data structure is (const, neighbors), where: data ← const(G) neighbors(data, kmer) returns the neighbors of kmer in G
9
NDS Membership
(e.g. hash table)
Traverse dBG from known nodes
x
x
Recent techniques are NDS but not Membership DS
10
“For each node x = x1x2x3,
in-neighbor: x3x1x2” Valid for these two graphs: AAT ATA TAA AAG AGA GAA So, 1 NDS ← → >1 dBGs 1 Membership DS ← → 1 dBG
11
We seek dBG representation lower bounds in the NDS model.
bits/node 4 8 16 22 self-information (k=27) 2 BF (k=27) XBW 12
Linear graphs
NDS for linear graphs need at least 2 bits/k-mer of space. Proof sketch:
same k-mer: ≈ 22n [Gagie 12]
≈ 22n
13
NDS need at least 3.24 bits/k-mer. Proof sketch:
graphs, such that for any two graphs, ∃
k-mer that appears in both graphs but with different neighbors.
bits
(pigeonhole principle), contradiction
Our construction has N = 23.24n
14
AATA AATC AATG AATT ATAA ATCA ATCC ATCG ATCT ATTA TCTG TCTT TCTA TCTC ATAC ATAG ATAT ATGA ATGC ATGG ATGT ATTC ATTG ATTT TAAA TAAG TAAT TAAC
m
ℓ possible graphs
m
ℓ ≥ 2(c−ǫ)ℓm with c = 8 − 3 log 3 ≈ 3.24
15
Navigational data structures:
Open questions:
Contact/references:
16