On the representation of de Bruijn Graphs Rayan Chikhi joint work - PowerPoint PPT Presentation

On the representation of de Bruijn Graphs Rayan Chikhi joint work with P . Medvedev, A. Limasset, S. Jackman, J. Simpson Univ. Lille ALEA 2016 1

de Bruijn Graph s❡q✉❡♥❝❡✿ ●❆❚❚❆❈❆❚❚❆❈❆❆ ❦✲♠❡rs✿ ●❆❚ ✭❦❂✸✮ ❆❚❚ ❚❚❆ ✳✳✳ Nodes: k -mers (words of length k ) Edges: exact suffix-prefix overlaps of length k − 1 CAT GAT ATT TTA TAC ACA CAA Usages: - Bioinformatics ◮ de novo assembly of sequencing data - Distributed applications 2

Genome sequencing 3

Genome assembly substrings from the genome, but position unknown 4

dBGs require a lot of memory 4300 512 Memory (GB) 244 14 Desktop Cluster Mammal Pine tree Bacterium computer node (20 Gbp) [Birol 13] 6

dBGs require a lot of memory 4300 512 Memory (GB) 244 14 Desktop Cluster Mammal Pine tree Bacterium computer node (20 Gbp) [Birol 13] Hash table Nodes TGA Additional GAT information: coverage, ATG GAT status, etc.. TGA ATG 6

How to encode the de Bruijn graph using as little space as possible? nodes only: { GAT , ATT , . . . } (human genome: k = 75, n = 3 · 10 9 k -mers) - Explicit list: 2 k · n bits 56 GB - Self-information of n nodes: [Conway, Bromage 11] �� 4 k �� log 2 bits n 44 GB 7

Recent techniques self-information (k=27) (k=27) XBW BF 0 4 8 16 22 bits/node - Bloom filter of nodes (w/ tricks) [Chikhi, Rizk 12], [Salikhov et al. 13] - XBW (Burrows-Wheeler for trees) variant [Bowe et al. 12] Why are they doing better? → different types of data structures 8

Data structures A membership data structure is a pair of algorithms ( const , contains _ node ) , where: data ← const ( G ) contains_node ( data , kmer ) returns { true, false } whether kmer ∈ G A navigational data structure is ( const , neighbors ) , where: data ← const ( G ) neighbors ( data , kmer ) returns the neighbors of kmer in G 9

Navigational data structures Membership NDS (e.g. hash table) Traverse dBG from known nodes � � Query membership of arbitrary nodes x � Enumerate nodes x � NDS has undefined behavior if query node not present. Recent techniques are NDS but not Membership DS 10

Why a NDS "beats" the self-information Consider this example NDS when k = 3 “For each node x = x 1 x 2 x 3 , out-neighbor: x 2 x 3 x 1 in-neighbor: x 3 x 1 x 2 ” Valid for these two graphs: AAT ATA TAA AAG AGA GAA So, 1 NDS ← → >1 dBGs 1 Membership DS ← → 1 dBG 11

Lower bounds We seek dBG representation lower bounds in the NDS model. self-information (k=27) (k=27) XBW BF 0 2 4 8 16 22 bits/node 12

NDS lower bound for linear graphs Linear graphs Theorem NDS for linear graphs need at least 2 bits/k-mer of space. Proof sketch: - Number of DNA strings that have n distinct k -mers and start with same k -mer: ≈ 2 2 n [Gagie 12] - Number of linear dBGs with n nodes and same left-most node: ≈ 2 2 n - Suppose NDS needs < 2 n bits, - Two graphs have the same NDS (pigeonhole principle) 13

NDS lower bound Theorem NDS need at least 3 . 24 bits/k-mer. Proof sketch: 1. Construct a large family of N graphs, such that for any two graphs, ∃ k-mer that appears in both graphs but with different neighbors. 2. Suppose NDS needs < log ( N ) bits 3. Two graphs have the same NDS (pigeonhole principle) , contradiction Our construction has N = 2 3 . 24 n 14

ATAA TAAA ATAC TAAC AATA ATAG TAAG ATAT TAAT ATCA TCTA ATCC AATC TCTC ATCG TCTG ATCT TCTT ATGA ATGC AATG ATGG ATGT ATTA ATTC AATT ATTG ATTT - Fix an even k ≥ 2, ℓ = k / 2, m = 4 ℓ − 1 - Consider a graph with ℓ + 1 levels of { A ℓ − i T α , α ∈ Σ i + ℓ − 1 } - Select m nodes per level � ℓ possible graphs � 4 m - m � ℓ ≥ 2 ( c − ǫ ) ℓ m with c = 8 − 3 log 3 ≈ 3 . 24 � 4 m - m 15

Conclusion / Perspectives Navigational data structures: - Model for recent dBG data struct. - Lower bound: 3 . 24 bits/ k -mer - Gap with known non-parameterized upper bounds (16) Open questions: - Closing the gap above - Entropy-compressed dBG representations Contact/references: - On the Representation of de Bruijn Graphs , 2014 - r❛②❛♥✳❝❤✐❦❤✐❅✉♥✐✈✲❧✐❧❧❡✶✳❢r - ❤tt♣✿✴✴r❛②❛♥✳❝❤✐❦❤✐✳♥❛♠❡ 16

On the representation of de Bruijn Graphs Rayan Chikhi joint work - PowerPoint PPT Presentation

On the representation of de Bruijn Graphs Rayan Chikhi joint work with P . Medvedev, A. Limasset, S. Jackman, J. Simpson Univ. Lille ALEA 2016 1 de Bruijn Graph sq

Orthogonal labelings in de Bruijn graphs Luca Mariot L.Mariot@tudelft.nl IWOCA 2020 Open

De Bruijn graphs and their foldings Peter J. Cameron University of St Andrews (Joint work with

Graphs Graphs Examples Definitions Implementation/Representation of graphs Graphs

Graphs () Graphs () Graphs Graphs Graphs are collections of nodes

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

Graphs Graphs Definitions Implementation/Representation of graphs Search Traversing

Week 4 Kullmann Graphs and directed graphs Elementary Graph Algorithms Representing graphs

On some classes of Deza graphs Deza graphs without 3-cocliques Line graphs V.V. Kabanov 1 Deza

A Compact Representation for Chordal Chordal Graphs Graphs A Compact Representation for Lilian

Graphs Graphs Definitions Implementation/Representation of graphs Search

Graphs Graphs Simple graphs Algorithms Depth-first search Breadth-first search

Searching on Graphs November 16, 2016 CMPE 250 Graphs- Searching on Graphs November 16, 2016 1

CS200: Graphs Prichard Ch. 14 Rosen Ch. 10 CS200 - Graphs 1 Graphs A collection of What can

Today. Types of graphs. Today. Types of graphs. Complete Graphs. Trees. Hypercubes. Today.

Space-efficient construction of succinct de Bruijn graphs Felipe A. Louza University of S ao

Enhanced de Bruijn Graphs Pierre MORISSE pierre.morisse2@univ-rouen.fr Supervisors: Thierry

CS681: Advanced Topics in Computational Biology Week 3, Lecture 1 Can Alkan EA224

1 IRS Circular 230 Required Notice- -IRS q regulations require that we inform you g q y that

7. Building Compilers with Coco/R 7.1 Overview 7.2 Scanner Specification 7.3 Parser

Machine Learning Classifiers: Many Diverse Ways to Learn CS271P, Fall Quarter, 2018 Introduction

Roberto Bruttomesso Intrepid: an SMT-based Model Checker for Control Engineering and Industrial

We were talking about similarity, sequence comparison and alignment. HOW DOES IT WORK ?

Homework 1 1. [40pt] We sequenced a small region of

Preparing to Program Aurora at Exascale Argonne Leadership Computing Facility IWOCL, Apr. 28,