REINDEER: efficient indexing of k -mer presence and abundance in - - PowerPoint PPT Presentation

reindeer efficient indexing of k mer presence and
SMART_READER_LITE
LIVE PREVIEW

REINDEER: efficient indexing of k -mer presence and abundance in - - PowerPoint PPT Presentation

REINDEER: efficient indexing of k -mer presence and abundance in sequencing datasets Camille Marchet, Zamin Iqbal, Mika el Salon, Rayan Chikhi DSB20 Rennes 1/28 Context raw reads datasets FASTQ ... ... ... ... 2/28 Sets of


slide-1
SLIDE 1

REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets

Camille Marchet, Zamin Iqbal, Mika¨ el Salon, Rayan Chikhi DSB’20 – Rennes

1/28

slide-2
SLIDE 2

Context

... ... ... ... FASTQ raw reads datasets

2/28

slide-3
SLIDE 3

Sets of k-mer sets

18 related papers and counting since 2016

k-mer aggregative method

dataset 1

ACA CAT ATC

dataset 2

GCA CAT ATA ACA CAT ATC GCA ATA color aggregative method ATA, CAT, GCA ACA, ATC, CAT

3/28

slide-4
SLIDE 4

Sets of k-mer sets

dataset 1

ACA CAT ATC

dataset 2

GCA CAT ATA

BIGSI VARI (Muggli et al. 17) Good performances due to FP tradeoff Presence/absence De Bruijn graph representation Presence/absence + bubble calling

4/28

slide-5
SLIDE 5

Our goal

REINDEER method: Query abundances of sequences in a collection of datasets of raw reads

dataset 1

ACA CAT ATC

dataset 2

GCA CAT ATA

30 0 0 10 31 0 30 9 0 8 Set of k-mers from all datasets + abundance matrix

GCA CAT ATA ACA ATC

5/28

slide-6
SLIDE 6

Motivation

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

6/28

slide-7
SLIDE 7

Motivation

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

7/28

slide-8
SLIDE 8

Color matrix

dataset 1

ACA CAT ATC ATC ATC ATA ATC ACA CAT ATA GAG ATC ACA CAT ATA GAG 1 2 2 3 4 1 2 3 4 GAG GAG GAG

dataset 3 color matrix color classes + dataset 2

ACA CAT ATA

8/28

slide-9
SLIDE 9

Abundance matrix

dataset 1

ACA CAT ATC ATC ATC ATA ATC ACA CAT ATA GAG GAG GAG GAG 5 12 5 20 4 21 20 15 5 18 12

dataset 3 count matrix dataset 2

ACA CAT ATA

equivalence classes for counts ? compression (sparse matrix)

9/28

slide-10
SLIDE 10

Definitions

CAGCT AGCTA ATTTA TATTT ACTTA

dataset a raw read multiset we see it as a set of k-mers count vector

10 0 3 ...

x vec[x,i] = count of x in dataset i abundance matrix

10 0 3 ... 2 5 13 ... 10 2 3 ... ...

a list of count vectors for each x

10/28

slide-11
SLIDE 11

Definitions

CAGCT AGCTA CAGCT AGCTA TATTT ATTTA TATTT CTTAT ACTTA

datasets De Bruijn graph In practice we use a compacted DBG (graph of unitigs) union De Bruijn graph represents the set of k-mer sets coming from all datasets

11/28

slide-12
SLIDE 12

Required building blocks

dataset 1

ACA CAT ATC

dataset 2

GCA CAT ATA

30 0 0 10 31 0 30 9 0 8

ACA CAT ATC GCA ATA

k-mer set representation associative data structure abundance matrix

12/28

slide-13
SLIDE 13

Associative structure

[Marchet et al. '19]

  • nb. 31-mers

Pufferfish (time/mem) BLight (tim/mem) human 2.5 billions 1 h/20 GB 30 min/8 GB (12.5 GB for the index) (≈ 26 bits/k-mer)

13/28

slide-14
SLIDE 14

K-mer counts

dataset 1

ACA CAT ATC

dataset 2

GCA CAT ATA

per datasets De Bruijn graphs unitig graphs

14/28

slide-15
SLIDE 15

K-mer counts

10 30 40 39 41 9 10 32 31 30 10 40 10 32

◮ Good approximation of k-mer counts ◮ Record more redundant values ◮ Smooth counts due to sequencing errors

15/28

slide-16
SLIDE 16

Associate counts to kmers

...

shared k-mers individual graphs + counts ATGGATG ... ATGGATG ... ATGGATG GGACAGT

16/28

slide-17
SLIDE 17

Associate counts to kmers

union graph: k-mer set 1 count vector per unitig 15 6 0 10 0 0 0 0 80

17/28

slide-18
SLIDE 18

Associate counts to kmers

dataset 1 dataset 2

...CTATTTA ATTTA ... TATTT

dataset 1 dataset 2

ACTTA CTTAT ACTTAT

dataset 1 dataset 2

TATTT ... ... TATTT

15 6 ✔ 0 6 15 6

CTATT ... ACTTA

15 0 ✘ 15 0 ✘

18/28

slide-19
SLIDE 19

Represent a set of k-mers: Spectrum Preserving String Sets

A SPSS of a k-mer set S is a set of strings having same k-mer spectrum as S ◮ k-mer set itself ◮ Unitigs ◮ Super k-mers from reads [Deorowicz et al.’15] ◮ Super k-mers from unitigs [Marchet et al.’19] ◮ Simplitigs[Brinda et al.’20]/UST[Rahman et al.’20] None can guarantees that all k-mers in a given string have the same count-vector

19/28

slide-20
SLIDE 20

A new SPSS: Minitigs

Minitigs are paths of the union DBG:

union graph: k-mer set CTAGAATGGATG ... ... GGACAGT 10 0 3 10 0 3 10 0 3 2 6 12 2 6 12

◮ All k-mers in a minitig have the same count vector ◮ Each k-mers is in one and only one minitig ◮ Minitigs can span several unitigs ◮ In practice

◮ All k-mers in a minitig have the same minimizer ◮ Greedy algorithm for construction

20/28

slide-21
SLIDE 21

Minitig example

count-vector 1 2 3 unitigs in individual DBG k-1 super k-mers from unitigs minitigs unitigs in union DBG ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... simplitigs/UST in union DBG ... ... ... ... ... ... ... ... ... ...

21/28

slide-22
SLIDE 22

REINDEER

union graph: k-mer set

... ...

individual graphs + counts (not explicitely built,

  • nly minitigs are extracted)

k-mer minitig ID hash table (MPHF) 10 0 3 2 6 12 abundance matrix

22/28

slide-23
SLIDE 23

REINDEER

union graph: k-mer set

... ...

individual graphs + counts (not explicitely built,

  • nly minitigs are extracted)

k-mer minitig ID hash table (MPHF) 10 0 3 2 6 12 de-duplicated abundance matrix ... ...

1 2 3 2 1 2 3 . . .

◮ Each count-vector is compressed with RLE and dumped on the disk ◮ The MPHF can be dumped as well

23/28

slide-24
SLIDE 24

Query

k-mers GATACCGATCACTGAC ... query sequence minitig ID hash table 19 0 ... 17 7 ... de-duplicated abundance matrix ... ...

1 2 3 2 1 2 3 . . .

17 9 ...

◮ Value reported only if X% of the query k-mers were found present in a dataset

24/28

slide-25
SLIDE 25

Results: index construction

∼ 2500 human RNA-seq datasets ∼ 4 billions distinct k-mers

Tool

  • Ext. Memory (GB)

Time (h) Peak RAM (GB) Index Size (GB) Counts (Y/N) SBT 300 55 25 200 N HowDeSBT 30 10 N/A 15 N Mantis 3,500 20 N/A 30 N SeqOthello 190 2 15 20 N BIGSI N/A N/A N/A 145 N Reindeer - raw counts 6,800 55 36 60 Y Reindeer - discretized 6,500 58 35 42 Y Reindeer - log 2 5,500 68 28 40 Y Reindeer - presence/absence 6,600 55 27 36 N 25/28

slide-26
SLIDE 26

Results: query

Batches of sequences using Refseq human transcripts (mean size 3,300 bases)

Batch size Index loading time (s, wallclock) Query time (s, wallclock) Peak RAM (GB) mean/min/max mean/min/max 10 sequences 475.7/459.8/506.5 41.68/40.55/42.97 75 100 sequences 41.95/40.35/45.98 1000 sequences 42.60/41.62/46.20 1000 sequences 42.70/40.47/46.28

26/28

slide-27
SLIDE 27

Application to transcriptomics

Find abundances of oncogenes/tumor repressor genes in a few minutes across 2585 datasets Left boxplot:Cancer / Right boxplot: Non-cancer ◮ Need normalization to go further with biological conclusions

27/28

slide-28
SLIDE 28

Take home messages

What REINDEER does: query abundances of sequences in a collection of datasets of raw reads ◮ Represent the set of k-mers using minitigs ◮ Exact associative index for k-mer → count information ◮ Counts per dataset in compressed, non redundant abundance matrix ◮ Reindeer can do presence/absence but other data-structures perform better for this (HowDeSBT, BIGSI,...)

G

10 0 3 2 16 12

28/28