REINDEER: efficient indexing of k -mer presence and abundance in - PowerPoint PPT Presentation

REINDEER: efficient indexing of k -mer presence and abundance in sequencing datasets Camille Marchet, Zamin Iqbal, Mika¨ el Salon, Rayan Chikhi DSB’20 – Rennes 1/28

Context raw reads datasets FASTQ ... ... ... ... 2/28

Sets of k-mer sets 18 related papers and counting since 2016 ACA GCA CAT CAT ATA ATC dataset 2 dataset 1 k-mer aggregative method color aggregative method ACA ACA, ATC, CAT ATA ATC CAT ATA, CAT, GCA GCA 3/28

Sets of k-mer sets ACA GCA CAT CAT ATC ATA dataset 1 dataset 2 BIGSI VARI (Muggli et al. 17) Good performances due to FP tradeoff De Bruijn graph representation Presence/absence Presence/absence + bubble calling 4/28

Our goal REINDEER method: Query abundances of sequences in a collection of datasets of raw reads ACA GCA CAT CAT ATC ATA dataset 1 dataset 2 GCA 30 0 Set of k-mers from all datasets CAT 0 10 + 31 0 ATA abundance matrix 30 9 ACA 0 8 ATC 5/28

Motivation ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 6/28

Motivation ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 7/28

Color matrix GAG GAG GAG ACA ATC ACA CAT CAT ATC ATC ATA ATA dataset 1 dataset 2 dataset 3 ATC ATC 1 1 ACA ACA 2 2 CAT CAT 2 + 3 ATA ATA 3 4 GAG GAG 4 color classes color matrix 8/28

Abundance matrix GAG GAG GAG ACA ACA ATC CAT CAT ATC ATC ATA ATA dataset 1 dataset 2 dataset 3 ATC 5 12 ACA 5 20 equivalence classes for counts ? CAT 4 21 ATA 20 15 compression (sparse matrix) GAG 5 18 12 count matrix 9/28

Definitions dataset count vector abundance matrix CAGCT AGCTA 10 0 3 ... ATTTA TATTT 2 5 13 ... x 10 0 3 ... ACTTA 10 2 3 ... ... a raw read multiset vec[x,i] = count of x in dataset i a list of count we see it as a set of k-mers vectors for each x 10/28

Definitions datasets De Bruijn graph union De Bruijn graph CAGCT AGCTA TATTT CTTAT CAGCT AGCTA ATTTA TATTT In practice we use a compacted represents the set of k-mer sets ACTTA DBG (graph of unitigs) coming from all datasets 11/28

Required building blocks k-mer set associative abundance representation data structure matrix ACA 30 0 CAT ATC ACA 0 10 dataset 1 CAT 31 0 ATC GCA GCA 30 9 CAT ATA ATA 0 8 dataset 2 12/28

Associative structure [Marchet et al. '19] nb. 31-mers Pufferfish (time/mem) BLight (tim/mem) human 2.5 billions 1 h/20 GB 30 min/8 GB (12.5 GB for the index) ( ≈ 26 bits/ k -mer) 13/28

K-mer counts per datasets De Bruijn graphs unitig graphs ACA CAT ATC dataset 1 GCA CAT ATA dataset 2 14/28

K-mer counts 32 31 10 32 10 40 40 39 41 30 10 30 9 10 ◮ Good approximation of k -mer counts ◮ Record more redundant values ◮ Smooth counts due to sequencing errors 15/28

Associate counts to kmers individual graphs + counts ATGGATG GGACAGT ... ATGGATG shared k-mers ... ATGGATG ... 16/28

Associate counts to kmers union graph: k-mer set 1 count vector per unitig 15 6 0 10 0 0 0 0 80 17/28

Associate counts to kmers ... TATTT TATTT ... dataset 1 dataset 2 15 6 ✔ ... ACTTA CTATT ATTTA ACTTA CTTAT ... TATTT dataset 1 dataset 2 dataset 1 dataset 2 15 0 ✘ 15 6 ✘ 0 6 ...CTATTTA ACTTAT 15 0 18/28

Represent a set of k -mers: Spectrum Preserving String Sets A SPSS of a k -mer set S is a set of strings having same k -mer spectrum as S ◮ k -mer set itself ◮ Unitigs ◮ Super k -mers from reads [Deorowicz et al.’15] ◮ Super k -mers from unitigs [Marchet et al.’19] ◮ Simplitigs[Brinda et al.’20]/UST[Rahman et al.’20] None can guarantees that all k -mers in a given string have the same count-vector 19/28

A new SPSS: Minitigs Minitigs are paths of the union DBG: union graph: k-mer set 10 0 3 GGACAGT 10 0 3 10 0 3 2 6 12 CTAGAATGGATG ... 2 6 12 ... ◮ All k -mers in a minitig have the same count vector ◮ Each k-mers is in one and only one minitig ◮ Minitigs can span several unitigs ◮ In practice ◮ All k-mers in a minitig have the same minimizer ◮ Greedy algorithm for construction 20/28

Minitig example count-vector 1 2 3 k-1 ... ... ... ... unitigs in ... ... ... ... individual DBG ... ... ... ... ... ... ... ... ... ... unitigs in ... ... union DBG ... ... ... ... ... ... ... ... simplitigs/UST ... ... in union DBG super k-mers from unitigs minitigs 21/28

REINDEER individual graphs union graph: k-mer set hash table abundance matrix + counts 10 0 3 minitig ID 2 6 12 k-mer ... ... (not explicitely built, (MPHF) only minitigs are extracted) 22/28

REINDEER de-duplicated individual graphs union graph: k-mer set hash table abundance matrix + counts 1 10 0 3 ... 2 minitig ID 2 2 6 12 k-mer 1 3 2 ... 3 . . ... . ... (not explicitely built, (MPHF) only minitigs are extracted) ◮ Each count-vector is compressed with RLE and dumped on the disk ◮ The MPHF can be dumped as well 23/28

Query de-duplicated hash table query sequence abundance matrix GATACCGATCACTGAC 1 19 0 ... ... 2 2 17 7 ... minitig ID 1 ... k-mers 17 9 ... 3 2 ... 3 . . . ◮ Value reported only if X% of the query k-mers were found present in a dataset 24/28

Results: index construction ∼ 2500 human RNA-seq datasets ∼ 4 billions distinct k -mers Tool Ext. Memory (GB) Time (h) Peak RAM (GB) Index Size (GB) Counts (Y/N) SBT 300 55 25 200 N HowDeSBT 30 10 N/A 15 N Mantis 3,500 20 N/A 30 N SeqOthello 190 2 15 20 N BIGSI N/A N/A N/A 145 N Reindeer - raw counts 6,800 55 36 60 Y Reindeer - discretized 6,500 58 35 42 Y Reindeer - log 2 5,500 68 28 40 Y Reindeer - presence/absence 6,600 55 27 36 N 25/28

Results: query Batches of sequences using Refseq human transcripts (mean size 3,300 bases) Batch size Index loading time (s, wallclock) Query time (s, wallclock) Peak RAM (GB) mean /min/max mean /min/max 10 sequences 41.68 /40.55/42.97 100 sequences 41.95 /40.35/45.98 475.7 /459.8/506.5 75 1000 sequences 42.60 /41.62/46.20 1000 sequences 42.70 /40.47/46.28 26/28

Application to transcriptomics Find abundances of oncogenes/tumor repressor genes in a few minutes across 2585 datasets Left boxplot:Cancer / Right boxplot: Non-cancer ◮ Need normalization to go further with biological conclusions 27/28

Take home messages What REINDEER does: query abundances of sequences in a collection of datasets of raw reads ◮ Represent the set of k -mers using minitigs ◮ Exact associative index for k -mer → count information ◮ Counts per dataset in compressed, non redundant G abundance matrix ◮ Reindeer can do presence/absence but other 10 0 3 2 16 12 data-structures perform better for this (HowDeSBT, BIGSI,...) 28/28

REINDEER: efficient indexing of k -mer presence and abundance in - PowerPoint PPT Presentation

REINDEER: efficient indexing of k -mer presence and abundance in sequencing datasets Camille Marchet, Zamin Iqbal, Mika el Salon, Rayan Chikhi DSB20 Rennes 1/28 Context raw reads datasets FASTQ ... ... ... ... 2/28 Sets of

Euclid OU-MER Herv Dole et al. MER tasks 1. MER photometry strategies 2. Pipeline

Santa Claus with Mobile with Mobile Santa Claus with Mobile Santa Claus Reindeer

BEBO Boazoealahusa boahtevuoda ovddas (For the Future of Reindeer Husbandry) BEBO WHAT IS IT?

Mer andia Caroline Izzi, Meghan Loveless, and Gage Markley Creative Process Merlandia Logo Mer

Multi-Probe LSH: Efficient Indexing for Efficient Indexing for Multi-Probe LSH:

Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Indexing and Searching Indexing and Searching TDT4215 TDT4215 Indexing & Searching 3

Bitmap Indexing and related indexing techniques Presented by: El Ghailani Maher Outline I

Recycle Your Android Devices Run real Linux on them David Greaves lbt on #mer #sailfjshos

Indexing Presentation - The Basics Attached is the slide deck for a short presentation on indexing

Indexing December 12, 2008 Indexing Introduction New tuple is stored without any order next

Presence Presence Presence When we wake up in the morning we may automatically leave our

The Smi High School and Reindeer Husbandry School Guovdageaidnu Below , , you see an example of

Suicide among Swedish reindeer herding Smi, 1961-2009 Anders Eriksson 1 , Lars Jacobsson 2 , Jon

Assessment and Treatment of Chronic Wounds Aime D. Garcia, M.D.,CWS Assistant Professor,

#prep X Assembly 03-A: Proximity Sensor You got Single Fan? You got the Dual Fan Upgrade? Good.

Texas Prescription Monitoring Program Linda Yazdanshenas Research Analyst | Texas State Board of

Welcome to (MS) 2 Mathematics & Science for Minority Students A Phillips Academy

Algebra and Geometry lecture 8: normal forms Misha Verbitsky Universit e Libre de Bruxelles

1 Properties of agents: Properties of agents: rationality reflectivity and reactivity (I)

Whats New With Your GuideStar Nonprofit Profile? August 16, 2017 2 3

FFIEC Cybersecurity Assessment Tool Monday, April 6 Moderator: Austin Kilgore , Editor in Chief,

REINDEER: efficient indexing of k -mer presence and abundance in - PowerPoint PPT Presentation

REINDEER: efficient indexing of k -mer presence and abundance in sequencing datasets Camille Marchet, Zamin Iqbal, Mika el Salon, Rayan Chikhi DSB20 Rennes 1/28 Context raw reads datasets FASTQ ... ... ... ... 2/28 Sets of

Euclid OU-MER Herv Dole et al. MER tasks 1. MER photometry strategies 2. Pipeline

Santa Claus with Mobile with Mobile Santa Claus with Mobile Santa Claus Reindeer

BEBO Boazoealahusa boahtevuoda ovddas (For the Future of Reindeer Husbandry) BEBO WHAT IS IT?

Mer andia Caroline Izzi, Meghan Loveless, and Gage Markley Creative Process Merlandia Logo Mer

Multi-Probe LSH: Efficient Indexing for Efficient Indexing for Multi-Probe LSH:

Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Indexing and Searching Indexing and Searching TDT4215 TDT4215 Indexing &amp; Searching 3

Bitmap Indexing and related indexing techniques Presented by: El Ghailani Maher Outline I

Recycle Your Android Devices Run real Linux on them David Greaves lbt on #mer #sailfjshos

Indexing Presentation - The Basics Attached is the slide deck for a short presentation on indexing

Indexing December 12, 2008 Indexing Introduction New tuple is stored without any order next

Presence Presence Presence When we wake up in the morning we may automatically leave our

The Smi High School and Reindeer Husbandry School Guovdageaidnu Below , , you see an example of

Suicide among Swedish reindeer herding Smi, 1961-2009 Anders Eriksson 1 , Lars Jacobsson 2 , Jon

Assessment and Treatment of Chronic Wounds Aime D. Garcia, M.D.,CWS Assistant Professor,

#prep X Assembly 03-A: Proximity Sensor You got Single Fan? You got the Dual Fan Upgrade? Good.

Texas Prescription Monitoring Program Linda Yazdanshenas Research Analyst | Texas State Board of

Welcome to (MS) 2 Mathematics &amp; Science for Minority Students A Phillips Academy

Algebra and Geometry lecture 8: normal forms Misha Verbitsky Universit e Libre de Bruxelles

1 Properties of agents: Properties of agents: rationality reflectivity and reactivity (I)

Whats New With Your GuideStar Nonprofit Profile? August 16, 2017 2 3

FFIEC Cybersecurity Assessment Tool Monday, April 6 Moderator: Austin Kilgore , Editor in Chief,

Indexing and Searching Indexing and Searching TDT4215 TDT4215 Indexing & Searching 3

Welcome to (MS) 2 Mathematics & Science for Minority Students A Phillips Academy