SPRING: a next-generation compressor for FASTQ data Shubham Chandak - - PowerPoint PPT Presentation

spring a next generation compressor for fastq data
SMART_READER_LITE
LIVE PREVIEW

SPRING: a next-generation compressor for FASTQ data Shubham Chandak - - PowerPoint PPT Presentation

SPRING: a next-generation compressor for FASTQ data Shubham Chandak Stanford University ISMB/ECCB 2019 Joint work with Kedar Tatwawadi, Stanford University Idoia Ochoa, UIUC Mikel Hernaez, UIUC Tsachy Weissman, Stanford


slide-1
SLIDE 1

SPRING: a next-generation compressor for FASTQ data

Shubham Chandak Stanford University ISMB/ECCB 2019

slide-2
SLIDE 2

Joint work with

  • Kedar Tatwawadi, Stanford University
  • Idoia Ochoa, UIUC
  • Mikel Hernaez, UIUC
  • Tsachy Weissman, Stanford University
slide-3
SLIDE 3

Outline

  • Introduction and motivation
  • FASTQ format and compression results
  • Algorithms - SPRING and others
  • SPRING as a practical tool
  • Next steps
slide-4
SLIDE 4

Genome sequencing

  • Genome: long string of bases {A, C, G, T}
  • Sequenced as noisy paired substrings (reads):

~ 300 – 500 bases ~ 100 –150 bases Genome ~ 3 billion bases

AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT

Coverage/ Depth: ~30x-60x

slide-5
SLIDE 5

Typical workflows

slide-6
SLIDE 6

Typical workflows

Sequencing Raw reads Alignment to reference Aligned reads Variant calling w.r.t. reference VCF (tabular data)

slide-7
SLIDE 7

Typical workflows

Sequencing Raw reads Alignment to reference Aligned reads Variant calling w.r.t. reference VCF (tabular data)

Sequencing Raw reads Assembly Assembled genome

slide-8
SLIDE 8

Why store raw reads?

slide-9
SLIDE 9

Why store raw reads?

  • Pipelines improve with time - need raw data for reanalysis
slide-10
SLIDE 10

Why store raw reads?

  • Pipelines improve with time - need raw data for reanalysis
  • For temporary storage - alignment and assembly time-consuming
slide-11
SLIDE 11

Why store raw reads?

  • Pipelines improve with time - need raw data for reanalysis
  • For temporary storage - alignment and assembly time-consuming
  • Can’t perform alignment when reference genome not available – e.g.,

de novo assembly or metagenomics

slide-12
SLIDE 12

Why store raw reads?

  • Pipelines improve with time - need raw data for reanalysis
  • For temporary storage - alignment and assembly time-consuming
  • Can’t perform alignment when reference genome not available – e.g.,

de novo assembly or metagenomics

  • Can get better compression than aligned data compression if

significant variation from reference (more on this later)!

slide-13
SLIDE 13

FASTQ format

slide-14
SLIDE 14

FASTQ format

We’ll mostly focus on reads in this talk.

slide-15
SLIDE 15

Read compression

slide-16
SLIDE 16

Read compression

  • For a typical 25x human dataset:
  • Uncompressed: 79 GB (1 byte/base)
slide-17
SLIDE 17

Read compression

  • For a typical 25x human dataset:
  • Uncompressed: 79 GB (1 byte/base)
  • Gzip:

~20 GB (2 bits/base) – still far from optimal

slide-18
SLIDE 18

Read compression

  • For a typical 25x human dataset:
  • Uncompressed: 79 GB (1 byte/base)
  • Gzip:

~20 GB (2 bits/base) – still far from optimal

  • Order of read pairs in FASTQ irrelevant – can this help?
slide-19
SLIDE 19

Read compression results

Compressor 25x human Uncompressed 79 GB Gzip ~20 GB

slide-20
SLIDE 20

Read compression results

Compressor 25x human Uncompressed 79 GB Gzip ~20 GB FaStore (allow reordering) 6 GB

Łukasz Roguski, Idoia Ochoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data, Bioinformatics, Volume 34, Issue 16, 15 August 2018, Pages 2748–2756

slide-21
SLIDE 21

Read compression results

Compressor 25x human Uncompressed 79 GB Gzip ~20 GB FaStore (allow reordering) 6 GB SPRING (no reordering) 3 GB SPRING (allow reordering) 2 GB

Łukasz Roguski, Idoia Ochoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data, Bioinformatics, Volume 34, Issue 16, 15 August 2018, Pages 2748–2756

slide-22
SLIDE 22

Read compression results

Compressor 25x human 100x human Uncompressed 79 GB 319 GB Gzip ~20 GB ~80 GB FaStore (allow reordering) 6 GB 13.7 GB SPRING (no reordering) 3 GB 10 GB SPRING (allow reordering) 2 GB 5.7 GB

Łukasz Roguski, Idoia Ochoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data, Bioinformatics, Volume 34, Issue 16, 15 August 2018, Pages 2748–2756

slide-23
SLIDE 23

Key idea

  • Storing reads equivalent to

AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT

slide-24
SLIDE 24

Key idea

  • Storing reads equivalent to
  • Store genome

AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT

slide-25
SLIDE 25

Key idea

  • Storing reads equivalent to
  • Store genome
  • Store read positions in genome (+ gap between paired reads)

AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT

slide-26
SLIDE 26

Key idea

  • Storing reads equivalent to
  • Store genome
  • Store read positions in genome (+ gap between paired reads)
  • Store noise in reads

AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT

slide-27
SLIDE 27

Key idea

  • Storing reads equivalent to
  • Store genome
  • Store read positions in genome (+ gap between paired reads)
  • Store noise in reads
  • Entropy calculations show this outperforms previous compressors

AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT

slide-28
SLIDE 28

Key idea

  • But... How to get the genome from the reads?
slide-29
SLIDE 29

Key idea

  • But... How to get the genome from the reads?
  • Genome assembly too expensive - big challenges:
  • resolve repeats
  • get very long pieces of genome from shorter assemblies
slide-30
SLIDE 30

Key idea

  • But... How to get the genome from the reads?
  • Genome assembly too expensive - big challenges:
  • resolve repeats
  • get very long pieces of genome from shorter assemblies
  • Solution: Don’t need perfect assembly for compression!
slide-31
SLIDE 31

SPRING workflow

Raw reads

slide-32
SLIDE 32

SPRING workflow

Approximate assembly Raw reads Contigs

slide-33
SLIDE 33

SPRING workflow

Approximate assembly Raw reads Encode

  • Assembled sequence
  • Read position in

assembled sequence

  • Gap b/w paired reads
  • Noisy bases + positions
  • Etc.

Contigs

slide-34
SLIDE 34

SPRING workflow

Approximate assembly Raw reads Encode

  • Assembled sequence
  • Read position in

assembled sequence

  • Gap b/w paired reads
  • Noisy bases + positions
  • Etc.

BSC Compressed file https://github.com/IlyaGrebnov/libbsc Contigs

slide-35
SLIDE 35

SPRING workflow

Approximate assembly Raw reads Encode

  • Assembled sequence
  • Read position in

assembled sequence

  • Gap b/w paired reads
  • Noisy bases + positions
  • Etc.

BSC Compressed file

In “allow reordering” mode: reorder by position in approximate assembly

https://github.com/IlyaGrebnov/libbsc Contigs

slide-36
SLIDE 36
  • Approx. assembly/reordering step (simplified)
slide-37
SLIDE 37
  • Approx. assembly/reordering step (simplified)
  • Index reads by specific substrings using hash tables
slide-38
SLIDE 38
  • Approx. assembly/reordering step (simplified)
  • Index reads by specific substrings using hash tables
  • For the current read, try to find an overlapping read within small

Hamming distance

slide-39
SLIDE 39
  • Approx. assembly/reordering step (simplified)
  • Index reads by specific substrings using hash tables
  • For the current read, try to find an overlapping read within small

Hamming distance

  • Example (reads indexed by prefix for simplicity):
  • ACGATCGTACGTACGATCGTCAG

(current read)

slide-40
SLIDE 40
  • Approx. assembly/reordering step (simplified)
  • Index reads by specific substrings using hash tables
  • For the current read, try to find an overlapping read within small

Hamming distance

  • Example (reads indexed by prefix for simplicity):
  • ACGATCGTACGTACGATCGTCAG

(current read)

  • ACGATCGTACGTATACGGGTACG

(candidate next read)

slide-41
SLIDE 41
  • Approx. assembly/reordering step (simplified)
  • Index reads by specific substrings using hash tables
  • For the current read, try to find an overlapping read within small

Hamming distance

  • Example (reads indexed by prefix for simplicity):
  • ACGATCGTACGTACGATCGTCAG

(current read)

  • ACGATCGTACGTATACGGGTACG

(candidate next read)

  • Index match found but Hamming distance too large → shift search substring

by one

slide-42
SLIDE 42
  • Approx. assembly/reordering step (simplified)
  • Index reads by specific substrings using hash tables
  • For the current read, try to find an overlapping read within small

Hamming distance

  • Example (reads indexed by prefix for simplicity):
  • ACGATCGTACGTACGATCGTCAG

(current read)

slide-43
SLIDE 43
  • Approx. assembly/reordering step (simplified)
  • Index reads by specific substrings using hash tables
  • For the current read, try to find an overlapping read within small

Hamming distance

  • Example (reads indexed by prefix for simplicity):
  • ACGATCGTACGTACGATCGTCAG

(current read)

  • No index match found → shift search substring by one
slide-44
SLIDE 44
  • Approx. assembly/reordering step (simplified)
  • Index reads by specific substrings using hash tables
  • For the current read, try to find an overlapping read within small

Hamming distance

  • Example (reads indexed by prefix for simplicity):
  • ACGATCGTACGTACGATCGTCAG

(current read)

  • GATCGTACGTATGATGGTCATTA

(candidate next read)

slide-45
SLIDE 45
  • Approx. assembly/reordering step (simplified)
  • Index reads by specific substrings using hash tables
  • For the current read, try to find an overlapping read within small

Hamming distance

  • Example (reads indexed by prefix for simplicity):
  • ACGATCGTACGTACGATCGTCAG

(current read)

  • GATCGTACGTATGATGGTCATTA

(candidate next read)

  • Next read found!
  • Repeat process with the new read
slide-46
SLIDE 46
  • Approx. assembly/reordering step (simplified)
  • Index reads by specific substrings using hash tables
  • For the current read, try to find an overlapping read within small

Hamming distance

  • Example (reads indexed by prefix for simplicity):
  • ACGATCGTACGTACGATCGTCAG

(current read)

  • GATCGTACGTATGATGGTCATTA

(candidate next read)

  • Next read found!
  • Repeat process with the new read.
  • If no match found at any shift, pick arbitrary remaining read & start

new contig

slide-47
SLIDE 47

Quality and read identifier compression

slide-48
SLIDE 48

Quality and read identifier compression

  • Quality – use general purpose compressor BSC (optionally apply

quantization)

  • Read identifier – split into tokens and use arithmetic coding [1]
  • 1. Bonfield, James K., and Matthew V. Mahoney. "Compression of FASTQ and SAM format sequencing data." PloS one 8.3

(2013): e59190.

slide-49
SLIDE 49

Quality and read identifier compression

  • Quality – use general purpose compressor BSC (optionally apply

quantization)

  • Read identifier – split into tokens and use arithmetic coding [1]

Dataset Reads Quality Read identifier Hiseq 2000 28x, 100 bp x 2 4.3 23.8 0.9 Novaseq 25x, 150 bp x 2 3.0 3.6 0.3 All human datasets. Sizes in GB.

  • 1. Bonfield, James K., and Matthew V. Mahoney. "Compression of FASTQ and SAM format sequencing data." PloS one 8.3

(2013): e59190.

slide-50
SLIDE 50

Quality and read identifier compression

  • Quality – use general purpose compressor BSC (optionally apply

quantization)

  • Read identifier – split into tokens and use arithmetic coding [1]

Dataset Reads Quality Read identifier Hiseq 2000 28x, 100 bp x 2 4.3 23.8 0.9 Novaseq 25x, 150 bp x 2 3.0 3.6 0.3 Novaseq 25x, 150 bp x 2 (allow reordering) 2.0 3.6 1.4 All human datasets. Sizes in GB.

  • 1. Bonfield, James K., and Matthew V. Mahoney. "Compression of FASTQ and SAM format sequencing data." PloS one 8.3

(2013): e59190.

slide-51
SLIDE 51

SPRING vs. reference-based compression

195 GB 25x human FASTQ NovaSeq

slide-52
SLIDE 52

SPRING vs. reference-based compression

195 GB 25x human FASTQ NovaSeq SPRING 2 hours 7 GB lossless SPRING archive

slide-53
SLIDE 53

SPRING vs. reference-based compression

195 GB 25x human FASTQ NovaSeq SPRING 2 hours 7 GB lossless SPRING archive BWA-MEM alignment (hg19) 8 hours SAM file Remove irrelevant fields (sorting)

slide-54
SLIDE 54

SPRING vs. reference-based compression

195 GB 25x human FASTQ NovaSeq SPRING 2 hours 7 GB lossless SPRING archive BWA-MEM alignment (hg19) 8 hours SAM file Remove irrelevant fields CRAM v3 25 min (sorting) Unsorted: 7.6 GB Sorted: 7.8 GB Sorted (+ embedded reference): 8.5 GB

*partly due to quality compression improvements in SPRING

slide-55
SLIDE 55

SPRING vs. reference-based compression

195 GB 25x human FASTQ NovaSeq SPRING 2 hours 7 GB lossless SPRING archive BWA-MEM alignment (hg19) 8 hours SAM file Remove irrelevant fields CRAM v3 25 min (sorting)

Advantage can be even greater in case of large variations between reference genome & FASTQ genome.

Unsorted: 7.6 GB Sorted: 7.8 GB Sorted (+ embedded reference): 8.5 GB

*partly due to quality compression improvements in SPRING

slide-56
SLIDE 56

Other approaches for FASTQ compression

Numanagić, Ibrahim, et al. "Comparison of high-throughput sequencing data compression tools." Nature Methods 13.12 (2016): 1005. Hernaez, Mikel, et al. "Genomic Data Compression." Annual Review of Biomedical Data Science 2 (2019).

slide-57
SLIDE 57

Other approaches for FASTQ compression

  • gzip/bzip2

Numanagić, Ibrahim, et al. "Comparison of high-throughput sequencing data compression tools." Nature Methods 13.12 (2016): 1005. Hernaez, Mikel, et al. "Genomic Data Compression." Annual Review of Biomedical Data Science 2 (2019).

slide-58
SLIDE 58

Other approaches for FASTQ compression

  • gzip/bzip2
  • Context-based arithmetic coding: DSRC 2, Fqzcomp, Quip

Numanagić, Ibrahim, et al. "Comparison of high-throughput sequencing data compression tools." Nature Methods 13.12 (2016): 1005. Hernaez, Mikel, et al. "Genomic Data Compression." Annual Review of Biomedical Data Science 2 (2019).

slide-59
SLIDE 59

Other approaches for FASTQ compression

  • gzip/bzip2
  • Context-based arithmetic coding: DSRC 2, Fqzcomp, Quip
  • Assembly based: Leon, Quip, Assembletrie

Numanagić, Ibrahim, et al. "Comparison of high-throughput sequencing data compression tools." Nature Methods 13.12 (2016): 1005. Hernaez, Mikel, et al. "Genomic Data Compression." Annual Review of Biomedical Data Science 2 (2019).

slide-60
SLIDE 60

Other approaches for FASTQ compression

  • gzip/bzip2
  • Context-based arithmetic coding: DSRC 2, Fqzcomp, Quip
  • Assembly based: Leon, Quip, Assembletrie
  • Reordering based:
  • Reordering based on substrings/minimizers: Orcom, Mince, FaStore, SCALCE
  • BWT-based reordering: BEETL

Numanagić, Ibrahim, et al. "Comparison of high-throughput sequencing data compression tools." Nature Methods 13.12 (2016): 1005. Hernaez, Mikel, et al. "Genomic Data Compression." Annual Review of Biomedical Data Science 2 (2019).

slide-61
SLIDE 61

Recent FASTQ compressors

slide-62
SLIDE 62

Recent FASTQ compressors

  • minicom [1]: Use minimizers to construct large contigs (assemblies)
  • Slight improvement (5-10%) over SPRING on RNA-seq reads

1. Yuansheng Liu, Zuguo Yu, Marcel E Dinger, Jinyan Li, Index suffix–prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression, Bioinformatics, Volume 35, Issue 12, June 2019, Pages 2066–2074.

slide-63
SLIDE 63

Recent FASTQ compressors

  • minicom [1]: Use minimizers to construct large contigs (assemblies)
  • Slight improvement (5-10%) over SPRING on RNA-seq reads
  • FQSqueezer [2]: Adapt general-purpose compressors such as

prediction by partial matchting (PPM) and dynamic Markov coding (DMC) to read compression

  • 10-30% improvement over SPRING for bacterial datasets

1. Yuansheng Liu, Zuguo Yu, Marcel E Dinger, Jinyan Li, Index suffix–prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression, Bioinformatics, Volume 35, Issue 12, June 2019, Pages 2066–2074. 2. Deorowicz, Sebastian. "FQSqueezer: k-mer-based compression of sequencing data." bioRxiv (2019): 559807.

slide-64
SLIDE 64

Recent FASTQ compressors

  • minicom [1]: Use minimizers to construct large contigs (assemblies)
  • Slight improvement (5-10%) over SPRING on RNA-seq reads
  • FQSqueezer [2]: Adapt general-purpose compressors such as

prediction by partial matchting (PPM) and dynamic Markov coding (DMC) to read compression

  • 10-30% improvement over SPRING for bacterial datasets
  • Both require significantly more time and memory than SPRING
  • Not tested on moderate to high coverage human datasets

1. Yuansheng Liu, Zuguo Yu, Marcel E Dinger, Jinyan Li, Index suffix–prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression, Bioinformatics, Volume 35, Issue 12, June 2019, Pages 2066–2074. 2. Deorowicz, Sebastian. "FQSqueezer: k-mer-based compression of sequencing data." bioRxiv (2019): 559807.

slide-65
SLIDE 65

SPRING as a practical tool

slide-66
SLIDE 66

SPRING as a practical tool

195 GB 25x human FASTQ 2 hours 32 GB RAM 8 threads 7 GB SPRING archive 26 minutes 6 GB RAM 8 threads Original FASTQ

slide-67
SLIDE 67

SPRING as a practical tool

  • ~1.6x better compression than FaStore with similar time/memory

195 GB 25x human FASTQ 2 hours 32 GB RAM 8 threads 7 GB SPRING archive 26 minutes 6 GB RAM 8 threads Original FASTQ

slide-68
SLIDE 68

SPRING as a practical tool

  • ~1.6x better compression than FaStore with similar time/memory
  • Easy to use with support for:
  • Lossless and lossy modes
  • Variable length reads, long reads, etc.
  • Random access
  • Scalable to large datasets

195 GB 25x human FASTQ 2 hours 32 GB RAM 8 threads 7 GB SPRING archive 26 minutes 6 GB RAM 8 threads Original FASTQ

slide-69
SLIDE 69

SPRING as a practical tool

  • ~1.6x better compression than FaStore with similar time/memory
  • Easy to use with support for:
  • Lossless and lossy modes
  • Variable length reads, long reads, etc.
  • Random access
  • Scalable to large datasets
  • Github: https://github.com/shubhamchandak94/SPRING/

195 GB 25x human FASTQ 2 hours 32 GB RAM 8 threads 7 GB SPRING archive 26 minutes 6 GB RAM 8 threads Original FASTQ

slide-70
SLIDE 70

Next steps

  • Currently integrating SPRING with genie, an upcoming open source

MPEG-G codec

slide-71
SLIDE 71

Next steps

  • Currently integrating SPRING with genie, an upcoming open source

MPEG-G codec

  • Third generation sequencing technologies (e.g., nanopore):
slide-72
SLIDE 72

Next steps

  • Currently integrating SPRING with genie, an upcoming open source

MPEG-G codec

  • Third generation sequencing technologies (e.g., nanopore):
  • Long reads, lots of insertions and deletion errors
  • Hash based approximate assembly doesn’t extend immediately
slide-73
SLIDE 73

Next steps

  • Currently integrating SPRING with genie, an upcoming open source

MPEG-G codec

  • Third generation sequencing technologies (e.g., nanopore):
  • Long reads, lots of insertions and deletion errors
  • Hash based approximate assembly doesn’t extend immediately
  • New types of raw data – e.g., raw current signal for nanopore sequencing
  • Need huge amounts of space and typically retained for further analysis
slide-74
SLIDE 74

Next steps

  • Currently integrating SPRING with genie, an upcoming open source

MPEG-G codec

  • Third generation sequencing technologies (e.g., nanopore):
  • Long reads, lots of insertions and deletion errors
  • Hash based approximate assembly doesn’t extend immediately
  • New types of raw data – e.g., raw current signal for nanopore sequencing
  • Need huge amounts of space and typically retained for further analysis
  • Time and memory efficient tool with compression close to SPRING:
slide-75
SLIDE 75

Next steps

  • Currently integrating SPRING with genie, an upcoming open source

MPEG-G codec

  • Third generation sequencing technologies (e.g., nanopore):
  • Long reads, lots of insertions and deletion errors
  • Hash based approximate assembly doesn’t extend immediately
  • New types of raw data – e.g., raw current signal for nanopore sequencing
  • Need huge amounts of space and typically retained for further analysis
  • Time and memory efficient tool with compression close to SPRING:
  • Disk based strategies (like Orcom/FaStore)
slide-76
SLIDE 76

Next steps

  • Currently integrating SPRING with genie, an upcoming open source

MPEG-G codec

  • Third generation sequencing technologies (e.g., nanopore):
  • Long reads, lots of insertions and deletion errors
  • Hash based approximate assembly doesn’t extend immediately
  • New types of raw data – e.g., raw current signal for nanopore sequencing
  • Need huge amounts of space and typically retained for further analysis
  • Time and memory efficient tool with compression close to SPRING:
  • Disk based strategies (like Orcom/FaStore)
  • When reference is available, can do fast and approximate alignment
slide-77
SLIDE 77

Thank you!

slide-78
SLIDE 78

References

  • Shubham Chandak, Kedar Tatwawadi, Tsachy Weissman; Compression of genomic sequencing

reads via hash-based reordering: algorithm and analysis, Bioinformatics, Volume 34, Issue 4, 15 February 2018, Pages 558–567

  • Shubham Chandak, Kedar Tatwawadi, Idoia Ochoa, Mikel Hernaez, Tsachy Weissman; SPRING: a

next-generation compressor for FASTQ data, Bioinformatics, bty1015

  • SPRING download: https://github.com/shubhamchandak94/Spring
  • genie (open source MPEG-G codec – under development): https://github.com/mitogen/genie