SPRING: a next-generation compressor for FASTQ data Shubham Chandak - - PowerPoint PPT Presentation

spring a next generation compressor for fastq data
SMART_READER_LITE
LIVE PREVIEW

SPRING: a next-generation compressor for FASTQ data Shubham Chandak - - PowerPoint PPT Presentation

SPRING: a next-generation compressor for FASTQ data Shubham Chandak Stanford University Stanford Compression Workshop 2019 Joint work with Kedar Tatwawadi, Stanford University Idoia Ochoa, UIUC Mikel Hernaez, UIUC Tsachy


slide-1
SLIDE 1

SPRING: a next-generation compressor for FASTQ data

Shubham Chandak Stanford University Stanford Compression Workshop 2019

slide-2
SLIDE 2

Joint work with

  • Kedar Tatwawadi, Stanford University
  • Idoia Ochoa, UIUC
  • Mikel Hernaez, UIUC
  • Tsachy Weissman, Stanford University
slide-3
SLIDE 3

Outline

  • Intro to genome sequencing
  • FASTQ format and compression results
  • SPRING algorithm
  • SPRING as a practical tool
slide-4
SLIDE 4

Genome sequencing

  • Genome: long string of bases {A, C, G, T}
  • Sequenced as noisy paired substrings (reads):

~ 300 – 500 bases ~ 100 –150 bases Genome ~ 3 billion bases

AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT

Coverage/ Depth: ~30x-60x

slide-5
SLIDE 5

Why compression?

slide-6
SLIDE 6

Why compression?

500K human genomes ~1.5M eukaryote species

slide-7
SLIDE 7

FASTQ format

slide-8
SLIDE 8

FASTQ format

We’ll mostly focus on reads in this talk.

slide-9
SLIDE 9

Read compression

slide-10
SLIDE 10

Read compression

  • For a typical 25x human dataset:
  • Uncompressed: 79 GB (1 byte/base)
slide-11
SLIDE 11

Read compression

  • For a typical 25x human dataset:
  • Uncompressed: 79 GB (1 byte/base)
  • Gzip:

~20 GB (2 bits/base) – still far from optimal

slide-12
SLIDE 12

Read compression

  • For a typical 25x human dataset:
  • Uncompressed: 79 GB (1 byte/base)
  • Gzip:

~20 GB (2 bits/base) – still far from optimal

  • Order of read pairs in FASTQ irrelevant – can this help?
slide-13
SLIDE 13

Read compression results

Compressor 25x human Uncompressed 79 GB Gzip ~20 GB

slide-14
SLIDE 14

Read compression results

Compressor 25x human Uncompressed 79 GB Gzip ~20 GB FaStore (allow reordering) 6 GB

slide-15
SLIDE 15

Read compression results

Compressor 25x human Uncompressed 79 GB Gzip ~20 GB FaStore (allow reordering) 6 GB SPRING (no reordering) 3 GB SPRING (allow reordering) 2 GB

slide-16
SLIDE 16

Read compression results

Compressor 25x human 100x human Uncompressed 79 GB 319 GB Gzip ~20 GB ~80 GB FaStore (allow reordering) 6 GB 13.7 GB SPRING (no reordering) 3 GB 10 GB SPRING (allow reordering) 2 GB 5.7 GB

slide-17
SLIDE 17

Key idea

  • Storing reads equivalent to

AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT

slide-18
SLIDE 18

Key idea

  • Storing reads equivalent to
  • Store genome

AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT

slide-19
SLIDE 19

Key idea

  • Storing reads equivalent to
  • Store genome
  • Store read positions in genome

AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT

slide-20
SLIDE 20

Key idea

  • Storing reads equivalent to
  • Store genome
  • Store read positions in genome
  • Store noise in reads

AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT

slide-21
SLIDE 21

Key idea

  • Storing reads equivalent to
  • Store genome
  • Store read positions in genome
  • Store noise in reads
  • Entropy calculations show this outperforms previous compressors

AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT

slide-22
SLIDE 22

Key idea

  • But... How to get the genome from the reads?
slide-23
SLIDE 23

Key idea

  • But... How to get the genome from the reads?
  • Genome assembly too expensive - big challenges:
  • resolve repeats
  • get very long pieces of genome from shorter assemblies
slide-24
SLIDE 24

Key idea

  • But... How to get the genome from the reads?
  • Genome assembly too expensive - big challenges:
  • resolve repeats
  • get very long pieces of genome from shorter assemblies
  • Solution: Don’t need perfect assembly for compression!
slide-25
SLIDE 25

SPRING workflow

Raw reads

slide-26
SLIDE 26

SPRING workflow

Approximate assembly Raw reads

slide-27
SLIDE 27

SPRING workflow

Approximate assembly Raw reads Encode

  • Assembled sequence
  • Read position in

assembled sequence

  • Noisy bases
  • Etc.
slide-28
SLIDE 28

SPRING workflow

Approximate assembly Raw reads Encode

  • Assembled sequence
  • Read position in

assembled sequence

  • Noisy bases
  • Etc.

BSC Compressed file

slide-29
SLIDE 29

SPRING workflow

Approximate assembly Raw reads Encode

  • Assembled sequence
  • Read position in

assembled sequence

  • Noisy bases
  • Etc.

BSC Compressed file

In “allow reordering” mode: sort by position in approximate assembly

slide-30
SLIDE 30

SPRING as a practical tool

slide-31
SLIDE 31

SPRING as a practical tool

195 GB 25x human FASTQ 2 hours 32 GB RAM 8 threads 7 GB SPRING archive 26 minutes 6 GB RAM 8 threads Original FASTQ

slide-32
SLIDE 32

SPRING as a practical tool

  • Support for:
  • Lossless and lossy modes
  • Variable length reads, long reads, etc.
  • Random access

195 GB 25x human FASTQ 2 hours 32 GB RAM 8 threads 7 GB SPRING archive 26 minutes 6 GB RAM 8 threads Original FASTQ

slide-33
SLIDE 33

SPRING as a practical tool

  • Support for:
  • Lossless and lossy modes
  • Variable length reads, long reads, etc.
  • Random access
  • Github: https://github.com/shubhamchandak94/SPRING/

195 GB 25x human FASTQ 2 hours 32 GB RAM 8 threads 7 GB SPRING archive 26 minutes 6 GB RAM 8 threads Original FASTQ

slide-34
SLIDE 34

SPRING as a practical tool

  • Support for:
  • Lossless and lossy modes
  • Variable length reads, long reads, etc.
  • Random access
  • Github: https://github.com/shubhamchandak94/SPRING/
  • Currently integrating with genie, an upcoming open source MPEG-G

codec

195 GB 25x human FASTQ 2 hours 32 GB RAM 8 threads 7 GB SPRING archive 26 minutes 6 GB RAM 8 threads Original FASTQ

slide-35
SLIDE 35

Thank you!

slide-36
SLIDE 36

References

  • Shubham Chandak, Kedar Tatwawadi, Tsachy Weissman; Compression of genomic sequencing reads via

hash-based reordering: algorithm and analysis, Bioinformatics, Volume 34, Issue 4, 15 February 2018, Pages 558–567

  • Shubham Chandak, Kedar Tatwawadi, Idoia Ochoa, Mikel Hernaez, Tsachy Weissman; SPRING: a next-

generation compressor for FASTQ data, Bioinformatics, bty1015

  • Łukasz Roguski, Idoia Ochoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for

raw sequencing data, Bioinformatics, Volume 34, Issue 16, 15 August 2018, Pages 2748–2756

  • Alberti C. et al. (2018) An introduction to MPEG-G, the new ISO standard for genomic information
  • representation. https://www.biorxiv.org/content/early/2018/10/08/426353.
  • BSC: https://github.com/IlyaGrebnov/libbsc
  • genie (open source MPEG-G codec): https://mitogen.github.io/
  • Image credits:
  • https://www.genome.gov/27541954/dna-sequencing-costs-data/
  • https://twitter.com/nature/status/1050115893957730305
  • http://www.earlham.ac.uk/newsroom/decoding-life-earth