SPRING: a next-generation compressor for FASTQ data Shubham Chandak - PowerPoint PPT Presentation

SPRING: a next-generation compressor for FASTQ data Shubham Chandak Stanford University ISMB/ECCB 2019

Joint work with • Kedar Tatwawadi, Stanford University • Idoia Ochoa, UIUC • Mikel Hernaez, UIUC • Tsachy Weissman, Stanford University

Outline • Introduction and motivation • FASTQ format and compression results • Algorithms - SPRING and others • SPRING as a practical tool • Next steps

Genome sequencing • Genome: long string of bases {A, C, G, T} • Sequenced as noisy paired substrings ( reads ): Genome ~ 3 billion bases AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT Coverage/ Depth: ~30x-60x ~ 300 – 500 bases ~ 100 –150 bases

Typical workflows

Typical workflows Alignment Variant VCF Aligned Sequencing Raw reads to calling w.r.t. (tabular reads reference reference data)

Typical workflows Alignment Variant VCF Aligned Sequencing Raw reads to calling w.r.t. (tabular reads reference reference data) Assembled Sequencing Raw reads Assembly genome

Why store raw reads?

Why store raw reads? • Pipelines improve with time - need raw data for reanalysis

Why store raw reads? • Pipelines improve with time - need raw data for reanalysis • For temporary storage - alignment and assembly time-consuming

Why store raw reads? • Pipelines improve with time - need raw data for reanalysis • For temporary storage - alignment and assembly time-consuming • Can’t perform alignment when reference genome not available – e.g., de novo assembly or metagenomics

Why store raw reads? • Pipelines improve with time - need raw data for reanalysis • For temporary storage - alignment and assembly time-consuming • Can’t perform alignment when reference genome not available – e.g., de novo assembly or metagenomics • Can get better compression than aligned data compression if significant variation from reference (more on this later)!

FASTQ format

FASTQ format We’ll mostly focus on reads in this talk.

Read compression

Read compression • For a typical 25x human dataset: • Uncompressed: 79 GB (1 byte/base)

Read compression • For a typical 25x human dataset: • Uncompressed: 79 GB (1 byte/base) • Gzip: ~20 GB (2 bits/base) – still far from optimal

Read compression • For a typical 25x human dataset: • Uncompressed: 79 GB (1 byte/base) • Gzip: ~20 GB (2 bits/base) – still far from optimal • Order of read pairs in FASTQ irrelevant – can this help?

Read compression results Compressor 25x human Uncompressed 79 GB Gzip ~20 GB

Read compression results Compressor 25x human Uncompressed 79 GB Gzip ~20 GB FaStore 6 GB (allow reordering) Łukasz Roguski, Idoia Ochoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data, Bioinformatics , Volume 34, Issue 16, 15 August 2018, Pages 2748–2756

Read compression results Compressor 25x human Uncompressed 79 GB Gzip ~20 GB FaStore 6 GB (allow reordering) SPRING 3 GB (no reordering) SPRING 2 GB (allow reordering) Łukasz Roguski, Idoia Ochoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data, Bioinformatics , Volume 34, Issue 16, 15 August 2018, Pages 2748–2756

Read compression results Compressor 25x human 100x human Uncompressed 79 GB 319 GB Gzip ~20 GB ~80 GB FaStore 6 GB 13.7 GB (allow reordering) SPRING 3 GB 10 GB (no reordering) SPRING 2 GB 5.7 GB (allow reordering) Łukasz Roguski, Idoia Ochoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data, Bioinformatics , Volume 34, Issue 16, 15 August 2018, Pages 2748–2756

Key idea AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT • Storing reads equivalent to

Key idea AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT • Storing reads equivalent to • Store genome

Key idea AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT • Storing reads equivalent to • Store genome • Store read positions in genome (+ gap between paired reads)

Key idea AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT • Storing reads equivalent to • Store genome • Store read positions in genome (+ gap between paired reads) • Store noise in reads

Key idea AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT • Storing reads equivalent to • Store genome • Store read positions in genome (+ gap between paired reads) • Store noise in reads • Entropy calculations show this outperforms previous compressors

Key idea • But... How to get the genome from the reads?

Key idea • But... How to get the genome from the reads? • Genome assembly too expensive - big challenges: • resolve repeats • get very long pieces of genome from shorter assemblies

Key idea • But... How to get the genome from the reads? • Genome assembly too expensive - big challenges: • resolve repeats • get very long pieces of genome from shorter assemblies • Solution: Don’t need perfect assembly for compression!

SPRING workflow Raw reads

SPRING workflow Contigs Approximate assembly Raw reads

SPRING workflow Contigs Assembled sequence • Read position in • Approximate assembled sequence Encode assembly Gap b/w paired reads • Noisy bases + positions • Etc. • Raw reads

SPRING workflow Contigs Assembled sequence • Read position in • Approximate assembled sequence Encode assembly Gap b/w paired reads • Noisy bases + positions • Etc. • Raw reads BSC Compressed file https://github.com/IlyaGrebnov/libbsc

SPRING workflow Contigs Assembled sequence • Read position in • Approximate assembled sequence Encode assembly Gap b/w paired reads • Noisy bases + positions • Etc. • Raw reads BSC Compressed file In “allow reordering” mode: reorder by position in approximate assembly https://github.com/IlyaGrebnov/libbsc

Approx. assembly/reordering step (simplified)

Approx. assembly/reordering step (simplified) • Index reads by specific substrings using hash tables

Approx. assembly/reordering step (simplified) • Index reads by specific substrings using hash tables • For the current read, try to find an overlapping read within small Hamming distance

Approx. assembly/reordering step (simplified) • Index reads by specific substrings using hash tables • For the current read, try to find an overlapping read within small Hamming distance • Example (reads indexed by prefix for simplicity): (current read) • ACGATCGTACGTACGATCGTCAG

Approx. assembly/reordering step (simplified) • Index reads by specific substrings using hash tables • For the current read, try to find an overlapping read within small Hamming distance • Example (reads indexed by prefix for simplicity): (current read) • ACGATCGTACGTACGATCGTCAG (candidate next read) • ACGATCGTACGTATACGGGTACG

Approx. assembly/reordering step (simplified) • Index reads by specific substrings using hash tables • For the current read, try to find an overlapping read within small Hamming distance • Example (reads indexed by prefix for simplicity): (current read) • ACGATCGTACGTACGATCGTCAG (candidate next read) • ACGATCGTACGTATACGGGTACG • Index match found but Hamming distance too large → shift search substring by one

Approx. assembly/reordering step (simplified) • Index reads by specific substrings using hash tables • For the current read, try to find an overlapping read within small Hamming distance • Example (reads indexed by prefix for simplicity): (current read) • ACGATCGTACGTACGATCGTCAG

Approx. assembly/reordering step (simplified) • Index reads by specific substrings using hash tables • For the current read, try to find an overlapping read within small Hamming distance • Example (reads indexed by prefix for simplicity): (current read) • ACGATCGTACGTACGATCGTCAG • No index match found → shift search substring by one

Approx. assembly/reordering step (simplified) • Index reads by specific substrings using hash tables • For the current read, try to find an overlapping read within small Hamming distance • Example (reads indexed by prefix for simplicity): (current read) • ACGATCGTACGTACGATCGTCAG (candidate next read) • GATCGTACGTATGATGGTCATTA

Approx. assembly/reordering step (simplified) • Index reads by specific substrings using hash tables • For the current read, try to find an overlapping read within small Hamming distance • Example (reads indexed by prefix for simplicity): (current read) • ACGATCGTACGTACGATCGTCAG (candidate next read) • GATCGTACGTATGATGGTCATTA • Next read found! • Repeat process with the new read

Approx. assembly/reordering step (simplified) • Index reads by specific substrings using hash tables • For the current read, try to find an overlapping read within small Hamming distance • Example (reads indexed by prefix for simplicity): (current read) • ACGATCGTACGTACGATCGTCAG (candidate next read) • GATCGTACGTATGATGGTCATTA • Next read found! • Repeat process with the new read. • If no match found at any shift, pick arbitrary remaining read & start new contig

SPRING: a next-generation compressor for FASTQ data Shubham Chandak - PowerPoint PPT Presentation

SPRING: a next-generation compressor for FASTQ data Shubham Chandak Stanford University ISMB/ECCB 2019 Joint work with Kedar Tatwawadi, Stanford University Idoia Ochoa, UIUC Mikel Hernaez, UIUC Tsachy Weissman, Stanford

Emerson Compressor Control Process Control Made Easy with SmartProcess Compressor Agenda

SPRING: a next-generation compressor for FASTQ data Shubham Chandak Stanford University

SPRING: A next generation compressor for FASTQ data Shubham Chandak Stanford University

Quantifying gene expression Genome Sequence reads GTF (annotation)? FASTQ (+reference

Compressor stations & Compressor stations & health risks: health risks: Moving New

CPM Series Permanent Magnet Motor Variable Speed Screw Air Compressor PAR ART T 01 Why we

Compressor stations and Compressor stations and health risks health risks Curtis Nordgaard, MD

Review of Natural Gas Transmission Compressor Station Methane Emissions and Mitigation Options

Charge Compressor 2/3 Stage Fouling/High Vibration Johnny Dugas Senior Technical Associate

GazSurf - provides wide range of reliable compressor units, spare parts and consumables

XC600D Series 2013 XC600D Series 2013 Controllers for Small Medium Compressor Racks with

Presentation w w w .zjboyang.com HI STORY Lanhai Compressor Co., LTD was born main product is

NESE Pipeline and Compressor Station: Chokepoints and Tactics Jeff Tittel, Director, New Jersey

Pipelined Compressor Tree Optimization using Integer Linear Programming International Conference

Thank you, sponsors Our online sponsors PLATINUM GOLD 1 6/28/2016 TOP 4 LOW COST COMPRESSOR

Next Generation Next Generation gTLD Dir gTLD Directory Services ectory Services Pr

Building a 121 Mining Event - London May 20-21, 2019 Multi-Asset Mid-Tier TSX:TGZ / OTCQX:TGCDF

Q1 2019 Highlights Multi-Asset Mid-Tier TSX:TGZ / OTCQX:TGCDF Friday, May 3, 2019 West African

Exercises: Special Case Search Structures Design super-fast search structures for the following

A P A Place ace In In My My He Heart Nana Mouskouri 06/03/2015 1 06/03/2015 2 i got your

A model for the runout analysis of rapid flow slides, debris flows, and avalanches Article in

An Introduction to Intelligent RAM (IRAM) David Patterson, Krste Asanovic, Aaron Brown, Ben

An Introduction to Intelligent RAM (IRAM) David Patterson, Krste Asanovic, Aaron Brown, Ben

The 8th European Parliamentary Election(s) 9/19/2014 PRESENTATION OUTLINE/ 1. Introductory

SPRING: a next-generation compressor for FASTQ data Shubham Chandak - PowerPoint PPT Presentation

SPRING: a next-generation compressor for FASTQ data Shubham Chandak Stanford University ISMB/ECCB 2019 Joint work with Kedar Tatwawadi, Stanford University Idoia Ochoa, UIUC Mikel Hernaez, UIUC Tsachy Weissman, Stanford

Emerson Compressor Control Process Control Made Easy with SmartProcess Compressor Agenda

SPRING: a next-generation compressor for FASTQ data Shubham Chandak Stanford University

SPRING: A next generation compressor for FASTQ data Shubham Chandak Stanford University

Quantifying gene expression Genome Sequence reads GTF (annotation)? FASTQ (+reference

Compressor stations &amp; Compressor stations &amp; health risks: health risks: Moving New

CPM Series Permanent Magnet Motor Variable Speed Screw Air Compressor PAR ART T 01 Why we

Compressor stations and Compressor stations and health risks health risks Curtis Nordgaard, MD

Review of Natural Gas Transmission Compressor Station Methane Emissions and Mitigation Options

Charge Compressor 2/3 Stage Fouling/High Vibration Johnny Dugas Senior Technical Associate

GazSurf - provides wide range of reliable compressor units, spare parts and consumables

XC600D Series 2013 XC600D Series 2013 Controllers for Small Medium Compressor Racks with

Presentation w w w .zjboyang.com HI STORY Lanhai Compressor Co., LTD was born main product is

NESE Pipeline and Compressor Station: Chokepoints and Tactics Jeff Tittel, Director, New Jersey

Pipelined Compressor Tree Optimization using Integer Linear Programming International Conference

Thank you, sponsors Our online sponsors PLATINUM GOLD 1 6/28/2016 TOP 4 LOW COST COMPRESSOR

Next Generation Next Generation gTLD Dir gTLD Directory Services ectory Services Pr

Building a 121 Mining Event - London May 20-21, 2019 Multi-Asset Mid-Tier TSX:TGZ / OTCQX:TGCDF

Q1 2019 Highlights Multi-Asset Mid-Tier TSX:TGZ / OTCQX:TGCDF Friday, May 3, 2019 West African

Exercises: Special Case Search Structures Design super-fast search structures for the following

A P A Place ace In In My My He Heart Nana Mouskouri 06/03/2015 1 06/03/2015 2 i got your

A model for the runout analysis of rapid flow slides, debris flows, and avalanches Article in

An Introduction to Intelligent RAM (IRAM) David Patterson, Krste Asanovic, Aaron Brown, Ben

An Introduction to Intelligent RAM (IRAM) David Patterson, Krste Asanovic, Aaron Brown, Ben

The 8th European Parliamentary Election(s) 9/19/2014 PRESENTATION OUTLINE/ 1. Introductory

Compressor stations & Compressor stations & health risks: health risks: Moving New