Sequencing Data Management Gen-Tao Chiang Wellcome Trust Sanger - PowerPoint PPT Presentation

Implementing iRODS for Next Generation Sequencing Data Management Gen-Tao Chiang Wellcome Trust Sanger Institute gtc@sanger.ac.uk ISGC, March 20, 2011

Outline 1. DNA Sequencing 2. Managing Data 3. iRODS 4. WTSI use case 5. Future Works ISGC, March 20, 2011

The Sanger Institute Funded by Wellcome Trust. • 2 nd largest research charity in the world. • More than 800 employees. • Based in Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. (share with EBI) • Most cited in the UK (Science Watch, 2008) Large scale genomic research. • Sequenced 1/3 of the human genome. (largest single contributor). • We have active cancer, malaria, pathogen and genomic variation / human health studies. • All data is made publicly available. Websites, ftp, direct database. access, programmatic APIs. ISGC, March 20, 2011 By Guy Coates

Data Centre • Completed in 2005. • 1,000 square meters of floor space split equally into four rooms. • Capable to support up to 50,000 processors. • Currently, about 10,000 cores and 10 petabyte storage. ISGC, March 20, 2011

Managing Data ISGC, March 20, 2011

DNA Sequencing ISGC, March 20, 2011

Capillary Based • In 2001, in the era of the HGP, DNA sequencing technology used a capillary-based approach. • Each sequencer produced about 115 kbp (thousand base pairs) per day (Mardis, 2011). ISGC, March 20, 2011

Next Generation Sequencing Life sciences is drowning in data from our new sequencing machines. Traditional sequencing: • 96 sequencing reactions carried out per run. Next-generation: sequencing. • 52 Million reactions per run. Machines are cheap(ish) and small. • Small labs can afford one. • Big labs can afford lots of them. ISGC, March 20, 2011

Illumina HiSeq • Migrating to Illumina HiSeq since October, 2010. • 5 times more data than Illumina GA2. • 20 Machines on site. • Make data management extremely difficult. http://www.illumina.com ISGC, March 20, 2011

ER Mardis. Nature 470 , 198-203 (2011) ISGC, March 20, 2011

Output Trends 4500 Our peak “old generation” sequencing: 4000 • August 2007: 3.5 Gbases/month. 4000 3500 Current output: 3000 • Jan 2010: 4 Tbases/month. 2500 s e s a b Capillary 2000 1000x increase in our sequencing G Illumina output. 1500 • In August 2007, total size of genbank was 1000 200 Gbases. 500 Improvements in chemistry continue to 3.5 0 increase the output of machines. Jan 2010

Data Growth Current weeky sequencing: 3000 Gbase Peak Yearly capillary sequencing: 30 Gbase ISGC, March 20, 2011

Managing Growth We have exponential growth in storage and compute. • Storage /compute doubles every 12 months.  2009 ~7 PB raw Gigabase of sequence ≠ Gigbyte of storage. • 16 bytes per base for for sequence data. • Intermediate analysis typically need 10x disk space of the raw data. Moore's law will not save us. • Transistor/disk density: T d =18 months • Sequencing cost: T d =12 months By Guy Coates

Economic Trends: The Human genome project: • 13 years. • 23 labs. • $500 Million. A Human genome today: • 3 days. • 1 machine. • $10,000. • Large centres are now doing studies with 1000s and 10,000s of genomes. Changes in sequencing technology are going to continue this trend. • “Next - next” generation sequencers are on their way . • One Pacific Biosciences RS test machine at WTSI now. • $500 genome is probable within 5 years.

Managing Data ISGC, March 20, 2011

Bulk Data Data size per Genome Structured data (databases) Individual features (3MB) Variation data (1GB) Alignments (200 GB) Sequencing informatics specialists Sequence + quality data (500 GB) Unstructured data Intensities / raw data (2TB) (flat files) By Guy Coates

Bulk Data Management We though we were really good at it. • All samples that come through the sequencing lab are bar-coded and tracked (Laboratory Information Systems). • Sequencing machines fed into an automated analysis pipeline. • All the data was tracked, analysed and archived appropriately. Strict meta-data controls. • Experiments do not start in the wet-lab until the investigator has supplied all the required data privacy and archiving requirements.  Anonymised data → straight into the archive.  Identifiable data → private/controlled archives.  Some data held back until journal publication.

Mainly for QC pipeline SRF SRA fastq Analysis, alignment, Further analysis assembly Ensembl annotation ISGC, March 20, 2011

ISGC, March 20, 2011

We had been focused on the sequencing pipeline. • For many investigators, data coming off the end of the sequencing pipeline is where they start . • Investigators take the mass of finished sequence data out of the archives, onto our compute farms and “do stuff”. Huge explosion of data and disk use all over the institute. • We had no idea what people were doing with their data. ISGC, March 20, 2011

Alignment Find the best match of fragments to a known genome / genomes. • “ grep ” for DNA sequences. • Use more sophisticated algorithms that can do fuzzy matching.  Real DNA has Insertions, deletions and mutations.  Typical algorithms are maq, bwa, ssaha, blast. Reference: ...TTTGCTGAAACCCAAGTGACGCCATCCAGCGTGACCACTGCATTTTTCTCGGTCATCACCAGCATTCTC.... Query: CAAGTGACGCCATCCAGCGTGACCACTGCATTTTTCT A GGTCATCACCAGCA Look for differences • Single base pair differences (SNP). • Larger insertions/deletions/mutations. Typical experiment: • Compare cancer cell genomes with healthy ones. By Guy Coates

Assembly Assemble fragments into a complete genome. • Typical experiment: collect reference genome for a new species. “De - novo” assembly. • Assemble fragment with no external data. • Harder than it looks.  Non uniform coverage, low depth, non-unique sequence (repeats). By Guy Coates

Analysing Cancer Genomes Cancer genomes contains a lot of genetic damage. • Many of the mutations in cancer are incidental. • Initial mutation disrupts the normal DNA repair/replication processes. • Corruption spreads through the rest of the genome. Today: Find the “driver” mutations amongst the thousands of “passengers. • Identifying the driver mutations will give us new targets for therapies. Tomorrow: Analyse the cancer genome of every patient in the clinic. • Variations in a patient and cancer genetic makeup play a major role in how effective a particular drugs will be. • Clinicians will use this information to tailor therapies.

ISGC, March 20, 2011

Accidents waiting to happen... From: <User A> (who left 12 months ago) I find the <project> directory is removed . The original directory is "/scratch/ <User B> (who left 6 months ago) " ..where is it ? If this problem cannot be solved ,I am afraid that <project> cannot be released.

Need a file tracking systems for unstructured data !! • They could not keep track of where the results. • Problem exacerbated with student turnover (summer students, PhD students, visiting researchers on rotation). Big wins with little effort. • Disk space usage dropped by 2/3.  Lots of individuals keeping copies of the same data set “so I know where it is”. • Team leaders are happy that their data are where they think they are.  Important stuff is on file systems that are backed up etc. But: • Systems are ad-hoc, quick hacks. • We want an institute wide, standardised system.  Invest in people to maintain/develop it.

Data Grid • Many different science fields today require dealing with large and geographically distributed data sets. The size of these data sets has been scaled up from terabytes to petabytes. • The combination of several issues, such as – large datasets, – distributed data – computationally intensive analysis • Data grid: a unified environment which allows users to deal with all above issues. • SRB, dCAche , CASTOR….etc ISGC, March 20, 2011

iRODS Architecture ISGC, March 20, 2011

iRODS • iRODS: Integrated Rule-Oriented Data System. • Produced by DICE (Data Intensive Cyber Environments) groups at U. North Carolina, Chapel Hill. • Successor to SRB.

Important Features • Catalogue: mapping logical file names to physical locations. • Metadata: metadata can be inserted into each file. • Rule Engines: – Manipulate files or DB. For example, replicate data to multiple resources. – Implement policies. • Easy to use client tools: – Icommands – Web interface. – API • Federation ISGC, March 20, 2011

What are we doing with it? Piloting it for internal use. • Help groups keep track of their data. • Move files between different storage pools.  Fast scratch space ↔ warehouse disk ↔ Offsite DR centre. • Link metadata back to our LIMs/tracking databases. We need to share data with other institutions. • Public data is easy: FTP/http. • Controlled data is hard: • Encrypt files and place on private FTP dropboxes. • Cumbersome to manage and insecure.

First Stage: A preservation system ISGC, March 20, 2011

BAM Multiple NFS Partitions ISGC, March 20, 2011

Sequencing Data Management Gen-Tao Chiang Wellcome Trust Sanger - PowerPoint PPT Presentation

Implementing iRODS for Next Generation Sequencing Data Management Gen-Tao Chiang Wellcome Trust Sanger Institute gtc@sanger.ac.uk ISGC, March 20, 2011 Outline 1. DNA Sequencing 2. Managing Data 3. iRODS 4. WTSI use case 5. Future Works

Sequencing technology and assembly Sanger sequencing Sanger sequencing with radioactivity

Genomics Sequencing tech Sequencing tech: next generation What do we get from sequencing? How

Apicomplexan Genome Sequencing in Sanger Arnab Pain, The Pathogen Sequencing Unit (PSU) 2 nd

Next Next Generation Sequencing: an overview of Generation Sequencing: an overview of

Sequencing Technologies Benchtop Production-Scale Illumina: Sequencing Platforms

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

The Massive Parallel Sequencing era: "Global sequencing" Richard Christen CNRS UMR

sequencing data Simon Andrews @simon_andrews How to spot problems in your sequencing data

Mutation detection in massively parallel sequencing 2012 Winter School in Mathematical and

Cost effective and informative genotyping by sequencing using AgriSeq targeted sequencing for

Applications of Next Generation DNA Sequencing in Newborn Screening Anne Goodeve Sheffield

Genomes and Metagenomes Whole Genome Sequencing and Metagenomics Whole Genome Sequencing

HIV tropism assessment HIV tropism assessment HIV tropism assessment HIV tropism assessment

Deep Learning for Shallow Sequencing Johnny Israeli Nvidia Genomics Group GTC 2018 1 Talk

Leadplane Training Course Leadplane Training Course Aircraft Sequencing Leadplane Training

10 Technology To Watch - 2012 - Thaweesak Koanantakool Sep. 20, 2012 1 Nanopore Sequencing

The Legal Implication of Biobanking: Ethical and Legal Framework and Aspects of Biobanks in

Genetic variation: SNPs ATTGCAATCCGTGG...ATCGAGCCATACG ATTGCACGCCG Basics What is

A SOCIAL COST-BENEFIT ANALYSIS PERSPECTIVE Massimo Florio United Nations/Italy Workshop on The

The Digital Curation Centre Michael Day Digital Curation Centre UKOLN, University of Bath

PSYC 335 Developmental Psychology I Session 4 Theories in Developmental Psychology- Part II

1 We look at the world once, in childhood. The rest is memory. Louise Glck What is

Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Human Health Risk Assessment Performance Accomplishments LTG 1: IRIS and other priority

Sequencing Data Management Gen-Tao Chiang Wellcome Trust Sanger - PowerPoint PPT Presentation

Implementing iRODS for Next Generation Sequencing Data Management Gen-Tao Chiang Wellcome Trust Sanger Institute gtc@sanger.ac.uk ISGC, March 20, 2011 Outline 1. DNA Sequencing 2. Managing Data 3. iRODS 4. WTSI use case 5. Future Works

Sequencing technology and assembly Sanger sequencing Sanger sequencing with radioactivity

Genomics Sequencing tech Sequencing tech: next generation What do we get from sequencing? How

Apicomplexan Genome Sequencing in Sanger Arnab Pain, The Pathogen Sequencing Unit (PSU) 2 nd

Next Next Generation Sequencing: an overview of Generation Sequencing: an overview of

Sequencing Technologies Benchtop Production-Scale Illumina: Sequencing Platforms

Introduction to Bioinformatics Genome sequencing &amp; assembly Genome sequencing &amp; assembly

The Massive Parallel Sequencing era: &quot;Global sequencing&quot; Richard Christen CNRS UMR

sequencing data Simon Andrews @simon_andrews How to spot problems in your sequencing data

Mutation detection in massively parallel sequencing 2012 Winter School in Mathematical and

Cost effective and informative genotyping by sequencing using AgriSeq targeted sequencing for

Applications of Next Generation DNA Sequencing in Newborn Screening Anne Goodeve Sheffield

Genomes and Metagenomes Whole Genome Sequencing and Metagenomics Whole Genome Sequencing

HIV tropism assessment HIV tropism assessment HIV tropism assessment HIV tropism assessment

Deep Learning for Shallow Sequencing Johnny Israeli Nvidia Genomics Group GTC 2018 1 Talk

Leadplane Training Course Leadplane Training Course Aircraft Sequencing Leadplane Training

10 Technology To Watch - 2012 - Thaweesak Koanantakool Sep. 20, 2012 1 Nanopore Sequencing

The Legal Implication of Biobanking: Ethical and Legal Framework and Aspects of Biobanks in

Genetic variation: SNPs ATTGCAATCCGTGG...ATCGAGCCATACG ATTGCACGCCG Basics What is

A SOCIAL COST-BENEFIT ANALYSIS PERSPECTIVE Massimo Florio United Nations/Italy Workshop on The

The Digital Curation Centre Michael Day Digital Curation Centre UKOLN, University of Bath

PSYC 335 Developmental Psychology I Session 4 Theories in Developmental Psychology- Part II

1 We look at the world once, in childhood. The rest is memory. Louise Glck What is

Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Human Health Risk Assessment Performance Accomplishments LTG 1: IRIS and other priority

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

The Massive Parallel Sequencing era: "Global sequencing" Richard Christen CNRS UMR