Implementing iRODS for Next Generation Sequencing Data Management
Gen-Tao Chiang Wellcome Trust Sanger Institute gtc@sanger.ac.uk
ISGC, March 20, 2011
Sequencing Data Management Gen-Tao Chiang Wellcome Trust Sanger - - PowerPoint PPT Presentation
Implementing iRODS for Next Generation Sequencing Data Management Gen-Tao Chiang Wellcome Trust Sanger Institute gtc@sanger.ac.uk ISGC, March 20, 2011 Outline 1. DNA Sequencing 2. Managing Data 3. iRODS 4. WTSI use case 5. Future Works
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
By Guy Coates
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
ER Mardis. Nature 470, 198-203 (2011)
Jan 2010 500 1000 1500 2000 2500 3000 3500 4000 4500 3.5 4000
Capillary Illumina
G b a s e s
ISGC, March 20, 2011
months.
2009 ~7 PB raw
disk space of the raw data.
Td=12 months By Guy Coates
10,000s of genomes.
ISGC, March 20, 2011
Intensities / raw data (2TB) Alignments (200 GB) Sequence + quality data (500 GB) Variation data (1GB) Individual features (3MB) Structured data (databases) Unstructured data (flat files) Data size per Genome Sequencing informatics specialists By Guy Coates
Anonymised data → straight into the archive. Identifiable data → private/controlled archives. Some data held back until journal publication.
ISGC, March 20, 2011
SRF SRA fastq Mainly for QC pipeline Analysis, alignment, assembly Further analysis Ensembl annotation
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
Find the best match of fragments to a known genome / genomes.
Real DNA has Insertions, deletions and mutations. Typical algorithms are maq, bwa, ssaha, blast.
Look for differences
Reference: ...TTTGCTGAAACCCAAGTGACGCCATCCAGCGTGACCACTGCATTTTTCTCGGTCATCACCAGCATTCTC.... Query: CAAGTGACGCCATCCAGCGTGACCACTGCATTTTTCTAGGTCATCACCAGCA
By Guy Coates
Non uniform coverage, low
By Guy Coates
ISGC, March 20, 2011
Lots of individuals keeping copies of the same data set “so I know where it is”.
Important stuff is on file systems that are backed up etc.
Invest in people to maintain/develop it.
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
Fast scratch space ↔ warehouse disk ↔ Offsite DR centre.
ISGC, March 20, 2011
ISGC, March 20, 2011
Multiple NFS Partitions BAM
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
1k_v37.fasta
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
Sequencing Centre Data Biologist Biologist Biologist
ISGC, March 20, 2011
Sequencing Centre Data Universities Data Small Labs Data Sequencing Centre Data Federation Access
ISGC, March 20, 2011
ISGC, March 20, 2011