Sequencing Data Management Gen-Tao Chiang Wellcome Trust Sanger - - PowerPoint PPT Presentation

sequencing data management
SMART_READER_LITE
LIVE PREVIEW

Sequencing Data Management Gen-Tao Chiang Wellcome Trust Sanger - - PowerPoint PPT Presentation

Implementing iRODS for Next Generation Sequencing Data Management Gen-Tao Chiang Wellcome Trust Sanger Institute gtc@sanger.ac.uk ISGC, March 20, 2011 Outline 1. DNA Sequencing 2. Managing Data 3. iRODS 4. WTSI use case 5. Future Works


slide-1
SLIDE 1

Implementing iRODS for Next Generation Sequencing Data Management

Gen-Tao Chiang Wellcome Trust Sanger Institute gtc@sanger.ac.uk

ISGC, March 20, 2011

slide-2
SLIDE 2

Outline

  • 1. DNA Sequencing
  • 2. Managing Data
  • 3. iRODS
  • 4. WTSI use case
  • 5. Future Works

ISGC, March 20, 2011

slide-3
SLIDE 3

ISGC, March 20, 2011

Funded by Wellcome Trust.

  • 2nd largest research charity in the world.
  • More than 800 employees.
  • Based in Wellcome Trust Genome Campus,

Hinxton, Cambridge, UK. (share with EBI)

  • Most cited in the UK (Science Watch, 2008)

Large scale genomic research.

  • Sequenced 1/3 of the human genome.

(largest single contributor).

  • We have active cancer, malaria, pathogen

and genomic variation / human health studies.

  • All data is made publicly available.

Websites, ftp, direct database. access, programmatic APIs.

The Sanger Institute

By Guy Coates

slide-4
SLIDE 4

Data Centre

  • Completed in 2005.
  • 1,000 square meters of

floor space split equally into four rooms.

  • Capable to support up

to 50,000 processors.

  • Currently, about 10,000

cores and 10 petabyte storage.

ISGC, March 20, 2011

slide-5
SLIDE 5

Managing Data

ISGC, March 20, 2011

slide-6
SLIDE 6

DNA Sequencing

ISGC, March 20, 2011

slide-7
SLIDE 7

Capillary Based

  • In 2001, in the era of the

HGP, DNA sequencing technology used a capillary-based approach.

  • Each sequencer

produced about 115 kbp (thousand base pairs) per day (Mardis, 2011).

ISGC, March 20, 2011

slide-8
SLIDE 8

Next Generation Sequencing

ISGC, March 20, 2011

Life sciences is drowning in data from our new sequencing machines. Traditional sequencing:

  • 96 sequencing reactions

carried out per run.

Next-generation: sequencing.

  • 52 Million reactions per run.

Machines are cheap(ish) and small.

  • Small labs can afford one.
  • Big labs can afford lots of

them.

slide-9
SLIDE 9

Illumina HiSeq

ISGC, March 20, 2011

  • Migrating to Illumina

HiSeq since October, 2010.

  • 5 times more data than

Illumina GA2.

  • 20 Machines on site.
  • Make data management

extremely difficult.

http://www.illumina.com

slide-10
SLIDE 10

ISGC, March 20, 2011

ER Mardis. Nature 470, 198-203 (2011)

slide-11
SLIDE 11

Output Trends

Our peak “old generation” sequencing:

  • August 2007: 3.5 Gbases/month.

Current output:

  • Jan 2010: 4 Tbases/month.

1000x increase in our sequencing

  • utput.
  • In August 2007, total size of genbank was

200 Gbases.

Improvements in chemistry continue to increase the output of machines.

Jan 2010 500 1000 1500 2000 2500 3000 3500 4000 4500 3.5 4000

Capillary Illumina

G b a s e s

slide-12
SLIDE 12

ISGC, March 20, 2011

Peak Yearly capillary sequencing: 30 Gbase Current weeky sequencing: 3000 Gbase

Data Growth

slide-13
SLIDE 13

Managing Growth

We have exponential growth in storage and compute.

  • Storage /compute doubles every 12

months.

 2009 ~7 PB raw

Gigabase of sequence ≠ Gigbyte of storage.

  • 16 bytes per base for for sequence data.
  • Intermediate analysis typically need 10x

disk space of the raw data.

Moore's law will not save us.

  • Transistor/disk density: Td=18 months
  • Sequencing cost:

Td=12 months By Guy Coates

slide-14
SLIDE 14

Economic Trends:

The Human genome project:

  • 13 years.
  • 23 labs.
  • $500 Million.

A Human genome today:

  • 3 days.
  • 1 machine.
  • $10,000.
  • Large centres are now doing studies with 1000s and

10,000s of genomes.

Changes in sequencing technology are going to continue this trend.

  • “Next-next” generation sequencers are on their way.
  • One Pacific Biosciences RS test machine at WTSI now.
  • $500 genome is probable within 5 years.
slide-15
SLIDE 15

Managing Data

ISGC, March 20, 2011

slide-16
SLIDE 16

Bulk Data

Intensities / raw data (2TB) Alignments (200 GB) Sequence + quality data (500 GB) Variation data (1GB) Individual features (3MB) Structured data (databases) Unstructured data (flat files) Data size per Genome Sequencing informatics specialists By Guy Coates

slide-17
SLIDE 17

Bulk Data Management

We though we were really good at it.

  • All samples that come through the sequencing lab are bar-coded

and tracked (Laboratory Information Systems).

  • Sequencing machines fed into an automated analysis pipeline.
  • All the data was tracked, analysed and archived appropriately.

Strict meta-data controls.

  • Experiments do not start in the wet-lab until the investigator has

supplied all the required data privacy and archiving requirements.

 Anonymised data → straight into the archive.  Identifiable data → private/controlled archives.  Some data held back until journal publication.

slide-18
SLIDE 18

ISGC, March 20, 2011

SRF SRA fastq Mainly for QC pipeline Analysis, alignment, assembly Further analysis Ensembl annotation

slide-19
SLIDE 19

ISGC, March 20, 2011

slide-20
SLIDE 20

ISGC, March 20, 2011

slide-21
SLIDE 21

ISGC, March 20, 2011

We had been focused on the sequencing pipeline.

  • For many investigators, data coming off the end
  • f the sequencing pipeline is where they start.
  • Investigators take the mass of finished sequence

data out of the archives, onto our compute farms and “do stuff”.

Huge explosion of data and disk use all over the institute.

  • We had no idea what people were doing with

their data.

slide-22
SLIDE 22

Alignment

Find the best match of fragments to a known genome / genomes.

  • “grep” for DNA sequences.
  • Use more sophisticated algorithms that can do fuzzy matching.

 Real DNA has Insertions, deletions and mutations.  Typical algorithms are maq, bwa, ssaha, blast.

Look for differences

  • Single base pair differences (SNP).
  • Larger insertions/deletions/mutations.

Typical experiment:

  • Compare cancer cell genomes with healthy ones.

Reference: ...TTTGCTGAAACCCAAGTGACGCCATCCAGCGTGACCACTGCATTTTTCTCGGTCATCACCAGCATTCTC.... Query: CAAGTGACGCCATCCAGCGTGACCACTGCATTTTTCTAGGTCATCACCAGCA

By Guy Coates

slide-23
SLIDE 23

Assembly

Assemble fragments into a complete genome.

  • Typical experiment: collect

reference genome for a new species.

“De-novo” assembly.

  • Assemble fragment with no

external data.

  • Harder than it looks.

 Non uniform coverage, low

depth, non-unique sequence (repeats).

By Guy Coates

slide-24
SLIDE 24

Analysing Cancer Genomes

Cancer genomes contains a lot of genetic damage.

  • Many of the mutations in cancer are incidental.
  • Initial mutation disrupts the normal DNA repair/replication processes.
  • Corruption spreads through the rest of the genome.

Today: Find the “driver” mutations amongst the thousands of “passengers.

  • Identifying the driver mutations will give us new targets for therapies.

Tomorrow: Analyse the cancer genome of every patient in the clinic.

  • Variations in a patient and cancer genetic makeup play a major role in

how effective a particular drugs will be.

  • Clinicians will use this information to tailor therapies.
slide-25
SLIDE 25

ISGC, March 20, 2011

slide-26
SLIDE 26

Accidents waiting to happen...

From: <User A> (who left 12 months ago) I find the <project> directory is removed . The original directory is "/scratch/<User B> (who left 6 months ago)" ..where is it ? If this problem cannot be solved ,I am afraid that <project> cannot be released.

slide-27
SLIDE 27

Need a file tracking systems for unstructured data !!

  • They could not keep track of where the results.
  • Problem exacerbated with student turnover (summer students,

PhD students, visiting researchers on rotation).

Big wins with little effort.

  • Disk space usage dropped by 2/3.

 Lots of individuals keeping copies of the same data set “so I know where it is”.

  • Team leaders are happy that their data are where they think

they are.

 Important stuff is on file systems that are backed up etc.

But:

  • Systems are ad-hoc, quick hacks.
  • We want an institute wide, standardised system.

 Invest in people to maintain/develop it.

slide-28
SLIDE 28

Data Grid

  • Many different science fields today require

dealing with large and geographically distributed data sets. The size of these data sets has been scaled up from terabytes to petabytes.

  • The combination of several issues, such as

– large datasets, – distributed data – computationally intensive analysis

  • Data grid: a unified environment which allows

users to deal with all above issues.

  • SRB, dCAche, CASTOR….etc

ISGC, March 20, 2011

slide-29
SLIDE 29

iRODS Architecture

ISGC, March 20, 2011

slide-30
SLIDE 30

iRODS

  • iRODS: Integrated Rule-Oriented Data

System.

  • Produced by DICE (Data Intensive Cyber

Environments) groups at U. North Carolina, Chapel Hill.

  • Successor to SRB.
slide-31
SLIDE 31

Important Features

  • Catalogue: mapping logical file names to physical

locations.

  • Metadata: metadata can be inserted into each file.
  • Rule Engines:

– Manipulate files or DB. For example, replicate data to multiple resources. – Implement policies.

  • Easy to use client tools:

– Icommands – Web interface. – API

  • Federation

ISGC, March 20, 2011

slide-32
SLIDE 32

What are we doing with it?

Piloting it for internal use.

  • Help groups keep track of their data.
  • Move files between different storage pools.

 Fast scratch space ↔ warehouse disk ↔ Offsite DR centre.

  • Link metadata back to our LIMs/tracking databases.

We need to share data with other institutions.

  • Public data is easy: FTP/http.
  • Controlled data is hard:
  • Encrypt files and place on private FTP dropboxes.
  • Cumbersome to manage and insecure.
slide-33
SLIDE 33

First Stage: A preservation system

ISGC, March 20, 2011

slide-34
SLIDE 34

ISGC, March 20, 2011

Multiple NFS Partitions BAM

slide-35
SLIDE 35

Rules in Core.irb

  • forces the “seq” resource group as the default resource

acSetRescSchemeForCreate||msiSetDefaultResc(seq,forced) |nop

  • defines the preferred resource as the “seq” resource group.

In addition, it also mentions avoiding the use of res-r2. The idea is to use res-r2 as a backup resource acPreprocForDataObjOpen||msiSetDataObjPreferredResc(s eq%res-g2)##msiSetDataObjAvoidResc(res- r2)|msiSetDataObjPreferredResc(seq%res-r2)##nop

  • After the data has been put into the iRODS system, the data
  • bject will be replicated to all the resources within the “seq”

resource group. acPostProcForPut||msiSysReplDataObj(seq,all)|nop

ISGC, March 20, 2011

slide-36
SLIDE 36

Data Replication

  • At this moment, files can be replicate to both

resources (res-g2 and res-r2) directly. irods-g1@irods-g1:/tmp$ iput repl.ir irods-g1@irods-g1:/tmp$ ils -l repl.ir irods-g1 0 res-g2 74 2011-03- 11.12:11 & repl.ir irods-g1 1 res-r2 74 2011-03- 11.12:11 & repl.ir

ISGC, March 20, 2011

slide-37
SLIDE 37

Data Replication

  • May still miss replication!!!

– Network stability – Servers temporary not available – Unknown reasons

  • We need extra protection

– One more rule using delayExec for periodically checking the replication. – running irule – running irepl –BPr /seq in crontab

ISGC, March 20, 2011

slide-38
SLIDE 38

Authentication

  • Password, GSI, Kerberos.
  • WTSI uses LDAP and AD. iRODS is not

supporting LDAP.

  • Kerberos is chosen because it can be

integrated with current Active Directory.

ISGC, March 20, 2011

slide-39
SLIDE 39

Kerberos

  • All the users will login to the zone Sanger.
  • Zone Sanger federated to zone Seq
  • Users can use their original WTSI username/password

to access iRODS and use icommands from their desktop

  • r farm. For example:

gtc@irods-sanger1:~$ kinit Password for gtc@INTERNAL.SANGER.AC.UK: gtc@irods-sanger1:~$ ils /seq |more /seq: C- /seq/1003 C- /seq/1008

ISGC, March 20, 2011

slide-40
SLIDE 40

Web Client

ISGC, March 20, 2011

slide-41
SLIDE 41

Web Client

  • Not support Kerberos yet.
  • Need to share data with other Genome Centre

in the future.

  • Need federation such as Shibboleth….etc.

ISGC, March 20, 2011

slide-42
SLIDE 42

WTSI Use case

ISGC, March 20, 2011

slide-43
SLIDE 43

WTSI Use Case

  • Currently, users of WTSI are using iRODS mainly for

managing and accessing sequencing Binary Alignment/Map (BAM) files, which is the binary representation of the Sequence Alignment/Map (SAM) file format.

  • BAM files are from

– Illumina HiSeq – Converted from fastq or Illumina G2 old runs

  • Each BAM file normally has an index file called BAM

Indexing (BAI) which are stored in iRODS as well.

  • Data from PacBio are stored in /pacbio but for testing

purpose at this moment.

ISGC, March 20, 2011

slide-44
SLIDE 44

WTSI Use Case

  • NGP Group is responsible to upload the quality

controlled BAM into the iRODS.

  • Quality control Perl scripts ruungin in

sequencing farm uses iRODS upload Perl module to upload the data.

ISGC, March 20, 2011

slide-45
SLIDE 45

Metadata

  • The following metadata are inserted at the

upload stage.

  • study, library, sample, id_run, lane, tag,

tag_index and human_split.

  • id_run, lane, and tag_index are together

unique within WTSI.

  • Some BAM files don't have a tag_index, which

means the file is for the whole run lane.

ISGC, March 20, 2011

slide-46
SLIDE 46

Metadata

  • Each BAM file belongs to one study, has one or

more samples, forms one actual library for sequencing, and each sequence may have a tag sequence with it.

  • Therefore, we added these metadata for the

file: study, sample (may have more than one), library and may with tag as well.

ISGC, March 20, 2011

slide-47
SLIDE 47
  • BAM file with tag
  • deskpro19635[gq1]5: imeta ls -d /seq/5635/5635_3#2.bam
  • AVUs defined for dataObj /seq/5635/5635_3#2.bam:
  • attribute: type
  • value: bam
  • units:
  • attribute: lane
  • value: 3
  • units:
  • attribute: sample
  • value: SZ0002
  • units:
  • attribute: reference
  • value:
  • /nfs/repository/d0031/references/Streptococcus_equi/4047/all/bwa/S-equi-4047.fasta
  • units:
  • ISGC, March 20, 2011
slide-48
SLIDE 48
  • attribute: study
  • value: Streptococcus equi genome diversity
  • units:
  • attribute: tag
  • value: CGATGTTT
  • units:
  • attribute: library
  • value: SZ0002 1560825
  • units:
  • attribute: id_run
  • value: 5635
  • units:
  • attribute: tag_index
  • value: 2
  • units:
  • attribute: alignment
  • value: 1
  • units:
  • ISGC, March 20, 2011
slide-49
SLIDE 49

Metadata

  • Some BAM files have non-consensual human

data, so we split the files into two parts: human and non-human. The public is not normally able to see the human part. human_split is used to indicate this situation.

ISGC, March 20, 2011

slide-50
SLIDE 50

Metadata

  • Sequences in BAM may be aligned to a
  • reference. Therefore the metadata 'alignment'

has been created to indicate this. If it has alignment, we add a metadata reference to indicate which one we used. The following are some examples.

ISGC, March 20, 2011

slide-51
SLIDE 51
  • BAM without tag and only human non consented part:
  • srpipe@sf-2-1-01:/nfs/sf44/irods$ imeta ls -d

/seq/5261/5261_5_human.bam

  • AVUs defined for dataObj /seq/5261/5261_5_human.bam:
  • attribute: type
  • value: bam
  • units:
  • attribute: study
  • value: Plasmodium falciparum Illumina sequencing R&D 1
  • units:
  • attribute: reference

ISGC, March 20, 2011

slide-52
SLIDE 52
  • value:
  • /nfs/repository/d0031/references/Homo_sapiens/1000Genomes/all/bwa/human_g

1k_v37.fasta

  • units:
  • attribute: sample
  • value: PK0039
  • units:
  • attribute: human_split
  • value: human
  • units:
  • attribute: lane
  • value: 5
  • units:

ISGC, March 20, 2011

slide-53
SLIDE 53
  • ----
  • attribute: library
  • value: PK0039 455682
  • units:
  • ----
  • attribute: id_run
  • value: 5261
  • units:
  • ----
  • attribute: alignment
  • value: 1
  • units:

ISGC, March 20, 2011

slide-54
SLIDE 54

metadata

  • User can query metadata and find those data

they are interested in.

ISGC, March 20, 2011

slide-55
SLIDE 55
  • gtc@irods-sanger1:~$ imeta qu -z seq -d study = 'Hyperplastic Polyposis'
  • collection: /seq/5208
  • dataObj: 5208_2.bam
  • collection: /seq/5208
  • dataObj: 5208_3.bam
  • collection: /seq/5208
  • dataObj: 5208_5.bam
  • collection: /seq/5230
  • dataObj: 5230_1.bam
  • collection: /seq/5230
  • dataObj: 5230_2.bam

ISGC, March 20, 2011

slide-56
SLIDE 56

Future Work

ISGC, March 20, 2011

slide-57
SLIDE 57

Non-Interactive environment

  • Kerberos has a limited credential life time; if jobs are

queued in the batching system, which is LSF in our case, the Kerberos credential may run out of validation time.

  • This means that Kerberos needs to be able to support a

non-interactive computing environment.

  • Theoretically, the valid life time can be configured by

the AD administrator. Practically, it would go against WTSI policy to configure the credentials with unlimited life time.

  • Grid myproxy server maybe the answer, but extra works

for users.

ISGC, March 20, 2011

slide-58
SLIDE 58

More zones

  • In our idea, each major projects may have their
  • wn iRODS zones and federated to each other.
  • Zone UK10K and zone Cancer are going to be

created next.

ISGC, March 20, 2011

slide-59
SLIDE 59

Federation with Collaborators

Sequencing Centre Data Biologist Biologist Biologist

slide-60
SLIDE 60

Future Collaborations

ISGC, March 20, 2011

Sequencing Centre Data Universities Data Small Labs Data Sequencing Centre Data Federation Access

slide-61
SLIDE 61

Collaboration

  • Typical projects takes 18 months to 3 years.
  • Does not like HEP model. Tier 0 (CERN) and

some Tier 1 centres (RAL, ASGC…etc).

  • Small labs join and leave more frequently.
  • Does not have dedicated network.
  • Will do a pilot iRODS Federation with Broad

Institute first.

ISGC, March 20, 2011

slide-62
SLIDE 62

Moving computing to data?

  • Moving large amount of genomic data to

computing resources is time consuming.

  • Moving computing to data?
  • Resources Broker (WMS) of gLite is able to

allocate computing jobs based on the metadata provided by each site.

  • SRM-iRODS developed by ASGC becomes a very

important part which glues the computing grid and data management system we are using.

ISGC, March 20, 2011

slide-63
SLIDE 63

Acknowledgements

Sanger Institute

  • Phil Butcher
  • Pete Clapham
  • Guy Coates
  • Kevin Sale
  • Guoying Qi

STFC

  • Jens Jensen