BioMart Data integration in four easy steps Arek Kasprzyk European - - PowerPoint PPT Presentation

biomart
SMART_READER_LITE
LIVE PREVIEW

BioMart Data integration in four easy steps Arek Kasprzyk European - - PowerPoint PPT Presentation

BioMart Data integration in four easy steps Arek Kasprzyk European Bioinformatics Institute 22 July 2006 BioMart A joint project European Bioinformatics Institute (EBI) Cold Spring Harbor Laboratory (CSHL) Funding


slide-1
SLIDE 1

Arek Kasprzyk European Bioinformatics Institute 22 July 2006

BioMart

Data integration in four easy steps

slide-2
SLIDE 2

BioMart

  • A joint project

– European Bioinformatics Institute (EBI) – Cold Spring Harbor Laboratory (CSHL)

  • Funding

– Wellcome Trust – European Commission – NIH

slide-3
SLIDE 3

Synopsis

  • Higher level data management system

– Data mining type access to descriptive data – Query optimization – Data federation – Meta data support

slide-4
SLIDE 4

BioMart

Source data

XML XML XML

Configuration 2 Transformation 1

BioMart software

3 Querying

slide-5
SLIDE 5

Transformation and Configuration Tools

slide-6
SLIDE 6

Query interfaces

slide-7
SLIDE 7

Programmatic access

  • APIs

– Perl (biomart-plib) – Java (martj) – R (biomaRt)

  • Web service
slide-8
SLIDE 8

Data federation

XML XML XML

MySQL ORACLE PostgreSQL

XML XML XML XML XML XML

REGISTRY

slide-9
SLIDE 9

Dataset, Attribute and Filter

GENE

gene_id(PK) gene_stable_id gene_start gene_chrom_end chromosome gene_display_id description

Mart Dataset Attribute Filter

slide-10
SLIDE 10

Joining two datasets

Links Dataset 1 Exportable

name = uniprot_id attributes = uniprot_ac

Dataset 2 Importable

name = uniprot_id filters = uniprot_ac

slide-11
SLIDE 11

Dataset linking

slide-12
SLIDE 12

Third party software

slide-13
SLIDE 13

Ensembl

slide-14
SLIDE 14

GMOD

slide-15
SLIDE 15

biomaRt

slide-16
SLIDE 16

Distributed Annotation System

slide-17
SLIDE 17

Taverna

slide-18
SLIDE 18

Galaxy

slide-19
SLIDE 19

Examples

slide-20
SLIDE 20

Genomic data

slide-21
SLIDE 21

Uniprot, MSD, ArrayExpress

Proteomic, structure, expression

slide-22
SLIDE 22

Model organism databases

Genes Expression Phenotypes Variations Literature Ontologies Sequence Genes Expression Phenotypes Variations Literature Ontologies Sequence

slide-23
SLIDE 23

Zebra Fish models for human development and disease

slide-24
SLIDE 24

Central Server

slide-25
SLIDE 25

Behind closed doors ; )

slide-26
SLIDE 26

Target SNP selection for the study of one autoimmune disease, type 1 diabetes (T1D), and infectious diseases, malaria and dengue

Laboratory of Genetics of I nfectious and Autoimmune Diseases

slide-27
SLIDE 27

Na me FragmentPos i t ion Al le les s t rand SNP1 AL1392581659852 T /A 1 SNP2 NT_25698 2569873 C/T

  • 1

SNP3 ch r13 1125698 C/G 1

Data conversion and integration Ensembl HapMap NCBI UCSC Priopriatery data Diabetes-Gene Association DataBase

Combined proprietary and public data

Genetics of I nfectious and Autoimmune Diseases, Pasteur I nstitute, I NSERM U730, Paris, France.

slide-28
SLIDE 28

Output format :

Genome Location Links to databases Overlaps with TFBS Location + predicted functional role Ensembl (dbSNP) Ensembl Vega RefSeq Acembly

Genetics of I nfectious and Autoimmune Diseases, Pasteur I nstitute, I NSERM U730, Paris, France.

slide-29
SLIDE 29

Using the Molecular Integration Database to Answer CAPRISA’s Questions

Research that contributes to understanding HIV pathogenesis and epidemiology as well as HIV/AIDS treatment and prevention

slide-30
SLIDE 30

How is the MID populated?

Clinical Data

MID

Cellular Immunity Humoral Immunity฀ HLA Typing Sequence & Sequence Related

Pipeline

slide-31
SLIDE 31

Caprisa

slide-32
SLIDE 32

What role for ‘Omics’ ?

g Human study to evaluate Omics in assessing safety

indicators

g Study of skin inflammation in response to detergent g Skin samples taken and analyzed with multiple Omics

techniques.

n Blood n Skin biopsy n Microdialysis

slide-33
SLIDE 33

System Data Flow

  • Oracle 9i database used for staging area and BioMarts
  • Database indexes files on a separate file system
  • Requires an extensible file and metadata management system

for omics data

Analysis files Data files Generic CSV files parsing import Coral staging area BioMarts transformation Import Interface download BioMart Interface Oracle 9i database

slide-34
SLIDE 34

Adding Annotation

g Query Ensembl for details

  • f genes measured or

identified in experiments e.g. GeneSpring Annotation

g For example, we can link to

Ensembl from Microarray Experiments by Gene ID Ensembl Mart Microarray Mart Link on Entrez gene id

slide-35
SLIDE 35

Four easy(?) steps

slide-36
SLIDE 36

Step 1

Transformation

slide-37
SLIDE 37

Step 2

Configuration

slide-38
SLIDE 38

Step 3

Query

slide-39
SLIDE 39

User interfaces

slide-40
SLIDE 40

Web service

< Query virtualSchemaName = "default" count = "0" > < Dataset name = "hsapiens_gene_ensembl"> < Attribute name = "gene_stable_id" / > < Filter name = "chr_name" value = "22"/ > < / Dataset> < Dataset name = ”uniprot"> < Attribute name = ”accession" / > < Filter name = ”pfam" value = “only"/ > < / Dataset> < / Query>

slide-41
SLIDE 41

API

my $initializer = BioMart::Initializer->new('registryFile'=>$confFile); my $registry = $initializer->getRegistry(); $registry->configure(); $query->addAttribute('hsapiens_gene_ensembl','ensembl_gene_id'); $query->addFilter('hsapiens_gene_ensembl','chromosome_name',['1']); $query->addAttribute(‘uniprot’,‘accession',); $query->addFilter(’uniprot', 'chromosome_name',['1’]); $query->formatter(’HTML'); my $runner = BioMart::QueryRunner->new(); $runner->execute($query); $runner->printResults();

slide-42
SLIDE 42

Step 4

Ask for a pay rise : )

slide-43
SLIDE 43

Summary

  • A generic data management system
  • Provides building blocks for designing your
  • wn ‘tailor-made’ data management

– A set of easily configurable user interfaces – Distributed Data federation – Query optimization

  • Easy to install and manage

– A project for bioinformatics students

  • Open source software.

– No restrictions for academics or commercial users

slide-44
SLIDE 44

Credits

  • BioMart

– Syed Haider – Richard Holland – Damian Smedley – Gudmundur Thorisson

  • Contributors

– Steffen Durinck (NCI, NIH) – Eric Just (Northwestern University) – Don Gilbert (Indiana University) – Darin London (Duke University) – Will Spooner (CSHL) – Benoit Ballester (Universite de la Mediterranee) – James Smith (Ensembl) – Arne Stabenau (Ensembl) – Andreas Kahari (Ensembl) – Craig Melsopp (Ensembl) – Katerina Tzouvara (EBI) – Paul Donlon (Unilever)