NEBC Database Course 2008 Biological Databases Online Web-Based - - PowerPoint PPT Presentation

nebc database course 2008 biological databases online web
SMART_READER_LITE
LIVE PREVIEW

NEBC Database Course 2008 Biological Databases Online Web-Based - - PowerPoint PPT Presentation

NEBC Database Course 2008 Biological Databases Online Web-Based Tools (practical) Tim Booth : tbooth@ceh.ac.uk Introduction To EnsEMBL EnsEMBL is a project to develop a software system which produces and maintains automatic annotation on


slide-1
SLIDE 1

NEBC Database Course 2008 Biological Databases Online Web-Based Tools (practical)

Tim Booth : tbooth@ceh.ac.uk

slide-2
SLIDE 2

Introduction To EnsEMBL

  • EnsEMBL is a project to develop a software system which

produces and maintains automatic annotation on metazoan genomes.

  • EnsEMBL is based on a large MySQL database system.
  • Take a look at the website:

http://www.ensembl.org/ Suggested exercise: Seek out the gene coding for the human p53 tumour suppressor protein, which is located on chromosome 17. Find the location of the predicted orthologue for the rat.

slide-3
SLIDE 3

Batch Queries via Martview

  • Go to http://www.ensembl.org/biomart/martview
  • Using the Vega 32 Homo sapiens dataset, ask to see all genes
  • n the first chromosome, within the first 100 kilobases, for

which there is a UniProt/Swissprot id.

  • How many entries pass the filter? (should be 12)
  • Generate an HTML report with the default settings. How many

lines do you see?

  • Go back and ask to add the InterPro descriptions to the report.

Now how many lines come back?

  • Can you account for the discrepancy, based on your knowledge
  • f how queries might be constructed on the underlying

database?

slide-4
SLIDE 4

Direct ENSEMBL Access

  • In a terminal type:

mysql -h ensembldb.ensembl.org -u anonymous

...you should now be logged into EnsEMBL...

show databases; use rattus_norvegicus_core_47_34q; show tables;

  • There are a lot of tables in the core database. The following

slides show some important parts.

  • Don't be phased by the complexity - the point is just to

appreciate what a major database can look like. It will be discussed later why this complexity arises and how we can deal meaningfully with it.

slide-5
SLIDE 5

EnsEMBL Sequence Regions

1 int

  • ri

int cmp_end int cmp_start int asm_end int asm_start int cmp_seq_region_id int asm_seq_region_id

assembly

int coord_system_id int length varchar name int seq_region_id

seq_region

varchar value int attrib_type_id int seq_region_id

seq_region_attrib

varchar name text description varchar code int attrib_type_id

attrib_type

mediumtext sequence int seq_region_id

dna

mediumblob sequence text n_line int seq_region_id

dnac

“default_version”, “sequence_level” attrib varchar version int rank varchar name int coord_system_id

coord_system

0..1 0..n 0..1 0..n 1 1 0..1 0..1 1 1…n 1 0..n 1 0..n

slide-6
SLIDE 6

EnsEMBL Genes and Transcripts

tinyint phase tinyint end_phase int seq_region_id int seq_region_start int seq_region_end tinyint seq_region_strand int exon_id

exon

varchar stable_id int version int exon_id

exon_stable_id

int transcript_id int rank int exon_id

exon_transcript

varchar type int analysis_id int display_xref_id int seq_region_id int seq_region_start int seq_region_end tinyint seq_region_strand int gene_id

gene

varchar stable_id int version int gene_id

gene_stable_id

int gene_id int display_xref_id int seq_region_id int seq_region_start int seq_region_end tinyint seq_region_strand int transcript_id

transcript

varchar stable_id int version int transcript_id

transcript_stable_id

varchar stable_id int version int translation_id

translation_stable_id

int transcript_id int seq_start int start_exon_id int seq_end int end_exon_id int translation_id

translation

text description int gene_id

gene_description

1 0..1 0..1 0..1 1 1 1 0..n 0..n 1 0..1 1..n 1 1 0..1 1 1 0..n 0..n 1 0..1

slide-7
SLIDE 7

EnsEMBL External References

int ensembl_id “Translation”, “Gene”, “Transcript” ensembl_object_type int xref_id int

  • bject_xref_id
  • bject_xref

double score double evalue int analysis_id int target_identity int hit_start int hit_end int translation_start int translation_end text cigar_line int query_identity int

  • bject_xref_id

identity_xref

varchar description varchar dbprimary_acc int xref_id varchar version varchar display_label int external_db_id

xref

varchar release “KNOWN”, “PRED”, “ORTH”,… status varchar dbname int external_db_id

external_db

“IC”,”IDA”,”IEA”,”IEP”,”I GI”,”IMP,”IPI”,… linkage_type int

  • bject_xref_id

go_xref

varchar synonym int xref_id

external_synonym

1 1..n 1 0..n 1 0..1 1 0..1 1 1..n

slide-8
SLIDE 8

Complex Database = Nasty Query

select gene.seq_region_id, gene_stable_id.stable_id, transcript_stable_id.stable_id, xref.display_label from external_db inner join xref on external_db.external_db_id = xref.external_db_id inner join object_xref on xref.xref_id = object_xref.xref_id inner join translation on object_xref.ensembl_id = translation.translation_id inner join transcript on translation.transcript_id = transcript.transcript_id inner join gene on gene.gene_id = transcript.gene_id inner join seq_region on gene.seq_region_id = seq_region.seq_region_id inner join coord_system on seq_region.coord_system_id = coord_system.coord_system_id left outer join gene_stable_id on gene_stable_id.gene_id = gene.gene_id left outer join transcript_stable_id on transcript.transcript_id = transcript_stable_id.transcript_id where coord_system.name = 'chromosome' and seq_region.name = '1' and gene.seq_region_end < 150000 and external_db.db_name = 'Uniprot/SWISSPROT';

slide-9
SLIDE 9

Simplifying things

  • The previous query will actually retrieve the same results as

the MartView search, though this is hardly obvious.

  • Making a friendly interface is one way to simplify the use of

your database.

  • We will look into these tomorrow, along with views,

denormalised tables and programming APIs.

  • For now, let's look at a simpler online database...
slide-10
SLIDE 10
  • The GeneSwytch database is held at CEH Oxford.
  • Go to http://www.genomics.ceh.ac.uk/GeneSwytch/all.php
  • Try some of the standard queries, such as viewing

'Haemophilus genomes'.

  • Now run each of the following:

describe table bac_tax_genomes

select * from bac_tax_genomes limit 20

describe table bac_rep_genomes

select repeat_name, motif, repeat_type from bac_rep_genomes where genome_id = 1

select genus, repeat_name, motif, repeat type from (bac_rep_genomes r inner join bac_tax_genomes t on t.genome_id = r.genome_id) where t.genome_id in (1,31)

A Look at The GeneSwytch Microsatellite Database

slide-11
SLIDE 11

A Few More Examples

  • The following are also good examples of database-powered
  • sites. Unfortunately, none of them provide the same level of

access or technical documentation as EnsEMBL:

  • www.hapmap.org : International human haplotyping project.

Note link to 'HapMart' search tool.

  • www.plasmodb.org : Similar to EnsEMBL, but specific to

plasmodium genomes

  • ftp://ftp.ncbi.nih.gov/pub/taxonomy/ : The contents of the NCBI

taxonomy database can be downloaded for local use. A suitable template database to store this data can be found at http://www.biosql.org

slide-12
SLIDE 12

Credits and Notes

  • Technical documentation on EnsEMBL can be located here:

http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/EnsemblCore.html

  • The baculovirus microsatellite database was created by Milo

Thurston, Paul Swift and Dawn Field.

  • Thanks to Man-Suen Chan for website suggestions.