NEBC Database Course 2008 Biological Databases Online Web-Based - - PowerPoint PPT Presentation
NEBC Database Course 2008 Biological Databases Online Web-Based - - PowerPoint PPT Presentation
NEBC Database Course 2008 Biological Databases Online Web-Based Tools (practical) Tim Booth : tbooth@ceh.ac.uk Introduction To EnsEMBL EnsEMBL is a project to develop a software system which produces and maintains automatic annotation on
Introduction To EnsEMBL
- EnsEMBL is a project to develop a software system which
produces and maintains automatic annotation on metazoan genomes.
- EnsEMBL is based on a large MySQL database system.
- Take a look at the website:
http://www.ensembl.org/ Suggested exercise: Seek out the gene coding for the human p53 tumour suppressor protein, which is located on chromosome 17. Find the location of the predicted orthologue for the rat.
Batch Queries via Martview
- Go to http://www.ensembl.org/biomart/martview
- Using the Vega 32 Homo sapiens dataset, ask to see all genes
- n the first chromosome, within the first 100 kilobases, for
which there is a UniProt/Swissprot id.
- How many entries pass the filter? (should be 12)
- Generate an HTML report with the default settings. How many
lines do you see?
- Go back and ask to add the InterPro descriptions to the report.
Now how many lines come back?
- Can you account for the discrepancy, based on your knowledge
- f how queries might be constructed on the underlying
database?
Direct ENSEMBL Access
- In a terminal type:
mysql -h ensembldb.ensembl.org -u anonymous
...you should now be logged into EnsEMBL...
show databases; use rattus_norvegicus_core_47_34q; show tables;
- There are a lot of tables in the core database. The following
slides show some important parts.
- Don't be phased by the complexity - the point is just to
appreciate what a major database can look like. It will be discussed later why this complexity arises and how we can deal meaningfully with it.
EnsEMBL Sequence Regions
1 int
- ri
int cmp_end int cmp_start int asm_end int asm_start int cmp_seq_region_id int asm_seq_region_id
assembly
int coord_system_id int length varchar name int seq_region_id
seq_region
varchar value int attrib_type_id int seq_region_id
seq_region_attrib
varchar name text description varchar code int attrib_type_id
attrib_type
mediumtext sequence int seq_region_id
dna
mediumblob sequence text n_line int seq_region_id
dnac
“default_version”, “sequence_level” attrib varchar version int rank varchar name int coord_system_id
coord_system
0..1 0..n 0..1 0..n 1 1 0..1 0..1 1 1…n 1 0..n 1 0..n
EnsEMBL Genes and Transcripts
tinyint phase tinyint end_phase int seq_region_id int seq_region_start int seq_region_end tinyint seq_region_strand int exon_id
exon
varchar stable_id int version int exon_id
exon_stable_id
int transcript_id int rank int exon_id
exon_transcript
varchar type int analysis_id int display_xref_id int seq_region_id int seq_region_start int seq_region_end tinyint seq_region_strand int gene_id
gene
varchar stable_id int version int gene_id
gene_stable_id
int gene_id int display_xref_id int seq_region_id int seq_region_start int seq_region_end tinyint seq_region_strand int transcript_id
transcript
varchar stable_id int version int transcript_id
transcript_stable_id
varchar stable_id int version int translation_id
translation_stable_id
int transcript_id int seq_start int start_exon_id int seq_end int end_exon_id int translation_id
translation
text description int gene_id
gene_description
1 0..1 0..1 0..1 1 1 1 0..n 0..n 1 0..1 1..n 1 1 0..1 1 1 0..n 0..n 1 0..1
EnsEMBL External References
int ensembl_id “Translation”, “Gene”, “Transcript” ensembl_object_type int xref_id int
- bject_xref_id
- bject_xref
double score double evalue int analysis_id int target_identity int hit_start int hit_end int translation_start int translation_end text cigar_line int query_identity int
- bject_xref_id
identity_xref
varchar description varchar dbprimary_acc int xref_id varchar version varchar display_label int external_db_id
xref
varchar release “KNOWN”, “PRED”, “ORTH”,… status varchar dbname int external_db_id
external_db
“IC”,”IDA”,”IEA”,”IEP”,”I GI”,”IMP,”IPI”,… linkage_type int
- bject_xref_id
go_xref
varchar synonym int xref_id
external_synonym
1 1..n 1 0..n 1 0..1 1 0..1 1 1..n
Complex Database = Nasty Query
select gene.seq_region_id, gene_stable_id.stable_id, transcript_stable_id.stable_id, xref.display_label from external_db inner join xref on external_db.external_db_id = xref.external_db_id inner join object_xref on xref.xref_id = object_xref.xref_id inner join translation on object_xref.ensembl_id = translation.translation_id inner join transcript on translation.transcript_id = transcript.transcript_id inner join gene on gene.gene_id = transcript.gene_id inner join seq_region on gene.seq_region_id = seq_region.seq_region_id inner join coord_system on seq_region.coord_system_id = coord_system.coord_system_id left outer join gene_stable_id on gene_stable_id.gene_id = gene.gene_id left outer join transcript_stable_id on transcript.transcript_id = transcript_stable_id.transcript_id where coord_system.name = 'chromosome' and seq_region.name = '1' and gene.seq_region_end < 150000 and external_db.db_name = 'Uniprot/SWISSPROT';
Simplifying things
- The previous query will actually retrieve the same results as
the MartView search, though this is hardly obvious.
- Making a friendly interface is one way to simplify the use of
your database.
- We will look into these tomorrow, along with views,
denormalised tables and programming APIs.
- For now, let's look at a simpler online database...
- The GeneSwytch database is held at CEH Oxford.
- Go to http://www.genomics.ceh.ac.uk/GeneSwytch/all.php
- Try some of the standard queries, such as viewing
'Haemophilus genomes'.
- Now run each of the following:
–
describe table bac_tax_genomes
–
select * from bac_tax_genomes limit 20
–
describe table bac_rep_genomes
–
select repeat_name, motif, repeat_type from bac_rep_genomes where genome_id = 1
–
select genus, repeat_name, motif, repeat type from (bac_rep_genomes r inner join bac_tax_genomes t on t.genome_id = r.genome_id) where t.genome_id in (1,31)
A Look at The GeneSwytch Microsatellite Database
A Few More Examples
- The following are also good examples of database-powered
- sites. Unfortunately, none of them provide the same level of
access or technical documentation as EnsEMBL:
- www.hapmap.org : International human haplotyping project.
Note link to 'HapMart' search tool.
- www.plasmodb.org : Similar to EnsEMBL, but specific to
plasmodium genomes
- ftp://ftp.ncbi.nih.gov/pub/taxonomy/ : The contents of the NCBI
taxonomy database can be downloaded for local use. A suitable template database to store this data can be found at http://www.biosql.org
Credits and Notes
- Technical documentation on EnsEMBL can be located here:
http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/EnsemblCore.html
- The baculovirus microsatellite database was created by Milo
Thurston, Paul Swift and Dawn Field.
- Thanks to Man-Suen Chan for website suggestions.