EGTDC Database Course 2004 Biological Databases Online (practical) - - PowerPoint PPT Presentation

egtdc database course 2004 biological databases online
SMART_READER_LITE
LIVE PREVIEW

EGTDC Database Course 2004 Biological Databases Online (practical) - - PowerPoint PPT Presentation

EGTDC Database Course 2004 Biological Databases Online (practical) Tim Booth : tbooth@ceh.ac.uk Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk Introduction To EnsEMBL EnsEMBL is a project to develop a software


slide-1
SLIDE 1

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

EGTDC Database Course 2004 Biological Databases Online (practical)

Tim Booth : tbooth@ceh.ac.uk

slide-2
SLIDE 2

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

Introduction To EnsEMBL

  • EnsEMBL is a project to develop a software system which produces and

maintains automatic annotation on metazoan genomes.

  • EnsEMBL is based on a large MySQL database system.
  • Take a look at the website:

http://www.ensembl.org/ Suggested exercise: Seek out the gene coding for the human p53 tumour suppressor protein, which is located on chromosome 17. Find the location of the predicted orthologue for the rat.

slide-3
SLIDE 3

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

Batch Queries via EnsMart

  • Go to http://www.ensembl.org/Multi/martview
  • Ask to see all genes on the first chromosome, within the first 30

megabases, for which there is a SwissProt id.

  • How many entries pass the filter? (should be 13)
  • Generate an HTML report with the default settings. How many lines do

you see?

  • Go back and ask to add the InterPro descriptions to the report. Now how

many lines come back?

  • Can you account for the discrepancy, based on your knowledge of how

queries might be constructed on the underlying database?

slide-4
SLIDE 4

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

Direct ENSEMBL Access

  • In a terminal type:

mysql -h ensembldb.ensembl.org -u anonymous

...you should now be logged into EnsEMBL...

show databases; use rattus_norvegicus_core_20_3b; show tables;

  • There are a lot of tables in the core database. The following slides show

some important parts.

  • Don't be phased by the complexity - the point is just to appreciate what a

major database can look like. It will be discussed later why this complexity arises and how we can deal meaningfully with it.

slide-5
SLIDE 5

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

EnsEMBL Sequence Regions

1 int

  • ri

int cmp_end int cmp_start int asm_end int asm_start int cmp_seq_region_id int asm_seq_region_id

assembly

int coord_system_id int length varchar name int seq_region_id

seq_region

varchar value int attrib_type_id int seq_region_id

seq_region_attrib

varchar name text description varchar code int attrib_type_id

attrib_type

mediumtext sequence int seq_region_id

dna

mediumblob sequence text n_line int seq_region_id

dnac

“default_version”, “sequence_level” attrib varchar version int rank varchar name int coord_system_id

coord_system

0..1 0..n 0..1 0..n 1 1 0..1 0..1 1 1…n 1 0..n 1 0..n

slide-6
SLIDE 6

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

EnsEMBL Genes and Transcripts

tinyint phase tinyint end_phase int seq_region_id int seq_region_start int seq_region_end tinyint seq_region_strand int exon_id

exon

varchar stable_id int version int exon_id

exon_stable_id

int transcript_id int rank int exon_id

exon_transcript

varchar type int analysis_id int display_xref_id int seq_region_id int seq_region_start int seq_region_end tinyint seq_region_strand int gene_id

gene

varchar stable_id int version int gene_id

gene_stable_id

int gene_id int display_xref_id int seq_region_id int seq_region_start int seq_region_end tinyint seq_region_strand int transcript_id

transcript

varchar stable_id int version int transcript_id

transcript_stable_id

varchar stable_id int version int translation_id

translation_stable_id

int transcript_id int seq_start int start_exon_id int seq_end int end_exon_id int translation_id

translation

text description int gene_id

gene_description

1 0..1 0..1 0..1 1 1 1 0..n 0..n 1 0..1 1..n 1 1 0..1 1 1 0..n 0..n 1 0..1

slide-7
SLIDE 7

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

EnsEMBL External References

int ensembl_id “Translation”, “Gene”, “Transcript” ensembl_object_type int xref_id int

  • bject_xref_id
  • bject_xref

double score double evalue int analysis_id int target_identity int hit_start int hit_end int translation_start int translation_end text cigar_line int query_identity int

  • bject_xref_id

identity_xref

varchar description varchar dbprimary_acc int xref_id varchar version varchar display_label int external_db_id

xref

varchar release “KNOWN”, “PRED”, “ORTH”,… status varchar dbname int external_db_id

external_db

“IC”,”IDA”,”IEA”,”IEP”,”I GI”,”IMP,”IPI”,… linkage_type int

  • bject_xref_id

go_xref

varchar synonym int xref_id

external_synonym

1 1..n 1 0..n 1 0..1 1 0..1 1 1..n

slide-8
SLIDE 8

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

Complex Database -> Nasty Query

select gene.seq_region_id, gene_stable_id.stable_id, transcript_stable_id.stable_id, xref.display_label from external_db inner join xref on external_db.external_db_id = xref.external_db_id inner join object_xref on xref.xref_id = object_xref.xref_id inner join translation on object_xref.ensembl_id = translation.translation_id inner join transcript on translation.transcript_id = transcript.transcript_id inner join gene on gene.gene_id = transcript.gene_id inner join seq_region on gene.seq_region_id = seq_region.seq_region_id inner join coord_system on seq_region.coord_system_id = coord_system.coord_system_id left outer join gene_stable_id on gene_stable_id.gene_id = gene.gene_id left outer join transcript_stable_id on transcript.transcript_id = transcript_stable_id.transcript_id where coord_system.name = 'chromosome' and seq_region.name = '1' and gene.seq_region_end < 30000000 and external_db.db_name = 'SWISSPROT';

slide-9
SLIDE 9

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

Simplifying things

  • The previous query will actually retrieve the same results as the

EnsMART search, though this is hardly obvious.

  • Making a friendly interface is one way to simplify the use of your

database.

  • We will look into these tomorrow, along with views, denormalised tables

and programming APIs.

  • For now, let's look at a simpler online database...
slide-10
SLIDE 10

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

  • This database was created by the Msatminer software, and is held at CEH

Oxford.

  • Go to http://www.genomics.ceh.ac.uk/cgi-bin/baculo/msatviewer.cgi
  • Try some of the standard queries, such as viewing 'all tris'.
  • Now run each of the following:

show tables

select * from baculo_genomes

describe repeats

select repeat_name, motif, genome from repeats where repeat_type = 'penta'

select organism, genome, count(*) as count from repeats where repeat_type = 'penta'

A Look at The Baculovirus Microsatellite Database

slide-11
SLIDE 11

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

A Few More Examples

  • The following are also good examples of database-powered sites.

Unfortunately, none of them provide the same level of access or technical documentation as EnsEMBL:

  • www.hapmap.org : International human haplotyping project
  • www.plasmodb.org : Similar to EnsEMBL, but specific to plasmodium

genomes

  • ftp://ftp.ncbi.nih.gov/pub/taxonomy/ : The contents of the NCBI

taxonomy database can be downloaded for local use.

slide-12
SLIDE 12

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

Credits and Notes

  • Technical documentation on ENSEMBL can be located here:

http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/EnsemblCore.html

  • The baculovirus microsatellite database was created by Milo Thurston

and Dawn Field.

  • Thanks to Man-Suen Chan for website suggestions.