Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Implementing a Relational Database Joe Wood anjw@ceh.ac.uk - - PowerPoint PPT Presentation
Implementing a Relational Database Joe Wood anjw@ceh.ac.uk - - PowerPoint PPT Presentation
Implementing a Relational Database Joe Wood anjw@ceh.ac.uk Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk Implementing a relational database establish requirements Data Requirements data analysis
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Implementing a relational database
establish requirements data analysis Conceptual Data Model database design Logical Schema Specification implement
schema and database
Data Requirements
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Our design so far:
relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName relation Sequence AccessionNumber: String ID: String primary key: AccessionNumber foreign key: SourceDatabase references PublicDatabase foreign key: SourceOrganism references Organism relation Organism OrganismNo: Integer Species: String Strain: String GenomeSeq: Boolean CommonName: String primary key: OrganismNo relation Feature FeatID: String Name: String primary key: FeatID foreign key: SourceSequence references Sequence
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Implementing our database
- Most of hard work is already done
- Create relations using SQL
- Define the data types for our columns
- Define primary and foreign keys
- Add constraints
- Add any appropriate default values
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Our design so far:
relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Create Table
relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName
CREATE TABLE publicdatabase ( databasename datatype url );
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Add data types
Reminder:
- Numerical
- integer,float,numerical
- String/Text
- varchar,text
- Date/Time
- timestamp,date
- Boolean
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Add data types
relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName
CREATE TABLE publicdatabase ( databasename varchar(50), datatype varchar(20), url varchar(200) );
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Primary Keys
relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName
CREATE TABLE publicdatabase ( databasename varchar(50), datatype varchar(20), url varchar(200), primary key (databasename) );
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Foreign Keys
relation Sequence AccessionNumber: String ID: String primary key: AccessionNumber foreign key: SourceDatabase references PublicDatabase foreign key: SourceOrganism references Organism
CREATE TABLE sequence ( accessionnumber varchar(50), id varchar(50), sourcedatabase varchar(30), sourceorganism integer, primary key (accessionnumber), foreign key (sourcedatabase) references publicdatabase(databasename), foreign key (sourceorganism) references organism(organismnumber) );
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Foreign Keys
relation Sequence AccessionNumber: String ID: String primary key: AccessionNumber foreign key: SourceDatabase references PublicDatabase foreign key: SourceOrganism references Organism
CREATE TABLE sequence ( accessionnumber varchar(50), id varchar(50), sourcedatabase varchar(30), sourceorganism integer, primary key (accessionnumber), foreign key (sourcedatabase) references publicdatabase, foreign key (sourceorganism) references organism );
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
feature table
CREATE TABLE feature ( featid varchar(50), name varchar(100), sourcesequence varchar(50), primary key (featid), foreign key (sourcesequence) references sequence(accessionnumber) );
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
- rganism table
CREATE TABLE organism (
- rganismnumber integer
species varchar(100), strain varchar(100), genomeseq boolean, commonname varchar(100), primary key (organismnumber) );
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Constraints
- Constraints restrict the values that can be inserted or
updated in columns
- Types of constraints
- NOT NULL
- UNIQUE
- Simply add to column definition
url varchar(100) NOT NULL
- r
url varchar(100) UNIQUE
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Constraints
- NOT NULL and UNIQUE implicit when primary key
- CHECK constraint
numberoflegs integer check (numberoflegs>2)
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
publicdatabase table
CREATE TABLE publicdatabase ( databasename varchar(50), datatype varchar(20), url varchar(200) UNIQUE, primary key (databasename) );
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Constraints
- Foreign key constraints
- To keep links between tables working you need to
preserve the matching values – referential integrity
- These constraints automatically set up for you by
RDBMS when you declare the primary and foreign keys
- This will prevent you from deleting a record with a
primary key before you have deleted all the child foreign key records
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Constraints
- Example from 'BigHit' database
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Constraints
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
- rganism table
CREATE TABLE organism (
- rganismnumber integer,
species varchar(100), strain varchar(100), genomeseq boolean, commonname varchar(100), primary key (organismnumber) );
How shall we create the unique primary key values for
- rganismnumber?
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Sequences
- Could do complicated checks to work out the last
number entered, add 1 and insert for each new record
- Sequence is a database object in PostgreSQL which
essentially an automatically incrementing numeric value (equivalent to 'autonumber' in Access)
CREATE SEQUENCE 'my_seq' (can specify increment,min and max) SELECT NEXTVAL('my_seq')
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Default values
- Still don't want to have to select the value each time
- Can set a default value for column which is
automatically filled in every time a record is inserted
CREATE TABLE organism (
- rganismnumber integer DEFAULT NEXTVAL('my_seq'),
species varchar(100), strain varchar(100), genomeseq boolean, commonname varchar(100), primary key (organismnumber) );
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Create your database
- To create your database run your SQL table and other
- bject creation statements in a single script
- Example - demodatabase.sql
- Be sure create tables in the right order
- Can't create table that refers to a primary key in a
table that doesn't exist yet
- You also need data...
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Populate your database
- Insert data using INSERT sql statements
INSERT INTO organism (species,strain,genomeseq,commonname) VALUES ('Oryctolagus cuniculus',NULL,'false','rabbit');
- Default values will inserted automatically
CREATE TABLE organism (
- rganismnumber integer DEFAULT NEXTVAL('my_seq'),
...
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Populate your database
- Be sure to insert data in correct order
- Don't try and insert a foreign key value when the
primary key value hasn't been inserted yet
- Run the demodatabase.sql script
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Querying your database
- Now that your database is set up and data has been
inserted we can query it
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Rabbit OXPKA
Rabbit phosphorylase
Nucleotide EMBL
TRIC_RABIT
Troponin I
protein Swissprot
rabbit PHS2_RABIT
Glycogen phosphorylase
Proten Swissprot
rabit 1ABB
Glycogen Phosphorylase
protein_structure
PDB
Rabbit Q8MJF7
pol protein
protein TrEMBL
Rabbit CK829726
Nucleotide dbEST
rabbit OCPHOS2
rabbit muscle phosphorylase mrna
Nucleotide EMBL
Trt3_rabit
Troponin T
Protein UniProt
Rabbit KPB1_Rabit
Phosphorylase B kinase alpha regulatory chain
Protein Swissprot
Organism Accession Number ID Sequence Type Database Name
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Querying your database
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Querying your database
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Querying your database
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Querying your database
- What have we gained?
- No data redundancy
- Data is consistent
- Enforced quality control – no missing data
- Only have change data once
- Flexibility to run a variety of queries
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Views
- Views are queries that are saved in the database as
- bjects
- Appear much like a table which can be queried in the
same way
- Good if underlying query is very complex
CREATE VIEW viewname AS query
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Views
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Views
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Indices
- Searching data is slow
- Indices make this searching faster
- Implicit indices are set up for primary keys as these are
used a lot for searching data
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Indices
- An index can be created on any column
CREATE INDEX orgname_idx on organism (commonname)
- An index is helpful on a column that is regularly
searched on (i.e. Used in the WHERE clause)
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Normalisation
- Normalisation is the process of reducing data
redundancy in your database design
- Denormalisation increases data redundancy but is
sometimes necessary to increase the usability of a database
- Example – ensembl
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Denormalisation
select gene.seq_region_id, gene_stable_id.stable_id, transcript_stable_id.stable_id, xref.display_label from external_db inner join xref on external_db.external_db_id = xref.external_db_id inner join object_xref on xref.xref_id = object_xref.xref_id inner join translation on object_xref.ensembl_id = translation.translation_id inner join transcript on translation.transcript_id = transcript.transcript_id inner join gene on gene.gene_id = transcript.gene_id inner join seq_region on gene.seq_region_id = seq_region.seq_region_id inner join coord_system on seq_region.coord_system_id = coord_system.coord_system_id left outer join gene_stable_id on gene_stable_id.gene_id = gene.gene_id left outer join transcript_stable_id on transcript.transcript_id = transcript_stable_id.transcript_id where coord_system.name = 'chromosome' and seq_region.name = '1' and gene.seq_region_end < 30000000 and external_db.db_name = 'SWISSPROT';
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Denormalisation
Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Denormalisation
- Implement a complex query as a table
- Need to set up “triggers” to populate table and maintain
data integrity
- Triggers could run on each insert or as bulk loading
processes
- This increases database administration but can increase
its usability
- Note that this different from a view which is a saved