Implementing a Relational Database Joe Wood anjw@ceh.ac.uk - - PowerPoint PPT Presentation

implementing a relational database
SMART_READER_LITE
LIVE PREVIEW

Implementing a Relational Database Joe Wood anjw@ceh.ac.uk - - PowerPoint PPT Presentation

Implementing a Relational Database Joe Wood anjw@ceh.ac.uk Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk Implementing a relational database establish requirements Data Requirements data analysis


slide-1
SLIDE 1

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Implementing a Relational Database

Joe Wood anjw@ceh.ac.uk

slide-2
SLIDE 2

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Implementing a relational database

establish requirements data analysis Conceptual Data Model database design Logical Schema Specification implement

schema and database

Data Requirements

slide-3
SLIDE 3

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Our design so far:

relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName relation Sequence AccessionNumber: String ID: String primary key: AccessionNumber foreign key: SourceDatabase references PublicDatabase foreign key: SourceOrganism references Organism relation Organism OrganismNo: Integer Species: String Strain: String GenomeSeq: Boolean CommonName: String primary key: OrganismNo relation Feature FeatID: String Name: String primary key: FeatID foreign key: SourceSequence references Sequence

slide-4
SLIDE 4

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Implementing our database

  • Most of hard work is already done
  • Create relations using SQL
  • Define the data types for our columns
  • Define primary and foreign keys
  • Add constraints
  • Add any appropriate default values
slide-5
SLIDE 5

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Our design so far:

relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName

slide-6
SLIDE 6

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Create Table

relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName

CREATE TABLE publicdatabase ( databasename datatype url );

slide-7
SLIDE 7

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Add data types

Reminder:

  • Numerical
  • integer,float,numerical
  • String/Text
  • varchar,text
  • Date/Time
  • timestamp,date
  • Boolean
slide-8
SLIDE 8

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Add data types

relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName

CREATE TABLE publicdatabase ( databasename varchar(50), datatype varchar(20), url varchar(200) );

slide-9
SLIDE 9

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Primary Keys

relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName

CREATE TABLE publicdatabase ( databasename varchar(50), datatype varchar(20), url varchar(200), primary key (databasename) );

slide-10
SLIDE 10

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Foreign Keys

relation Sequence AccessionNumber: String ID: String primary key: AccessionNumber foreign key: SourceDatabase references PublicDatabase foreign key: SourceOrganism references Organism

CREATE TABLE sequence ( accessionnumber varchar(50), id varchar(50), sourcedatabase varchar(30), sourceorganism integer, primary key (accessionnumber), foreign key (sourcedatabase) references publicdatabase(databasename), foreign key (sourceorganism) references organism(organismnumber) );

slide-11
SLIDE 11

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Foreign Keys

relation Sequence AccessionNumber: String ID: String primary key: AccessionNumber foreign key: SourceDatabase references PublicDatabase foreign key: SourceOrganism references Organism

CREATE TABLE sequence ( accessionnumber varchar(50), id varchar(50), sourcedatabase varchar(30), sourceorganism integer, primary key (accessionnumber), foreign key (sourcedatabase) references publicdatabase, foreign key (sourceorganism) references organism );

slide-12
SLIDE 12

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

feature table

CREATE TABLE feature ( featid varchar(50), name varchar(100), sourcesequence varchar(50), primary key (featid), foreign key (sourcesequence) references sequence(accessionnumber) );

slide-13
SLIDE 13

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

  • rganism table

CREATE TABLE organism (

  • rganismnumber integer

species varchar(100), strain varchar(100), genomeseq boolean, commonname varchar(100), primary key (organismnumber) );

slide-14
SLIDE 14

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Constraints

  • Constraints restrict the values that can be inserted or

updated in columns

  • Types of constraints
  • NOT NULL
  • UNIQUE
  • Simply add to column definition

url varchar(100) NOT NULL

  • r

url varchar(100) UNIQUE

slide-15
SLIDE 15

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Constraints

  • NOT NULL and UNIQUE implicit when primary key
  • CHECK constraint

numberoflegs integer check (numberoflegs>2)

slide-16
SLIDE 16

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

publicdatabase table

CREATE TABLE publicdatabase ( databasename varchar(50), datatype varchar(20), url varchar(200) UNIQUE, primary key (databasename) );

slide-17
SLIDE 17

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Constraints

  • Foreign key constraints
  • To keep links between tables working you need to

preserve the matching values – referential integrity

  • These constraints automatically set up for you by

RDBMS when you declare the primary and foreign keys

  • This will prevent you from deleting a record with a

primary key before you have deleted all the child foreign key records

slide-18
SLIDE 18

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Constraints

  • Example from 'BigHit' database
slide-19
SLIDE 19

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Constraints

slide-20
SLIDE 20

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

  • rganism table

CREATE TABLE organism (

  • rganismnumber integer,

species varchar(100), strain varchar(100), genomeseq boolean, commonname varchar(100), primary key (organismnumber) );

How shall we create the unique primary key values for

  • rganismnumber?
slide-21
SLIDE 21

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Sequences

  • Could do complicated checks to work out the last

number entered, add 1 and insert for each new record

  • Sequence is a database object in PostgreSQL which

essentially an automatically incrementing numeric value (equivalent to 'autonumber' in Access)

CREATE SEQUENCE 'my_seq' (can specify increment,min and max) SELECT NEXTVAL('my_seq')

slide-22
SLIDE 22

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Default values

  • Still don't want to have to select the value each time
  • Can set a default value for column which is

automatically filled in every time a record is inserted

CREATE TABLE organism (

  • rganismnumber integer DEFAULT NEXTVAL('my_seq'),

species varchar(100), strain varchar(100), genomeseq boolean, commonname varchar(100), primary key (organismnumber) );

slide-23
SLIDE 23

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Create your database

  • To create your database run your SQL table and other
  • bject creation statements in a single script
  • Example - demodatabase.sql
  • Be sure create tables in the right order
  • Can't create table that refers to a primary key in a

table that doesn't exist yet

  • You also need data...
slide-24
SLIDE 24

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Populate your database

  • Insert data using INSERT sql statements

INSERT INTO organism (species,strain,genomeseq,commonname) VALUES ('Oryctolagus cuniculus',NULL,'false','rabbit');

  • Default values will inserted automatically

CREATE TABLE organism (

  • rganismnumber integer DEFAULT NEXTVAL('my_seq'),

...

slide-25
SLIDE 25

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Populate your database

  • Be sure to insert data in correct order
  • Don't try and insert a foreign key value when the

primary key value hasn't been inserted yet

  • Run the demodatabase.sql script
slide-26
SLIDE 26

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Querying your database

  • Now that your database is set up and data has been

inserted we can query it

slide-27
SLIDE 27

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Rabbit OXPKA

Rabbit phosphorylase

Nucleotide EMBL

TRIC_RABIT

Troponin I

protein Swissprot

rabbit PHS2_RABIT

Glycogen phosphorylase

Proten Swissprot

rabit 1ABB

Glycogen Phosphorylase

protein_structure

PDB

Rabbit Q8MJF7

pol protein

protein TrEMBL

Rabbit CK829726

Nucleotide dbEST

rabbit OCPHOS2

rabbit muscle phosphorylase mrna

Nucleotide EMBL

Trt3_rabit

Troponin T

Protein UniProt

Rabbit KPB1_Rabit

Phosphorylase B kinase alpha regulatory chain

Protein Swissprot

Organism Accession Number ID Sequence Type Database Name

slide-28
SLIDE 28

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Querying your database

slide-29
SLIDE 29

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Querying your database

slide-30
SLIDE 30

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Querying your database

slide-31
SLIDE 31

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Querying your database

  • What have we gained?
  • No data redundancy
  • Data is consistent
  • Enforced quality control – no missing data
  • Only have change data once
  • Flexibility to run a variety of queries
slide-32
SLIDE 32

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Views

  • Views are queries that are saved in the database as
  • bjects
  • Appear much like a table which can be queried in the

same way

  • Good if underlying query is very complex

CREATE VIEW viewname AS query

slide-33
SLIDE 33

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Views

slide-34
SLIDE 34

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Views

slide-35
SLIDE 35

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Indices

  • Searching data is slow
  • Indices make this searching faster
  • Implicit indices are set up for primary keys as these are

used a lot for searching data

slide-36
SLIDE 36

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Indices

  • An index can be created on any column

CREATE INDEX orgname_idx on organism (commonname)

  • An index is helpful on a column that is regularly

searched on (i.e. Used in the WHERE clause)

slide-37
SLIDE 37

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Normalisation

  • Normalisation is the process of reducing data

redundancy in your database design

  • Denormalisation increases data redundancy but is

sometimes necessary to increase the usability of a database

  • Example – ensembl
slide-38
SLIDE 38

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Denormalisation

select gene.seq_region_id, gene_stable_id.stable_id, transcript_stable_id.stable_id, xref.display_label from external_db inner join xref on external_db.external_db_id = xref.external_db_id inner join object_xref on xref.xref_id = object_xref.xref_id inner join translation on object_xref.ensembl_id = translation.translation_id inner join transcript on translation.transcript_id = transcript.transcript_id inner join gene on gene.gene_id = transcript.gene_id inner join seq_region on gene.seq_region_id = seq_region.seq_region_id inner join coord_system on seq_region.coord_system_id = coord_system.coord_system_id left outer join gene_stable_id on gene_stable_id.gene_id = gene.gene_id left outer join transcript_stable_id on transcript.transcript_id = transcript_stable_id.transcript_id where coord_system.name = 'chromosome' and seq_region.name = '1' and gene.seq_region_end < 30000000 and external_db.db_name = 'SWISSPROT';

slide-39
SLIDE 39

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Denormalisation

slide-40
SLIDE 40

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Denormalisation

  • Implement a complex query as a table
  • Need to set up “triggers” to populate table and maintain

data integrity

  • Triggers could run on each insert or as bulk loading

processes

  • This increases database administration but can increase

its usability

  • Note that this different from a view which is a saved

query