Modelling and Designing a Database
Bela Tiwari btiwari@ceh.ac.uk
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
Modelling and Designing a Database Bela Tiwari btiwari@ceh.ac.uk - - PowerPoint PPT Presentation
Modelling and Designing a Database Bela Tiwari btiwari@ceh.ac.uk Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk Modelling and designing a database A well thought out database will consist of a co-ordinated set
Bela Tiwari btiwari@ceh.ac.uk
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
Rabbit OXPKA
Rabbit phosphorylase
Nucleotide EMBL
TRIC_RABIT
Troponin I
protein Swissprot
rabbit PHS2_RABIT
Glycogen phosphorylase
Proten Swissprot
rabit 1ABB
Glycogen Phosphorylase
protein_structure
PDB
Rabbit Q8MJF7
pol protein
protein TrEMBL
Rabbit CK829726
Nucleotide dbEST
rabbit OCPHOS2
rabbit muscle phosphorylase mrna
Nucleotide EMBL
Trt3_rabit
Troponin T
Protein UniProt
Rabbit KPB1_Rabit
Phosphorylase B kinase alpha regulatory chain
Protein Swissprot
Organism ID Name Sequence Type Database Name
Rabbit OXPKA
Rabbit phosphorylase
Nucleotide EMBL
TRIC_RABIT
Troponin I
protein Swissprot
rabbit PHS2_RABIT
Glycogen phosphorylase
Peptide Swissprot
rabit 1ABB
Glycogen Phosphorylase
protein_structure
PDB
Rabbit Q8MJF7
pol protein
protein TrEMBL
Rabbit CK829726
Nucleotide dbEST
rabbit OCPHOS2
rabbit muscle phosphorylase mrna
Nucleotide EMBL
Trt3_rabit
Troponin T
Protein UniProt
Rabbit KPB1_Rabit
Phosphorylase B kinase alpha regulatory chain
Protein Swissprot
Organism ID Name Sequence Type Database Name
Rabbit OXPKA
Rabbit phosphorylase
Nucleotide EMBL
TRIC_RABIT
Troponin I
protein Swissprot
rabbit PHS2_RABIT
Glycogen phosphorylase
Peptide Swissprot
rabit 1ABB
Glycogen Phosphorylase
protein_structure
PDB
Rabbit Q8MJF7
pol protein
protein TrEMBL
Rabbit CK829726
Nucleotide dbEST
rabbit OCPHOS2
rabbit muscle phosphorylase mrna
Nucleotide EMBL
Trt3_rabit
Troponin T
Protein UniProt
Rabbit KPB1_Rabit
Phosphorylase B kinase alpha regulatory chain
Protein Swissprot
Organism ID Name Sequence Type Database Name
Rabbit OXPKA
Rabbit phosphorylase
Nucleotide EMBL
TRIC_RABIT
Troponin I
protein Swissprot
rabbit PHS2_RABIT
Glycogen phosphorylase
Peptide Swissprot
rabit 1ABB
Glycogen Phosphorylase
protein_structure
PDB
Rabbit Q8MJF7
pol protein
protein TrEMBL
Rabbit CK829726
Nucleotide dbEST
rabbit OCPHOS2
rabbit muscle phosphorylase mrna
Nucleotide EMBL
Trt3_rabit
Troponin T
Protein UniProt
Rabbit KPB1_Rabit
Phosphorylase B kinase alpha regulatory chain
Protein Swissprot
Organism ID Name Sequence Type Database Name
Rabbit OXPKA
Rabbit phosphorylase
Nucleotide EMBL
TRIC_RABIT
Troponin I
protein Swissprot
rabbit PHS2_RABIT
Glycogen phosphorylase
Peptide Swissprot
rabit 1ABB
Glycogen Phosphorylase
protein_structure
PDB
Rabbit Q8MJF7
pol protein
protein TrEMBL
Rabbit CK829726
Nucleotide dbEST
rabbit OCPHOS2
rabbit muscle phosphorylase mrna
Nucleotide EMBL
Trt3_rabit
Troponin T
Protein UniProt
Rabbit KPB1_Rabit
Phosphorylase B kinase alpha regulatory chain
Protein Swissprot
Organism ID Name Sequence Type Database Name
establish requirements Data Requirements data analysis Conceptual Data Model database design Logical Schema Specification implement
schema and database
establish requirements Data Requirements data analysis Conceptual Data Model database design Logical Schema Specification implement
schema and database
Consult with the users of your system and agree upon
Include in your considerations the following:
e.g. if you are storing the birth year of a person, you should not also store their age; it can easily be derived
as whether the data may be needed to answer a number of different questions, and whether you anticipate that additional types of data may be added in the future. In the end, you want your database to be useful for what it needs to do!
Feed C Feed B Feed A Mouse 20g 10g 10g m3 40g m2 20g 20g m1 20g m1 10g m3 FeedA Mouse 40g m2 20g m1 10g m3 FeedB Mouse 10g m3 FeedC Mouse 40g FeedB m2 10g FeedA m3 10g FeedB m3 20g FeedC m3 20g FeedB m1 20g FeedA m1 Amount FeedType Mouse
What persistent data is to be stored in the database? Write a description of what is required of the database, including the kinds of data that should be stored and what questions users would like to be able to ask. Information about the limits on the data and any issues like whether a certain type of data is required or not should also be included.
Example: Initial Statement of Database Requirements A database is required that will hold information about entries from public databases of interest along with the type of features in that sequence. The users should be able to search for a database accession number, organism, or public database name and retrieve a listing of all relevant sequences. They should also be able to retrieve any sequences containing particular feature types. For each public database sequence listed, the users would like to see the URL for that database. Questions:
entry for that public database to be added?
date – can you set up an automated retrieval system from a local or remote copy of the public database?)
strains as well?
databases, features, etc that are not stated in this initial description?
Does a sequence from a particular public database have to be stored in order for an entry for that public database to be added? No Do they want to store the sequence data? No – automated retrieval preferred Do the users want to search for just organism names, or do they wish to search for strains as well? Strains as well Do they wish to store only formal names, or common names also? Common names as well Is there additional information the users would like to retrieve about the organisms, databases, features, etc that are not stated in this initial description? Yes – they would like to be able to see retrieve a list of public database names stored, along with the type of data that is stored in that database. They would also like to see if the genome sequence of an organism is available.
Revised Statement of Database Requirements
A database is required that will hold information about entries from public databases of interest along with the type of features in that sequence. The users should be able to search for a database accession number, organism name, formal or common, and/or strain, or public database name and retrieve a listing of all relevant sequences as well as whether the full genome sequence of the organism in question is publicly available. They should also be able to retrieve any sequences containing particular feature types. For each public database sequence listed, the users would like to see the URL for that database and the type of data it stores. The sequence data itself does not need to be stored in the database.
A way of formally representing your database requirements.
Your model will include a list of entity types, constraints, and assumptions and an E-R diagram. Entity: a representation of a thing that there is a need to record data about Relationship: an association between entities that needs to be recorded E-R diagram: a pictorial representation of your data requirements
A good first step to defining the entities – underline the nouns in the requirements statement
Statement of Database Requirements A database is required that will hold information about sequences from public databases of interest along with the type of features in that sequence. The users should be able to search for a database accession number,
relevant sequences as well as whether the full genome sequence of the organism in question is publicly
public database sequences listed, the users would like to see the URL for that database and the type of data it stores. The sequence data itself does not need to be stored in the database. Possible entities: PublicDatabase Sequence Organism Feature
Other possibilities: GenomeSequence……………………..(should we go back to the users and ask about this?) CommonNames……………………......(should we go back to the users and ask about this?)
Representing Entities and Relationships in an E-R diagram
Entity:
EntityName
Relationship
EntityTypeOne EntityTypeTwo
RelationshipName
Relationships
E.g. A PublicDatabase may contain many Sequences Holds An Organism may have many Sequences recorded ContainsSeqs or ComesFrom
Representing Entities and Relationships in an E-R diagram Different types of Relationships 1:1
EntityTypeOne EntityTypeTwo
RelationshipName
1:n (1 to many) m:n (many to many)
EntityTypeOne EntityTypeTwo
RelationshipName
EntityTypeOne EntityTypeTwo
RelationshipName
Representing Entities and Relationships in an E-R diagram Different types of Relationships
mandatory
PublicDatabase Sequence
Holds
PublicDatabase Sequence
Holds
Based on our Database Requirements
We defined our entities as: PublicDatabase Sequence Organism Feature What type of information do we need to store about each of these data types? What is the relationship between these entities? What type of relationship associates each entity? What will we use to identify each instance of each entity? What assumptions have we made in making our model? What constraints are there on our data?
What type of information do we need to store about each of these data types? PublicDatabase(DatabaseName, DataType,URL) Sequence (AccessionNumber, ID) Organism (Species, Strain, GenomeSeq, CommonName) Feature (FeatID, Name)
What will we use to identify each instance of each entity? PublicDatabase(DatabaseName, DataType, URL) Sequence (AccessionNumber, ID) Organism (Species, Strain, GenomeSeq, CommonName) Feature (FeatID, Name) What assumptions have we made so far? Every public database has a unique name The AccessionNumbers of sequences are unique across all databases Feature ID’s are unique Every species has a strain Each species has at most one common name
What will we use to identify each instance of each entity? PublicDatabase(DatabaseName, DataType, URL) Sequence (AccessionNumber, ID) Organism (Species, Strain, GenomeSeq, CommonName) Feature (FeatID, Name) What assumptions have we made so far? Every public database has a unique name The AccessionNumbers of sequences are unique across all databases Feature ID’s are unique Each species has at most one common name
Currently we have no unique identifier for the Organism entity. We will deal with this during the design stage.
What is the relationship between these entities? PublicDatabase(DatabaseName, DataType, URL) Sequence (AccessionNumber, ID) Organism (Species, Strain, GenomeSeq, CommonName) Feature (FeatID, Name)
PublicDatabase Sequence Organism
Holds Contains ComesFrom
Feature
What type of relationship associates each entity? PublicDatabase(DatabaseName, DataType, URL) Sequence (AccessionNumber, ID) Organism (Species, Strain, GenomeSeq, CommonName) Feature (FeatID, Name)
PublicDatabase Sequence Organism
Holds Contains ComesFrom
Feature
What assumptions have we made? Every public database has a unique name The AccessionNumbers of sequences are unique across all databases Feature ID’s are unique Each species has at most one common name Feature types are not recorded unless they are found in a sequence in a database Information about a public database can be recorded even if no sequences from it are in our database Sequences do not require that you have recorded information about the organism they come from Organism information is only recorded if a sequence from that organism is present in the database
PublicDatabase Sequence Organism
Holds Contains ComesFrom
Feature
Not all aspects of the data can be represented by the model. Things that cannot be represented should be listed as constraints Examples: if modelling university course applications, a constraint might be that each student can only take a total of 5 courses in a year If modelling student records, a constraint might be that grades can be
PublicDatabase Sequence Organism
Holds Contains ComesFrom
Feature
Entity types: PublicDatabase(DatabaseName, DataType, URL) Sequence (AccessionNumber, ID) Organism (Species, Strain, GenomeSeq, CommonName) Feature (FeatID, Name) Constraints: Assumptions: Every public database has a unique name The AccessionNumbers of sequences are unique across all databases Feature ID’s are unique Each species has at most one common name Feature types are not recorded unless they are found in a sequence in a database Information about a public database can be recorded even if no sequences from it are in our database Sequences do not require that you have recorded information about the organism they come from Organism information is only recorded if a sequence from that organism is present in the database
establish requirements Data Requirements data analysis Conceptual Data Model database design Logical Schema Specification implement
schema and database
establish requirements Data Requirements data analysis Conceptual Data Model database design Logical Schema Specification implement
schema and database
A conceptual data model and the “logical schema specification”
A conceptual data model like an E-R model represents the data independently of any concerns about implementation as a database. A logical schema, or design describes a set of tables and more closely describes how data will be stored in a database.
How do we represent the entities?
As a first step, we can represent each entity as a “relation” (a table), having attributes (columns) corresponding to the attributes of the entity.
Entity types: PublicDatabase(DatabaseName, DataType, URL) Sequence (AccessionNumber, ID) Organism (Species, Strain, GenomeSeq, CommonName) Feature (FeatID, Name) Relations: PublicDatabase(DatabaseName, DataType, URL) Sequence (AccessionNumber, ID) Organism (Species, Strain, GenomeSeq, CommonName) Feature (FeatID, Name)
The identifier from an Entity has become the primary key for the corresponding Relation.
What about the Organism entity?
Recall: We had no clear identifier for instances of this Entity. Organism (Species, Strain, GenomeSeq, CommonName) We can create a surrogate key for Organism. A surrogate key is a unique value that is not derived from any data in the database; its only significance is to act as the primary key of a relation. Organism (OrganismNo, Species, Strain, GenomeSeq, CommonName)
OrganismNo is a surrogate key in the Organism Relation
So, the Entities from our model are now translated into corresponding Relations in our design:
Entity types:
PublicDatabase(DatabaseName, DataType, URL) Sequence (AccessionNumber, ID) Organism (Species, Strain, GenomeSeq, CommonName) Feature (FeatID, Name)
Relations:
PublicDatabase(DatabaseName, DataType, URL) Sequence (AccessionNumber, ID) Organism (OrganismNo, Species, Strain, GenomeSeq, CommonName) Feature (FeatID, Name) and each Relation has a designated primary key.
For example: PublicDatabase(DatabaseName, DataType, URL)
DatabaseName DatabaseName, DataType DatabaseName, DataType, URL
Each of these uniquely identify any given instance of PublicDatabase. Only DatabaseName is both unique and minimal
For each relation, we need to define its name and its primary key. For each attribute, we will need to determine what the data type is.
relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName Recall: this is not the database implementation stage!! SQL has no type “String”. For implementation there will be a number of data types to choose from that are strings, but have varying conditions
At the stage, you can describe the data type you with any descriptive name you
the database system you are using.
Some other common data types for attributes include Integer, Real, Date, Time, etc.
In fact, in this database, most of the data types are “string”.
relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName relation Sequence AccessionNumber: String ID: String primary key: AccessionNumber relation Organism OrganismNo: Integer Species: String Strain: String GenomeSeq: Boolean CommonName: String primary key: OrganismNo relation Feature FeatID: string Name: string primary key: FeatID So, now the entities have been represented. What about the relationships?
PublicDatabase(DatabaseName, DataType, URL) Sequence (AccessionNumber, ID) Organism (OrganismNo, Species, Strain, GenomeSeq, CommonName) Feature (FeatID, Name)
PublicDatabase Sequence Organism
Holds Contains ComesFrom
Feature
PublicDatabase Sequence
Holds
AB117611 Q04917 AE017138 AK004199 Q919P1 2.103764336-104764335 EMBL SwissProt EnsEMBL E.g. Here, you would wish to record in the Sequence relation that sequence: AB117611 is held in the EMBL database Q04917 is held in the SwissProt database AE017138 is held in the EMBL database and so on.
I.e. we want to record the primary key information from the PublicDatabase relation in the appropriate rows of the Sequence relation. The information we wish to record about a Sequence in the Sequence relation might look something like: The description of the Sequence relation would have a new line added: relation Sequence AccessionNumber: String ID: String primary key: AccessionNumber foreign key: SourceDatabase references PublicDatabase
EMBL Thang3 AE017138 SwissProt Anotherthing Q04917 EMBL Something AB117611 SourceDatabase ID AccessionNumber
Terminology
We declare a foreign key in the relation on the :n side of the relationship. foreign key: SourceDatabase references PublicDatabase This foreign key refers to a unique attribute (usually the primary key) in the relation on the 1: side of the relationship. foreign key: SourceDatabase references PublicDatabase
EMBL Thang3 AE017138 SwissProt Anotherthing Q04917 EMBL Something AB117611 SourceDatabase ID AccessionNumber
EnsEMBL 2.103764336- 104764335 2.103764336- 104764335 SwissProt 3MPH_AERPE Q919P1 EMBL AK004199 AK004199 EMBL AE107138 AE107138 SwissProt 143F_HUMAN Q04917 EMBL AB117611 AB117611 SourceDatabase ID AccessionNumber
www.ensembl.org
nucleotide EnsEMBL
ca.expasy.org/sprot/
peptide SwissProt
www.ebi.ac.uk/embl/index.html
nucleotide EMBL URL DataType DatabaseName 2.103764336-104764335 2.103764336-104764335 3MPH_AERPE Q919P1 AK004199 AK004199 AE107138 AE107138 143F_HUMAN Q04917 AB117611 AB117611 ID AccessionNumber
Original Sequence Relation The Holds relationship is now represented by the posting of SourceDatabase as a foreign key in the Sequence relation.
EnsEMBL 2.103764336- 104764335 2.103764336- 104764335 SwissProt 3MPH_AERPE Q919P1 EMBL AK004199 AK004199 EMBL AE107138 AE107138 SwissProt 143F_HUMAN Q04917 EMBL AB117611 AB117611 SourceDatabase ID AccessionNumber
Revised Sequence Relation
relation Sequence AccessionNumber: string ID: string primary key: AccessionNumber foreign key: SourceDatabase references PublicDatabase
You now need to represent all the other relationships to complete your design.
PublicDatabase Sequence Organism
Holds Contains ComesFrom
Feature
Entity types: PublicDatabase(DatabaseName, DataType, URL) Sequence (AccessionNumber, ID) Organism (OrganismNo, Species, Strain, GenomeSeq, CommonName) Feature (FeatID, Name) E.g. for each relation, you will have a listing like: relation Sequence AccessionNumber: String ID: String primary key: AccessionNumber foreign key: SourceDatabase references PublicDatabase
relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName relation Sequence AccessionNumber: String ID: String primary key: AccessionNumber foreign key: SourceDatabase references PublicDatabase foreign key: SourceOrganism references Organism relation Organism OrganismNo: Integer Species: String Strain: String GenomeSeq: Boolean CommonName: String primary key: OrganismNo relation Feature FeatID: String Name: String primary key: FeatID foreign key: SourceSequence references Sequence
EntityTypeOne EntityTypeTwo
RelationshipName
m:n (many to many)
EntityTypeOne EntityTypeTwo
RelationshipName
1:1
A picture is worth a thousand words…
Team PostDoc
ConsistsOf
TeamA TeamB
HeadedBy
Team PostDoc
ConsistsOf
TeamA TeamB
HeadedBy
Team PostDoc
ConsistsOf HeadedBy relation Team TeamName: String TeamLoc: String TeamPhone: String primary key: TeamName relation PostDoc PostDocId: String Name: String SpecialistSubject: String primary key: PostDocId foreign key: TeamName references Team
How do we represent the relationship HeadedBy?
A picture is worth a thousand words…
Team PostDoc
ConsistsOf HeadedBy
How do we represent the relationship HeadedBy?
Because this relationship is symmetrical, you can choose to post the foreign key in Team or PostDoc. Make the decision based on what the most natural grouping of the data is.
Team PostDoc
ConsistsOf HeadedBy
Here, you would probably choose to post the foreign key into the Team relation
relation Team TeamName: String TeamLoc: String TeamPhone: String primary key: TeamName foreign key: TeamHead references PostDocs
Reasoning:
It seems a natural attribute of a team to have a team leader. Each team will have a leader. Every PostDoc will not also be a team leader, so if you chose to put the foreign key in the PostDoc relation, many rows will contain null values for the TeamHead attribute.
Team PostDoc
ConsistsOf HeadedBy
So far, this is the same as dealing with 1:n relationships.
You now need a mechanism to ensure that values of TeamHead in the Team relation are unique to represent this as a 1:1 relationship. In a model you can represent this by declaring this attribute as an alternate key of Team.
Alternate keys, like primary keys, constrain the set of valid states of a relation – i.e. no two rows can contain the same entry for this attribute.
relation Team TeamName: String TeamLoc: String TeamPhone: String primary key: TeamName alternate key: TeamHead foreign key: TeamHead references PostDocs
SoftwareCategory BioinfProgram
BelongsTo
Here we have bioinformatics programs belonging software categories; each program can belong to one or more categories. Each software category can contain one or more bioinformatics programs. A picture is worth a thousand words…
NucleicRestriction NucleicTranslation ProteinComposition Display remap backtranseq restrict showseq transeq pepstats compseq abiview
You can represent m:n relationships by using another relation to represent the relationship itself.
SoftwareCategory BioinfProgram
BelongsTo
SoftwareCategory BioinfProgram BelongsTo
You now have your m:n relationship represented as two 1:n relationships (and a new relation). You can represent these relationships in your model using the foreign key mechanism introduced earlier.
SoftwareCategory BioinfProgram BelongsTo SoftwareCategory BioinfProgram
BelongsTo relation SoftwareCategory CategoryName: String Description: String primary key: CategoryName relation BioinfProgram ProgramName: String Description: String primary key: ProgramName relation SoftwareCategory CategoryName: String Description: String primary key: CategoryName relation BioinfProgram ProgramName: String Description: String primary key: ProgramName relation BelongsTo CategoryName: String ProgramName: String primary key: (CategoryName, ProgramName) foreign key: CategoryName references SoftwareCategory foreign key: ProgramName references BioinfProgram
establish requirements Data Requirements data analysis Conceptual Data Model database design Logical Schema Specification implement
schema and database