SLIDE 1
Overview of current biological databases Qi Sun Computational - - PowerPoint PPT Presentation
Overview of current biological databases Qi Sun Computational - - PowerPoint PPT Presentation
Overview of current biological databases Qi Sun Computational Biology Service Unit Cornell University Platforms for Bioinformatics HTTP SQL SOAP FTP Web Server Database Server Platforms for Bioinformatics Micorsoft Open source Linux
SLIDE 2
SLIDE 3
Web Server Database Server
SOAP HTTP FTP SQL
Platforms for Bioinformatics
SLIDE 4
Linux Apache Mysql Perl/Python/PHP Windows ASP.NET SQL Server C# Open source Micorsoft Platforms for Bioinformatics
SLIDE 5
Archival database (GenBank, GenPept) vs Computer algorithm generated database (Unigene) vs Manually curated database (RefSeq)
Public Database - 1
NCBI Sequence Data Model
SLIDE 6
The NCBI Data Model
Genbank- A DNA centered database
SLIDE 7
1. LOCUS (obsolete) 2. Accession (version) 3. GI
Identifier:
SLIDE 8
Features
SLIDE 9
GenPept- A protein centered database
SLIDE 10
FTP sites: GenBank: ftp://ftp.ncbi.nih.gov/genbank/ GenPept: ftp://ftp.ncifcrf.gov/pub/genpept/
SLIDE 11
Problems with Genbank and Genpept
- It does not distinguish the sequence categories.
- Lot of redundancy.
- Same gene could be deposited into the
database many times with different names
- Different version of the same gene could be
submitted many times with different accession number.
- The features of genbank record could be chaotic.
SLIDE 12
Archival database (GenBank, GenPept) vs Computer algorithm generated database (Unigene) vs Curated database (RefSeq, Locuslink ...)
Public Database - 1
NCBI Sequence Databases
SLIDE 13
UniGene a non-redundant set of gene-oriented clusters
GenBank mRNAs GenBank genomic CDSs dbEST ESTs Unigene
SLIDE 14
Hs for human Mm for mouse Rn for rat Bt for cow Dr for zebrafish Dm for fruitfly Aga for mosquito Xl for frog At for cress Hv for barley Os for rice Ta for wheats Zm for maize
Unigene identifier Examples: Mm.213407 Hs.13303 At.138
SLIDE 15
Archival database (GenBank, GenPept) vs Computer generated database (Unigene) vs Curated database (RefSeq, Gene ...) NCBI Sequence Databases
Public Database - 1
SLIDE 16
NCBI human genome annotation pipeline
The refseq incorporate the predicted transcript and protein sequences, experimentally identified mRNA sequences, EST sequences.
SLIDE 17
Refseq Accession Numbers: NT_123456 constructed genomic contigs NM_123456 mRNAs NP_123456 proteins NC_123456 chromosomes XM_123456 predicted mRNA XP_123456 predicted protein
SLIDE 18
Genome sequence available
Refseq
acc: NP_123456, et al
EST sequence available
Unigene
acc: Hs.13303, et al
Genbank
acc: AP33493, et al
Refseq? Unigene? Genbank?
SLIDE 19
Go to the web
SLIDE 20
Files that you can download from the NCBI gene database gene_info gene2refseq gene2go
SLIDE 21
NCBI Search engine Entrez
- boolean operators “AND” “OR” “NOT”
- entrez tags
- using limits
- MeSH terms
Batch Entrez search by accession list
SLIDE 22
Other Sequence Databases: Genomic DNA: Ensembl Genome annotation database
(http://www.ensembl.org, HTTP, FTP, MySQL interface)
Protein: Uniprot (http://www.pir.uniprot.org/ )
SLIDE 23
KEGG database go to the web
SLIDE 24
Public Database - 2
GO
Gene Ontology
- 1. Molecular Function
- 2. Biological Process
- 3. Cellular Component
http://www.geneontology.org
SLIDE 25
Public Database - 2
SLIDE 26
Public Database - 2
Molecular Function 3674 Biological Process 8150 Cellular Component 5575
GO
3673
SLIDE 27
GO Example 1: Biological Process
SLIDE 28
GO Example 2: Molecular Function
SLIDE 29
Smn: survival motor neuron Gene ID: 39844 Gene Ontology Annotation
SLIDE 30
Public Database - 4
Species Specific Databases
- Arabidopsis – TAIR
- Yeast – SGD
- Fly – FLYBASE
- Worm – WORMBASE
- Mouse – MGD