Overview of current biological databases Qi Sun Computational - - PowerPoint PPT Presentation

overview of current biological databases
SMART_READER_LITE
LIVE PREVIEW

Overview of current biological databases Qi Sun Computational - - PowerPoint PPT Presentation

Overview of current biological databases Qi Sun Computational Biology Service Unit Cornell University Platforms for Bioinformatics HTTP SQL SOAP FTP Web Server Database Server Platforms for Bioinformatics Micorsoft Open source Linux


slide-1
SLIDE 1

Overview of current biological databases

Qi Sun Computational Biology Service Unit Cornell University

slide-2
SLIDE 2
slide-3
SLIDE 3

Web Server Database Server

SOAP HTTP FTP SQL

Platforms for Bioinformatics

slide-4
SLIDE 4

Linux Apache Mysql Perl/Python/PHP Windows ASP.NET SQL Server C# Open source Micorsoft Platforms for Bioinformatics

slide-5
SLIDE 5

Archival database (GenBank, GenPept) vs Computer algorithm generated database (Unigene) vs Manually curated database (RefSeq)

Public Database - 1

NCBI Sequence Data Model

slide-6
SLIDE 6

The NCBI Data Model

Genbank- A DNA centered database

slide-7
SLIDE 7

1. LOCUS (obsolete) 2. Accession (version) 3. GI

Identifier:

slide-8
SLIDE 8

Features

slide-9
SLIDE 9

GenPept- A protein centered database

slide-10
SLIDE 10

FTP sites: GenBank: ftp://ftp.ncbi.nih.gov/genbank/ GenPept: ftp://ftp.ncifcrf.gov/pub/genpept/

slide-11
SLIDE 11

Problems with Genbank and Genpept

  • It does not distinguish the sequence categories.
  • Lot of redundancy.
  • Same gene could be deposited into the

database many times with different names

  • Different version of the same gene could be

submitted many times with different accession number.

  • The features of genbank record could be chaotic.
slide-12
SLIDE 12

Archival database (GenBank, GenPept) vs Computer algorithm generated database (Unigene) vs Curated database (RefSeq, Locuslink ...)

Public Database - 1

NCBI Sequence Databases

slide-13
SLIDE 13

UniGene a non-redundant set of gene-oriented clusters

GenBank mRNAs GenBank genomic CDSs dbEST ESTs Unigene

slide-14
SLIDE 14

Hs for human Mm for mouse Rn for rat Bt for cow Dr for zebrafish Dm for fruitfly Aga for mosquito Xl for frog At for cress Hv for barley Os for rice Ta for wheats Zm for maize

Unigene identifier Examples: Mm.213407 Hs.13303 At.138

slide-15
SLIDE 15

Archival database (GenBank, GenPept) vs Computer generated database (Unigene) vs Curated database (RefSeq, Gene ...) NCBI Sequence Databases

Public Database - 1

slide-16
SLIDE 16

NCBI human genome annotation pipeline

The refseq incorporate the predicted transcript and protein sequences, experimentally identified mRNA sequences, EST sequences.

slide-17
SLIDE 17

Refseq Accession Numbers: NT_123456 constructed genomic contigs NM_123456 mRNAs NP_123456 proteins NC_123456 chromosomes XM_123456 predicted mRNA XP_123456 predicted protein

slide-18
SLIDE 18

Genome sequence available

Refseq

acc: NP_123456, et al

EST sequence available

Unigene

acc: Hs.13303, et al

Genbank

acc: AP33493, et al

Refseq? Unigene? Genbank?

slide-19
SLIDE 19

Go to the web

slide-20
SLIDE 20

Files that you can download from the NCBI gene database gene_info gene2refseq gene2go

slide-21
SLIDE 21

NCBI Search engine Entrez

  • boolean operators “AND” “OR” “NOT”
  • entrez tags
  • using limits
  • MeSH terms

Batch Entrez search by accession list

slide-22
SLIDE 22

Other Sequence Databases: Genomic DNA: Ensembl Genome annotation database

(http://www.ensembl.org, HTTP, FTP, MySQL interface)

Protein: Uniprot (http://www.pir.uniprot.org/ )

slide-23
SLIDE 23

KEGG database go to the web

slide-24
SLIDE 24

Public Database - 2

GO

Gene Ontology

  • 1. Molecular Function
  • 2. Biological Process
  • 3. Cellular Component

http://www.geneontology.org

slide-25
SLIDE 25

Public Database - 2

slide-26
SLIDE 26

Public Database - 2

Molecular Function 3674 Biological Process 8150 Cellular Component 5575

GO

3673

slide-27
SLIDE 27

GO Example 1: Biological Process

slide-28
SLIDE 28

GO Example 2: Molecular Function

slide-29
SLIDE 29

Smn: survival motor neuron Gene ID: 39844 Gene Ontology Annotation

slide-30
SLIDE 30

Public Database - 4

Species Specific Databases

  • Arabidopsis – TAIR
  • Yeast – SGD
  • Fly – FLYBASE
  • Worm – WORMBASE
  • Mouse – MGD