1
Bioinformatics Databases Introduction to Bioinformatics Dortmund, - - PowerPoint PPT Presentation
Bioinformatics Databases Introduction to Bioinformatics Dortmund, - - PowerPoint PPT Presentation
Bioinformatics Databases Introduction to Bioinformatics Dortmund, 16.-20.07.2007 Lectures: Sven Rahmann Exercises: Udo Feldkamp, Michael Wurst 1 Overview Databases at NCBI (via Entrez) DNA GenBank, EMBL, DDBJ Data Format
2
Overview
- Databases at NCBI (via Entrez)
- DNA – GenBank, EMBL, DDBJ
– Data Format Issues
- UCSC Genome Browser
- Protein – SwissProt, PIR, PDB
- Sequence Retrieval System at EBI
3
Fundamentals
- Accession number :=
– unique identifier for each entry (“record”) in a DB – Example: PubMed ID [PMID] – If you know the accession number, you obtain the
record without searching
– Different databases can be linked via accession
numbers
– Data integration: Hide the details (accession
numbers) behind a convenient interface
4
Databases at NCBI (2007) http://www.ncbi.nlm.nih.gov/
5
Different Databases
- DNA
– nucleotide sequence – gene – transcript / gene expression – genome
- Protein
– sequence and annotation – structure
- ...
6
Different Databases
- Repositories of primary sequence data
– Everything related to a topic goes in here – GenBank (NCBI Nucleotide): all nucleotide seq's
- Machine-curated annotation data
– automatically generated from primary data – quality depends on primary data and method
- Manually curated annotation data
– reviewed by experts (SwissProt – Amos Bairoch) – high quality, slow to grow
7
Integration
- “Meta Search Engines”
– Entrez at NCBI (U.S.) – SRS at EBI (Europe)
- Value comes from linking databases
- Accession numbers provide unique identifiers
8
Security
- Assume that everything you send over the
internet can be intercepted.
- Don't send confidential data, patent data, etc.
- None of the public databases currently supports
encryption
9
Searching Entrez
10
Nucleotide Results
11
Core Nucleotide DB
12
DNA / Nucleotide DBs
- International Nucleotide Sequence Database
Collaboration (INSDC)
same content GenBank = NCBI Nucleotide
13
File Formats: GenBank
LOCUS AAURRA 118 bp ss-rRNA RNA 16-JUN-1986 DEFINITION A.auricula-judae (mushroom) 5S ribosomal RNA. ACCESSION K03160 VERSION K03160.1 GI:173593 KEYWORDS 5S ribosomal RNA; ribosomal RNA. SOURCE A.auricula-judae (mushroom) ribosomal RNA. ORGANISM Auricularia auricula-judae Eukaryota; Fungi; Eumycota; Basidiomycotina; Phragmobasidiomycetes; Heterobasidiomycetidae; Auriculariales; Auriculariaceae. REFERENCE 1 (bases 1 to 118) AUTHORS Huysmans,E., Dams,E., Vandenberghe,A. and De Wachter,R. TITLE The nucleotide sequences of the 5S rRNAs of four mushrooms and their use in studying the phylogenetic position of basidiomycetes among the eukaryotes JOURNAL Nucleic Acids Res. 11, 2871-2880 (1983) FEATURES Location/Qualifiers rRNA 1..118 /note="5S ribosomal RNA" BASE COUNT 27 a 34 c 34 g 23 t ORIGIN 5' end of mature rRNA. 1 atccacggcc ataggactct gaaagcactg catcccgtcc gatctgcaaa gttaaccaga 61 gtaccgccca gttagtacca cggtggggga ccacgcggga atcctgggtg ctgtggtt // LOCUS ABCRRAA 118 bp ss-rRNA RNA 15-SEP-1990 ...
14
File Formats: FASTA
>gi|173593|gb|K03160.1|AAURRA Auricula auricula-judae 5S ribosomal RNA ATCCACGGCCATAGGACTCTGAAAGCACTGCATCCCGTCCGATCTGCAA AGTTAACCAGAGTACCGCCCAGTTAGTACCACGGTGGGGGACCACGCG GGAATCCTGGGTGCTGTGGTT
15
Sequence Retrieval System (SRS)
- URL: http://srs.ebi.ac.uk/
16
Selecting Libraries (DBs) to Search
17
Standard Query Form
18
UCSC Genome Browser
- Portal to ENCODE:
Encyclopedia of DNA elements functional annotation of the human genome
19
Protein: UniProt / SwissProt
- URL: http://expasy.org/sprot/
– SwissProt: manually curated – TrEMBL: anntotated automatically
20
Protein Structure: (WW)PDB
- http://www.wwpdb.org/