MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
Proteomics databases and protein characterization tools
Marie-Claude.Blatter@ISB-SIB.ch
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
Proteomics databases and protein characterization tools - - PDF document
Proteomics databases and protein characterization tools Marie-Claude.Blatter@ISB-SIB.ch EMBnet 2004: Proteomics using MCB - 4/3/2004 bioinformatic tools Part I Proteomics databases EMBnet 2004: Proteomics using MCB - 4/3/2004
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
Marie-Claude.Blatter@ISB-SIB.ch
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
CoDing Sequences
provided by submitters
Data not submitted to public databases, delayed or cancelled…
CoDing Sequences
provided by submitter and « de novo » gene prediction
XP_NNNNN
Manually annotated
Scientific publications derived sequences
with or without annotated CDS
3D structures
PRF, PIR
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
Data not submitted to public databases, delayed or cancelled…
with or without annotated CDS provided by authors CDS CoDing Sequence portion of DNA/RNA translated into protein (from Met to STOP)
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
Human/Mouse/Rat: Organisms with the highest redundancy !
1980: 80 genes fully sequenced !
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
(exception: TPA, since january 2003)
ID HSERPG standard; genomic DNA; HUM; 3398 BP. XX AC X02158; XX SV X02158.1 XX DT 13-JUN-1985 (Rel. 06, Created) DT 22-JUN-1993 (Rel. 36, Last updated, Version 2) XX DE Human gene for erythropoietin XX KW erythropoietin; glycoprotein hormone; hormone; signal peptide. XX OS Homo sapiens (human) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; OC Eutheria; Primates; Catarrhini; Hominidae; Homo. XX RN [1] RP 1-3398 RX MEDLINE; 85137899. RA Jacobs K., Shoemaker C., Rudersdorf R., Neill S.D., Kaufman R.J., RA Mufson A., Seehra J., Jones S.S., Hewick R., Fritsch E.F., Kawakita M., RA Shimizu T., Miyake T.; RT Isolation and characterization of genomic and cDNA clones of human RT erythropoietin; RL Nature 313:806-810(1985). XX DR GDB; 119110; EPO. DR GDB; 119615; TIMP1. DR Swiss-Prot; P01588; EPO_HUMAN. XX …
taxonomy Cross-references references keyword
DNA (genomic)
RNA
CC Data kindly reviewed (24-FEB-1986) by K. Jacobs FH Key Location/Qualifiers FH FT source 1..3398 FT /db_xref=taxon:9606 FT /organism=Homo sapiens FT mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327) FT CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763) FT /db_xref=SWISS-PROT:P01588 FT /product=erythropoietin FT /protein_id=CAA26095.1 FT /translation=MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLLE FT AKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVLRG FT QALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTITAD FT TFRKLFRVYSNFLRGKLKLYTGEACRTGDR FT mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2763) FT /product=erythropoietin FT sig_peptide join(615..627,1194..1261) FT exon 397..627 FT /number=1 FT intron 628..1193 FT /number=1 FT exon 1194..1339 FT /number=2 FT intron 1340..1595 FT /number=2 FT exon 1596..1682 FT /number=3 FT intron 1683..2293 FT /number=3 FT exon 2294..2473 FT /number=4 FT intron 2474..2607 FT /number=4 FT exon 2608..3327 FT /note=3' untranslated region FT /number=5 XX SQ Sequence 3398 BP; 698 A; 1034 C; 991 G; 675 T; 0 other; agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag 60 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat 120
Annotation (Prediction or experimentally determined) sequence
CDS CoDing Sequence
(proposed by submitters)
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
FT CDS complement(45959..47332) FT /db_xref="SPTREMBL:Q9UZ71" FT /note="PAB2386" FT /transl_table=11 FT /product="4-AMINOBUTYRATE qui se dilate AMINOTRANSFERASE FT (EC 2.6.1.19)" FT /protein_id="CAB50188.1" FT /translation="MDYPRIVVNPPGPKAKELIEREKRVLSTGIGVKLFPLVPKRGFGP FT FIEDVDGNVFIDFLAGAAAASTGYSHPKLVKAVKEQVELIQHSMIGYTHSERAIRVAEK FT LVKISPIKNSKVLFGLSGSDAVDMAIKVSKFSTRRPWILAFIGAYHGQTLGATSVASFQ FT VSQKRGYSPLMPNVFWVPYPNPYRNPWGINGYEEPQELVNRVVEYLEDYVFSHVVPPDE FT VAAFFAEPIQGDAGIVVPPENFFKELKKLLDEHGILLVMDEVQTGIGRTGKWFASEWFE FT VKPDMIIFGKGVASGMGLSGVIGREDIMDITSGSALLTPAANPVISAAADATLEIIEEE FT NLLKNAIEVGSFIMKRLNELKEQFDIIGDVRGKGLMIGVEIVKENGRPDPEMTGKICWR FT AFELGLILPSYGMFGNVIRITPPLVLTKEVAEKGLEIIEKAIKDAIAGKVERKVVTWH"
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
CoDing Sequences
provided by submitters
Data not submitted to public databases, delayed or cancelled…
Manually annotated
with or without annotated CDS
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
(integration of the PIR data)
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
* submitted to the public databases (EMBL, GenBank, DDJB, SWISS-PROT)
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
CDS
Annotation of conflicts Once in Swiss-Prot, no more in TrEMBL
CDS
CDS
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
**is going to disappear soon !
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
Since 1986 Since 1996
Search also with accession numbers (Swiss-Prot or other databases)
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
View « by default » on the ExPASy server
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
sequence
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
Sequencing errors ? Polymorphisms ? Alternative splicing ? Alternative initiation ? Usage of an alternative promoter ? RNA editing ?
Selenocystein ? Fragment ?
Same gene ?
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
sequence
Accession Nr. Identifier Gene name Protein name MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
sequence
Accession Nr. Identifier Gene name Taxonomy Protein name MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
sequence
Accession Nr. Identifier Gene name Taxonomy References Protein name
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
sequence
Accession Nr. Identifier Gene name Taxonomy References Comments Protein name
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
ICOL_HUMAN, O75144
« - »: experimentally proved; « By similarity »: experimentally proved in an ortholog or in another member of the family; « Probable »: not proved, but realistic; « Potential »: predicted (bioinformatic tools).
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
« - »: experimentally proved; « By similarity »: experimentally proved in an ortholog or in another member of the family; « Probable »: not proved but realistic; « Potential »: predicted (bioinformatic tools).
BRH2_HUMAN, Q9NY43 AAA1_HUMAN, Q9NS82
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
sequence
Accession Nr. Identifier Gene name Taxonomy References Comments Cross-references Protein name
Gasteiger et al., Curr. Issues Mol. Biol. (2001), 3(3): 47-55
Domains, sites, families
HAMAP InterPro PIRSF Pfam PRINTS ProDom PROSITE SMART TIGRFAMs
Nucleotide sequences
EMBL
Structure
HSSP PDB
Organism-specific
dbSNP DictyDb EcoGene FlyBase GeneDB_SPombe Genew GK Gramene HIV Leproma ListiList MaizeDB MGD MypuList OMIM SagaList SGD StyGene SubtiList TIGR TubercuList WormPep ZFIN
Miscellaneous
GermOnline GO MEROPS PIR REBASE TRANSFAC
2D-gel electrophoresis
ANU-2DPAGE Aarhus/Ghent-2DPAGE COMPLUYEAST-2DPAGE ECO2DBASE HSC-2DPAGE MAIZE-2DPAGE PHCI-2DPAGE PMMA-2DPAGE Siena-2DPAGE SWISS-2DPAGE
PTM
GlycoSuiteDB PhosSite
(explicit links)
Swiss-Prot currently acts as the main index for the 15 federated 2D-PAGE databases.
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
ICE8_HUMAN Q14790
Examples of implicit links to GenBank and DDBJ added ‘on the fly’ by the ExPASy server
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
2D-PAGE
143S_HUMAN P31947 MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
Theoritically computed pI and MW Theoritically computed pI and MW with potential phosphorylation and acetylation sites Experimentally determined position
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
sequence
Accession Nr. Identifier Gene name Taxonomy References Comments Cross-references Keywords Protein name
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
Q9HC96 Calpain 10
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
n=481 entries
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
sequence
Accession Nr. Identifier Gene name Taxonomy References Comments Cross-references Keywords Feature table Protein name MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
General topology
ICOL_HUMAN, O75144
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
General topology
ICOL_HUMAN, O75144
Domains
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
General topology
ICOL_HUMAN, O75144
PTM Domains
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
« - »: experimentally proved; « By similarity »: experimentally proved in an ortholog or in another member of the family; « Probable »: not proved but realistic; « Potential »: predicted (bioinformatic tools).
ICOL_HUMAN, O75144
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
General topology
ICOL_HUMAN, O75144
PTM Alternative splicing Domains
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
BRC2_HUMAN, P51587
Polymorphisms Differences between the sequence shown and other submitted sequences Polymorphisms
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
« new » genome sequencing -> « new » proteins (AB, sept 2002)
Are missing:
the submitters» or no DNA sequence)
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
swiss-prot@expasy.org
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
= Three collections of sequences clusters from the UniProt knowledgebase (Swiss-Prot, TrEMBL): One UniRef100 entry -> all identical sequences (including fragments) One UniRef90 entry -> sequences that have at least 90 % or more identity One UniRef50 entry -> sequences that are at least 50 % identical
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
UniRef100 does not include TrEMBLnew (tn) , because TrEMBLnew is going to « disappear » soon
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
http://www.pir.uniprot.org/cgi-bin/textSearch_AR
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
Use with extreme caution: also contains pseudogene, incorrect CDS prediction etc…! Also patent office database data (EPO, ESPO…).
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
CoDing Sequences
provided by submitters
Data not submitted to public databases, delayed or cancelled…
Manually annotated
CoDing Sequences
provided by submitters
Data not submitted to public databases, delayed or cancelled…
CoDing Sequences
provided by submitter and « de novo » gene prediction
XP_NNNNN
Manually annotated
Scientific publications derived sequences
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
~TrEMBL, except that it is redundant with Swiss-Prot All PIR data have been integrated into Swiss-Prot and TrEMBL (UniProt) 3D structure database: all the protein sequences which have been cristallized (Swiss-Prot/TrEMBL are crosslinked to PDB) Scientific publications derived sequences (integrated into TrEMBL)
http://www.ncbi.nlm.nih.gov/RefSeq/
The sequence data are tightly linked to LocusLink which contains the associated biological information (« interdependent curated resources »)
(!!! 1 entry = 1 sequence ….)
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein
AMBN
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
Swiss-Prot
Genpept Genpept RefSeq RefSeq
Correspond to Swiss-Prot entry AMBN_HUMAN Q9NP70 GenBank source
KW AC Taxonomy References
GenBank source
used for the construction of the RefSeq entry
Description of the sequence differences
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
Annotation
DR EMBL; AC009488; AAP22357.1; -.
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
~TrEMBL, except that it is redundant with Swiss-Prot All PIR data have been integrated into Swiss-Prot and TrEMBL (UniProt) 3D structure database: all the protein sequences which have been cristallized (Swiss-Prot/TrEMBL are crosslinked to PDB) Scientific publications derived sequences (integrated into TrEMBL)
(RCSB) (USA).
protein-nucleic acid complexes, and viruses. Proteins represent more than 90% of available structures
structure has been obtained by X-ray or NMR studies
3D structure (e.g., SwissPDB-viewer, Chime, Rasmol)).
molecules, but far less protein family (highly redundant) !
HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2 COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3 COMPND 2 (E.C.4.2.1.1) MUTANT WITH VAL 121 REPLACED BY ALA (/V121A) 12CA 4 SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5 AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6 REVDAT 1 15-OCT-92 12CA 0 12CA 7 JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8 JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9 JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10 JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11 JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12 JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13 REMARK 1 12CA 14 REMARK 2 12CA 15 REMARK 2 RESOLUTION. 2.4 ANGSTROMS. 12CA 16 REMARK 3 12CA 17 REMARK 3 REFINEMENT. 12CA 18 REMARK 3 PROGRAM PROLSQ 12CA 19 REMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20 REMARK 3 R VALUE 0.170 12CA 21 REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22 REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23 REMARK 4 12CA 24 REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25 REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26 REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27 ………
SHEET 3 S10 PHE 66 PHE 70 -1 O ASN 67 N LEU 60 12CA 68 SHEET 4 S10 TYR 88 TRP 97 -1 O PHE 93 N VAL 68 12CA 69 SHEET 5 S10 ALA 116 ASN 124 -1 O HIS 119 N HIS 94 12CA 70 SHEET 6 S10 LEU 141 VAL 150 -1 O LEU 144 N LEU 120 12CA 71 SHEET 7 S10 VAL 207 LEU 212 1 O ILE 210 N GLY 145 12CA 72 SHEET 8 S10 TYR 191 GLY 196 -1 O TRP 192 N VAL 211 12CA 73 SHEET 9 S10 LYS 257 ALA 258 -1 O LYS 257 N THR 193 12CA 74 SHEET 10 S10 LYS 39 TYR 40 1 O LYS 39 N ALA 258 12CA 75 TURN 1 T1 GLN 28 VAL 31 TYPE VIB (CIS-PRO 30) 12CA 76 TURN 2 T2 GLY 81 LEU 84 TYPE II(PRIME) (GLY 82) 12CA 77 TURN 3 T3 ALA 134 GLN 137 TYPE I (GLN 136) 12CA 78 TURN 4 T4 GLN 137 GLY 140 TYPE I (ASP 139) 12CA 79 TURN 5 T5 THR 200 LEU 203 TYPE VIA (CIS-PRO 202) 12CA 80 TURN 6 T6 GLY 233 GLU 236 TYPE II (GLY 235) 12CA 81 CRYST1 42.700 41.700 73.000 90.00 104.60 90.00 P 21 2 12CA 82 ORIGX1 1.000000 0.000000 0.000000 0.00000 12CA 83 ORIGX2 0.000000 1.000000 0.000000 0.00000 12CA 84 ORIGX3 0.000000 0.000000 1.000000 0.00000 12CA 85 SCALE1 0.023419 0.000000 0.006100 0.00000 12CA 86 SCALE2 0.000000 0.023981 0.000000 0.00000 12CA 87 SCALE3 0.000000 0.000000 0.014156 0.00000 12CA 88 ATOM 1 N TRP 5 8.519 -0.751 10.738 1.00 13.37 12CA 89 ATOM 2 CA TRP 5 7.743 -1.668 11.585 1.00 13.42 12CA 90 ATOM 3 C TRP 5 6.786 -2.502 10.667 1.00 13.47 12CA 91 ATOM 4 O TRP 5 6.422 -2.085 9.607 1.00 13.57 12CA 92 ATOM 5 CB TRP 5 6.997 -0.917 12.645 1.00 13.34 12CA 93 ATOM 6 CG TRP 5 5.784 -0.209 12.221 1.00 13.40 12CA 94 ATOM 7 CD1 TRP 5 5.681 1.084 11.797 1.00 13.29 12CA 95 ATOM 8 CD2 TRP 5 4.417 -0.667 12.221 1.00 13.34 12CA 96 ATOM 9 NE1 TRP 5 4.388 1.418 11.515 1.00 13.30 12CA 97 ATOM 10 CE2 TRP 5 3.588 0.375 11.797 1.00 13.35 12CA 98 ATOM 11 CE3 TRP 5 3.837 -1.877 12.645 1.00 13.39 12CA 99 ATOM 12 CZ2 TRP 5 2.216 0.208 11.656 1.00 13.39 12CA 100 ATOM 13 CZ3 TRP 5 2.465 -2.043 12.504 1.00 13.33 12CA 101 ATOM 14 CH2 TRP 5 1.654 -1.001 12.009 1.00 13.34 12CA 102 …….
There are all derived from PDB data !
structures in 3D.
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
* OWL is obsolete since 1999
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
http://www.matrixscience.com/help/database_help.html MSDB: “non-identical protein sequence database” Contains sequences derived from: PIR (now integrated into UniProt (Swiss-Prot /TrEMBL)) TrEMBL REMTrEMBL (does not exist anymore, see UniParc) GenBank Swiss-Prot NRL3D (PDB derived sequences)
One digit followed by three letters: e.g. 1TUP PDB (protein structure) e.g. XM_000483 e.g. XP_000467 RefSeq prediction e.g. NP_00483 RefSeq protein Two letters, underscore bar and six digit: e.g. mRNA NM_000492 e.g. genomic NT_000907 RefSeq nucleotide One letter (O, P, Q) and five digits/letters: e.g. P12345 Swiss-Prot/TrEMBL One letter followed by five digits: e.g. U12345 Two letters followed by 6 digits: e.g. AF123456 GenBank/EMBL/DDBJ Sample Accession Format Type of record
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
Most proteins have « modular » structures Estimation: ~ 3 domains / protein Domains not only share a common structure but
– Pattern (regular expression); used for very conserved domains – Profiles (weighted matrices): two-dimensional tables of position specific match-, gap-, and insertion-scores, derived from aligned sequence families; used for less conserved domains – Hidden Markov Model (HMM); probabilistic models; an
[LIVM]-[ST]-A-[STAG]-H-C
ID TRYPSIN_DOM; MATRIX. AC PS50240; DT DEC-2001 (CREATED); DEC-2001 (DATA UPDATE); DEC-2001 (INFO UPDATE). DE Serine proteases, trypsin domain profile. MA /GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ'; LENGTH=234; MA /DISJOINT: DEFINITION=PROTECT; N1=6; N2=229; MA /NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=0.0169; R2=0.00836256; TEXT='-LogE'; MA /CUT_OFF: LEVEL=0; SCORE=1134; N_SCORE=9.5; MODE=1; TEXT='!'; MA /CUT_OFF: LEVEL=-1; SCORE=775; N_SCORE=6.5; MODE=1; TEXT='?'; MA /DEFAULT: M0=-9; D=-20; I=-20; B1=-60; E1=-60; MI=-105; MD=-105; IM=-105; DM=-105; MA /I: B1=0; BI=-105; BD=-105; MA A B D E F G H I K L M N P Q R S T V W Y MA /M: SY='I'; M= -8,-29,-34,-26, 3,-34,-24, 34,-26, 19, 15,-24,-21,-21,-24,-19, -8, 25,-19, 3; MA /M: SY='N'; M= 0, 14, 10, 1,-22, -1, 6,-23, -4,-26,-17, 20,-14, -1, -6, 13, 2,-20,-34,-15; MA /M: SY='E'; M= -4, 4, 7, 14,-26,-13, -7,-23, 3,-22,-16, 2, 7, 3, -3, 2, -2,-21,-30,-18; MA /M: SY='R'; M=-12, 5, 5, 7,-23,-17, 3,-24, 8,-20,-12, 7,-16, 10, 12, -2, -6,-21,-27, -9; MA /M: SY='W'; M=-16,-33,-35,-27, 13,-22,-24,-11,-18,-13,-13,-31,-27,-20,-18,-30,-21,-18, 97, 25; MA /M: SY='V'; M= 1,-29,-31,-28, -1,-30,-29, 31,-22, 13, 11,-27,-27,-26,-22,-12, -2, 41,-27, -8; MA /M: SY='L'; M= -8,-29,-31,-22, 9,-30,-21, 23,-27, 37, 20,-28,-28,-21,-20,-25, -8, 17,-20, -1; MA /M: SY='T'; M= 2, -1, -9, -9,-11,-17,-19,-10,-10,-13,-11, 1,-11, -9,-10, 23, 43, 0,-32,-12; MA /M: SY='A'; M= 45, -9,-19,-10,-20, -2,-15,-11,-10,-11,-10, -9,-11, -9,-19, 10, 1, -1,-21,-18; MA /M: SY='A'; M= 40, -9,-17, -8,-21, 5,-18,-14, -9,-13,-12, -8,-11, -9,-16, 9, -2, -5,-21,-21; MA /M: SY='H'; M=-18, 0, 0, 1,-21,-19, 89,-29, -8,-21, -1, 9,-19, 11, 0, -7,-17,-29,-30, 16; MA /M: SY='C'; M= -9,-18,-28,-29,-20,-29,-29,-29,-29,-20,-19,-18,-39,-29,-29, -9, -9, -9,-49,-29; MA /I: E1=0; IE=-105; DE=-105; //
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
with a specific signal
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
post-translational modifications of proteins (PTMs) 5-10 fold increase alternative splicing
2-5 fold increase
~ 100’000 human transcripts ~ 25’000 human genes ~ 1'000'000 human proteins
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools Am Amidation AcN Acetylation N-terminal AcI Acetylation internal Alk Alkylation Adp ADP-ribosylation Bio Biotinylation Bro Bromination Cgly C-linked glycosylation Ogly O-linked glycosylation Ngly N-linked glycosylation Dea Deamidation Sul Sulfation Far Farnesylation Ger Geranylgeranylation GPI GPI-anchoring Met Methylation Myr Myristoylation Hyd Hydroxylation Pho Phosphorylation Pal Palmitoylation Pyr Pyrrolidone carboxylic acid Oxo 2-amino-3-oxo- propionic acid
Sul Pho
Ogly Myr
GPI GPI GPI GPI GPI GPI GPI
GPI Ngly Ngly
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
http://www-nbrf.georgetown.edu/pirwww/dbinfo/resid.html
descriptive, chemical, structural and bibliographic information.
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
http://www.expasy.org/tools/findmod/findmod_masses.html
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
222 401 422 4963 135 563 38828 12959 total 1458 324 37046 N-GlcNAc 41 101 119 1001 48 451 4056 Exp. 79 8 O-GlcNAc 73 240 136 2791 83 1446 By sim. Pot./prob. 108 GPI-anchor 60 myristate 167 sulfation 1162 phosphorylation 29 O-GalNAc 7457 signal peptide all organisms Number of PTMs in Swiss-Prot release 40
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 acetyl phosphate methyl sulfate total proven
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools
MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools