Proteomics databases and protein characterization tools - - PDF document

proteomics databases and protein characterization tools
SMART_READER_LITE
LIVE PREVIEW

Proteomics databases and protein characterization tools - - PDF document

Proteomics databases and protein characterization tools Marie-Claude.Blatter@ISB-SIB.ch EMBnet 2004: Proteomics using MCB - 4/3/2004 bioinformatic tools Part I Proteomics databases EMBnet 2004: Proteomics using MCB - 4/3/2004


slide-1
SLIDE 1

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Proteomics databases and protein characterization tools

Marie-Claude.Blatter@ISB-SIB.ch

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Part I Proteomics databases

slide-2
SLIDE 2

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Proteomics databases

  • 1. Sequence databases:

« The story of a protein sequence’s life »

  • 2. Swiss-Prot: a quick overview
  • 3. UniProt utilities: UniRef and UniParc
  • 4. Swiss-Prot … and the other protein databases

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Where do the protein sequences come from ? What’s about their reliability ? What do you have to take care of ?

slide-3
SLIDE 3

Real life of a protein sequence …

TrEMBL Genpept

CoDing Sequences

provided by submitters

cDNAs, ESTs, genomes, … EMBL, GenBank, DDBJ

Data not submitted to public databases, delayed or cancelled…

Swiss-Prot

CoDing Sequences

provided by submitter and « de novo » gene prediction

RefSeq

XP_NNNNN

UniProt: Swiss-Prot + TrEMBL + (PIR) NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF

Manually annotated

PRF

Scientific publications derived sequences

with or without annotated CDS

3D structures

PRF, PIR

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Let’s start at the very beginning…

slide-4
SLIDE 4

Real life of a protein sequence …

cDNAs, ESTs, genomes, … EMBL, GenBank, DDBJ

Data not submitted to public databases, delayed or cancelled…

with or without annotated CDS provided by authors CDS CoDing Sequence portion of DNA/RNA translated into protein (from Met to STOP)

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

EMBL/GenBank/DDBJ

  • The 3 main public nucleic acid sequence databases are

EMBL (EBI)/GenBank (NCBI) /DDBJ (Japan): « different views of the same data set » within 2-3 days

  • Contribution: EMBL 10 %; GenBank 73 %; DDBJ 17 %
  • EMBL: since 1982
slide-5
SLIDE 5

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

  • Serve as archives
  • Contain all public sequences derived from:

– Genome projects (> 80 % of entries) – Sequencing centers (cDNAs, ESTs…) – Individual scientists ( 15 % of entries) – Patent offices (i.e. European Patent Office, EPO)

  • Currently: 30x106 sequences, ~36 x109 bp;
  • Sequences from > 50’000 different species;

EMBL/GenBank/DDBJ

Human/Mouse/Rat: Organisms with the highest redundancy !

The tremendous increase in nucleotide sequences

1980: 80 genes fully sequenced !

Human Rat Mouse Other

slide-6
SLIDE 6

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

EMBL/GenBank/DDBJ

Sort of sequence museum, where sequences are preserved for eternity as they were determined, interpreted and published originally by their authors (primary sequence repository) The authors have full authority over the content of the entries they submit !

(exception: TPA, since january 2003)

an EMBL entry

ID HSERPG standard; genomic DNA; HUM; 3398 BP. XX AC X02158; XX SV X02158.1 XX DT 13-JUN-1985 (Rel. 06, Created) DT 22-JUN-1993 (Rel. 36, Last updated, Version 2) XX DE Human gene for erythropoietin XX KW erythropoietin; glycoprotein hormone; hormone; signal peptide. XX OS Homo sapiens (human) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; OC Eutheria; Primates; Catarrhini; Hominidae; Homo. XX RN [1] RP 1-3398 RX MEDLINE; 85137899. RA Jacobs K., Shoemaker C., Rudersdorf R., Neill S.D., Kaufman R.J., RA Mufson A., Seehra J., Jones S.S., Hewick R., Fritsch E.F., Kawakita M., RA Shimizu T., Miyake T.; RT Isolation and characterization of genomic and cDNA clones of human RT erythropoietin; RL Nature 313:806-810(1985). XX DR GDB; 119110; EPO. DR GDB; 119615; TIMP1. DR Swiss-Prot; P01588; EPO_HUMAN. XX …

taxonomy Cross-references references keyword

DNA (genomic)

  • r

RNA

slide-7
SLIDE 7

CC Data kindly reviewed (24-FEB-1986) by K. Jacobs FH Key Location/Qualifiers FH FT source 1..3398 FT /db_xref=taxon:9606 FT /organism=Homo sapiens FT mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327) FT CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763) FT /db_xref=SWISS-PROT:P01588 FT /product=erythropoietin FT /protein_id=CAA26095.1 FT /translation=MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLLE FT AKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVLRG FT QALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTITAD FT TFRKLFRVYSNFLRGKLKLYTGEACRTGDR FT mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2763) FT /product=erythropoietin FT sig_peptide join(615..627,1194..1261) FT exon 397..627 FT /number=1 FT intron 628..1193 FT /number=1 FT exon 1194..1339 FT /number=2 FT intron 1340..1595 FT /number=2 FT exon 1596..1682 FT /number=3 FT intron 1683..2293 FT /number=3 FT exon 2294..2473 FT /number=4 FT intron 2474..2607 FT /number=4 FT exon 2608..3327 FT /note=3' untranslated region FT /number=5 XX SQ Sequence 3398 BP; 698 A; 1034 C; 991 G; 675 T; 0 other; agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag 60 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat 120

Annotation (Prediction or experimentally determined) sequence

CDS CoDing Sequence

(proposed by submitters)

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

slide-8
SLIDE 8

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

FT CDS complement(45959..47332) FT /db_xref="SPTREMBL:Q9UZ71" FT /note="PAB2386" FT /transl_table=11 FT /product="4-AMINOBUTYRATE qui se dilate AMINOTRANSFERASE FT (EC 2.6.1.19)" FT /protein_id="CAB50188.1" FT /translation="MDYPRIVVNPPGPKAKELIEREKRVLSTGIGVKLFPLVPKRGFGP FT FIEDVDGNVFIDFLAGAAAASTGYSHPKLVKAVKEQVELIQHSMIGYTHSERAIRVAEK FT LVKISPIKNSKVLFGLSGSDAVDMAIKVSKFSTRRPWILAFIGAYHGQTLGATSVASFQ FT VSQKRGYSPLMPNVFWVPYPNPYRNPWGINGYEEPQELVNRVVEYLEDYVFSHVVPPDE FT VAAFFAEPIQGDAGIVVPPENFFKELKKLLDEHGILLVMDEVQTGIGRTGKWFASEWFE FT VKPDMIIFGKGVASGMGLSGVIGREDIMDITSGSALLTPAANPVISAAADATLEIIEEE FT NLLKNAIEVGSFIMKRLNELKEQFDIIGDVRGKGLMIGVEIVKENGRPDPEMTGKICWR FT AFELGLILPSYGMFGNVIRITPPLVLTKEVAEKGLEIIEKAIKDAIAGKVERKVVTWH"

slide-9
SLIDE 9

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Proteomics databases

  • 1. Sequence databases:

« The story of a protein sequence’s life »

  • 2. Swiss-Prot: a quick overview
  • 3. UniProt utilities: UniRef and UniParc
  • 4. Swiss-Prot … and the other protein databases

Real life of a protein sequence …

TrEMBL

CoDing Sequences

provided by submitters

cDNAs, ESTs, genomes, … EMBL

Data not submitted to public databases, delayed or cancelled…

Swiss-Prot

Manually annotated

Nucleic acids Amino acids

with or without annotated CDS

slide-10
SLIDE 10

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Since december 15, 2003 Swiss-Prot and TrEMBL constitute the Knowledgebase

(integration of the PIR data)

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

  • > give access to

all known* protein sequences

* submitted to the public databases (EMBL, GenBank, DDJB, SWISS-PROT)

slide-11
SLIDE 11

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

a SWISS-PROT entry = a protein sequence…

… associated with

  • manually-checked
  • well-structured
  • periodically-updated
  • searchable

… biological information

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

a TrEMBL entry = a protein sequence…

… associated with

  • computer-annotated
  • well-structured
  • periodically-updated
  • searchable

… biological information

slide-12
SLIDE 12

EMBL Swiss-Prot TrEMBL

CDS

Swiss-Prot

Annotation of conflicts Once in Swiss-Prot, no more in TrEMBL

  • > Minimal redundancy

EMBL TrEMBL

CDS

slide-13
SLIDE 13

EMBL TrEMBL

CDS

Swiss-Prot

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

How to make things clear…? Depending of the server… UniProt = Swiss-Prot + TrEMBL = SPTR = SWALL Swiss-Prot =UniProt/Swiss-Prot TrEMBL= UniProt/TrEMBL=SPTrEMBL TrEMBL=SPTrEMBL + TrEMBLnew**

**is going to disappear soon !

slide-14
SLIDE 14

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Swiss-Prot

  • 1. Minimal redundancy;
  • 2. Maximal manual annotation;
  • 3. Integration with other databases.

Swiss-Prot

  • 1. Minimal redundancy;

1 gene (1 species) -> 1 entry Swiss-Prot

Identical sequences are merged, as are variants, fragments, alternative splicing isoforms….

slide-15
SLIDE 15

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Swiss-Prot

  • 1. Minimal redundancy.
  • 2. Maximal manual annotation:
  • Function(s);
  • Interactions;
  • Subcellular localization and tissue expression;
  • Structure (domains, …);
  • Post translational modifications (PTMs);
  • Variants (alternative splicing, polymorphisms, …);
  • Similarities…

Swiss-Prot

  • 1. Minimal redundancy;
  • 2. Manual annotation;
  • 3. Integration with other databases:

Release 41.25 (26-Sep-2003): 83 links to other datases.

slide-16
SLIDE 16

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Up-to-date sources:

Swiss-Prot -> ExPASy (www.expasy.org); TrEMBL

  • > EBI (European Bioinformatics Institute)

(www.ebi.ac.uk/trembl/).

Since 1986 Since 1996

You can install the ExPASyBar on your computer Amos’links www.expasy.org

slide-17
SLIDE 17

Search also with accession numbers (Swiss-Prot or other databases)

www.expasy.org

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

slide-18
SLIDE 18

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Swiss-Prot an overview

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

View « by default » on the ExPASy server

slide-19
SLIDE 19

ExPASy EBI NCBI

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

slide-20
SLIDE 20

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Not always obvious to known from which database your protein sequence is derived from !…

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Topology of a Swiss-Prot entry

sequence

slide-21
SLIDE 21

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Swiss-Prot Protein sequence:

  • The longest sequence is usually « displayed »
  • Precursor (except INIT_MET 0 and « amino acid sequencing »)
  • Comparison of genomic and cDNA sequences
  • > carefully checked; validated !
  • > choose the most “representative”
  • > The “sequence quality” is always increasing….

Sequencing errors ? Polymorphisms ? Alternative splicing ? Alternative initiation ? Usage of an alternative promoter ? RNA editing ?

Swiss-Prot’s daily bread

Selenocystein ? Fragment ?

Same gene ?

slide-22
SLIDE 22

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Topology of a Swiss-Prot entry

sequence

Accession Nr. Identifier Gene name Protein name MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

www.expasy.org

Always cite the primary accession number !

slide-23
SLIDE 23

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Topology of a Swiss-Prot entry

sequence

Accession Nr. Identifier Gene name Taxonomy Protein name MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Topology of a Swiss-Prot entry

sequence

Accession Nr. Identifier Gene name Taxonomy References Protein name

slide-24
SLIDE 24

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

References

  • Complete sequences;
  • Fragments ;
  • Function, characterization, interaction…;
  • Post translational modifications;
  • 3D structure (crystallography or NMR);
  • Polymorphisms.

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Topology of a Swiss-Prot entry

sequence

Accession Nr. Identifier Gene name Taxonomy References Comments Protein name

slide-25
SLIDE 25

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Comment lines

  • Function(s) and role(s); enzymes:
  • a. Catalytic activity (if EC number)
  • b. Cofactor
  • c. Enzyme regulation
  • d. Pathway
  • Subunit (Protein/protein interactions)
  • Subcellular location
  • Alternative products (alt. splicing, alt. initiation, RNA editing)
  • Tissue specificity (Northern and Western results)
  • Developmental stage
  • Induction (genetic control)
  • Domain
  • Post translational modifications (PTM)
  • Mass spectrometry
  • Polymorphisms
  • Disease
  • Biotechnology
  • Pharmaceutical
  • Miscellaneous
  • Similarities
  • Caution
  • Database (specialized cross-references)

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Information is derived from:

  • Publications;
  • Databases;
  • Personal communications;
  • Predictions;
  • Brain storming…

Comment lines

slide-26
SLIDE 26

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

ICOL_HUMAN, O75144

Experimental qualifiers:

« - »: experimentally proved; « By similarity »: experimentally proved in an ortholog or in another member of the family; « Probable »: not proved, but realistic; « Potential »: predicted (bioinformatic tools).

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Experimental qualifiers:

« - »: experimentally proved; « By similarity »: experimentally proved in an ortholog or in another member of the family; « Probable »: not proved but realistic; « Potential »: predicted (bioinformatic tools).

BRH2_HUMAN, Q9NY43 AAA1_HUMAN, Q9NS82

slide-27
SLIDE 27

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Topology of a Swiss-Prot entry

sequence

Accession Nr. Identifier Gene name Taxonomy References Comments Cross-references Protein name

Cross-references (X-ref)

  • Swiss-Prot was the first database with X-ref.;
  • Explicit links to 53 databases;
  • Implicit X-references to 30 additional db added by the

ExPASy servers on the WWW (such as GenBank, Ensembl, …) => links to 83 databases from the ExPASy servers

  • Currently 1.2x106 cross-references in Swiss-Prot

Gasteiger et al., Curr. Issues Mol. Biol. (2001), 3(3): 47-55

slide-28
SLIDE 28

Domains, sites, families

HAMAP InterPro PIRSF Pfam PRINTS ProDom PROSITE SMART TIGRFAMs

Nucleotide sequences

EMBL

Structure

HSSP PDB

Organism-specific

dbSNP DictyDb EcoGene FlyBase GeneDB_SPombe Genew GK Gramene HIV Leproma ListiList MaizeDB MGD MypuList OMIM SagaList SGD StyGene SubtiList TIGR TubercuList WormPep ZFIN

Miscellaneous

GermOnline GO MEROPS PIR REBASE TRANSFAC

2D-gel electrophoresis

ANU-2DPAGE Aarhus/Ghent-2DPAGE COMPLUYEAST-2DPAGE ECO2DBASE HSC-2DPAGE MAIZE-2DPAGE PHCI-2DPAGE PMMA-2DPAGE Siena-2DPAGE SWISS-2DPAGE

PTM

GlycoSuiteDB PhosSite

Swiss-Prot

(explicit links)

Swiss-Prot currently acts as the main index for the 15 federated 2D-PAGE databases.

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Cross-references 1.

ADN (Index of low redundancy) 3D genomic

ICE8_HUMAN Q14790

Examples of implicit links to GenBank and DDBJ added ‘on the fly’ by the ExPASy server

slide-29
SLIDE 29

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

2D-PAGE

Cross-references 2.

143S_HUMAN P31947 MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

slide-30
SLIDE 30

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Theoritically computed pI and MW Theoritically computed pI and MW with potential phosphorylation and acetylation sites Experimentally determined position

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Topology of a Swiss-Prot entry

sequence

Accession Nr. Identifier Gene name Taxonomy References Comments Cross-references Keywords Protein name

slide-31
SLIDE 31

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Keywords

(automated and manual annotation)

Q9HC96 Calpain 10

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

n=481 entries

slide-32
SLIDE 32

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Topology of a Swiss-Prot entry

sequence

Accession Nr. Identifier Gene name Taxonomy References Comments Cross-references Keywords Feature table Protein name MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

General topology

ICOL_HUMAN, O75144

Sequence features:

Manual annotation

slide-33
SLIDE 33

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

General topology

ICOL_HUMAN, O75144

Sequence features:

Manual annotation

Domains

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

General topology

ICOL_HUMAN, O75144

Sequence features:

Manual annotation

PTM Domains

slide-34
SLIDE 34

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Experimental qualifiers:

« - »: experimentally proved; « By similarity »: experimentally proved in an ortholog or in another member of the family; « Probable »: not proved but realistic; « Potential »: predicted (bioinformatic tools).

ICOL_HUMAN, O75144

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

General topology

ICOL_HUMAN, O75144

Sequence features:

Manual annotation

PTM Alternative splicing Domains

slide-35
SLIDE 35

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

All the « alternatively spliced sequences » are available,

  • n the ExPASy server, in Fasta format,

i.e. for Blast searches or proteomic tools. Some proteomic tools, on other server, such as Mascot, also include these « alternatively spliced sequences » in their search engines.

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

BRC2_HUMAN, P51587

Polymorphisms Differences between the sequence shown and other submitted sequences Polymorphisms

slide-36
SLIDE 36

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Swiss-Prot and PTM annotations

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Swiss-Prot PTM annotations

  • References (Rx lines)
  • Comments (CC lines)

CC -!- PTM: …

  • Keywords (KW lines)

KW …

  • Feature table (FT lines)

FT … keywords comments features references

slide-37
SLIDE 37

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

references

references

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Comments (CC PTM)

comments “The N-terminus is blocked.” “Phosphorylation of Tyr-660 reduces the ability of 4.1 to promote the assembly of the spectrin/actin/4.1 ternary complex.” “Sulfated.”

slide-38
SLIDE 38

keywords

Linkage : Acetylation, Amidation, D- amino acid, Formylation, Glycoprotein, GPI-anchor, Hydroxylation, Hypusine, Iodination, Myristate, Palmitate, Phosphorylation, Prenylation, Sulfation, etc. Cleavage : Signal, Transit peptide, Protein splicing, etc. Cross-link : Thioether bond, Thioester bond. keywords

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

features

features Cleavage : INIT_MET, PROPEP, SIGNAL, TRANSIT Linkage : MOD_RES, CARBOHYD, LIPID, BINDING Cross-link : DISULFID, CROSSLNK sequence

slide-39
SLIDE 39

144’731 + 1’072’890 ≈ 700’000 Swiss-Prot & TrEMBL

introduce a new arithmetical concept !

Redundancy in TrEMBL & Redundancy between TrEMBL and Swiss-Prot

  • In 3 years….more than 2’000’000 protein sequences
  • But, in the future: redundancy is going to decrease:

« new » genome sequencing -> « new » proteins (AB, sept 2002)

In the case of human proteins, the redundancy is still very high:

10’542 + 54’152 ≈ about 22’000*

Are missing:

  • Sequences not submitted to EMBL/GenBank/DDJB (and PIR)
  • Not yet predicted or known genes (« no CDS provided by

the submitters» or no DNA sequence)

  • Confidential data (Patent application sequences)
  • Immunoglobulins, T-cell receptors (-> UniParc)

* human gene number estimation: 25’000-35’000

slide-40
SLIDE 40

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Take home message

  • Swiss-Prot is a nonredundant, manually annotated

and highly crossreferenced protein knowledgebase.

  • Be aware of the differences between TrEMBL and

Swiss-Prot.

  • Always cite the Accession number, not the ID.
  • We need your feedback!

swiss-prot@expasy.org

Righting the wrongs “Sequences are rarely deposited in a “mature” state; as with all scientific research, DNA and protein annotation is a continual process of learning, revision and corrections.” “Sequencing error rates: ~1 base in 10’000” “Making people aware of errors is good and great; making people aware that they’re responsible also for correcting errors is even greater”

  • C. Hardley, EMBO reports, 4(9), 2003.
slide-41
SLIDE 41

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Proteomics databases

  • 1. Sequence databases:

« The story of a protein sequence’s life »

  • 2. Swiss-Prot: a quick overview
  • 3. UniProt utilities: UniRef and UniParc
  • 4. Swiss-Prot … and the other protein databases

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

UniProt consortium (since oct. 2002): The UniProt Knowledgebase (UniProt) (Swiss-Prot and TrEMBL; integration of PIR data) (Release 1 dec. 2003). The UniProt Non-redundant Reference (UniRef) databases combine closely related sequences into a single record to speed BLAST searches. The UniProt Archive (UniParc) is a comprehensive repository, reflecting the history of all protein sequences.

slide-42
SLIDE 42

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

UniRef useful for comprehensive BLAST searches by providing sets of representative sequences «Collapsing BLAST results»

= Three collections of sequences clusters from the UniProt knowledgebase (Swiss-Prot, TrEMBL): One UniRef100 entry -> all identical sequences (including fragments) One UniRef90 entry -> sequences that have at least 90 % or more identity One UniRef50 entry -> sequences that are at least 50 % identical

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

BLASTP: UniRef100

UniRef100 does not include TrEMBLnew (tn) , because TrEMBLnew is going to « disappear » soon

slide-43
SLIDE 43

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

BLASTP: UniRef100

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

BLASTP: UniRef90

slide-44
SLIDE 44

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

BLASTP: UniRef90

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

BLASTP: UniRef50

slide-45
SLIDE 45

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

UniParc allows to keep track of a protein sequence and of its integration in various databases

http://www.pir.uniprot.org/cgi-bin/textSearch_AR

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Use with extreme caution: also contains pseudogene, incorrect CDS prediction etc…! Also patent office database data (EPO, ESPO…).

UniParc

slide-46
SLIDE 46

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Proteomics databases

  • 1. Sequence databases:

« The story of a protein sequence’s life »

  • 2. Swiss-Prot: a quick overview
  • 3. UniProt utilities: UniRef and UniParc
  • 3. Swiss-Prot … and the other protein databases

Real life of a protein sequence …

TrEMBL

CoDing Sequences

provided by submitters

cDNAs, ESTs, genomes, … EMBL

Data not submitted to public databases, delayed or cancelled…

Swiss-Prot

Manually annotated

Nucleic acids Amino acids

slide-47
SLIDE 47

Real life of a protein sequence …

TrEMBL Genpept

CoDing Sequences

provided by submitters

cDNAs, ESTs, genomes, … EMBL, GenBank, DDBJ

Data not submitted to public databases, delayed or cancelled…

Swiss-Prot

CoDing Sequences

provided by submitter and « de novo » gene prediction

RefSeq

XP_NNNNN

UniProt: Swiss-Prot + TrEMBL + (PIR) NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF

Manually annotated

PRF

Scientific publications derived sequences

PRF Protein sequences: « NR database »

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein

slide-48
SLIDE 48

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF

~TrEMBL, except that it is redundant with Swiss-Prot All PIR data have been integrated into Swiss-Prot and TrEMBL (UniProt) 3D structure database: all the protein sequences which have been cristallized (Swiss-Prot/TrEMBL are crosslinked to PDB) Scientific publications derived sequences (integrated into TrEMBL)

NCBI Reference Sequence (RefSeq)

http://www.ncbi.nlm.nih.gov/RefSeq/

The RefSeq collection: genomic DNA, transcript (RNA), and protein products RefSeq provides a non-redundant set of sequences, derived from GenBank, the literature and gene prediction. Release 3 includes over 844’000 proteins from 2218

  • rganisms (including 1100 viruses and 150 bacteria).

The sequence data are tightly linked to LocusLink which contains the associated biological information (« interdependent curated resources »)

(!!! 1 entry = 1 sequence ….)

slide-49
SLIDE 49

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Example 1 Search for a gene name

slide-50
SLIDE 50

Protein sequences: « NR database »

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein

AMBN

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Swiss-Prot

20 entries

slide-51
SLIDE 51

« Entrez protein AMBN»

Genpept Genpept RefSeq RefSeq

Correspond to Swiss-Prot entry AMBN_HUMAN Q9NP70 GenBank source

KW AC Taxonomy References

GenBank source

slide-52
SLIDE 52

used for the construction of the RefSeq entry

Description of the sequence differences

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Annotation

slide-53
SLIDE 53

Example 2 BLAST searches

Human EPO: Blastp against Swiss-Prot/TrEMBL (at the ExPASy server)

DR EMBL; AC009488; AAP22357.1; -.

*

slide-54
SLIDE 54

Human EPO: Blastp against NR All these human sequences are integrated into the corresponding Swiss-Prot entry with the annotation of their differences (conflicts, variant, fragments…)

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF

~TrEMBL, except that it is redundant with Swiss-Prot All PIR data have been integrated into Swiss-Prot and TrEMBL (UniProt) 3D structure database: all the protein sequences which have been cristallized (Swiss-Prot/TrEMBL are crosslinked to PDB) Scientific publications derived sequences (integrated into TrEMBL)

slide-55
SLIDE 55

PDB: Protein Data Bank www.rcsb.org/pdb/

  • Managed by Research Collaboratory for Structural Bioinformatics

(RCSB) (USA).

  • Contains macromolecular structure data on proteins, nucleic acids,

protein-nucleic acid complexes, and viruses. Proteins represent more than 90% of available structures

  • Contain the spatial coordinates of macromolecules whose 3D

structure has been obtained by X-ray or NMR studies

  • Specialized programs allow the visualization of the corresponding

3D structure (e.g., SwissPDB-viewer, Chime, Rasmol)).

  • Currently there are 24’358 structural data for about 6’000

molecules, but far less protein family (highly redundant) !

PDB: example

HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2 COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3 COMPND 2 (E.C.4.2.1.1) MUTANT WITH VAL 121 REPLACED BY ALA (/V121A) 12CA 4 SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5 AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6 REVDAT 1 15-OCT-92 12CA 0 12CA 7 JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8 JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9 JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10 JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11 JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12 JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13 REMARK 1 12CA 14 REMARK 2 12CA 15 REMARK 2 RESOLUTION. 2.4 ANGSTROMS. 12CA 16 REMARK 3 12CA 17 REMARK 3 REFINEMENT. 12CA 18 REMARK 3 PROGRAM PROLSQ 12CA 19 REMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20 REMARK 3 R VALUE 0.170 12CA 21 REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22 REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23 REMARK 4 12CA 24 REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25 REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26 REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27 ………

slide-56
SLIDE 56

PDB (cont.)

SHEET 3 S10 PHE 66 PHE 70 -1 O ASN 67 N LEU 60 12CA 68 SHEET 4 S10 TYR 88 TRP 97 -1 O PHE 93 N VAL 68 12CA 69 SHEET 5 S10 ALA 116 ASN 124 -1 O HIS 119 N HIS 94 12CA 70 SHEET 6 S10 LEU 141 VAL 150 -1 O LEU 144 N LEU 120 12CA 71 SHEET 7 S10 VAL 207 LEU 212 1 O ILE 210 N GLY 145 12CA 72 SHEET 8 S10 TYR 191 GLY 196 -1 O TRP 192 N VAL 211 12CA 73 SHEET 9 S10 LYS 257 ALA 258 -1 O LYS 257 N THR 193 12CA 74 SHEET 10 S10 LYS 39 TYR 40 1 O LYS 39 N ALA 258 12CA 75 TURN 1 T1 GLN 28 VAL 31 TYPE VIB (CIS-PRO 30) 12CA 76 TURN 2 T2 GLY 81 LEU 84 TYPE II(PRIME) (GLY 82) 12CA 77 TURN 3 T3 ALA 134 GLN 137 TYPE I (GLN 136) 12CA 78 TURN 4 T4 GLN 137 GLY 140 TYPE I (ASP 139) 12CA 79 TURN 5 T5 THR 200 LEU 203 TYPE VIA (CIS-PRO 202) 12CA 80 TURN 6 T6 GLY 233 GLU 236 TYPE II (GLY 235) 12CA 81 CRYST1 42.700 41.700 73.000 90.00 104.60 90.00 P 21 2 12CA 82 ORIGX1 1.000000 0.000000 0.000000 0.00000 12CA 83 ORIGX2 0.000000 1.000000 0.000000 0.00000 12CA 84 ORIGX3 0.000000 0.000000 1.000000 0.00000 12CA 85 SCALE1 0.023419 0.000000 0.006100 0.00000 12CA 86 SCALE2 0.000000 0.023981 0.000000 0.00000 12CA 87 SCALE3 0.000000 0.000000 0.014156 0.00000 12CA 88 ATOM 1 N TRP 5 8.519 -0.751 10.738 1.00 13.37 12CA 89 ATOM 2 CA TRP 5 7.743 -1.668 11.585 1.00 13.42 12CA 90 ATOM 3 C TRP 5 6.786 -2.502 10.667 1.00 13.47 12CA 91 ATOM 4 O TRP 5 6.422 -2.085 9.607 1.00 13.57 12CA 92 ATOM 5 CB TRP 5 6.997 -0.917 12.645 1.00 13.34 12CA 93 ATOM 6 CG TRP 5 5.784 -0.209 12.221 1.00 13.40 12CA 94 ATOM 7 CD1 TRP 5 5.681 1.084 11.797 1.00 13.29 12CA 95 ATOM 8 CD2 TRP 5 4.417 -0.667 12.221 1.00 13.34 12CA 96 ATOM 9 NE1 TRP 5 4.388 1.418 11.515 1.00 13.30 12CA 97 ATOM 10 CE2 TRP 5 3.588 0.375 11.797 1.00 13.35 12CA 98 ATOM 11 CE3 TRP 5 3.837 -1.877 12.645 1.00 13.39 12CA 99 ATOM 12 CZ2 TRP 5 2.216 0.208 11.656 1.00 13.39 12CA 100 ATOM 13 CZ3 TRP 5 2.465 -2.043 12.504 1.00 13.33 12CA 101 ATOM 14 CH2 TRP 5 1.654 -1.001 12.009 1.00 13.34 12CA 102 …….

Coordinates of each atom The same PDB entry “visualized” with Chime

slide-57
SLIDE 57

3D structure database: other

There are all derived from PDB data !

  • HSSP: Homology-derived secondary structure of proteins
  • FSSP: structural alignment
  • SCOP: Structural classification of proteins
  • CATH: hierarchical domain classification of protein structures
  • HomStrad: (HOMologous STRucture Alignment Database)
  • DALI server (EBI): network service for comparing protein

structures in 3D.

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Protein databases used by the protein identification tools the jungle…

slide-58
SLIDE 58

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

PROWL: NCBInr, Swiss-Prot, dbEST Protein prospector: NCBInr, Swiss-Prot, dbEST, GenPept, Ludwignr, OWL*. Peptident (Aldente): Swiss-Prot, TrEMBL. Mascot: NCBInr, Swiss-Prot, dbEST, OWL*, MSDB

* OWL is obsolete since 1999

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Matrix Science (Mascot) Sequence databases

http://www.matrixscience.com/help/database_help.html MSDB: “non-identical protein sequence database” Contains sequences derived from: PIR (now integrated into UniProt (Swiss-Prot /TrEMBL)) TrEMBL REMTrEMBL (does not exist anymore, see UniParc) GenBank Swiss-Prot NRL3D (PDB derived sequences)

slide-59
SLIDE 59

One digit followed by three letters: e.g. 1TUP PDB (protein structure) e.g. XM_000483 e.g. XP_000467 RefSeq prediction e.g. NP_00483 RefSeq protein Two letters, underscore bar and six digit: e.g. mRNA NM_000492 e.g. genomic NT_000907 RefSeq nucleotide One letter (O, P, Q) and five digits/letters: e.g. P12345 Swiss-Prot/TrEMBL One letter followed by five digits: e.g. U12345 Two letters followed by 6 digits: e.g. AF123456 GenBank/EMBL/DDBJ Sample Accession Format Type of record

The AC number jungle

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

The end of part I…

slide-60
SLIDE 60

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

PART II

Protein characterization tools

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

What can we learn in silico from a amino acid sequence ?

  • 1. Domain, family attribution
  • 2. Subcellular location
  • 3. Posttranslational modifications (PTMs)
slide-61
SLIDE 61

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

What can we learn in silico from a amino acid sequence ?

  • 1. Domain, family attribution
  • 2. Subcellular location
  • 3. Posttranslational modifications (PTMs)

 Most proteins have « modular » structures  Estimation: ~ 3 domains / protein  Domains not only share a common structure but

have also often a similar function that contributes to the global activity of the proteins which contain them.

Protein domain/family: some definitions

slide-62
SLIDE 62
  • Domains are identified by multiple sequence

alignments

  • Domains can be defined by different methods:

– Pattern (regular expression); used for very conserved domains – Profiles (weighted matrices): two-dimensional tables of position specific match-, gap-, and insertion-scores, derived from aligned sequence families; used for less conserved domains – Hidden Markov Model (HMM); probabilistic models; an

  • ther method to generate profiles.

[LIVM]-[ST]-A-[STAG]-H-C

Pattern-Profile

  • Profile:
  • Pattern:

Yes or no

ID TRYPSIN_DOM; MATRIX. AC PS50240; DT DEC-2001 (CREATED); DEC-2001 (DATA UPDATE); DEC-2001 (INFO UPDATE). DE Serine proteases, trypsin domain profile. MA /GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ'; LENGTH=234; MA /DISJOINT: DEFINITION=PROTECT; N1=6; N2=229; MA /NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=0.0169; R2=0.00836256; TEXT='-LogE'; MA /CUT_OFF: LEVEL=0; SCORE=1134; N_SCORE=9.5; MODE=1; TEXT='!'; MA /CUT_OFF: LEVEL=-1; SCORE=775; N_SCORE=6.5; MODE=1; TEXT='?'; MA /DEFAULT: M0=-9; D=-20; I=-20; B1=-60; E1=-60; MI=-105; MD=-105; IM=-105; DM=-105; MA /I: B1=0; BI=-105; BD=-105; MA A B D E F G H I K L M N P Q R S T V W Y MA /M: SY='I'; M= -8,-29,-34,-26, 3,-34,-24, 34,-26, 19, 15,-24,-21,-21,-24,-19, -8, 25,-19, 3; MA /M: SY='N'; M= 0, 14, 10, 1,-22, -1, 6,-23, -4,-26,-17, 20,-14, -1, -6, 13, 2,-20,-34,-15; MA /M: SY='E'; M= -4, 4, 7, 14,-26,-13, -7,-23, 3,-22,-16, 2, 7, 3, -3, 2, -2,-21,-30,-18; MA /M: SY='R'; M=-12, 5, 5, 7,-23,-17, 3,-24, 8,-20,-12, 7,-16, 10, 12, -2, -6,-21,-27, -9; MA /M: SY='W'; M=-16,-33,-35,-27, 13,-22,-24,-11,-18,-13,-13,-31,-27,-20,-18,-30,-21,-18, 97, 25; MA /M: SY='V'; M= 1,-29,-31,-28, -1,-30,-29, 31,-22, 13, 11,-27,-27,-26,-22,-12, -2, 41,-27, -8; MA /M: SY='L'; M= -8,-29,-31,-22, 9,-30,-21, 23,-27, 37, 20,-28,-28,-21,-20,-25, -8, 17,-20, -1; MA /M: SY='T'; M= 2, -1, -9, -9,-11,-17,-19,-10,-10,-13,-11, 1,-11, -9,-10, 23, 43, 0,-32,-12; MA /M: SY='A'; M= 45, -9,-19,-10,-20, -2,-15,-11,-10,-11,-10, -9,-11, -9,-19, 10, 1, -1,-21,-18; MA /M: SY='A'; M= 40, -9,-17, -8,-21, 5,-18,-14, -9,-13,-12, -8,-11, -9,-16, 9, -2, -5,-21,-21; MA /M: SY='H'; M=-18, 0, 0, 1,-21,-19, 89,-29, -8,-21, -1, 9,-19, 11, 0, -7,-17,-29,-30, 16; MA /M: SY='C'; M= -9,-18,-28,-29,-20,-29,-29,-29,-29,-20,-19,-18,-39,-29,-29, -9, -9, -9,-49,-29; MA /I: E1=0; IE=-105; DE=-105; //

score/threshold

slide-63
SLIDE 63

Protein domain/family db

PROSITE Patterns / Profiles ProDom Aligned motifs (PSI-BLAST) (Pfam B) PRINTS Aligned motifs Pfam HMM (Hidden Markov Models) SMART HMM TIGRfam HMM DOMO Aligned motifs BLOCKS Aligned motifs (PSI-BLAST) CDD(CDART) PSI-BLAST(PSSM) of Pfam and SMART I I n n t t e e r r p p r r

  • InterPro

http://www.ebi.ac.uk/InterProScan/

  • Search simultaneously many domain databases

(PRINTS, PROSITE, Pfam, ProDom, SMART, and TIGRFAMs).

  • Contains an unique AC, functional description of

the domain and references.

  • Links are made back to the relevant member

databases.

slide-64
SLIDE 64

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

slide-65
SLIDE 65

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

What can we learn in silico from a amino acid sequence ?

  • 1. Domain, family attribution
  • 2. Subcellular location
  • 3. Posttranslational modifications (PTMs)

Protein pathway in Eukaryota Secretory pathway

  • --> per default

with a specific signal

slide-66
SLIDE 66

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

slide-67
SLIDE 67

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

What can we learn in silico from a amino acid sequence ?

  • 1. Domain, family attribution
  • 2. Subcellular location
  • 3. Posttranslational modifications (PTMs)

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

from genome to proteome

post-translational modifications of proteins (PTMs) 5-10 fold increase alternative splicing

  • f mRNA

2-5 fold increase

~ 100’000 human transcripts ~ 25’000 human genes ~ 1'000'000 human proteins

protein complexity

slide-68
SLIDE 68

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools Am Amidation AcN Acetylation N-terminal AcI Acetylation internal Alk Alkylation Adp ADP-ribosylation Bio Biotinylation Bro Bromination Cgly C-linked glycosylation Ogly O-linked glycosylation Ngly N-linked glycosylation Dea Deamidation Sul Sulfation Far Farnesylation Ger Geranylgeranylation GPI GPI-anchoring Met Methylation Myr Myristoylation Hyd Hydroxylation Pho Phosphorylation Pal Palmitoylation Pyr Pyrrolidone carboxylic acid Oxo 2-amino-3-oxo- propionic acid

Sul Pho

PTM diversity

Ogly Myr

GPI GPI GPI GPI GPI GPI GPI

GPI Ngly Ngly

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

Three major categories

cleavage x-linking linkage initiator Met, signal and transit peptides, propeptides, complex processing, etc. simple chemical groups: phosphate, sulfate, methyl, hydroxyl, acetate, etc. complex molecules: N-, O- or C-linked glycans, lipids (e.g. palmitate, myristate, GPI) disulfide bonds, thioester, thioether bonds, etc.

slide-69
SLIDE 69

PTM database

http://www-nbrf.georgetown.edu/pirwww/dbinfo/resid.html

  • RESID is a database of protein post-translational modifications with

descriptive, chemical, structural and bibliographic information.

  • contains 351 entries (last update nov 2003)
slide-70
SLIDE 70

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

PTM prediction tools

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

PTM prediction on ExPASy + PROSITE predictions (n~15)

slide-71
SLIDE 71

http://www.expasy.org/tools/findmod/findmod_masses.html

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

PTM prediction

  • > Beware the « biological consistency » !
  • > Organisms (Eubacteria, Archae, Eukaryota)
  • > Subcellular location
  • > secretory pathway (ER, Golgi)
  • > shuttle between organelles…
  • > topology
  • > A well characterized orthologous protein
slide-72
SLIDE 72

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

222 401 422 4963 135 563 38828 12959 total 1458 324 37046 N-GlcNAc 41 101 119 1001 48 451 4056 Exp. 79 8 O-GlcNAc 73 240 136 2791 83 1446 By sim. Pot./prob. 108 GPI-anchor 60 myristate 167 sulfation 1162 phosphorylation 29 O-GalNAc 7457 signal peptide all organisms Number of PTMs in Swiss-Prot release 40

Some statistics

Total number of proteins < total number of PTMs

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 acetyl phosphate methyl sulfate total proven

PTM annotation in SWISS-PROT: all organisms

slide-73
SLIDE 73

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

We need your help ! Swiss-prot@expasy.org PTM@isb-sib.ch

MCB - 4/3/2004 EMBnet 2004: Proteomics using bioinformatic tools

The end of part II…