SLIDE 1 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 1
Swiss Institute of Bioinformatics
Protein Structure Bioinformatics Introduction
Secondary Structure Prediction & Fold recognition
EMBnet course Basel, September 29, 2004
Lorenza Bordoli
Overview Introduction Secondary Structure Prediction Fold Recognition
SLIDE 2 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 2
Principles of protein structure Primary Structure Secondary Structure Tertiary Structure (Fold) Quaternary Structure Principles of protein structure
Protein structure include: Core Region:
Secondary structure element packed in close proximity in hydrophobic environment Limited amino acid substitution
Outside the core:
loops and structural elements in contact with water, membrane
Amino acid substitution: not as restricted as above
SLIDE 3
Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 3
PDB Holdings PDB Holdings
SLIDE 4
Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 4
Protein Structure Databases
PDB: http://www.pdb.org
X-Ray, NMR => atom coordinates of the proteins are deposited in PDB: worldwide repository for the 3-D biological macromolecular structure data.
EBI-MSD: http://www.ebi.ac.uk/msd/ (2003)
suite of web-based search and retrieval interfaces for macromolecular structure research.
Protein Structure Databases
http://www.wwpdb.org/
SLIDE 5
Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 5
Introduction
Goal: Relationship between amino acid sequence and three-dimensional structure in proteins? Can we predict the structure from the sequence? Currently: comparative (homology) modeling;
See Lecture Thursday (Torsten) Homology Modeling
Similar Sequence Similar Structure Homology modeling = Comparative protein modeling
Idea: Using experimental 3D-structures of related family members (templates) to calculate a model for a new sequence (target).
Structure is better conserved than sequence
SLIDE 6 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 6
Flow chart: analyze a new protein sequence
Protein Sequence Homology Modeling Predicted 3D Structural model 3D structural analysis in laboratory Structure prediction (Secondary Structure Fold recognition) Protein family Sequence search (Pfam) Database similarity search (BLAST) Relatioship to known structure? Does sequence align with a protein of known structure ? Hints for domain assignment? Function?
Secondary structure assignment
DSSP Dictionary of Secondary Structure of Proteins (Kabsch & Sander, 1983) Based on recognition of hydrogen-bonding patterns in known structures Automated assignment of secondary structure Interprets backbone hydrogen bonds Uses a Coulomb approximation for the hydrogen bond energy (-0.5 kcal/mol cut-off) Secondary structures are assigned to consecutive segments of residues with hydrogen bonds
SLIDE 7 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 7
Secondary structure assignment DSSP secondary structure elements
8 secondary structure classes
– H (α-helix) → H – G (310-helix) → H – I (π-helix) → H – E (extended strand) → E – B (residue in isolated β-bridge) → E – T (turn) → L – S (bend) → L – " " (blank = other) → L
Secondary Structure prediction
What is protein secondary structure prediction?
Simplification of prediction problem 3D → 1D
Why do we need it?
As starting point for 3D modeling:
- Improve sequence alignments
- Use in fold recognition (discover family/superfamily relationship)
- Definition of loops / core regions
SLIDE 8 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8
Secondary Structure prediction Assumption:
there should be a correlation between amino acid sequence and secondary structure
What can we predict?
α-helix β-strand Loop (coil)
Secondary Structure prediction
Projection onto strings of structural assignments “Secondary Structure” 3-state model:
(S) β-Strand (E) (H) α-Helix (L) Loop SEQ MRIILLGAPGAGKGTQAQFIMEKYGIPQISTGDMLRAAVKSGSELGKQAK SS SSSSSSLLLLLLHHHHHHHHHHHLLLSSSLHHHHHHHHHHHLLLLLLHHH SS SSSSSS HHHHHHHHHHH SSS HHHHHHHHHHH HHH
SLIDE 9 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 9
Accuracy of prediction 3-state-per-residue accuracy:
Gives % of correctly predicted residues in α, β or other state Q3 = 100 • Σ ci/N
- N= total number of residues
- Ci = number of correctly predicted residue in state
I (H,E,L)
Performance Evaluation
Assumption: there should be a correlation* between amino acid sequence and secondary structure Systematic performance testing pre-requisite for reliability of method Training Set Test Set Dataset PDB PDB sub set: derive correlation* PDB sub-set: => Q3
SLIDE 10 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 10
Conformational Preferences
Biochimica et Biophysica Acta 916: 200-204 (1987).
α β
RT
1st Generation secondary structure prediction
1st Generation based on single amino acid propensities
Chou and Fasman, 1974 Robson, 1976 GOR-1: Garnier, Osguthorpe, and Robson, 1978
Preference of particular residues for certain secondary structure elements:
Single-residue statistics: analysis of the frequency of each 20 aa in α helices, β strands or coils
Databases of very limited size < 55% Q3 accuracy
SLIDE 11 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 11
1st Generation secondary structure prediction Chou and Fasman (partial table): Amino Acid P α P β P t Glu 1.51 0.37 0.74 Met 1.45 1.05 0.60 Ala 1.42 0.83 0.66 Val 1.06 1.70 0.50 Ile 1.08 1.60 0.50 Tyr 0.69 1.47 1.14 Pro 0.57 0.55 1.52 Gly 0.57 0.75 1.56
Name P(H) P(E) P(turn) f(i) f(i+1) f(i+2) f(i+3) Alanine 142 83 66 0.06 0.076 0.035 0.058 Arginine 98 93 95 0.07 0.106 0.099 0.085 Aspartic Acid 101 54 146 0.147 0.11 0.179 0.081 Asparagine 67 89 156 0.161 0.083 0.191 0.091 Cysteine 70 119 119 0.149 0.05 0.117 0.128 Glutamic Acid 151 37 74 0.056 0.06 0.077 0.064 Glutamine 111 110 98 0.074 0.098 0.037 0.098 Glycine 57 75 156 0.102 0.085 0.19 0.152 Histidine 100 87 95 0.14 0.047 0.093 0.054 Isoleucine 108 160 47 0.043 0.034 0.013 0.056 Leucine 121 130 59 0.061 0.025 0.036 0.07 Lysine 114 74 101 0.055 0.115 0.072 0.095 Methionine 145 105 60 0.068 0.082 0.014 0.055 Phenylalanine 113 138 60 0.059 0.041 0.065 0.065 Proline 57 55 152 0.102 0.301 0.034 0.068 Serine 77 75 143 0.12 0.139 0.125 0.106 Threonine 83 119 96 0.086 0.108 0.065 0.079 Tryptophan 108 137 96 0.077 0.013 0.064 0.167 Tyrosine 69 147 114 0.082 0.065 0.114 0.125 Valine 106 170 50 0.062 0.048 0.028 0.053
Chou-Fasman Pij-values
SLIDE 12 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 12
Chou-Fasman
How it works:
- a. Assign all of the residues the appropriate set of parameters
- b. Identify a-helix and b-sheet regions. Extend the regions in both
directions.
- c. If structures overlap compare average values for P(H) and P(E) and
assign secondary structure based on best scores.
- d. Turns are modeled as tetra-peptides using 2 different probability values.
Assign Pij values
1. Assign all of the residues the appropriate set of parameters T S P T A E L M R S T G
P(H) 69 77 57 69 142 151 121 145 98 77 69 57 P(E) 147 75 55 147 83 37 130 105 93 75 147 75 P(turn) 114 143 152 114 66 74 59 60 95 143 114 156
SLIDE 13 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 13
Scan peptide for α−helix regions
- 2. Identify regions where 4/6 aa have a
P(H) >100 “alpha-helix nucleus” T S P T A E L M R S T G
P(H) 69 77 57 69 142 151 121 145 98 77 69 57
T S P T A E L M R S T G
P(H) 69 77 57 69 142 151 121 145 98 77 69 57
Extend α-helix nucleus
- 3. Extend helix in both directions until a set of four
residues have an average P(H) <100.
T S P T A E L M R S T G
P(H) 69 77 57 69 142 151 121 145 98 77 69 57
Repeat steps 1 – 3 for entire peptide
SLIDE 14 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 14
- 4. Identify regions where 3/5 have a
P(E) >100 “b-sheet nucleus” Extend b-sheet until 4 continuous residues have an average P(E) < 100 If region average > 105 and the average P(E) > average P(H) then “b-sheet” T S P T A E L M R S T G
P(H) 69 77 57 69 142 151 121 145 98 77 69 57 P(E) 147 75 55 147 83 37 130 105 93 75 147 75
Scan peptide for β-sheet regions Chou-Fasman
- 1. Assign all of the residues in the peptide the appropriate set of parameters.
- 2. Scan through the peptide and identify regions where 4 out of 6 contiguous residues have P(a-helix)
> 100. That region is declared an alpha-helix. Extend the helix in both directions until a set of four contiguous residues that have an average P(a-helix) < 100 is reached. That is declared the end of the helix. If the segment defined by this procedure is longer than 5 residues and the average P(a- helix) > P(b-sheet) for that segment, the segment can be assigned as a helix.
- 3. Repeat this procedure to locate all of the helical regions in the sequence.
- 4. Scan through the peptide and identify a region where 3 out of 5 of the residues have a value of P(b-
sheet) > 100. That region is declared as a beta-sheet. Extend the sheet in both directions until a set
- f four contiguous residues that have an average P(b-sheet) < 100 is reached. That is declared the
end of the beta-sheet. Any segment of the region located by this procedure is assigned as a beta- sheet if the average P(b-sheet) > 105 and the average P(b-sheet) > P(a-helix) for that region.
- 5. Any region containing overlapping alpha-helical and beta-sheet assignments are taken to be helical if
the average P(a-helix) > P(b-sheet) for that region. It is a beta sheet if the average P(b-sheet) > P(a-helix) for that region.
- 6. To identify a bend at residue number j, calculate the following value:
p(t) = f(j)f(j+1)f(j+2)f(j+3) where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue is used and the f(j+3) value for the j+3 residue is used. If: (1) p(t) > 0.000075; (2) the average value for P(turn) > 1.00 in the tetra-peptide; and (3) the averages for the tetra-peptide obey the inequality P(a-helix) < P(turn) > P(b-sheet), then a beta-turn is predicted at that location.
SLIDE 15
Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 15
CHOFAS predicts protein secondary structure version 2.0u61 September 1998 Please cite: Chou and Fasman (1974) Biochem., 13:222-245 Chou-Fasman plot of @, 12 aa; SEQ1 sequence.
TSPTAELMRSTG helix <> sheet EEEEEEE turns T Residue totals: H: 2 E: 7 T: 1 percent: H: 16.7 E: 58.3 T: 8.3
Chou-Fasman Results
2nd Generation secondary structure prediction
Improvements
Larger database of protein structures Segment-based statistics (11-21 residue window)
Basic idea:
"How likely is it that the central residue in a window adopts a particular secondary structure state?"
Algorithm used:
Presumably all conceivable algorithms on this planet have been applied to the Secondary Structure prediction problem. E.g. statistical information, physicochemical properties, sequence patterns, neural networks, graph theory, expert rules
SLIDE 16 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 16
(H) α-Helix, local interactions
Neural Network
Artificial intelligence:
Computer programs are trained to be able to recognize amino acid patters that are located in known secondary structure and distinguish from other patterns not located in these structures
NN can detect interactions between amino acids in a sequence windows. Neural Networks for Secondary Structure prediction
A C D E F G H I K L M N P Q R S T V W Y .
H E L D (L) R (E) Q (E) G (E) F (E) V (E) P (E) A (H) A (H) Y (H) V (E) K (E) K (E)
(B.Rost, Columbia, NewYork)
Input Layer Hidden Layer Output Layer Weights
SLIDE 17 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 17
H E L D (L) R (E) Q (E) G (E) F (E) V (E) P (E) A (H) A (H) Y (H) V (E) K (E) K (E)
Neural Networks for secondary structure predictions
(B.Rost, Columbia, NewYork)
= 0.19 = 0.61 = 0.17
The winner is:
E
Neural Networks Benefits
General applicable Can capture higher order correlations Inputs other than sequence information
Drawbacks
Needs many data points (solved structures) Risk of overtraining
SLIDE 18
Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 18
2nd Generation secondary structure prediction
Methods:
GORIII
COMBINE
Q3 accuracy < 70%
Problems with first and second generation methods Q3 accuracy < 70% β-stands predicted < 28 - 48 % (slightly better than random) Predicted helices and strands are too short
SLIDE 19 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 19
3rd Generation secondary structure prediction
Breakthrough: Using evolutionary information
1 50 fyn_human VTLFVALYDY EARTEDDLSF HKGEKFQILN SSEGDWWEAR SLTTGETGYI yrk_chick VTLFIALYDY EARTEDDLSF QKGEKFHIIN NTEGDWWEAR SLSSGATGYI fgr_human VTLFIALYDY EARTEDDLTF TKGEKFHILN NTEGDWWEAR SLSSGKTGCI yes_chick VTVFVALYDY EARTTDDLSF KKGERFQIIN NTEGDWWEAR SIATGKTGYI src_avis2 VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI src_aviss VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI src_avisr VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI src_chick VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI stk_hydat VTIFVALYDY EARISEDLSF KKGERLQIIN TADGDWWYAR SLITNSEGYI src_rsvpa .......... ESRIETDLSF KKRERLQIVN NTEGTWWLAH SLTTGQTGYI hck_human ..IVVALYDY EAIHHEDLSF QKGDQMVVLE ES.GEWWKAR SLATRKEGYI blk_mouse ..FVVALFDY AAVNDRDLQV LKGEKLQVLR .STGDWWLAR SLVTGREGYV hck_mouse .TIVVALYDY EAIHREDLSF QKGDQMVVLE .EAGEWWKAR SLATKKEGYI lyn_human ..IVVALYPY DGIHPDDLSF KKGEKMKVLE .EHGEWWKAK SLLTKKEGFI lck_human ..LVIALHSY EPSHDGDLGF EKGEQLRILE QS.GEWWKAQ SLTTGQEGFI ss81_yeast.....ALYPY DADDDdeISF EQNEILQVSD .IEGRWWKAR R.ANGETGII abl_mouse ..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWV abl1_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWV src1_drome..VVVSLYDY KSRDESDLSF MKGDRMEVID DTESDWWRVV NLTTRQEGLI mysd_dicdi.....ALYDF DAESSMELSF KEGDILTVLD QSSGDWWDAE L..KGRRGKV yfj4_yeast....VALYSF AGEESGDLPF RKGDVITILK ksQNDWWTGR V..NGREGIF abl2_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YNQNGEWSEV RSKNG.QGWV tec_human .EIVVAMYDF QAAEGHDLRL ERGQEYLILE KNDVHWWRAR D.KYGNEGYI abl1_caeel..LFVALYDF HGVGEEQLSL RKGDQVRILG YNKNNEWCEA RlrLGEIGWV txk_human .....ALYDF LPREPCNLAL RRAEEYLILE KYNPHWWKAR D.RLGNEGLI yha2_yeastVRRVRALYDL TTNEPDELSF RKGDVITVLE QVYRDWWKGA L..RGNMGIF abp1_sacex.....AEYDY EAGEDNELTF AENDKIINIE FVDDDWWLGE LETTGQKGLF
3rd Generation secondary structure prediction
PHD method (Rost and Sander) Combine neural networks with MAXHOM multiple sequence profiles
6-8 Percentage points increase in prediction accuracy over standard neural networks
SLIDE 20 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 20
3rd Generation secondary structure prediction
Η Ε L > > > pick maximal unit => current prediction J2 input layer first or hidden layer second or
s0 s1 s2 J1
: G Y I Y D P A V G D P D N G V E P G T E F : : G Y I Y D P E V G D P T Q N I P P G T K F : : G Y E Y D P A E G D P D N G V K P G T S F : : G Y E Y D P A E G D P D N G V K P G T A F : Alignments 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 . . . . . . . . . 2 . . . . . 3 . . . . . . . . . . . . . . . . . . . . . . . 5 . . . . . . 5 . . . . . . . . . . . . . . . . . . 5 . . . . . . . . . . . . . . . . . . 3 . . . . 2 . . . . . . . . . . . . . . . . 1 . . 2 . . . 2 . . . . . . . . 5 . . . . . . . . . . . . . . . . . . . . . . . 5 . . . . . . . . . . . . . . . . . . 5 . . . . . . . . . . . . . . . . . . . . 4 . 1 . . . . . . . . . . . . . . . . . 1 3 . . . 1 . . . . . . . . . . 4 . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . 4 . 1 . . . . . . . . . 1 . 1 . 1 2 . . . . . . . . . . . . . . 5 . . . . . . . . . . . . . . . . 5 . . . . . . . . . . . . . . . . . . . . . . . . . 5 . . . . . . . . . . . . . . 1 1 . 1 . . 1 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 . GSAPD NTEKQ CVHIR LMYFW profile table : G Y I Y D P E D G D P D D G V N P G T D F : Protein
corresponds to the the 21*3 bits coding for the profile of one residue
(B.Rost, Columbia, NewYork)
3rd generation secondary structure prediction PHD (Rost et. al.)
Q3 better than 72 %
[ B.Rost (2001) J.Struct.Biol. 134, 204 ]
59 % 65 % 72 % Q3
Prediction reliability (0 = weak, 9 = strong)
[http://www.embl-heidelberg.de/predictprotein/]
SLIDE 21 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 21
3rd generation secondary structure prediction
PSI-Pred (Jones, DT)
Use alignments from iterative sequence searches (PSI-Blast) as input to a neural network Better predictions due to better sequence profiles Available as stand alone program and via the web
[http://bioinf.cs.ucl.ac.uk/psipred/psiform.html]
How accurate are predictions today?
10 20 30 40 50 60 70 10 20 30 40 50 60 70 80 90 100 Number of protein chains Per-residue accuracy (Q 3) <Q3>=72.3% ; sigma=10.5% 1spf 1bct 1stu 3ifm 1psm
(B.Rost, Columbia, NewYork)
SLIDE 22 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 22
How accurate are predictions today?
Q3 = 72-76% +- 11 % (on average)
I.e. 30 % of predicted assignments are wrong I.e. for 2/3 of all proteins, between 60% - 80% of residues are predicted correctly I.e. for your protein, accuracy can be lower than 60%
How accurate are predictions today?
At present it is not always possible to predict secondary structure with very high reliability As methods have improved (from 1st->3d generation of methods), prediction has reached an average accuracy of 64%-75%
SLIDE 23
Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 23
Secondary Structure Prediction
META-PredictProtein Server
http://cubic.bioc.columbia.edu/meta/ Simultaneous submission tool to several other servers, e.g. JPRED, PHD, PROF, PSIprod, SAM-T99, APSSP2, Sspro Includes also motif searches, domain assignments, TM predictions, etc.
1D-Structure prediction Secondary Structure Prediction Solvent Accessibility Prediction
Identify exposed residues, e.g. for mutation studies, epitopes, etc.
SLIDE 24 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 24
1D-Structure prediction Projection onto strings of structural assignments
E.g. “Solvent Accessibility” (buried or exposed?) A B C D E F G… ¦ ¦ ¦ ¦ ¦ ¦ ¦ e e b b e e e…
Accuracy of two-state prediction: 75% ± 10 %
PHDacc: solvent accessibility prediction
[http://cubic.bioc.columbia.edu/predictprotein/]
SLIDE 25
Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 25
1D-Structure prediction
Secondary Structure Prediction Solvent Accessibility Prediction Transmembrane Helices prediction
PHDhtm [http://www.embl-heidelberg.de/predictprotein/predictprotein.html] TMHMM [http://www.cbs.dtu.dk/services/TMHMM/] TMpred [http://www.ch.embnet.org/software/TMPRED_form.html]
Fold Recognition
SLIDE 26 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 26
[ PDB: http://www.pdb.org ]
Growth of the Protein Data Bank PDB
Christine Orengo (Structures, 1997, 5, 1093-1108)
Fold Classification Databases
SLIDE 27 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 27
Christine Orengo (Structures, 1997, 5, 1093-1108)
Fold Classification Databases Protein structure classification databases
Databases: provide structural comparisons for the proteins in PDB: Methods used to classify the protein structures:
Manual examination fully automatic computer algorithms
Examples:
SCOP CATH FSSP
SLIDE 28 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 28
[ http://scop.mrc-lmb.cam.ac.uk/scop/ ]
SCOP - Structural Classification of Proteins
MRC Cambridge UK, A. Murzin, Brenner S. E., Hubbard T., Chothia C. created by manual inspection hierarchical classification of protein domain structures comprehensive description of the structural and evolutionary relationships
- rganized as a tree structure:
Class all α class Fold globin-like fold (6 helices; folded leaf) Superfamily globin-like superfamily Family globin and phycocyanin families Domain hemoglobin 1, myoglobin,… Species
Domain= segment of a polypetide chain that can autonomously fold into a 3D structure
[ http://www.biochem.ucl.ac.uk/bsm/cath_new/ ]
CATH - Protein Structure Classification
UCL, Janet Thornton & Christine Orengo Hierarchical classification of protein domain structures clusters proteins at four major levels:
Class (C) Architecture(A) Topology(T) Homologous superfamily (H)
SLIDE 29 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 29
[ http://www.biochem.ucl.ac.uk/bsm/cath_new/ ]
CATH - Protein Structure Classification
derived from secondary structure content is assigned automatically
describes the gross orientation of secondary structures, independent of connectivity.
clusters structures according to their topological connections and numbers of secondary structures
FSSP-Fold Classification structure-structure alignment Holm and Sander, EBI, UK Fold classification based on pair-wise structural alignment of PDB. (DALI program) Clusters of fold types = unique configuration of secondary structure elements
[http://www2.ebi.ac.uk/dali/fssp/]
SLIDE 30 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 30
Structural Alignments
Protein Structure is better conserved than sequence Structural alignments establish equivalences between amino acid residues based on the 3D structures of two or more proteins Structure alignments therefore provide information not available from sequence alignment methods Structural alignments can be used to guide sequence alignments (see: T_COFFEE / SAP)
See Lecture Thursday (Laurent) Sequence alignment
[ PDB: http://www.pdb.org ]
Growth of the Protein Data Bank PDB
SLIDE 31 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 31
[ PDB: http://www.pdb.org ]
Growth of the Protein Data Bank PDB
New folds per year “Old” folds per year
The number of fold appears to be limited
The number of fold appears to be limited Many different sequences will adopt the same fold:
A reasonable probability that a new sequence will posses an already identified fold
Goal of fold recognition: discover which fold is best matched
Sequence alignment method (e.g. HMM) 3D structure prediction methods (e.g. threading)
SLIDE 32 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 32
Find a compatible fold for a given sequence ....
>Protein XY MSTLYEKLGGTTAVDLAV DKFYERVLQDDRIKHFFA DVDMAKQRAHQKAFLTYA FGGTDKYDGRYMREAHKE LVENHGLNGEHFDAVAED LLATLKEMGVPEDLIAEV AAVAGAPAHKRDVLNQ
≈
?
Fold recognition
Number of protein folds that occurs in nature is limited. Fold Recognition can be used to: Identify templates for modeling Assign Protein Function
Fold recognition: sequence based
Sequence alignment (HMM) can be used to identify a family of homologous proteins that have the same seq. and presumably a similar 3D-structure ex.: Superfamily database:
uses a library (covering all proteins of known structure) consisting of 1294 SCOP superfamilies each of which is represented by a group of hidden Markov models HMM
[http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/]
SLIDE 33 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 33
Fold recognition: threading The amino acid sequence of a query protein is examined for compatibility with the structural core of known protein structures:
Structure profile method (e.g. 3D-PSSM) Contact potential method (e.g. 123D)
Fold recognition methods
position specific scoring matrix
Kelley et al, JMB, 299, 499 (2000)
http://www.sbg.bio.ic.ac.uk/~3dpssm/
SLIDE 34 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 34
Fold recognition and Function
Some words of warning concerning fold recognition:
There is no simple close association of fold and function in a
The five most versatile folds (TIM-barrel, alpha-beta hydrolase, Rossmann, P-loop containing NTP hydrolase, ferredoxin fold), accommodate from six to as many as 16 functions. The two most versatile enzymatic functions (hydrolases and o- glycosyl-glucosidases) are associated with seven folds each.
Aspartase [1JSW]
CO2
H NH3
+
H H O O
H
+ NH3
Histidase [1B8F]
N NH CO2
H H
+NH3
H H CO2
N + NH3
δ2-Crystallin [1AUW] Avian eye lens protein
Functional assignment by fold recognition ?
SLIDE 35 Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 35
Fold Recognition Servers
Meta server
http://bioinfo.pl/meta/
3DPSSM
http://www.sbg.bio.ic.ac.uk/servers/3dpssm/
GenTHREADER
http://bioinf.cs.ucl.ac.uk/psipred/
FUGUE2
http://www-cryst.bioc.cam.ac.uk/~fugue/prfsearch.html
SAM
http://www.cse.ucsc.edu/research/compbio/HMM-apps/T99-query.html
FOLD
http://fold.doe-mbi.ucla.edu/
FFAS/PDBBLAST
http://bioinformatics.burnham-inst.org/
References
D.W. Mount, Bioinformatics, CSHLP. P.E.Bourne, H. Weissig. Structural Bioinformatics, Wiley-Liss and Sons. Methods in Molecular Biology 143: Protein Structure Prediction, Humana Press. Protein Structure Prediction: A practical Approach, Oxford University Press.