Protein Structure Bioinformatics Introduction Secondary Structure - - PDF document

protein structure bioinformatics introduction
SMART_READER_LITE
LIVE PREVIEW

Protein Structure Bioinformatics Introduction Secondary Structure - - PDF document

Introduction to Protein Structure Bioinformatics 29.9.2004 Protein Structure Bioinformatics Introduction Secondary Structure Prediction & Fold recognition EMBnet course Basel, September 29, 2004 Lorenza Bordoli Swiss Institute of


slide-1
SLIDE 1

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 1

Swiss Institute of Bioinformatics

Protein Structure Bioinformatics Introduction

Secondary Structure Prediction & Fold recognition

EMBnet course Basel, September 29, 2004

Lorenza Bordoli

Overview Introduction Secondary Structure Prediction Fold Recognition

slide-2
SLIDE 2

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 2

Principles of protein structure Primary Structure Secondary Structure Tertiary Structure (Fold) Quaternary Structure Principles of protein structure

Protein structure include: Core Region:

Secondary structure element packed in close proximity in hydrophobic environment Limited amino acid substitution

Outside the core:

loops and structural elements in contact with water, membrane

  • r other proteins

Amino acid substitution: not as restricted as above

slide-3
SLIDE 3

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 3

PDB Holdings PDB Holdings

slide-4
SLIDE 4

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 4

Protein Structure Databases

PDB: http://www.pdb.org

X-Ray, NMR => atom coordinates of the proteins are deposited in PDB: worldwide repository for the 3-D biological macromolecular structure data.

EBI-MSD: http://www.ebi.ac.uk/msd/ (2003)

suite of web-based search and retrieval interfaces for macromolecular structure research.

Protein Structure Databases

http://www.wwpdb.org/

slide-5
SLIDE 5

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 5

Introduction

Goal: Relationship between amino acid sequence and three-dimensional structure in proteins? Can we predict the structure from the sequence? Currently: comparative (homology) modeling;

See Lecture Thursday (Torsten) Homology Modeling

Similar Sequence Similar Structure Homology modeling = Comparative protein modeling

Idea: Using experimental 3D-structures of related family members (templates) to calculate a model for a new sequence (target).

Structure is better conserved than sequence

slide-6
SLIDE 6

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 6

Flow chart: analyze a new protein sequence

Protein Sequence Homology Modeling Predicted 3D Structural model 3D structural analysis in laboratory Structure prediction (Secondary Structure Fold recognition) Protein family Sequence search (Pfam) Database similarity search (BLAST) Relatioship to known structure? Does sequence align with a protein of known structure ? Hints for domain assignment? Function?

Secondary structure assignment

DSSP Dictionary of Secondary Structure of Proteins (Kabsch & Sander, 1983) Based on recognition of hydrogen-bonding patterns in known structures Automated assignment of secondary structure Interprets backbone hydrogen bonds Uses a Coulomb approximation for the hydrogen bond energy (-0.5 kcal/mol cut-off) Secondary structures are assigned to consecutive segments of residues with hydrogen bonds

slide-7
SLIDE 7

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 7

Secondary structure assignment DSSP secondary structure elements

8 secondary structure classes

– H (α-helix) → H – G (310-helix) → H – I (π-helix) → H – E (extended strand) → E – B (residue in isolated β-bridge) → E – T (turn) → L – S (bend) → L – " " (blank = other) → L

Secondary Structure prediction

What is protein secondary structure prediction?

Simplification of prediction problem 3D → 1D

Why do we need it?

As starting point for 3D modeling:

  • Improve sequence alignments
  • Use in fold recognition (discover family/superfamily relationship)
  • Definition of loops / core regions
slide-8
SLIDE 8

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8

Secondary Structure prediction Assumption:

there should be a correlation between amino acid sequence and secondary structure

What can we predict?

α-helix β-strand Loop (coil)

Secondary Structure prediction

Projection onto strings of structural assignments “Secondary Structure” 3-state model:

(S) β-Strand (E) (H) α-Helix (L) Loop SEQ MRIILLGAPGAGKGTQAQFIMEKYGIPQISTGDMLRAAVKSGSELGKQAK SS SSSSSSLLLLLLHHHHHHHHHHHLLLSSSLHHHHHHHHHHHLLLLLLHHH SS SSSSSS HHHHHHHHHHH SSS HHHHHHHHHHH HHH

slide-9
SLIDE 9

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 9

Accuracy of prediction 3-state-per-residue accuracy:

Gives % of correctly predicted residues in α, β or other state Q3 = 100 • Σ ci/N

  • N= total number of residues
  • Ci = number of correctly predicted residue in state

I (H,E,L)

Performance Evaluation

Assumption: there should be a correlation* between amino acid sequence and secondary structure Systematic performance testing pre-requisite for reliability of method Training Set Test Set Dataset PDB PDB sub set: derive correlation* PDB sub-set: => Q3

slide-10
SLIDE 10

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 10

Conformational Preferences

Biochimica et Biophysica Acta 916: 200-204 (1987).

α β

RT

1st Generation secondary structure prediction

1st Generation based on single amino acid propensities

Chou and Fasman, 1974 Robson, 1976 GOR-1: Garnier, Osguthorpe, and Robson, 1978

Preference of particular residues for certain secondary structure elements:

Single-residue statistics: analysis of the frequency of each 20 aa in α helices, β strands or coils

Databases of very limited size < 55% Q3 accuracy

slide-11
SLIDE 11

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 11

1st Generation secondary structure prediction Chou and Fasman (partial table): Amino Acid P α P β P t Glu 1.51 0.37 0.74 Met 1.45 1.05 0.60 Ala 1.42 0.83 0.66 Val 1.06 1.70 0.50 Ile 1.08 1.60 0.50 Tyr 0.69 1.47 1.14 Pro 0.57 0.55 1.52 Gly 0.57 0.75 1.56

Name P(H) P(E) P(turn) f(i) f(i+1) f(i+2) f(i+3) Alanine 142 83 66 0.06 0.076 0.035 0.058 Arginine 98 93 95 0.07 0.106 0.099 0.085 Aspartic Acid 101 54 146 0.147 0.11 0.179 0.081 Asparagine 67 89 156 0.161 0.083 0.191 0.091 Cysteine 70 119 119 0.149 0.05 0.117 0.128 Glutamic Acid 151 37 74 0.056 0.06 0.077 0.064 Glutamine 111 110 98 0.074 0.098 0.037 0.098 Glycine 57 75 156 0.102 0.085 0.19 0.152 Histidine 100 87 95 0.14 0.047 0.093 0.054 Isoleucine 108 160 47 0.043 0.034 0.013 0.056 Leucine 121 130 59 0.061 0.025 0.036 0.07 Lysine 114 74 101 0.055 0.115 0.072 0.095 Methionine 145 105 60 0.068 0.082 0.014 0.055 Phenylalanine 113 138 60 0.059 0.041 0.065 0.065 Proline 57 55 152 0.102 0.301 0.034 0.068 Serine 77 75 143 0.12 0.139 0.125 0.106 Threonine 83 119 96 0.086 0.108 0.065 0.079 Tryptophan 108 137 96 0.077 0.013 0.064 0.167 Tyrosine 69 147 114 0.082 0.065 0.114 0.125 Valine 106 170 50 0.062 0.048 0.028 0.053

Chou-Fasman Pij-values

slide-12
SLIDE 12

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 12

Chou-Fasman

How it works:

  • a. Assign all of the residues the appropriate set of parameters
  • b. Identify a-helix and b-sheet regions. Extend the regions in both

directions.

  • c. If structures overlap compare average values for P(H) and P(E) and

assign secondary structure based on best scores.

  • d. Turns are modeled as tetra-peptides using 2 different probability values.

Assign Pij values

1. Assign all of the residues the appropriate set of parameters T S P T A E L M R S T G

P(H) 69 77 57 69 142 151 121 145 98 77 69 57 P(E) 147 75 55 147 83 37 130 105 93 75 147 75 P(turn) 114 143 152 114 66 74 59 60 95 143 114 156

slide-13
SLIDE 13

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 13

Scan peptide for α−helix regions

  • 2. Identify regions where 4/6 aa have a

P(H) >100 “alpha-helix nucleus” T S P T A E L M R S T G

P(H) 69 77 57 69 142 151 121 145 98 77 69 57

T S P T A E L M R S T G

P(H) 69 77 57 69 142 151 121 145 98 77 69 57

Extend α-helix nucleus

  • 3. Extend helix in both directions until a set of four

residues have an average P(H) <100.

T S P T A E L M R S T G

P(H) 69 77 57 69 142 151 121 145 98 77 69 57

Repeat steps 1 – 3 for entire peptide

slide-14
SLIDE 14

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 14

  • 4. Identify regions where 3/5 have a

P(E) >100 “b-sheet nucleus” Extend b-sheet until 4 continuous residues have an average P(E) < 100 If region average > 105 and the average P(E) > average P(H) then “b-sheet” T S P T A E L M R S T G

P(H) 69 77 57 69 142 151 121 145 98 77 69 57 P(E) 147 75 55 147 83 37 130 105 93 75 147 75

Scan peptide for β-sheet regions Chou-Fasman

  • 1. Assign all of the residues in the peptide the appropriate set of parameters.
  • 2. Scan through the peptide and identify regions where 4 out of 6 contiguous residues have P(a-helix)

> 100. That region is declared an alpha-helix. Extend the helix in both directions until a set of four contiguous residues that have an average P(a-helix) < 100 is reached. That is declared the end of the helix. If the segment defined by this procedure is longer than 5 residues and the average P(a- helix) > P(b-sheet) for that segment, the segment can be assigned as a helix.

  • 3. Repeat this procedure to locate all of the helical regions in the sequence.
  • 4. Scan through the peptide and identify a region where 3 out of 5 of the residues have a value of P(b-

sheet) > 100. That region is declared as a beta-sheet. Extend the sheet in both directions until a set

  • f four contiguous residues that have an average P(b-sheet) < 100 is reached. That is declared the

end of the beta-sheet. Any segment of the region located by this procedure is assigned as a beta- sheet if the average P(b-sheet) > 105 and the average P(b-sheet) > P(a-helix) for that region.

  • 5. Any region containing overlapping alpha-helical and beta-sheet assignments are taken to be helical if

the average P(a-helix) > P(b-sheet) for that region. It is a beta sheet if the average P(b-sheet) > P(a-helix) for that region.

  • 6. To identify a bend at residue number j, calculate the following value:

p(t) = f(j)f(j+1)f(j+2)f(j+3) where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue is used and the f(j+3) value for the j+3 residue is used. If: (1) p(t) > 0.000075; (2) the average value for P(turn) > 1.00 in the tetra-peptide; and (3) the averages for the tetra-peptide obey the inequality P(a-helix) < P(turn) > P(b-sheet), then a beta-turn is predicted at that location.

slide-15
SLIDE 15

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 15

CHOFAS predicts protein secondary structure version 2.0u61 September 1998 Please cite: Chou and Fasman (1974) Biochem., 13:222-245 Chou-Fasman plot of @, 12 aa; SEQ1 sequence.

TSPTAELMRSTG helix <> sheet EEEEEEE turns T Residue totals: H: 2 E: 7 T: 1 percent: H: 16.7 E: 58.3 T: 8.3

Chou-Fasman Results

2nd Generation secondary structure prediction

Improvements

Larger database of protein structures Segment-based statistics (11-21 residue window)

Basic idea:

"How likely is it that the central residue in a window adopts a particular secondary structure state?"

Algorithm used:

Presumably all conceivable algorithms on this planet have been applied to the Secondary Structure prediction problem. E.g. statistical information, physicochemical properties, sequence patterns, neural networks, graph theory, expert rules

slide-16
SLIDE 16

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 16

(H) α-Helix, local interactions

Neural Network

Artificial intelligence:

Computer programs are trained to be able to recognize amino acid patters that are located in known secondary structure and distinguish from other patterns not located in these structures

NN can detect interactions between amino acids in a sequence windows. Neural Networks for Secondary Structure prediction

A C D E F G H I K L M N P Q R S T V W Y .

H E L D (L) R (E) Q (E) G (E) F (E) V (E) P (E) A (H) A (H) Y (H) V (E) K (E) K (E)

(B.Rost, Columbia, NewYork)

Input Layer Hidden Layer Output Layer Weights

slide-17
SLIDE 17

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 17

H E L D (L) R (E) Q (E) G (E) F (E) V (E) P (E) A (H) A (H) Y (H) V (E) K (E) K (E)

Neural Networks for secondary structure predictions

(B.Rost, Columbia, NewYork)

= 0.19 = 0.61 = 0.17

The winner is:

E

Neural Networks Benefits

General applicable Can capture higher order correlations Inputs other than sequence information

Drawbacks

Needs many data points (solved structures) Risk of overtraining

slide-18
SLIDE 18

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 18

2nd Generation secondary structure prediction

Methods:

GORIII

COMBINE

Q3 accuracy < 70%

Problems with first and second generation methods Q3 accuracy < 70% β-stands predicted < 28 - 48 % (slightly better than random) Predicted helices and strands are too short

slide-19
SLIDE 19

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 19

3rd Generation secondary structure prediction

Breakthrough: Using evolutionary information

1 50 fyn_human VTLFVALYDY EARTEDDLSF HKGEKFQILN SSEGDWWEAR SLTTGETGYI yrk_chick VTLFIALYDY EARTEDDLSF QKGEKFHIIN NTEGDWWEAR SLSSGATGYI fgr_human VTLFIALYDY EARTEDDLTF TKGEKFHILN NTEGDWWEAR SLSSGKTGCI yes_chick VTVFVALYDY EARTTDDLSF KKGERFQIIN NTEGDWWEAR SIATGKTGYI src_avis2 VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI src_aviss VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI src_avisr VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI src_chick VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI stk_hydat VTIFVALYDY EARISEDLSF KKGERLQIIN TADGDWWYAR SLITNSEGYI src_rsvpa .......... ESRIETDLSF KKRERLQIVN NTEGTWWLAH SLTTGQTGYI hck_human ..IVVALYDY EAIHHEDLSF QKGDQMVVLE ES.GEWWKAR SLATRKEGYI blk_mouse ..FVVALFDY AAVNDRDLQV LKGEKLQVLR .STGDWWLAR SLVTGREGYV hck_mouse .TIVVALYDY EAIHREDLSF QKGDQMVVLE .EAGEWWKAR SLATKKEGYI lyn_human ..IVVALYPY DGIHPDDLSF KKGEKMKVLE .EHGEWWKAK SLLTKKEGFI lck_human ..LVIALHSY EPSHDGDLGF EKGEQLRILE QS.GEWWKAQ SLTTGQEGFI ss81_yeast.....ALYPY DADDDdeISF EQNEILQVSD .IEGRWWKAR R.ANGETGII abl_mouse ..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWV abl1_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWV src1_drome..VVVSLYDY KSRDESDLSF MKGDRMEVID DTESDWWRVV NLTTRQEGLI mysd_dicdi.....ALYDF DAESSMELSF KEGDILTVLD QSSGDWWDAE L..KGRRGKV yfj4_yeast....VALYSF AGEESGDLPF RKGDVITILK ksQNDWWTGR V..NGREGIF abl2_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YNQNGEWSEV RSKNG.QGWV tec_human .EIVVAMYDF QAAEGHDLRL ERGQEYLILE KNDVHWWRAR D.KYGNEGYI abl1_caeel..LFVALYDF HGVGEEQLSL RKGDQVRILG YNKNNEWCEA RlrLGEIGWV txk_human .....ALYDF LPREPCNLAL RRAEEYLILE KYNPHWWKAR D.RLGNEGLI yha2_yeastVRRVRALYDL TTNEPDELSF RKGDVITVLE QVYRDWWKGA L..RGNMGIF abp1_sacex.....AEYDY EAGEDNELTF AENDKIINIE FVDDDWWLGE LETTGQKGLF

3rd Generation secondary structure prediction

PHD method (Rost and Sander) Combine neural networks with MAXHOM multiple sequence profiles

6-8 Percentage points increase in prediction accuracy over standard neural networks

slide-20
SLIDE 20

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 20

3rd Generation secondary structure prediction

Η Ε L > > > pick maximal unit => current prediction J2 input layer first or hidden layer second or

  • utput layer

s0 s1 s2 J1

: G Y I Y D P A V G D P D N G V E P G T E F : : G Y I Y D P E V G D P T Q N I P P G T K F : : G Y E Y D P A E G D P D N G V K P G T S F : : G Y E Y D P A E G D P D N G V K P G T A F : Alignments 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 . . . . . . . . . 2 . . . . . 3 . . . . . . . . . . . . . . . . . . . . . . . 5 . . . . . . 5 . . . . . . . . . . . . . . . . . . 5 . . . . . . . . . . . . . . . . . . 3 . . . . 2 . . . . . . . . . . . . . . . . 1 . . 2 . . . 2 . . . . . . . . 5 . . . . . . . . . . . . . . . . . . . . . . . 5 . . . . . . . . . . . . . . . . . . 5 . . . . . . . . . . . . . . . . . . . . 4 . 1 . . . . . . . . . . . . . . . . . 1 3 . . . 1 . . . . . . . . . . 4 . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . 4 . 1 . . . . . . . . . 1 . 1 . 1 2 . . . . . . . . . . . . . . 5 . . . . . . . . . . . . . . . . 5 . . . . . . . . . . . . . . . . . . . . . . . . . 5 . . . . . . . . . . . . . . 1 1 . 1 . . 1 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 . GSAPD NTEKQ CVHIR LMYFW profile table : G Y I Y D P E D G D P D D G V N P G T D F : Protein

corresponds to the the 21*3 bits coding for the profile of one residue

(B.Rost, Columbia, NewYork)

3rd generation secondary structure prediction PHD (Rost et. al.)

Q3 better than 72 %

[ B.Rost (2001) J.Struct.Biol. 134, 204 ]

59 % 65 % 72 % Q3

Prediction reliability (0 = weak, 9 = strong)

[http://www.embl-heidelberg.de/predictprotein/]

slide-21
SLIDE 21

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 21

3rd generation secondary structure prediction

PSI-Pred (Jones, DT)

Use alignments from iterative sequence searches (PSI-Blast) as input to a neural network Better predictions due to better sequence profiles Available as stand alone program and via the web

[http://bioinf.cs.ucl.ac.uk/psipred/psiform.html]

How accurate are predictions today?

10 20 30 40 50 60 70 10 20 30 40 50 60 70 80 90 100 Number of protein chains Per-residue accuracy (Q 3) <Q3>=72.3% ; sigma=10.5% 1spf 1bct 1stu 3ifm 1psm

(B.Rost, Columbia, NewYork)

slide-22
SLIDE 22

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 22

How accurate are predictions today?

Q3 = 72-76% +- 11 % (on average)

I.e. 30 % of predicted assignments are wrong I.e. for 2/3 of all proteins, between 60% - 80% of residues are predicted correctly I.e. for your protein, accuracy can be lower than 60%

  • r higher than 80%

How accurate are predictions today?

At present it is not always possible to predict secondary structure with very high reliability As methods have improved (from 1st->3d generation of methods), prediction has reached an average accuracy of 64%-75%

slide-23
SLIDE 23

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 23

Secondary Structure Prediction

META-PredictProtein Server

http://cubic.bioc.columbia.edu/meta/ Simultaneous submission tool to several other servers, e.g. JPRED, PHD, PROF, PSIprod, SAM-T99, APSSP2, Sspro Includes also motif searches, domain assignments, TM predictions, etc.

1D-Structure prediction Secondary Structure Prediction Solvent Accessibility Prediction

Identify exposed residues, e.g. for mutation studies, epitopes, etc.

slide-24
SLIDE 24

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 24

1D-Structure prediction Projection onto strings of structural assignments

E.g. “Solvent Accessibility” (buried or exposed?) A B C D E F G… ¦ ¦ ¦ ¦ ¦ ¦ ¦ e e b b e e e…

Accuracy of two-state prediction: 75% ± 10 %

PHDacc: solvent accessibility prediction

[http://cubic.bioc.columbia.edu/predictprotein/]

slide-25
SLIDE 25

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 25

1D-Structure prediction

Secondary Structure Prediction Solvent Accessibility Prediction Transmembrane Helices prediction

PHDhtm [http://www.embl-heidelberg.de/predictprotein/predictprotein.html] TMHMM [http://www.cbs.dtu.dk/services/TMHMM/] TMpred [http://www.ch.embnet.org/software/TMPRED_form.html]

Fold Recognition

slide-26
SLIDE 26

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 26

[ PDB: http://www.pdb.org ]

Growth of the Protein Data Bank PDB

Christine Orengo (Structures, 1997, 5, 1093-1108)

Fold Classification Databases

slide-27
SLIDE 27

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 27

Christine Orengo (Structures, 1997, 5, 1093-1108)

Fold Classification Databases Protein structure classification databases

Databases: provide structural comparisons for the proteins in PDB: Methods used to classify the protein structures:

Manual examination fully automatic computer algorithms

Examples:

SCOP CATH FSSP

slide-28
SLIDE 28

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 28

[ http://scop.mrc-lmb.cam.ac.uk/scop/ ]

SCOP - Structural Classification of Proteins

MRC Cambridge UK, A. Murzin, Brenner S. E., Hubbard T., Chothia C. created by manual inspection hierarchical classification of protein domain structures comprehensive description of the structural and evolutionary relationships

  • rganized as a tree structure:

Class all α class Fold globin-like fold (6 helices; folded leaf) Superfamily globin-like superfamily Family globin and phycocyanin families Domain hemoglobin 1, myoglobin,… Species

Domain= segment of a polypetide chain that can autonomously fold into a 3D structure

[ http://www.biochem.ucl.ac.uk/bsm/cath_new/ ]

CATH - Protein Structure Classification

UCL, Janet Thornton & Christine Orengo Hierarchical classification of protein domain structures clusters proteins at four major levels:

Class (C) Architecture(A) Topology(T) Homologous superfamily (H)

slide-29
SLIDE 29

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 29

[ http://www.biochem.ucl.ac.uk/bsm/cath_new/ ]

CATH - Protein Structure Classification

  • Class(C)

derived from secondary structure content is assigned automatically

  • Architecture(A)

describes the gross orientation of secondary structures, independent of connectivity.

  • Topology(T)

clusters structures according to their topological connections and numbers of secondary structures

FSSP-Fold Classification structure-structure alignment Holm and Sander, EBI, UK Fold classification based on pair-wise structural alignment of PDB. (DALI program) Clusters of fold types = unique configuration of secondary structure elements

[http://www2.ebi.ac.uk/dali/fssp/]

slide-30
SLIDE 30

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 30

Structural Alignments

Protein Structure is better conserved than sequence Structural alignments establish equivalences between amino acid residues based on the 3D structures of two or more proteins Structure alignments therefore provide information not available from sequence alignment methods Structural alignments can be used to guide sequence alignments (see: T_COFFEE / SAP)

See Lecture Thursday (Laurent) Sequence alignment

[ PDB: http://www.pdb.org ]

Growth of the Protein Data Bank PDB

slide-31
SLIDE 31

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 31

[ PDB: http://www.pdb.org ]

Growth of the Protein Data Bank PDB

New folds per year “Old” folds per year

The number of fold appears to be limited

The number of fold appears to be limited Many different sequences will adopt the same fold:

A reasonable probability that a new sequence will posses an already identified fold

Goal of fold recognition: discover which fold is best matched

Sequence alignment method (e.g. HMM) 3D structure prediction methods (e.g. threading)

slide-32
SLIDE 32

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 32

Find a compatible fold for a given sequence ....

>Protein XY MSTLYEKLGGTTAVDLAV DKFYERVLQDDRIKHFFA DVDMAKQRAHQKAFLTYA FGGTDKYDGRYMREAHKE LVENHGLNGEHFDAVAED LLATLKEMGVPEDLIAEV AAVAGAPAHKRDVLNQ

?

Fold recognition

Number of protein folds that occurs in nature is limited. Fold Recognition can be used to: Identify templates for modeling Assign Protein Function

Fold recognition: sequence based

Sequence alignment (HMM) can be used to identify a family of homologous proteins that have the same seq. and presumably a similar 3D-structure ex.: Superfamily database:

uses a library (covering all proteins of known structure) consisting of 1294 SCOP superfamilies each of which is represented by a group of hidden Markov models HMM

[http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/]

slide-33
SLIDE 33

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 33

Fold recognition: threading The amino acid sequence of a query protein is examined for compatibility with the structural core of known protein structures:

Structure profile method (e.g. 3D-PSSM) Contact potential method (e.g. 123D)

Fold recognition methods

  • 3DPSSM
  • Three-dimensional

position specific scoring matrix

Kelley et al, JMB, 299, 499 (2000)

http://www.sbg.bio.ic.ac.uk/~3dpssm/

slide-34
SLIDE 34

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 34

Fold recognition and Function

Some words of warning concerning fold recognition:

There is no simple close association of fold and function in a

  • ne-to-one sense.

The five most versatile folds (TIM-barrel, alpha-beta hydrolase, Rossmann, P-loop containing NTP hydrolase, ferredoxin fold), accommodate from six to as many as 16 functions. The two most versatile enzymatic functions (hydrolases and o- glycosyl-glucosidases) are associated with seven folds each.

Aspartase [1JSW]

CO2

  • C

H NH3

+

H H O O

  • CO2
  • H

H

  • O2C

+ NH3

Histidase [1B8F]

N NH CO2

  • H

H H

+NH3

H H CO2

  • NH

N + NH3

δ2-Crystallin [1AUW] Avian eye lens protein

Functional assignment by fold recognition ?

slide-35
SLIDE 35

Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 35

Fold Recognition Servers

Meta server

http://bioinfo.pl/meta/

3DPSSM

http://www.sbg.bio.ic.ac.uk/servers/3dpssm/

GenTHREADER

http://bioinf.cs.ucl.ac.uk/psipred/

FUGUE2

http://www-cryst.bioc.cam.ac.uk/~fugue/prfsearch.html

SAM

http://www.cse.ucsc.edu/research/compbio/HMM-apps/T99-query.html

FOLD

http://fold.doe-mbi.ucla.edu/

FFAS/PDBBLAST

http://bioinformatics.burnham-inst.org/

References

D.W. Mount, Bioinformatics, CSHLP. P.E.Bourne, H. Weissig. Structural Bioinformatics, Wiley-Liss and Sons. Methods in Molecular Biology 143: Protein Structure Prediction, Humana Press. Protein Structure Prediction: A practical Approach, Oxford University Press.