Outline Day 1 & 2 Introduction: The protein structure knowledge - - PDF document

outline day 1 2
SMART_READER_LITE
LIVE PREVIEW

Outline Day 1 & 2 Introduction: The protein structure knowledge - - PDF document

Swiss Institute of Bioinformatics SIB Doctoral School in Bioinformatics Advanced Course Protein Structure: Prediction and Analysis Day 2: Protein Structure Modeling Lausanne, September 1-5, 2008 Torsten Schw ede Biozentrum - Universitt


slide-1
SLIDE 1

Swiss Institute of Bioinformatics

Torsten Schw ede Biozentrum - Universität Basel Swiss Institute of Bioinformatics Klingelbergstr 50-70 CH - 4056 Basel, Switzerland Tel: + 41-61 267 15 81

SIB Doctoral School in Bioinformatics Advanced Course

Protein Structure: Prediction and Analysis

Day 2: Protein Structure Modeling

Lausanne, September 1-5, 2008

Outline Day 1 & 2

  • Introduction: The protein structure knowledge gap
  • Recap: Basic principles of proteins and their 3-dimensional structures
  • Protein Structure modeling and prediction
  • Comparative protein structure modeling
  • What happened to fold recognition?
  • De novo prediction
  • Evaluation and Assessment of Protein Structure Model Quality

Practicals & Tutorials:

  • Tutorial: Structure Visualization with DeepView (Nicolas Guex, SIB Lausanne)
  • Practical: Examples of comparative modeling and model evaluation

Exam / credit points:

  • Student presentations on one of the evaluation examples
slide-2
SLIDE 2

The number of distinct protein domains in nature is limited.

Chothia (1992) Proteins. One thousand families for the molecular biologist. Nature. 3 5 7 : 543-4. Idea: determine structures of representative proteins and then derive most other structures by homology modeling. Structural Genomics (Protein Structure Initiative, Riken in Japan, SPINE in Europe)

Fold Classification Fold classification is an important to systematically study protein structure evolution Multi-domain proteins have to be divided into domains prior to classification There is no consensus on how to delineate the domains. Three main protein structure classification databases are commonly used:

SCOP: manual classification based on evolutionary information CATH: semi-automatic classification based on geometric criteria FSSP: automatic classification based on direct structural similarity

Fold Classification

slide-3
SLIDE 3

[ http:/ / w w w .biochem .ucl.ac.uk/ bsm / cath_ new / ]

The CATH database is a hierarchical domain classification

  • f protein structures in the Brookhaven protein databank.

UCL, Janet Thornton & Christine Orengo clusters proteins semi-automated at four major levels:

Class(C) Architecture(A) Topology(T) Homologous superfamily (H)

Fold Classification Databases

[ http: / / www.cathdb.info ]

Protein Structure Classification

  • Class( C)

derived from secondary structure content is assigned automatically

  • Architecture( A)

describes the gross orientation of secondary structures, independent of connectivity.

  • Topology( T)

clusters structures according to their topological connections and numbers of secondary structures

  • Hom ologous Superfam ily ( H)

This level groups together protein domains which are thought to share a common ancestor and can therefore be described as homologous.

slide-4
SLIDE 4

Top of the hierarchy:

Fold Classification Databases

Example: 1EWF

slide-5
SLIDE 5

Example: 1EWF

[ http:/ / scop.m rc-lm b.cam .ac.uk/ scop/ ]

Structural Classification of Proteins: SCOP

MRC Cambridge (UK), Alexey Murzin, Brenner S. E., Hubbard T., Chothia C. hierarchical classification of protein domain structures created by manual inspection comprehensive description of the structural and evolutionary relationships

  • rganized as a hierarchical structure
  • Class
  • Fold
  • Superfamily
  • Family
  • Species

Fold Classification Databases

slide-6
SLIDE 6

The different m ajor levels in the hierarchy are: Fold: Major structural similarity Proteins are defined as having a common fold if they have the same major secondary structures in the same arrangement and with the same topological connections. Superfam ily: Probable common evolutionary origin Proteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary

  • rigin is probable are placed together in superfamilies.

Fam ily: Clear evolutionarily relationship Proteins clustered together into families are clearly evolutionarily

  • related. Generally, this means that pair wise residue identities

between the proteins are 30% and greater.

Fold Classification Databases Fold Classification Databases

slide-7
SLIDE 7

Fold Classification Databases Fold Classification Databases

slide-8
SLIDE 8

qCOPS and Fold Space Navigator

Quantitative classification of protein structures, navigation through fold space and visualization of pairwise structure similarities.

http: / / www.came.sbg.ac.at

Protein Structure / Fold Databases

  • PDB:

http: / / www.pdb.org EBI-MSD http: / / www.ebi.ac.uk/ msd/

  • SCOP

http: / / scop.mrc-lmb.cam.ac.uk/ scop/

  • CATH

http: / / www.biochem.ucl.ac.uk/ bsm/ cath_new/ FSSP: http: / / ekhidna.biocenter.helsinki.fi/ dali/ start

slide-9
SLIDE 9

Comparing Protein Structures

Why do we want to compare protein structures?

Classify structures Identify structural movements (induced fit, NMR, etc.) Analyze evolutionary relationships Identify recurring structural motifs Assess quality of predicted models …

Comparing Protein Structures

  • What do we need to compare structures?
  • Protein sequences can be treated as linear strings of
  • letters. For a given similarity matrix, sequences can

be aligned optimally using dynamic programming.

  • Protein structures are 3-dimensional objects. We need

to find algorithms (analogue to DP for sequences) which find an optimal match for two shapes – given a certain similarity measure.

slide-10
SLIDE 10

Comparing Protein Structures

What do we need to compare structures?

1. Structural feature description 2. Comparison / superposition algorithms 3. Distance / similarity measure

Description A Description B

1.

Similarity / Distance Measure

2. 3.

Local or global comparison?

Global: Local: n = 4 n = 5

Comparing Protein Structures

slide-11
SLIDE 11

Distance measure: Root mean square deviation

Comparing two sets of points (= atoms in structures) A = { a1 … an} and B = { b1 … bn} with

ai Position vector of atom i in structure A n Number of equivalent atoms

We need to define a 1: 1 correspondence for atoms in A and B Root mean square distance is calculated from the squared Euclidian distances between corresponding points:

n b a d s m r

n i i i

=

− =

2

) ( . . . .

Comparing Protein Structures

Distances in Euclidian Space For two points x = (x1,x2,x3,… ,xn) and y = (x1,x2,x3,… ,xn), a p- norm distance is defined as:

1-norm distance 2-norm distance p-norm distance

In Euclidian space Rn, distances are normally given as Euclidian distance (= 2-norm distance), which is a generalization of the Pythagorean theorem. p need not be integer, but can not be less than 1.

Comparing Protein Structures

=

− =

n i i i

y x

1 2 / 1 1 2 ⎟

⎠ ⎞ ⎜ ⎝ ⎛ − = ∑

= n i i i

y x

p n i p i i

y x

/ 1 1

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − = ∑

=

slide-12
SLIDE 12

Distances Mathematically, “distance” is a function which meets the following criteria:

One can find the distance between any two points. That distance is a distinct real number. It is positive definite. d(x,y) ≥ 0, and d(x,y) = 0 if and only if x = y. It is symmetric. d(x,y) = d(y,x). It satisfies the triangle inequality, d(x,z) ≤ d(x,y) + d(y,z). Such a function is known as a metric. Geometrically, the right-hand part of the triangle inequality states that the sum of the lengths of any two sides of a triangle is greater than the length of the remaining side:

Comparing Protein Structures

X Z Y

Comparing Protein Structures

Protein Structure Superposition

Assume that we know the correspondence set between A and B (e.g. NMR ensembles, induced fit) Task: Find a rigid transformation RT which best superposes A = { a1 … an} and B = { b1 … bn} Many solutions in image analysis, mechanical engineering, … Find a rigid transformation RT which minimizes the error function E:

=

− =

n i i i RT

b a RT E

1 2

) ( min

slide-13
SLIDE 13

Comparing Protein Structures

Protein Structure Superposition

The rigid transformation RT has a translational component T and a rotational component R: RT(a) = R(X) + T The error function to be minimized becomes:

=

+ − =

n i i i RT

T b a R E

1 2

) ( min

Comparing Protein Structures

  • 1. The translational component

The error function is as its minimum with respect to the translation when: If both structures A and B are centered on the coordinate origin, Σai and Σbi become 0, and then also T = 0. In the first step, we translate the centers of structures A and B to the origin of the coordinate system.

( )

=

= + − = ∂ ∂

n i i i

T b a R T E

1

) ( 2

∑ ∑

= =

+ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − =

n i i n i i

b a R T

1 1

slide-14
SLIDE 14

Comparing Protein Structures

  • 2. The rotational component

Shifting { an} and { bn} to their respective barycenters: ai’= ai-a0 and bi’= bi-b0: Several methods have been proposed to find the

  • ptimal solution for the rotation R.

( )

=

− =

n i i i R

b a R E

1 2 ' '

min

Comparing Protein Structures

  • 2. The rotational component

The most common one by Kabsch (1976) involves the Eigenvalue decomposition (SVD) of the 3x3 covariance matrix C= ABT, with A = [ a’1 … a’n] and B = [ b’1 … b’n] . From the singular values of C (σ1≥σ2≥σ3, χ = sgn(det C)), the rotation matrix R and remaining residual can be calculated: Dill (2004) has proposed the use of “Quaternions” to derive the RT operator and the r.m.s.d. This method allows to derive “gradients of rmsd” along some parameter.

(Evangelos et al., J.Comput.Chem. (2004) 25, 1849-1857)

( )

3 2 1 1 2 2 min

2 1 χσ σ σ + + − − = ∑

=

n y x n E

n i i i

slide-15
SLIDE 15

Comparing Protein Structures

Protein Structure Superposition

Example 1: Ensembles of NMR structures (PDB: 1A0N)

Rmsd (backbone atoms) between first and tenth structure: 3.55 Å

Comparing Protein Structures

Protein Structure Superposition

Example 2: Induced fit of Adenylate Kinases

PDB:1AKE(A) / 4AKE(A) Rmsd (backbone atoms): 7.15 Å (global rmsd of 214 residues) 1.27 Å (local rmsd of 125 residues)

slide-16
SLIDE 16

Comparing Protein Structures

  • Protein Structure Superposition / Alignment Problem

We need an optimal subset of corresponding residues in both structures A = { a1 … an} and B = { b1 … bn} to find the

  • ptimal RT operator which minimizes the distance

measure (here: r.m.s.d.). We need an optimal RT operator to generate a structural alignment, which allows us to define the optimal subset of corresponding residues in both structures A = { a1 … an} and B = { b1 … bn} which minimizes the distance the distance measure.

Comparing Protein Structures

  • Protein Structure Superposition / Alignment Problem

Method needs to identify the largest possible set of corresponding atoms/ residues, which gives a good superposition (= a low r.m.s.d.). Several methods have been developed:

  • DALI (Holm & Sander, 1993)
  • SSAP (Orengo & Taylor, 1989)
  • VAST (Gibrat et al., 1996)
  • CE (Shindyalov & Bourne, 1998)
  • SSM (Krissinel & Henrick, 2004)
  • qCOPS (Sippl, 2008)
slide-17
SLIDE 17

Comparing Protein Structures

  • The “naïve” approach as implemented in DeepView’s

“Iterative magic fit” function:

1. Generate a local pair-wise sequence alignment between the structures A and B using SIM* as seed alignment. 2. Use the aligned segments as correspondence set for a least squares superposition. 3. Generate a structural alignment by

  • Including residues from A and B in the alignment which are closer

than the distance threshold.

  • Removing residues in A and B from the alignment which are more

distant than the distance threshold.

4. Use this structure-derived correspondence set to repeat steps 2 and 3 iteratively until the procedure has converged.

* Huang, X. and Miller, W. (1991) A time-efficient, linear-space local similarity algorithm, Adv. Appl. Math, 12,337-257.

Comparing Protein Structures

  • The “naïve” approach as implemented in DeepView’s

“Iterative magic fit” function:

  • Although mathematically not very sophisticated, the

method converges in most cases to a reasonable solution.

  • This simple approach can not automatically handle

domain movements.

  • In each iteration, DeepView corrects for chemically

equivalent atoms in symmetric molecules:

CB CG CD1 CD2 CZ CE2 CE1 CB CG CD2 CD1 CZ CE1 CE2 What is the r.m.s.d. between these two conformations of Phe?

slide-18
SLIDE 18

Comparing Protein Structures

Evolution of the Globin Familiy

0.0 2.5 0.5 1.5 2.0 1.0 100 50

Percent identical residues in core Rmsd of backbone atoms in core

[ Chothia & Lesk (1986) ]

Common core = all residues that can be superposed in 3D For proteins > 60% identical residues, the core contains > 90 % of all residues deviating less than 1.0 Å.

Evolution of the Globin Familiy

slide-19
SLIDE 19 .

20 40 60 80 100 50 100 150 200 250 identity similarity

Number of residues aligned Percentage sequence identity/ similarity

(B.Rost, Columbia, NewYork)

Sequence similarity implies structural similarity?

Don’t know region ..... Sequence identity implies structural similarity

Hom ology m odeling

= Comparative protein modeling = Knowledge-based modeling Idea: Using experimental 3D-structures of related family members (templates) to calculate a model for a new sequence (target).

Similar Sequence Similar Structure

slide-20
SLIDE 20

Known Structures (Templates) Target Sequence Template Selection Alignment Template - Target Structure modeling Structure Evaluation & Assessment Homology Model(s)

Comparative Protein Structure Modeling

Know n Structures ( Tem plates) Target Sequence Tem plate Selection Alignm ent Tem plate - Target Structure m odeling

Structure Evaluation & Assessm ent

Hom ology Model( s)

  • Protein Data Bank PDB

http: / / www.pdb.org

  • Database of templates
  • Separate into single chains
  • Remove bad structures

(models)

  • Create BLASTable database
  • r fold library (profiles, HMMs)

Comparative Protein Structure Modeling

slide-21
SLIDE 21

Know n Structures ( Tem plates) Target Sequence Tem plate Selection Alignm ent Tem plate - Target Structure m odeling

Structure Evaluation & Assessm ent

Hom ology Model( s)

Template selection: 1. Sequence Similarity / Fold recognition 2. Structure quality (resolution, experimental method) 3. Experimental conditions (ligands and cofactors)

Comparative Protein Structure Modeling Comparative Modeling - Fold recognition

  • The "biological" perspective:

Homologous proteins have evolved by molecular evolution from a common ancestor over millions of years. If we can establish homology to a known protein, we can predict aspects of structure and function of a new protein by analogy.

  • The "physical" perspective:

The native conformation of a protein corresponds to a global free energy minimum of the protein / solvent system. To identify a compatible fold, the protein sequence is "threaded" through a library of folds, and empirical energy calculations are used to evaluate compatibility.

Ludw ig Boltzm ann (20. February 1844 - 5. September 1906

slide-22
SLIDE 22

Comparative Modeling - Fold recognition

  • Simple example: PDB-BLAST (profile Blast)
  • Pair-wise sequence comparisons assume position

independence of amino acid substitutions, i.e. the score of "ALA → SER" exchange will simply be taken from a substitution matrix.

  • A multiple sequence alignment of a protein family reflects

the evolutionary history of the family in the form of position specific substitution frequencies.

  • This information can be represented by PSSMs ("Position

Specific Scoring Matrix") and Sequence Profiles.

Comparative Modeling - Fold recognition

  • Simple example: PDB-BLAST (profile Blast)
  • PSI-Blast Profile-Search:

1. Build sequence profile (for the query) by iterative PSI- BLAST search against a large sequence database ("P") 2. Use profile to search database of proteins with known structure ("X")

Q X P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P X X

slide-23
SLIDE 23

P

Fold recognition methods

  • Transitive PSI-BLAST
  • Also known as Intermediate Sequence Search (ISS)
  • Procedure:

1. Find homologues of the query sequence 2. Find homologues of these "hits". 3. Repeat until no additional "hits" are identified.

Q X P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P

HHSearch

  • HMM-HMM comparison method
  • Uses PSI-Blast profiles to generate alignments for

both target and templates for building HMMs

  • Searching by HMM-HMM alignment
  • More sensitive than

PSI-Blast

slide-24
SLIDE 24

Know n Structures ( Tem plates) Target Sequence Tem plate Selection Alignm ent Tem plate - Target Structure m odeling

Structure Evaluation & Assessm ent

Hom ology Model( s)

  • Multiple sequence alignment

for pairs > 40% identity

  • r
  • Use structural alignment of

templates to guide sequence alignment of target

  • r
  • Use separate profiles (or

HMMs) for template and targets

Comparative Protein Structure Modeling

Know n Structures ( Tem plates) Target Sequence Tem plate Selection Alignm ent Tem plate - Target Structure m odeling

Structure Evaluation & Assessm ent

Hom ology Model( s)

  • Errors in template selection or

alignment result in bad models

  • iterative cycles of alignment,

modeling and evaluation

  • Built many models,

choose best.

Comparative Protein Structure Modeling

slide-25
SLIDE 25

Know n Structures ( Tem plates) Target Sequence Tem plate Selection Alignm ent Tem plate - Target Structure m odeling

Structure Evaluation & Assessm ent

Hom ology Model( s)

I. Manual Model building

  • II. Template based fragment

assembly – Composer (Sybyl, Tripos) – SWISS-MODEL

  • III. Satisfaction of spatial restraints

– Modeller (Insight II, MSI) – CPH-Models

Comparative Protein Structure Modeling

Manuel protein modeling

“Human” protein structure modelers often perform better than fully automated methods. However, “Human” protein structure modelers use automated approaches as input. During the last CASP experiment, some automated methods start to perform better than a significant number of human predictors.

Comparative Protein Structure Modeling

slide-26
SLIDE 26

Step 1: Identify structurally conserved core regions in the available

  • ptimally superposed template structures.

Model building by Rigid Fragment Assembly

Step 2: Build the core of the model backbone. Averaging of core

template backbone atoms (weighted by local sequence similarity with the target sequence) can only be applied to _very_ close

  • residues. Otherwise, choose the best representative backbone.

Model building by Rigid Fragment Assembly

slide-27
SLIDE 27

Step 3a: Loop (insertion) modeling by e.g. CSP (constraint space programming), finding “spare parts” in Loop-Databases, or “ab-initio” rebuilding. The most difficult part is the scoring function to rank possible loop conformations. Energy- based scoring schemes require complete side chain models.

Model building by Rigid Fragment Assembly

Very short loops: Analytical enumeration of solutions possible (e.g.Wedemeyer, Scheraga (1999) J. Comput. Chem. 20, 819- 844) Medium sized loops:

Loop-libraries based on PDB fragments Ab initio rebuilding of loops

  • e.g. Loopy: Xiang,Z, Csoto,C and Honig,B. Evaluating conformtional free

energies: the colony energy and its application to the problem of loop

  • prediction. Proc. Natl. Acad. Sci. USA 99: 7432-7437.
  • e.g. Rapper: M.A. DePristo, P.I.W. de Bakker, S.C. Lovell, T.L. Blundell

(2002) Ab initio construction of polypeptide fragments: Efficient generation

  • f accurate, representative ensembles. Proteins Struct. Funct. Genet.

51: 41-55

Large loops? Most methods are pretty bad …

For more details, see

http: / / nook.cs.ucdavis.edu: 8080/ ~ koehl/ BioEbook/ loop_building.html

Model building by Rigid Fragment Assembly

slide-28
SLIDE 28

Step 3b: Side chain modeling. Find the most probable side chain conformation, using template structure information, rotamer libraries and energetic and packing criteria. This problem occurs e.g. in homology modeling, protein design, protein- protein-docking, automated electron-density fitting in X-Ray crystallography.

Model building by Rigid Fragment Assembly

What is the problem when modeling side- chain rotamers?

We are searching the “global minimum energy conformation” (GMEC) The conformational space of protein side chains is huge, i.e. a systematic search of all conformations is not possible. Use rotamer libraries to reduce the complexity Model building by Rigid Fragment Assembly

slide-29
SLIDE 29

Rotam er Libraries

Only a small fraction of all possible side chain conformations is observed in experimental structures Rotamer libraries provide an ensemble

  • f likely conformations

The propensity of rotamers depends on the backbone geometry:

Model building by Rigid Fragment Assembly

g+ trans g- p(g+ | phi) p(t | phi) p(g- | phi) p(g+ | psi) p(t | psi) p(g- | psi) Phe,Tyr, His

Backbone-dependent Rotamer Libraries

slide-30
SLIDE 30

Examples:

SCW RL3 .0

  • A. A. Canutescu, A. A. Shelenkov, and R. L. Dunbrack,
  • Jr. A graph theory algorithm for protein side-chain
  • prediction. Protein Science 12, 2001-2014 (2003).

Penultim ate Rotam er Library SC Lovell, JM Word, JS Richardson and DC Richardson (2000) "The Penultimate Rotamer Library" Proteins: Structure Function and Genetics 40: 389-408.

Backbone-dependent Rotamer Libraries

We are searching the “global minimum energy conformation” (GMEC) of all side chain rotamers: Example: 4 Residues with n= 2, 2, 3, 5 rotamers resp: For a typical modeling or design problem, more than 10100 conformations must be evaluated. This is technically not

  • feasible. Possible solutions:

Side Chain Modeling

  • Monte Carlo Searching
  • Genetic algorithms
  • Mean Field Theory
  • Dead End Elimination (DEE)
slide-31
SLIDE 31

Dead End Elimination (DEE)

Discrete conformational search, functionally equivalent to an exhaustive search. Guaranteed to converge to the global minimum solution. DEE uses rejection criteria to prune the search space: X

Side Chain Modeling

Dead End Elimination (DEE)

Theorem: A global minimum energy conformation (GMEC) exists, for which there is a unique conformation for each side chain. Each residue i has a set of possible rotamers r. The total conformation energy is reformulated as a) the internal energy of a rotamer and its interactions with the backbone plus b) the sum of pair-wise interactions between side chain conformations. The energy of each protein conformation can be expressed as:

∑ ∑ ∑

− = + = =

+ =

1 1 1 1

) , ( ) ( ) (

N i N i j s r N i r Tot

j i E i E conf E

conf GMEC E conf E

Tot Tot

∀ ≥ ) ( ) (

Note that:

Side Chain Modeling

slide-32
SLIDE 32

Dead End Elimination (DEE)

Rejection criteria: eliminate all conformations from the search which can be shown to be not part of the GMEC. Iterate until convergence occurs, or the problem becomes sufficiently simplified for an exhaustive search.

∑ ∑ ∑

− = + = =

+ =

1 1 1 1

) , ( ) ( ) (

N i N i j s r N i r Tot

j i E i E conf E

conf GMEC E conf E

Tot Tot

∀ ≥ ) ( ) (

Note that:

Side Chain Modeling

Dead End Elimination (DEE)

Simple Rejection criterion: Let’s consider two rotamers ir and is at residue position i. If the pair-wise energy between ir and all other side chain conformations j s is always higher than the pair-wise energy between is and j s, then ir cannot exist in the GMEC and is eliminated.

Most favorable interaction of ir with conformational background Most unfavorable interaction of is with conformational background Conformation is eliminates ir from further evaluations.

Eliminate ir if:

( ) ( ) ( ) ( )

, max , min > − + −

∑ ∑

≠ ≠ i j s i j r s r

j i E j i E i E i E

ir is

Side Chain Modeling

slide-33
SLIDE 33

Dead End Elimination (DEE)

Goldstein criterion / limited criterion: is and ir can be compared in the same conformational background. The conformation background is obtained by fixing each residue j in the conformation that most favors is with respect to ir. This criterion is stronger than the simple one.

( ) ( ) ( ) ( ) ( )

, , min > − + −

≠i j s s s r s r

j i E j i E i E i E

ir is

Side Chain Modeling

References:

  • Desmet, J., De Maeyer, M., Hazes, B., and Lasters, I. 1992. The dead-end

elimination theorem and its use in protein side-chain positioning. Nature 3 5 6 : 539–542.

  • Looger, L.L. and Hellinga, H.W. 2001. Generalized dead-end elimination

algorithms make large-scale protein side-chain structure prediction tractable: Implications for protein design and structural genomics. J. Mol. Biol. 3 0 7 :429– 445.

  • Dunbrack Jr., R.L. and Karplus, M. 1993. Backbone-dependent rotamer library

for proteins: Application to side-chain prediction. J. Mol. Biol. 2 3 0 : 543– 574.

  • Lovell, S.C., Word, J.M., Richardson, J.S., and Richardson, D.C. 2000. The

penultimate rotamer library. Proteins 4 0 : 389–408.

  • A. A. Canutescu, A. A. Shelenkov, and R. L. Dunbrack, Jr. A graph theory

algorithm for protein side-chain prediction. Protein Science 1 2 , 2001-2014 (2003).

Side Chain Modeling

slide-34
SLIDE 34

Energy minimization

fragment assembly will produce unfavorable contacts and distorted bonds. Energy minimization is used to

  • regularize local bond and angle geometry
  • Relax close contacts and geometric strain

extensive energy minimization will move coordinates away from real structure ⇒ keep it to a minimum! Molecular dynamics does not improve model quality. SWISS-MODEL is using GROMOS 96 force field for a steepest descent minimization.

Comparative Protein Structure Modeling

Know n Structures ( Tem plates) Target Sequence Tem plate Selection Alignm ent Tem plate - Target Structure m odeling

Structure Evaluation & Assessm ent

Hom ology Model( s)

I. Manual Model building

  • II. Template based fragment

assembly – Composer (Sybyl, Tripos) – SWISS-MODEL

  • III. Satisfaction of spatial restraints

– Modeller (Insight II, MSI) – CPH-Models

Comparative Protein Structure Modeling

slide-35
SLIDE 35

M A T E A F T S G Q

  • III. Satisfaction of Spatial restraints
  • A. Sali and T.L. Blundell. Comparative protein modelling by

satisfaction of spatial restraints. Journal of Molecular Biology, 234:779-815, 1993.

Comparative Protein Structure Modeling

Alignment of target sequence with templates Extraction of spatial restraints from templates Modeling by satisfaction of spatial restraints

M A T E A F T S G Q

Satisfaction of spatial restraints

slide-36
SLIDE 36

Som e features of a protein structure: R resolution of X-ray experiment r amino acid residue type Φ, Ψ main chain angles t secondary structure class M main chain conformation class Χ i,, ci side chain dihedral angle class a residue solvent accessibility s residue neighborhood difference d Ca - Ca distance Δd difference between two Ca - Ca distances

Satisfaction of spatial restraints

Feature properties can be associated with

a protein (e.g. X-ray resolution) residues (e.g. solvent accessibility) pairs of residues (e.g. Ca - Ca distance)

  • ther features (e.g. main chain classes)

How can we derive modeling restraints from this data?

A restraint is defined as probability density function (pdf) p(x):

= < ≤

1 2

) ( ) 2 1 (

x x

dx x p x x x p 1 ) ( =

dx x p

with

) ( > x p

Satisfaction of spatial restraints

slide-37
SLIDE 37

a) 11 Cys residues Chi-1 angles b) smoothed distribution from a) c) 297 Cys Chi-1 angles as control Derive pdfs from frequency tables by smoothing:

Satisfaction of spatial restraints

4 . ' 2 . < < s 4 . ' ' 2 . < < s 4 . ' 2 . < < s 6 . ' ' 4 . < < s 4 . ' ' 2 . < < s 6 . ' 4 . < < s

  • Combine basis pdfs to molecular probability density functions

Satisfaction of spatial restraints

slide-38
SLIDE 38

Satisfaction of spatial restraints Find the protein model with the highest probability Variable target function: Start with a linear conformation model or a model close to the template conformation At first, use only local restraints minimize some steps using a conjugate gradient optimization repeat with introducing more and more long range restraints until all restraints are used

Satisfaction of spatial restraints

  • Optimization schedule and progress

Satisfaction of spatial restraints

slide-39
SLIDE 39

Known Structures (Templates) Target Sequence Template Selection Alignment Template - Target Structure modeling Structure Evaluation & Assessment Homology Model(s)

Comparative Protein Structure Modeling Accuracy of protein structure models

“... a model must be wrong, in some respects -- else it would be the thing itself. The trick is to see ... where it is right.”

Henry A. Bent

"Uses (and Abuses) of Models in Teaching Chemistry,"

  • J. Chem. Ed. 1 9 8 4 61, 774.

T0286

slide-40
SLIDE 40

EVA

Evaluation of Automatic protein structure prediction

http: / / eva.compbio.ucsf.edu/ ~ eva/

CASP

Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction

http: / / PredictionCenter.org

Assessing the accuracy of protein structure models

Continuous evaluation of automatic protein structure prediction

[ Burkhard Rost, Andrej Sali, http://eva.compbio.ucsf.edu/~eva/cm/ ] Target Sequence

MNIFEMLRID EGLRLKIYKD TEGYYTIGIG HLLTKSPSLN AAKSELDKAI GRNCNGVITK

New experimental structure Prediction Servers Evaluation of prediction accuracy 1 2 3 4 EVAcm: 54’847 models for 22’364 target proteins evaluated (since 2000).

slide-41
SLIDE 41

CASP

  • CASP: Double-blind assessment or protein structure prediction

techniques

1. Assessment of pre-dictions (not post-dictions); i.e. the experimental result is not known at the time of prediction. 2. All predictions are made at the same time on the same data set provided by the organizers. 3. After the experiment, all predictions and experimental results are provided to an independent assessor.

  • The assessor does not know the origin of the predictions, i.e. the results

are evaluated anonymously.

  • Evaluate only final results, not “elegance” of a method etc.
  • The assessor is free to adopt evaluation criteria that make scientific sense.

4. After the assessment has been completed, a meeting is organized where results are presented and the identity of the predictors is revealed.

Comparing Protein Structures

Model Evaluation.

How can we compare models from different protein structure prediction methods (shown as back-bone trace) with the “real” crystal structure (shown as ribbon)?

slide-42
SLIDE 42

GDT-TS - Global Distance Test

A large sample of possible structure super-positions of the model on the corresponding experimental structure is generated by superposing all sets of three, five, and seven consecutive Cα atoms along the backbone (each peptide segment provides one superposition). Each of these initial super-positions is iteratively extended, including all residue pairs under a specified threshold in the next iteration, and continuing until there is no change in included residues. Superimposed residues are not required to be continuous in the sequence, nor is there necessarily any relationship between the sets

  • f residues superimposed at different thresholds.

Comparing Protein Structures Comparing Protein Structures

GDT-TS - Global Distance Test Total Score

The GDT procedure is carried out using thresholds of 1, 2, 4, and 8 Å, and the final superposition that includes the maximum number of residues is selected for each threshold. GDT_TS is then obtained by averaging over the four superposition scores for the different thresholds:

4 ) 8 _ 4 _ 2 _ 1 _ ( GDT GDT GDT GDT TS GDT + + + = −

slide-43
SLIDE 43

GDT - Global Distance Test

  • GDT analysis: largest set of Cα atoms (percent of the modeled structure) that

can fit under DISTANCE cutoff: 0.5Å, 1.0Å, 1.5Å, ... , 10.0Å

Comparing Protein Structures

Human predictors vs. automated servers

10 20 30 40 50 60 70 80 24 20 25 556 125 675 50 38 664 568 5 698 26 213 113 658 651 105 214 418 212 27 307 671 710 Groups % significant wins GDT_HA AL0 HB-Score Total Sum Zhang-Server

HHpred2 HHpred3 MetaTasser HHpred1 BayesHH

Direct head to head comparison (t-test) on GDT,AL0,and HBscore on common targets.

Battey et al., Proteins: Structure, Function, and Bioinformatics, 69 (S8) 68 – 82.

slide-44
SLIDE 44

What do you mean by “accurate model”?

Chotia & Lesk vs. CASP7

What do you mean by “accurate model”?

Baker D, Sali A. Protein structure prediction and structural genomics. (2001) Science. 294:93-96.

slide-45
SLIDE 45

Atomic contact map from protein-ligand distance matrices [ 1]: Atomic contact score (ACS) [ 2]:

i j rij

[1] NCONT, CCP4, Krissinel E, European Bioinformatics Institute, Cambridge, UK. [2] Kopp J, Bordoli L, Battey JND, Kiefer F, Schwede T (2007). Assessment of CASP7 Predictions for Template-Based Modeling Targets. Proteins: Structure, Function, and Bioinformatics, 69, 38-56.

∑ ∑ ∑

− ⋅ =

k i target k i k i model k i model k i k i target k i

Cont Clash Cont Cont ACS

, , , , , , ,

) ( ⎩ ⎨ ⎧ ≤ ≤ =

  • therwise

Cont k

i

4.0Å r Å 2.0 if 1

k i, ,

⎩ ⎨ ⎧ ≤ =

  • therwise

Clash k

i

1.5Å r if 1

k i, ,

k i Cik

What do you mean by “accurate model”?

CASP7: Back-bone vs. active site accuracy

Kopp J, Bordoli L, Battey JND, Kiefer F, Schwede T (2007). Assessment of CASP7 Predictions for Template-Based Modeling

  • Targets. Proteins: Structure, Function, and Bioinformatics, 69, 38-56.
slide-46
SLIDE 46

CASP7: Back-bone vs. active site accuracy

T0313

Superposition of experimental (grey) and predicted ADP binding sites of target T0313, a human KIFC3 motor domain. Prediction by group by groups 20 in orange and 24 in light blue, and by group 186 in green.

Protein Model Quality Estimation Problem : At the time modeling, the experimental protein structure is not known, i.e. We need to assess the model for possible errors without knowing the right answer.

slide-47
SLIDE 47

Typical types of errors

Sequence alignment errors. Loops which cannot be rebuilt. Inappropriate template selection. Subunit displacement.

Problem : How can we identify errors in a 3-dimensional protein model without knowing the correct answer?

Protein Structure Evaluation

Bond & angle geom etry can be checked using empirically derived rules (e.g. ProCheck, WhatCheck), molecular force fields, or statistical methods (e.g. Ramachandran plots). Caveat: These parameters are usually part of the target function of the modeling / refinement

  • program. Correct geometry is therefore not sufficient

to indicate model accuracy.

slide-48
SLIDE 48

+, Ile86 III, Ala182 II, Phe134 I, Val13 *, Met80 I II III * Val13 Met80 Phe134 Ala182 +

Problem : How can we identify errors in a 3-dimensional protein model without knowing the correct answer?

Protein Structure Evaluation

  • Statistical Mean Force Potentials

can assess the assess the packing quality of a model by evaluating the "Non- Local Environment" (NLE) of each heavy atom in the molecule.

  • Example: ANOLEA (Melo, F. and E.

Feytmans (1998). "Assessing protein structures with a non-local atomic interaction energy." J Mol Biol 277(5): 1141-1152.)

(Melo & Feytmans J Mol Biol 277, 1141)

Distance Å MFP kcal/ mol Methyl-Methyl pairs Cysteine S-S-pairs Distance Å

Protein Structure Evaluation

Statistical Mean Force Potentials

  • Use inverse Boltzmann law to derive an atomic Potential of Mean Force (Ū)

from the observed number of atomic pairs (i,j) within a distance shell r± Δr in the training database of protein structures:

  • Nexpected is the expected number of atomic pairs (i,j) in the same distance shell

if there were no interactions between atoms (reference state).

) , , ( ) , , ( ln ) , , ( r j i N r j i N RT r j i U

expected

  • bserved

− =

R: gas constant T: temperature

slide-49
SLIDE 49

ANOLEA: (Atomic Non-Local Environment Assessment)

MFPs can detect local packing errors, e.g. as a result of misalignments:

Model with wrong alignment.

Protein Structure Evaluation

Correct Structure: 1GES

(Melo & Feytmans J Mol Biol 277, 1141)

All checking tools are happy, so can I believe it now ? Models are not experimental facts. Models can be partially inaccurate or sometimes completely wrong. A model is a tool that helps to interpret biochemical data.

slide-50
SLIDE 50

Save Zone Twilight Zone Midnight Zone

Model quality vs. sequence identity

Annotation by fold assignment 3D-motif searching, active site recognition Including NMR restraints Supporting site directed mutagenesis X-Ray Molecular replacement models Docking of small molecules Drug development; comparable to medium resolution NMR

  • r low resolution X-ray structures

What can models be used for ?

slide-51
SLIDE 51

Applications of Protein Modeling

Application examples:

Rational planning of mutagenesis experiments Structure-based drug design

Collaboration with M.Spiess (Biozentrum) Yeast Sec61 hetero-trimer homology model based on archaeal SecY crystal structure from Methanococcus jannaschii1 Design of partial and complete deletion mutations of the plug domain to study its functional role. Conclusion: The plug domain of Sec61p is non-essential, but influences topogenesis and Sec61 assembly.2

Application: Rational design of Functional Protein Mutations The role of the “plug” domain in yeast SecY

1 (Van den Berg B et al. (2004) Nature. 2004 427,36-44) 2 (Junne et al. Molecular Cell 2006, in press)

slide-52
SLIDE 52

Application: Models in structure based drug discovery

Can computational models be used in structure based drug design? How do errors and inaccuracies of the homology models affect the subsequent molecular modeling of protein-ligand interaction?

HM based on EIAV

High quality models can be used for structure-based drug design nearly as well as experimental structures.

Reference: Discovery of a potent and selective protein kinase CK2 inhibitor by high-throughput docking. Vangrevelinghe E, Zimmermann K, Schoepfer J, Portmann R, Fabbro D, Furet P. Oncology Research, Novartis Pharma, Basle, J Med Chem. 2003 Jun 19;46(13):2656-62.

Discovery of CK2a Inhibitors by in silico docking Homology model of the target molecule:

slide-53
SLIDE 53

Can we predict the structure of a protein “de novo” only from sequence?

MNIFEMLRID EGLRLKIYKD TEGYYTIGIG HLLTKSPSLN AAKSELDKAI GRNCNGVITK DEAEKLFNQD VDAAVRGILR NAKLKPVYDS LDAVRRCALI NMVFQMGETG VAGFTNSLRM LQQKRWDEAA VNLAKSRWYN QTPNRAKRVI TTFRTGTWDA YKNL

The physics-based approach:

we are searching the conformation of protein corresponding to the free energy minimum Solvent plays a crucial role in protein folding. We need to take the solvent into account in these calculations How long would it take to simulate folding of a protein?

Protein Structure Prediction - ab initio

slide-54
SLIDE 54

Helix posit ion Amino acid st at ist ics

Rosetta

Fragment based ab initio prediction methods

Rosetta developed by David Baker group (similar approaches by Skolnick) Find sequence patterns that strongly correlate with protein structure at the local level to create a library of fragments (I- sites). E.g. „amphipathic helix“:

Rosetta

To build a model building for a new sequence: Search for compatible fragments Use Monte Carlo simulated annealing to assemble overlapping fragments Scoring functions / empirical force fields are used to select best models (~ 1000)

http: / / www.bakerlab.org

slide-55
SLIDE 55

DECOYS:

Generate a large number

  • f possible shapes

DISCRIMINATION:

Select the correct, native-like fold Need good decoy structures Need a good energy function

Rosetta

Scoring Functions

MC methods are not likely to generate exactly native structures, but native-like structures. Scoring needs to distinguish native-like structures from not-native ones. Solution: “Decoy” testing. Create many structures that are plausible and not too far from native fold, and try to distinguish these Use sequence dependent and sequence independent evaluation of predicted structure

Rosetta

slide-56
SLIDE 56

Rosetta – The energy function

  • Bayesian separation of the total energy:

P(structure| sequence)

  • According Bayes’ theorem:
  • P(sequence):

constant (since we compare same sequence)

  • P(sequence| structure):

sequence dependent term

  • P(structure):

sequence independent term

) P(sequence re) )P(structu |structure P(sequence ) e|sequence P(structur =

Scoring Function Components

Rosetta

slide-57
SLIDE 57

De novo examples from CASP7

Prediction Success: T0283

10 20 30 40 50 60 70 80 GDT_TS

20 (Baker) 13 (Jones-UCL) T0283 Target

AL0: 77.32 AL0: 62.89

http: / / www.predictioncenter.org/ casp7/

slide-58
SLIDE 58

Prediction Success: T0283

10 20 30 40 50 60 70 80 GDT_TS

20 (Baker) 13 (Jones-UCL) T0283 Target T0283 Template: 2b2j T0283 Group 20 T0283 Template: 2b2j Group 20 T0283 Group 20 Side Chains

AL0: 77.32 AL0: 62.89

Qian B, Raman S, Das R, Bradley P, McCoy AJ, Read RJ, Baker D. High-resolution structure prediction and the crystallographic phase problem. Nature (2007) 450:259-64. Qian B, Raman S, Das R, Bradley P, McCoy AJ, Read RJ, Baker D. High-resolution structure prediction and the crystallographic phase problem. Nature (2007) 450:259-64.

Protein Modeling Resources

SWISS-MODEL http: / / swissmodel.expasy.org Modeller http: / / www.salilab.org M4T http: / / manaslu.aecom.yu.edu/ M4T/ SCWRL3 http: / / dunbrack.fccc.edu/ SCWRL3.php