Homology Modeling I Basel, September 30, 2004 Torsten Schwede - - PDF document

homology modeling i
SMART_READER_LITE
LIVE PREVIEW

Homology Modeling I Basel, September 30, 2004 Torsten Schwede - - PDF document

Swiss Institute of Bioinformatics EMBnet course: Introduction to Protein Structure Bioinformatics Homology Modeling I Basel, September 30, 2004 Torsten Schwede Biozentrum - Universitt Basel Swiss Institute of Bioinformatics Klingelbergstr


slide-1
SLIDE 1

Swiss Institute of Bioinformatics Torsten Schwede Biozentrum - Universität Basel Swiss Institute of Bioinformatics Klingelbergstr 50-70 CH - 4056 Basel, Switzerland Tel: +41-61 267 15 81

EMBnet course: Introduction to Protein Structure Bioinformatics

Homology Modeling I

Basel, September 30, 2004

[ PDB: http://www.pdb.org ]

Growth of the Protein Data Bank PDB

slide-2
SLIDE 2

100 1'000 10'000 100'000 1'000'000 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004

TrEMBL SwissProt PDB

No experimental structure for most sequences

Public Database Holdings

The protein sequence contains all information needed to create a correctly folded protein. Can we predict protein structures from protein sequences alone (ab initio) ?

Many proteins fold spontaneously to their native structure Protein folding is relatively fast (nsec – sec) Chaperones speed up folding, but do not alter the structure

MNIFEMLRID EGLRLKIYKD TEGYYTIGIG HLLTKSPSLN AAKSELDKAI GRNCNGVITK DEAEKLFNQD VDAAVRGILR NAKLKPVYDS LDAVRRCALI NMVFQMGETG VAGFTNSLRM LQQKRWDEAA VNLAKSRWYN QTPNRAKRVI TTFRTGTWDA YKNL

slide-3
SLIDE 3

( ) ( )

( ) ( )

∑ ∑ ∑ ∑ ∑

= + =

          +                 −         + − + + − + − =

N i N i j ij j i ij ij ij ij ij torsions N angles i i i bonds i i i

r q q r r n V k l l k

1 1 6 12 2 , 2 ,

4 4 cos 1 2 2 2 πε σ σ πε γ ω θ θ ν

Molecular Dynamics Ab initio protein folding simulation

[ http://www.research.ibm.com/bluegene/ ]

Physical time for simulation 10–4 seconds Typical time-step size 10–15 seconds Number of MD time steps 1011 Atoms in a typical protein and water simulation 32’000 Approximate number of interactions in force calculation 109 Machine instructions per force calculation 1000 Total number of machine instructions 1023 BlueGene capacity (floating point operations per second) 1 petaflop (1015)

Blue Gene will need 1-3 years to simulate 100 µsec.

slide-4
SLIDE 4

Helix position Amino acid statistics

Rosetta Stone Approach

David Baker group Find sequence patterns that strongly correlate with protein structure at the local level to create a library of fragments (I- sites). E.g. „amphipathic helix“:

Rosetta Stone Approach

  • To build a model building for a new sequence:
  • Search for compatible fragments (reduced alphabet)
  • Use Monte Carlo simulated annealing to assemble overlapping fragments
  • Scoring functions are used to select best models (~1000)

http://isites.bio.rpi.edu

slide-5
SLIDE 5
  • Generates thousands of models
  • Best Models in CASP4: ~ 5 – 10 Å rmsd Ca
  • Difficult to distinguish good and bad models

http://isites.bio.rpi.edu

Rosetta Stone Approach

PDB submissions per year Year Already known folds New folds

The number of different protein folds is limited:

slide-6
SLIDE 6

Evolution of the globin family:

0.0 2.5 0.5 1.5 2.0 1.0 100 50

Percent identical residues in core Rmsd of backbone atoms in core

[ Chothia & Lesk (1986) ]

Evolution of protein structure families

Common core = all residues that can be superposed in 3D For proteins > 60% identical residues, the core contains > 90 % of all residues deviating less than 1.0 Å.

slide-7
SLIDE 7 .

20 40 60 80 100 50 100 150 200 250 ide ntity

Number of residues aligned Percentage sequence identity/similarity

(B.Rost, Columbia, NewYork)

Sequence identity implies structural similarity Don’t know region .....

Sequence similarity implies structural similarity?

.

20 40 60 80 100 50 100 150 200 250 ide ntity sim ila rity

Number of residues aligned Percentage sequence identity/similarity

(B.Rost, Columbia, NewYork)

Sequence similarity implies structural similarity?

Don’t know region ..... Sequence identity implies structural similarity

slide-8
SLIDE 8

Homology modeling

= Comparative protein modeling = Knowledge-based modeling Idea: Using experimental 3D-structures of related family members (templates) to calculate a model for a new sequence (target).

Similar Sequence Similar Structure

Known Structures (Templates) Target Sequence Template Selection Alignment Template - Target Structure modeling Structure Evaluation & Assessment Homology Model(s)

Comparative Modeling

slide-9
SLIDE 9

Known Structures (Templates) Target Sequence Template Selection Alignment Template - Target Structure modeling

Structure Evaluation & Assessment

Homology Model(s)

  • Protein Data Bank PDB

http://www.pdb.org

  • Database of templates
  • Separate into single chains
  • Remove bad structures

(models)

  • Create BLASTable database
  • r fold library (profiles, HMMs)

Comparative Modeling

Known Structures (Templates) Target Sequence Template Selection Alignment Template - Target Structure modeling

Structure Evaluation & Assessment

Homology Model(s)

Template selection: 1. Sequence Similarity / Fold recognition 2. Structure quality (resolution, experimental method) 3. Experimental conditions (ligands and cofactors)

Comparative Modeling

slide-10
SLIDE 10

Known Structures (Templates) Target Sequence Template Selection Alignment Template - Target Structure modeling

Structure Evaluation & Assessment

Homology Model(s)

  • Multiple sequence alignment

for pairs > 40% identity

  • r
  • Use structural alignment of

templates to guide sequence alignment of target

  • r
  • Use separate profiles for

template and targets

Comparative Modeling

Known Structures (Templates) Target Sequence Template Selection Alignment Template - Target Structure modeling

Structure Evaluation & Assessment

Homology Model(s)

  • Errors in template selection or

alignment result in bad models

  • iterative cycles of alignment,

modeling and evaluation

  • Built many models,

choose best.

Comparative Modeling

slide-11
SLIDE 11

Known Structures (Templates) Target Sequence Template Selection Alignment Template - Target Structure modeling

Structure Evaluation & Assessment

Homology Model(s)

I. Manual Model building

  • II. Template based fragment

assembly – Composer (Sybyl, Tripos) – SWISS-MODEL

  • III. Satisfaction of spatial restraints

– Modeller (Insight II, MSI) – CPH-Models

Comparative Modeling

[ http://www.expasy.org/spdbv/ ]

  • I. Manual Modeling
slide-12
SLIDE 12
  • II. Template based fragment assembly

Find structurally conserved core regions

  • II. Template based fragment assembly
  • Build model core

… by averaging core template backbone atoms (weighted by local sequence similarity with the target sequence). Leave non-conserved regions (loops) for later ….

slide-13
SLIDE 13
  • II. Template based fragment assembly

Loop (insertion) modeling

Use the “spare part” algorithm to find compatible fragments in a Loop- Database, or “ab-initio” rebuilding (e.g. Monte Carlo, MD, GA, etc.) to build missing loops.

  • II. Template based fragment assembly

Side Chain placement

Find the most probable side chain conformation, using

  • homologues structure information
  • back-bone dependent rotamer libraries
  • energetic and packing criteria
slide-14
SLIDE 14
  • II. Template based fragment assembly

Rotamer Libraries

Only a small fraction of all possible side chain conformations is observed in experimental structures Rotamer libraries provide an ensemble of likely conformations The propensity of rotamers depends on the backbone geometry: g+ trans g- p(g+ | phi) p(t | phi) p(g- | phi) p(g+ | psi) p(t | psi) p(g- | psi) Phe,Tyr, His

Backbone-dependent rotamer libraries

slide-15
SLIDE 15
  • II. Template based fragment assembly

Energy minimization

modeling method will produce unfavorable contacts and bonds Energy minimization is used to

  • regularize local bond and angle geometry
  • Relax close contacts and geometric strain

extensive energy minimization will move coordinates away from real structure ⇒ keep it to a minimum SWISS-MODEL is using GROMOS 96 force field for a steepest descent

  • III. Satisfaction of Spatial restraints

Alignment of target sequence with templates Extraction of spatial restraints from templates Modeling by satisfaction of spatial restraints

M A T E A F T S G Q

slide-16
SLIDE 16

Some features of a protein structure: R resolution of X-ray experiment r amino acid residue type Φ, Ψ main chain angles t secondary structure class M main chain conformation class Χ i,, ci side chain dihedral angle class a residue solvent accessibility s residue neighborhood difference d Ca - Ca distance ∆d difference between two Ca - Ca distances

  • III. Satisfaction of Spatial restraints

Feature properties can be associated with

a protein (e.g. X-ray resolution) residues (e.g. solvent accessibility) pairs of residues (e.g. Ca - Ca distance)

  • ther features (e.g. main chain classes)

How can we derive modeling restraints from this data?

A restraint is defined as probability density function (pdf) p(x):

= < ≤

1 2

) ( ) 2 1 (

x x

dx x p x x x p 1 ) ( =

dx x p

with

) ( > x p

  • III. Satisfaction of Spatial restraints
slide-17
SLIDE 17

a) 11 Cys residues Chi-1 angles b) smoothed distribution from a) c) 297 Cys Chi-1 angles as control

  • III. Satisfaction of Spatial restraints

Derive pdfs from frequency tables by smoothing:

4 . ' 2 . < < s 4 . ' ' 2 . < < s 4 . ' 2 . < < s 6 . ' ' 4 . < < s 4 . ' ' 2 . < < s 6 . ' 4 . < < s

  • III. Satisfaction of Spatial restraints
  • Combine basis pdfs to molecular probability density functions
slide-18
SLIDE 18

Satisfaction of spatial restraints Find the protein model with the highest probability Variable target function: Start with a linear conformation model or a model close to the template conformation At first, use only local restraints minimize some steps using a conjugate gradient optimization repeat with introducing more and more long range restraints until all restraints are used

  • III. Satisfaction of Spatial restraints

EVA Evaluation of Automatic protein structure prediction

[ Burkhard Rost, Andrej Sali, http://cubic.bioc.columbia.edu/eva/ ]

CASP Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction

http://PredictionCenter.llnl.gov

Model Accuracy Evaluation

slide-19
SLIDE 19

Evaluation of Automatic protein structure prediction

[ Burkhard Rost, Andrej Sali, http://cubic.bioc.columbia.edu/eva/ ] Target Sequence

MNIFEMLRID EGLRLKIYKD TEGYYTIGIG HLLTKSPSLN AAKSELDKAI GRNCNGVITK

New PDB Release Prediction Servers e.g. Evaluation of prediction accuracy 1 2 3

Typical types of errors

Sequence alignment errors. Loops which cannot be rebuilt. Inappropriate template selection. Structural rearrangements.

slide-20
SLIDE 20

e.g. GROMOS, CHARMM, AMBER, ...

  • Which type of errors in a protein structure can you identify by an

empirical force filed?

  • Which type of errors are not recognized?

Empirical Force Fields

  • Useful to identify regions with errors in backbone geometry

Statistical Methods

Ramachandran Plot of backbone angles (ϕ,ψ)

favored regions generously allowed regions disallowed regions Amino acids with special properties:

  • PRO: ϕ = 60º
  • GLY ()
slide-21
SLIDE 21

Probability for a feature to occur in a given environment, e.g.

Solvent exposed / buried Hydrophobic / polar environment Electrostatic interactions Secondary structure etc.

1D - 3D Checks

+, Ile86 III, Ala182 II, Phe134 I, Val13 *, Met80 I II III * Val13 Met80 Phe134 Ala182

A B

+

Statistical Mean Force Potentials

slide-22
SLIDE 22

Atom Type Definitions

Distance Å MFP kcal/mol Methyl-Methyl pairs Cysteine S-S-pairs Distance Å

Statistical Mean Force Potentials

slide-23
SLIDE 23

ANOLEA : (Atomic Non-Local Environment Assessment)

http://protein.bio.puc.cl/cardex/servers/anolea/ http://swissmodel.expasy.org/anolea/ Correct Structure: PDB: 1GES Model with wrong alignment: Detects local packing errors Errors in alignments

ANOLEA

slide-24
SLIDE 24

Checks the stereo-chemical quality of a protein structure, producing a number of plots analyzing its overall and residue-by-residue geometry.

  • Covalent geometry
  • Planarity
  • Dihedral angles
  • Chirality
  • Non-bonded interactions
  • Main-chain hydrogen bonds
  • Disulphide bonds
  • Stereochemical parameters
  • Residue-by-residue analysis

Laskowski R A, MacArthur M W, Moss D S & Thornton J M (1993). PROCHECK: a program to check the stereochemical quality of protein structures. J. Appl. Cryst., 26, 283-291. Morris A L, MacArthur M W, Hutchinson E G & Thornton J M (1992). Stereochemical quality of protein structure coordinates. Proteins, 12, 345-364.

PROCHECK

WHAT IF I check my structure? Imagine ...

  • An everyday situation in a biocomputing lab: "Should they use the structure?"
  • An everyday situation in a crystallography lab: "Should they deposit the structure

already?" In a WHAT_CHECK report, each reported fact has an assigned severity: error: severe errors encountered during the analyses. Items marked as errors are considered severe problems requiring immediate attention. warning: Either less severe problems or uncommon structural features. These still need special attention. note: Statistical values, plots, or other verbose results of tests and analyses that have been performed.

WHAT IF: A molecular modeling and drug design program. G.Vriend, J. Mol.

  • Graph. (1990) 8, 52-56.

Errors in protein structures. R.W.W. Hooft, G. Vriend, C. Sander, E.E. Abola, Nature (1996) 381, 272-272.

WhatCheck / WhatIf

slide-25
SLIDE 25

# 49 # Note: Summary report for users of a structure This is an overall summary of the quality of the structure as compared with current reliable structures. This summary is most useful for biologists seeking a good structure to use for modelling calculations. The second part of the table mostly gives an impression of how well the model conforms to common refinement constraint values. The first part of the table shows a number of constraint-independent quality indicators. Structure Z-scores, positive is better than average: 1st generation packing quality : -2.550 2nd generation packing quality : -5.472 (bad) Ramachandran plot appearance : -1.898 chi-1/chi-2 rotamer normality : -1.433 Backbone conformation : -2.173 RMS Z-scores, should be close to 1.0: Bond lengths : 0.905 Bond angles : 1.476 Omega angle restraints : 0.921 Side chain planarity : 2.681 (loose) Improper dihedral distribution : 1.771 (loose) Inside/Outside distribution : 1.333 (unusual)

WhatCheck / WhatIf report for a bad model ... All checking tools are happy, so can I believe it now? Models are not experimental facts ! Models can be partially inaccurate or sometimes completely wrong ! A model is a tool that helps to interpret biochemical data.

slide-26
SLIDE 26

ANOLEA : (Atomic Non-Local Environment Assessment)

  • http://protein.bio.puc.cl/cardex/servers/anolea/
  • http://swissmodel.expasy.org/anolea/

ProCheck

  • http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html

WhatCheck

  • http://www.cmbi.kun.nl/gv/whatcheck/

Verify3D

  • http://www.doe-mbi.ucla.edu/Services/Verify_3D/

Biotech Validation Suite for Protein Structures

  • http://biotech.ebi.ac.uk:8400/

Some useful Evaluation Tools

Save Zone Twilight Zone Midnight Zone

Model quality vs. sequence identity

slide-27
SLIDE 27

What can models be used for ?

Reference: Discovery of a potent and selective protein kinase CK2 inhibitor by high-throughput docking. Vangrevelinghe E, Zimmermann K, Schoepfer J, Portmann R, Fabbro D, Furet P. Oncology Research, Novartis Pharma, Basle, J Med Chem. 2003 Jun 19;46(13):2656-62.

Discovery of CK2a Inhibitors by in silico docking Homology model of the target molecule:

slide-28
SLIDE 28

In silico docking of a virtual library of 400‘000 compounds Distributed Computing on PC Grid

Discovery of CK2a Inhibitors by in silico docking

  • large scale experimental structure solution projects

Goal: Most of the sequences in a genome database should match at least one structure with a sufficient sequence identity allowing for reliable modeling.

Range of sequence space that can be modeled with acceptable accuracy.

The modeling error determines selection of targets for structural genomics.

Structural Genomics

slide-29
SLIDE 29

Structural Genomics – Target Selection

Protein Modeling Resources

SWISS-MODEL http://swissmodel.expasy.org Modeller http://www.salilab.org WhatIf http://www.cmbi.kun.nl/whatif/ 3D-JIGSAW

http://www.bmm.icnet.uk/people/paulb/3dj/form.html

CPHmodels

http://www.cbs.dtu.dk/services/CPHmodels/

SDSC1 http://cl.sdsc.edu/hm.html