Five hierarchical levels of sequence-structure correlations in - - PowerPoint PPT Presentation

five hierarchical levels of sequence structure
SMART_READER_LITE
LIVE PREVIEW

Five hierarchical levels of sequence-structure correlations in - - PowerPoint PPT Presentation

Five hierarchical levels of sequence-structure correlations in proteins Chris Bystroff Rensselaer Polytechnic Institute Troy, New York, USA What does structure prediction tell us about the physics of folding? Check one: A. If we can


slide-1
SLIDE 1

Five hierarchical levels of sequence-structure correlations in proteins

Chris Bystroff Rensselaer Polytechnic Institute Troy, New York, USA

slide-2
SLIDE 2

What does structure prediction tell us about the physics of folding?

Check one: A. If we can predict protein structures, then we know how proteins fold. B. If we know how proteins fold, then we can predict protein structures.

slide-3
SLIDE 3

Two ways to predict protein structure...

query sequence best alignment Database search

(statistics)

lowest energy query sequence Folding Simulation

(physics)

slide-4
SLIDE 4

Darwin:

Proteins with a common ancestor have the same fold. query sequence best alignment

Boltzmann:

Proteins adopt a minimum the free energy conformation. lowest energy query sequence millions of years microseconds to seconds

...two very different Underlying principles

slide-5
SLIDE 5

Darwin versus Boltzmann. Do hybrid models make sense?

Global structure similarity BLAST threading Rosetta AMBER Sequence similarity Knowledge-based physics physics

slide-6
SLIDE 6

We know proteins fold via pathways.

local structure first, eliminating alternate pathways, then global

slide-7
SLIDE 7

Proteins can fold because they don't have to search all of conformational space.

slide-8
SLIDE 8

We know that proteins have a heirarchy of structural similarity...

Image borrowed from CATH database

packing of 2° 2° content chain connectivity Topology* Architecture Class conserves...

*Fold recognition algorithms work at this level

slide-9
SLIDE 9

Can we use the database to make models for folding pathways?

Steps along the Steps in folding pathway: data mining: (1) Initiation local motifs (2) propagation extended local motifs (3) condensation pairs of motifs (4) molten globule multiple motifs (5) native state aligned multiple motifs early late

slide-10
SLIDE 10

Heirarchical level 1: Folding initiation site motifs

recurrent sequence

Non-homologous sequences HDFPIEGGDS FW S N ANAKLSHGY CPYDNIW M Q T QSAAVYSVLHLIFLT IDMNPQGSIE GYAESA ELSPVVNFLE F ISGFTQTANSD I N W G S M Q T LM NV M DKIPS I FNESKKKGIA ILSGR PPPM QTI ESKHALWCSVD PW M W NLM Q QVIEIPS MQT IFF MKLKGLKGA P M Q T IF IFFN M QTIFF EM QTIF IFFEE W Q M QTIFF FFVIVNYN TIFF ISQ VFSHDEQ Is it a recurrent structure?

slide-11
SLIDE 11

Sampling bias creates problems for motif mining

First we must "factor out" inheritance

slide-12
SLIDE 12

Removing database redundancy

(1): Cluster sequences into phylogenetic trees. One family, one count. (2): apply a tree weight to each sequence.

w w w wwww w

P

ij =

wkδ skj = aai ( )

k=seqs

∑ wk

k=seqs

(3): Convert each position to a probability distribution.

"sequence profile"

slide-13
SLIDE 13

26 27 2829 30 3132 position C F L I V W Y M A Q N T S H R K E D P G AA 26 27 2829 30 3132 C F L I V W Y M A Q N T S H R K E D P G

Clustering sequence profiles to find recurrent patterns

Each dot represents a short profile

l=1,L

|P ijl−P ikl|

i =1,20

similarity metric (product of log-likelihood ratios)

D( p, q) = LLR pij

( )LLR qij ( )

amîno acids i

positions j

slide-14
SLIDE 14

diverging type-2 turn Serine hairpin

Proline helix C-cap

alpha-alpha corner glycine helix N-cap Frayed helix Type-I hairpin

The I-sites Library

Amino acids arranged from non- polar to polar

Backbone angles: ψ=green, φ=red

slide-15
SLIDE 15

Are I-sites really folding initiation sites?

Prediction experiments

(Bystroff & Baker, Proteins, 1997)

NMR data on peptides

(Yi et al, J.Mol.Biol., 1998)

Molecular dynamics simulations

(Bystroff & Garde, Proteins, 2002)

slide-16
SLIDE 16

Level 2. Motif grammar

Arrangement of I-sites motifs in proteins is highly non-random helix helix cap beta strand beta turn

Adjacencies can be modeled as a Markov chain

slide-17
SLIDE 17

φ ψ

Type-1 G α C-cap Type-2 G α C-cap α helix Type-2 G α C-cap α helix Type-1 G α C-cap

aligned profiles aligned structures

Aligned motifs become a Markov chain

state topology:

slide-18
SLIDE 18

A Markov state from HMMSTR

ahi aij aik

amino acid symbols structure symbols

bi = {ACDEF...} ri = {HGEBdblLex} di = {HST} ci = {mnhd...} next state previous state next state

slide-19
SLIDE 19

Discretized structure states: backbone angle regions (ri)

slide-20
SLIDE 20

How an HMM works

P Q | S

( ) = π q1 (s1)

aqt−1qiBqt(st )

t = 2,N

We have S (the sequence). We want Q (the state sequence), P(Q|S) is the probability of Q given S starting states arrows amino acid profiles B

i st

( )=

di Dt

( )

r

i Rt

( )

ci Ct

( )

⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ bqi (Ot)

slide-21
SLIDE 21

HMMSTR

Hidden Markov Model for local protein STRucture

282 nodes 317 transitions

Unified model for 31 distinct sequence- structure motifs

(Bystroff & Baker, J.

  • Mol. Biol., 2000)
slide-22
SLIDE 22

Level 1: I-sites Level 2: HMMSTR initiation propagation

slide-23
SLIDE 23

Level 3: Pairwise Motif-Motif Contact Potentials

  • G (p, q, s) represents the free energy of a

motif-motif contact. G(p,q, s) = −log Γ i, p

( )Γ i + s,q ( )

i ∋Di,i+s <8Å

PDBselect

Γ i, p

( )Γ i + s, q ( )

i

PDBselect

slide-24
SLIDE 24

What is a contact map?

S(I, J) = 1 if d(i, j) ≤ D if d(i, j) > D ⎧ ⎨ ⎩

Definition:

slide-25
SLIDE 25

E(i,j)

Both axes: sequence Red: favorable contact Blue: unfavorable

slide-26
SLIDE 26

helices strands Features in a contact map can be interpreted as a TOPS diagram

slide-27
SLIDE 27

helices strands Features in a contact map can be interpreted as a TOPS diagram Which one is right?

slide-28
SLIDE 28

T0130

CASP5

True Contact Map T0130

X

True contact map amphipathicnon-polar

A rule-based simulation procedure.

ab initio Prediction Contact energies

slide-29
SLIDE 29

Level 4: Multibody arrangements of local motifs

It is difficult to see similarities between these two proteins, but...

slide-30
SLIDE 30

Different folds can have the same arrangement of secondary structure elements.

1alk 6 1 5 3 4 2 7 2 3 4 1 5 4 1 2 3 6 7 3 4 1 2 1vpt

slide-31
SLIDE 31

SCALI : Structural Core ALIgnment

slide-32
SLIDE 32

How SCALI works

(1) Gapless alignment of HMMSTR states (2) Initialize tree search w/ one gapless fragment. (3) Add a new fragment iff it is compatible and has a high score . (4) Tree leaves when no fragments can be added. Score of leaves = aligned contacts + permutation penalty.

slide-33
SLIDE 33

HMMs may be built based on non- sequential alignments

Markov states represent amino acid sequences and positions in space. Connections between them represent loops.

slide-34
SLIDE 34

Hidden Markov models for α/β/α proteins

slide-35
SLIDE 35

Core packing classes

Multiple non-sequential alignments are more specific than “architecture” but not as specific as “topology”.`

Non-sequential clusters may be a useful for classifying proteins

slide-36
SLIDE 36

Level 1: I-sites initiation

Level 2: HMMSTR propagation

Level 3: HMMSTR-CM condensation Level 4: SCALI molten globule

slide-37
SLIDE 37

Level 5: Global topology

Separation of the SCOP 1.53 database into training and test sequences, shown for the G proteins test family

slide-38
SLIDE 38

Support Vector Machine

X1 x2

Optimal hyperplane Support Vectors Support Vectors

4052 proteins --> 54-dimensional

  • vector. Each

dimension is the

  • rder of

appearance HMMSTR states for one family.

slide-39
SLIDE 39

HMMSTR as the basis for a Support Vector Machine

SCOP benchmark of 54 sequence families 4052 proteins, represented as 282-dimensional vector = Prob of each HMMSTR state.

(Hou,Y et al, Bioinformatics, 2003; Proteins, 2004)

slide-40
SLIDE 40

No sparse data problem as we mine longer and longer patterns! Why?

Steps along the folding pathway: Model Complexity (1) Initiation I-sites ~40 motifs (2) propagation HMMSTR 1.1 transitions/node (3) condensation HMMSTR-CM ~1% of pairs occur (4) molten globule SCALI only self-avoiding paths (5) native state SVM-HMMSTR ~1000-2000 folds early late

slide-41
SLIDE 41

Are there any conclusions?

We assumed that proteins fold in a certain, heirarchical manner, mined the data accordingly and found recurrence at every level, from short motifs to global structure.

slide-42
SLIDE 42

Bystroff Lab

Yaoming Huang Yu Shao Donna Crone Xin Yuan Rachel van Duyne Kwang Kim Ben Cole Funding from: NSF-CISE HMMSTR says: Think Globally, Act Locally. www.bioinfo.rpi.edu/~bystrc/

SVM-HMMSTR (Nat.Univ.Singapore)

Yuna Hou Mong-Li Lee Wynne Hsu

HMMSTR: Chris Bystroff Vesteinn Thorsson David Baker

slide-43
SLIDE 43

Are I-sites folding initiation sites?

Patterns of conservation suggest energetic motive

  • 2. sidechain

contacts

  • 1. backbone

angle constraints

  • 3. negative

design

slide-44
SLIDE 44

NMR structures confirm independent folding

26 27 2829 30 3132 position C F L I V W Y M A Q N T S H R K E D P G AA 26 27 2829 30 3132 C F L I V W Y M A Q N T S H R K E D P G 1 2 3 4 5 6 7 position C F L I V W Y M A Q N T S H R K E D P G AA 1 2 3 4 5 6 7 C F L I V W Y M A Q N T S H R K E D P G

(a) (b) (c) (d)

color scale 1. 0.8 0.6 0.4 0.2 0.0

  • .2
  • .4
  • .6
  • .8

Š-1

AA AA

NMR structure of a 7-residue I-sites motif in isolation (Yi et al, J. Mol. Biol, 1998)

diverging turn motif

slide-45
SLIDE 45

Peptide simulations show a correlation between sequence and stability

AAALDRMR AALEALLR AANRSHMP AARYKFIE ADFKAAVA AFDGETEI AKELVVVY AKGVETAD ARFTKRLG ATLEEKLN CNGGHWIA DAVTRYWP DEAIDAYI DELTRHIR DYVRSKIA EDLVERLK EELKQALR EEMVSKLK EKLLESLE EKPFGTSY EQIKAAVK FHMYFMLR FSVMNDAS FYSSYVYL G QLMALKQ HNLIEAFE IEHTLNEK IQNGD W TF KAAIAQLR KKYRPETD KNPDNVVG KP M GPLLV KQAHPDLK KQDKHYGY KSYLRSLR LDLHQTYL NAV WAAIK NETHSGRK NFLEVGEY NPVKESRH PAIISAAE PLQHHNLL PRDANTSH QDDARKLM QGIIDKLD QKMKTYFN QTLAQLSV RDFEERMN RIILDRHR RLLLKAYR RPIARMLS RVLGRDLF SCDVKFPI TEVMKRLV TLNEKRIL YASLRSLV YESHVGCR

Sequences simulated

  • AMBER 6.0
  • 800-900 waters.
  • Ion balance(Na, Cl)
  • 340°K
  • ≥ 5ns

QuickTime™ and a decompressor are needed to see this picture.

slide-46
SLIDE 46

3-state secondary structure prediction

74.9% correct 74.6% correct

slide-47
SLIDE 47

Predicting super-secondary context

Results are for the independent test set.

slide-48
SLIDE 48

HMMSTR can predict which parts of a structure might misfold.

0.2 0.4 0.6 0.8 1 125 135 145 155 165 175 185 195 205 215 225 235 245 255 265 H E L

1 2 3 Human prion protein fragment. (X-ray structure solved in 2002) 1 2 3 3 2 1

HMMSTR secondary structure prediction

Helix 3 is known to be the location

  • f familial

prion disease mutations.

Knaus et al, NSB 8:770-4, 2001