Inferring protein functions by matching binding surfaces through - - PowerPoint PPT Presentation

inferring protein functions by matching binding surfaces
SMART_READER_LITE
LIVE PREVIEW

Inferring protein functions by matching binding surfaces through - - PowerPoint PPT Presentation

Inferring protein functions by matching binding surfaces through evolutionary models Jie Liang (Joint work with Jeffrey Tseng) Dept. of Bioengineering University of Illinois at Chicago Outline Methodology: Computational geometry of


slide-1
SLIDE 1

Inferring protein functions by matching binding surfaces through evolutionary models

Jie Liang (Joint work with Jeffrey Tseng)

  • Dept. of Bioengineering

University of Illinois at Chicago

slide-2
SLIDE 2

Outline Methodology:

  • Computational geometry of surface pattern:

– Candidate motifs.

  • Assessing surface similarity.

– Sequence, shape, orientation, and p-values.

  • Incorporation of evolutionary information by Bayesian

Markov chain Monte Carlo. Discovery:

  • Protein functional prediction.
slide-3
SLIDE 3

The Universe of Protein Structures

  • Human genome: 3 billion nucleotides
  • Number of genes: 30,000
  • Protein families: 10,000-30,000
  • Number of folds: 1,000 - 4,000
  • Currently in PDB: < 700 folds

– Comparative modeling: needs a structural template with sequence identities > 30-35%

  • eg. ~50% of ORFs and ~18% of residues of S. cerevisiae genome
  • Structural Genomics: populating each fold with 4-5

structures

– One for each superfamily at 30-35% sequence identities. – Fold of a novel gene can be identified

  • Its structure can then be interpolated by comparative modeling.

All β α/β

(from SCOP)

slide-4
SLIDE 4
  • Main chain folds:

– Important for understanding evolution. – May not directly lead to understanding of function

Tenasin Phosphotransferase 1ten 1poh Tenasin Phosphotransferase 1ten 1poh (SCOP) All beta proteins a+b proteins Ig like beta sandwich HPr fold

(from Jaroszewski & Godzik, ISMB 00)

slide-5
SLIDE 5

Predicting protein function by matching surfaces

  • Proteins from structural genomics often are
  • f unknown functions.

– Sequence homologs are often hypothetical proteins.

  • Strategy: Matching automatically computed

surfaces that may be binding sites.

  • Three tasks:

– Geometric computation: A library of >2 million surface patterns on > 20,000l PDBs. (cast.engr.uic.edu) – Similarity measure: Sequence patterns, coordinate RMSD, and orientational RMSD. – Scoring matrix.

Shape library

(Binkowski, Adamian, and Liang,

  • J. Mol. Biol. 332:505-526, 2003)

(Mucke and Edelsbrunner, ACM Trans. Graphics. 1994. Edelsbrunner, et al, Discrete Applied Math. 1998.)

slide-6
SLIDE 6

Protein Functional Surfaces

Ras 21 Fts Z

GDP Binding Pockets

slide-7
SLIDE 7

http://cast.engr.uic.edu

slide-8
SLIDE 8

Number of Residues Num of Voids and Pockets 200 600 1000 50 150

Voids and Pockets in Soluble Proteins

  • Many voids and pockets.

– At least 1 water molecule. – 15/100 residues.

(Liang & Dill, 2001, Bioph J)

slide-9
SLIDE 9

Simulating Protein Packing with Off-Lattice Chain Polymers

  • 32-state off-lattice discrete model
  • Sequential Monte Carlo and

resampling:

– 1,000+ of conformations of N = 2,000 (Zhang, Chen, Tang and Liang, 2003, J. Chem. Phys.)

slide-10
SLIDE 10
  • Proteins are not optimized by evolution to eliminate voids.

– Protein dictated by generic compactness constraint related to nc.

slide-11
SLIDE 11

How to identify biologically important pockets and voids from random ones? Local Sequence and Shape Similarity

(Binkowski, Adamian, Liang, 2003, JMB, 332:505-526)

slide-12
SLIDE 12

Binding Site Pocket: Sparse Residues, Long Gaps

  • ATP Binding: cAMP Dependet Protein Kinase (1cdk)
  • Tyr Protein Kinase c-src (2src)

1cdk.A 49LGTGSFGRVMLVKHKETGNHFAMKILDKQKVVKLKQIEHTLNEKRILQAVNFPFLVKLEYSFKDNSNL YMVMEYVPGGEMFSHLRRIGRFSEPHARFYAAQIVLTFEYLHSLDLIYRDLKPENLLIDQQGYIQVTDFG FAKRVKGRTWTLCGTPEYLAPEIILSKGYNKAVDWWALGVLIYEMAAGYPPFFADQPIQIYEKIVSGKVR FPSHFSSDLKDLLRNLLQVDLTKRFGNLKDGVNDIKNHKWFATTDWIAIYQRKVEAPFIPKFKGPGDTSN F327 1cdk.A_p 49LGTGSFGRV A K V MEYV E K EN L TD F 2src.m 273LGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIV TEYMSKGSLLDFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVAD404 2src.m_p 273LGQGCFGEV A K V TEYM GS D D R AN L AD

Low overall sequence identity: 13 %

slide-13
SLIDE 13

High Sequence Similarity of Pocket Residues

2src

Tyr Protein Kinase c-src

1cdk

cAMP Dependent Protein Kinase

1cdk.A LGTGSFGRVAKVMEYV---EKENLTDF 24 2src.m LGQGCFGEVAKVTEYMGSDDRANLAD- 26 ** *.**.**** **: :: **:*

High sequence identity: 51 %

slide-14
SLIDE 14

Sequence Similarity of Surface Pockets

  • Similarity detection:

– Dynamic programming SSEARCH (Pearson, 1998)

  • BLOSUM50 scoring matrix (Henikoff, 1994).
  • Not identity.

– Order Dependent Sequence Pattern.

  • Statistics of Null Model:

– Gapless local alignment: Extreme Value Distribution

(Altschul & Karlin, 90)

– Alignment with gaps: (Altschul, Bundschuh, Olsen & Hwa, 01)

Statistical Significance !

slide-15
SLIDE 15

Approximation with EVD distribution (Pearson, 1998, JMB)

  • Kolmogorov-Smirnov

Test:

– Estimate K and λ parameters.

  • Estimation of E-value:

– Estimate p value of

  • bserved Smith-

Waterman score by EVD. – Estimate E-value:

) exp( 1 ) ' ( , ln '

x

e x S p Kmn S S

− − = ≥ − = λ

all all

) ( N p N N p E

d

⋅ ≤ − ⋅ =

(Binkowski, Adamian, Liang, 2003, JMB, 332:505-526)

slide-16
SLIDE 16

Shape Similarity Measure

  • cRMSD (coordinate root mean square distance)
  • oRMSD (Orientational RMSD):

– Place a unit sphere S2 at center of mass x0 ∈ R3 – Map each residue x ∈ R3 to a unit vector on S2 : f : x = (x, y, z)T a u = (x - x0) / || x - x0 || – Measuring RMSD between two sets of unit vectors.

(cf. uRMSD by Kedem and Chew, 2002)

slide-17
SLIDE 17

Statistical Significance of Shape Similarity

  • Estimate the probability p of obtaining a specific cRMSD
  • r oRMSD value for random pockets with Nres

– EVD and other parametric distributions not accurate. – Randomly select 2 pockets. – Calculate cRMSD for Nres randomly selected residues – Also calculate oRSMD 10-7 30 10-6 100 10-8 3 Random surfaces Nres

(Binkowski, Adamian, Liang, 2003, JMB, 332:505-526)

slide-18
SLIDE 18

Surprising Surface Similarity Surprising Surface Similarity

Retroviral protease Retroviral protease Family Family All All β β

Class Class Binds poly Binds poly-

  • peptide substrate acetyl

peptide substrate acetyl-

  • pepstatin

pepstatin

Fold Fold

HIV HIV-

  • 1 Protease

1 Protease (

(5hvp 5hvp) )

Pocket Pocket Acid proteases Acid proteases

CATH CATH

Hsp90 Hsp90 Family Family α α+ +β β

Class Class Binds protein segment Binds protein segment geldanamycin geldanamycin

Fold Fold

Heat Shock Protein 90 Heat Shock Protein 90 (

(1yes 1yes) )

Pocket Pocket α α/ /β β sandwhich sandwhich

CATH CATH

  • Conserved residues both important in

Conserved residues both important in polypeptide binding polypeptide binding

  • Both pockets undergo conformational

Both pockets undergo conformational changes upon binding changes upon binding

slide-19
SLIDE 19

How to incorporate evolutionary information? What to do if related sequences all have unknown functions?

slide-20
SLIDE 20

Likelihood function of a given phylogeny

  • Given a set of multiple-aligned sequences S = (x1, x2, L, xs)

and a phylogenetic tree T = ( V, E ), A column xh at poisition h is represented as:

T h s h h h

x x x x ) , , (

, , 2 , 1

L =

1 10 11 12 13 14 15 16 2 3 4 5 6 7 8 9

0.1 substitution/site

  • The Likelihood function of observing these sequences is:

∏ ∑ ∏

= ∈ ∈

= = =

s h h s x I i j i ij x x x h

Q T x p Q T x x P Q T S P t p Q T x p

A i j i k

1 1 ) , (

) , | ( ) , | , ( ) , | ( : sequence Whole ) ( ) , | ( : column One L

ε

π

slide-21
SLIDE 21

Estimation of instantaneous rates Q

  • Posterior probability of rate matrix given the sequences and tree:
  • n.

distributi posterior : ) , | (

  • n,

distributi likelihood : ) , | (

  • n,

distributi prior : ) ( where , ) ( ) , | ( ) , | ( T S Q Q T S P Q dQ Q Q T S P T S Q π π π π

⋅ ∝

  • Bayesian estimation of posterior mean of rates in Q :

Eπ(Q) = ∫ Q · π (Q | S, T) d Q,

  • Estimated by Markov chain Monte Carlo.
slide-22
SLIDE 22

Validation by simulation

1 10 11 12 13 14 15 16 2 3 4 5 6 7 8 9

0.1 substitution/site

  • Generate 16 artificial sequences

from a known tree and known rates (JTT model)

– Carboxypeptidase A2 precursor as ancestor, length = 147

  • Goal: recovering the substitution

rates

Phylogenetic tree used to generate 16 sequences

14000 14500 15000 15500 0e+00 3e + 5 6e + 5

Negative Log Likelihood Number of Steps

data points collected every 500 simulation steps

100 200 300 400 −150 −100 −50 50

rate index

JTT model Parameters Estimated 1 Parameters Estimated 2

Estimations from two initial conditions are very similar to the true values of residue substitution rates. Convergence of the Markov chain

slide-23
SLIDE 23

Accurate Estimation with > 20 residues and random initial values

−1 1 2 3 4 5 6 7 x 10

−3

1 2 3 4 5 6 7 8 9 10 11

Relative Error

  • Rel. Err.

= (| Q’|F-|Q|F)/|Q|F

100 200 300 400 500 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

Relative Error Sequence Length

Accurate when > 20 residues in length. Q’ matrix estimated by Bayesian MCMC has small relative error by Frobenius norm (<5%) to Q. Distribution of relative errors of estimated rates starting from 50 sets

  • f random initial values.

All Relative Error < 5%.

slide-24
SLIDE 24

Surface motifs known to be biologically important

A R N D C Q E G H I L K M F P S T W Y V

0.00 0.04 0.08 0.00 0.04 0.08

Amino Acid Composition of ActiveSite Pockets

Amion Acid probability

Active Site Pocket Composition JTT Amino Acid Composition

0.05

50 100 150 200 500 1000 1500

2335

ActiveSite Pocket length Distribution length Frequency

Total 6273 protein functional pockets mean ~ 35 residues median ~ 23 residues

(a) (b)

Fig (b). Compare amino acid composition of functional site pockets (7,173 protein pockets) with protein sequence database (16,300 proteins) by JTT. Functionally important residues: His (H), Asp (D), Tyr (Y), Trp (W) and Gly (G) Phe (F), Asn (N), and Arg (R). Fig (a). From 6,273 protein active site pockets, 80% have between 8 and 200 a.a.

  • The average length: 35

residues.

slide-25
SLIDE 25

Evolutionary rates of binding sites and other regions are different

A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V

The Active Pocket [ValidPairs: 39]

(a)

A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V

The rest of Surface [ValidPairs: 177]

(b)

A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V

Interior [ValidPairs: 190]

(c)

A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V

Surface [ValidPairs: 187]

(d)

Residues on protein functional surface experience different selection pressure. Estimated substitution rate matrices of amylase:

  • functional surface

residues.

  • The remaining

surface,

  • The interior

residues

  • All surface

residues.

Sij (i, j) are residues shown in the same column of MSA defined as Sampled Pairs and Sij are estimated by Baysian MCMC }

slide-26
SLIDE 26

Improved functional prediction

slide-27
SLIDE 27

Finding alpha amylase by matching pocket surfaces

Challenging: – amylases often have low overall sequence identity (<25%).

–1bag, pocket 60; B. subtilis –14 sequences, none with structures, 2 are hypothetical –1bg9; Barley –9 sequences, none with structures.

slide-28
SLIDE 28

Query: B. subtilis Barley 1bag 1bg9

Results for Amylase

  • 1bag: found 58 PDB structures.
  • 1bg9: found 48 PDB structures.
  • Altogether: 69

– All belong to amylase (EC 3.2.1.1)

Comparison:

  • Annotated enzyme structure

database (Thorton): 75. Hits: human 1b2y 1u2y 22% 23%

slide-29
SLIDE 29

Comparison with others

Benchmark data:

– Enzyme Structure Database (ESD): template our results ESD psi-blast 1bag 58 75 31 1bg9 48 75 11 union 69 75 41 – Psi-blast: does not contain information about which surface region, active residues, and geometry; contains many uninterpretable false positives. – Ssearch: 32 structures found.

slide-30
SLIDE 30

Inferring biological functions of protein BioH

NP_991531 1A7U 1BRT 1HKH 1A88 1A8S 1A8Q 1IUN 1J1I ZP_00217495 ZP_00221914 ZP_00194950 NP_822724 ZP_00031039 ZP_00032388 NP_069699 ZP_00081046 NP_273326 NP_618510 NP_871588 NP_778087 NP_249193 NP_790345 ZP_00086843 NP_904047 NP_298645 NP_796527 1M33 0.1 substitutions/site

99 100 94 68 100 93 91 60 62 70 85 44 63 82 97 66 99 53 93 98 100 49 85 90 85

(a) (b)

Protein of unknown functions from structural genomics The candidate binding pocket (CASTp id=35) of BioH (1m33) and a similar functional surface detected from proteinase A (2jxr) CASTp id=104. The phylogenetic tree of 28 sequences related to BioH. Many are hypothetical genes.

slide-31
SLIDE 31

Summary

  • Model for evolution of binding surfaces:

– Continuous Markov process for residue substitution.

  • Bayesian Markov chain Monte Carlo works for residue

rates:

– Fast convergency, insensitive to perturbation of tree topology and representative sequences. – Small relative errors (<5%) for > 20 residues.

  • Can be used for function prediction.

– Database search of functionally related binding surfaces.

slide-32
SLIDE 32

Collaborators

  • Andrew Binkowski (UIC)
  • Jinfeng Zhang (UIC)
  • NSF CAREER DBI 0133856 and DBI 0078270
  • NIH GM68958
  • ONR MURI
  • Whitaker Foundation

Acknowledgement