An Optimized Neural Network for Contact Prediction George - - PowerPoint PPT Presentation

▶

Jun 10, 2023 413 likes •698 views

An Optimized Neural Network for Contact Prediction George Shackelford Kevin Karplus University of California, Santa Cruz UCSC p. 1/28 Using Contact Predictions 3D structure prediction is hard. Local structure predictions like secondary

SLIDE 1

An Optimized Neural Network for Contact Prediction

George Shackelford Kevin Karplus University of California, Santa Cruz

UCSC – p. 1/28

SLIDE 2

Using Contact Predictions

3D structure prediction is hard. Local structure predictions like secondary structure predictions are good. Tools for searching fold space are good but challenged by complexity. With contact predictions we would use a small but accurate number of contact predictions as constraints in Undertaker, Rosetta. But contact prediction is hard.

UCSC – p. 2/28

SLIDE 3

Residue-Residue Contact Definitions

Contact between residues is not actual contact (i.e. van der Waals distance). CASP: Contact between two residues i, j is when the distance between their respective Cβ atoms is less than 8 Å. We define separation as |i − j|

UCSC – p. 3/28

SLIDE 4

Method: Neural Network

Upside: can provide excellent classification. Downside: black box - gives little or no information about feature relationships. Software based on fann, fast artificial neural network. Used Improved Resilient Back-propagation. CASP6 approach: used all inputs we could. CASP7 goal: use good inputs while eliminating weak or redundant inputs.

UCSC – p. 4/28

SLIDE 5

Multiple Sequence Alignment

We use multiple sequence alignments from SAM-t04 as a source of evolutionary data:

>2baa i j SVSSIVSR AQFDRMLLHRNDGACQAKGFYTYDAFV asaDISSLISQ DMFNEMLKHRNDGNCPGKGFYTYDAFI avtAVASLVTSgGFFAEARWYGPGGKCSSVE-------A dtiQANFVVSE AQFNQMFPNRNP-------FYTYQGLV

We have features for single columns, i and j, and for paired columns, (i, j).

UCSC – p. 5/28

SLIDE 6

Thinning the Sequence Alignment

If the sequences are too similar, we tend to see false correlations. We use thinning to reduce the sample bias. To thin a MSA to 50%, we remove sequences from the set until no pair of sequences has more than 50% percent identity. 80% thinning and sequence weighting for single column features. 50% thinning and NO weighting for paired features.

UCSC – p. 6/28

SLIDE 7

Single-column Features

Distribution of residues in the column. Regularized by using mixtures of Dirichlet distributions. Entropy over distribution. Predicted local features. A secondary structure alphabet (str2)—13 classes. A burial alphabet—11 classes

UCSC – p. 7/28

SLIDE 8

Inputs: Using Windows

For single columns we input values from features for i − 2, i − 1, i, i + 1, i + 2, j − 2, j − 1, j, j + 1, j + 2. Tests indicated this window width was the best. Exception is entropy with no window. (20 + 13 + 11) ∗ 5 ∗ 2 + 2 = 442 inputs—so far!

UCSC – p. 8/28

SLIDE 9

Paired-columns Features

>2baa i j SVSSIVSR AQFDRMLLHRNDGACQAKGFYTYDAFV asaDISSLISQ DMFNEMLKHRNDGNCPGKGFYTYDAFI avtAVASLVTSgGFFAEARWYGPGGKCSSVE-------A dtiQANFVVSE AQFNQMFPNRNP-------FYTYQGLV

Yields pairs: DD, ND, NQ. No pairing with gaps. For features: Contact propensity E-values from mutual information Joint entropy Number of pairs between the two columns Log(|i-j|)

UCSC – p. 9/28

SLIDE 10

Contact Propensity

The log likelihood two amino acids (A, L) are in contact. Contact propensity is log(prob(contact(x, y))/prob(x)prob(y)). Contact propensity is largely due to the hydrophobicity (M. Cline et al. ’02). Some very small part is due to other signals. We average the propensity over all sequences. Results show a significant increase in the signal.

UCSC – p. 10/28

SLIDE 11

Correlated Mutations

When a residue in a protein structure mutates, there is a possibility that a nearby residue will also mutate in compensation. beta bridges sidechain-sidechain interactions functional regions We can detect these correlated mutations with correlation statistics.

UCSC – p. 11/28

SLIDE 12

Mutual Information

MIi,j =

p(ri,k, rj.k)log p(ri,k, rj,k) p(ri,k)p(rj,k) where ri,k is the residue in column i, pair k. Mutual information is a very weak predictor by itself. We can improve by calculating an E-value over possible MI values.

UCSC – p. 12/28

SLIDE 13

Mutual Information E-value

Shuffle residues in one column and calculate the mutual information value. Repeat 500 times recording the MI values. Determine parameters for Gamma distribution by using moment matching. Use that distribution and original MI value to derive a p-value. Derive E-value from p-value.

UCSC – p. 13/28

SLIDE 14

Joint Entropy

Enti,j =

x∈R
y∈R

Ci,j

x,y

T log

Ci,j

x,y

T

i and j represent the indices of the pair of columns,

R is the set of twenty residues and T is the number

f valid residue pairs,

Ci,j

x,y is the count of amino acid pairs, x, y, for

columns, i, j.

UCSC – p. 14/28

SLIDE 15

There are a LOT of Pairs

We track only the top (10*length) values for each statistic. We sort each list according to value to get a rank. We calculate the Z-values using means and s.d. over all pairs (i + separation <= j). We form a final set over the intersection of the lists. We keep data on value, Z-value, and rank.

UCSC – p. 15/28

SLIDE 16

Use Rank and/or Value for Inputs?

We experimented with using the rank, values, and Z-values of the pair values. For contact propensity we use -log(rank). For MI E-values we use -log(rank) and Z-value. For joint entropy we use -log(rank).

UCSC – p. 16/28

SLIDE 17

Misc. Input

for input number 449: Log of sequence length.

UCSC – p. 17/28

SLIDE 18

Evaluation: Comparing Predictors

0.1 0.2 0.3 0.4 0.5 0.6 0.1 1 true positives / predictions log (predictions / residue) CASP7 neural net mi e-value + propensity NN mi e-value propensity mutual information

Results for separation >= 9.

UCSC – p. 18/28

SLIDE 19

CASP7 Results

For 0.1 predictions/residue and separation >= 12. We show two good results: T0321 and T0350. We show a bad result: T0307. We compare accuracy to difficulty of target using BLAST E-values. using Zhang Server GDT. We compare accuracy to number of sequences in MSA. We examine how confident can we be in the neural net output scores.

UCSC – p. 19/28

SLIDE 20

The Good: T0321

Thickness of side-chains represents neural net output.

UCSC – p. 20/28

SLIDE 21

The Good and Difficult: T0353

T0353 is from the free modeling class.

UCSC – p. 21/28

SLIDE 22

The Bad: T0307

UCSC – p. 22/28

SLIDE 23

Accuracy vs. Log BLAST E-value

0.2 0.4 0.6 0.8 1 50 100 150 200 250 true positives / predictions

log(top e-value in PDB)

UCSC – p. 23/28

SLIDE 24

Accuracy vs. Zhang Server GDT

0.2 0.4 0.6 0.8 1 20 40 60 80 100 true positives / predictions GDT score of Zhang Server model 1

Pearson’s r: 0.45

UCSC – p. 24/28

SLIDE 25

Accuracy vs. # Seq. in MSA

0.2 0.4 0.6 0.8 1 1 10 100 1000 10000 true positives / predictions

no. of seq. in MSA

UCSC – p. 25/28

SLIDE 26

Accuracy vs. Neural Net Score

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 true positives / predictions NN score at the mark = 0.1*L pair

Pearson’s r: 0.58

UCSC – p. 26/28

SLIDE 27

Conclusion and Future Work

Conclusions: Contact predictions go from very poor to very good. Contact predictions may sometimes be useful. Poor correlation between neural net score and accuracy. Future work: Improve calibration of neural network score. Investigate separate predictor(s) for small alignments. Demonstrate usefulness of contact predictions.

UCSC – p. 27/28

SLIDE 28

Thanks and Acknowledgments

Kevin Karplus (adviser) Richard Hughey (SAM software) Rachel Karchin (Johns Hopkins) Postdoc Martin Madera Graduates Firas Khatib, Grant Thiltgen, Pinal Kanabar, Chris Wong, Zach Sanborn Undergraduates Cynthia Hsu, Silvia Do, Navya Swetha Davuluri, Crissan Harris Last but not least Anthony Fodor at Stanford for Java software In memory of my father, John Cooper Shackelford

UCSC – p. 28/28