Protein Solubility Prediction Reese Lennarson Rex Richard Project - - PowerPoint PPT Presentation

protein solubility prediction
SMART_READER_LITE
LIVE PREVIEW

Protein Solubility Prediction Reese Lennarson Rex Richard Project - - PowerPoint PPT Presentation

Protein Solubility Prediction Reese Lennarson Rex Richard Project Relevance Recombinant DNA Technology: Insert gene of protein of interest into Escherichia coli accessory DNA E. coli uses these new instructions from new DNA and becomes


slide-1
SLIDE 1

Protein Solubility Prediction

Reese Lennarson Rex Richard

slide-2
SLIDE 2

Project Relevance

Recombinant DNA Technology: Insert gene of protein of interest into

Escherichia coli accessory DNA

  • E. coli uses these new instructions from new DNA and becomes a

reactor for the production of the protein of interest

Proteins not native to E. coli may be soluble or insoluble when

expressed

Insoluble proteins form pellets that are difficult to recover and are

not desired in production

Accurate predictions can save time performing experiments

slide-3
SLIDE 3

Project Objectives

Develop models that can predict whether a protein will

be soluble or insoluble when expressed in Escherichia coli based on trends in parameters for collected proteins

Evaluate different methods for prediction and see which

is best

Identify most important parameters for accurate

prediction of solubility

slide-4
SLIDE 4

Protein Background: Amino Acids

Proteins composed of building blocks

called amino acids

R groups responsible

for protein folding and ultimately function

20 amino acids each with different R group

slide-5
SLIDE 5

Protein Background: Amino Acids (cont’d)

R groups characterized by H-bond character,

charge, size, shape, hydrophobicity Serine (hydrophilic) Valine (hydrophobic)

Sequence of amino acid’s R groups (primary

structure) determines how protein folds

slide-6
SLIDE 6

Protein Background: Secondary Structure

Secondary structure (local 3-D structure) has

three common motifs: α-helix, β-sheet, and turns

Alpha helix forms

stabilizing H-bonds along adjacent coil strands

Secondary structure can be predicted fairly well

with knowledge of amino acid sequence

Β sheet Alphahelix

slide-7
SLIDE 7

Creating a Protein Database

226 proteins found in research for which solubility status

  • n expression in E. coli is known at set conditions (37 C,

no chaperones or fusion partners)

Amino acid sequences catalogued for each found protein 17 parameters based on amino acid sequence and

hypothesized to affect solubility calculated for each protein

slide-8
SLIDE 8

Protein Parameters

Parameters based on fraction of specific amino acids:

cysteine fraction proline fraction asparagine fraction threonine fraction tyrosine fraction combined fraction of asn, thr, and tyr

Parameters based on protein-solvent interaction:

hydrophilicity index hydrophobic residue fraction average number of contiguous hydrophobic residues aliphatic index approximate charge average

slide-9
SLIDE 9

Protein Parameters (cont’d)

Parameters based on secondary structure:

alpha helix propensity beta sheet propensity alpha helix propensity/beta sheet propensity turn-forming residue fraction

Parameters based on protein size:

molecular weight, total number of residues

slide-10
SLIDE 10

Developing a Model that Can Predict Solubility

Three methods used for prediction: discriminant

analysis, logistic regression, and neural network

Models look for parameter trends from protein to

protein in the database

Each model develops an equation to predict

solubility for new proteins

slide-11
SLIDE 11

Statistical Analyses

Discriminant Analysis (DA)

Used in all previous solubility studies

Logistic Regression (LR)

More commonly used than discriminant

analysis in recent years

SAS (Statistical Analysis System) software used to build models for both methods

slide-12
SLIDE 12

Why investigate logistic regression?

LR fits our system better than DA!

LR more accurate when there are only 2 dichotomous

groups in the dependent variable

LR more accurate than DA when independent (input)

variables are continuous

DA must assume normal distribution of independent

variables

LR handles unequal group sizes better than DA

LR can give us a more robust model to make future solubility predictions.

slide-13
SLIDE 13

2-D Representation of Statistical Models

Soluble Insoluble

slide-14
SLIDE 14

2-D Representation of Statistical Models

slide-15
SLIDE 15

Discriminant Analysis

Used to model systems with categorical, rather

than continuous, dependent (outcome) variables

Calculates canonical variable (CV) from

parameters for each data point

n = number of parameters xi = value of parameter i λi = adjustable coefficient of parameter i

i i n

x λ CV ∑ =

slide-16
SLIDE 16

Discriminant Analysis, continued

DA optimizes λ values to achieve maximum distinction

between groups

Value of discriminant found Discriminant is the dividing line between groups

for prediction of new data

CV > discriminant; data belongs to Group 1 CV < discriminant; data belongs to Group 2

i i n

x λ CV ∑ =

slide-17
SLIDE 17

Logistic Regression

Similar in approach to DA, but it transforms the dependent variable via a logit function

where pi = probability that data belongs to group 1 (soluble) and = “logit” or “log-odds”

  • Maximum likelihood method used to determine α and β values
  • pi ≥ 0.5 Soluble
  • pi < 0.5 Insoluble

i i n i i

x β α p 1 p log ∑ + =       −

      −

i i

p 1 p log

slide-18
SLIDE 18

Building a DA model in SAS

Step 1: Significant parameters determined in with STEPDISC statement

Stepwise construction of model Parameters evaluated one by one (F to enter, F to remove) Parameters with lowest pr > F value (null-hypothesis test)

included in model

Remaining parameters reevaluated; additional parameters

included as necessary

Parameters may be excluded from the model at any step if F > p

value rises above 0.05 (95% confidence)

Process continues until no more parameters can be added to or

removed from model

slide-19
SLIDE 19

Building a DA model in SAS

slide-20
SLIDE 20

Building a DA model in SAS

Step 2: Coefficients determined with CANDISC statement

Provides raw and weighted coefficients for

parameters

Step 3: Model evaluated with DISCRIM statement

Provides accuracy of predictions for insoluble

proteins, soluble proteins, and overall database

slide-21
SLIDE 21

Building a LR Model in SAS

Model built in reverse-stepwise fashion All parameters included at first, run with LOGISTIC

statement

Parameter with highest null-hypothesis probability

removed

Model run again, next parameter deleted Process continues until remaining parameters have null-

hypothesis probability ≤ 0.05 (95% confidence)

Intercept (α) and coefficient estimates (β) generated as

  • utput
slide-22
SLIDE 22

Building a LR Model in SAS

slide-23
SLIDE 23

Evaluating the Models

Post hoc (training set) evaluations

All proteins used to build model Same proteins plugged into model Model solubility predictions compared to actual

solubility of proteins

Result reported as percentage accuracy

A priori (test set) evaluations

Some proteins used to build model Remaining proteins plugged into model Provides more realistic evaluation of how well models

will predict solubility for new proteins

slide-24
SLIDE 24

Discriminant Analysis Results

Important parameters:

Previous research:

Wilkinson-Harrison: charge average, turn-forming

residue fraction

Idicula-Thomas: aliphatic index, molecular weight,

net charge

Current work:

Asparagine fraction, α-helix propensity

slide-25
SLIDE 25

Discriminant Analysis Results

  • 31.02
  • 0.64

Asparagine Fraction 18.12 0.68 α-helix Propensity Raw Coefficient Standardized Coefficient Parameter

Parameter Coefficients: Post hoc accuracy:

66.5% 62.3% 70.7% Overall Insoluble Soluble

slide-26
SLIDE 26

Logistic Regression Results

Removal of parameters from model:

0.155 Cysteine Fraction 0.398 α-helix Propensity 0.416 Turn-Forming Residue Fraction 0.628 Combined Asn, Tyr, Thr Fraction 0.628 Threonine Fraction 0.653 Proline Fraction 0.692 Average # of Contiguous Hydrophobic Residues 0.794 β-sheet Propensity 0.810 Aliphatic Index 0.839 αβ Propensity Ratio 0.858 Total Number of Residues pr in Removal Step Parameter

slide-27
SLIDE 27

Logistic Regression Results

Parameters included in model: Post hoc accuracy

15.1898 0.07 0.0511 Tyrosine Fraction

  • 20.4259

0.11 0.0325 Asparagine Fraction

  • 12.3538

0.05 0.0192 Approximate Charge Average 4.9629 0.02 0.0002 Hydrophilicity Index 0.0600 0.95 <0.0001 Total # of Hydrophobic Residues

  • 0.1693

1.00 <0.0001 Molecular Weight (kDa) Estimated Coefficient Relative Weight pr Parameter

73.9% 89.4% 42.7% Overall Insoluble Soluble

slide-28
SLIDE 28

: Logistic Regression Model Accuracy

  • ver Prediction Ranges

(Post hoc analysis of entire database)

50 60 70 80 90 100

  • 1

1

  • 2

2

  • 3

3

  • 4

4

  • 5

5

  • 6

6

  • 7

7

  • 8

8

  • 9

9

  • 1

Solubility Prediction Range (%) Model Accuracy (% Correct Predictions)

10 20 30 40 50

Number of Proteins in Range

Model Accuracy Number of Proteins in Range

slide-29
SLIDE 29

LR A Priori Analysis

Database randomized eight times Data split into training and test sets of the

following ratios:

80/20 85/15 90/10 95/5

For each ratio, accuracies using the eight

randomized data sets were averaged

slide-30
SLIDE 30

Logistic Regression Results

Accuracy averages for test sets:

76.1 98.1 21.7 72.9 87.1 45.9 20% 78.7 98.5 19.5 73.1 86.7 47.2 15% 78.7 98.5 17.0 74.3 88.1 45.2 10% 88.6 100.0 25.3 72.4 87.1 43.7 5% Overall Insoluble Soluble Overall Insoluble Soluble Test-Set Size (percent

  • f overall database)

Test-Set Accuracy (%) Training-Set Accuracy (%)

slide-31
SLIDE 31

Statistical Analysis Summary

Discriminant analysis models overpredict

solubility

Logistic regression models overpredict

insolubility

LR models demonstrate better post hoc

accuracy than DA models

LR models very accurate (>90%) for solubility

probabilities nearing 0% and 100%

slide-32
SLIDE 32

Neural Network (NN) Theory

Neural networks essentially learn by decreasing error

through iterations

For this project, a feedforward network is used with

backpropagation

The most common neural network consists of one input

layer, one hidden layer, and one output layer with two connection layers

slide-33
SLIDE 33

Feedforward NNs with Backpropagation

Data flows from input layer to hidden layer to output layer for each

iteration (epoch); output given as value between 0 and 1, where number higher than 0.5 rounded up, numbers lower than 0.5 rounded down

Error signal is calculated and sent back to first connection layer to

update weights for next iteration

Each connection layer supplies weights (initially randomized) from

input layer to hidden layer and from hidden layer to output layer

All data is normalized

slide-34
SLIDE 34

2-D Representation of Neural Network Models

Soluble Insoluble

slide-35
SLIDE 35

2-D Representation of Neural Network Models

slide-36
SLIDE 36

Neural Network Data Analysis:Training/Test Set Randomization

First, eight randomized training/test set combinations

with each in the ratio of 80%/20% were made

The randomized training/test set combo with the highest

test set accuracy was chosen for the optimization of number of nodes

Set Parameters

Number of nodes: 4 Number of iterations: 25,000 Hidden Layer Step Size: 0.5 Output Layer Step Size:0.05

slide-37
SLIDE 37

Training/Test Set Randomization Results

56 60 47 92 98 80 8 58 63 40 88 93 80 7 71 81 46 92 97 82 6 38 54 32 95 98 84 5 62 77 29 95 98 90 4 53 65 29 93 98 82 3 78 90 50 95 94 97 2 87 89 78 86 97 67 1 Overall Insoluble Soluble Overall Insoluble Soluble Random Set Test Accuracy(%) Training Accuracy(%)

slide-38
SLIDE 38

Neural Network Data Analysis: Node Optimization

Number of nodes varied from 3 to 9 using

  • ptimum training/test combo from before

Number of iterations and step sizes kept

same as before

Number of nodes giving highest test set

accuracy considered optimum

slide-39
SLIDE 39

Node Optimization Results

66 74 50 97 99 94 9 71 76 60 98 99 95 8 72 79 60 97 99 94 7 74 82 60 97 98 95 6 74 84 55 91 96 83 5 87 89 78 86 97 67 4 78 84 65 89 91 84 3 Overall Insoluble Soluble Overall Insoluble Soluble Number

  • f Nodes

Test Accuracy(%) Training Accuracy(%)

slide-40
SLIDE 40

Neural Network Model Using All Proteins

Final model included all 226 proteins giving the following

training accuracy.

Almost 90% of outputs in this analysis fell in the ranges

  • f 0-0.1 and 0.9-1

Can we get a better idea of what kind of accuracy one

can expect when this model is used on new proteins?

91 96 80 Overall Insoluble Soluble Training Accuracy (%)

slide-41
SLIDE 41

Neural Network Data Analysis: Varying the Training Set Size

Same procedure used for logistic regression Seven new randomized training/test set combos

added to the one used in node optimization

This was done for 80/20, 85/15, 90/10, and 95/5

ratios

slide-42
SLIDE 42

Results of Varying the Test Set Size

80 77 82 91 92 89 95/5 66 72 54 92 96 84 90/10 69 76 54 92 95 86 85/15 63 72 44 92 96 83 80/20 Overall Insoluble Soluble Overall Insoluble Soluble

% Training Set Proteins/% Test Set Proteins

Test Accuracy(%) Training Accuracy(%) Trend indicates that on average, prediction accuracy on new proteins will be worse (possibly 15 to 25%) than training accuracy given post hoc Also indicates that predictions for soluble and insoluble proteins are fairly well- balanced

slide-43
SLIDE 43

Variation of Accuracy with Output

Neural Network Model Accuracy over Prediction Ranges

0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1 Solubility Output Range Percentage % of Protein in Range Overall Classification Accuracy (%)

slide-44
SLIDE 44

Evaluating the Most Important Parameters

The higher a parameter’s weight, the higher the significance Asparagine, Tyrosine, and Total Hydrophobic Residues Most

Important

Parameter Contribution- Averaged over 4 Nodes

0.000 10.000 20.000 30.000 40.000 50.000 60.000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Parameter W e ig h t

1 Total Residues 2 Molecular Weight (Da) 3 Cysteine Fraction 4 Proline Fraction 5 Turn-Forming Residue Fraction 6 Hydrophilicity Index 7 Approximate Charge Average 8 Total Hydrophobic Residues 9 Average Number of Contiguous Hydrophobic Residues 10 Aliphatic Index 11 Alpha Helix Propensity 12 Beta Sheet Propensity 13 Alpha Helix Propensity/Beta Shee Propensity 14 Asparagine Fraction 15 Threonine Fraction 16 Tyrosine Fraction 17 Combined Fraction of Asparagine, Threonine, and Tyrosine

slide-45
SLIDE 45

Comparing the Methods

91.0% Neural Networks 73.9% Logistic Regression 66.5% Discriminant Analysis Post hoc accuracy (for entire database) Method

slide-46
SLIDE 46

Comparing the Methods

≥66.0% Neural Networks ≥78.7% Logistic Regression A priori accuracy (10% of database for testing) Method

slide-47
SLIDE 47

Model Trends

Neural network has the highest post hoc accuracy,

while logistic regression has the highest accuracy when predicting new proteins

Logistic regression model very accurate for high

and low probability post hoc predictions

Neural network better than statistical methods at

classifying soluble proteins correctly

slide-48
SLIDE 48

Comparing Three Methods

Asparagine common to NN and DA;

Hydrophobic residues common to NN and LR

Asparagine only parameter found significant in all three

models

Prediction of solubility from amino acid sequence and

primary structure extremely difficult

Secondary structure data would be very useful, but

information is limited

Neural networks represent the most promising method

for solubility prediction with the available data

slide-49
SLIDE 49

Recommendations for Further Study

Examine other parameters

Secondary structure Second virial coefficient

Investigate parameter interactions Utilize all models in concert Incorporate more proteins from other host

  • rganisms
slide-50
SLIDE 50

Acknowledgements

  • Dr. Miguel Bagajewicz
  • Dr. Roger Harrison

Armando Diaz Zehra Tosun