Proteomics Informatics – Protein identification I: searching protein sequence collections and significance testing (Week 4)
Proteomics Informatics Protein identification I: searching protein - - PowerPoint PPT Presentation
Proteomics Informatics Protein identification I: searching protein - - PowerPoint PPT Presentation
Proteomics Informatics Protein identification I: searching protein sequence collections and significance testing (Week 4) Peptide Mapping - Mass Accuracy 2 Peptide Mapping Database Size Human C. elegans S. cerevisiae 3 Peptide Mapping
2
Peptide Mapping - Mass Accuracy
3
Peptide Mapping Database Size
- C. elegans
- S. cerevisiae
Human
4
Peptide Mapping Cys-Containing Peptides
- C. elegans
- S. cerevisiae
Human
MS Identification – Peptide Mass Fingerprinting
MS
Digestion All Peptide Masses Pick Protein Compare, Score, Test Significance Repeat for each protein Sequence DB Identified Proteins
ProFound – Search Parameters
http://prowl.rockefeller.edu/
ProFound – Protein Identification by Peptide Mapping
pattern r i i i r r i i
F m m r m m g N r N I k P DI k P − − − − ∝
∑ ∏
= = 2 1 2 min max 1
2 ) ( 2 exp 2 ! )! ( ) | ( ) | ( σ σ
- W. Zhang & B.T. Chait,
Analytical Chemistry 72 (2000) 2482-2489
ProFound Results
Peptide Mapping – Mass Accuracy
ProFound
1 2 3 4 5 6 7 0.5 1 1.5 2
Mass Tolerance (Da)
- log(e)
Mascot
20 40 60 80 100 120 140 0.5 1 1.5 2
Mass Tolerance (Da) Score
Peptide Mapping - Database Size
- S. cerevisiae
Fungi All Taxa
Expectation Values Peptide mapping example:
- S. Cerevisiae
4.8e-7 Fungi 8.4e-6 All Taxa 2.9e-4
Database size
Missed Cleavage Sites
u = 1 u = 2 u = 4
Expectation Values Peptide mapping example:
u=1 4.8e-7 u=2 1.1e-5 u=4 6.8e-4
Peptide Mapping - Partial Modifications
No Modifications Phophorylation (S, T, or Y)
Searched Searched With Without Possible Modifications Phosphorylation
- f S/T/Y
DARPP-32 0.00006 0.01 CFTR 0.00002 0.005
Even if the protein is modified it is usually better to search a protein sequence database without specifying possible modifications using peptide mapping data.
Peptide Mapping - Ranking by Direct Calculation of the Significance
The response to random input data should be random. Maximum number of correct identification and minimum number of incorrect identifications for any data set. Maximal separation between scores for correct identifications and the distribution of scores for random matching proteins for any data set. The statistical significance of the results should be calculated. The searches should be fast.
General Criteria for a Good Protein Identification Algorithms
Response to Random Data
Normalized Frequency
Peptide Fragmentation
Mass Analyzer 1 Frag- mentation Detector Ion Source Mass Analyzer 2
b y
Identification – Tandem MS
m/z
% Relative Abundance
100 250 500 750 1000
Tandem MS – Sequence Confirmation
K L E D E E L F G S
K 1166 L 1020 E 907 D 778 E 663 E 534 L 405 F 292 G 145 S 88 b ions m/z
% Relative Abundance
100 250 500 750 1000 K L E D E E L F G S
Tandem MS – Sequence Confirmation
147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z
% Relative Abundance
100 250 500 750 1000 K L E D E E L F G S
Tandem MS – Sequence Confirmation
147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z
% Relative Abundance
100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022 K L E D E E L F G S
Tandem MS – Sequence Confirmation
147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z
% Relative Abundance
100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022 K L E D E E L F G S
Tandem MS – Sequence Confirmation
147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z
% Relative Abundance
100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022
113
K L E D E E L F G S
113
Tandem MS – Sequence Confirmation
147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z
% Relative Abundance
100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022
129 129
K L E D E E L F G S
Tandem MS – Sequence Confirmation
K L E D E E L F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z
% Relative Abundance
100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022
Tandem MS – Sequence Confirmation
K L E D E E L F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z
% Relative Abundance
100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022
Tandem MS – Sequence Confirmation
K L E D E E L F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z
% Relative Abundance
100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022
Tandem MS – Sequence Confirmation
Tandem MS – de novo Sequencing
m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022
Mass Differences
1-letter code 3-letter code Chemical formula Monois
- topic
Average A Ala C3H5ON 71.0371 71.0788 R Arg C6H12ON4 156.101 156.188 N Asn C4H6O2N2 114.043 114.104 D Asp C4H5O3N 115.027 115.089 C Cys C3H5ONS 103.009 103.139 E Glu C5H7O3N 129.043 129.116 Q Gln C5H8O2N2 128.059 128.131 G Gly C2H3ON 57.0215 57.0519 H His C6H7ON3 137.059 137.141 I Ile C6H11ON 113.084 113.159 L Leu C6H11ON 113.084 113.159 K Lys C6H12ON2 128.095 128.174 M Met C5H9ONS 131.04 131.193 F Phe C9H9ON 147.068 147.177 P Pro C5H7ON 97.0528 97.1167 S Ser C3H5O2N 87.032 87.0782 T Thr C4H7O2N 101.048 101.105 W Trp C11H10ON2 186.079 186.213 Y Tyr C9H9O2N 163.063 163.176 V Val C5H9ON 99.0684 99.1326
Amino acid masses
Sequences consistent with spectrum
Tandem MS – de novo Sequencing
260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079 260 32 129 145 244 274 373 403 502 518 615 647 760 762 819 292 97 113 212 242 341 371 470 486 583 615 728 730 787 389 16 115 145 244 274 373 389 486 518 631 633 690 405 99 129 228 258 357 373 470 502 615 617 674 504 30 129 159 258 274 371 403 516 518 575 534 99 129 228 244 341 373 486 488 545 633 30 129 145 242 274 387 389 446 663 99 115 212 244 357 359 416 762 16 113 145 258 260 317 778 97 129 242 244 301 875 32 145 147 204 907 113 115 172 1020 2 59 1022 57
Tandem MS – de novo Sequencing
260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079 260 32
129
145 244 274 373 403 502 518 615 647 760 762 819 292
97 113
212 242 341 371 470 486 583 615 728 730 787 389 16
115
145 244 274 373 389 486 518 631 633 690 405
99 129
228 258 357 373 470 502 615 617 674 504 30
129
159 258 274 371 403 516 518 575 534
99 129
228 244 341 373 486 488 545 633 30
129
145 242 274 387 389 446 663
99 115
212 244 357 359 416 762 16
113
145 258 260 317 778
97 129
242 244 301 875 32 145
147
204 907
113 115
172 1020 2 59 1022
57
260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079 260 32 E 145 244 274 373 403 502 518 615 647 760 762 819 292
P I/L 212
242 341 371 470 486 583 615 728 730 787 389 16 D 145 244 274 373 389 486 518 631 633 690 405
V E
228 258 357 373 470 502 615 617 674 504 30 E 159 258 274 371 403 516 518 575 534
V E
228 244 341 373 486 488 545 633 30 E 145 242 274 387 389 446 663
V D
212 244 357 359 416 762 16 I/L 145 258 260 317 778
P E
242 244 301 875 32 145 F 204 907
I/L D
172 1020 2 59 1022
G
Tandem MS – de novo Sequencing
X X X X X X
…GF(I/L)EEDE(I/L)… …(I/L)EDEE(I/L)FG… …GF(I/L)EEDE(I/L)… …(I/L)EDEE(I/L)FG… Peptide M+H = 1166 1166 -1079 = 87 => S SGF(I/L)EEDE(I/L)… SGF(I/L)EEDE(I/L)… 1166 – 1020 – 18 = 128 ⇒K or Q SGF(I/L)EEDE(I/L)(K/Q)
Tandem MS – de novo Sequencing
Challenges in de novo sequencing Neutral loss (-H2O, -NH3) Modifications Background peaks Incomplete information Challenges in de novo sequencing Neutral loss (-H2O, -NH3) Modifications Background peaks Incomplete information
MS/MS Lysis Fractionation Tandem MS – Database Search
MS/MS
Digestion Sequence DB All Fragment Masses Pick Protein Compare, Score, Test Significance Repeat for all proteins Pick Peptide LC-MS Repeat for all peptides
Algorithms
Comparing and Optimizing Algorithms
Score Score 1-Specificity 1-Specificity Sensitivity Sensitivity Algorithm 1 Algorithm 2
True True False False
Score Score 1-Specificity 1-Specificity Sensitivity Sensitivity Algorithm 1 Algorithm 2
True True False False
37
MS/MS - Parent Mass Error and Enzyme Specificity
) ! ! (
y b I II
n n x x =
Expectation Values MS/MS example:
∆m=2, Trypsin 2.5e-5 ∆m=100, Trypsin 2.5e-5 ∆m=2, non-specific 7.9e-5 ∆m=100, non-specific 1.6e-4
Sequest
Cross-correlation
X! Tandem - Search Parameters
http://www.thegpm.org/
X! Tandem - Search Parameters
X! Tandem - Search Parameters
sequences
sequences
spectra
Conventional, single stage searching Generic search engine Test all cleavages, modifications, & mutations for all sequences
Determining potential modifications
- e.g., oxidation, phosphorylation, deamidation
- calculation order 2n
- NP complete
Some hard problems in MS/MS analysis in proteomics
Allowing for unanticipated peptide cleavages
- e.g., chymotryptic contamination in trypsin
- calculation order ~ 200 × tryptic cleavage
- “unfortunate” coefficient
Detecting point mutations
- e.g., sequence homology
- calculation order 18N
- NP complete
sequences
sequences spectra
Multi-stage searching
Tryptic cleavage Modifications #1 Modifications #2 Point mutation
X! Tandem
Search Results
Search Results
Sequence Annotations
Search Results
Search Results
Lysis Fractionation Digestion LC-MS/MS Identification – Spectrum Library Search
MS/MS
Spectrum Library Pick Spectrum Compare, Score, Test Significance Repeat for all spectra Identified Proteins
- 1. Find the best 10 spectra for a particular
sequence, with the same PTMs and charge.
- 2. Add the spectra together and normalize the
intensity values.
- 3. Assign a “ quality” value: the median
expectation value of the 10 spectra used.
- 4. Record the 20 most intense peaks in the
averaged spectrum, it’s parent ion z, m/ z, sequence, protein accessions & quality. Steps in making an Annotated Spectrum Library (ASL):
2 4 6 8 10 10 20 30 40 50 pept ide lengt h fraction of library (% )
Spectrum Library Characteristics – Peptide Length
10 20 30 40 50 10 30 50 70 90 110 130 150 170 190 protein Mr (kDa) % coverage residues peptides
Spectrum Library Characteristics – Protein Coverage
Library spectrum Test spectrum (5:25) (5:25)
Results: 4 peaks selected, 1 peak missed
Identification – Spectrum Library Search
Matches Probability 1 0.45 2 0.15 3 0.016 4 0.00039 5 0.0000037 Apply a hypergeometric probability model:
- 25 possible m/ z values;
- 5 peaks in the library spectrum; and
- 4 selected by the test spectrum.
How likely is this?
Identification – Spectrum Library Search
If you have 1000 possible m/ z values and 20 peaks in test and library spectrum?
1.0E-14 1.0E-12 1.0E-10 1.0E-08 1.0E-06 1.0E-04 1.0E-02 1.0E+00 1 2 3 4 5 6 7 8 9 10
matches p
1 matched: p = 0.6 5 matched: p = 0.0002 10 matched: p = 0.0000000000001
Identification – Spectrum Library Search
Experimental Mass Spectrum Library of Assigned Mass Spectra
M/Z
Best search result
Identification – Spectrum Library Search
X! Hunter
- 1. Use dot product to find a library spectrum
that best matches a test spectrum.
- 2. Calculate p-value with hypergeometric
distribution.
- 3. Use p-value to calculate expectation value,
given the identificat ion parameters.
- 4. If expectation value is less than the median
expectation value of the library spectrum, report the median value. X! Hunter algorithm:
X! Hunter Result
Query Spectrum Library Spectrum
Significance Testing False protein identification is caused by random matching An objective criterion for testing the significance of protein identification results is necessary. The significance of protein identifications can be tested once the distribution of scores for false results is known.
Significance Testing - Expectation Values The majority of sequences in a collection will give a score due to random matching.
Database Search
M/Z
List of Candidates Extrapolate And Calculate Expectation Values List of Candidates With Expectation Values Distribution of Scores for Random and False Identifications
Significance Testing - Expectation Values
Proteomics Informatics – Protein identification I: searching protein sequence collections and significance testing (Week 4)