Proteomics Informatics Protein identification I: searching protein - - PowerPoint PPT Presentation
Proteomics Informatics Protein identification I: searching protein - - PowerPoint PPT Presentation
Proteomics Informatics Protein identification I: searching protein sequence collections and significance testing (Week 4) Peptide Mapping - Mass Accuracy 2 Peptide Mapping Database Size Human C. elegans S. cerevisiae 3 Peptide Mapping
2
Peptide Mapping - Mass Accuracy
3
Peptide Mapping Database Size
- C. elegans
- S. cerevisiae
Human
4
Peptide Mapping Cys-Containing Peptides
- C. elegans
- S. cerevisiae
Human
MS Identification – Peptide Mass Fingerprinting
MS
Digestion All Peptide Masses Pick Protein Compare, Score, Test Significance Repeat for each protein Sequence DB Identified Proteins
ProFound Results
Database size
Mixtures
Peptide Fragmentation
Mass Analyzer 1 Frag- mentation Detector Ion Source Mass Analyzer 2
b y
Identification – Tandem MS
m/z
% Relative Abundance
100 250 500 750 1000
Tandem MS – Sequence Confirmation
K L E D E E L F G S
K 1166 L 1020 E 907 D 778 E 663 E 534 L 405 F 292 G 145 S 88 b ions m/z
% Relative Abundance
100 250 500 750 1000 K L E D E E L F G S
Tandem MS – Sequence Confirmation
147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z
% Relative Abundance
100 250 500 750 1000 K L E D E E L F G S
Tandem MS – Sequence Confirmation
147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z
% Relative Abundance
100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022 K L E D E E L F G S
Tandem MS – Sequence Confirmation
147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z
% Relative Abundance
100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022 K L E D E E L F G S
Tandem MS – Sequence Confirmation
147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z
% Relative Abundance
100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022
113
K L E D E E L F G S
113
Tandem MS – Sequence Confirmation
147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z
% Relative Abundance
100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022
129 129
K L E D E E L F G S
Tandem MS – Sequence Confirmation
K L E D E E L F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z
% Relative Abundance
100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022
Tandem MS – Sequence Confirmation
K L E D E E L F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z
% Relative Abundance
100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022
Tandem MS – Sequence Confirmation
K L E D E E L F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z
% Relative Abundance
100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022
Tandem MS – Sequence Confirmation
Tandem MS – de novo Sequencing
m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022
Mass Differences
1-letter code 3-letter code Chemical formula Monois
- topic
Average A Ala C3H5ON 71.0371 71.0788 R Arg C6H12ON4 156.101 156.188 N Asn C4H6O2N2 114.043 114.104 D Asp C4H5O3N 115.027 115.089 C Cys C3H5ONS 103.009 103.139 E Glu C5H7O3N 129.043 129.116 Q Gln C5H8O2N2 128.059 128.131 G Gly C2H3ON 57.0215 57.0519 H His C6H7ON3 137.059 137.141 I Ile C6H11ON 113.084 113.159 L Leu C6H11ON 113.084 113.159 K Lys C6H12ON2 128.095 128.174 M Met C5H9ONS 131.04 131.193 F Phe C9H9ON 147.068 147.177 P Pro C5H7ON 97.0528 97.1167 S Ser C3H5O2N 87.032 87.0782 T Thr C4H7O2N 101.048 101.105 W Trp C11H10ON2 186.079 186.213 Y Tyr C9H9O2N 163.063 163.176 V Val C5H9ON 99.0684 99.1326
Amino acid masses
Sequences consistent with spectrum
Tandem MS – de novo Sequencing
260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079 260 32 129 145 244 274 373 403 502 518 615 647 760 762 819 292 97 113 212 242 341 371 470 486 583 615 728 730 787 389 16 115 145 244 274 373 389 486 518 631 633 690 405 99 129 228 258 357 373 470 502 615 617 674 504 30 129 159 258 274 371 403 516 518 575 534 99 129 228 244 341 373 486 488 545 633 30 129 145 242 274 387 389 446 663 99 115 212 244 357 359 416 762 16 113 145 258 260 317 778 97 129 242 244 301 875 32 145 147 204 907 113 115 172 1020 2 59 1022 57
Tandem MS – de novo Sequencing
260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079 260 32
129
145 244 274 373 403 502 518 615 647 760 762 819 292
97 113
212 242 341 371 470 486 583 615 728 730 787 389 16
115
145 244 274 373 389 486 518 631 633 690 405
99 129
228 258 357 373 470 502 615 617 674 504 30
129
159 258 274 371 403 516 518 575 534
99 129
228 244 341 373 486 488 545 633 30
129
145 242 274 387 389 446 663
99 115
212 244 357 359 416 762 16
113
145 258 260 317 778
97 129
242 244 301 875 32 145
147
204 907
113 115
172 1020 2 59 1022
57
260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079 260 32 E 145 244 274 373 403 502 518 615 647 760 762 819 292
P I/L 212
242 341 371 470 486 583 615 728 730 787 389 16 D 145 244 274 373 389 486 518 631 633 690 405
V E
228 258 357 373 470 502 615 617 674 504 30 E 159 258 274 371 403 516 518 575 534
V E
228 244 341 373 486 488 545 633 30 E 145 242 274 387 389 446 663
V D
212 244 357 359 416 762 16 I/L 145 258 260 317 778
P E
242 244 301 875 32 145 F 204 907
I/L D
172 1020 2 59 1022
G
Tandem MS – de novo Sequencing
X X X X X X
…GF(I/L)EEDE(I/L)… …(I/L)EDEE(I/L)FG… …GF(I/L)EEDE(I/L)… …(I/L)EDEE(I/L)FG… Peptide M+H = 1166 1166 -1079 = 87 => S SGF(I/L)EEDE(I/L)… SGF(I/L)EEDE(I/L)… 1166 – 1020 – 18 = 128 K or Q SGF(I/L)EEDE(I/L)(K/Q)
Tandem MS – de novo Sequencing
Challenges in de novo sequencing Neutral loss (-H2O, -NH3) Modifications Background peaks Incomplete information Challenges in de novo sequencing Neutral loss (-H2O, -NH3) Modifications Background peaks Incomplete information
MS/MS Lysis Fractionation Tandem MS – Database Search
MS/MS
Digestion Sequence DB All Fragment Masses Pick Protein Compare, Score, Test Significance Repeat for all proteins Pick Peptide LC-MS Repeat for all peptides
Search Results
Significance Testing False protein identification is caused by random matching An objective criterion for testing the significance of protein identification results is necessary. The significance of protein identifications can be tested once the distribution of scores for false results is known.
Significance Testing - Expectation Values The majority of sequences in a collection will give a score due to random matching.
Database Search
M/Z
List of Candidates Extrapolate And Calculate Expectation Values List of Candidates With Expectation Values Distribution of Scores for Random and False Identifications
Significance Testing - Expectation Values
Rho-diagrams: Overall Quality of a Data Set
) exp( ) ( s s e
i N i N i
E Ei
) )} 1 exp( 1 { )} 1 exp( 1 ){ exp( log( ) log( ) (
)} 1 exp( ) {exp(
) exp( ) 1 exp(
i i N Nde
i e i e i
E
Definition: Ei (i=0,-1,-2,…) is the number of spectra that has been assigned an expectation value between exp(i) and exp(i-1). For random matching: Expectation values as a function of score for random matching:
- 6
- 5
- 4
- 3
- 2
- 1
- 6
- 5
- 4
- 3
- 2
- 1
log(e)
Rho-diagram
Random Matching
Rho-diagram
Data Quality
- 10
- 8
- 6
- 4
- 2
- 10
- 8
- 6
- 4
- 2
log(e)
Rho-diagram
Parameters
How many fragments are sufficient?
To identify an unmodified peptide? To identify an unmodified peptide? To identify a modified peptide? To localize a modification on a peptide? To identify an unmodified peptide? To identify a modified peptide?
How many fragments are sufficient?
How does it depend on different parameters?
- Precursor mass
- Precursor mass error
- Fragment mass error
- Background peaks
LSD
Simulations using synthetic spectra
Select a peptide sequence Calculate possible fragment ion masses Choose number of fragment ions to select Randomly select fragment ions Search and store result Average over peptides Seq. DB
1825.92 1710.89 1609.84 1496.76 1365.72 1236.68 1123.59 1036.56 923.48 824.41 753.37 656.32 569.29 470.22 413.20 316.15 201.12 114.09 175.12 290.15 391.19 504.28 635.32 764.36 877.44 964.48 1077.56 1176.63 1247.67 1344.72 1431.75 1530.82 1587.84 1684.89 1799.92 1886.95
LSDPGVSPAVLSLEMLTDR
Simulations using synthetic spectra
Select a peptide sequence Calculate possible fragment ion masses Choose number of fragment ions to select Randomly select fragment ions Search and store result Average over peptides Seq. DB
6 8 9 7 5
1825.92 1710.89 1609.84 1496.76 1365.72 1236.68 1123.59 1036.56 923.48 824.41 753.37 656.32 569.29 470.22 413.20 316.15 201.12 114.09 175.12 290.15 391.19 504.28 635.32 764.36 877.44 964.48 1077.56 1176.63 1247.67 1344.72 1431.75 1530.82 1587.84 1684.89 1799.92 1886.95
LSDPGVSPAVLSLEMLTDR
Simulations using synthetic spectra
Select a peptide sequence Calculate possible fragment ion masses Choose number of fragment ions to select Randomly select fragment ions Search and store result Average over peptides
8
6 8 9 7 5
1825.92 1710.89 1609.84 1496.76 1365.72 1236.68 1123.59 1036.56 923.48 824.41 753.37 656.32 569.29 470.22 413.20 316.15 201.12 114.09 175.12 290.15 391.19 504.28 635.32 764.36 877.44 964.48 1077.56 1176.63 1247.67 1344.72 1431.75 1530.82 1587.84 1684.89 1799.92 1886.95
Simulations using synthetic spectra
Select a peptide sequence Calculate possible fragment ion masses Choose number of fragment ions to select Randomly select fragment ions Search and store result Average over peptides
8
201.12 504.28 964.48 1123.59 1247.67 1496.76 1530.82 1710.89
201.12 504.28 964.48 1123.59 1247.67 1496.76 1530.82 1710.89
Seq. DB
Simulations using synthetic spectra
Select a peptide sequence Calculate possible fragment ion masses Choose number of fragment ions to select Randomly select fragment ions Search and store result Average over peptides
Search engine
Identification LSDPGVSPAVLSLEMLTDR Seq. DB Is it significant? Is the identified sequence identical to the one used to generate the synthetic data?
1825.92 1710.89 1609.84 1496.76 1365.72 1236.68 1123.59 1036.56 923.48 824.41 753.37 656.32 569.29 470.22 413.20 316.15 201.12 114.09 175.12 290.15 391.19 504.28 635.32 764.36 877.44 964.48 1077.56 1176.63 1247.67 1344.72 1431.75 1530.82 1587.84 1684.89 1799.92 1886.95
Simulations using synthetic spectra
201.12 504.28 964.48 1123.59 1247.67 1496.76 1530.82 1710.89
Seq. DB
Search engine
Identification 6 8 9 7 5
8
Select a peptide sequence Calculate possible fragment ion masses Choose number of fragment ions to select Randomly select fragment ions Search and store result Average over peptides
Simulations using synthetic spectra
1825.92 1710.89 1609.84 1496.76 1365.72 1236.68 1123.59 1036.56 923.48 824.41 753.37 656.32 569.29 470.22 413.20 316.15 201.12 114.09 175.12 290.15 391.19 504.28 635.32 764.36 877.44 964.48 1077.56 1176.63 1247.67 1344.72 1431.75 1530.82 1587.84 1684.89 1799.92 1886.95 1825.92 1710.89 1609.84 1496.76 1365.72 1236.68 1123.59 1036.56 923.48 824.41 753.37 656.32 569.29 470.22 413.20 316.15 201.12 114.09 175.12 290.15 391.19 504.28 635.32 764.36 877.44 964.48 1077.56 1176.63 1247.67 1344.72 1431.75 1530.82 1587.84 1684.89 1799.92 1886.95
201.12 504.28 964.48 1123.59 1247.67 1496.76 1530.82 1710.89
Seq. DB
Search engine
Identification
6 8 9 7 5
9
Select a peptide sequence Calculate possible fragment ion masses Choose number of fragment ions to select Randomly select fragment ions Search and store result Average over peptides
6 8 9 7 5
1825.92 1710.89 1609.84 1496.76 1365.72 1236.68 1123.59 1036.56 923.48 824.41 753.37 656.32 569.29 470.22 413.20 316.15 201.12 114.09 175.12 290.15 391.19 504.28 635.32 764.36 877.44 964.48 1077.56 1176.63 1247.67 1344.72 1431.75 1530.82 1587.84 1684.89 1799.92 1886.95
LSDPGVSPAVLSLEMLTDR
Simulations using synthetic spectra
Select a peptide sequence Calculate possible fragment ion masses Choose number of fragment ions to select Randomly select fragment ions Search and store result Average over peptides LSD Prot. seq.
201.12 504.28 964.48 1123.59 1247.67 1496.76 1530.82 1710.89 201.12 504.28 964.48 1123.59 1247.67 1496.76 1530.82 1710.89
Seq. DB
Search engine
Identification Is it significant? Is the identified sequence identical to the one used to generate the synthetic data? LSDPGVSPAVLSLEMLTDR
8
Select a peptide sequence Calculate possible fragment ion masses Choose number of fragment ions to select Randomly select fragment ions Search and store result Average over peptides
Simulations using synthetic spectra
Each point is an average of searches with 20 randomly generated synthetic fragment mass spectra. Threshold
Each point is an average of 50 peptides.
Average
- ver
peptides
Critical number of fragment masses
0.2 0.4 0.6 0.8 1 1.2 5 10 15 20
Probability of Identification Number of fragment ions
1000 Da 1500 Da 2000 Da 2500 Da
Small peptides are slightly more difficult to identify
Dmprecursor = 1 Da Dmfragment = 0.5 Da No modification
mprecursor
A lower precursor mass error requires fewer fragment masses for identification of unmodified peptides
0.2 0.4 0.6 0.8 1 1.2 5 10 15 20
Probability of Identification Number of fragment ions
0.01 Da 1 Da 10 Da
mprecursor = 2000 Da Dmfragment = 0.5 Da No modification
0.2 0.4 0.6 0.8 1 1.2 5 10 15 20
Probability of Identification Number of fragment ions
0.01 Da 0.5 Da 1 Da 2 Da
The dependence on the fragment mass error is weak below a threshold for identification of unmodified peptides
Dmfragment
mprecursor = 2000 Da Dmprecursor = 1 Da No modification
0.2 0.4 0.6 0.8 1 1.2 5 10 15 20
Probability of Identification Number of fragment ions
0% 50% 80%
A moderate number of background peaks can be tolerated when identifying unmodified peptides
mprecursor = 2000 Da Dmprecursor = 1 Da Dmfragment = 0.5 Da No modification
Background
A large number of background peaks can be tolerated if the fragment mass is accurate
mprecursor = 2000 Da Dmprecursor = 1 Da Dmfragment = 0.01 Da No modification
0.2 0.4 0.6 0.8 1 1.2 5 10 15 20
Probability of Identification Number of fragment ions
0% 50% 80%
Background
0.2 0.4 0.6 0.8 1 1.2 5 10 15 20
Probability of Identification Number of fragment ions
Phosphorylated Unmodified
Identification of phosphopeptides is only slightly more difficult
mprecursor = 2000 Da Dmprecursor = 1 Da Dmfragment = 0.5 Da