[PPT] - Proteomics Informatics Protein identification I: searching protein PowerPoint Presentation

SLIDE 1

Proteomics Informatics – Protein identification I: searching protein sequence collections and significance testing (Week 4)

SLIDE 2

2

Peptide Mapping - Mass Accuracy

SLIDE 3

3

Peptide Mapping Database Size

C. elegans
S. cerevisiae

Human

SLIDE 4

4

Peptide Mapping Cys-Containing Peptides

C. elegans
S. cerevisiae

Human

SLIDE 5

MS Identification – Peptide Mass Fingerprinting

MS

Digestion All Peptide Masses Pick Protein Compare, Score, Test Significance Repeat for each protein Sequence DB Identified Proteins

SLIDE 6

ProFound – Search Parameters

http://prowl.rockefeller.edu/

SLIDE 7

ProFound – Protein Identification by Peptide Mapping

pattern r i i i r r i i

F m m r m m g N r N I k P DI k P             − −       − − ∝

∑ ∏

= = 2 1 2 min max 1

2 ) ( 2 exp 2 ! )! ( ) | ( ) | ( σ σ

W. Zhang & B.T. Chait,

Analytical Chemistry 72 (2000) 2482-2489

SLIDE 8

ProFound Results

SLIDE 9

Peptide Mapping – Mass Accuracy

ProFound

1 2 3 4 5 6 7 0.5 1 1.5 2

Mass Tolerance (Da)

log(e)

Mascot

20 40 60 80 100 120 140 0.5 1 1.5 2

Mass Tolerance (Da) Score

SLIDE 10

Peptide Mapping - Database Size

S. cerevisiae

Fungi All Taxa

Expectation Values Peptide mapping example:

S. Cerevisiae

4.8e-7 Fungi 8.4e-6 All Taxa 2.9e-4

SLIDE 11

Database size

SLIDE 12

Missed Cleavage Sites

u = 1 u = 2 u = 4

Expectation Values Peptide mapping example:

u=1 4.8e-7 u=2 1.1e-5 u=4 6.8e-4

SLIDE 13

Peptide Mapping - Partial Modifications

No Modifications Phophorylation (S, T, or Y)

Searched Searched With Without Possible Modifications Phosphorylation

f S/T/Y

DARPP-32 0.00006 0.01 CFTR 0.00002 0.005

Even if the protein is modified it is usually better to search a protein sequence database without specifying possible modifications using peptide mapping data.

SLIDE 14

Peptide Mapping - Ranking by Direct Calculation of the Significance

SLIDE 15

The response to random input data should be random. Maximum number of correct identification and minimum number of incorrect identifications for any data set. Maximal separation between scores for correct identifications and the distribution of scores for random matching proteins for any data set. The statistical significance of the results should be calculated. The searches should be fast.

General Criteria for a Good Protein Identification Algorithms

SLIDE 16

Response to Random Data

Normalized Frequency

SLIDE 17

Peptide Fragmentation

Mass Analyzer 1 Frag- mentation Detector Ion Source Mass Analyzer 2

b y

SLIDE 18

Identification – Tandem MS

SLIDE 19

m/z

% Relative Abundance

100 250 500 750 1000

Tandem MS – Sequence Confirmation

K L E D E E L F G S

SLIDE 20

K 1166 L 1020 E 907 D 778 E 663 E 534 L 405 F 292 G 145 S 88 b ions m/z

% Relative Abundance

100 250 500 750 1000 K L E D E E L F G S

Tandem MS – Sequence Confirmation

SLIDE 21

147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z

% Relative Abundance

100 250 500 750 1000 K L E D E E L F G S

Tandem MS – Sequence Confirmation

SLIDE 22

147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z

% Relative Abundance

100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022 K L E D E E L F G S

Tandem MS – Sequence Confirmation

SLIDE 23

147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z

% Relative Abundance

100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022 K L E D E E L F G S

Tandem MS – Sequence Confirmation

SLIDE 24

147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z

% Relative Abundance

100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022

113

K L E D E E L F G S

113

Tandem MS – Sequence Confirmation

SLIDE 25

147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z

% Relative Abundance

100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022

129 129

K L E D E E L F G S

Tandem MS – Sequence Confirmation

SLIDE 26

K L E D E E L F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z

% Relative Abundance

100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022

Tandem MS – Sequence Confirmation

SLIDE 27

K L E D E E L F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z

% Relative Abundance

100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022

Tandem MS – Sequence Confirmation

SLIDE 28

K L E D E E L F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z

% Relative Abundance

100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022

Tandem MS – Sequence Confirmation

SLIDE 29

Tandem MS – de novo Sequencing

m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022

Mass Differences

1-letter code 3-letter code Chemical formula Monois

topic

Average A Ala C3H5ON 71.0371 71.0788 R Arg C6H12ON4 156.101 156.188 N Asn C4H6O2N2 114.043 114.104 D Asp C4H5O3N 115.027 115.089 C Cys C3H5ONS 103.009 103.139 E Glu C5H7O3N 129.043 129.116 Q Gln C5H8O2N2 128.059 128.131 G Gly C2H3ON 57.0215 57.0519 H His C6H7ON3 137.059 137.141 I Ile C6H11ON 113.084 113.159 L Leu C6H11ON 113.084 113.159 K Lys C6H12ON2 128.095 128.174 M Met C5H9ONS 131.04 131.193 F Phe C9H9ON 147.068 147.177 P Pro C5H7ON 97.0528 97.1167 S Ser C3H5O2N 87.032 87.0782 T Thr C4H7O2N 101.048 101.105 W Trp C11H10ON2 186.079 186.213 Y Tyr C9H9O2N 163.063 163.176 V Val C5H9ON 99.0684 99.1326

Amino acid masses

Sequences consistent with spectrum

SLIDE 30

Tandem MS – de novo Sequencing

260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079 260 32 129 145 244 274 373 403 502 518 615 647 760 762 819 292 97 113 212 242 341 371 470 486 583 615 728 730 787 389 16 115 145 244 274 373 389 486 518 631 633 690 405 99 129 228 258 357 373 470 502 615 617 674 504 30 129 159 258 274 371 403 516 518 575 534 99 129 228 244 341 373 486 488 545 633 30 129 145 242 274 387 389 446 663 99 115 212 244 357 359 416 762 16 113 145 258 260 317 778 97 129 242 244 301 875 32 145 147 204 907 113 115 172 1020 2 59 1022 57

SLIDE 31

Tandem MS – de novo Sequencing

260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079 260 32

129

145 244 274 373 403 502 518 615 647 760 762 819 292

97 113

212 242 341 371 470 486 583 615 728 730 787 389 16

115

145 244 274 373 389 486 518 631 633 690 405

99 129

228 258 357 373 470 502 615 617 674 504 30

129

159 258 274 371 403 516 518 575 534

99 129

228 244 341 373 486 488 545 633 30

129

145 242 274 387 389 446 663

99 115

212 244 357 359 416 762 16

113

145 258 260 317 778

97 129

242 244 301 875 32 145

147

204 907

113 115

172 1020 2 59 1022

57

SLIDE 32

260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079 260 32 E 145 244 274 373 403 502 518 615 647 760 762 819 292

P I/L 212

242 341 371 470 486 583 615 728 730 787 389 16 D 145 244 274 373 389 486 518 631 633 690 405

V E

228 258 357 373 470 502 615 617 674 504 30 E 159 258 274 371 403 516 518 575 534

V E

228 244 341 373 486 488 545 633 30 E 145 242 274 387 389 446 663

V D

212 244 357 359 416 762 16 I/L 145 258 260 317 778

P E

242 244 301 875 32 145 F 204 907

I/L D

172 1020 2 59 1022

G

Tandem MS – de novo Sequencing

X X X X X X

…GF(I/L)EEDE(I/L)… …(I/L)EDEE(I/L)FG… …GF(I/L)EEDE(I/L)… …(I/L)EDEE(I/L)FG… Peptide M+H = 1166 1166 -1079 = 87 => S SGF(I/L)EEDE(I/L)… SGF(I/L)EEDE(I/L)… 1166 – 1020 – 18 = 128 ⇒K or Q SGF(I/L)EEDE(I/L)(K/Q)

SLIDE 33

Tandem MS – de novo Sequencing

Challenges in de novo sequencing Neutral loss (-H2O, -NH3) Modifications Background peaks Incomplete information Challenges in de novo sequencing Neutral loss (-H2O, -NH3) Modifications Background peaks Incomplete information

SLIDE 34

MS/MS Lysis Fractionation Tandem MS – Database Search

MS/MS

Digestion Sequence DB All Fragment Masses Pick Protein Compare, Score, Test Significance Repeat for all proteins Pick Peptide LC-MS Repeat for all peptides

SLIDE 35

Algorithms

SLIDE 36

Comparing and Optimizing Algorithms

Score Score 1-Specificity 1-Specificity Sensitivity Sensitivity Algorithm 1 Algorithm 2

True True False False

Score Score 1-Specificity 1-Specificity Sensitivity Sensitivity Algorithm 1 Algorithm 2

True True False False

SLIDE 37

37

MS/MS - Parent Mass Error and Enzyme Specificity

) ! ! (

y b I II

n n x x =

Expectation Values MS/MS example:

∆m=2, Trypsin 2.5e-5 ∆m=100, Trypsin 2.5e-5 ∆m=2, non-specific 7.9e-5 ∆m=100, non-specific 1.6e-4

SLIDE 38

Sequest

Cross-correlation

SLIDE 39

X! Tandem - Search Parameters

http://www.thegpm.org/

SLIDE 40

X! Tandem - Search Parameters

SLIDE 41

X! Tandem - Search Parameters

SLIDE 42

sequences

spectra

Conventional, single stage searching Generic search engine Test all cleavages, modifications, & mutations for all sequences

SLIDE 43

Determining potential modifications

e.g., oxidation, phosphorylation, deamidation
calculation order 2n
NP complete

Some hard problems in MS/MS analysis in proteomics

Allowing for unanticipated peptide cleavages

e.g., chymotryptic contamination in trypsin
calculation order ~ 200 × tryptic cleavage
“unfortunate” coefficient

Detecting point mutations

e.g., sequence homology
calculation order 18N
NP complete

SLIDE 44

sequences

sequences spectra

Multi-stage searching

Tryptic cleavage Modifications #1 Modifications #2 Point mutation

X! Tandem

SLIDE 45

Search Results

SLIDE 46

Search Results

SLIDE 47

Sequence Annotations

SLIDE 48

Search Results

SLIDE 49

Search Results

SLIDE 50

Lysis Fractionation Digestion LC-MS/MS Identification – Spectrum Library Search

MS/MS

Spectrum Library Pick Spectrum Compare, Score, Test Significance Repeat for all spectra Identified Proteins

SLIDE 51

1. Find the best 10 spectra for a particular

sequence, with the same PTMs and charge.

2. Add the spectra together and normalize the

intensity values.

3. Assign a “ quality” value: the median

expectation value of the 10 spectra used.

4. Record the 20 most intense peaks in the

averaged spectrum, it’s parent ion z, m/ z, sequence, protein accessions & quality. Steps in making an Annotated Spectrum Library (ASL):

SLIDE 52

2 4 6 8 10 10 20 30 40 50 pept ide lengt h fraction of library (% )

Spectrum Library Characteristics – Peptide Length

SLIDE 53

10 20 30 40 50 10 30 50 70 90 110 130 150 170 190 protein Mr (kDa) % coverage residues peptides

Spectrum Library Characteristics – Protein Coverage

SLIDE 54

Library spectrum Test spectrum (5:25) (5:25)

Results: 4 peaks selected, 1 peak missed

Identification – Spectrum Library Search

SLIDE 55

Matches Probability 1 0.45 2 0.15 3 0.016 4 0.00039 5 0.0000037 Apply a hypergeometric probability model:

25 possible m/ z values;
5 peaks in the library spectrum; and
4 selected by the test spectrum.

How likely is this?

Identification – Spectrum Library Search

SLIDE 56

If you have 1000 possible m/ z values and 20 peaks in test and library spectrum?

1.0E-14 1.0E-12 1.0E-10 1.0E-08 1.0E-06 1.0E-04 1.0E-02 1.0E+00 1 2 3 4 5 6 7 8 9 10

matches p

1 matched: p = 0.6 5 matched: p = 0.0002 10 matched: p = 0.0000000000001

Identification – Spectrum Library Search

SLIDE 57

Experimental Mass Spectrum Library of Assigned Mass Spectra



M/Z



Best search result

Identification – Spectrum Library Search

SLIDE 58

X! Hunter

SLIDE 59

1. Use dot product to find a library spectrum

that best matches a test spectrum.

2. Calculate p-value with hypergeometric

distribution.

3. Use p-value to calculate expectation value,

given the identificat ion parameters.

4. If expectation value is less than the median

expectation value of the library spectrum, report the median value. X! Hunter algorithm:

SLIDE 60

X! Hunter Result

Query Spectrum Library Spectrum

SLIDE 61

Significance Testing False protein identification is caused by random matching An objective criterion for testing the significance of protein identification results is necessary. The significance of protein identifications can be tested once the distribution of scores for false results is known.

SLIDE 62

Significance Testing - Expectation Values The majority of sequences in a collection will give a score due to random matching.

SLIDE 63

Database Search

M/Z

List of Candidates Extrapolate And Calculate Expectation Values List of Candidates With Expectation Values Distribution of Scores for Random and False Identifications

Significance Testing - Expectation Values

SLIDE 64

Proteomics Informatics – Protein identification I: searching protein sequence collections and significance testing (Week 4)