Proteomics Informatics Protein identification I: searching protein - - PowerPoint PPT Presentation

proteomics informatics protein identification i searching
SMART_READER_LITE
LIVE PREVIEW

Proteomics Informatics Protein identification I: searching protein - - PowerPoint PPT Presentation

Proteomics Informatics Protein identification I: searching protein sequence collections and significance testing (Week 4) Peptide Mapping - Mass Accuracy 2 Peptide Mapping Database Size Human C. elegans S. cerevisiae 3 Peptide Mapping


slide-1
SLIDE 1

Proteomics Informatics – Protein identification I: searching protein sequence collections and significance testing (Week 4)

slide-2
SLIDE 2

2

Peptide Mapping - Mass Accuracy

slide-3
SLIDE 3

3

Peptide Mapping Database Size

  • C. elegans
  • S. cerevisiae

Human

slide-4
SLIDE 4

4

Peptide Mapping Cys-Containing Peptides

  • C. elegans
  • S. cerevisiae

Human

slide-5
SLIDE 5

MS Identification – Peptide Mass Fingerprinting

MS

Digestion All Peptide Masses Pick Protein Compare, Score, Test Significance Repeat for each protein Sequence DB Identified Proteins

slide-6
SLIDE 6

ProFound Results

slide-7
SLIDE 7

Database size

slide-8
SLIDE 8

Mixtures

slide-9
SLIDE 9

Peptide Fragmentation

Mass Analyzer 1 Frag- mentation Detector Ion Source Mass Analyzer 2

b y

slide-10
SLIDE 10

Identification – Tandem MS

slide-11
SLIDE 11

m/z

% Relative Abundance

100 250 500 750 1000

Tandem MS – Sequence Confirmation

K L E D E E L F G S

slide-12
SLIDE 12

K 1166 L 1020 E 907 D 778 E 663 E 534 L 405 F 292 G 145 S 88 b ions m/z

% Relative Abundance

100 250 500 750 1000 K L E D E E L F G S

Tandem MS – Sequence Confirmation

slide-13
SLIDE 13

147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z

% Relative Abundance

100 250 500 750 1000 K L E D E E L F G S

Tandem MS – Sequence Confirmation

slide-14
SLIDE 14

147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z

% Relative Abundance

100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022 K L E D E E L F G S

Tandem MS – Sequence Confirmation

slide-15
SLIDE 15

147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z

% Relative Abundance

100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022 K L E D E E L F G S

Tandem MS – Sequence Confirmation

slide-16
SLIDE 16

147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z

% Relative Abundance

100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022

113

K L E D E E L F G S

113

Tandem MS – Sequence Confirmation

slide-17
SLIDE 17

147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z

% Relative Abundance

100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022

129 129

K L E D E E L F G S

Tandem MS – Sequence Confirmation

slide-18
SLIDE 18

K L E D E E L F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z

% Relative Abundance

100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022

Tandem MS – Sequence Confirmation

slide-19
SLIDE 19

K L E D E E L F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z

% Relative Abundance

100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022

Tandem MS – Sequence Confirmation

slide-20
SLIDE 20

K L E D E E L F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 E 633 663 E 762 534 L 875 405 F 1022 292 G 1080 145 S 1166 88 y ions b ions m/z

% Relative Abundance

100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022

Tandem MS – Sequence Confirmation

slide-21
SLIDE 21

Tandem MS – de novo Sequencing

m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022

Mass Differences

1-letter code 3-letter code Chemical formula Monois

  • topic

Average A Ala C3H5ON 71.0371 71.0788 R Arg C6H12ON4 156.101 156.188 N Asn C4H6O2N2 114.043 114.104 D Asp C4H5O3N 115.027 115.089 C Cys C3H5ONS 103.009 103.139 E Glu C5H7O3N 129.043 129.116 Q Gln C5H8O2N2 128.059 128.131 G Gly C2H3ON 57.0215 57.0519 H His C6H7ON3 137.059 137.141 I Ile C6H11ON 113.084 113.159 L Leu C6H11ON 113.084 113.159 K Lys C6H12ON2 128.095 128.174 M Met C5H9ONS 131.04 131.193 F Phe C9H9ON 147.068 147.177 P Pro C5H7ON 97.0528 97.1167 S Ser C3H5O2N 87.032 87.0782 T Thr C4H7O2N 101.048 101.105 W Trp C11H10ON2 186.079 186.213 Y Tyr C9H9O2N 163.063 163.176 V Val C5H9ON 99.0684 99.1326

Amino acid masses

Sequences consistent with spectrum

slide-22
SLIDE 22

Tandem MS – de novo Sequencing

260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079 260 32 129 145 244 274 373 403 502 518 615 647 760 762 819 292 97 113 212 242 341 371 470 486 583 615 728 730 787 389 16 115 145 244 274 373 389 486 518 631 633 690 405 99 129 228 258 357 373 470 502 615 617 674 504 30 129 159 258 274 371 403 516 518 575 534 99 129 228 244 341 373 486 488 545 633 30 129 145 242 274 387 389 446 663 99 115 212 244 357 359 416 762 16 113 145 258 260 317 778 97 129 242 244 301 875 32 145 147 204 907 113 115 172 1020 2 59 1022 57

slide-23
SLIDE 23

Tandem MS – de novo Sequencing

260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079 260 32

129

145 244 274 373 403 502 518 615 647 760 762 819 292

97 113

212 242 341 371 470 486 583 615 728 730 787 389 16

115

145 244 274 373 389 486 518 631 633 690 405

99 129

228 258 357 373 470 502 615 617 674 504 30

129

159 258 274 371 403 516 518 575 534

99 129

228 244 341 373 486 488 545 633 30

129

145 242 274 387 389 446 663

99 115

212 244 357 359 416 762 16

113

145 258 260 317 778

97 129

242 244 301 875 32 145

147

204 907

113 115

172 1020 2 59 1022

57

slide-24
SLIDE 24

260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079 260 32 E 145 244 274 373 403 502 518 615 647 760 762 819 292

P I/L 212

242 341 371 470 486 583 615 728 730 787 389 16 D 145 244 274 373 389 486 518 631 633 690 405

V E

228 258 357 373 470 502 615 617 674 504 30 E 159 258 274 371 403 516 518 575 534

V E

228 244 341 373 486 488 545 633 30 E 145 242 274 387 389 446 663

V D

212 244 357 359 416 762 16 I/L 145 258 260 317 778

P E

242 244 301 875 32 145 F 204 907

I/L D

172 1020 2 59 1022

G

Tandem MS – de novo Sequencing

X X X X X X

…GF(I/L)EEDE(I/L)… …(I/L)EDEE(I/L)FG… …GF(I/L)EEDE(I/L)… …(I/L)EDEE(I/L)FG… Peptide M+H = 1166 1166 -1079 = 87 => S SGF(I/L)EEDE(I/L)… SGF(I/L)EEDE(I/L)… 1166 – 1020 – 18 = 128 K or Q SGF(I/L)EEDE(I/L)(K/Q)

slide-25
SLIDE 25

Tandem MS – de novo Sequencing

Challenges in de novo sequencing Neutral loss (-H2O, -NH3) Modifications Background peaks Incomplete information Challenges in de novo sequencing Neutral loss (-H2O, -NH3) Modifications Background peaks Incomplete information

slide-26
SLIDE 26

MS/MS Lysis Fractionation Tandem MS – Database Search

MS/MS

Digestion Sequence DB All Fragment Masses Pick Protein Compare, Score, Test Significance Repeat for all proteins Pick Peptide LC-MS Repeat for all peptides

slide-27
SLIDE 27

Search Results

slide-28
SLIDE 28

Significance Testing False protein identification is caused by random matching An objective criterion for testing the significance of protein identification results is necessary. The significance of protein identifications can be tested once the distribution of scores for false results is known.

slide-29
SLIDE 29

Significance Testing - Expectation Values The majority of sequences in a collection will give a score due to random matching.

slide-30
SLIDE 30

Database Search

M/Z

List of Candidates Extrapolate And Calculate Expectation Values List of Candidates With Expectation Values Distribution of Scores for Random and False Identifications

Significance Testing - Expectation Values

slide-31
SLIDE 31

Rho-diagrams: Overall Quality of a Data Set

) exp( ) ( s s e   

i N i N i

E Ei

        ) )} 1 exp( 1 { )} 1 exp( 1 ){ exp( log( ) log( ) ( 

)} 1 exp( ) {exp(

) exp( ) 1 exp(

   

  

i i N Nde

i e i e i

E

Definition: Ei (i=0,-1,-2,…) is the number of spectra that has been assigned an expectation value between exp(i) and exp(i-1). For random matching: Expectation values as a function of score for random matching:

slide-32
SLIDE 32
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

log(e) 

Rho-diagram

Random Matching

slide-33
SLIDE 33

Rho-diagram

Data Quality

  • 10
  • 8
  • 6
  • 4
  • 2
  • 10
  • 8
  • 6
  • 4
  • 2

log(e) 

slide-34
SLIDE 34

Rho-diagram

Parameters

slide-35
SLIDE 35

How many fragments are sufficient?

To identify an unmodified peptide? To identify an unmodified peptide? To identify a modified peptide? To localize a modification on a peptide? To identify an unmodified peptide? To identify a modified peptide?

slide-36
SLIDE 36

How many fragments are sufficient?

How does it depend on different parameters?

  • Precursor mass
  • Precursor mass error
  • Fragment mass error
  • Background peaks
slide-37
SLIDE 37

LSD

Simulations using synthetic spectra

Select a peptide sequence Calculate possible fragment ion masses Choose number of fragment ions to select Randomly select fragment ions Search and store result Average over peptides Seq. DB

slide-38
SLIDE 38

1825.92 1710.89 1609.84 1496.76 1365.72 1236.68 1123.59 1036.56 923.48 824.41 753.37 656.32 569.29 470.22 413.20 316.15 201.12 114.09 175.12 290.15 391.19 504.28 635.32 764.36 877.44 964.48 1077.56 1176.63 1247.67 1344.72 1431.75 1530.82 1587.84 1684.89 1799.92 1886.95

LSDPGVSPAVLSLEMLTDR

Simulations using synthetic spectra

Select a peptide sequence Calculate possible fragment ion masses Choose number of fragment ions to select Randomly select fragment ions Search and store result Average over peptides Seq. DB

slide-39
SLIDE 39

6 8 9 7 5

1825.92 1710.89 1609.84 1496.76 1365.72 1236.68 1123.59 1036.56 923.48 824.41 753.37 656.32 569.29 470.22 413.20 316.15 201.12 114.09 175.12 290.15 391.19 504.28 635.32 764.36 877.44 964.48 1077.56 1176.63 1247.67 1344.72 1431.75 1530.82 1587.84 1684.89 1799.92 1886.95

LSDPGVSPAVLSLEMLTDR

Simulations using synthetic spectra

Select a peptide sequence Calculate possible fragment ion masses Choose number of fragment ions to select Randomly select fragment ions Search and store result Average over peptides

8

slide-40
SLIDE 40

6 8 9 7 5

1825.92 1710.89 1609.84 1496.76 1365.72 1236.68 1123.59 1036.56 923.48 824.41 753.37 656.32 569.29 470.22 413.20 316.15 201.12 114.09 175.12 290.15 391.19 504.28 635.32 764.36 877.44 964.48 1077.56 1176.63 1247.67 1344.72 1431.75 1530.82 1587.84 1684.89 1799.92 1886.95

Simulations using synthetic spectra

Select a peptide sequence Calculate possible fragment ion masses Choose number of fragment ions to select Randomly select fragment ions Search and store result Average over peptides

8

       

201.12 504.28 964.48 1123.59 1247.67 1496.76 1530.82 1710.89

slide-41
SLIDE 41

201.12 504.28 964.48 1123.59 1247.67 1496.76 1530.82 1710.89

Seq. DB

Simulations using synthetic spectra

Select a peptide sequence Calculate possible fragment ion masses Choose number of fragment ions to select Randomly select fragment ions Search and store result Average over peptides

Search engine

Identification LSDPGVSPAVLSLEMLTDR Seq. DB Is it significant? Is the identified sequence identical to the one used to generate the synthetic data?

slide-42
SLIDE 42

1825.92 1710.89 1609.84 1496.76 1365.72 1236.68 1123.59 1036.56 923.48 824.41 753.37 656.32 569.29 470.22 413.20 316.15 201.12 114.09 175.12 290.15 391.19 504.28 635.32 764.36 877.44 964.48 1077.56 1176.63 1247.67 1344.72 1431.75 1530.82 1587.84 1684.89 1799.92 1886.95

Simulations using synthetic spectra        

201.12 504.28 964.48 1123.59 1247.67 1496.76 1530.82 1710.89

Seq. DB

Search engine

Identification 6 8 9 7 5

8

Select a peptide sequence Calculate possible fragment ion masses Choose number of fragment ions to select Randomly select fragment ions Search and store result Average over peptides

slide-43
SLIDE 43

Simulations using synthetic spectra

1825.92 1710.89 1609.84 1496.76 1365.72 1236.68 1123.59 1036.56 923.48 824.41 753.37 656.32 569.29 470.22 413.20 316.15 201.12 114.09 175.12 290.15 391.19 504.28 635.32 764.36 877.44 964.48 1077.56 1176.63 1247.67 1344.72 1431.75 1530.82 1587.84 1684.89 1799.92 1886.95 1825.92 1710.89 1609.84 1496.76 1365.72 1236.68 1123.59 1036.56 923.48 824.41 753.37 656.32 569.29 470.22 413.20 316.15 201.12 114.09 175.12 290.15 391.19 504.28 635.32 764.36 877.44 964.48 1077.56 1176.63 1247.67 1344.72 1431.75 1530.82 1587.84 1684.89 1799.92 1886.95

       

201.12 504.28 964.48 1123.59 1247.67 1496.76 1530.82 1710.89

Seq. DB

Search engine

Identification

6 8 9 7 5

9

Select a peptide sequence Calculate possible fragment ion masses Choose number of fragment ions to select Randomly select fragment ions Search and store result Average over peptides

slide-44
SLIDE 44

6 8 9 7 5

1825.92 1710.89 1609.84 1496.76 1365.72 1236.68 1123.59 1036.56 923.48 824.41 753.37 656.32 569.29 470.22 413.20 316.15 201.12 114.09 175.12 290.15 391.19 504.28 635.32 764.36 877.44 964.48 1077.56 1176.63 1247.67 1344.72 1431.75 1530.82 1587.84 1684.89 1799.92 1886.95

LSDPGVSPAVLSLEMLTDR

Simulations using synthetic spectra

Select a peptide sequence Calculate possible fragment ion masses Choose number of fragment ions to select Randomly select fragment ions Search and store result Average over peptides LSD Prot. seq.

       

201.12 504.28 964.48 1123.59 1247.67 1496.76 1530.82 1710.89 201.12 504.28 964.48 1123.59 1247.67 1496.76 1530.82 1710.89

Seq. DB

Search engine

Identification Is it significant? Is the identified sequence identical to the one used to generate the synthetic data? LSDPGVSPAVLSLEMLTDR

8

Select a peptide sequence Calculate possible fragment ion masses Choose number of fragment ions to select Randomly select fragment ions Search and store result Average over peptides

slide-45
SLIDE 45

Simulations using synthetic spectra

Each point is an average of searches with 20 randomly generated synthetic fragment mass spectra. Threshold

Each point is an average of 50 peptides.

Average

  • ver

peptides

slide-46
SLIDE 46

Critical number of fragment masses

slide-47
SLIDE 47

0.2 0.4 0.6 0.8 1 1.2 5 10 15 20

Probability of Identification Number of fragment ions

1000 Da 1500 Da 2000 Da 2500 Da

Small peptides are slightly more difficult to identify

Dmprecursor = 1 Da Dmfragment = 0.5 Da No modification

mprecursor

slide-48
SLIDE 48

A lower precursor mass error requires fewer fragment masses for identification of unmodified peptides

0.2 0.4 0.6 0.8 1 1.2 5 10 15 20

Probability of Identification Number of fragment ions

0.01 Da 1 Da 10 Da

mprecursor = 2000 Da Dmfragment = 0.5 Da No modification

slide-49
SLIDE 49

0.2 0.4 0.6 0.8 1 1.2 5 10 15 20

Probability of Identification Number of fragment ions

0.01 Da 0.5 Da 1 Da 2 Da

The dependence on the fragment mass error is weak below a threshold for identification of unmodified peptides

Dmfragment

mprecursor = 2000 Da Dmprecursor = 1 Da No modification

slide-50
SLIDE 50

0.2 0.4 0.6 0.8 1 1.2 5 10 15 20

Probability of Identification Number of fragment ions

0% 50% 80%

A moderate number of background peaks can be tolerated when identifying unmodified peptides

mprecursor = 2000 Da Dmprecursor = 1 Da Dmfragment = 0.5 Da No modification

Background

slide-51
SLIDE 51

A large number of background peaks can be tolerated if the fragment mass is accurate

mprecursor = 2000 Da Dmprecursor = 1 Da Dmfragment = 0.01 Da No modification

0.2 0.4 0.6 0.8 1 1.2 5 10 15 20

Probability of Identification Number of fragment ions

0% 50% 80%

Background

slide-52
SLIDE 52

0.2 0.4 0.6 0.8 1 1.2 5 10 15 20

Probability of Identification Number of fragment ions

Phosphorylated Unmodified

Identification of phosphopeptides is only slightly more difficult

mprecursor = 2000 Da Dmprecursor = 1 Da Dmfragment = 0.5 Da

slide-53
SLIDE 53

Proteomics Informatics – Protein identification I: searching protein sequence collections and significance testing (Week 4)