How We Handle Mass Spectra NIST Mass Spectrometry Data Center - - PowerPoint PPT Presentation
How We Handle Mass Spectra NIST Mass Spectrometry Data Center - - PowerPoint PPT Presentation
How We Handle Mass Spectra NIST Mass Spectrometry Data Center NIST/EPA/NIH Mass Spectral Library Numbers of Spectra 200,000 180,000 160,000 140,000 120,000 Replicates 100,000 Compounds 80,000 60,000 40,000 20,000 0 '78 '80 '83
Numbers of Spectra
20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000 180,000 200,000 '78 '80 '83 '86 '88 '90 '93 '98 '02 Replicates Compounds
Red Books EPA NIST
NIST/EPA/NIH Mass Spectral Library
Libraries Distributed/Year
500 1000 1500 2000 2500 3000 3500 4000 4500 '88 '89 '90 '91 '92 '93 '94 '95 '96 '97 '98 '99 '00 '01 '02 '03 '04
The Data
Cl
11 17 10 7 5 9 8
NH
20 6 15 12 14 4 19 13 1 16 2 3
H H H H H N NH NH N NH2 O
Connection Table
Cl
1 2 3 4
S S S S S D S D
4 3 2 1 1 2 3 4
From Structure to Spectrum: A Mass “Fragmentogram”
P F O CH3 O C H C H3 C H3 P F O CH3 O C H C H3 C H3
+ +
+ e- 2e- mass = 140 u P F OH CH3 O H P F O CH3 O CH C H3 +
+
mass = 99 u + mass = 125 u
+
CH3 CH2 CH C H2
Molecular Fingerprints
VX HD GB
I will discuss
- Library Searching
– Full and Partial Spectra
- Spectrum Purification
- Chemical Structure Representation
- Peptide Spectra Libraries
50 100 150 200 250 300 200 400 600 800 1000
Instrument ‘Noise Signature’
250 Hexachlorobenzene Spectra same instrument, calibration mix Bars show quartiles
Instrument Effects
Library Search
unknown
MF=93 MF=68 sarin
Spectral Similarity
- M = f(Abundance) Peak in Measured Spectrum
- R = f(Abundance) Peak in Reference Spectrum
- Sum over all peaks
- f(Abundance)
– Abundance – Abundance * m/z – Certainty
- R
M MR
Top Hit Top 2 Hits Top 3 Hits Correlation – Weighted 74.9 86.9 91.7 Correlation 72.9 85.9 90.8 Euclidean Distance 71.9 83.9 88.9 Absolute Distance 67.9 80.3 85.5 PBM - Published 64.7 78.4 84.8 Hites/Hertz/Biemann 64.4 77.2 83.2
Algorithm Performance
12,592 Replicate Spectra against NIST Library
Percent Correct Model
20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0
False Positives (108,000 compounds) False Negatives (21,000 replicate spectra) Match Factor Threshold Fraction Recovered
FP/FP Above Given Match Factor for NIST Library Spectra
80 85 90 95 100 0.0 0.2 0.4 0.6 0.8
m/z weighting no weighting Match Factor Fraction Recovered
FP/FN Expanded View
FN FP x 10,000
20 40 60 80 100 50 100 150 200
decane decalin TMB HCB malathion DMPB sarin
HCB = hexachlorobenzene DMPB = dimethylpenobarbital TMB = 1,2,3-trimethylbenzene
FP Depends on Spectrum Uniqueness
Match Factor
FP
Multiple Ion Monitoring
- What is is?
– Use 2-5 Major Peaks in Spectrum of Target
- 10 – 100 more sensitive
- What’s the problem?
– Can match major Target peaks with Minor Sample Peaks
- What we have done:
– Examine risk using library as source of potential false positive IDs
False Positive Risk vs Number of Peaks Used
Figure 1. Median FPP vs. NP
0.00001 0.0001 0.001 0.01 0.1 1 1 2 3 4 5 Number of Peaks 1 1/2 1/4 1/8 1/16 1/32 1/64 1/128
BMA
Figure 1. Median FPP vs. NP
0.00001 0.0001 0.001 0.01 0.1 1 1 2 3 4 5 Number of Peaks 1 1/2 1/2 1/4 1/4 1/8 1/8 1/16 1/16 1/32 1/32 1/64 1/64 1/128 1/128
BMA
Number of Peaks FP/ spectrum
(median)
Abundance Ratio: Biggest Search Peak/ Matching Peak in FP
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 m/z Difference Relative Probabilities
s
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 m/z Difference Relative Probabilities
s
Mass Spectral Peak Occurrences are Correlated
Difference in Peak Position (m/z)
Joint Occurrence Prob.
Big Peaks Small Peaks Medium Peaks
FP Observed and Computed
(from individual peak probabilities)
0.1 1 10 100 1000 10000 1 2 3 4 5 6 7 8 9 10 Observed FPP Percentile 0.1 1 10 100 1000 10000 1 2 3 4 5 6 7 8 9 10 Observed FPP Percentile
FP Percentile/10
FP
Actual No Peak Correlation
Search Results Depend on Search Spectrum Quality
AMDIS: http://chemdata.nist.gov
Real Data
Total ion chromatogram A mass spectrum (scan)
Chromatogram with single ion
AMDIS Analysis of Data
AMDIS Match = 81
O P O F
Order of Analysis
- Noise Analysis – find ‘Noise Factor’
- Find and quantify maximizing ions
- Combine to create ‘Model Peak’
- Use Model Peak shape (intensity vs time) to
purify spectra
- Find best matching library spectrum
Intensity K Noise
noise
=
Intensity
Noise
Derive Noise Factor
Finding Possible Peaks for Each m/z
Maximum rate
Scan number n
Find Possible Compounds: Do Ions Maximize at Same Time?
10 36
.0 .1 .7 .6 .5 .4 .2 .3 .9 .8
1 2
Separate the Components
10 36 14 7 111 41 16 85 11 42 2 103 8 18 4
508
.3
yes
.2 82 751 75 16 15 13 305 14 19 22 82 37 147 6 .0 .1 .7 .6 .5 .4 .2 .3 .9 .8
1 2
264
.3
yes
.2
NO
.6
57 81 23 96 7
A ‘Model Peak’ Provides Shape
10 36 14 7 111 41 16 85 11 42 2 103 8 18 4
508
.3
yes
.2 82 751 75 16 15 13 305 14 19 22 82 37 147 6 .0 .1 .7 .6 .5 .4 .2 .3 .9 .8
1 2
264
.3
yes
.2
NO
.6
57 81 23 96 7
The model shape is defined as the sum of all of the ion chromatograms that maximize within the range and have a sharpness value within 75% of the maximum.
AMDIS Testing – Closely Eluting Components
Representing Chemical Identity
- Visual: 2D Structure
- Text: IUPAC Name
- Digital: No Accepted, Open Method
- Solution:
The IUPAC/NIST Chemical Identifier
Connection Table
Cl
1 2 3 4
S S S S S D S D
4 3 2 1 1 2 3 4
Chemical Identity Problems
CH3 C H3 CH3 C H3
Registry Number possible for each exact form, mixture, unknown, unspecified Experts required Expensive, ambiguous and error prone
Requirements
- Different compounds have different identifiers
– Keep all distinguishing structural information IChI - 1 IChI - 2
= =
Requirements
- One compound has only one identifier
– Omit unnecessary information
N O O N O O N
+ O
O
N O O
Same INChI = = =
3 Steps to INChI
- Chemistry
– ‘Normalize’ Input Structure
- Implement chemical rules
- Math
– ‘Canonicalize’ (label the atoms)
- Equivalent atoms get the same label
- Format
– ‘Serialize’ Labeled Structure
- Output as character string (‘name’)
formula connectivity stereo isotope Chemical Substances
“Layers”
Nitrobenzene
CH
5
CH
3
CH
1
CH
2
CH
4
C
6
N
+
7
O
8
O
9
Canonical numbering
Description Layers
formula C6H5NO2 connectivity 8-7(9)6-4-2-1-3-5-6 H-atoms 1-5H charges
MSG
C
4
C
5
O
8
CH2
2
O
9
CH2
1
CH
3
O
10
O H
7
NH2
6
Na
+
1
Canonical numbering
Description Layers
formula C5H8NO4.Na connectivity 6-3(5(9)10)1-2-4(7)8; H-atoms 1-2H2,3H,6H2(H-,7,8,9,10); stereo sp3 3-; charges -1;+1
C5H9NO4.Na/c6-3(5(9)10)1-2-4(7)8;/h1- 2H2,3H,6H2,(H,7,8)(H,9,10);/q;+1/p-1/t3-;/m1./s1
INChI Test Version
Input/ Result Mobile H On/Off Include Org- Metal Bonds
Peptide Mass Spectra: Libraries for Organisms
- Proteins are linear sequences of amino acids
– characteristic of Genome (organism)
- Peptides are ‘digested’ fragments of proteins
- MS ‘sequences’ peptides to reveal source Protein
- Peptides fragmentation spectra are not quite
predictable
- Peptide fragmentation spectra for a ‘genome’ can
be contained in one Library.
Spectrum Prediction Programs
Peptide Spectra Reference Library
(multiple measurements each of 10,000 peptides)
HLQLAIR/2+
MS Mapped to the Genome From Eric Deutsch, ISB, 6/2004