How We Handle Mass Spectra NIST Mass Spectrometry Data Center - - PowerPoint PPT Presentation

how we handle mass spectra
SMART_READER_LITE
LIVE PREVIEW

How We Handle Mass Spectra NIST Mass Spectrometry Data Center - - PowerPoint PPT Presentation

How We Handle Mass Spectra NIST Mass Spectrometry Data Center NIST/EPA/NIH Mass Spectral Library Numbers of Spectra 200,000 180,000 160,000 140,000 120,000 Replicates 100,000 Compounds 80,000 60,000 40,000 20,000 0 '78 '80 '83


slide-1
SLIDE 1

How We Handle Mass Spectra

NIST Mass Spectrometry Data Center

slide-2
SLIDE 2
slide-3
SLIDE 3

Numbers of Spectra

20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000 180,000 200,000 '78 '80 '83 '86 '88 '90 '93 '98 '02 Replicates Compounds

Red Books EPA NIST

NIST/EPA/NIH Mass Spectral Library

slide-4
SLIDE 4

Libraries Distributed/Year

500 1000 1500 2000 2500 3000 3500 4000 4500 '88 '89 '90 '91 '92 '93 '94 '95 '96 '97 '98 '99 '00 '01 '02 '03 '04

slide-5
SLIDE 5

The Data

Cl

11 17 10 7 5 9 8

NH

20 6 15 12 14 4 19 13 1 16 2 3

H H H H H N NH NH N NH2 O

slide-6
SLIDE 6

Connection Table

Cl

1 2 3 4

S S S S S D S D

4 3 2 1 1 2 3 4

slide-7
SLIDE 7

From Structure to Spectrum: A Mass “Fragmentogram”

P F O CH3 O C H C H3 C H3 P F O CH3 O C H C H3 C H3

+ +

+ e- 2e- mass = 140 u P F OH CH3 O H P F O CH3 O CH C H3 +

+

mass = 99 u + mass = 125 u

+

CH3 CH2 CH C H2

slide-8
SLIDE 8

Molecular Fingerprints

VX HD GB

slide-9
SLIDE 9

I will discuss

  • Library Searching

– Full and Partial Spectra

  • Spectrum Purification
  • Chemical Structure Representation
  • Peptide Spectra Libraries
slide-10
SLIDE 10

50 100 150 200 250 300 200 400 600 800 1000

Instrument ‘Noise Signature’

250 Hexachlorobenzene Spectra same instrument, calibration mix Bars show quartiles

slide-11
SLIDE 11

Instrument Effects

slide-12
SLIDE 12

Library Search

unknown

MF=93 MF=68 sarin

slide-13
SLIDE 13

Spectral Similarity

  • M = f(Abundance) Peak in Measured Spectrum
  • R = f(Abundance) Peak in Reference Spectrum
  • Sum over all peaks
  • f(Abundance)

– Abundance – Abundance * m/z – Certainty

  • R

M MR

slide-14
SLIDE 14

Top Hit Top 2 Hits Top 3 Hits Correlation – Weighted 74.9 86.9 91.7 Correlation 72.9 85.9 90.8 Euclidean Distance 71.9 83.9 88.9 Absolute Distance 67.9 80.3 85.5 PBM - Published 64.7 78.4 84.8 Hites/Hertz/Biemann 64.4 77.2 83.2

Algorithm Performance

12,592 Replicate Spectra against NIST Library

Percent Correct Model

slide-15
SLIDE 15

20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0

False Positives (108,000 compounds) False Negatives (21,000 replicate spectra) Match Factor Threshold Fraction Recovered

FP/FP Above Given Match Factor for NIST Library Spectra

slide-16
SLIDE 16

80 85 90 95 100 0.0 0.2 0.4 0.6 0.8

m/z weighting no weighting Match Factor Fraction Recovered

FP/FN Expanded View

FN FP x 10,000

slide-17
SLIDE 17

20 40 60 80 100 50 100 150 200

decane decalin TMB HCB malathion DMPB sarin

HCB = hexachlorobenzene DMPB = dimethylpenobarbital TMB = 1,2,3-trimethylbenzene

FP Depends on Spectrum Uniqueness

Match Factor

FP

slide-18
SLIDE 18

Multiple Ion Monitoring

  • What is is?

– Use 2-5 Major Peaks in Spectrum of Target

  • 10 – 100 more sensitive
  • What’s the problem?

– Can match major Target peaks with Minor Sample Peaks

  • What we have done:

– Examine risk using library as source of potential false positive IDs

slide-19
SLIDE 19

False Positive Risk vs Number of Peaks Used

Figure 1. Median FPP vs. NP

0.00001 0.0001 0.001 0.01 0.1 1 1 2 3 4 5 Number of Peaks 1 1/2 1/4 1/8 1/16 1/32 1/64 1/128

BMA

Figure 1. Median FPP vs. NP

0.00001 0.0001 0.001 0.01 0.1 1 1 2 3 4 5 Number of Peaks 1 1/2 1/2 1/4 1/4 1/8 1/8 1/16 1/16 1/32 1/32 1/64 1/64 1/128 1/128

BMA

Number of Peaks FP/ spectrum

(median)

Abundance Ratio: Biggest Search Peak/ Matching Peak in FP

slide-20
SLIDE 20

10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 m/z Difference Relative Probabilities

s

10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 m/z Difference Relative Probabilities

s

Mass Spectral Peak Occurrences are Correlated

Difference in Peak Position (m/z)

Joint Occurrence Prob.

Big Peaks Small Peaks Medium Peaks

slide-21
SLIDE 21

FP Observed and Computed

(from individual peak probabilities)

0.1 1 10 100 1000 10000 1 2 3 4 5 6 7 8 9 10 Observed FPP Percentile 0.1 1 10 100 1000 10000 1 2 3 4 5 6 7 8 9 10 Observed FPP Percentile

FP Percentile/10

FP

Actual No Peak Correlation

slide-22
SLIDE 22

Search Results Depend on Search Spectrum Quality

AMDIS: http://chemdata.nist.gov

slide-23
SLIDE 23

Real Data

Total ion chromatogram A mass spectrum (scan)

slide-24
SLIDE 24

Chromatogram with single ion

slide-25
SLIDE 25

AMDIS Analysis of Data

AMDIS Match = 81

O P O F

slide-26
SLIDE 26

Order of Analysis

  • Noise Analysis – find ‘Noise Factor’
  • Find and quantify maximizing ions
  • Combine to create ‘Model Peak’
  • Use Model Peak shape (intensity vs time) to

purify spectra

  • Find best matching library spectrum
slide-27
SLIDE 27

Intensity K Noise

noise

=

Intensity

Noise

Derive Noise Factor

slide-28
SLIDE 28

Finding Possible Peaks for Each m/z

Maximum rate

Scan number n

slide-29
SLIDE 29

Find Possible Compounds: Do Ions Maximize at Same Time?

10 36

.0 .1 .7 .6 .5 .4 .2 .3 .9 .8

1 2

slide-30
SLIDE 30

Separate the Components

10 36 14 7 111 41 16 85 11 42 2 103 8 18 4

508

.3

yes

.2 82 751 75 16 15 13 305 14 19 22 82 37 147 6 .0 .1 .7 .6 .5 .4 .2 .3 .9 .8

1 2

264

.3

yes

.2

NO

.6

57 81 23 96 7

slide-31
SLIDE 31

A ‘Model Peak’ Provides Shape

10 36 14 7 111 41 16 85 11 42 2 103 8 18 4

508

.3

yes

.2 82 751 75 16 15 13 305 14 19 22 82 37 147 6 .0 .1 .7 .6 .5 .4 .2 .3 .9 .8

1 2

264

.3

yes

.2

NO

.6

57 81 23 96 7

The model shape is defined as the sum of all of the ion chromatograms that maximize within the range and have a sharpness value within 75% of the maximum.

slide-32
SLIDE 32

AMDIS Testing – Closely Eluting Components

slide-33
SLIDE 33

Representing Chemical Identity

  • Visual: 2D Structure
  • Text: IUPAC Name
  • Digital: No Accepted, Open Method
  • Solution:

The IUPAC/NIST Chemical Identifier

slide-34
SLIDE 34

Connection Table

Cl

1 2 3 4

S S S S S D S D

4 3 2 1 1 2 3 4

slide-35
SLIDE 35

Chemical Identity Problems

CH3 C H3 CH3 C H3

Registry Number possible for each exact form, mixture, unknown, unspecified Experts required Expensive, ambiguous and error prone

slide-36
SLIDE 36

Requirements

  • Different compounds have different identifiers

– Keep all distinguishing structural information IChI - 1 IChI - 2

= =

slide-37
SLIDE 37

Requirements

  • One compound has only one identifier

– Omit unnecessary information

N O O N O O N

+ O

O

N O O

Same INChI = = =

slide-38
SLIDE 38

3 Steps to INChI

  • Chemistry

– ‘Normalize’ Input Structure

  • Implement chemical rules
  • Math

– ‘Canonicalize’ (label the atoms)

  • Equivalent atoms get the same label
  • Format

– ‘Serialize’ Labeled Structure

  • Output as character string (‘name’)
slide-39
SLIDE 39

formula connectivity stereo isotope Chemical Substances

“Layers”

slide-40
SLIDE 40

Nitrobenzene

CH

5

CH

3

CH

1

CH

2

CH

4

C

6

N

+

7

O

8

O

9

Canonical numbering

Description Layers

formula C6H5NO2 connectivity 8-7(9)6-4-2-1-3-5-6 H-atoms 1-5H charges

slide-41
SLIDE 41

MSG

C

4

C

5

O

8

CH2

2

O

9

CH2

1

CH

3

O

10

O H

7

NH2

6

Na

+

1

Canonical numbering

Description Layers

formula C5H8NO4.Na connectivity 6-3(5(9)10)1-2-4(7)8; H-atoms 1-2H2,3H,6H2(H-,7,8,9,10); stereo sp3 3-; charges -1;+1

C5H9NO4.Na/c6-3(5(9)10)1-2-4(7)8;/h1- 2H2,3H,6H2,(H,7,8)(H,9,10);/q;+1/p-1/t3-;/m1./s1

slide-42
SLIDE 42

INChI Test Version

Input/ Result Mobile H On/Off Include Org- Metal Bonds

slide-43
SLIDE 43

Peptide Mass Spectra: Libraries for Organisms

  • Proteins are linear sequences of amino acids

– characteristic of Genome (organism)

  • Peptides are ‘digested’ fragments of proteins
  • MS ‘sequences’ peptides to reveal source Protein
  • Peptides fragmentation spectra are not quite

predictable

  • Peptide fragmentation spectra for a ‘genome’ can

be contained in one Library.

slide-44
SLIDE 44

Spectrum Prediction Programs

slide-45
SLIDE 45

Peptide Spectra Reference Library

(multiple measurements each of 10,000 peptides)

HLQLAIR/2+

slide-46
SLIDE 46

MS Mapped to the Genome From Eric Deutsch, ISB, 6/2004