Metabolite Identification via Machine Learning Huibin Shen - - PowerPoint PPT Presentation

metabolite identification via machine learning
SMART_READER_LITE
LIVE PREVIEW

Metabolite Identification via Machine Learning Huibin Shen - - PowerPoint PPT Presentation

Introduction Fingerprints prediction Database matching Result Conclusion Metabolite Identification via Machine Learning Huibin Shen Department of Information and Computer Science Aalto University February 7, 2013 Huibin Shen Metabolite


slide-1
SLIDE 1

Introduction Fingerprints prediction Database matching Result Conclusion

Metabolite Identification via Machine Learning

Huibin Shen

Department of Information and Computer Science Aalto University

February 7, 2013

Huibin Shen Metabolite Identification via Machine Learning

slide-2
SLIDE 2

Introduction Fingerprints prediction Database matching Result Conclusion

Outline

1

Introduction

2

Fingerprints prediction

3

Database matching

4

Result

5

Conclusion

Huibin Shen Metabolite Identification via Machine Learning

slide-3
SLIDE 3

Introduction Fingerprints prediction Database matching Result Conclusion General picture Computational methods Key concept Machine learning method

General picture

What is the metabolites identification?

Figure 1: Metabolomics pipeline towards a systems biology approach: from the whole metabolome to identified metabolites [M. Sofia, 2007].

Huibin Shen Metabolite Identification via Machine Learning

slide-4
SLIDE 4

Introduction Fingerprints prediction Database matching Result Conclusion General picture Computational methods Key concept Machine learning method

Standard computational method

Matching reference spectral database. Problems:

Huibin Shen Metabolite Identification via Machine Learning

slide-5
SLIDE 5

Introduction Fingerprints prediction Database matching Result Conclusion General picture Computational methods Key concept Machine learning method

Standard computational method

Matching reference spectral database. Problems: Quality of data.

Huibin Shen Metabolite Identification via Machine Learning

slide-6
SLIDE 6

Introduction Fingerprints prediction Database matching Result Conclusion General picture Computational methods Key concept Machine learning method

Standard computational method

Matching reference spectral database. Problems: Quality of data. Seldom public.

Huibin Shen Metabolite Identification via Machine Learning

slide-7
SLIDE 7

Introduction Fingerprints prediction Database matching Result Conclusion General picture Computational methods Key concept Machine learning method

Standard computational method

Matching reference spectral database. Problems: Quality of data. Seldom public. Limited number.

Huibin Shen Metabolite Identification via Machine Learning

slide-8
SLIDE 8

Introduction Fingerprints prediction Database matching Result Conclusion General picture Computational methods Key concept Machine learning method

Standard computational method

Matching reference spectral database. Problems: Quality of data. Seldom public. Limited number. Diversity of mass spectrometer.

Huibin Shen Metabolite Identification via Machine Learning

slide-9
SLIDE 9

Introduction Fingerprints prediction Database matching Result Conclusion General picture Computational methods Key concept Machine learning method

Standard computational method

Matching reference spectral database. Problems: Quality of data. Seldom public. Limited number. Diversity of mass spectrometer. Similarity definition.

Huibin Shen Metabolite Identification via Machine Learning

slide-10
SLIDE 10

Introduction Fingerprints prediction Database matching Result Conclusion General picture Computational methods Key concept Machine learning method

Standard computational method

Matching reference spectral database. Problems: Quality of data. Seldom public. Limited number. Diversity of mass spectrometer. Similarity definition.

Huibin Shen Metabolite Identification via Machine Learning

slide-11
SLIDE 11

Introduction Fingerprints prediction Database matching Result Conclusion General picture Computational methods Key concept Machine learning method

Molecular fingerprint

Figure 2: Representation of a molecular substructure fingerprint with a substructure fingerprint dictionary of given substructure patterns. This molecule is represented in a series of binary bits that represent the presence or absence of particular substructures in the molecules [D.S. Cao, 2012].

Huibin Shen Metabolite Identification via Machine Learning

slide-12
SLIDE 12

Introduction Fingerprints prediction Database matching Result Conclusion General picture Computational methods Key concept Machine learning method

Machine learning method

We propose a new framework to identify metabolites through machine learning:

Figure 3: The overview of the two-step metabolite identification framework.

Huibin Shen Metabolite Identification via Machine Learning

slide-13
SLIDE 13

Introduction Fingerprints prediction Database matching Result Conclusion Method Kernels

Support Vector Machine (SVM)

SVM, a supervised machine learning method for classification and regression.

Figure 4: Three dimensional case for SVM1.

1Figure from http://www.dtreg.com/svm.htm Huibin Shen Metabolite Identification via Machine Learning

slide-14
SLIDE 14

Introduction Fingerprints prediction Database matching Result Conclusion Method Kernels

kernels for mass spectrum

Feature mapping ≈ kernel function.

Huibin Shen Metabolite Identification via Machine Learning

slide-15
SLIDE 15

Introduction Fingerprints prediction Database matching Result Conclusion Method Kernels

kernels for mass spectrum

Feature mapping ≈ kernel function. Three basic features and their combination.

Huibin Shen Metabolite Identification via Machine Learning

slide-16
SLIDE 16

Introduction Fingerprints prediction Database matching Result Conclusion Method Kernels

kernels for mass spectrum

Feature mapping ≈ kernel function. Three basic features and their combination. Two families of kernels: integral mass kernel and probability product kernel.

Huibin Shen Metabolite Identification via Machine Learning

slide-17
SLIDE 17

Introduction Fingerprints prediction Database matching Result Conclusion Method Kernels

Integral mass kernels

k(x, x′) = x, x′

20 40 60 80 100 120 140 160 180 200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 I n t e n s i t y m/z Collision energy 30ev Collision energy 20ev Collision energy 10eV

HN NH

2

O OH

169.3 187.4 73 145.1 117.0

Figure 5: Three basic features and integral mass kernel.

Huibin Shen Metabolite Identification via Machine Learning

slide-18
SLIDE 18

Introduction Fingerprints prediction Database matching Result Conclusion Method Kernels

Probability product kernel

k(x, x′) = kprob(p(x), p′(x′)) =

  • X p(x)p′(x′)dx

Figure 6: Probability product kernel.

Huibin Shen Metabolite Identification via Machine Learning

slide-19
SLIDE 19

Introduction Fingerprints prediction Database matching Result Conclusion

Scoring

Given the cross validation accuracy p = (pi)m

i=1 ∈ Rm over m fingerprints

y = (yi)m

i=1. The similarity score between two fingerprints y and y∗ is:

p(y|p, y∗) =

m

  • i=1

p

1−|yi −y∗

i |

i

(1 − pi)|yi −y∗

i |. Huibin Shen Metabolite Identification via Machine Learning

slide-20
SLIDE 20

Introduction Fingerprints prediction Database matching Result Conclusion Experiment 1 Experiment 2 Experiment 3

Experiments data

A summary of the datasets is listed in this table

Data Device Size Mode Mass error Std Fingerprints QqQ misc 514 Pos 286

  • API3000

410 Pos 0.128 0.164

  • QuattroPremier XE

82 Pos

  • 0.092

0.073

  • TSQ 7000

17 Pos

  • 0.124

0.036

  • TSQ Quantum AM

3 Pos

  • Q-Trap

2 Pos Ltq LTQ Orbitrap XL 293 Pos 0.0 0.049 128 Lipids LTQ Orbitrap 403 Neg

  • 0.135

0.090 20 Table 1: The dataset statistics. Only a subset of fingerprints are exhibited in each dataset’s molecules.

Huibin Shen Metabolite Identification via Machine Learning

slide-21
SLIDE 21

Introduction Fingerprints prediction Database matching Result Conclusion Experiment 1 Experiment 2 Experiment 3

Fingerprint prediction

We show the predication accuracies for ltq dataset.

0.5 0.6 0.7 0.8 0.9 1.0 fingerprints accuracy 1 30 60 90 128 Integral mass kernel High resolution mass kernel

Figure 7: Light grey is improvement by integral kernel from default classifier. Dark grey is improvement by product probability kernel from integral kernel.

Huibin Shen Metabolite Identification via Machine Learning

slide-22
SLIDE 22

Introduction Fingerprints prediction Database matching Result Conclusion Experiment 1 Experiment 2 Experiment 3

Feature selection

We show the effect of different features.

40 45 50 55 60 65 88 90 92 94 mean F1 mean accuracy

  • Integral mass kernel

High resolution mass kernel

  • peaks

nloss diff peaks+nloss peaks+diff full

Figure 8: Scatter plot of the aggregate average accuracy/F1 across three datasets. The non-filled marks represent higher accuracy/F1 ratio in quadratic kernel.

Huibin Shen Metabolite Identification via Machine Learning

slide-23
SLIDE 23

Introduction Fingerprints prediction Database matching Result Conclusion Experiment 1 Experiment 2 Experiment 3

Experiments data (for CASMI challenge)

MS2 spectra are used to train the model and MS1 spectra are used for comparing the result of isotopic patterns matching.

MS type Instument type Size

  • No. of Mol

Fingerprints MS2 APCI-ITFT-CID 295 65 179 APCI-ITFT-HCD 882 86 181 LC-ESI-ITFT-CID 447 244 281 LC-ESI-ITFT-HCD 2655 225 281 LC-ESI-QTOF-CID 1027 523 290 MS1 LC-ESI-ITFT 41 41 LC-ESI-QTOF 62 62

Table 2: The dataset statistics. Only a subset of fingerprints are exhibited in each dataset’s molecules.

Huibin Shen Metabolite Identification via Machine Learning

slide-24
SLIDE 24

Introduction Fingerprints prediction Database matching Result Conclusion Experiment 1 Experiment 2 Experiment 3

Learning curve

0.00 0.05 0.10 0.15 0.20 0.25 0.30

APCI−ITFT−CID(295)

percent of training data averge prediction error for fingerprints of baseline <= 0.8 0.2 0.4 0.6 0.8 1 0.00 0.05 0.10 0.15 0.20 0.25 0.30

../APCI−ITFT−HCD(882)

percent of training data averge prediction error for fingerprints of baseline <= 0.8 0.2 0.4 0.6 0.8 1 0.00 0.05 0.10 0.15 0.20 0.25 0.30

../APCI−ITFT(1177)

percent of training data averge prediction error for fingerprints of baseline <= 0.8 0.2 0.4 0.6 0.8 1 0.00 0.05 0.10 0.15 0.20 0.25 0.30

../LC−ESI−ITFT−CID(447)

percent of training data averge prediction error for fingerprints of baseline <= 0.8 0.2 0.4 0.6 0.8 1 0.00 0.05 0.10 0.15 0.20 0.25 0.30

../LC−ESI−ITFT−HCD(2655)

percent of training data averge prediction error for fingerprints of baseline <= 0.8 0.2 0.4 0.6 0.8 1 0.00 0.05 0.10 0.15 0.20 0.25 0.30

LC−ESI−QTOF(1027)

percent of training data averge prediction error for fingerprints of baseline <= 0.8 0.2 0.4 0.6 0.8 1

Figure 9: Blue line is training error; red line is cross validation prediction error; black line is the relative rank of the correct molecule. Matching database is Kegg.

Huibin Shen Metabolite Identification via Machine Learning

slide-25
SLIDE 25

Introduction Fingerprints prediction Database matching Result Conclusion Experiment 1 Experiment 2 Experiment 3

Combine isotopic patterns matching (LC-ESI-ITFT)

ITFT

log Rank Propotion of dataset

1 10 102

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

fingerid isotope average rank diff_iso mini rank

Figure 10: The fingerid red line rank the candidates only by fingerid scores while the isotope blue

line rank by isotopic patterns matching scores. The average rank black line take the average of fingerid and isotopic matching while the diff iso green line rank by isotopic matching scores first and then for the ones having the tie, rank them by fingerid scores. The mini rank grey line rank by taking the minimum rank of the fingerid and isotopic matching.

Huibin Shen Metabolite Identification via Machine Learning

slide-26
SLIDE 26

Introduction Fingerprints prediction Database matching Result Conclusion Experiment 1 Experiment 2 Experiment 3

Combine isotopic patterns matching (LC-ESI-QTOF)

ITFT

log Rank Propotion of dataset

1 10 102

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

fingerid isotope average rank diff_iso mini rank

Figure 11: The fingerid red line rank the candidates only by fingerid scores while the isotope blue

line rank by isotopic patterns matching scores. The average rank black line take the average of fingerid and isotopic matching while the diff iso green line rank by isotopic matching scores first and then for the ones having the tie, rank them by fingerid scores. The mini rank grey line rank by taking the minimum rank of the fingerid and isotopic matching.

Huibin Shen Metabolite Identification via Machine Learning

slide-27
SLIDE 27

Introduction Fingerprints prediction Database matching Result Conclusion Experiment 1 Experiment 2 Experiment 3

Result for CASMI challenge

Challenge 1 2 3 4 5 6 10 11 12 13 14 15 16 17 Category1 4 1 3 4 4 1 1 5 Category2 5 1 4 11 Table 3: The result for CASMI-2012 challenge. Category 1 is chemical formula identification and category 2 is molecule structure identification. For Challenge 11, we deducted the wrong exact mass of the molecule. For the others, we don’t have that molecule in our database (most molecules in Kegg).

Huibin Shen Metabolite Identification via Machine Learning

slide-28
SLIDE 28

Introduction Fingerprints prediction Database matching Result Conclusion Conclusion

Conclusion

Predicting the fingerprints with high accuracy. Product probability kernels and combined features are better in most cases. Isotopic patterns matching does not help a lot. Choosing the right molecular database can be a critical problem in our framework.

Huibin Shen Metabolite Identification via Machine Learning