 
              Introduction Fingerprints prediction Database matching Result Conclusion Metabolite Identification via Machine Learning Huibin Shen Department of Information and Computer Science Aalto University February 7, 2013 Huibin Shen Metabolite Identification via Machine Learning
Introduction Fingerprints prediction Database matching Result Conclusion Outline Introduction 1 Fingerprints prediction 2 Database matching 3 Result 4 Conclusion 5 Huibin Shen Metabolite Identification via Machine Learning
Introduction General picture Fingerprints prediction Computational methods Database matching Key concept Result Machine learning method Conclusion General picture What is the metabolites identification? Figure 1: Metabolomics pipeline towards a systems biology approach: from the whole metabolome to identified metabolites [M. Sofia, 2007]. Huibin Shen Metabolite Identification via Machine Learning
Introduction General picture Fingerprints prediction Computational methods Database matching Key concept Result Machine learning method Conclusion Standard computational method Matching reference spectral database. Problems: Huibin Shen Metabolite Identification via Machine Learning
Introduction General picture Fingerprints prediction Computational methods Database matching Key concept Result Machine learning method Conclusion Standard computational method Matching reference spectral database. Problems: Quality of data. Huibin Shen Metabolite Identification via Machine Learning
Introduction General picture Fingerprints prediction Computational methods Database matching Key concept Result Machine learning method Conclusion Standard computational method Matching reference spectral database. Problems: Quality of data. Seldom public. Huibin Shen Metabolite Identification via Machine Learning
Introduction General picture Fingerprints prediction Computational methods Database matching Key concept Result Machine learning method Conclusion Standard computational method Matching reference spectral database. Problems: Quality of data. Seldom public. Limited number. Huibin Shen Metabolite Identification via Machine Learning
Introduction General picture Fingerprints prediction Computational methods Database matching Key concept Result Machine learning method Conclusion Standard computational method Matching reference spectral database. Problems: Quality of data. Seldom public. Limited number. Diversity of mass spectrometer. Huibin Shen Metabolite Identification via Machine Learning
Introduction General picture Fingerprints prediction Computational methods Database matching Key concept Result Machine learning method Conclusion Standard computational method Matching reference spectral database. Problems: Quality of data. Seldom public. Limited number. Diversity of mass spectrometer. Similarity definition. Huibin Shen Metabolite Identification via Machine Learning
Introduction General picture Fingerprints prediction Computational methods Database matching Key concept Result Machine learning method Conclusion Standard computational method Matching reference spectral database. Problems: Quality of data. Seldom public. Limited number. Diversity of mass spectrometer. Similarity definition. Huibin Shen Metabolite Identification via Machine Learning
Introduction General picture Fingerprints prediction Computational methods Database matching Key concept Result Machine learning method Conclusion Molecular fingerprint Figure 2: Representation of a molecular substructure fingerprint with a substructure fingerprint dictionary of given substructure patterns. This molecule is represented in a series of binary bits that represent the presence or absence of particular substructures in the molecules [D.S. Cao, 2012]. Huibin Shen Metabolite Identification via Machine Learning
Introduction General picture Fingerprints prediction Computational methods Database matching Key concept Result Machine learning method Conclusion Machine learning method We propose a new framework to identify metabolites through machine learning: Figure 3: The overview of the two-step metabolite identification framework. Huibin Shen Metabolite Identification via Machine Learning
Introduction Fingerprints prediction Method Database matching Kernels Result Conclusion Support Vector Machine (SVM) SVM, a supervised machine learning method for classification and regression. Figure 4: Three dimensional case for SVM 1 . 1 Figure from http://www.dtreg.com/svm.htm Huibin Shen Metabolite Identification via Machine Learning
Introduction Fingerprints prediction Method Database matching Kernels Result Conclusion kernels for mass spectrum Feature mapping ≈ kernel function. Huibin Shen Metabolite Identification via Machine Learning
Introduction Fingerprints prediction Method Database matching Kernels Result Conclusion kernels for mass spectrum Feature mapping ≈ kernel function. Three basic features and their combination. Huibin Shen Metabolite Identification via Machine Learning
Introduction Fingerprints prediction Method Database matching Kernels Result Conclusion kernels for mass spectrum Feature mapping ≈ kernel function. Three basic features and their combination. Two families of kernels: integral mass kernel and probability product kernel. Huibin Shen Metabolite Identification via Machine Learning
Introduction Fingerprints prediction Method Database matching Kernels Result Conclusion Integral mass kernels k ( x , x ′ ) = � x , x ′ � O OH HN NH 2 Collision energy 10eV Collision energy 20ev Collision energy 30ev 1 0.9 0.8 I 0.7 n 145.1 t e 0.6 n s 0.5 i 117.0 t 0.4 y 0.3 0.2 169.3 73 187.4 0.1 0 0 20 40 60 80 100 120 140 160 180 200 m/z Figure 5: Three basic features and integral mass kernel. Huibin Shen Metabolite Identification via Machine Learning
Introduction Fingerprints prediction Method Database matching Kernels Result Conclusion Probability product kernel k ( x , x ′ ) = k prob ( p ( x ) , p ′ ( x ′ )) = X p ( x ) p ′ ( x ′ ) dx � Figure 6: Probability product kernel. Huibin Shen Metabolite Identification via Machine Learning
Introduction Fingerprints prediction Database matching Result Conclusion Scoring i =1 ∈ R m over m fingerprints Given the cross validation accuracy p = ( p i ) m i =1 . The similarity score between two fingerprints y and y ∗ is: y = ( y i ) m m 1 −| y i − y ∗ i | (1 − p i ) | y i − y ∗ p ( y | p , y ∗ ) = � i | . p i i =1 Huibin Shen Metabolite Identification via Machine Learning
Introduction Fingerprints prediction Experiment 1 Database matching Experiment 2 Result Experiment 3 Conclusion Experiments data A summary of the datasets is listed in this table Data Device Size Mode Mass error Std Fingerprints QqQ misc 514 Pos 286 - API3000 410 Pos 0.128 0.164 - QuattroPremier XE 82 Pos -0.092 0.073 - TSQ 7000 17 Pos -0.124 0.036 - TSQ Quantum AM 3 Pos - Q-Trap 2 Pos Ltq LTQ Orbitrap XL 293 Pos 0.0 0.049 128 Lipids LTQ Orbitrap 403 Neg -0.135 0.090 20 Table 1: The dataset statistics. Only a subset of fingerprints are exhibited in each dataset’s molecules. Huibin Shen Metabolite Identification via Machine Learning
Introduction Fingerprints prediction Experiment 1 Database matching Experiment 2 Result Experiment 3 Conclusion Fingerprint prediction We show the predication accuracies for ltq dataset. 1.0 0.9 0.8 accuracy 0.7 0.6 Integral mass kernel High resolution mass kernel 0.5 1 30 60 90 128 fingerprints Figure 7: Light grey is improvement by integral kernel from default classifier. Dark grey is improvement by product probability kernel from integral kernel. Huibin Shen Metabolite Identification via Machine Learning
Introduction Fingerprints prediction Experiment 1 Database matching Experiment 2 Result Experiment 3 Conclusion Feature selection We show the effect of different features. peaks nloss 94 diff peaks+nloss ● peaks+diff full mean accuracy 92 ● ● 90 88 Integral mass kernel High resolution mass kernel 40 45 50 55 60 65 mean F1 Figure 8: Scatter plot of the aggregate average accuracy/F 1 across three datasets. The non-filled marks represent higher accuracy/F 1 ratio in quadratic kernel. Huibin Shen Metabolite Identification via Machine Learning
Introduction Fingerprints prediction Experiment 1 Database matching Experiment 2 Result Experiment 3 Conclusion Experiments data (for CASMI challenge) MS2 spectra are used to train the model and MS1 spectra are used for comparing the result of isotopic patterns matching. MS type Instument type Size No. of Mol Fingerprints MS2 APCI-ITFT-CID 295 65 179 APCI-ITFT-HCD 882 86 181 LC-ESI-ITFT-CID 447 244 281 LC-ESI-ITFT-HCD 2655 225 281 LC-ESI-QTOF-CID 1027 523 290 MS1 LC-ESI-ITFT 41 41 LC-ESI-QTOF 62 62 Table 2: The dataset statistics. Only a subset of fingerprints are exhibited in each dataset’s molecules. Huibin Shen Metabolite Identification via Machine Learning
Recommend
More recommend