science and cross-disciplinary software development Florian Huber 1 - - PowerPoint PPT Presentation

▶

Apr 06, 2023 210 likes •567 views

Linking biological data using data science and cross-disciplinary software development Florian Huber 1 , Justin van der Hooft 2 , Simon Rogers 3 , Marnix Medema 2 , Lars Ridder 1 1 Netherlands eScience Center 2 Bioinformatics Group, Wageningen

SLIDE 1

Linking biological data using data science and cross-disciplinary software development

Florian Huber1, Justin van der Hooft2, Simon Rogers3, Marnix Medema2, Lars Ridder1

de-RSE conference, Potsdam 05/06/2019

1 Netherlands eScience Center 2 Bioinformatics Group, Wageningen University 3 School of Computing Science, University of Glasgow

SLIDE 2

Breaking down scientific mono- cultures by cross-disciplinary software development

Florian Huber1, Justin van der Hooft2, Simon Rogers3, Marnix Medema2, Lars Ridder1

de-RSE conference, Potsdam 05/06/2019

1 Netherlands eScience Center 2 Bioinformatics Group, Wageningen University 3 School of Computing Science, University of Glasgow

SLIDE 3

Florian Huber | @me_datapoint | de-RSE 2019

talk by Alys Brett

SLIDE 4

Florian Huber | @me_datapoint | de-RSE 2019

SLIDE 5

We signal challenges and opportunities at the intersection of software and academic research

Photography: Elodie Burrillon

SLIDE 6

SLIDE 7

Our technological expertise areas

Big data analytics Scientific visualization Machine learning Information retrieval Computer vision Information visualization T ext mining Efficient computing Low power computing Accelerated computing Orchestrated computing High performance computing Distributed computing Optimized data handling Databases Linked data Handling sensor data Information integration Data assimilation

SLIDE 8

What do we do?

Link between researchers and IT infrastructure

Research software

Data stewards/data scientists Cross-disciplinary transfer

SLIDE 9

Example project: Integrated ‘omics’ analysis

Medema lab - Wageningen UR, NL NL eScience Center Glasgow University: Simon Rogers, Andrew Ramsay, Grimur Hjorleifsson Eldjar UCSD: Madeleine Ernst Pieter Dorrestein

SLIDE 10

secondary metabolites

Florian Huber | @me_datapoint | de-RSE 2019

SLIDE 11

mass spectra DNA

HMM (Hidden Markov Model) + manually written rules

SLIDE 12

mass spectra DNA

HMM (Hidden Markov Model) + manually written rules

SLIDE 13

mass spectra DNA

HMM (Hidden Markov Model) + manually written rules

SLIDE 14

Mass spectrometry and fragmentation

Mass Separation Mass Trapping Mass Detection

+ + + + + + + + + + + m/z MS1

Ionization

SLIDE 15

Mass spectrometry and fragmentation

Mass Separation Mass Trapping Mass Detection

+ + + + + + + + + + + m/z MS1

Ionization

m/z MS2

SLIDE 16

Mass spectrometry and fragmentation

Mass Separation Mass Trapping Mass Detection

+ + + + + + + + + + + m/z m/z MS1 MS2

Ionization

Fragments to puzzle the metabolite structure

SLIDE 17

The challenge….

doxorubicin (chemotherapeutic agent) rapamycin (immunosuppressant) lovastatin (cholesterol lowering agent) spinosad (insecticide) Pneumocandin (antifungal) vancomycin (antibiotic)

Mass spectrometry fragmentation spectrum

….is large-scale coupling of spectral data to molecular structures

f known & especially novel natural products molecules.

Bacteria, fungi, and plants produce a large & diverse arsenal of high-value molecules:

SLIDE 18

But…. How similar are they?

Spectral similarity

Florian Huber | @me_datapoint | de-RSE 2019

SLIDE 19

How similar are they?

Spectral similarity

Florian Huber | @me_datapoint | de-RSE 2019

SLIDE 20

How similar are they?

Spectral similarity

Florian Huber | @me_datapoint | de-RSE 2019

SLIDE 21

…likes cake with a cappuccino. …loves to have a cookie and a coffee.

How similar are they?

What does similar mean? number of words? number of characters? grammatical structure? meaning? style? topic? phonetic structure?

Florian Huber | @me_datapoint | de-RSE 2019

SLIDE 22

…likes cake with a cappuccino. …loves to have a cookie and a coffee.

‘word’ ‘sentence’ (or ‘document’)

Florian Huber | @me_datapoint | de-RSE 2019

SLIDE 23

Count how often ‘words’ co-occur (find word ‘context’)

9 … 24 9 17 24 17 … … … … … … … … … … sweet sweet cake cake cookie cookie NxN matrix N: number of words in dictionary all words in corpus… Words monster

Florian Huber | @me_datapoint | de-RSE 2019

SLIDE 24

‘Word2Vec’ → lower dimensional context vector  x

factorization

9 … 24 9 17 24 17 … … … … … … … … … … sweet sweet cake cake cookie cookie Words monster

Florian Huber | @me_datapoint | de-RSE 2019

SLIDE 25

‘Word2Vec’ → lower dimensional context vector

Vcookie Vcake Vcookie Vcake

9 … 24 9 17 24 17 … … … … … … … … … … sweet sweet cake cake cookie cookie Words monster

Florian Huber | @me_datapoint | de-RSE 2019

SLIDE 26

NLP → metabolomics: use peaks as words

9 … 24 9 17 24 17 … peak positions =‘words’ … … … … … … … … … … m(A’’) m(A’’) m(Aa) m(Aa) m(A) m(A)

VA VAa VA VAa

Florian Huber | @me_datapoint | de-RSE 2019

SLIDE 27

Spectral similarity measures.

NLP/word2vec based method

Vspectrum1 = VA+ VB + VA’ + VB’ + … ‘word’ vector ‘document’ vector

Florian Huber | @me_datapoint | de-RSE 2019

SLIDE 28

Spectral similarity measures.

NLP/word2vec based method

Vspectrum1 = VA+ VB + VA’ + VB’ + … Vspectrum2 = VAa+ VBb + VA’ + VBb’ + … ‘word’ vector ‘document’ vector Similarity = cos() =

𝑊

𝑡𝑞𝑓𝑑𝑢𝑠𝑣𝑛1∙ 𝑊 𝑡𝑞𝑓𝑑𝑢𝑠𝑣𝑛2

𝑊

𝑡𝑞𝑓𝑑𝑢𝑠𝑣𝑛1

𝑊

𝑡𝑞𝑓𝑑𝑢𝑠𝑣𝑛2



Florian Huber | @me_datapoint | de-RSE 2019

SLIDE 29

Spectral similarity measures: evaluation.

Dataset: 11.000 spectra with known molecular structures

(fake spectra)

0.85 0.23 0.13

Florian Huber | @me_datapoint | de-RSE 2019

SLIDE 30

1 2 3 4 5 6 … 1 2 3 4 5 6 …

Molecular similarity scores: 10.000 highest ‘classical’ scores*

hig igh molecular similarity lo low molecular similarity

16%

Spectra (ID)

Molecular similarity scores

(circular fingerprint: Morgan3 / ECFP6)

* = scores > 0.998

Histogram of reference scores for 10.000 best scoring pairs (classical score) Florian Huber | @me_datapoint | de-RSE 2019

SLIDE 31

1 2 3 4 5 6 … 1 2 3 4 5 6 …

Molecular similarity scores: 10.000 highest NLP-based scores*

Spectra (ID)

hig igh molecular similarity lo low molecular similarity

73%

Molecular similarity scores

(circular fingerprint: Morgan3 / ECFP6)

* = scores > 0.84

Histogram of reference scores for 10.000 best scoring pairs (NLP-based score) Florian Huber | @me_datapoint | de-RSE 2019

SLIDE 32

Spectral similarity measures: examples.

1 2 3 4 5 6 7 8 9

query molecule 9 closest candidates (according to molecular networking similarity)

bad bad

spectrum ID: 3351 Florian Huber | @me_datapoint | de-RSE 2019

SLIDE 33

Spectral similarity measures: examples.

query molecule 9 closest candidates (according to Word2vec-based spectral similarity)

1 2 3 4 5 6 7 8 9

spectrum ID: 3351 Florian Huber | @me_datapoint | de-RSE 2019

SLIDE 34

RSE’s creating unique links

RSE’s – working in teams with broad range of expertise and

backgrounds.

RSE’s – working on projects of different scientific domains.

→Creating opportunities unlike anywhere else in the academic setting!

Transfer methods/techniques between domains.
Spot potential synergies between (sub-)fields.

Florian Huber | @me_datapoint | de-RSE 2019

SLIDE 35

Interested in Research Software ?

+31 (0)20 460 4770 www.esciencecenter .nl blog.esciencecenter .nl

The Netherlands eScience Center is the Dutch national center of excellence for the development and application

f research software to advance academic research.

Join the team !

n.renaud@esciencecenter .nl

@me_datapoint

Florian Huber

@neocarlitos

Linking biological data using data science and cross-disciplinary software development

Florian Huber1, Justin van der Hooft2, Simon Rogers3, Marnix Medema2, Lars Ridder1

Breaking down scientific mono- cultures by cross-disciplinary software development

Florian Huber1, Justin van der Hooft2, Simon Rogers3, Marnix Medema2, Lars Ridder1

talk by Alys Brett

What do we do?

Link between researchers and IT infrastructure

Research software

Data stewards/data scientists Cross-disciplinary transfer

Example project: Integrated ‘omics’ analysis

secondary metabolites

mass spectra DNA

mass spectra DNA

mass spectra DNA

Mass spectrometry and fragmentation

Ionization

Mass spectrometry and fragmentation

Ionization

Mass spectrometry and fragmentation

Ionization

Fragments to puzzle the metabolite structure

The challenge….

….is large-scale coupling of spectral data to molecular structures

Bacteria, fungi, and plants produce a large & diverse arsenal of high-value molecules:

But…. How similar are they?

Spectral similarity

How similar are they?

Spectral similarity

How similar are they?

Spectral similarity

…likes cake with a cappuccino. …loves to have a cookie and a coffee.

How similar are they?

What does similar mean? number of words? number of characters? grammatical structure? meaning? style? topic? phonetic structure?

…likes cake with a cappuccino. …loves to have a cookie and a coffee.

‘word’ ‘sentence’ (or ‘document’)

Count how often ‘words’ co-occur (find word ‘context’)

‘Word2Vec’ → lower dimensional context vector  x

factorization

‘Word2Vec’ → lower dimensional context vector

Vcookie Vcake Vcookie Vcake

NLP → metabolomics: use peaks as words

VA VAa VA VAa

Spectral similarity measures.

NLP/word2vec based method

Vspectrum1 = VA+ VB + VA’ + VB’ + … ‘word’ vector ‘document’ vector

Spectral similarity measures.

NLP/word2vec based method

Vspectrum1 = VA+ VB + VA’ + VB’ + … Vspectrum2 = VAa+ VBb + VA’ + VBb’ + … ‘word’ vector ‘document’ vector Similarity = cos() =

Spectral similarity measures: evaluation.

Dataset: 11.000 spectra with known molecular structures

0.85 0.23 0.13

Molecular similarity scores: 10.000 highest ‘classical’ scores*

hig igh molecular similarity lo low molecular similarity

16%

Molecular similarity scores

* = scores > 0.998

Molecular similarity scores: 10.000 highest NLP-based scores*

hig igh molecular similarity lo low molecular similarity

73%

Molecular similarity scores

* = scores > 0.84

Spectral similarity measures: examples.

bad bad

Spectral similarity measures: examples.

RSE’s creating unique links

backgrounds.

→Creating opportunities unlike anywhere else in the academic setting!

Interested in Research Software ?

Join the team !

n.renaud@esciencecenter .nl

Florian Huber

Carlos Martinez-Ortiz