science and cross-disciplinary software development Florian Huber 1 - - PowerPoint PPT Presentation

science and cross disciplinary
SMART_READER_LITE
LIVE PREVIEW

science and cross-disciplinary software development Florian Huber 1 - - PowerPoint PPT Presentation

Linking biological data using data science and cross-disciplinary software development Florian Huber 1 , Justin van der Hooft 2 , Simon Rogers 3 , Marnix Medema 2 , Lars Ridder 1 1 Netherlands eScience Center 2 Bioinformatics Group, Wageningen


slide-1
SLIDE 1

Linking biological data using data science and cross-disciplinary software development

Florian Huber1, Justin van der Hooft2, Simon Rogers3, Marnix Medema2, Lars Ridder1

de-RSE conference, Potsdam 05/06/2019

1 Netherlands eScience Center 2 Bioinformatics Group, Wageningen University 3 School of Computing Science, University of Glasgow

slide-2
SLIDE 2

Breaking down scientific mono- cultures by cross-disciplinary software development

Florian Huber1, Justin van der Hooft2, Simon Rogers3, Marnix Medema2, Lars Ridder1

de-RSE conference, Potsdam 05/06/2019

1 Netherlands eScience Center 2 Bioinformatics Group, Wageningen University 3 School of Computing Science, University of Glasgow

slide-3
SLIDE 3

Florian Huber | @me_datapoint | de-RSE 2019

talk by Alys Brett

slide-4
SLIDE 4

Florian Huber | @me_datapoint | de-RSE 2019

slide-5
SLIDE 5

We signal challenges and opportunities at the intersection of software and academic research

Photography: Elodie Burrillon

slide-6
SLIDE 6
slide-7
SLIDE 7

Our technological expertise areas

Big data analytics Scientific visualization Machine learning Information retrieval Computer vision Information visualization T ext mining Efficient computing Low power computing Accelerated computing Orchestrated computing High performance computing Distributed computing Optimized data handling Databases Linked data Handling sensor data Information integration Data assimilation

slide-8
SLIDE 8

What do we do?

Link between researchers and IT infrastructure

Research software

Data stewards/data scientists Cross-disciplinary transfer

slide-9
SLIDE 9

Example project: Integrated ‘omics’ analysis

Medema lab - Wageningen UR, NL NL eScience Center Glasgow University: Simon Rogers, Andrew Ramsay, Grimur Hjorleifsson Eldjar UCSD: Madeleine Ernst Pieter Dorrestein

slide-10
SLIDE 10

secondary metabolites

Florian Huber | @me_datapoint | de-RSE 2019

slide-11
SLIDE 11

mass spectra DNA

HMM (Hidden Markov Model) + manually written rules

slide-12
SLIDE 12

mass spectra DNA

HMM (Hidden Markov Model) + manually written rules

slide-13
SLIDE 13

mass spectra DNA

HMM (Hidden Markov Model) + manually written rules

slide-14
SLIDE 14

Mass spectrometry and fragmentation

Mass Separation Mass Trapping Mass Detection

+ + + + + + + + + + + m/z MS1

Ionization

slide-15
SLIDE 15

Mass spectrometry and fragmentation

Mass Separation Mass Trapping Mass Detection

+ + + + + + + + + + + m/z MS1

Ionization

m/z MS2

slide-16
SLIDE 16

Mass spectrometry and fragmentation

Mass Separation Mass Trapping Mass Detection

+ + + + + + + + + + + m/z m/z MS1 MS2

Ionization

Fragments to puzzle the metabolite structure

slide-17
SLIDE 17

The challenge….

doxorubicin (chemotherapeutic agent) rapamycin (immunosuppressant) lovastatin (cholesterol lowering agent) spinosad (insecticide) Pneumocandin (antifungal) vancomycin (antibiotic)

Mass spectrometry fragmentation spectrum

….is large-scale coupling of spectral data to molecular structures

  • f known & especially novel natural products molecules.

Bacteria, fungi, and plants produce a large & diverse arsenal of high-value molecules:

slide-18
SLIDE 18

But…. How similar are they?

Spectral similarity

Florian Huber | @me_datapoint | de-RSE 2019

slide-19
SLIDE 19

How similar are they?

Spectral similarity

Florian Huber | @me_datapoint | de-RSE 2019

slide-20
SLIDE 20

How similar are they?

Spectral similarity

Florian Huber | @me_datapoint | de-RSE 2019

slide-21
SLIDE 21

…likes cake with a cappuccino. …loves to have a cookie and a coffee.

How similar are they?

What does similar mean? number of words? number of characters? grammatical structure? meaning? style? topic? phonetic structure?

Florian Huber | @me_datapoint | de-RSE 2019

slide-22
SLIDE 22

…likes cake with a cappuccino. …loves to have a cookie and a coffee.

‘word’ ‘sentence’ (or ‘document’)

Florian Huber | @me_datapoint | de-RSE 2019

slide-23
SLIDE 23

Count how often ‘words’ co-occur (find word ‘context’)

9 … 24 9 17 24 17 … … … … … … … … … … sweet sweet cake cake cookie cookie NxN matrix N: number of words in dictionary all words in corpus… Words monster

Florian Huber | @me_datapoint | de-RSE 2019

slide-24
SLIDE 24

‘Word2Vec’ → lower dimensional context vector  x

factorization

9 … 24 9 17 24 17 … … … … … … … … … … sweet sweet cake cake cookie cookie Words monster

Florian Huber | @me_datapoint | de-RSE 2019

slide-25
SLIDE 25

‘Word2Vec’ → lower dimensional context vector

Vcookie Vcake Vcookie Vcake

9 … 24 9 17 24 17 … … … … … … … … … … sweet sweet cake cake cookie cookie Words monster

Florian Huber | @me_datapoint | de-RSE 2019

slide-26
SLIDE 26

NLP → metabolomics: use peaks as words

9 … 24 9 17 24 17 … peak positions =‘words’ … … … … … … … … … … m(A’’) m(A’’) m(Aa) m(Aa) m(A) m(A)

VA VAa VA VAa

Florian Huber | @me_datapoint | de-RSE 2019

slide-27
SLIDE 27

Spectral similarity measures.

NLP/word2vec based method

Vspectrum1 = VA+ VB + VA’ + VB’ + … ‘word’ vector ‘document’ vector

Florian Huber | @me_datapoint | de-RSE 2019

slide-28
SLIDE 28

Spectral similarity measures.

NLP/word2vec based method

Vspectrum1 = VA+ VB + VA’ + VB’ + … Vspectrum2 = VAa+ VBb + VA’ + VBb’ + … ‘word’ vector ‘document’ vector Similarity = cos() =

𝑊

𝑡𝑞𝑓𝑑𝑢𝑠𝑣𝑛1∙ 𝑊 𝑡𝑞𝑓𝑑𝑢𝑠𝑣𝑛2

𝑊

𝑡𝑞𝑓𝑑𝑢𝑠𝑣𝑛1

𝑊

𝑡𝑞𝑓𝑑𝑢𝑠𝑣𝑛2

Florian Huber | @me_datapoint | de-RSE 2019

slide-29
SLIDE 29

Spectral similarity measures: evaluation.

Dataset: 11.000 spectra with known molecular structures

(fake spectra)

0.85 0.23 0.13

Florian Huber | @me_datapoint | de-RSE 2019

slide-30
SLIDE 30

1 2 3 4 5 6 … 1 2 3 4 5 6 …

Molecular similarity scores: 10.000 highest ‘classical’ scores*

hig igh molecular similarity lo low molecular similarity

16%

Spectra (ID)

Molecular similarity scores

(circular fingerprint: Morgan3 / ECFP6)

* = scores > 0.998

Histogram of reference scores for 10.000 best scoring pairs (classical score) Florian Huber | @me_datapoint | de-RSE 2019

slide-31
SLIDE 31

1 2 3 4 5 6 … 1 2 3 4 5 6 …

Molecular similarity scores: 10.000 highest NLP-based scores*

Spectra (ID)

hig igh molecular similarity lo low molecular similarity

73%

Molecular similarity scores

(circular fingerprint: Morgan3 / ECFP6)

* = scores > 0.84

Histogram of reference scores for 10.000 best scoring pairs (NLP-based score) Florian Huber | @me_datapoint | de-RSE 2019

slide-32
SLIDE 32

Spectral similarity measures: examples.

1 2 3 4 5 6 7 8 9

query molecule 9 closest candidates (according to molecular networking similarity)

bad bad

spectrum ID: 3351 Florian Huber | @me_datapoint | de-RSE 2019

slide-33
SLIDE 33

Spectral similarity measures: examples.

query molecule 9 closest candidates (according to Word2vec-based spectral similarity)

1 2 3 4 5 6 7 8 9

spectrum ID: 3351 Florian Huber | @me_datapoint | de-RSE 2019

slide-34
SLIDE 34

RSE’s creating unique links

  • RSE’s – working in teams with broad range of expertise and

backgrounds.

  • RSE’s – working on projects of different scientific domains.

→Creating opportunities unlike anywhere else in the academic setting!

  • Transfer methods/techniques between domains.
  • Spot potential synergies between (sub-)fields.

Florian Huber | @me_datapoint | de-RSE 2019

slide-35
SLIDE 35

Interested in Research Software ?

+31 (0)20 460 4770 www.esciencecenter .nl blog.esciencecenter .nl

The Netherlands eScience Center is the Dutch national center of excellence for the development and application

  • f research software to advance academic research.

Join the team !

n.renaud@esciencecenter .nl

@me_datapoint

Florian Huber

@neocarlitos

Carlos Martinez-Ortiz