RDKit (cheminformatics) Neo4j Integration Mentors: Christian - - PowerPoint PPT Presentation

rdkit cheminformatics neo4j integration
SMART_READER_LITE
LIVE PREVIEW

RDKit (cheminformatics) Neo4j Integration Mentors: Christian - - PowerPoint PPT Presentation

RDKit (cheminformatics) Neo4j Integration Mentors: Christian Pilger (BASF) Presenter - Evgeny Sorokin Greg Landrum (RDKit) Stefan Armbruster (Neo4j) Motivation Neo4j = useful tool to map knowledge Chemical/pharmaceutical R&D:


slide-1
SLIDE 1

RDKit (cheminformatics) Neo4j Integration

Mentors: Christian Pilger (BASF) Greg Landrum (RDKit) Stefan Armbruster (Neo4j) Presenter - Evgeny Sorokin

slide-2
SLIDE 2

Motivation

  • Neo4j = useful tool to map knowledge
  • Chemical/pharmaceutical R&D:

○ Required: mapping data of completely different nature (recipe, process, application test, chemical structures) ○ Knowledge graphs are frequently a good choice here over other data models ○ Problem: Neo4j does not support chemical structures

  • RDKit

○ is a widely used Open Source tool to deal with chemical structures ○ has proven its value in conjunction with Postgres

  • Idea: enrich Neo4j's capabilities by combining it with RDKit => GSoC project
slide-3
SLIDE 3

Chemical structure representation

  • Not intended: “dissolve” atoms and bond as nodes and relations into the graph!
  • Intended: use available structure representation as node properties

○ SMILES format: c1ccccc1 (single line ASCII representation --> exact search via string matching) ○ MOL format: (3D coordinates: richer format, more details --> advantages in sub-structure searches) name: benzene formula: C6H6 SMILES: c1ccccc1

slide-4
SLIDE 4

Chemical structures example

name: benzene formula: C6H6 SMILES: c1ccccc1

slide-5
SLIDE 5

Requirements

  • Basic Functionality :

○ Exact chemical search (“find the molecule benzene") ○ Chemical substructure search (“find all molecules that contain a benzene moiety")

  • Typical application scenarios in Graph context

○ Find entry points into the graph ○ Filter paths during graph traversal with chemical structure conditions

slide-6
SLIDE 6

How was it implemented - storage in a graph

  • A new node with labels :Chemical:Structure is processed by RDKit event handler
  • From either smiles or mdlmol property a list of 7-8 properties is created per node
  • A full text index is created for fingerprint property
  • canonical_smiles
  • inchi
  • formula
  • molecular_weight
  • fp - bit-vector fingerprint
  • fp_ones - count of positive bits
  • mdlmol
  • smiles [optional]
slide-7
SLIDE 7

How was it implemented - exact search

Simple case: compare two canonical smiles with each other, find a match. SMILES O=S(=O)(Cc1ccccc1)CS(=O)(=O)Cc1ccccc1 Canonical SMILES O=S(=O)(CC1=CC=CC=C1)CS(=O)(=O)CC1=CC=CC=C1

slide-8
SLIDE 8

How was it implemented - SSS

Chemical fingerprint is a unique pattern for the presence of a particular molecule. Bitvector and count-based fingerprints

slide-9
SLIDE 9

How was it implemented - SSS

1. Each of the structures is encoded as bitvector fingerprint 2. Bitvectors are transformed into a string of positive indexes 3. Fulltext index is applied to transformed bitvectors (numbers -> words) 4. Search is done with constraints regarding specific properties.

slide-10
SLIDE 10

How was it implemented - SSS

slide-11
SLIDE 11

Chemical reactions’ relationships

slide-12
SLIDE 12

What are possible applications

2.) expand path apoc.path.expand

slide-13
SLIDE 13

Resources

  • https://github.com/rdkit/neo4j-rdkit
  • https://www.rdkit.org/UGM/2012/Landrum_RDKit_UGM.Fingerprints.Final.pptx.pdf
  • https://www.rdkit.org/
  • https://neo4j.com/docs/cypher-manual/current/schema/index/
  • http://tiny.cc/mol_block_definition
  • @evgerher via telegram
slide-14
SLIDE 14

Hunger games Q&A

1) Hard: what format can resolve a situation when chemical structure has chirality property (ex.: Lactic acid)?

a) SMILES b) MOL block c) All of above

2) Medium: what is the difference between bitvector and count-based fingerprints?

a) Harder to store b) Does not support similarity search c) Does not keep track of occurence amount

3) Easy: transformation of bitvector [1 0 1 0 1 1 0 0 1] is:

a) “1 3 5 6 9” b) “2 4 6 7 9” c) “1 3 5 6 9”