RDKit (cheminformatics) Neo4j Integration
Mentors: Christian Pilger (BASF) Greg Landrum (RDKit) Stefan Armbruster (Neo4j) Presenter - Evgeny Sorokin
RDKit (cheminformatics) Neo4j Integration Mentors: Christian - - PowerPoint PPT Presentation
RDKit (cheminformatics) Neo4j Integration Mentors: Christian Pilger (BASF) Presenter - Evgeny Sorokin Greg Landrum (RDKit) Stefan Armbruster (Neo4j) Motivation Neo4j = useful tool to map knowledge Chemical/pharmaceutical R&D:
Mentors: Christian Pilger (BASF) Greg Landrum (RDKit) Stefan Armbruster (Neo4j) Presenter - Evgeny Sorokin
○ Required: mapping data of completely different nature (recipe, process, application test, chemical structures) ○ Knowledge graphs are frequently a good choice here over other data models ○ Problem: Neo4j does not support chemical structures
○ is a widely used Open Source tool to deal with chemical structures ○ has proven its value in conjunction with Postgres
○ SMILES format: c1ccccc1 (single line ASCII representation --> exact search via string matching) ○ MOL format: (3D coordinates: richer format, more details --> advantages in sub-structure searches) name: benzene formula: C6H6 SMILES: c1ccccc1
name: benzene formula: C6H6 SMILES: c1ccccc1
○ Exact chemical search (“find the molecule benzene") ○ Chemical substructure search (“find all molecules that contain a benzene moiety")
○ Find entry points into the graph ○ Filter paths during graph traversal with chemical structure conditions
Simple case: compare two canonical smiles with each other, find a match. SMILES O=S(=O)(Cc1ccccc1)CS(=O)(=O)Cc1ccccc1 Canonical SMILES O=S(=O)(CC1=CC=CC=C1)CS(=O)(=O)CC1=CC=CC=C1
Chemical fingerprint is a unique pattern for the presence of a particular molecule. Bitvector and count-based fingerprints
1. Each of the structures is encoded as bitvector fingerprint 2. Bitvectors are transformed into a string of positive indexes 3. Fulltext index is applied to transformed bitvectors (numbers -> words) 4. Search is done with constraints regarding specific properties.
2.) expand path apoc.path.expand
1) Hard: what format can resolve a situation when chemical structure has chirality property (ex.: Lactic acid)?
a) SMILES b) MOL block c) All of above
2) Medium: what is the difference between bitvector and count-based fingerprints?
a) Harder to store b) Does not support similarity search c) Does not keep track of occurence amount
3) Easy: transformation of bitvector [1 0 1 0 1 1 0 0 1] is:
a) “1 3 5 6 9” b) “2 4 6 7 9” c) “1 3 5 6 9”