rdkit cheminformatics neo4j integration
play

RDKit (cheminformatics) Neo4j Integration Mentors: Christian - PowerPoint PPT Presentation

RDKit (cheminformatics) Neo4j Integration Mentors: Christian Pilger (BASF) Presenter - Evgeny Sorokin Greg Landrum (RDKit) Stefan Armbruster (Neo4j) Motivation Neo4j = useful tool to map knowledge Chemical/pharmaceutical R&D:


  1. RDKit (cheminformatics) Neo4j Integration Mentors: Christian Pilger (BASF) Presenter - Evgeny Sorokin Greg Landrum (RDKit) Stefan Armbruster (Neo4j)

  2. Motivation Neo4j = useful tool to map knowledge ● Chemical/pharmaceutical R&D: ● Required: mapping data of completely different nature (recipe, process, application ○ test, chemical structures) Knowledge graphs are frequently a good choice here over other data models ○ Problem: Neo4j does not support chemical structures ○ RDKit ● is a widely used Open Source tool to deal with chemical structures ○ has proven its value in conjunction with Postgres ○ Idea: enrich Neo4j's capabilities by combining it with RDKit => GSoC project ●

  3. Chemical structure representation Not intended : “dissolve” atoms and bond as nodes and relations into the graph! ● Intended : use available structure representation as node properties ● SMILES format: c1ccccc1 (single line ASCII representation --> exact search via string matching) ○ MOL format: (3D coordinates: richer format, more details --> advantages in sub-structure searches) ○ name: benzene formula: C 6 H 6 SMILES: c1ccccc1

  4. Chemical structures example name: benzene formula: C 6 H 6 SMILES: c1ccccc1

  5. Requirements Basic Functionality : ● Exact chemical search (“find the molecule benzene") ○ Chemical substructure search (“find all molecules that contain a benzene moiety") ○ Typical application scenarios in Graph context ● Find entry points into the graph ○ Filter paths during graph traversal with chemical structure conditions ○

  6. How was it implemented - storage in a graph A new node with labels :Chemical:Structure is processed by RDKit event handler ● From either smiles or mdlmol property a list of 7-8 properties is created per node ● A full text index is created for fingerprint property ● canonical_smiles ● inchi ● formula ● molecular_weight ● fp - bit-vector fingerprint ● fp_ones - count of positive bits ● mdlmol ● smiles [optional] ●

  7. How was it implemented - exact search Simple case: compare two canonical smiles with each other, find a match. SMILES O=S(=O)(Cc1ccccc1)CS(=O)(=O)Cc1ccccc1 Canonical SMILES O=S(=O)(CC1=CC=CC=C1)CS(=O)(=O)CC1=CC=CC=C1

  8. How was it implemented - SSS Chemical fingerprint is a unique pattern for the presence of a particular molecule. Bitvector and count-based fingerprints

  9. How was it implemented - SSS 1. Each of the structures is encoded as bitvector fingerprint 2. Bitvectors are transformed into a string of positive indexes 3. Fulltext index is applied to transformed bitvectors (numbers -> words) 4. Search is done with constraints regarding specific properties.

  10. How was it implemented - SSS

  11. Chemical reactions’ relationships

  12. What are possible applications 2.) expand path apoc.path.expand

  13. Resources ● https://github.com/rdkit/neo4j-rdkit ● https://www.rdkit.org/UGM/2012/Landrum_RDKit_UGM.Fingerprints.Final.pptx.pdf ● https://www.rdkit.org/ ● https://neo4j.com/docs/cypher-manual/current/schema/index/ ● http://tiny.cc/mol_block_definition ● @evgerher via telegram

  14. Hunger games Q&A 1) Hard : what format can resolve a situation when chemical structure has chirality property (ex.: Lactic acid )? a) SMILES b) MOL block c) All of above 2) Medium : what is the difference between bitvector and count-based fingerprints? a) Harder to store b) Does not support similarity search c) Does not keep track of occurence amount 3) Easy : transformation of bitvector [1 0 1 0 1 1 0 0 1] is: a) “1 3 5 6 9” b) “2 4 6 7 9” c) “1 3 5 6 9”

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend