1 Pfizer Confidential
A large-scale chemical data integration system Gaia Paolini Pfizer - - PowerPoint PPT Presentation
A large-scale chemical data integration system Gaia Paolini Pfizer - - PowerPoint PPT Presentation
A large-scale chemical data integration system Gaia Paolini Pfizer Confidential 1 Large-Scale Chemical Data Integration Summary Current situation Business case Aims The design process Functionality Applications
SLIDE 1
SLIDE 2
Large-Scale Chemical Data Integration
Summary
Current situation Business case Aims The design process Functionality Applications
SLIDE 3
Large-Scale Chemical Data Integration
The Project
A large chemical data warehouse to store
and integrate Pfizer and third-party information using chemical structure as the natural entry point
Millions of chemical structures Based on the DayCart Oracle Cartridge
SLIDE 4
Large-Scale Chemical Data Integration
Why Integrate?
In-house data
In-licensing Mergers & acquisitions 3rd-party databases
The need to integrate and mine
disparate sources of data
SLIDE 5
Large-Scale Chemical Data Integration
Why Integrate?
Data available to buy and integrate from external
sources
Need for active chemoinformatics research
repository
Opportunity to highlight connections Chemical Properties Structural similarities
SLIDE 6
Large-Scale Chemical Data Integration
Aims of the Data Warehouse
Enable chemical/pharmaceutical data
mining and knowledge discovery
Store chemical structures and
properties together with related entities
Biology Portfolio Inventory
SLIDE 7
Large-Scale Chemical Data Integration
Scope
Data warehouse Common consolidated set of data Repository of selected fields from Pfizer and
third-party data
Source independent Chemo-centric : indexed on structure not
compound ID
Emphasis on data integration rather than front
end client application
SLIDE 8
Large-Scale Chemical Data Integration
Requirements
Unique chemical structure indexing Multiple and hierarchical tautomeric and stereochemical indexing Integrate internal and external data Indexed by chemical structure Integrate chemo- and bio-informatics
communities
Fit-for-purpose model architecture Uses corporate dictionaries to standardise entities Create connections and synonym tables
SLIDE 9
Large-Scale Chemical Data Integration
What do we want from our data?
Data should be easy to
access compare exchange manipulate
compound target phase launched
SLIDE 10
Large-Scale Chemical Data Integration
Why data integration?
SLIDE 11
Large-Scale Chemical Data Integration
Database Design Decisions
Central data warehouse Selective data integration Focus on chemical structure SMILES representation in DayCart Flexible compound wiring
SLIDE 12
Large-Scale Chemical Data Integration
Central Data Warehouse
Chemical Structure Integration
External Structures Pfizer External Staging Tables DrugStore warehouse DataMart DataMart DataMart
Data is decoded, loaded, cleaned and mapped
ETL Chemical structure integration
SLIDE 13
Large-Scale Chemical Data Integration
Selective data integration
Drug Store
Database
Pfizer marts … 3rd-party … Contributed research…
Pipeline Pilot Spotfire Ad-hoc queries and data mining
Flexible UI
SLIDE 14
Large-Scale Chemical Data Integration
Data Integration
Consolidated, homogeneous set of data: One index for every entity One unit of measure for every property We can: Highlight connections between entities Create new connections Filter on properties Interface to other databases
SLIDE 15
Large-Scale Chemical Data Integration
Chemo-centric design
Every entity and property is connected to a
chemical structure
Seamless integration of different data sources
Can measure how a data source enriches chemical
space
Consistent modelling of tautomers and stereo-
isomers
Easy to apply hierarchical order (e.g. parent-child)
Any (and multiple) grouping of structures allowed
Intuitive application of chemo-informatics
methods
SLIDE 16
Large-Scale Chemical Data Integration
DayCart Oracle Cartridge
SMILES chemical representation Structure comparison, transformation,
manipulation
Fast data retrieval
SLIDE 17
Large-Scale Chemical Data Integration
DayCart: Chemical Representation
SMILES syntax support
Compact, linear representation Self contained language Computer friendly & searchable
No proprietary data types!
SLIDE 18
Large-Scale Chemical Data Integration
DayCart: Functions for Chemical Information
Exact match Substructure Similarity Tautomers Salts Stereochemistry
SLIDE 19
Large-Scale Chemical Data Integration
DayCart: Indexes
Four (domain) indexes DDBLOB: substructure, similarity DDGRAPH: tautomers, stereochemistry DDROLE: salts DDEXACT: exact match Essential for performance Trade-off data-load/index building Partitioning? (Next version)
SLIDE 20
Large-Scale Chemical Data Integration
DayCart: Indexes
100 200 300 400 500 600 700 1,000,000 2,000,000 3,000,000 4,000,000 5,000,000
No of Records being indexed Time (in mins)
DDBLOB DDGRAPH DDROLE
SLIDE 21
Large-Scale Chemical Data Integration
DayCart: VCS_normalize
Transform structures according to
database rules encoded in SMIRKS
Apply internal business rules Standardize structures Performance?
SLIDE 22
Large-Scale Chemical Data Integration
Applications
Perform large-scale data mining
Accelerate exploration of new ideas at project inception
Repository for chemo-informatics
knowledge
Advanced research database for computational chemists
SLIDE 23
Large-Scale Chemical Data Integration
Example Query: chemical toolbox
compound target activity
Select top ten representative diverse structures Find all screens and compounds tested against each target Find all activity results & rank compounds
Filter
Filter out non druggable compounds Select available compounds Filter out non-selective compounds “Show me the most potent, selective tools for each target, available in- house”
SLIDE 24
Large-Scale Chemical Data Integration
Acknowledgements
SLIDE 25