A large-scale chemical data integration system Gaia Paolini Pfizer - - PowerPoint PPT Presentation

a large scale chemical data integration system
SMART_READER_LITE
LIVE PREVIEW

A large-scale chemical data integration system Gaia Paolini Pfizer - - PowerPoint PPT Presentation

A large-scale chemical data integration system Gaia Paolini Pfizer Confidential 1 Large-Scale Chemical Data Integration Summary Current situation Business case Aims The design process Functionality Applications


slide-1
SLIDE 1

1 Pfizer Confidential

A large-scale chemical data integration system

Gaia Paolini

slide-2
SLIDE 2

Large-Scale Chemical Data Integration

Summary

Current situation Business case Aims The design process Functionality Applications

slide-3
SLIDE 3

Large-Scale Chemical Data Integration

The Project

A large chemical data warehouse to store

and integrate Pfizer and third-party information using chemical structure as the natural entry point

Millions of chemical structures Based on the DayCart Oracle Cartridge

slide-4
SLIDE 4

Large-Scale Chemical Data Integration

Why Integrate?

In-house data

In-licensing Mergers & acquisitions 3rd-party databases

The need to integrate and mine

disparate sources of data

slide-5
SLIDE 5

Large-Scale Chemical Data Integration

Why Integrate?

Data available to buy and integrate from external

sources

Need for active chemoinformatics research

repository

Opportunity to highlight connections Chemical Properties Structural similarities

slide-6
SLIDE 6

Large-Scale Chemical Data Integration

Aims of the Data Warehouse

Enable chemical/pharmaceutical data

mining and knowledge discovery

Store chemical structures and

properties together with related entities

Biology Portfolio Inventory

slide-7
SLIDE 7

Large-Scale Chemical Data Integration

Scope

Data warehouse Common consolidated set of data Repository of selected fields from Pfizer and

third-party data

Source independent Chemo-centric : indexed on structure not

compound ID

Emphasis on data integration rather than front

end client application

slide-8
SLIDE 8

Large-Scale Chemical Data Integration

Requirements

Unique chemical structure indexing Multiple and hierarchical tautomeric and stereochemical indexing Integrate internal and external data Indexed by chemical structure Integrate chemo- and bio-informatics

communities

Fit-for-purpose model architecture Uses corporate dictionaries to standardise entities Create connections and synonym tables

slide-9
SLIDE 9

Large-Scale Chemical Data Integration

What do we want from our data?

Data should be easy to

access compare exchange manipulate

compound target phase launched

slide-10
SLIDE 10

Large-Scale Chemical Data Integration

Why data integration?

slide-11
SLIDE 11

Large-Scale Chemical Data Integration

Database Design Decisions

Central data warehouse Selective data integration Focus on chemical structure SMILES representation in DayCart Flexible compound wiring

slide-12
SLIDE 12

Large-Scale Chemical Data Integration

Central Data Warehouse

Chemical Structure Integration

External Structures Pfizer External Staging Tables DrugStore warehouse DataMart DataMart DataMart

Data is decoded, loaded, cleaned and mapped

ETL Chemical structure integration

slide-13
SLIDE 13

Large-Scale Chemical Data Integration

Selective data integration

Drug Store

Database

Pfizer marts … 3rd-party … Contributed research…

Pipeline Pilot Spotfire Ad-hoc queries and data mining

Flexible UI

slide-14
SLIDE 14

Large-Scale Chemical Data Integration

Data Integration

Consolidated, homogeneous set of data: One index for every entity One unit of measure for every property We can: Highlight connections between entities Create new connections Filter on properties Interface to other databases

slide-15
SLIDE 15

Large-Scale Chemical Data Integration

Chemo-centric design

Every entity and property is connected to a

chemical structure

Seamless integration of different data sources

Can measure how a data source enriches chemical

space

Consistent modelling of tautomers and stereo-

isomers

Easy to apply hierarchical order (e.g. parent-child)

Any (and multiple) grouping of structures allowed

Intuitive application of chemo-informatics

methods

slide-16
SLIDE 16

Large-Scale Chemical Data Integration

DayCart Oracle Cartridge

SMILES chemical representation Structure comparison, transformation,

manipulation

Fast data retrieval

slide-17
SLIDE 17

Large-Scale Chemical Data Integration

DayCart: Chemical Representation

SMILES syntax support

Compact, linear representation Self contained language Computer friendly & searchable

No proprietary data types!

slide-18
SLIDE 18

Large-Scale Chemical Data Integration

DayCart: Functions for Chemical Information

Exact match Substructure Similarity Tautomers Salts Stereochemistry

slide-19
SLIDE 19

Large-Scale Chemical Data Integration

DayCart: Indexes

Four (domain) indexes DDBLOB: substructure, similarity DDGRAPH: tautomers, stereochemistry DDROLE: salts DDEXACT: exact match Essential for performance Trade-off data-load/index building Partitioning? (Next version)

slide-20
SLIDE 20

Large-Scale Chemical Data Integration

DayCart: Indexes

100 200 300 400 500 600 700 1,000,000 2,000,000 3,000,000 4,000,000 5,000,000

No of Records being indexed Time (in mins)

DDBLOB DDGRAPH DDROLE

slide-21
SLIDE 21

Large-Scale Chemical Data Integration

DayCart: VCS_normalize

Transform structures according to

database rules encoded in SMIRKS

Apply internal business rules Standardize structures Performance?

slide-22
SLIDE 22

Large-Scale Chemical Data Integration

Applications

Perform large-scale data mining

Accelerate exploration of new ideas at project inception

Repository for chemo-informatics

knowledge

Advanced research database for computational chemists

slide-23
SLIDE 23

Large-Scale Chemical Data Integration

Example Query: chemical toolbox

compound target activity

Select top ten representative diverse structures Find all screens and compounds tested against each target Find all activity results & rank compounds

Filter

Filter out non druggable compounds Select available compounds Filter out non-selective compounds “Show me the most potent, selective tools for each target, available in- house”

slide-24
SLIDE 24

Large-Scale Chemical Data Integration

Acknowledgements

slide-25
SLIDE 25

Large-Scale Chemical Data Integration

Thank you!