Amanda J. Minnich Staff Research Scientist Lawrence Livermore - PowerPoint PPT Presentation

Using GPUs to Generate Reproducible Workflows to Accelerate Drug Discovery Amanda J. Minnich Staff Research Scientist Lawrence Livermore National Laboratory GPU Technology Conference | March 21, 2019 LLNL-PRES-769348 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC This document was prepared as an account of work sponsored by an agency of the United States government. Neither the United States government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes.

ATOM: Accelerating Therapeutics for Opportunities in Medicine Cancer Centers Pharma Founding Members Tech High-performance High-performance computing computing Diverse Emerging Gov’t Labs biological data experimental Academia capabilities Partners 2

What is ATOM? • Approach : An open public-private partnership - Lead with computation supported by targeted experiments - Data-sharing to build models using everyone’s data - Build an open-source framework of tools and capabilities • Status : - Shared collaboration space at Mission Bay, SF - 25 FTE’s engaged across the partners - R&D started March 2018 - In the process of engaging new partners 3

Current drug discovery: long, costly, high failure Is there a better way to get medicines to patients? Design, make, & test 1000s of new Screen millions Lengthy in-vitro molecules of functional and in-vivo molecules to Sequential evaluation experiments; inform design and optimization Synthesis bottlenecks Target Human clinical 6 years trials Lead Discovery Lead Optimization Preclinical 1.5 yrs 3 yrs 1.5 yrs • 33% of total cost of medicine development • Clinical success only ~12%, indicating poor translation in patients Source: http://www.nature.com/nrd/journal/v9/n3/pdf/nrd3078.pdf 4

Accelerated drug discovery concept Vision of ATOM workflow in practice ATOM Workflow Released to public after a 1-year design synthesize member benefit Empirical Open source PK In Silico models and Safety generated data simulate active assay Efficacy learning Patient-specific data and samples input to workflow to develop new therapeutics Therapeutic Candidates Members use workflow for Models of drug Commercialization internal drug discovery by members for behavior in humans patient benefit 5

Top-level view of the ATOM molecular design platform Design Criteria Initial Library (selected compounds ) Human therapeutic window model Working Compound Library Multiparameter Optimization (10k Compounds) • Genetic optimizer • Bayesian optimizer Retrain property Mechanistic Uncertainty analysis prediction Feature & experiment design Simulations models Human-relevant Assays, Complex in vitro Models Software framework is being released as open source 6

Roadmap •Infrastructure and Architecture – what GPUs are we using? •Data-Driven Modeling Pipeline – what have we built? •Experiments – what have we been able to do? •Future work – where are we going from here? 7

Docker/Kubernetes Cluster Development Browsable Directories JupyterLab Infrastructure • Upload files to • Acts as front end for Datastore via GUI or interactive development API. • Access control via Unix • Also set up VNC to groups enable use of IDE for debugging Relational Database HPC Clusters Data Lake ChEMBL • Contains all input and KEGG output files PDB Supercomputer Servers • GUI and REST API Public • Deploy parallelized runs for GSK data hyperparameter search Metadata Database • Memory/GPU/CPU- intensive jobs Metadata for Model Zoo and Results DB Data Lake Stores model prediction results

Kubernetes allocates GPU resources on our development server • Our development server has 4 GPU nodes with 4 Titan XPs in each node • 1 data server (cephid), 1 login/head node • Kubernetes is an open source container orchestrator • Manages containerized workloads and services • Use it to orchestrate allocation of GPUs, CPUs, and memory • Handles Role-Based Access Control 10

LLNL HPC Software Specs and Computer Architecture • Nodes: 164 • Cores/Node: 36 • Total Cores: 5,904 • Memory/Node: 256 • Total Memory: 41,984 GB • GPU Architecture: NVIDIA Tesla P100 GPUs • Total GPUs: 326 • GPUs per compute node: 2 • GPU peak performance (TFLOP/s double precision): 5.00 11 • GPU global memory (GB): 16 • Switch: Omni-Path • Peak TFLOPs (GPUs): 1,727.8 • Peak TFLOPS (CPUs+GPUs): 1,926.1

Data services are a necessity • Data services are required to organize: • Raw data • Curated datasets • Model-ready datasets • Train/test/validation split of datasets • Serialized models • Performance results • Simulation output • These data types vary in size, format, and level of organization/complexity 12

Have a variety of services to handle our needs • Data Lake • In-house object store service • Allows for association of complex metadata with any type of file • Can access via GUI and REST API • mongoDB • Used as backend for Data Lake metadata • Used as backend for Model Zoo metadata • Used for Results DB • MySQL • Many public datasets are available in SQL format

Overall structure of data services Backend Services deployed via containers NoSQL SQL Object Store Web Application Machine Metadata Application Services Learning Apps Services Services Results DB... Tensorflow, ... Server APIs (secure REST interface) Application Client APIs (Python, R, etc.) Interactive Data Large-Scale Data Science Apps Science Apps (HPC) (Jupyter/ Browser)

End-to-End Data-Driven Modeling Pipeline Enables portability of models and reproducibility of results Data Ingestion + Model Training Prediction Visualization + Featurization Curation + Tuning Generation Analysis Data Lake Model Zoo Results DB 16

Data Ingestion + Model Training + Prediction Visualization + Featurization Curation Tuning Generation Analysis Data Lake Model Zoo Results DB •Raw pharma data consists of 300 GB of a variety of bioassay and animal toxicology data on ~2 million compounds from GSK •Proprietary or sensitive data must only be stored on approved servers •Data may need to remain sequestered from other members 17

ATOM has curated ~150 GSK Pharmacokinetic Datasets model-ready data sets Descriptor data Data Set MOE 3D Descriptors Compounds GSK 1.86M ChEMBL 1.6M Enamine 680M

Data Ingestion + Model Training + Prediction Visualization + Featurization Curation Tuning Generation Analysis Data Lake Model Zoo Results DB •Support loading datasets from either Data Lake or filesystem •Support a variety of feature types • Extended Connectivity Fingerprint • Graph-based features • Molecular descriptor-based features (MOE, DRAGON7, rdkit) • Autoencoder-based features (MolVAE) • Allow for custom featurizer classes •Split dataset based on structure to avoid bias 19

Featurization is key •We have found that the best-performing feature type varies by dataset •In general chemical descriptors out-perform other feature types •Graph Convolutions occasionally outperform others 20

Dimensionality reduction can improve performance 21

Data Ingestion + Model Training + Prediction Visualization + Featurization Curation Tuning Generation Analysis Data Lake Model Zoo Results DB •Have built a train/tune/predict framework to create high-quality models •Currently support: • sklearn models • deepchem models (wrapper for TensorFlow) • Allow for custom model classes •Tune models using the validation set and perform k-fold cross validation 22

Amanda J. Minnich Staff Research Scientist Lawrence Livermore - PowerPoint PPT Presentation

Using GPUs to Generate Reproducible Workflows to Accelerate Drug Discovery Amanda J. Minnich Staff Research Scientist Lawrence Livermore National Laboratory GPU Technology Conference | March 21, 2019 LLNL-PRES-769348 This work was performed

KANSAS DEPARTMENT OF AGRICULTURE WEIGHTS AND MEASURES LOREN MINNICH FIELD SUPERVISOR

Experiences booting one million virtual machines (and a few tools we developed) Ron Minnich Don

Linux emulation Ron Minnich Fifth IWP9 With thanks to Jim McKie Ron Minnich Linux emulation A

System call tracing Ron Minnich Fifth IWP9 With thanks to Russ Cox, Jim McKie, and Noah Evans

RESULTS FROM AMANDA AMANDA Carlos de los Heros Division of High Energy Physics Uppsala

Axion-driven inflation and quantum gravity Albion Lawrence, Brandeis University Kaloper,

ABW-3 AUTOMATIC BULK WEIGHING SYSTEMS Loren Minnich WWMA Fall 2018 Purpose: To modernize the

LinuxBoot: Linux as Firmware Chris Koch, Gan Shun Lim Google with Ron Minnich, Ryan OLeary,

Ron Minnich Ryan OLeary Gan Shun Lim Prachi Laud Chris Koch Ian Goegebuer With thanks to:

Coyote: all IB, all the time (Booting as a Linux HPC application) Ron Minnich Sandia National

Harvey lvaro Jurado Ron Minnich Rafael Fernndez Aki Nyrhinen David du Colombier John

Orientation and Role Models Workshop Amanda Duley Resident Staff Scientist Jessie Herbert

Flint Staff Accounts Inviting & Managing Staff Members Flint Staff Accounts Flint.com

Welcome to the Fall All-Staff Meeting Hosted by Staff Council Staff Council Updates Dana &

St. Lawrence Action Plan 2011-2026 Presentation to the Great Lakes and St. Lawrence Cities

Andr Walker-Loud Staff Scientist Lawrence Berkeley National Laboratory S91010 - Accelerating

uncommon knowledge, open innovation Wednesday, March 11, 2009 how do we make the web work for

Forest Carbon Partnership Facility NICARAGUA Emission Reduction Program Idea Note Combating

THE (REGIONAL) PHD IN COMPUTER SCIENCE Paolo Ferragina Coordinator of the PhD Program The

ADVANCES IN LAND SURFACE HYDROLOGY REPRESENTATION IN INM RAS EARTH SYSTEM MODEL Victor Stepanenko

Challenges and Opportunities for Applied Nanotechnology to the Regeneration of the Central

Insights to Fibrosis Drug Discovery & Development Gary Phillips, Pharmaxis CEO Bioshares

Availability of Co-occurring Disorders Treatment in Massachusetts: Survey Findings and Policy

MN Employers Driving Improved Mental Health Care and Outcomes Overview March 2019 Thanks for

Amanda J. Minnich Staff Research Scientist Lawrence Livermore - PowerPoint PPT Presentation

Using GPUs to Generate Reproducible Workflows to Accelerate Drug Discovery Amanda J. Minnich Staff Research Scientist Lawrence Livermore National Laboratory GPU Technology Conference | March 21, 2019 LLNL-PRES-769348 This work was performed

KANSAS DEPARTMENT OF AGRICULTURE WEIGHTS AND MEASURES LOREN MINNICH FIELD SUPERVISOR

Experiences booting one million virtual machines (and a few tools we developed) Ron Minnich Don

Linux emulation Ron Minnich Fifth IWP9 With thanks to Jim McKie Ron Minnich Linux emulation A

System call tracing Ron Minnich Fifth IWP9 With thanks to Russ Cox, Jim McKie, and Noah Evans

RESULTS FROM AMANDA AMANDA Carlos de los Heros Division of High Energy Physics Uppsala

Axion-driven inflation and quantum gravity Albion Lawrence, Brandeis University Kaloper,

ABW-3 AUTOMATIC BULK WEIGHING SYSTEMS Loren Minnich WWMA Fall 2018 Purpose: To modernize the

LinuxBoot: Linux as Firmware Chris Koch, Gan Shun Lim Google with Ron Minnich, Ryan OLeary,

Ron Minnich Ryan OLeary Gan Shun Lim Prachi Laud Chris Koch Ian Goegebuer With thanks to:

Coyote: all IB, all the time (Booting as a Linux HPC application) Ron Minnich Sandia National

Harvey lvaro Jurado Ron Minnich Rafael Fernndez Aki Nyrhinen David du Colombier John

Orientation and Role Models Workshop Amanda Duley Resident Staff Scientist Jessie Herbert

Flint Staff Accounts Inviting &amp; Managing Staff Members Flint Staff Accounts Flint.com

Welcome to the Fall All-Staff Meeting Hosted by Staff Council Staff Council Updates Dana &amp;

St. Lawrence Action Plan 2011-2026 Presentation to the Great Lakes and St. Lawrence Cities

Andr Walker-Loud Staff Scientist Lawrence Berkeley National Laboratory S91010 - Accelerating

uncommon knowledge, open innovation Wednesday, March 11, 2009 how do we make the web work for

Forest Carbon Partnership Facility NICARAGUA Emission Reduction Program Idea Note Combating

THE (REGIONAL) PHD IN COMPUTER SCIENCE Paolo Ferragina Coordinator of the PhD Program The

ADVANCES IN LAND SURFACE HYDROLOGY REPRESENTATION IN INM RAS EARTH SYSTEM MODEL Victor Stepanenko

Challenges and Opportunities for Applied Nanotechnology to the Regeneration of the Central

Insights to Fibrosis Drug Discovery &amp; Development Gary Phillips, Pharmaxis CEO Bioshares

Availability of Co-occurring Disorders Treatment in Massachusetts: Survey Findings and Policy

MN Employers Driving Improved Mental Health Care and Outcomes Overview March 2019 Thanks for

Flint Staff Accounts Inviting & Managing Staff Members Flint Staff Accounts Flint.com

Welcome to the Fall All-Staff Meeting Hosted by Staff Council Staff Council Updates Dana &

Insights to Fibrosis Drug Discovery & Development Gary Phillips, Pharmaxis CEO Bioshares