23451: Developing a deep learning and AI platform for life science - - PowerPoint PPT Presentation

23451 developing a deep learning and ai
SMART_READER_LITE
LIVE PREVIEW

23451: Developing a deep learning and AI platform for life science - - PowerPoint PPT Presentation

23451: Developing a deep learning and AI platform for life science research Robert Esnouf robert@well.ox.ac.uk Head of Research Computing Core, Wellcome Centre for Human Genetics Director of Research Computing, Big Data Institute Research


slide-1
SLIDE 1

Robert Esnouf, University of Oxford: GTC-Europe 11 October 2017

23451: Developing a deep learning and AI platform for life science research

Robert Esnouf robert@well.ox.ac.uk Head of Research Computing Core, Wellcome Centre for Human Genetics Director of Research Computing, Big Data Institute Research Computing Strategy Officer, Nuffield Department of Medicine University of Oxford, UK

slide-2
SLIDE 2

Robert Esnouf, University of Oxford: GTC-Europe 11 October 2017

Overview of talk

  • The WCHG, the BDI and the Old Road Campus
  • Areas of interest for applying DL techniques in the

clinical/life sciences

  • Early promising results
  • Expanding provision for DL/AI and general purpose GPU

computing

  • Acknowledgments
slide-3
SLIDE 3

Robert Esnouf, University of Oxford: GTC-Europe 11 October 2017

The Wellcome Centre for Human Genetics

About 500 researchers in a purpose-built institute

  • “to advance the understanding of genetically-related conditions through

multi-disciplinary research”

  • Sequencing, statistical genetics, disease-focused research (diabetes,
  • besity, heart disease, malaria), optical microscopy, MRI, functional

genetics, crystallography & electron microscopy Opened in 1999, the first building on the “Old Road Campus” surrounded by five hospitals in Headington, east Oxford

slide-4
SLIDE 4

Robert Esnouf, University of Oxford: GTC-Europe 11 October 2017

Computing growth in the WCHG

  • Genetics was largely a lab-based

science with small separate servers for each research group

  • Next-generation sequencing (~2007)

changed all that and in 2009 I started to build a shared infrastructure for the whole of the WCHG

  • WCHG now has HPC cluster of ~4200

CPU cores; 5x Tesla K80, 8x Tesla P100 and consumer cards; ~6.7PB raw GPFS and ~5PB other storage

slide-5
SLIDE 5

Death registries Cancer registries Hospital records Primary care data Pharmacy records Pathology records Screening programmes Environmental data Employment records Built environment Genetic data Imaging Robert Esnouf, University of Oxford: GTC-Europe 11 October 2017

Big data is transforming the study of human biology and disease

slide-6
SLIDE 6

Cohorts

Prospective cohorts (UKB, China, Mexico) Disease-focused cohorts Partnerships with NHS / NIHR Tropical Medicine overseas centres WHO & National ID surveillance

Measurement technologies

Imaging Genomics and other ‘omics Sensors Electronic healthcare records Patient-interactive systems

Integrative analysis methods

Statistics Epidemiology Machine learning Software development Computational ecosystem Interdisciplinary and problem- focused research institute of 350 researchers working on the acquisition and analysis of population-scale data resources linking detailed biological measurement with longitudinal information on health, treatment and outcome.

Data access and sharing

Consent Privacy and security Information governance Intellectual property Standards and protocols

The Oxford Big Data Institute: The Li Ka Shing Centre for Health Information and Discovery

Robert Esnouf, University of Oxford: GTC-Europe 11 October 2017

slide-7
SLIDE 7

Robert Esnouf, University of Oxford: GTC-Europe 11 October 2017

A research computing infrastructure for the WCHG and BDI

  • Linking with dark fibre and quad EDR InfiniBand
  • Expanding shared HPC and high-performance storage
  • Creating a scalable virtualization platform on OpenStack
  • Secure multisite scalable S3 object store
  • GPU-accelerated virtual desktop infrastructure and

independent identity management and authorization

  • Opening facility across Oxford departments to drive

efficient collaboration

slide-8
SLIDE 8

Robert Esnouf, University of Oxford: GTC-Europe 11 October 2017

UK Biobank: wearable accelerometer data

Aiden Doherty

  • 103,712 participants with 7 days data per participant
  • 100Hz tri-axial acceleration data and 0.2Hz temperature/light information

Self-reporting: 50% Accelerometer: 5% Self-reporting: 38% Accelerometer: 5%

Accelerometers better than self-reporting! (R = 0.48–0.60 vs. R = 0.07–0.28) Objective measures of physical activity more strongly associated with mortality

slide-9
SLIDE 9

Robert Esnouf, University of Oxford: GTC-Europe 11 October 2017

Predicting malaria risk

  • Relate environmental factors (temperature, rainfall etc.) to malaria prevalence.
  • Using point surveys and environment from annual 5km x 5km raster pixels.
  • We already use stacking. Train a number of machine learning models and feed

predictions from these models into a meta-learner. We use geostatistical models as our meta-learners.

Tim Lucas and Pete Gething

slide-10
SLIDE 10

Robert Esnouf, University of Oxford: GTC-Europe 11 October 2017

Base-calling data from Oxford Nanopore Technologies sequencers

Hannah Roberts and Gerton Lunter As DNA or RNA pass through the pore, current traces are recorded from which the sequence of bases can be inferred (‘base-calling’) State-of-the-art base-callers use deep neural networks to interpret the current signals, improving on older methods from 71% (HMM; R7.3 chemistry) to 90% accurate (DNN; R9 chemistry) With more training, DNNs may be able to detect modified bases (e.g. methylation patterns)

G T T C T G T A T AT C TT

slide-11
SLIDE 11

Robert Esnouf, University of Oxford: GTC-Europe 11 October 2017

Detecting rare genetic conditions from craniofacial features

  • There are many, many rare

genetic conditions that often go undiagnosed

  • Something like 1 in 12 people

has one of these conditions

  • Often these conditions are

also manifest in craniofacial features

  • www.minervaandme.com

does image analysis on faces to predict genetic conditions

  • With better feature recognition and DL techniques researchers expect to be able

to detect more conditions more reliably

Michael Ferlaino and Chris Nellåker

slide-12
SLIDE 12

Robert Esnouf, University of Oxford: GTC-Europe 11 October 2017

Deep neural networks as models for the brain

Jessie Liu and Tom Nichols

  • Deep neural networks (DNNs) were inspired by the

brain and learn similar features

  • DNNs could take further inspiration from the brain
  • Can we build more sophisticated or cognitive

neural representations in to DNNs?

  • Such as the brain’s GPS system:

This approach will offer:

  • Insights in to principles underlying neural

representations in the brain

  • New DNN architectures capable of powerful, brain-

like computations

Artificial neuron firing field Real neuron firing field

slide-13
SLIDE 13

Robert Esnouf, University of Oxford: GTC-Europe 11 October 2017

Deep learning of chromatin features to predict islet-specific SNP effects

Agata Wesolowska-Andersen, Chris Holmes and Mark McCarthy

slide-14
SLIDE 14

Robert Esnouf, University of Oxford: GTC-Europe 11 October 2017

CNNs capture motifs of input ChIP-seq and known islet transcription factors

Agata Wesolowska-Andersen, Chris Holmes and Mark McCarthy

slide-15
SLIDE 15

Robert Esnouf, University of Oxford: GTC-Europe 11 October 2017

Deep learning predicts regulatory effects for high PPA SNPs

Agata Wesolowska-Andersen, Chris Holmes and Mark McCarthy

slide-16
SLIDE 16

Robert Esnouf, University of Oxford: GTC-Europe 11 October 2017

Predicting Expression using Convolutional Neural Networks (CNNs)

Moustafa Abdalla, Chris Holmes and Mark McCarthy

peaBrain

  • a promoter-derived

embedding and abundance (pea) model

  • a convolutional

neural network that leverages DNA sequence to predict expression

  • can be used to

predict both average gene expression and variation in expression (between individuals)

slide-17
SLIDE 17

Dataset: 19k genes x 4 kilo-basepairs x 32 channels (18.47 GB) representing the “core” promoter sequence of all protein-coding genes in the human reference genome

Robert Esnouf, University of Oxford: GTC-Europe 11 October 2017

Moustafa Abdalla, Chris Holmes and Mark McCarthy

GPUs are necessary for computational tractibility

Quad E5-4640 (64 threads) Single Tesla K80 Single Tesla K80 Single Tesla P100

slide-18
SLIDE 18

Green: fraction of genes whose expression can be predicted using the model R2 is average of repeated out-of-sample (test) sets

Neural Network Regularized Linear Regression Robert Esnouf, University of Oxford: GTC-Europe 11 October 2017

CNNs already outperform previous computational and experimental methods

Moustafa Abdalla, Chris Holmes and Mark McCarthy

slide-19
SLIDE 19

Robert Esnouf, University of Oxford: GTC-Europe 11 October 2017

Provision for DL/AI in WCHG/BDI

  • Until Easter 2017, GPUs were mainly used for electron tomography

(“dynamo”) and single-particle electron microscopy (“relion”)

  • Dell C4130 with 4x K80; Dell R730 with 1x K80;

Scan workstation with TitanXp

  • Free-for-all access
  • Adding a tiered set of local and shared resources:
  • Initial exploration and testing
  • 3x Gigabyte servers each with 4x GTX 1080Ti
  • 1x SuperMicro workstation with 1x GTX 1080Ti
  • 1x Scan workstation with 1x TitanXp
  • Mid-scale training and inference along with image analysis
  • 1x Dell R730 with 1x K80
  • 1x Dell C4130 with 4x K80
  • 2x Dell C4130 each with 4x P100 (SXM2)
  • 1x Scan workstation with 1x V100 (PCIe)
  • Controlling access within Univa Grid Engine
slide-20
SLIDE 20

Robert Esnouf, University of Oxford: GTC-Europe 11 October 2017

Provision for DL/AI with central Oxford IT

  • Oxford IT Services Advanced Research Computing (ARC) mainly

supports maths, physical and life sciences across Oxford

  • ARC has ~7500 core cluster and ~40x K40 and K80 GPUs
  • Shared purchase of an NVidia DGX-1V between ARC, WCHG, BDI

and WIMM to be housed by ARC

  • First Volta system within UK academic sector
  • Will be majority devoted to life-science and clinical research
  • Delivery before end of October – thanks NVidia!
  • Oxford ARC manages new JADE cluster
  • UK national GPU cluster
  • 22x NVidia DGX-1 Pascal systems
  • Some access for life-science projects
  • There is vast enthusiasm for DL/AI in life sciences and clinical
  • research. Early results are promising. The WCHG and BDI will grow

their DL/AI hardware capability to meet this researcher demand

slide-21
SLIDE 21

Robert Esnouf, University of Oxford: GTC-Europe 11 October 2017

Acknowledgments

  • Members of the the research computing teams
  • Jon Diprose, Colin Freeman, Callum Smith and Adam Huffman
  • The researchers across Oxford who have provided me

with descriptions of their research

  • Moustafa Abdalla, Gavin Band, Adrian Cortes, Aiden Doherty,

Michael Ferlaino, Jessie Liu, Tim Lucas, Hannah Roberts, Agata Wesolowska-Andersen and Joe Zhu

  • My bosses and others who have helped us to grow
  • Profs. Peter Donnelly (WCHG) and Gil McVean (BDI)
  • Funding agencies, especially the Wellcome Trust, the Li Ka Shing

and Robertson Foundations, the Medical Research Council

  • And to you for your attention…
  • Enjoy the rest of the conference!