[PPT] - Deep Computing in Biology Challenges and Progress Ajay K. Royyuru PowerPoint Presentation

SLIDE 1

Deep Computing in Biology

Challenges and Progress

Computational Biology Center Thomas J. Watson Research Center ajayr@us.ibm.com

Ajay K. Royyuru

SLIDE 2

2

IBM Computational Biology Center

Outline

Biology has become an Information Science

Data explosion – how to take advantage High Throughput technologies Genomics Genographic Proteomics Medical Imaging Data Integration, Mining and Analysis Scale of Computing is rapidly advancing – think big Tackle Complexity in Biology – think multiscale

SLIDE 3

3

IBM Computational Biology Center

microRNAs in a nutshell microRNAs in a nutshell

SLIDE 4

4

IBM Computational Biology Center

rna22 rna22’ ’s Predictions s Predictions

2 3 ,6 1 6 > 5 5 0 ,0 0 0 5 0 ,1 3 9 321

H. sapiens

1 8 ,5 9 7 > 4 0 0 ,0 0 0 4 4 ,3 5 8 245

M. musculus

1 3 ,1 0 4 > 1 5 0 ,0 0 0 1 ,1 1 7 78

D. melanogaster

9 ,7 5 2 > 6 0 ,0 0 0 623 114

C. elegans

rna22 predicted affected Transcripts rna22 predicted 3' UTR targets (locks) rna22 predicted precursors (keys) currently known precursors June 2005 Genome

Rigoutsos et al., Cell (2006)

SLIDE 5

5

IBM Computational Biology Center

w w w .nationalgeographic.com/genographic

The Genographic Project

SLIDE 6

6

IBM Computational Biology Center

Map of Human Migration

The Genographic Project

SLIDE 7

7

IBM Computational Biology Center

Public Participation

Over 217,000 participants to date www.nationalgeographic.com/genographic www.ibm.com/dna

Behar et al., PLoS Genetics, 3/e104: 1083-1095 (2007)

SLIDE 8

8

IBM Computational Biology Center

m tDNA Report

HVS1 Sequence

Haplogroup: M*
16223T, 16519C
ATTCTAATTTAAACTATTCTCTGTTCTTTCATGGGGAAGCAGATTTGGGTA

CCACCCAAGTATTGACTCACCCATCAACAACCGCTATGTATTTCGTACATT ACTGCCAGCCACCATGAATATTGTACGGTACCATAAATACTTGACCACCTG TAGTACATAAAAACCCAATCCACATCAAAACCCCCTCCCCATGCTTACAAG CAAGTACAGCAATCAACCTTCAACTATCACACATCAACTGCAACTCCAAAG CCACCCCTCACCCACTAGGATACCAACAAACCTACCCACCCTTAACAGTAC ATAGTACATAAAGCCATTTACCGTACATAGCACATTACAGTCAAATCCCTT CTCGTCCCCATGGATGACCCCCCTCAGATAGGGGTCCCTTGACCACCATCC TCCGTGAAATCAATATCCCGCACAAGAGTGCTACTCTCCTCGCTCCGGGCC CATAACACTTGGGGGTAGCTAAAGTGAACTGTATCCGACATCTGGTTCCTA CTTCAGGGCCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAATAAGACA TCACGATG

SLIDE 9

9

IBM Computational Biology Center

Why is Proteomics Important? Proteins do real work in cells

not genes

Many disease involve post-

translational modifications of proteins (hence, not encoded in genes)

Looking for protein-based

biomarkers to track disease state or progression

Folded, modified, translocated

Mature Protein Cellular Machinery

SLIDE 10

10

IBM Computational Biology Center

Diagnostics w ith Proteomics

Extract blood from subject

IBM

Process serum in mass spec Extract raw data Identify peaks via novel 2D analysis

Process of identification of protein fragments in the blood of an individual Medical condition Healthy individuals Our algorithms Biomarkers

f disease

Identification of markers characteristic of disease

Take serum from patient Analyze serum peaks Are these biomarkers of disease present? YES: patient has condition NO: patient does not have condition The near future: Proteomics with allow for early diagnostics of some conditions from blood samples

SLIDE 11

11

IBM Computational Biology Center

Medical Imaging: fMRI

Listening to music Hubs analysis Activity analysis

SLIDE 12

12

IBM Computational Biology Center

Directional links explain the difference

Visual Auditory Visual Auditory

Neutral Links Directed Links

Presented at the Human Brain Mapping Conference (2006)

SLIDE 13

13

IBM Computational Biology Center

Graphs determined by the structure of

pairwise correlations between voxels display very robust topological statistical regularities, including power-law connectivity scaling and small-worldness*

However, the computations become

intractable very easily as one moves up from two-point correlations

We developed a novel approach that extends

ur previous findings to include directional

links, and based on this analyze the presence and significance of higher-order correlation patterns

We implemented a series of algorithms

implemented on distributed platforms that render our approach feasible

Netw ork Analysis

*Scale-Free Brain Functional Networks, V.M. Eguiluz, D.R. Chialvo. G.A. Cecchi,

M. Baliki & V.A. Apkarian, Physical Review Letters 94:18102 (2005)

SLIDE 14

14

IBM Computational Biology Center

Outline

Biology has become an Information Science

Data explosion – how to take advantage High Throughput technologies Genomics Genographic Proteomics Medical Imaging Data Integration, Mining and Analysis Scale of Computing is rapidly advancing – think big Molecular Simulations Docking, Virtual Screening Medical Imaging Tackle Complexity in Biology – think multiscale

SLIDE 15

15 New Top Supercomputers in the World June 2007

Source: www.top500.org Japan Earth Simulator (DC Opteron/IB) 36.58 Appro 19 FZJ – Juelich (8 racks BlueGene/L) 37.33 IBM 18 ARL (DC Xeon 51xx/Infiniband) 40.61 Linux Networx 17 Maui HPCC – Jaws (Xeon/Infiniband) 42.39 Dell 16 TACC – Lonestar (Xeon/Infiniband) 46.73 Dell 15 Tsubame Galaxy TiTech (Opteron/Clearspeed/IB) 48.88 NEC/ Sun 14 Japan Earth Simulator (NEC) 35.86 NEC 20 CEA/DAM Tera10 (Itanium2) 52.84 Bull 12 NASA/Columbia (Itanium2) 51.87 SGI 13 Sandia NL (Xeon/Infiniband) 53.00 Dell 11

Rmax TFlops Installation Ven- dor #

NCSA (QC Xeon/Infiniband) 62.68 Dell 8 BlueGene at RPI (16 racks BlueGene/L) 73.03 IBM 7 BlueGene at Watson (20 racks BlueGene/L) 91.29 IBM 4 Oak Ridge NL (XT3 Opteron) 101.7 Cray 2 Sandia – Red Storm (XT3 Opteron) 101.4 Cray 3 BSC MareNostrum (2560 JS21 Blades) 62.63 IBM 9 ASC Purple LLNL (1526 nodes p5 575) 75.76 IBM 6 BlueGene at Stony Brook / BNL (18 racks BlueGene/L) 82.16 IBM 5 DOE/NSSA/LLNL (64 racks BlueGene/L) 280.6 IBM 1 Altix4700 at LRZ (DC Itanium 2/Infiniband) 56.52 SGI 10

Rmax TFlops Installation Ven- dor #

Upgrade

New New

Upgrade Upgrade

New New

Upgrade Upgrade

SLIDE 16

16

IBM Computational Biology Center

Time Scales: Biopolymers and Membranes

10-15 10-12 10-9 10-6 10-3 1 103 106 109

| | | | | | | | |

Bond Vibration

Adapted from “The Protein Folding Problem”, Chan and Dill, Physics Today, Feb. 1993

DNA Twisting Hinge Motion Helix-Coil Transition Protein Folding Ligand-Protein Binding Electron Transfer Lipid exchange via diffusion Torsional correlation in lipid headgroups Simulation Experiment

s

SLIDE 17

17

IBM Computational Biology Center

Blue Matter strong scaling performance

Computation rates as a function of atoms per node 200 400 600 800 1000 1200 1400 0.1 1 10 100 Computation Rate (time-steps/second) Atoms/Node Hairpin SPI 64^3 (V5) SOPE SPI 64^$ (V5) Hairpin SPI 64^3 (V4) SOPE SPI (V5) Rhodopsin SPI (V5) SOPE SPI (V4) ApoA1 SPI (V5) Rhodopsin SPI (V4) ApoA1 SPI (V4) SOPE MPI (V4) Rhodopsin MPI (V4) ApoA1 MPI (V4) ApoA1 NAMD Msging Layer ApoA1 NAMD MPI www.research.ibm.com/bluegene

SLIDE 18

18

IBM Computational Biology Center

Lysozyme System

Trp62Ala mutation of Lyzosyme

Dramatically reduce stability in 8M urea solution Responsible for amyloid formation

Lysozyme structure consists of:

? -domain with 4 alpha helixes (A-D) 1 310 helix ?-domain with Anti parallel beta sheet Loop of the beta-domain

C. Dobson and coworkers, Nature 424, 783, 2003

SLIDE 19

19

IBM Computational Biology Center

R Zhou, M Eleftheriou, AK Royyuru, BJ Berne, Destruction of long-range interactions by a single mutation in lysozyme

Proc. Natl. Acad. Sci., 104:5824-5829 (2007)

SLIDE 20

20

IBM Computational Biology Center

GPCR-based drugs among the 200 best-selling prescriptions, and their GPCR targets

900 Bristol-Myers Squibb Stroke Plavix

ADP receptors

100 Pharmacia Ulcers Cytotec

Prostaglandin (PGE1) receptors

90 AstraZeneca Parkinson’s diseases Requip

Dopamine receptors

740 AstraZeneca Cancer Zoladex

GnRH receptors

600 Boehringer Ingelheim COPD Atrovent

Muscarinic acetylcholine receptors

940 GlaxoSmithKline Asthma Serevent 250 GlaxoSmithKline Congestive heart failure Coreg 580 AstraZeneca Toprol-XL

Adrenoceptors

1,700 Merck Hypertension Cozaar

Angiotensin receptors

2,400 Eli Lilly Schizophrenia Zyprexa 714 Bristol-Myers Squibb Anxiety BuSpar 1,100 GlaxoSmithKline Migraine Imitrex 1,600 Johnson & Johnson Psychosis Risperdal

5-HT receptors

1,100 Aventia Allegra 2,200 Schering-Plough Allergies Claritin 850 Merck Pepcid 870 AstraZeneca Ulcers Zantac

Histamine receptors 2000 sales(US $m) Company Disease Drug GPCR target

http://www.predixpharm.com/market_table.htm

SLIDE 21

21

IBM Computational Biology Center

Pitman, M. C., Suits, F., Gawrisch, K. & Feller, S. E. J Chem Phys 122, 244715 (2005). Pitman, M. C., Grossfield, A., Suits, F. & Feller, S. E.

J. Am. Chem. Soc. 127, 4576-4577 (2005).

Pitman, M. C., Suits, F., Mackerell, A. D., Jr. & Feller, S. E. Biochemistry 43, 15318-28 (2004). Suits, F., Pitman, M. C. & Feller, S. E. J Chem Phys 122, 244714 (2005).

SOPE 3:1 SDPC/CHOL

Rhodopsin in 2:2:1 SDPC/SDPE/CHOL

Rhodopsin - Dark Ensemble Light-adapted Rhodopsin

Toward Active Rhodopsin

Pitman et al., PNAS (2005)

SLIDE 22

8/9/2007 22

IBM Computational Biology Center

Rhodopsin Photocycle

Dark-adapted Rhodopsin h? Photorhodopsin Bathorhodopsin Lumirhodopsin Meta-I Meta-II Bind transducin (G-protein)

Isomerize retinal

Activate phosphodiesterase ~200 fs ms timescale Dark-adapted Rhodopsin Photorhodopsin Bathorhodopsin Lumirhodopsin Meta-I Meta-II Bind transducin (G-protein)

Isomerize retinal

Activate phosphodiesterase ns timescale ? s timescale

SLIDE 23

23

IBM Computational Biology Center

SLIDE 24

24

IBM Computational Biology Center

Magnetization Transfer from Water to Protein and Lipid Resonances During Meta-I Rhodopsin

M. C. Pitman, S. E. Feller, A. Grossfield, O Soubias, K. Gawrisch (under review)
0.2

0.0 0.2 0.4 0.6 0.8 1.0

10

10 20 30

Relative Attentuation Time/Minutes ROS disks pH 8.0, 20

0C

SLIDE 25

25

IBM Computational Biology Center

However…

rapid virus evolution makes any pandemic plan a reactive challenge and vaccination efforts difficult

Evolution of Influenza A Virus Subtypes in the Human Population

1918 1940 1960 1980 2000 1918 1940 1960 1980 2000 1918 1940 1960 1980 2000 1918 1940 1960 1980 2000

H3 H1 H1 H2 H1?

1900 1889

?

YEAR

Challenges addressed by current US Pandemic Influenza Implementation Plan:

Detect major disease pathogens
React to real time outbreak detection
Contain disease spread, collaborating

with public health agencies

Evolving virus strains are more aggressive and tolerant
Vaccines may become ineffective unless you stay ahead of the strain

H = Hemagglutinin

Current Strategy for Pandemic Influenza is reactive

SLIDE 26

26

IBM Computational Biology Center

Checkmate: Research on Avian influenza

Project Checkmate changes focus from reactive disease control to proactive pandemic prevention By partnering the world’s best research science & supercomputing technology, Project Checkmate provides a means to:

Anticipate genetic variation and disease evolution to develop effective vaccines Develop prophylactics and therapeutics for potential future pandemics Apply this strategy to other emerging infectious diseases

SLIDE 27

27

IBM Computational Biology Center

Biological analysis: Reconstructed 1918 pandemic virus to study disease properties Structural analysis: Completed Hemagglutinin from 1918 pandemic & H5N1 avian viruses Blue Gene technology: World-renowned high capacity computational power

Result is ability to proactively predict disease evolution Advances in Biotechnology and Supercomputing

Characterization of the Reconstructed 1918 Spanish Influenza Pandemic Virus Tumpey et al., Science: Vol. 310, Page 77, Published 2005 Structure and Receptor Specificity of the Hemagglutinin from an H5N1 Influenza Virus Stevens, et al., Science: Vol. 312, Page 404, Published 2006 Observation of a dewetting transition in the collapse

f the melitten tetramer
P. Liu, et al., Nature: Vol. 435, Page 159, Published 2005

IBM

ranks # 1, 4, 5, 7 in Top10

SLIDE 28

28

IBM Computational Biology Center

Team 5: Computational Prediction of Antigenic Variation and Biological Validation: Reverse genetics to reconstruct influenza viruses to test computation and experimental predictions

f antigenic variation (Palese).

HK97 HK01 Indo03 HK03 Viet04 Viet05 Av? Hu? ?? ?? Av/Hu? X

X

Team 1: Rapid In Vitro Evolution: Developing a methodology to identify and neutralize future virulent strains (Janda) Team 2: Computer Modeling and Structural Prediction of Influenza Virus Evolution (Wilson). Team 3: Antibodies and Vaccines: Finding and targeting influenza’s Achilles’ heel (Burton). Team 4: Small Molecule Inhibitors: Targeting HA attachment to host cells (Wong).

X

X X X X

X X X X Antigenic evolution of avian influenza A (H5N1) virus from 1997 to 200?

SLIDE 29

29

IBM Computational Biology Center

Hemagluttinin

HA trimer from H5N1 Simulation on Blue Gene/L

www.ibm.com/avianflu

SLIDE 30

30

Blue Gene/L Development

Process of Drug Discovery

Target Identification and Selection Target Isolation and Purification Structure Determination Analyze Structure for Potential Ligand Binding Sites Docking of Small Molecules using Computational Methods Biochemical Assays and Further Testing Lead Optimization to Improve Potency Pre-clinical Testing Drug Candidate

Years Compounds 4 8 3,000 – 10,000 250

SLIDE 31

31

Blue Gene/L Development

Docking Components

Efficient Search Procedure

? Speed and efficiency

Scoring Function

? Fast ? Accurate ( discriminate between native and non-native docked information )

SLIDE 32

32

IBM Computational Biology Center

Dr. Yuan-Ping Pang, Mayo Clinic

SLIDE 33

33

IBM Computational Biology Center

Drug Docking on Blue Gene

SLIDE 34

34

IBM Computational Biology Center

SARS Virus

SLIDE 35

35

IBM Computational Biology Center

SARS Virus Cysteine Proteinase

SLIDE 36

36

IBM Computational Biology Center

Blue Gene Accelerates the Pace of Discovery

SLIDE 37

37

Blue Gene/L Development

DOCK6 External Collaborations

SLIDE 38

38

Blue Gene/L Development

Results f or HTC Mode

Parallelizing the dispatching of embarrassingly pleasantly parallel codes provides

near linear speed up in HTC mode

Linear scale up has been validated to 8192 processors, but all indications suggest

this will continue as the system is grown

Dock6 in HTC mode

2000 4000 6000 8000 10000 1 2048 4096 6144 8192 Processors Speed Up Linear Optimized MPI code Dock6 in HTC mode

SLIDE 39

39

IBM Computational Biology Center

Dock6 and High Throughput Compute Mode

Why HTC?

Enables large number of independent tasks Multiple Instruction, Multiple Data (MIMD) architecture Dock6 is an embarrassingly parallel code where each processor is conducting independent calculations on different ligands in the library. This makes Dock6 a great candidate for HTC mode.

Results

Demonstrated scalable task dispatch to 1000’s processors. Optimal ratio of dispatcher to partition size is 1:32 – latencies increase above this level, possibly due to Launcher contention for socket

resource. May depend on task duration and arrival rates.

Delayed dispatch proportional to executable size for effective task distribution across partitions (using 0.24 microseconds per byte) – due to IO Node to Compute Node bandwidth. Led to invocation of multiple dispatchers: Near Linear Scaling Near Linear Scaling

SLIDE 40

40

IBM Computational Biology Center

HRRT PET Blue Gene/L

SLIDE 41

41

IBM Computational Biology Center

Outline Biology has become an Information Science

Data explosion – how to take advantage Scale of Computing is rapidly advancing – think big Tackle Complexity in Biology – think multiscale

SLIDE 42

42

IBM Computational Biology Center

Systems Biology at IBM Research

Reverse Engineering Data Mining

p53 TP53 MDM2 mdm2 cycg1 CYCG1

DNA damage

Nutlin

p14

p38

Modeling & Simulations

wip1 WIP1

Experimentation & collaborations

SLIDE 43

43

IBM Computational Biology Center

Lahav et al., Nature Genetics, 2004 p53 protein

Digital response of tumor suppressor p53 to IR

DNA damage initiation & repair ATM: DNA damage detection P53- MDM2

scillator

Irradiation Cell Cycle Arrest and DNA Repair

?

DNA damage initiation & repair ATM: DNA damage detection P53- MDM2

scillator

Irradiation Cell Cycle Arrest and DNA Repair

?

Irradiation

SLIDE 44

44

IBM Computational Biology Center

m R N A m R N A p53 Protein TP53 D N A p53

Delay=?

Protein TP53* Basal m d m 2 Protein M D M 2 D N A

Delay=?

Basal (P1) m d m 2 (Fast) (Slow) Induced (P2) A T M * m R N A m R N A p53 Protein TP53 D N A p53

Delay=?

Protein TP53* Basal m d m 2 Protein M D M 2 D N A

Delay=?

Basal (P1) m d m 2 (Fast) (Slow) Induced (P2) A T M *

Lahav et al., Nature Genetics 2004

500 1000 1500 1 2 3 4 5

TP53 mdm2 MDM2

Molecular intensity (fold)

Time (min)

Response to 5Gy

DNA damage

Digital response of tumor suppressor p53 to IR

Ma, Wagner, Rice, Hu, Levine and Stolovitzky, A plausible model for the digital response of p53 to DNA damage, PNAS (2005).

Wenwei Hu, Zhaohui Feng, Lan Ma, John Wagner, J. Jeremy Rice, Gustavo Stolovitzky, and Arnold J. Levine, A Single Nucleotide Polymorphism in the MDM2 Gene Disrupts the Oscillation of p53 and MDM2 Levels in Cells, Cancer Research (in press), 2007.

SLIDE 45

45

IBM Computational Biology Center

Blue Brain Project

Ecole Polytechnique Fédérale de Lausanne

Research collaboration with

EPFL to simulate the neocortical column using the Blue Gene supercomputer

Why

Better understand the basis of cognitive function ? human nature Better understand neocortical diseases Identify treatment targets and strategies

Why use computation

Enable scientific discovery impossible or difficult with biological experimentation alone Manage complexity Integrate knowledge Inform experimental design and theory

SLIDE 46

46

IBM Computational Biology Center

The Data

Unprecedented neocortical

microcircuit data comprising

Neurons Morphological type Electrophysiological type Ion channel types/distributions Gene expression Synapses Neurotransmitter receptors Number and innervation patterns Dynamics and plasticity rules Microcircuit structure Frequency/layout of neuron types Connectivity patterns/probabilities

SLIDE 47

47

IBM Computational Biology Center

The Approach

Maintain tight coupling between

simulation and experimental validation

Exploit unparalleled neocortical microcircuit data Develop parameterization techniques & models Validate models with in vitro & in vivo experimental data Use experimentation to validate simulated findings

Employ multiple modeling strategies

Up: Neuron ? Slice ? Column Down: Network ? Point neurons ? Complex neurons

Facilitate large-scale neural simulation

Optimize simulation layout Develop parameterization and analysis techniques

Nature Reviews Neuroscience 7:153-160 (Feb 2006) Henry Markram Opinion: The Blue Brain Project

SLIDE 48

48

IBM Computational Biology Center

Summary

Biology has become an Information Science

Will continue to present challenges orders of magnitude beyond current capability Information technology is a vital tool enabling discovery

Data explosion

Capacity computing to handle large and heterogeneous data Rapidly evolving algorithms require general and easy to use systems

Scale of Computing

Molecular simulations amongst the most mature to exploit petascale Meaningful biology can exploit orders of magnitude more capability

Tackle Complexity in Biology

Think multiscale – brute force won’t get you far Problems are not compute limited Challenge to find the right abstraction and model that provides insight

SLIDE 49

49

IBM Computational Biology Center

IBM Thomas J. Watson Research Center, Yorktown Heights, NY

IBM Computational Biology Center www.research.ibm.com/compsci/compbio

SLIDE 50

50

IBM Computational Biology Center

World Community Grid

www.worldcommunitygrid.org
Launched in Nov ’04
Mobilize the community
>2400 years of computing in first 50 days
Technology solving problems
Advisory Board of experts in health sciences,

technology and philanthropy

RFP for new projects
FightAIDS@Home launched Nov 21, 2005
30,000 additional devices in first 10 days
Art Olson, Scripps
AutoDOCK for HIV Protease inhibitors