Deep Computing in Biology Challenges and Progress Ajay K. Royyuru - - PowerPoint PPT Presentation
Deep Computing in Biology Challenges and Progress Ajay K. Royyuru - - PowerPoint PPT Presentation
Deep Computing in Biology Challenges and Progress Ajay K. Royyuru Computational Biology Center Thomas J. Watson Research Center ajayr@us.ibm.com IBM Computational Biology Center Outline Biology has become an Information Science Data
2
IBM Computational Biology Center
Outline
Biology has become an Information Science
Data explosion – how to take advantage High Throughput technologies Genomics Genographic Proteomics Medical Imaging Data Integration, Mining and Analysis Scale of Computing is rapidly advancing – think big Tackle Complexity in Biology – think multiscale
3
IBM Computational Biology Center
microRNAs in a nutshell microRNAs in a nutshell
4
IBM Computational Biology Center
rna22 rna22’ ’s Predictions s Predictions
2 3 ,6 1 6 > 5 5 0 ,0 0 0 5 0 ,1 3 9 321
- H. sapiens
1 8 ,5 9 7 > 4 0 0 ,0 0 0 4 4 ,3 5 8 245
- M. musculus
1 3 ,1 0 4 > 1 5 0 ,0 0 0 1 ,1 1 7 78
- D. melanogaster
9 ,7 5 2 > 6 0 ,0 0 0 623 114
- C. elegans
rna22 predicted affected Transcripts rna22 predicted 3' UTR targets (locks) rna22 predicted precursors (keys) currently known precursors June 2005 Genome
Rigoutsos et al., Cell (2006)
5
IBM Computational Biology Center
w w w .nationalgeographic.com/genographic
The Genographic Project
6
IBM Computational Biology Center
Map of Human Migration
The Genographic Project
7
IBM Computational Biology Center
Public Participation
Over 217,000 participants to date www.nationalgeographic.com/genographic www.ibm.com/dna
Behar et al., PLoS Genetics, 3/e104: 1083-1095 (2007)
8
IBM Computational Biology Center
m tDNA Report
HVS1 Sequence
- Haplogroup: M*
- 16223T, 16519C
- ATTCTAATTTAAACTATTCTCTGTTCTTTCATGGGGAAGCAGATTTGGGTA
CCACCCAAGTATTGACTCACCCATCAACAACCGCTATGTATTTCGTACATT ACTGCCAGCCACCATGAATATTGTACGGTACCATAAATACTTGACCACCTG TAGTACATAAAAACCCAATCCACATCAAAACCCCCTCCCCATGCTTACAAG CAAGTACAGCAATCAACCTTCAACTATCACACATCAACTGCAACTCCAAAG CCACCCCTCACCCACTAGGATACCAACAAACCTACCCACCCTTAACAGTAC ATAGTACATAAAGCCATTTACCGTACATAGCACATTACAGTCAAATCCCTT CTCGTCCCCATGGATGACCCCCCTCAGATAGGGGTCCCTTGACCACCATCC TCCGTGAAATCAATATCCCGCACAAGAGTGCTACTCTCCTCGCTCCGGGCC CATAACACTTGGGGGTAGCTAAAGTGAACTGTATCCGACATCTGGTTCCTA CTTCAGGGCCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAATAAGACA TCACGATG
9
IBM Computational Biology Center
Why is Proteomics Important? Proteins do real work in cells
- not genes
Many disease involve post-
translational modifications of proteins (hence, not encoded in genes)
Looking for protein-based
biomarkers to track disease state or progression
Folded, modified, translocated
Mature Protein Cellular Machinery
10
IBM Computational Biology Center
Diagnostics w ith Proteomics
Extract blood from subject
IBM
Process serum in mass spec Extract raw data Identify peaks via novel 2D analysis
Process of identification of protein fragments in the blood of an individual Medical condition Healthy individuals Our algorithms Biomarkers
- f disease
Identification of markers characteristic of disease
Take serum from patient Analyze serum peaks Are these biomarkers of disease present? YES: patient has condition NO: patient does not have condition The near future: Proteomics with allow for early diagnostics of some conditions from blood samples
11
IBM Computational Biology Center
Medical Imaging: fMRI
Listening to music Hubs analysis Activity analysis
12
IBM Computational Biology Center
Directional links explain the difference
Visual Auditory Visual Auditory
Neutral Links Directed Links
Presented at the Human Brain Mapping Conference (2006)
13
IBM Computational Biology Center
Graphs determined by the structure of
pairwise correlations between voxels display very robust topological statistical regularities, including power-law connectivity scaling and small-worldness*
However, the computations become
intractable very easily as one moves up from two-point correlations
We developed a novel approach that extends
- ur previous findings to include directional
links, and based on this analyze the presence and significance of higher-order correlation patterns
We implemented a series of algorithms
implemented on distributed platforms that render our approach feasible
Netw ork Analysis
*Scale-Free Brain Functional Networks, V.M. Eguiluz, D.R. Chialvo. G.A. Cecchi,
- M. Baliki & V.A. Apkarian, Physical Review Letters 94:18102 (2005)
14
IBM Computational Biology Center
Outline
Biology has become an Information Science
Data explosion – how to take advantage High Throughput technologies Genomics Genographic Proteomics Medical Imaging Data Integration, Mining and Analysis Scale of Computing is rapidly advancing – think big Molecular Simulations Docking, Virtual Screening Medical Imaging Tackle Complexity in Biology – think multiscale
15 New Top Supercomputers in the World June 2007
Source: www.top500.org Japan Earth Simulator (DC Opteron/IB) 36.58 Appro 19 FZJ – Juelich (8 racks BlueGene/L) 37.33 IBM 18 ARL (DC Xeon 51xx/Infiniband) 40.61 Linux Networx 17 Maui HPCC – Jaws (Xeon/Infiniband) 42.39 Dell 16 TACC – Lonestar (Xeon/Infiniband) 46.73 Dell 15 Tsubame Galaxy TiTech (Opteron/Clearspeed/IB) 48.88 NEC/ Sun 14 Japan Earth Simulator (NEC) 35.86 NEC 20 CEA/DAM Tera10 (Itanium2) 52.84 Bull 12 NASA/Columbia (Itanium2) 51.87 SGI 13 Sandia NL (Xeon/Infiniband) 53.00 Dell 11
Rmax TFlops Installation Ven- dor #
NCSA (QC Xeon/Infiniband) 62.68 Dell 8 BlueGene at RPI (16 racks BlueGene/L) 73.03 IBM 7 BlueGene at Watson (20 racks BlueGene/L) 91.29 IBM 4 Oak Ridge NL (XT3 Opteron) 101.7 Cray 2 Sandia – Red Storm (XT3 Opteron) 101.4 Cray 3 BSC MareNostrum (2560 JS21 Blades) 62.63 IBM 9 ASC Purple LLNL (1526 nodes p5 575) 75.76 IBM 6 BlueGene at Stony Brook / BNL (18 racks BlueGene/L) 82.16 IBM 5 DOE/NSSA/LLNL (64 racks BlueGene/L) 280.6 IBM 1 Altix4700 at LRZ (DC Itanium 2/Infiniband) 56.52 SGI 10
Rmax TFlops Installation Ven- dor #
Upgrade
New New
Upgrade Upgrade
New New
Upgrade Upgrade
16
IBM Computational Biology Center
Time Scales: Biopolymers and Membranes
10-15 10-12 10-9 10-6 10-3 1 103 106 109
| | | | | | | | |
Bond Vibration
Adapted from “The Protein Folding Problem”, Chan and Dill, Physics Today, Feb. 1993
DNA Twisting Hinge Motion Helix-Coil Transition Protein Folding Ligand-Protein Binding Electron Transfer Lipid exchange via diffusion Torsional correlation in lipid headgroups Simulation Experiment
s
17
IBM Computational Biology Center
Blue Matter strong scaling performance
Computation rates as a function of atoms per node 200 400 600 800 1000 1200 1400 0.1 1 10 100 Computation Rate (time-steps/second) Atoms/Node Hairpin SPI 64^3 (V5) SOPE SPI 64^$ (V5) Hairpin SPI 64^3 (V4) SOPE SPI (V5) Rhodopsin SPI (V5) SOPE SPI (V4) ApoA1 SPI (V5) Rhodopsin SPI (V4) ApoA1 SPI (V4) SOPE MPI (V4) Rhodopsin MPI (V4) ApoA1 MPI (V4) ApoA1 NAMD Msging Layer ApoA1 NAMD MPI www.research.ibm.com/bluegene
18
IBM Computational Biology Center
Lysozyme System
Trp62Ala mutation of Lyzosyme
Dramatically reduce stability in 8M urea solution Responsible for amyloid formation
Lysozyme structure consists of:
? -domain with 4 alpha helixes (A-D) 1 310 helix ?-domain with Anti parallel beta sheet Loop of the beta-domain
- C. Dobson and coworkers, Nature 424, 783, 2003
19
IBM Computational Biology Center
R Zhou, M Eleftheriou, AK Royyuru, BJ Berne, Destruction of long-range interactions by a single mutation in lysozyme
- Proc. Natl. Acad. Sci., 104:5824-5829 (2007)
20
IBM Computational Biology Center
GPCR-based drugs among the 200 best-selling prescriptions, and their GPCR targets
900 Bristol-Myers Squibb Stroke Plavix
ADP receptors
100 Pharmacia Ulcers Cytotec
Prostaglandin (PGE1) receptors
90 AstraZeneca Parkinson’s diseases Requip
Dopamine receptors
740 AstraZeneca Cancer Zoladex
GnRH receptors
600 Boehringer Ingelheim COPD Atrovent
Muscarinic acetylcholine receptors
940 GlaxoSmithKline Asthma Serevent 250 GlaxoSmithKline Congestive heart failure Coreg 580 AstraZeneca Toprol-XL
Adrenoceptors
1,700 Merck Hypertension Cozaar
Angiotensin receptors
2,400 Eli Lilly Schizophrenia Zyprexa 714 Bristol-Myers Squibb Anxiety BuSpar 1,100 GlaxoSmithKline Migraine Imitrex 1,600 Johnson & Johnson Psychosis Risperdal
5-HT receptors
1,100 Aventia Allegra 2,200 Schering-Plough Allergies Claritin 850 Merck Pepcid 870 AstraZeneca Ulcers Zantac
Histamine receptors 2000 sales(US $m) Company Disease Drug GPCR target
http://www.predixpharm.com/market_table.htm
21
IBM Computational Biology Center
Pitman, M. C., Suits, F., Gawrisch, K. & Feller, S. E. J Chem Phys 122, 244715 (2005). Pitman, M. C., Grossfield, A., Suits, F. & Feller, S. E.
- J. Am. Chem. Soc. 127, 4576-4577 (2005).
Pitman, M. C., Suits, F., Mackerell, A. D., Jr. & Feller, S. E. Biochemistry 43, 15318-28 (2004). Suits, F., Pitman, M. C. & Feller, S. E. J Chem Phys 122, 244714 (2005).
SOPE 3:1 SDPC/CHOL
Rhodopsin in 2:2:1 SDPC/SDPE/CHOL
Rhodopsin - Dark Ensemble Light-adapted Rhodopsin
Toward Active Rhodopsin
Pitman et al., PNAS (2005)
8/9/2007 22
IBM Computational Biology Center
Rhodopsin Photocycle
Dark-adapted Rhodopsin h? Photorhodopsin Bathorhodopsin Lumirhodopsin Meta-I Meta-II Bind transducin (G-protein)
Isomerize retinal
Activate phosphodiesterase ~200 fs ms timescale Dark-adapted Rhodopsin Photorhodopsin Bathorhodopsin Lumirhodopsin Meta-I Meta-II Bind transducin (G-protein)
Isomerize retinal
Activate phosphodiesterase ns timescale ? s timescale
23
IBM Computational Biology Center
24
IBM Computational Biology Center
Magnetization Transfer from Water to Protein and Lipid Resonances During Meta-I Rhodopsin
- M. C. Pitman, S. E. Feller, A. Grossfield, O Soubias, K. Gawrisch (under review)
- 0.2
0.0 0.2 0.4 0.6 0.8 1.0
- 10
10 20 30
Relative Attentuation Time/Minutes ROS disks pH 8.0, 20
0C
25
IBM Computational Biology Center
However…
rapid virus evolution makes any pandemic plan a reactive challenge and vaccination efforts difficult
Evolution of Influenza A Virus Subtypes in the Human Population
1918 1940 1960 1980 2000 1918 1940 1960 1980 2000 1918 1940 1960 1980 2000 1918 1940 1960 1980 2000
H3 H1 H1 H2 H1?
1900 1889
?
YEAR
Challenges addressed by current US Pandemic Influenza Implementation Plan:
- Detect major disease pathogens
- React to real time outbreak detection
- Contain disease spread, collaborating
with public health agencies
- Evolving virus strains are more aggressive and tolerant
- Vaccines may become ineffective unless you stay ahead of the strain
H = Hemagglutinin
Current Strategy for Pandemic Influenza is reactive
26
IBM Computational Biology Center
Checkmate: Research on Avian influenza
Project Checkmate changes focus from reactive disease control to proactive pandemic prevention By partnering the world’s best research science & supercomputing technology, Project Checkmate provides a means to:
Anticipate genetic variation and disease evolution to develop effective vaccines Develop prophylactics and therapeutics for potential future pandemics Apply this strategy to other emerging infectious diseases
27
IBM Computational Biology Center
Biological analysis: Reconstructed 1918 pandemic virus to study disease properties Structural analysis: Completed Hemagglutinin from 1918 pandemic & H5N1 avian viruses Blue Gene technology: World-renowned high capacity computational power
Result is ability to proactively predict disease evolution Advances in Biotechnology and Supercomputing
Characterization of the Reconstructed 1918 Spanish Influenza Pandemic Virus Tumpey et al., Science: Vol. 310, Page 77, Published 2005 Structure and Receptor Specificity of the Hemagglutinin from an H5N1 Influenza Virus Stevens, et al., Science: Vol. 312, Page 404, Published 2006 Observation of a dewetting transition in the collapse
- f the melitten tetramer
- P. Liu, et al., Nature: Vol. 435, Page 159, Published 2005
IBM
ranks # 1, 4, 5, 7 in Top10
28
IBM Computational Biology Center
Team 5: Computational Prediction of Antigenic Variation and Biological Validation: Reverse genetics to reconstruct influenza viruses to test computation and experimental predictions
- f antigenic variation (Palese).
HK97 HK01 Indo03 HK03 Viet04 Viet05 Av? Hu? ?? ?? Av/Hu? X
X
X
Team 1: Rapid In Vitro Evolution: Developing a methodology to identify and neutralize future virulent strains (Janda) Team 2: Computer Modeling and Structural Prediction of Influenza Virus Evolution (Wilson). Team 3: Antibodies and Vaccines: Finding and targeting influenza’s Achilles’ heel (Burton). Team 4: Small Molecule Inhibitors: Targeting HA attachment to host cells (Wong).
X
X X X X
X X X X Antigenic evolution of avian influenza A (H5N1) virus from 1997 to 200?
29
IBM Computational Biology Center
Hemagluttinin
HA trimer from H5N1 Simulation on Blue Gene/L
www.ibm.com/avianflu
30
Blue Gene/L Development
Process of Drug Discovery
Target Identification and Selection Target Isolation and Purification Structure Determination Analyze Structure for Potential Ligand Binding Sites Docking of Small Molecules using Computational Methods Biochemical Assays and Further Testing Lead Optimization to Improve Potency Pre-clinical Testing Drug Candidate
Years Compounds 4 8 3,000 – 10,000 250
31
Blue Gene/L Development
Docking Components
Efficient Search Procedure
? Speed and efficiency
Scoring Function
? Fast ? Accurate ( discriminate between native and non-native docked information )
32
IBM Computational Biology Center
- Dr. Yuan-Ping Pang, Mayo Clinic
33
IBM Computational Biology Center
Drug Docking on Blue Gene
34
IBM Computational Biology Center
SARS Virus
35
IBM Computational Biology Center
SARS Virus Cysteine Proteinase
36
IBM Computational Biology Center
Blue Gene Accelerates the Pace of Discovery
37
Blue Gene/L Development
DOCK6 External Collaborations
38
Blue Gene/L Development
Results f or HTC Mode
- Parallelizing the dispatching of embarrassingly pleasantly parallel codes provides
near linear speed up in HTC mode
- Linear scale up has been validated to 8192 processors, but all indications suggest
this will continue as the system is grown
Dock6 in HTC mode
2000 4000 6000 8000 10000 1 2048 4096 6144 8192 Processors Speed Up Linear Optimized MPI code Dock6 in HTC mode
39
IBM Computational Biology Center
Dock6 and High Throughput Compute Mode
- Why HTC?
Enables large number of independent tasks Multiple Instruction, Multiple Data (MIMD) architecture Dock6 is an embarrassingly parallel code where each processor is conducting independent calculations on different ligands in the library. This makes Dock6 a great candidate for HTC mode.
- Results
Demonstrated scalable task dispatch to 1000’s processors. Optimal ratio of dispatcher to partition size is 1:32 – latencies increase above this level, possibly due to Launcher contention for socket
- resource. May depend on task duration and arrival rates.
Delayed dispatch proportional to executable size for effective task distribution across partitions (using 0.24 microseconds per byte) – due to IO Node to Compute Node bandwidth. Led to invocation of multiple dispatchers: Near Linear Scaling Near Linear Scaling
40
IBM Computational Biology Center
HRRT PET Blue Gene/L
41
IBM Computational Biology Center
Outline Biology has become an Information Science
Data explosion – how to take advantage Scale of Computing is rapidly advancing – think big Tackle Complexity in Biology – think multiscale
42
IBM Computational Biology Center
Systems Biology at IBM Research
Reverse Engineering Data Mining
p53 TP53 MDM2 mdm2 cycg1 CYCG1
DNA damage
Nutlin
p14
p38
Modeling & Simulations
wip1 WIP1
Experimentation & collaborations
43
IBM Computational Biology Center
Lahav et al., Nature Genetics, 2004 p53 protein
Digital response of tumor suppressor p53 to IR
DNA damage initiation & repair ATM: DNA damage detection P53- MDM2
- scillator
Irradiation Cell Cycle Arrest and DNA Repair
?
DNA damage initiation & repair ATM: DNA damage detection P53- MDM2
- scillator
Irradiation Cell Cycle Arrest and DNA Repair
?
Irradiation
44
IBM Computational Biology Center
m R N A m R N A p53 Protein TP53 D N A p53
Delay=?
Protein TP53* Basal m d m 2 Protein M D M 2 D N A
Delay=?
Basal (P1) m d m 2 (Fast) (Slow) Induced (P2) A T M * m R N A m R N A p53 Protein TP53 D N A p53
Delay=?
Protein TP53* Basal m d m 2 Protein M D M 2 D N A
Delay=?
Basal (P1) m d m 2 (Fast) (Slow) Induced (P2) A T M *
Lahav et al., Nature Genetics 2004
500 1000 1500 1 2 3 4 5
TP53 mdm2 MDM2
Molecular intensity (fold)
Time (min)
Response to 5Gy
DNA damage
Digital response of tumor suppressor p53 to IR
Ma, Wagner, Rice, Hu, Levine and Stolovitzky, A plausible model for the digital response of p53 to DNA damage, PNAS (2005).
Wenwei Hu, Zhaohui Feng, Lan Ma, John Wagner, J. Jeremy Rice, Gustavo Stolovitzky, and Arnold J. Levine, A Single Nucleotide Polymorphism in the MDM2 Gene Disrupts the Oscillation of p53 and MDM2 Levels in Cells, Cancer Research (in press), 2007.
45
IBM Computational Biology Center
Blue Brain Project
Ecole Polytechnique Fédérale de Lausanne
- Research collaboration with
EPFL to simulate the neocortical column using the Blue Gene supercomputer
- Why
Better understand the basis of cognitive function ? human nature Better understand neocortical diseases Identify treatment targets and strategies
- Why use computation
Enable scientific discovery impossible or difficult with biological experimentation alone Manage complexity Integrate knowledge Inform experimental design and theory
46
IBM Computational Biology Center
The Data
- Unprecedented neocortical
microcircuit data comprising
Neurons Morphological type Electrophysiological type Ion channel types/distributions Gene expression Synapses Neurotransmitter receptors Number and innervation patterns Dynamics and plasticity rules Microcircuit structure Frequency/layout of neuron types Connectivity patterns/probabilities
47
IBM Computational Biology Center
The Approach
Maintain tight coupling between
simulation and experimental validation
Exploit unparalleled neocortical microcircuit data Develop parameterization techniques & models Validate models with in vitro & in vivo experimental data Use experimentation to validate simulated findings
Employ multiple modeling strategies
Up: Neuron ? Slice ? Column Down: Network ? Point neurons ? Complex neurons
Facilitate large-scale neural simulation
Optimize simulation layout Develop parameterization and analysis techniques
Nature Reviews Neuroscience 7:153-160 (Feb 2006) Henry Markram Opinion: The Blue Brain Project
48
IBM Computational Biology Center
Summary
Biology has become an Information Science
Will continue to present challenges orders of magnitude beyond current capability Information technology is a vital tool enabling discovery
Data explosion
Capacity computing to handle large and heterogeneous data Rapidly evolving algorithms require general and easy to use systems
Scale of Computing
Molecular simulations amongst the most mature to exploit petascale Meaningful biology can exploit orders of magnitude more capability
Tackle Complexity in Biology
Think multiscale – brute force won’t get you far Problems are not compute limited Challenge to find the right abstraction and model that provides insight
49
IBM Computational Biology Center
IBM Thomas J. Watson Research Center, Yorktown Heights, NY
IBM Computational Biology Center www.research.ibm.com/compsci/compbio
50
IBM Computational Biology Center
World Community Grid
- www.worldcommunitygrid.org
- Launched in Nov ’04
- Mobilize the community
- >2400 years of computing in first 50 days
- Technology solving problems
- Advisory Board of experts in health sciences,
technology and philanthropy
- RFP for new projects
- FightAIDS@Home launched Nov 21, 2005
- 30,000 additional devices in first 10 days
- Art Olson, Scripps
- AutoDOCK for HIV Protease inhibitors