SLIDE 1 New Architectures for a New Biology
David E. Shaw
- D. E. Shaw Research, LLC and
Center for Computational Biology and Bioinformatics Columbia University
SLIDE 2
*** Background (A Bit of Basic Biochemistry)
SLIDE 3
DNA Codes for Proteins
SLIDE 4
The 20 Amino Acids
SLIDE 5 Polypeptide Chain
Source: www.yourgenome.org
SLIDE 6 Levels of Protein Structure
Source: Robert Melamede, U. Colorado
SLIDE 7
What We Know and What We Don’t
Decoded the genome Don’t know most protein structures – Especially membrane proteins No detailed picture of what most proteins do Don’t know how everything fits together into a working system
SLIDE 8
We Now Have The Parts List ...
SLIDE 9
But We Don’t Know What the Parts Look Like ...
SLIDE 10
Or How They Fit Together ...
SLIDE 11
Or How The Whole Machine Works
SLIDE 12 How Can We Get There?
Two major approaches: Experiments – Wet lab – Hard, since everything is so small Simulation – Simulate:
- How proteins fold (structure, dynamics)
- How proteins interact with
- Other proteins
- Nucleic acids
- Drug molecules
– Gold standard: Molecular dynamics (MD)
SLIDE 13
*** Molecular Dynamics
SLIDE 14
Molecular Dynamics
t
Divide time into discrete time steps ~1 fs time step
SLIDE 15
Molecular Dynamics
Calculate forces Molecular mechanics force field
SLIDE 16
Molecular Dynamics
Move atoms
SLIDE 17
Molecular Dynamics
Move atoms ... a little bit
SLIDE 18
Molecular Dynamics
Iterate Iterate Iterate ... and iterate Iterate ... and iterate Integrate Newton’s laws of motion
SLIDE 19
Example of an MD Simulation
SLIDE 20
Main Problem With MD Too slow!
Example I just showed: 2 ns simulated time 3.4 CPU-days to simulate
SLIDE 21
*** Goals and Strategy
SLIDE 22
Thought Experiment
What if MD were – Perfectly accurate? – Infinitely fast? Would be easy to perform arbitrary computational experiments – Determine structures by watching them form – Figure out what happens by watching it happen – Transform measurement into data mining
SLIDE 23
Two Distinct Problems
Problem 1: Simulate many short trajectories Problem 2: Simulate one long trajectory
SLIDE 24
Simulating Many Short Trajectories
Can answer surprising number of interesting questions Can be done using – Many slow computers – Distributed processing approach – Little inter-processor communication E.g., Pande’s Folding at Home project
SLIDE 25
Simulating One Long Trajectory
Harder problem Essential to elucidate many biologically interesting processes Requires a single machine with – Extremely high performance – Truly massive parallelism – Lots of inter-processor communication
SLIDE 26
Our Goal
Single, millisecond-scale MD simulations – Protein with 64K atoms – Explicit water molecules Why? – That’s the time scale at which many biologically interesting things start to happen
SLIDE 27 Image: Istvan Kolossvary & Annabel Todd,
Protein Folding
SLIDE 28 Interactions Between Proteins
Image: Vijayakumar, et al., J. Mol. Biol. 278, 1015 (1998)
SLIDE 29 Image: Nagar, et al., Cancer Res. 62, 4236 (2002)
Binding of Drugs to their Molecular Targets
SLIDE 30 Image: H. Grubmüller, in Attig, et al. (eds.), Computational Soft Matter (2004)
Mechanisms of Intracellular Machines
SLIDE 31
What Will It Take to Simulate a Millisecond?
We need an enormous increase in speed – Current (single processor): ~ 100 ms / fs – Goal will require < 10 µs / fs Required speedup: > 10,000x faster than current single-processor speed ~ 1,000x faster than current parallel implementations
SLIDE 32 Target Simulation Speed
3.4 days today (one processor) ~ 13 seconds on
(one segment)
SLIDE 33 Molecular Mechanics Force Field
( )
2 bonds 2 angles torsions 12 6
( ) [1 cos( )]
b i j i j i ij ij ij i j i ij ij
E k r r k A n q q r A B r r
θ θ
θ τ ϕ
> >
= − + − + + − + + −
∑ ∑ ∑ ∑∑ ∑∑
Stretch Bend Torsion Electrostatic Van der Waals Non- Bonded Bonded
SLIDE 34 Distance Between Centers of Atoms Potential Energy
r0
Stretch Term
( )
2 bonds b
E k r r = −
∑
SLIDE 35 Distance Between Centers of Atoms
r0
Stretch Term
Potential Energy
( )
2 bonds b
E k r r = −
∑
SLIDE 36 Distance Between Centers of Atoms
r0
Stretch Term
Potential Energy
( )
2 bonds b
E k r r = −
∑
SLIDE 37 Distance Between Centers of Atoms Potential Energy
r0
Stretch Term
( )
2 bonds b
E k r r = −
∑
SLIDE 38 Distance Between Centers of Atoms Potential Energy
r0
Stretch Term
( )
2 bonds b
E k r r = −
∑
SLIDE 39 Distance Between Centers of Atoms Potential Energy
r0
Stretch Term
( )
2 bonds b
E k r r = −
∑
SLIDE 40 Distance Between Centers of Atoms Potential Energy
r0
Stretch Term
( )
2 bonds b
E k r r = −
∑
SLIDE 41 Bond Angle Potential Energy
2 angles
( ) E kθ θ θ = −
∑
θ
Bend Term
θ θ =
SLIDE 42 Bond Angle Potential Energy
θ
Bend Term
θ θ
2 angles
( ) E kθ θ θ = −
∑
SLIDE 43 Bond Angle Potential Energy
θ
Bend Term
θ θ
2 angles
( ) E kθ θ θ = −
∑
SLIDE 44 Bond Angle Potential Energy
θ
Bend Term
θ θ =
2 angles
( ) E kθ θ θ = −
∑
SLIDE 45 Bond Angle Potential Energy
θ
Bend Term
θ θ
2 angles
( ) E kθ θ θ = −
∑
SLIDE 46 Bond Angle Potential Energy
θ
Bend Term
θ θ
2 angles
( ) E kθ θ θ = −
∑
SLIDE 47 Bond Angle Potential Energy
θ
Bend Term
θ θ =
2 angles
( ) E kθ θ θ = −
∑
SLIDE 48 Torsion Term
torsions
[1 cos( )] E A nτ ϕ = + −
∑
4 /3 π 5 /3 π 2π /3 π 2 /3 π π
Oblique View
Potential Energy
Torsion Angle (radians)
Axial View
SLIDE 49 Torsion Term
torsions
[1 cos( )] E A nτ ϕ = + −
∑
4 /3 π 5 /3 π 2π /3 π 2 /3 π π
Oblique View
Potential Energy
Torsion Angle (radians)
Axial View
SLIDE 50 Torsion Term
torsions
[1 cos( )] E A nτ ϕ = + −
∑
4 /3 π 5 /3 π 2π /3 π 2 /3 π π
Oblique View
Potential Energy
Torsion Angle (radians)
Axial View
SLIDE 51 Electrostatic Term
Distance Between Centers of Atoms Potential Energy
i j i j i ij
q q E r
>
=∑∑
+ +
SLIDE 52 Electrostatic Term
Distance Between Centers of Atoms Potential Energy
i j i j i ij
q q E r
>
=∑∑
+ +
SLIDE 53 Electrostatic Term
Distance Between Centers of Atoms Potential Energy
i j i j i ij
q q E r
>
=∑∑
+ +
SLIDE 54 Electrostatic Term
Distance Between Centers of Atoms Potential Energy
i j i j i ij
q q E r
>
=∑∑
+ _ +
SLIDE 55 Electrostatic Term
Distance Between Centers of Atoms Potential Energy
i j i j i ij
q q E r
>
=∑∑
+ _ +
SLIDE 56 Electrostatic Term
Distance Between Centers of Atoms Potential Energy
i j i j i ij
q q E r
>
=∑∑
+ _
SLIDE 57 Attractive (1/r 6 ) Repulsive (1/r 12 ) Combined
Van der Waals Terms
Potential Energy Distance Between Centers of Atoms
12 6
=
ij ij i j i ij ij
A B E r r
>
−
∑∑
SLIDE 58 Attractive (1/r 6 ) Repulsive (1/r 12 ) Combined
Van der Waals Terms
Distance Between Centers of Atoms
12 6
=
ij ij i j i ij ij
A B E r r
>
−
∑∑
Potential Energy
SLIDE 59 Attractive (1/r 6 ) Repulsive (1/r 12 ) Combined
Van der Waals Terms
Distance Between Centers of Atoms
12 6
=
ij ij i j i ij ij
A B E r r
>
−
∑∑
Potential Energy
SLIDE 60 Attractive (1/r 6 ) Repulsive (1/r 12 ) Combined
Van der Waals Terms
Potential Energy Distance Between Centers of Atoms
12 6
=
ij ij i j i ij ij
A B E r r
>
−
∑∑
SLIDE 61 Attractive (1/r 6 ) Repulsive (1/r 12 ) Combined
Van der Waals Terms
Distance Between Centers of Atoms
12 6
=
ij ij i j i ij ij
A B E r r
>
−
∑∑
Potential Energy
SLIDE 62 Attractive (1/r 6 ) Repulsive (1/r 12 ) Combined
Van der Waals Terms
Distance Between Centers of Atoms
12 6
=
ij ij i j i ij ij
A B E r r
>
−
∑∑
Potential Energy
SLIDE 63 Attractive (1/r 6 ) Repulsive (1/r 12 ) Combined
Van der Waals Terms
Potential Energy Distance Between Centers of Atoms
12 6
=
ij ij i j i ij ij
A B E r r
>
−
∑∑
SLIDE 64 Molecular Mechanics Force Field
( )
2 bonds 2 angles torsions 12 6
( ) [1 cos( )]
b i j i j i ij ij ij i j i ij ij
E k r r k A n q q r A B r r
θ θ
θ τ ϕ
> >
= − + − + + − + + −
∑ ∑ ∑ ∑∑ ∑∑
Stretch Bend Torsion Electrostatic Van der Waals Non- Bonded Bonded
SLIDE 65
What Takes So Long?
Inner loop of force field evaluation looks at all pairs of atoms (within distance R) On the order of 64K atoms in typical system Repeat ~1012 times Current approaches too slow by several orders of magnitude What can be done?
SLIDE 66 Our Strategy
New architectures – Designing a specialized machine – Enormously parallel architecture – Based on special-purpose ASICs – Dramatically faster for MD, but less flexible – Projected completion: 2008 New algorithms – Applicable to
- Conventional clusters
- Our own machine
– Scale to very large # of processing elements
SLIDE 67 Interdisciplinary Lab
Computational Chemists and Biologists Computer Scientists and Applied Mathematicians Computer Architects and Engineers
SLIDE 68
*** New Architectures
SLIDE 69
Alternative Machine Architectures
Conventional cluster of commodity processors General-purpose scientific supercomputer Special-purpose molecular dynamics machine
SLIDE 70 Conventional Cluster of Commodity Processors
Strengths: – Flexibility – Mass market economies of scale Limitations – Doesn’t exploit special features of the problem – Communication bottlenecks
- Between processor and memory
- Among processors
– Insufficient arithmetic power
SLIDE 71
Typical Commodity Microprocessor
SLIDE 72
Typical Commodity Microprocessor
SLIDE 73
General-Purpose Scientific Supercomputer
E.g., IBM Blue Gene More demanding goal than ours – General-purpose scientific supercomputing – Fast for wide range of applications Strengths: – Flexibility – Ease of programmability Limitations for MD simulations – Expensive – Still not fast enough for our purposes
SLIDE 74 Our Special-Purpose MD Machine
Strengths: – Several orders of magnitude faster for MD – Excellent cost/performance characteristics Limitations: – Not designed for other scientific applications
- They’d be difficult to program
- Still wouldn’t be especially fast
– Limited flexibility
SLIDE 75 Source of Speedup on Our Machine
Judicious use of arithmetic specialization – Flexibility, programmability only where needed – Elsewhere, hardware tailored for speed
- Tables and parameters, but not programmable
Carefully choreographed communication – Data flows to just where it’s needed – Almost never need to access off-chip memory
SLIDE 76 Two Subsystems on Each ASIC
Specialized Subsystem Flexible Subsystem
general-purpose
- Efficient geometric
- perations
- Pairwise point
interactions
SLIDE 77
Where We Use Specialized Hardware
Specialized hardware (with tables, parameters) where: Inner loop Simple, regular algorithmic structure Unlikely to change Examples: Electrostatic forces Van der Waals interactions (at least attractive term)
SLIDE 78
Example: Particle Interaction Pipeline (one of 32)
SLIDE 79
Array of 32 Particle Interaction Pipelines
SLIDE 80
Advantages of Particle Interaction Pipelines
Save area that would have been allocated to – Cache – Control logic – Wires Achieve extremely high arithmetic density Save time that would have been spent on – Cache misses, – Load/store instructions – Misc. data shuffling
SLIDE 81 Where We Use Flexible Hardware
– Use programmable hardware where:
- Algorithm less regular
- Smaller % of total time
- E.g., local interactions (fewer of them)
- More likely to change
– Examples:
- Bonded interactions
- Bond length constraints
- Experimentation with
- New, short-range force field terms
- Alternative integration techniques
SLIDE 82
Forms of Parallelism in Flexible Subsystem
The Flexible Subsystem exploits three forms of parallelism: – Multi-core parallelism – Instruction-level parallelism – SIMD parallelism
SLIDE 83 Overview of the Flexible Subsystem
GC = Geometry Core (each a VLIW processor)
SLIDE 84 Geometry Core (one of 8; 64 pipelined lanes/chip)
+ X + + + +
Instruction Memory Decode
From Tensilica Core X X X X Y Z W PC + X + + + + X X X X Y Z W
Data Memory
f f f f f f f f
SLIDE 85
System-Level Organization
Multiple segments (probably 8 in first machine) 512 nodes (each with one ASIC) per segment – Organized in an 8 x 8 x 8 toroidal mesh Topology reflects physical space being simulated: – Three-dimensional nearest neighbor connections – Periodic boundary conditions
SLIDE 86
3D Torus Network
SLIDE 87
But Communication is Still a Bottleneck
Scalability limited by inter-chip communication To execute a single millisecond-scale simulation, – Need a huge number of processing elements – Must dramatically reduce amount of data transferred between these processing elements Can’t do this without fundamentally new algorithms
SLIDE 88
*** The NT Algorithm
SLIDE 89 Range-Limited Pairwise Particle Interactions
Efficient methods known for distant interactions
R
Pairwise, non-bonded interactions dominate Range-limited n-body problem
SLIDE 90
New Algorithm
Parallel algorithm for range-limited n-body problem Called the NT (for “Neutral Territory”) Method* Asymptotically less inter-processor communication than traditional spatial decomposition methods Constant factors also very attractive – Significant improvements on typical cluster – Major win on large machines
* Shaw, J. Comp. Chem. 26, Oct. 2005
SLIDE 91
Desirable Properties
Ideally, a parallel algorithm for the range-limited
n-body problem would:
Exploit the range limitation to reduce computational load Scale such that data transfer approaches zero as
p → ∞
SLIDE 92
Asymptotic Comparison With Traditional Spatial Decomposition Methods
Exploitable range limitation Scaling with number of processors Traditional methods O (R 3) neighbors Not scalable NT Method O (R 3/2) neighbors O (P –1/2) scaling
NT Method has both of these properties:
SLIDE 93 Partitioning of Space Into Boxes
Atom A Home box of atom A
SLIDE 94 Two-Dimensional Analog of the NT Method
Traditional Method (2D Analog) NT Method (2D Analog) Green = interaction box; blue = import region
SLIDE 95 How can it be better to meet on neutral territory?
Number of pairwise interactions (~ product of areas) Number of atoms imported (~ sum of areas):
Traditional Method (2D) NT Method (2D)
SLIDE 96
Actual 3D Algorithm
Considerably more complex – Odd number of dimensions introduces complications Can be made to work – Math gets more complicated – Performance advantage just as large Start by describing 3D version of traditional spatial decomposition methods
SLIDE 97
Traditional 3D Spatial Decomposition Methods
SLIDE 98
Traditional Spatial Decomposition Method
Interaction Box and Import Region Green = Interaction box Blue = Import region
SLIDE 99
Site of Interaction, Traditional Method
Interact – One atom from (cubical) interaction box – One atom from either interaction box or import region All interactions occur within home box of one of the two atoms How much inter-processor communication?
SLIDE 100
Import Subregion Face(–x)
SLIDE 101
Import Subregion Edge(–x, +z)
SLIDE 102
Import Subregion Corner(+x, –y, +z)
SLIDE 103
Import region of traditional spatial decomposition method: 3 face subregions 6 edge subregions 4 corner subregions 3Rb2 + 3πR2b/2 + 2πR3/3 where b = side length of (cubical) box In limit as p , import volume approaches 2πR3/3
Import Volume, Traditional Method
SLIDE 104
*** The Three-Dimensional NT Algorithm
SLIDE 105
NT Method
Interaction Box and Import Region Green = Interaction box Blue = Import region
SLIDE 106
The Tower
(outer tower in blue)
SLIDE 107
The Plate
(outer plate in blue)
SLIDE 108 Interact – One atom from tower – One atom from plate Both atoms may have to be imported They meet “on neutral territory”
Site of Interaction, NT Method
Plate Atom Tower Atom
SLIDE 109
Aspect Ratio Optimization in NT
Dimensions of box ⇒ dimensions of tower, plate Volume of box determined by – Size of molecular system – Number of processors Aspect ratio of box is free parameter – x and y dimensions equal; ratio to z can vary – Optimize to minimize communication Optimal aspect ratio depends on number of processors – More processors ⇒ shorter, fatter box (balance)
SLIDE 110 Scaling of the NT Method
64 Processors
Assumes 50,000 atoms, interaction radius = 12A, density = 0.1 atom/A3
SLIDE 111 Scaling of the NT Method
512 Processors
Assumes 50,000 atoms, interaction radius = 12A, density = 0.1 atom/A3
SLIDE 112 Scaling of the NT Method
4K Processors
Assumes 50,000 atoms, interaction radius = 12A, density = 0.1 atom/A3
SLIDE 113 Scaling of the NT Method
32K Processors
Assumes 50,000 atoms, interaction radius = 12A, density = 0.1 atom/A3
SLIDE 114
Import volume: 4 face subregions 2 edge subregions No corner subregions Vi = 2 Rbxy2 + 2 Rbxybz + πR2bz/2 where bxy = x & y dimensions of box bz = z dimension of box Optimize ratio bxy / bz to minimize import volume
NT’s Import Volume With Cubical Box
SLIDE 115
NT: Optimal Aspect Ratio and Import Volume
Results: Optimal bxy = [ c1/2 + (Vb c –1/2 – c)1/2 ] / 2 where c = d/6 – 2πRVb/d d = {27Vb2 – 3[3Vb3((4πR)3 + 27Vb)]1/2}1/3 Vb = box volume To find minimal import volume: – Use optimal bxy to calculate optimal bz – Substitute into equation for Vi
SLIDE 116
NT’s Import Volume With Optimized Box
Limit as p → ∞: Note decrease in exponent Vi = 2π 1/2 R 3/2 Vb 1/2 Vb ~ N / p, where N is # atoms in molecular system So Vi = O (R 3/2 (N/p)1/2)
SLIDE 117
Comparison of Traditional and NT Methods
SLIDE 118
Traditional Method Imports Corner Subregions
SLIDE 119
NT Method Doesn’t Import Any Corner Subregions
SLIDE 120 NT vs. Traditional Method
Traditional spatial decomposition method:
- Transfer time ~ volume of sphere of radius R
(for large p) NT method
- Transfer time ~ square root of that sphere’s
volume Advantage of NT over traditional method grows as number of processors increases
SLIDE 121 Scaling of Traditional vs. NT Method
64 Processors
Assumes 50,000 atoms, interaction radius = 12A, density = 0.1 atom/A3
SLIDE 122 Scaling of Traditional vs. NT Method
512 Processors
Assumes 50,000 atoms, interaction radius = 12A, density = 0.1 atom/A3
SLIDE 123 Scaling of Traditional vs. NT Method
4K Processors
Assumes 50,000 atoms, interaction radius = 12A, density = 0.1 atom/A3
SLIDE 124 Scaling of Traditional vs. NT Method
32K Processors
Assumes 50,000 atoms, interaction radius = 12A, density = 0.1 atom/A3
SLIDE 125 Inter-Processor Transfer Time, Traditional vs. NT
8 64 512 4K 32K 1000 2000 3000 4000 5000 6000 7000 Number of Processors Time
Assumes 50,000 atoms, interaction radius = 12A, density = 0.1 atom/A3 Time unit is time required to import data associated with one atom Traditional Method NT Method
SLIDE 126
An Open Question That Keeps Me Awake at Night
SLIDE 127 Are Force Fields Accurate Enough?
Nobody knows how accurate the force fields that everyone uses actually are – Can’t simulate for long enough to know – If problems surface, we may at least be able to
- Figure out why
- Take steps to fix them
But we already know that fast, single MD simulations will prove sufficient to answer at least some major scientific questions