New Architectures for a New Biology David E. Shaw D. E. Shaw - - PowerPoint PPT Presentation

new architectures for a new biology
SMART_READER_LITE
LIVE PREVIEW

New Architectures for a New Biology David E. Shaw D. E. Shaw - - PowerPoint PPT Presentation

New Architectures for a New Biology David E. Shaw D. E. Shaw Research, LLC and Center for Computational Biology and Bioinformatics Columbia University *** Background (A Bit of Basic Biochemistry) DNA Codes for Proteins The 20 Amino Acids


slide-1
SLIDE 1

New Architectures for a New Biology

David E. Shaw

  • D. E. Shaw Research, LLC and

Center for Computational Biology and Bioinformatics Columbia University

slide-2
SLIDE 2

*** Background (A Bit of Basic Biochemistry)

slide-3
SLIDE 3

DNA Codes for Proteins

slide-4
SLIDE 4

The 20 Amino Acids

slide-5
SLIDE 5

Polypeptide Chain

Source: www.yourgenome.org

slide-6
SLIDE 6

Levels of Protein Structure

Source: Robert Melamede, U. Colorado

slide-7
SLIDE 7

What We Know and What We Don’t

Decoded the genome Don’t know most protein structures – Especially membrane proteins No detailed picture of what most proteins do Don’t know how everything fits together into a working system

slide-8
SLIDE 8

We Now Have The Parts List ...

slide-9
SLIDE 9

But We Don’t Know What the Parts Look Like ...

slide-10
SLIDE 10

Or How They Fit Together ...

slide-11
SLIDE 11

Or How The Whole Machine Works

slide-12
SLIDE 12

How Can We Get There?

Two major approaches: Experiments – Wet lab – Hard, since everything is so small Simulation – Simulate:

  • How proteins fold (structure, dynamics)
  • How proteins interact with
  • Other proteins
  • Nucleic acids
  • Drug molecules

– Gold standard: Molecular dynamics (MD)

slide-13
SLIDE 13

*** Molecular Dynamics

slide-14
SLIDE 14

Molecular Dynamics

t

Divide time into discrete time steps ~1 fs time step

slide-15
SLIDE 15

Molecular Dynamics

Calculate forces Molecular mechanics force field

slide-16
SLIDE 16

Molecular Dynamics

Move atoms

slide-17
SLIDE 17

Molecular Dynamics

Move atoms ... a little bit

slide-18
SLIDE 18

Molecular Dynamics

Iterate Iterate Iterate ... and iterate Iterate ... and iterate Integrate Newton’s laws of motion

slide-19
SLIDE 19

Example of an MD Simulation

slide-20
SLIDE 20

Main Problem With MD Too slow!

Example I just showed: 2 ns simulated time 3.4 CPU-days to simulate

slide-21
SLIDE 21

*** Goals and Strategy

slide-22
SLIDE 22

Thought Experiment

What if MD were – Perfectly accurate? – Infinitely fast? Would be easy to perform arbitrary computational experiments – Determine structures by watching them form – Figure out what happens by watching it happen – Transform measurement into data mining

slide-23
SLIDE 23

Two Distinct Problems

Problem 1: Simulate many short trajectories Problem 2: Simulate one long trajectory

slide-24
SLIDE 24

Simulating Many Short Trajectories

Can answer surprising number of interesting questions Can be done using – Many slow computers – Distributed processing approach – Little inter-processor communication E.g., Pande’s Folding at Home project

slide-25
SLIDE 25

Simulating One Long Trajectory

Harder problem Essential to elucidate many biologically interesting processes Requires a single machine with – Extremely high performance – Truly massive parallelism – Lots of inter-processor communication

slide-26
SLIDE 26

Our Goal

Single, millisecond-scale MD simulations – Protein with 64K atoms – Explicit water molecules Why? – That’s the time scale at which many biologically interesting things start to happen

slide-27
SLIDE 27

Image: Istvan Kolossvary & Annabel Todd,

  • D. E. Shaw Research

Protein Folding

slide-28
SLIDE 28

Interactions Between Proteins

Image: Vijayakumar, et al., J. Mol. Biol. 278, 1015 (1998)

slide-29
SLIDE 29

Image: Nagar, et al., Cancer Res. 62, 4236 (2002)

Binding of Drugs to their Molecular Targets

slide-30
SLIDE 30

Image: H. Grubmüller, in Attig, et al. (eds.), Computational Soft Matter (2004)

Mechanisms of Intracellular Machines

slide-31
SLIDE 31

What Will It Take to Simulate a Millisecond?

We need an enormous increase in speed – Current (single processor): ~ 100 ms / fs – Goal will require < 10 µs / fs Required speedup: > 10,000x faster than current single-processor speed ~ 1,000x faster than current parallel implementations

slide-32
SLIDE 32

Target Simulation Speed

3.4 days today (one processor) ~ 13 seconds on

  • ur machine

(one segment)

slide-33
SLIDE 33

Molecular Mechanics Force Field

( )

2 bonds 2 angles torsions 12 6

( ) [1 cos( )]

b i j i j i ij ij ij i j i ij ij

E k r r k A n q q r A B r r

θ θ

θ τ ϕ

> >

= − + − + + − + + −

∑ ∑ ∑ ∑∑ ∑∑

Stretch Bend Torsion Electrostatic Van der Waals Non- Bonded Bonded

slide-34
SLIDE 34

Distance Between Centers of Atoms Potential Energy

r0

Stretch Term

( )

2 bonds b

E k r r = −

slide-35
SLIDE 35

Distance Between Centers of Atoms

r0

Stretch Term

Potential Energy

( )

2 bonds b

E k r r = −

slide-36
SLIDE 36

Distance Between Centers of Atoms

r0

Stretch Term

Potential Energy

( )

2 bonds b

E k r r = −

slide-37
SLIDE 37

Distance Between Centers of Atoms Potential Energy

r0

Stretch Term

( )

2 bonds b

E k r r = −

slide-38
SLIDE 38

Distance Between Centers of Atoms Potential Energy

r0

Stretch Term

( )

2 bonds b

E k r r = −

slide-39
SLIDE 39

Distance Between Centers of Atoms Potential Energy

r0

Stretch Term

( )

2 bonds b

E k r r = −

slide-40
SLIDE 40

Distance Between Centers of Atoms Potential Energy

r0

Stretch Term

( )

2 bonds b

E k r r = −

slide-41
SLIDE 41

Bond Angle Potential Energy

2 angles

( ) E kθ θ θ = −

θ

Bend Term

θ θ =

slide-42
SLIDE 42

Bond Angle Potential Energy

θ

Bend Term

θ θ

2 angles

( ) E kθ θ θ = −

slide-43
SLIDE 43

Bond Angle Potential Energy

θ

Bend Term

θ θ

2 angles

( ) E kθ θ θ = −

slide-44
SLIDE 44

Bond Angle Potential Energy

θ

Bend Term

θ θ =

2 angles

( ) E kθ θ θ = −

slide-45
SLIDE 45

Bond Angle Potential Energy

θ

Bend Term

θ θ

2 angles

( ) E kθ θ θ = −

slide-46
SLIDE 46

Bond Angle Potential Energy

θ

Bend Term

θ θ

2 angles

( ) E kθ θ θ = −

slide-47
SLIDE 47

Bond Angle Potential Energy

θ

Bend Term

θ θ =

2 angles

( ) E kθ θ θ = −

slide-48
SLIDE 48

Torsion Term

torsions

[1 cos( )] E A nτ ϕ = + −

4 /3 π 5 /3 π 2π /3 π 2 /3 π π

Oblique View

Potential Energy

Torsion Angle (radians)

Axial View

slide-49
SLIDE 49

Torsion Term

torsions

[1 cos( )] E A nτ ϕ = + −

4 /3 π 5 /3 π 2π /3 π 2 /3 π π

Oblique View

Potential Energy

Torsion Angle (radians)

Axial View

slide-50
SLIDE 50

Torsion Term

torsions

[1 cos( )] E A nτ ϕ = + −

4 /3 π 5 /3 π 2π /3 π 2 /3 π π

Oblique View

Potential Energy

Torsion Angle (radians)

Axial View

slide-51
SLIDE 51

Electrostatic Term

Distance Between Centers of Atoms Potential Energy

i j i j i ij

q q E r

>

=∑∑

+ +

slide-52
SLIDE 52

Electrostatic Term

Distance Between Centers of Atoms Potential Energy

i j i j i ij

q q E r

>

=∑∑

+ +

slide-53
SLIDE 53

Electrostatic Term

Distance Between Centers of Atoms Potential Energy

i j i j i ij

q q E r

>

=∑∑

+ +

slide-54
SLIDE 54

Electrostatic Term

Distance Between Centers of Atoms Potential Energy

i j i j i ij

q q E r

>

=∑∑

+ _ +

slide-55
SLIDE 55

Electrostatic Term

Distance Between Centers of Atoms Potential Energy

i j i j i ij

q q E r

>

=∑∑

+ _ +

slide-56
SLIDE 56

Electrostatic Term

Distance Between Centers of Atoms Potential Energy

i j i j i ij

q q E r

>

=∑∑

+ _

slide-57
SLIDE 57

Attractive (1/r 6 ) Repulsive (1/r 12 ) Combined

Van der Waals Terms

Potential Energy Distance Between Centers of Atoms

12 6

=

ij ij i j i ij ij

A B E r r

>

∑∑

slide-58
SLIDE 58

Attractive (1/r 6 ) Repulsive (1/r 12 ) Combined

Van der Waals Terms

Distance Between Centers of Atoms

12 6

=

ij ij i j i ij ij

A B E r r

>

∑∑

Potential Energy

slide-59
SLIDE 59

Attractive (1/r 6 ) Repulsive (1/r 12 ) Combined

Van der Waals Terms

Distance Between Centers of Atoms

12 6

=

ij ij i j i ij ij

A B E r r

>

∑∑

Potential Energy

slide-60
SLIDE 60

Attractive (1/r 6 ) Repulsive (1/r 12 ) Combined

Van der Waals Terms

Potential Energy Distance Between Centers of Atoms

12 6

=

ij ij i j i ij ij

A B E r r

>

∑∑

slide-61
SLIDE 61

Attractive (1/r 6 ) Repulsive (1/r 12 ) Combined

Van der Waals Terms

Distance Between Centers of Atoms

12 6

=

ij ij i j i ij ij

A B E r r

>

∑∑

Potential Energy

slide-62
SLIDE 62

Attractive (1/r 6 ) Repulsive (1/r 12 ) Combined

Van der Waals Terms

Distance Between Centers of Atoms

12 6

=

ij ij i j i ij ij

A B E r r

>

∑∑

Potential Energy

slide-63
SLIDE 63

Attractive (1/r 6 ) Repulsive (1/r 12 ) Combined

Van der Waals Terms

Potential Energy Distance Between Centers of Atoms

12 6

=

ij ij i j i ij ij

A B E r r

>

∑∑

slide-64
SLIDE 64

Molecular Mechanics Force Field

( )

2 bonds 2 angles torsions 12 6

( ) [1 cos( )]

b i j i j i ij ij ij i j i ij ij

E k r r k A n q q r A B r r

θ θ

θ τ ϕ

> >

= − + − + + − + + −

∑ ∑ ∑ ∑∑ ∑∑

Stretch Bend Torsion Electrostatic Van der Waals Non- Bonded Bonded

slide-65
SLIDE 65

What Takes So Long?

Inner loop of force field evaluation looks at all pairs of atoms (within distance R) On the order of 64K atoms in typical system Repeat ~1012 times Current approaches too slow by several orders of magnitude What can be done?

slide-66
SLIDE 66

Our Strategy

New architectures – Designing a specialized machine – Enormously parallel architecture – Based on special-purpose ASICs – Dramatically faster for MD, but less flexible – Projected completion: 2008 New algorithms – Applicable to

  • Conventional clusters
  • Our own machine

– Scale to very large # of processing elements

slide-67
SLIDE 67

Interdisciplinary Lab

Computational Chemists and Biologists Computer Scientists and Applied Mathematicians Computer Architects and Engineers

slide-68
SLIDE 68

*** New Architectures

slide-69
SLIDE 69

Alternative Machine Architectures

Conventional cluster of commodity processors General-purpose scientific supercomputer Special-purpose molecular dynamics machine

slide-70
SLIDE 70

Conventional Cluster of Commodity Processors

Strengths: – Flexibility – Mass market economies of scale Limitations – Doesn’t exploit special features of the problem – Communication bottlenecks

  • Between processor and memory
  • Among processors

– Insufficient arithmetic power

slide-71
SLIDE 71

Typical Commodity Microprocessor

slide-72
SLIDE 72

Typical Commodity Microprocessor

slide-73
SLIDE 73

General-Purpose Scientific Supercomputer

E.g., IBM Blue Gene More demanding goal than ours – General-purpose scientific supercomputing – Fast for wide range of applications Strengths: – Flexibility – Ease of programmability Limitations for MD simulations – Expensive – Still not fast enough for our purposes

slide-74
SLIDE 74

Our Special-Purpose MD Machine

Strengths: – Several orders of magnitude faster for MD – Excellent cost/performance characteristics Limitations: – Not designed for other scientific applications

  • They’d be difficult to program
  • Still wouldn’t be especially fast

– Limited flexibility

slide-75
SLIDE 75

Source of Speedup on Our Machine

Judicious use of arithmetic specialization – Flexibility, programmability only where needed – Elsewhere, hardware tailored for speed

  • Tables and parameters, but not programmable

Carefully choreographed communication – Data flows to just where it’s needed – Almost never need to access off-chip memory

slide-76
SLIDE 76

Two Subsystems on Each ASIC

Specialized Subsystem Flexible Subsystem

  • Programmable,

general-purpose

  • Efficient geometric
  • perations
  • Pairwise point

interactions

  • Enormously parallel
slide-77
SLIDE 77

Where We Use Specialized Hardware

Specialized hardware (with tables, parameters) where: Inner loop Simple, regular algorithmic structure Unlikely to change Examples: Electrostatic forces Van der Waals interactions (at least attractive term)

slide-78
SLIDE 78

Example: Particle Interaction Pipeline (one of 32)

slide-79
SLIDE 79

Array of 32 Particle Interaction Pipelines

slide-80
SLIDE 80

Advantages of Particle Interaction Pipelines

Save area that would have been allocated to – Cache – Control logic – Wires Achieve extremely high arithmetic density Save time that would have been spent on – Cache misses, – Load/store instructions – Misc. data shuffling

slide-81
SLIDE 81

Where We Use Flexible Hardware

– Use programmable hardware where:

  • Algorithm less regular
  • Smaller % of total time
  • E.g., local interactions (fewer of them)
  • More likely to change

– Examples:

  • Bonded interactions
  • Bond length constraints
  • Experimentation with
  • New, short-range force field terms
  • Alternative integration techniques
slide-82
SLIDE 82

Forms of Parallelism in Flexible Subsystem

The Flexible Subsystem exploits three forms of parallelism: – Multi-core parallelism – Instruction-level parallelism – SIMD parallelism

slide-83
SLIDE 83

Overview of the Flexible Subsystem

GC = Geometry Core (each a VLIW processor)

slide-84
SLIDE 84

Geometry Core (one of 8; 64 pipelined lanes/chip)

+ X + + + +

Instruction Memory Decode

From Tensilica Core X X X X Y Z W PC + X + + + + X X X X Y Z W

Data Memory

f f f f f f f f

slide-85
SLIDE 85

System-Level Organization

Multiple segments (probably 8 in first machine) 512 nodes (each with one ASIC) per segment – Organized in an 8 x 8 x 8 toroidal mesh Topology reflects physical space being simulated: – Three-dimensional nearest neighbor connections – Periodic boundary conditions

slide-86
SLIDE 86

3D Torus Network

slide-87
SLIDE 87

But Communication is Still a Bottleneck

Scalability limited by inter-chip communication To execute a single millisecond-scale simulation, – Need a huge number of processing elements – Must dramatically reduce amount of data transferred between these processing elements Can’t do this without fundamentally new algorithms

slide-88
SLIDE 88

*** The NT Algorithm

slide-89
SLIDE 89

Range-Limited Pairwise Particle Interactions

Efficient methods known for distant interactions

R

Pairwise, non-bonded interactions dominate Range-limited n-body problem

slide-90
SLIDE 90

New Algorithm

Parallel algorithm for range-limited n-body problem Called the NT (for “Neutral Territory”) Method* Asymptotically less inter-processor communication than traditional spatial decomposition methods Constant factors also very attractive – Significant improvements on typical cluster – Major win on large machines

* Shaw, J. Comp. Chem. 26, Oct. 2005

slide-91
SLIDE 91

Desirable Properties

Ideally, a parallel algorithm for the range-limited

n-body problem would:

Exploit the range limitation to reduce computational load Scale such that data transfer approaches zero as

p → ∞

slide-92
SLIDE 92

Asymptotic Comparison With Traditional Spatial Decomposition Methods

Exploitable range limitation Scaling with number of processors Traditional methods O (R 3) neighbors Not scalable NT Method O (R 3/2) neighbors O (P –1/2) scaling

NT Method has both of these properties:

slide-93
SLIDE 93

Partitioning of Space Into Boxes

Atom A Home box of atom A

slide-94
SLIDE 94

Two-Dimensional Analog of the NT Method

Traditional Method (2D Analog) NT Method (2D Analog) Green = interaction box; blue = import region

slide-95
SLIDE 95

How can it be better to meet on neutral territory?

Number of pairwise interactions (~ product of areas) Number of atoms imported (~ sum of areas):

Traditional Method (2D) NT Method (2D)

slide-96
SLIDE 96

Actual 3D Algorithm

Considerably more complex – Odd number of dimensions introduces complications Can be made to work – Math gets more complicated – Performance advantage just as large Start by describing 3D version of traditional spatial decomposition methods

slide-97
SLIDE 97

Traditional 3D Spatial Decomposition Methods

slide-98
SLIDE 98

Traditional Spatial Decomposition Method

Interaction Box and Import Region Green = Interaction box Blue = Import region

slide-99
SLIDE 99

Site of Interaction, Traditional Method

Interact – One atom from (cubical) interaction box – One atom from either interaction box or import region All interactions occur within home box of one of the two atoms How much inter-processor communication?

slide-100
SLIDE 100

Import Subregion Face(–x)

slide-101
SLIDE 101

Import Subregion Edge(–x, +z)

slide-102
SLIDE 102

Import Subregion Corner(+x, –y, +z)

slide-103
SLIDE 103

Import region of traditional spatial decomposition method: 3 face subregions 6 edge subregions 4 corner subregions 3Rb2 + 3πR2b/2 + 2πR3/3 where b = side length of (cubical) box In limit as p , import volume approaches 2πR3/3

Import Volume, Traditional Method

slide-104
SLIDE 104

*** The Three-Dimensional NT Algorithm

slide-105
SLIDE 105

NT Method

Interaction Box and Import Region Green = Interaction box Blue = Import region

slide-106
SLIDE 106

The Tower

(outer tower in blue)

slide-107
SLIDE 107

The Plate

(outer plate in blue)

slide-108
SLIDE 108

Interact – One atom from tower – One atom from plate Both atoms may have to be imported They meet “on neutral territory”

Site of Interaction, NT Method

Plate Atom Tower Atom

slide-109
SLIDE 109

Aspect Ratio Optimization in NT

Dimensions of box ⇒ dimensions of tower, plate Volume of box determined by – Size of molecular system – Number of processors Aspect ratio of box is free parameter – x and y dimensions equal; ratio to z can vary – Optimize to minimize communication Optimal aspect ratio depends on number of processors – More processors ⇒ shorter, fatter box (balance)

slide-110
SLIDE 110

Scaling of the NT Method

64 Processors

Assumes 50,000 atoms, interaction radius = 12A, density = 0.1 atom/A3

slide-111
SLIDE 111

Scaling of the NT Method

512 Processors

Assumes 50,000 atoms, interaction radius = 12A, density = 0.1 atom/A3

slide-112
SLIDE 112

Scaling of the NT Method

4K Processors

Assumes 50,000 atoms, interaction radius = 12A, density = 0.1 atom/A3

slide-113
SLIDE 113

Scaling of the NT Method

32K Processors

Assumes 50,000 atoms, interaction radius = 12A, density = 0.1 atom/A3

slide-114
SLIDE 114

Import volume: 4 face subregions 2 edge subregions No corner subregions Vi = 2 Rbxy2 + 2 Rbxybz + πR2bz/2 where bxy = x & y dimensions of box bz = z dimension of box Optimize ratio bxy / bz to minimize import volume

NT’s Import Volume With Cubical Box

slide-115
SLIDE 115

NT: Optimal Aspect Ratio and Import Volume

Results: Optimal bxy = [ c1/2 + (Vb c –1/2 – c)1/2 ] / 2 where c = d/6 – 2πRVb/d d = {27Vb2 – 3[3Vb3((4πR)3 + 27Vb)]1/2}1/3 Vb = box volume To find minimal import volume: – Use optimal bxy to calculate optimal bz – Substitute into equation for Vi

slide-116
SLIDE 116

NT’s Import Volume With Optimized Box

Limit as p → ∞: Note decrease in exponent Vi = 2π 1/2 R 3/2 Vb 1/2 Vb ~ N / p, where N is # atoms in molecular system So Vi = O (R 3/2 (N/p)1/2)

slide-117
SLIDE 117

Comparison of Traditional and NT Methods

slide-118
SLIDE 118

Traditional Method Imports Corner Subregions

slide-119
SLIDE 119

NT Method Doesn’t Import Any Corner Subregions

slide-120
SLIDE 120

NT vs. Traditional Method

Traditional spatial decomposition method:

  • Transfer time ~ volume of sphere of radius R

(for large p) NT method

  • Transfer time ~ square root of that sphere’s

volume Advantage of NT over traditional method grows as number of processors increases

slide-121
SLIDE 121

Scaling of Traditional vs. NT Method

64 Processors

Assumes 50,000 atoms, interaction radius = 12A, density = 0.1 atom/A3

slide-122
SLIDE 122

Scaling of Traditional vs. NT Method

512 Processors

Assumes 50,000 atoms, interaction radius = 12A, density = 0.1 atom/A3

slide-123
SLIDE 123

Scaling of Traditional vs. NT Method

4K Processors

Assumes 50,000 atoms, interaction radius = 12A, density = 0.1 atom/A3

slide-124
SLIDE 124

Scaling of Traditional vs. NT Method

32K Processors

Assumes 50,000 atoms, interaction radius = 12A, density = 0.1 atom/A3

slide-125
SLIDE 125

Inter-Processor Transfer Time, Traditional vs. NT

8 64 512 4K 32K 1000 2000 3000 4000 5000 6000 7000 Number of Processors Time

Assumes 50,000 atoms, interaction radius = 12A, density = 0.1 atom/A3 Time unit is time required to import data associated with one atom Traditional Method NT Method

slide-126
SLIDE 126

An Open Question That Keeps Me Awake at Night

slide-127
SLIDE 127

Are Force Fields Accurate Enough?

Nobody knows how accurate the force fields that everyone uses actually are – Can’t simulate for long enough to know – If problems surface, we may at least be able to

  • Figure out why
  • Take steps to fix them

But we already know that fast, single MD simulations will prove sufficient to answer at least some major scientific questions