new architectures for a new biology
play

New Architectures for a New Biology David E. Shaw D. E. Shaw - PowerPoint PPT Presentation

New Architectures for a New Biology David E. Shaw D. E. Shaw Research, LLC and Center for Computational Biology and Bioinformatics Columbia University *** Background (A Bit of Basic Biochemistry) DNA Codes for Proteins The 20 Amino Acids


  1. Our Strategy � New architectures – Designing a specialized machine – Enormously parallel architecture – Based on special-purpose ASICs – Dramatically faster for MD, but less flexible – Projected completion: 2008 � New algorithms – Applicable to • Conventional clusters • Our own machine – Scale to very large # of processing elements

  2. Interdisciplinary Lab Computational Chemists and Biologists Computer Scientists and Applied Mathematicians Computer Architects and Engineers

  3. *** New Architectures

  4. Alternative Machine Architectures � Conventional cluster of commodity processors � General-purpose scientific supercomputer � Special-purpose molecular dynamics machine

  5. Conventional Cluster of Commodity Processors � Strengths: – Flexibility – Mass market economies of scale � Limitations – Doesn’t exploit special features of the problem – Communication bottlenecks • Between processor and memory • Among processors – Insufficient arithmetic power

  6. Typical Commodity Microprocessor

  7. Typical Commodity Microprocessor

  8. General-Purpose Scientific Supercomputer � E.g., IBM Blue Gene � More demanding goal than ours – General-purpose scientific supercomputing – Fast for wide range of applications � Strengths: – Flexibility – Ease of programmability � Limitations for MD simulations – Expensive – Still not fast enough for our purposes

  9. Our Special-Purpose MD Machine � Strengths: – Several orders of magnitude faster for MD – Excellent cost/performance characteristics � Limitations: – Not designed for other scientific applications • They’d be difficult to program • Still wouldn’t be especially fast – Limited flexibility

  10. Source of Speedup on Our Machine � Judicious use of arithmetic specialization – Flexibility, programmability only where needed – Elsewhere, hardware tailored for speed • Tables and parameters, but not programmable � Carefully choreographed communication – Data flows to just where it’s needed – Almost never need to access off-chip memory

  11. Two Subsystems on Each ASIC Programmable, � Flexible general-purpose Subsystem Efficient geometric � operations Pairwise point � Specialized interactions Subsystem Enormously parallel �

  12. Where We Use Specialized Hardware Specialized hardware (with tables, parameters) where: Inner loop Simple, regular algorithmic structure Unlikely to change Examples: Electrostatic forces Van der Waals interactions (at least attractive term)

  13. Example: Particle Interaction Pipeline (one of 32)

  14. Array of 32 Particle Interaction Pipelines

  15. Advantages of Particle Interaction Pipelines � Save area that would have been allocated to – Cache – Control logic – Wires � Achieve extremely high arithmetic density � Save time that would have been spent on – Cache misses, – Load/store instructions – Misc. data shuffling

  16. Where We Use Flexible Hardware – Use programmable hardware where: • Algorithm less regular • Smaller % of total time - E.g., local interactions (fewer of them) • More likely to change – Examples: • Bonded interactions • Bond length constraints • Experimentation with - New, short-range force field terms - Alternative integration techniques

  17. Forms of Parallelism in Flexible Subsystem � The Flexible Subsystem exploits three forms of parallelism: – Multi-core parallelism – Instruction-level parallelism – SIMD parallelism

  18. Overview of the Flexible Subsystem GC = Geometry Core (each a VLIW processor)

  19. Geometry Core (one of 8; 64 pipelined lanes/chip) Instruction Memory From PC Tensilica Decode Core X Y Z W X Y Z W Data X X X X X X X X Memory f + + f + + + + + + f f f f f f + +

  20. System-Level Organization � Multiple segments (probably 8 in first machine) � 512 nodes (each with one ASIC) per segment – Organized in an 8 x 8 x 8 toroidal mesh � Topology reflects physical space being simulated: – Three-dimensional nearest neighbor connections – Periodic boundary conditions

  21. 3D Torus Network

  22. But Communication is Still a Bottleneck � Scalability limited by inter-chip communication � To execute a single millisecond-scale simulation, – Need a huge number of processing elements – Must dramatically reduce amount of data transferred between these processing elements � Can’t do this without fundamentally new algorithms

  23. *** The NT Algorithm

  24. Range-Limited Pairwise Particle Interactions � Efficient methods known for distant interactions R � Pairwise, non-bonded interactions dominate � Range-limited n -body problem

  25. New Algorithm � Parallel algorithm for range-limited n -body problem � Called the NT (for “Neutral Territory”) Method * � Asymptotically less inter-processor communication than traditional spatial decomposition methods � Constant factors also very attractive – Significant improvements on typical cluster – Major win on large machines * Shaw, J. Comp. Chem. 26, Oct. 2005

  26. Desirable Properties � Ideally, a parallel algorithm for the range-limited n -body problem would: � Exploit the range limitation to reduce computational load � Scale such that data transfer approaches zero as p → ∞

  27. Asymptotic Comparison With Traditional Spatial Decomposition Methods � NT Method has both of these properties: Exploitable Scaling with range number of limitation processors O ( R 3 ) Not Traditional neighbors scalable methods O ( R 3/2 ) O ( P –1/2 ) NT Method neighbors scaling

  28. Partitioning of Space Into Boxes Atom A Home box of atom A

  29. Two-Dimensional Analog of the NT Method Traditional Method NT Method (2D Analog) (2D Analog) Green = interaction box; blue = import region

  30. How can it be better to meet on neutral territory? Traditional Method (2D) NT Method (2D) Number of pairwise interactions (~ product of areas) Number of atoms imported (~ sum of areas):

  31. Actual 3D Algorithm � Considerably more complex – Odd number of dimensions introduces complications � Can be made to work – Math gets more complicated – Performance advantage just as large � Start by describing 3D version of traditional spatial decomposition methods

  32. Traditional 3D Spatial Decomposition Methods

  33. Traditional Spatial Decomposition Method Interaction Box and Import Region Green = Interaction box Blue = Import region

  34. Site of Interaction, Traditional Method � Interact – One atom from (cubical) interaction box – One atom from either interaction box or import region � All interactions occur within home box of one of the two atoms � How much inter-processor communication?

  35. Import Subregion Face(– x )

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend