Roadrunner: What makes it tick? Los Alamos Computer Science - PowerPoint PPT Presentation

LA-UR-08-6246 Roadrunner: What makes it tick? Los Alamos Computer Science Symposium October 14, 2008 Ken Koch Roadrunner Technical Manager, Computer, Computational, and Statistical Sciences Division, Los Alamos National Laboratory Work presented was performed by a large team of Roadrunner project staff! Work presented was performed by a large team of Roadrunner project staff! Operated by the Los Alamos National Security, LLC for the DOE/NNSA IBM Confidential

The messages this talk will convey are: • Why Roadrunner? Why Cell? • A bold but important step toward the future • What does Roadrunner look like? • Cluster-of-clusters with node-attached Cells • Concepts for Programming Roadrunner • MPI, Opteron+Cell, “local-store” memory & DMA transfers • Status and plans for Roadrunner • Unclassified Science opportunities Operated by the Los Alamos National Security, LLC for the DOE/NNSA

The Cell Processor a harbinger of the future IBM Confidential Operated by the Los Alamos National Security, LLC for the DOE/NNSA

Microprocessor trends are changing • Moore’s law still holds, but is now being realized differently • Frequency, power, & instruction- level-parallelism (ILP) have all Montecito plateaued transistors • Multi-core is here today and many- core ( ≥ 32 ) looks to be the future • Memory bandwidth and capacity per Pentium core are headed downward (caused clock 386 by increased core counts) power • Key findings of Jan. 2007 IDC Study: “Next Phase in HPC” ILP • new ways of dealing with parallelism will be required • must focus more heavily on bandwidth (flow of data) and less on processor From Burton Smith, LASCI-06 keynote, with permission Operated by the Los Alamos National Security, LLC for the DOE/NNSA

We are programming thousands of processors with MPI cluster Message Passing Message Passing High protocol overhead High protocol overhead Large granularity Large granularity Symmetric Symmetric Synchronous Synchronous node Slide 5 Operated by the Los Alamos National Security, LLC for the DOE/NNSA

Future supercomputers will require new programming models cluster Message Passing Message Passing High protocol overhead High protocol overhead Large granularity Large granularity Symmetric Symmetric Synchronous Synchronous node Not Message Passing Not Message Passing Parallelism and heterogeneity Parallelism and heterogeneity require new approaches: require new approaches: Threads, OpenMP, Threads, OpenMP, Accelerators … Accelerators … socket Slide 6 Operated by the Los Alamos National Security, LLC for the DOE/NNSA

The Cell processor is an (8+1)-way heterogeneous parallel processor SPU SPE • Cell Broadband Engine (CBE*) developed by Sony-Toshiba-IBM • used in Sony PlayStation 3 • 8 Synergistic Processing Elements (SPEs) • 128-bit vector engines • 256 kB local memory (LS = Local Store) • Direct Memory Access (DMA) engine (25.6 GB/s each) PowerPC • Chip interconnect (EIB) to • Run SPE-code as POSIX threads memory to PCIe (SPMD, MPMD, streaming) • PowerPC PPE runs Linux OS • Current Cell performance: • 204.8 GF/s SP & 13.65 GF/s DP • 512 MB @ 25.6 GB/s XDR memory • Insufficient for a Petaflop/s machine * trademark of Sony Computer Entertainment, Inc. Operated by the Los Alamos National Security, LLC for the DOE/NNSA

IBM is creating new Cell processors Next Gen (2PPE’+32SPE’) 45nm SOI ~1 TF-SP (est.) Performance Enhancements/ Scaling Path Enhanced Enhanced Cell Cell PowerXCell 8i chip: (1+8eDP SPE) (1+8eDP SPE) 65nm SOI To be used in Roadrunner 65nm SOI 102.4 GF/s double precision 4 GB DDR2 @ 25.6 GB/s Cost Cell BE Cell BE Cell/B.E. Cell/B.E. Reduction Continued (1+8) (1+8) (1+8) (1+8) shrinks Path 65nm SOI 90nm SOI 45nm SOI 90nm SOI PowerXCell is IBM’s name for this new enhanced double-precision (eDP) Cell processor variant 2006 2007 2008 2009 2010 All future dates and specifications are estimations only; Subject to change without notice. Dashed outlines indicate concept designs. Operated by the Los Alamos National Security, LLC for the DOE/NNSA

Industry presentations show changing trends in processors Intel’s Microprocessor Research Lab AMD Fusion Intel’s Visual Computing Group - Larabee nVidia G80 - 2006 Taken from publicly available information Operated by the Los Alamos National Security, LLC for the DOE/NNSA

Roadrunner is on a different path to a petascale petascale 2002 2002 2003 2003 2004 2004 2005 2005 2006 2006 2007 2007 Roadrunner DARK HORSE Skunkworks Cell, 3D memory Clearspeed Cell Clearspeed, Cell Adv. Arch. Project GPU, FPGA HPCS: PERCS PF system design Roadrunner Roadrunner Contract Award 9/8/2006 LANL has been looking at hybrid & petascale petascale computing for Cell is fast some time Cell is energy efficient Cell is commodity Cell brings heterogeneity g g y Cell brings fine-scale paralleism Operated by the Los Alamos National Security, LLC for the DOE/NNSA

A Roadrunner is born Operated by the Los Alamos National Security, LLC for the DOE/NNSA

IBM built hybrid nodes in Rochester, MN and assembled the system in Poughkeepsie, NY Operated by the Los Alamos National Security, LLC for the DOE/NNSA

Roadrunner broke the 1 Petaflop/s mark on May 26 th , 2008 Calculation: ~2 hours Matrix: ~5 trillion entries Calculation: ~2 hours Matrix: ~5 trillion entries Performance: Performance: 1.026 Petaflop/s 1.026 Petaflop/s Only 3 days after the full machine was finally assembled! Operated by the Los Alamos National Security, LLC for the DOE/NNSA

Roadrunner is a TOP performer! # SITE SYSTEM TF/sec 1 DOE/NNSA/LANL Roadrunner, QS22/LS21 #1 on the TOP500 1026 United States IBM DOE/NNSA/LLNL Blue Gene/L 2 478 United States IBM Argonne National Laboratory Blue Gene/P 3 450 United States IBM Texas Adv. Comp. Center SunBlade Opteron IB Cluster 4 326 United States Sun DOE/ORNL Jaguar, XT4-QuadCore 5 205 United States Cray Forschungszentrum Juelich Blue Gene/P 6 180 Germany IBM Green 500 From June 2008 Top 500 List Cell QS22 clusters Roadrunner Mflops / Watt BG/P #3 on the Green500 Xeon Quad BG/L Position Operated by the Los Alamos National Security, LLC for the DOE/NNSA

Roadrunner System Configuration Operated by the Los Alamos National Security, LLC for the DOE/NNSA IBM Confidential

Roadrunner Phase 3 is Cell-accelerated, not a cluster of Cells Cell-accelerated Add Cells to compute node each individual node I/O gateway nodes Multi-socket multi-core Opteron cluster nodes • • • (100’s of such cluster nodes) “Scalable Unit” Cluster Interconnect Switch/Fabric Node-attached Cells is what makes Roadrunner different! Node-attached Cells is what makes Roadrunner different! Operated by the Los Alamos National Security, LLC for the DOE/NNSA

A Roadrunner TriBlade node integrates Cell and Opteron blades • QS22 is an IBM Cell blade containing Cell eDP Cell eDP two new enhanced double-precision (eDP/PowerXCell ™ ) Cell chips 4 GB 4 GB 2xPCI-E x16 (Unused) I/O Hub I/O Hub QS22 2x PCI-E x8 • Expansion blade connects two QS22 via Dual PCI-E x8 flex-cable four PCI-e x8 links to LS21 & provides the node’s ConnectX IB 4X DDR cluster Cell eDP Cell eDP 2 GB/s, 2us, per PCI-e link attachment 4 GB 4 GB 2xPCI-E x16 (Unused) I/O Hub I/O Hub QS22 • LS21 is an IBM dual-socket Opteron 2x PCI-E x8 blade Dual PCI-E x8 flex-cable • 4-wide IBM BladeCenter packaging HSDC HT2100 Connector PCI-E x8 (unused) HT x16 HT x16 IB • Roadrunner Triblades are completely HT2100 to cluster diskless and run from RAM disks with IB 2 x HT x16 Expansion Std PCI-E Exp. 4x Connector Connector blade NFS & Panasas only to the LS21 DDR PCI-E x8 2 GB/s, 2us 2 x HT x16 • Node design points: Exp. AMD AMD Connector HT x16 HT x16 • One Cell chip per Opteron core Dual Dual Core Core • ~400 GF/s double-precision & 8 GB 8 GB HT x16 LS21 ~800 GF/s single-precision • 16 GB Opteron memory PLUS Design point: 16 GB Cell memory Design point: One Cell per Opteron core • 1 PCI-E x8 to each Cell One Cell per Opteron core Operated by the Los Alamos National Security, LLC for the DOE/NNSA

A Roadrunner TriBlade node integrates Cell and Opteron blades Two QS22’s with 2 Cells each Expansion blade LS21 with two dual-core Opterons Operated by the Los Alamos National Security, LLC for the DOE/NNSA

A Connected Unit (CU) forms a building block BC-H chassis 1 TriBlade 1 TriBlade 2 96 To 2 nd Stage Switches TriBlade 3 IB 4x DDR 2+2 GB/s ISR2012 10 GigE 1+1 GB/s 180 IB4x DDR Switch BC-H chassis 60 2U I/O TriBlade 178 Node 1 12 10 GigE to file 2U I/O TriBlade 179 systems Node 12 & LANs 2U Service TriBlade 180 Node Operated by the Los Alamos National Security, LLC for the DOE/NNSA

Roadrunner: What makes it tick? Los Alamos Computer Science - PowerPoint PPT Presentation

LA-UR-08-6246 Roadrunner: What makes it tick? Los Alamos Computer Science Symposium October 14, 2008 Ken Koch Roadrunner Technical Manager, Computer, Computational, and Statistical Sciences Division, Los Alamos National Laboratory Work

The Roadrunner By Vincent The Roadrunner By Vincent Introduction This animal report is

Roadrunners By Savannah Cool facts The talents that a roadrunner has is that it can go up to

The Roadrunner REALTOR, {COMPANY SLOGAN} You will never wait on me! Cindy L. Dudley REALTOR

NEW MEXICO ROADRUNNER CHAPTER SOLID WASTE ASSOCIATION OF NORTH AMERICA ANNUAL MEETING DECEMBER

SOLID WASTE BUREAU 2017 UPDATE NM SWANA Roadrunner Chapter Annual Meeting George Schuman

Commercial Real Estate 575.532.2345 Industry Observations 141 Roadrunner Pkwy Suite 141 Las

With the advent of the first petascale supercomputer, Los Alamos's Roadrunner, there is a pressing

Commercial Real Estate 575.532.2345 Industry Observations 141 Roadrunner Pkwy Suite 141 Las

Popeye and Roadrunner Lessons Learned Time marches on The number of flight events received by

Application Design Considerations for Roadrunner SPaSM and Beyond Brian J. Albright Applied

ss 2 Cl Class CSC 495/583 Topics of Software Security IA-32 Register & Byte Ordering &

Southern California Kindergarten Conference March 2013 1 TK Collaboration Is Key SCKC

Deposition Model Uncertainties Steven Hanna Harvard School of Public Health

Risk communication between dispersion modelers and decision makers Steven Hanna Harvard School

Scanning (and some other no-tech hacking) Last Class /usr/bin/johnjumbo on:

Networks on chip: Evolution or Revolution? Luca Benini lbenini@deis.unibo.it DEIS-Universita

Challenges of MPSOC Communication, Computation and Design Flow Prof. Jari Nurmi Tampere

Network Analysis to understand the Roman Commerce Pau de Soto Network Analysis to understand the

Plans of the WLCG for Run3 and HL-LHC era Jose F. Salt Cairols Instituto de Fsica Corpuscular

Thoughts on system software for next-generation hardware !"#"$%"&'()*$ $

VI-EPSCoR Annual Conference 2015 VI-EPSCoR Annual Conference 2015 VI-EPSCoR Annual Conference

The OmpSs Programming Model Jesus Labarta Director Computer Sciences Research Dept. BSC

National Center for HIV/AIDS, Viral Hepatitis, STD, and TB Prevention: NCHHSTP Division of Viral

A Malaria Week Dialogue: STRONG SURVEILLANCE SYSTEMS AND TIMELY REPORTING Requestin ing

Roadrunner: What makes it tick? Los Alamos Computer Science - PowerPoint PPT Presentation

LA-UR-08-6246 Roadrunner: What makes it tick? Los Alamos Computer Science Symposium October 14, 2008 Ken Koch Roadrunner Technical Manager, Computer, Computational, and Statistical Sciences Division, Los Alamos National Laboratory Work

The Roadrunner By Vincent The Roadrunner By Vincent Introduction This animal report is

Roadrunners By Savannah Cool facts The talents that a roadrunner has is that it can go up to

The Roadrunner REALTOR, {COMPANY SLOGAN} You will never wait on me! Cindy L. Dudley REALTOR

NEW MEXICO ROADRUNNER CHAPTER SOLID WASTE ASSOCIATION OF NORTH AMERICA ANNUAL MEETING DECEMBER

SOLID WASTE BUREAU 2017 UPDATE NM SWANA Roadrunner Chapter Annual Meeting George Schuman

Commercial Real Estate 575.532.2345 Industry Observations 141 Roadrunner Pkwy Suite 141 Las

With the advent of the first petascale supercomputer, Los Alamos's Roadrunner, there is a pressing

Commercial Real Estate 575.532.2345 Industry Observations 141 Roadrunner Pkwy Suite 141 Las

Popeye and Roadrunner Lessons Learned Time marches on The number of flight events received by

Application Design Considerations for Roadrunner SPaSM and Beyond Brian J. Albright Applied

ss 2 Cl Class CSC 495/583 Topics of Software Security IA-32 Register &amp; Byte Ordering &amp;

Southern California Kindergarten Conference March 2013 1 TK Collaboration Is Key SCKC

Deposition Model Uncertainties Steven Hanna Harvard School of Public Health

Risk communication between dispersion modelers and decision makers Steven Hanna Harvard School

Scanning (and some other no-tech hacking) Last Class /usr/bin/johnjumbo on:

Networks on chip: Evolution or Revolution? Luca Benini lbenini@deis.unibo.it DEIS-Universita

Challenges of MPSOC Communication, Computation and Design Flow Prof. Jari Nurmi Tampere

Network Analysis to understand the Roman Commerce Pau de Soto Network Analysis to understand the

Plans of the WLCG for Run3 and HL-LHC era Jose F. Salt Cairols Instituto de Fsica Corpuscular

Thoughts on system software for next-generation hardware !&quot;#&quot;$%&quot;&amp;'()*$ $

VI-EPSCoR Annual Conference 2015 VI-EPSCoR Annual Conference 2015 VI-EPSCoR Annual Conference

The OmpSs Programming Model Jesus Labarta Director Computer Sciences Research Dept. BSC

National Center for HIV/AIDS, Viral Hepatitis, STD, and TB Prevention: NCHHSTP Division of Viral

A Malaria Week Dialogue: STRONG SURVEILLANCE SYSTEMS AND TIMELY REPORTING Requestin ing

ss 2 Cl Class CSC 495/583 Topics of Software Security IA-32 Register & Byte Ordering &

Thoughts on system software for next-generation hardware !"#"$%"&'()*$ $