with Proprietary Interconnect and Its Programming and Applications - PowerPoint PPT Presentation

Tightly Coupled Accelerators with Proprietary Interconnect and Its Programming and Applications Toshihiro Hanawa Information Technology Center, The University of Tokyo Taisuke Boku Center for Computational Sciences, University of Tsukuba Collaboration with Yuetsu Kodama, Mitsuhisa Sato, Masayuki Umemura @ CCS, Univ. of Tsukuba Hitoshi Murai @ Riken AICS, Hideharu Amano @ Keio Univ. GPU Technology Conference 2015 1 Mar. 19, 2015

Agenda  Background  Application Examples  HA-PACS / AC-Crest Project  QUDA (QCD)  FFTE (FFT)  Introduction of HA-PACS / TCA  Introduction of XcalableACC  Organization of TCA  Concept  PEACH2 Board designed for TCA  Code Examples  Evaluation of Basic  Evaluations Performance  Summary  Collective Communications  Implementation Examples  Performance Evaluation GPU Technology Conference 2015 2 Mar. 19, 2015

Current Trend of HPC using GPU Computing  Advantageous Features High peak performance / cost ratio  High peak performance / power ratio   Examples of HPC System:  GPU Clusters in Green500 (Nov.  GPU Clusters and MPPs in TOP500 2014) (“Greenest” Supercomputers (Nov. 2014) ranked in Top500)  2nd: Titan (NVIDIA K20X, 27 PF)  3rd: TSUBAME-KFC (NVIDIA K20X, 4.4  6th: Piz Daint (NVIDIA K20X, 7.8 PF) GF/W)  10th: Cray CS-Storm (NVIDIA K40, 6.1  4th: Cray Storm1 (NVIDIA K40, 3.9 GF / PF) W)  15th: TSUBAME2.5 (NVIDIA K20X, 5.6  7th: HA-PACS/TCA (NVIDIA K20X, 3.5 PF) GF/W)  48 systems use NVIDIA GPUs.  8 systems of Top10 use NVIDIA GPUs. GPU Technology Conference 2015 3 Mar. 19, 2015

Issues of GPU Computing  Data I/O performance limitation Ex) K20X: PCIe gen2 x16  Peak Performance: 8GB/s (I/O) ⇔ 1.3 TFLOPS (Computation) Communication bottleneck becomes significant on multi GPU application   Strong-scaling on GPU cluster Important to shorten Turn-Around Time of production-run  Heavy impact of communication latency   Ultra- low latency between GPUs is important for next generation’s HPC Our target is developing a direct communication system between external GPUs for a feasibility study for future accelerated computing. ⇒ “Tightly Coupled Accelerators (TCA)” architecture GPU Technology Conference 2015 4 Mar. 19, 2015

HA-PACS Project HA- PACS is not only a “commodity GPU HA-PACS (Highly Accelerated Parallel  cluster ” but also experiment platform Advanced system for Computational Sciences) HA-PACS base cluster  for development of GPU-accelerated code 8 th generation of PAX/PACS series   for target fields, and performing product-run supercomputer in University of Tsukuba Now in operation since Feb. 2012  HA-PACS/TCA (TCA = Tightly Coupled FY2011-2013, operation until   Accelerators) FY2016(?) for elementary research on direct Promotion of computational science   communication technology for accelerated applications in key areas in CCS- computing Tsukuba Our original communication chip named  Target field: QCD, astrophysics, “PEACH2” was installed in each node.  QM/MM (quantum mechanics / Now in operation since Nov. 2013  molecular mechanics, bioscience) GPU Technology Conference 2015 5 Mar. 19, 2015

AC-CREST project  Project “ Research and Development on Unified Environment of Accelerated Computing and Interconnection for Post-Petascale Era ” ( AC-CREST )  Objectives Supported by JST-CREST  Realization of high-performance (direct)  “ Development of System communication among accelerators Software Technologies for Development of system software supporting  post-Peta Scale High communication system among accelerators Performance Computing ” Development of parallel language and  program compilers  Higher productivity  Highly optimized (offload, communication) Development of practical applications  GPU Technology Conference 2015 6 Mar. 19, 2015

What is “Tightly Coupled Accelerators (TCA)” ? Concept:  Direct connection between accelerators (GPUs) over the nodes without CPU assistance  Eliminate extra memory copies to the host  Reduce latency, improve strong scaling with small data size  Enable hardware support for complicated communication patterns GPU Technology Conference 2015 7 Mar. 19, 2015

Communication on TCA Architecture  Using PCIe as a communication link between accelerators over the nodes  Direct device P2P communication is available thru PCIe.  PEACH2: Node Node PCI Express Adaptive CPU Memory CPU Memory Communication Hub ver. 2 CPU CPU PCIe Switch PCIe Switch Implementation of the  PCIe PCIe PCe interface and data PCe GPU GPU transfer engine for TCA PEACH2 PEACH2 PCIe GPU Memory GPU Memory GPU Technology Conference 2015 8 Mar. 19, 2015

GPU Communication with traditional MPI  Traditional MPI using InfiniBand requires data copy 3 times  Data copy between CPU and GPU (1 and 3) have to perform manually CPU CPU 2: Data transfer over IB Mem Mem PCIe SW PCIe SW Mem GPU IB IB GPU Mem 1: Copy from GPU mem 3: Copy from CPU mem to CPU mem through PCI Express (PCIe) to GPU mem through PCIe GPU Technology Conference 2015 9 Mar. 19, 2015

GPU Communication with IB/GDR  The InfiniBand controller read and write GPU memory directly (with GDR)  Temporal data copy is eliminated  Lower latency than the previous method  Protocol conversion is still needed 1: Direct data transfer CPU CPU (PCIe -> IB -> PCIe) Mem Mem PCIe SW PCIe SW Mem GPU IB IB GPU Mem GPU Technology Conference 2015 10 Mar. 19, 2015

GPU Communication with TCA (PEACH2)  TCA does not need protocol conversion  direct data copy using GDR  much lower latency than InfiniBand 1: Direct data transfer CPU CPU (PCIe -> PCIe -> PCIe) Mem Mem PCIe SW PCIe SW Mem GPU TCA TCA GPU Mem GPU Technology Conference 2015 11 Mar. 19, 2015

TCA node structure example  Similar to ordinary GPU cluster  PEACH2 can access all configuration except PEACH2 GPUs  80 PCIe lanes are required  NVIDIA Kepler architecture CPU CPU + “ GPUDirect Support for (Xeon (Xeon RDMA” are required. QPI E5 v2) E5 v2) G3 Single PCI address space G2  Connect among 3 nodes G2 x8 G2 G2 G2 PCIe x8 x16 x16 x16 x16 using remaining PEACH2 IB GPU GPU GPU GPU PEA G2 HC 0 1 2 3 CH2 port A x8 G2 G2 QDR x8 x8 GPU: NVIDIA K20X 2port GPU Technology Conference 2015 12 Mar. 19, 2015

TCA node structure example  Similar to ordinary GPU cluster Actually, configuration except PEACH2  Performance over QPI is  80 PCIe lanes are required miserable. CPU CPU (Xeon (Xeon  PEACH2 is available for GPU0, QPI E5 v2) E5 v2) G3 GPU1. G2 G2 x8 G2 G2 G2 PCIe x8 x16 x16 x16 x16  Note that InfiniBand with GPU IB GPU GPU GPU GPU PEA G2 HC Direct for RDMA is available 0 1 2 3 CH2 A x8 only for GPU2, GPU3. G2 G2 x8 x8 GPU: NVIDIA K20X GPU Technology Conference 2015 13 Mar. 19, 2015

Design of PEACH2  Implement by FPGA with four  Latency reduction PCIe Gen.2 IPs  Hardwired logic  Altera Stratix IV GX  Low-overhead routing mechanism  Prototyping, flexible enhancement  Efficient address mapping in PCIe address area using unused  Sufficient communication bits bandwidth  Simple comparator for decision  PCI Express Gen2 x8 for each of output port port (40Gbps = IB QDR) It is not only a proof-of-concept  Sophisticated DMA controller implementation, but it will also be available for product-run in GPU  Chaining DMA, Block-stride transfer function cluster. GPU Technology Conference 2015 14 Mar. 19, 2015

PEACH2 board (Production version for HA-PACS/TCA) FPGA Main board (Altera Stratix IV Most part operates at 250 MHz + sub board (PCIe Gen2 logic runs at 250MHz) 530GX) DDR3- Power supply SDRAM for various voltage PCI Express x8 card edge PCIe x8 cable connecter PCIe x16 cable connecter GPU Technology Conference 2015 15 Mar. 19, 2015

HA-PACS/TCA Compute Node PEACH2 Board is installed here! Front View (8 node / rack ） Rear View 3U height GPU Technology Conference 2015 16 Mar. 19, 2015

Inside of HA-PACS/TCA Compute Node GPU Technology Conference 2015 17 Mar. 19, 2015

Spec. of HA-PACS base cluster & HA-PACS/TCA Base cluster (Feb. 2012) TCA (Nov. 2013) Node CRAY GreenBlade 8204 CRAY 3623G4-SM MotherBoard Intel Washington Pass SuperMicro X9DRG-QF CPU Intel Xeon E5-2670 x 2 socket Intel Xeon E5-2680 v2 x 2 socket (SandyBridge-EP, 2.6GHz 8 core) x2 (IvyBridge-EP, 2.8GHz 10 core) x2 Memory DDR3-1600 128 GB DDR3-1866 128 GB GPU NVIDIA M2090 x4 NVIDIA K20X x 4 # of Nodes 268 (26) 64 (10) (Racks) Interconnect Mellanox InfiniBand QDR x2 (Connect X-3) Mellanox InfiniBand QDR x2 + PEACH2 Peak Perf. 802 TFlops 364 TFlops Power 408 kW 99.3 kW Totally, HA-PACS is over 1PFlops system ! GPU Technology Conference 2015 18 Mar. 19, 2015

with Proprietary Interconnect and Its Programming and Applications - PowerPoint PPT Presentation

Tightly Coupled Accelerators with Proprietary Interconnect and Its Programming and Applications Toshihiro Hanawa Information Technology Center, The University of Tokyo Taisuke Boku Center for Computational Sciences, University of Tsukuba

Geometrically Parameterized Interconnect Geometrically Parameterized Interconnect Performance

2019-2020 Campaign Theme CONFIDENTIAL & PROPRIETARY CONFIDENTIAL & PROPRIETARY

2019-2020 Campaign Theme CONFIDENTIAL & PROPRIETARY CONFIDENTIAL & PROPRIETARY

MacroPoint Proprietary Cargo Theft is a Global Issue MacroPoint Proprietary MacroPoint

Capturing the opportunity in the SMB segment Proprietary + Confidential Proprietary +

RF-Interconnect RF-Interconnect and its Applications to and its Applications to NoC Design NoC

The Interconnect Verification Challenge Franois Cerisier and Mike Bartley Test and

Robust Interconnect Robust Interconnect Communication Capacity Algorithm Communication Capacity

Connectors: HiLo Interconnect Wednesday, October 04, 2017 Company Overview March 12, 2015

COOL Interconnect COOL Interconnect Low Power Interconnection Technology Low Power

Interconnect Technologies for Clusters Interconnect approaches Cluster Computing WAN

& Co Cogn gnition System em Proprietary & Confidential | 1 Proprietary &

QCD$Library$for$GPU$Cluster$with$ Proprietary$Interconnect$for$ GPU$Direct$Communica<on

with Proprietary Interconnect TCA for GPU Direct Communication Kazuya MATSUMOTO 1 , Toshihiro

Confidential and Proprietary Information of Ohmconnect, Inc. Confidential and Proprietary

Nick Hugh VP, EMEA Yahoo 2015. Confidential & Proprietary. Yahoo 2015. Confidential &

1 Core new business Normalised operating profit Normalised headline earnings +16 16% % +19

Challenges urban climate (change): Netherlands High resolution decision maps for urban planning:

under Climate Change Department of Water Resources 28 January 2015 1.Water Security 2.Strategic

Flood Risk Management Project Karen Thomas Project Manager Water Management Alliance On behalf

AESA Member Annual Report 2015 PING, Inc. Leadership Characteristics Conservation and Pollution

Building Capability for Hydrographic Products and Industry Contribution to Capacity Building

Suggested Discussion Focus Paradigm shift in open space planning Multiple functional approach

New Urban Agenda Dr Graham Alabaster Chief of Sanitation & Waste Management, UNHABITAT 1

with Proprietary Interconnect and Its Programming and Applications - PowerPoint PPT Presentation

Tightly Coupled Accelerators with Proprietary Interconnect and Its Programming and Applications Toshihiro Hanawa Information Technology Center, The University of Tokyo Taisuke Boku Center for Computational Sciences, University of Tsukuba

Geometrically Parameterized Interconnect Geometrically Parameterized Interconnect Performance

2019-2020 Campaign Theme CONFIDENTIAL &amp; PROPRIETARY CONFIDENTIAL &amp; PROPRIETARY

2019-2020 Campaign Theme CONFIDENTIAL &amp; PROPRIETARY CONFIDENTIAL &amp; PROPRIETARY

MacroPoint Proprietary Cargo Theft is a Global Issue MacroPoint Proprietary MacroPoint

Capturing the opportunity in the SMB segment Proprietary + Confidential Proprietary +

RF-Interconnect RF-Interconnect and its Applications to and its Applications to NoC Design NoC

The Interconnect Verification Challenge Franois Cerisier and Mike Bartley Test and

Robust Interconnect Robust Interconnect Communication Capacity Algorithm Communication Capacity

Connectors: HiLo Interconnect Wednesday, October 04, 2017 Company Overview March 12, 2015

COOL Interconnect COOL Interconnect Low Power Interconnection Technology Low Power

Interconnect Technologies for Clusters Interconnect approaches Cluster Computing WAN

&amp; Co Cogn gnition System em Proprietary &amp; Confidential | 1 Proprietary &amp;

QCD$Library$for$GPU$Cluster$with$ Proprietary$Interconnect$for$ GPU$Direct$Communica&lt;on

with Proprietary Interconnect TCA for GPU Direct Communication Kazuya MATSUMOTO 1 , Toshihiro

Confidential and Proprietary Information of Ohmconnect, Inc. Confidential and Proprietary

Nick Hugh VP, EMEA Yahoo 2015. Confidential &amp; Proprietary. Yahoo 2015. Confidential &amp;

1 Core new business Normalised operating profit Normalised headline earnings +16 16% % +19

Challenges urban climate (change): Netherlands High resolution decision maps for urban planning:

under Climate Change Department of Water Resources 28 January 2015 1.Water Security 2.Strategic

Flood Risk Management Project Karen Thomas Project Manager Water Management Alliance On behalf

AESA Member Annual Report 2015 PING, Inc. Leadership Characteristics Conservation and Pollution

Building Capability for Hydrographic Products and Industry Contribution to Capacity Building

Suggested Discussion Focus Paradigm shift in open space planning Multiple functional approach

New Urban Agenda Dr Graham Alabaster Chief of Sanitation &amp; Waste Management, UNHABITAT 1

2019-2020 Campaign Theme CONFIDENTIAL & PROPRIETARY CONFIDENTIAL & PROPRIETARY

2019-2020 Campaign Theme CONFIDENTIAL & PROPRIETARY CONFIDENTIAL & PROPRIETARY

& Co Cogn gnition System em Proprietary & Confidential | 1 Proprietary &

QCD$Library$for$GPU$Cluster$with$ Proprietary$Interconnect$for$ GPU$Direct$Communica<on

Nick Hugh VP, EMEA Yahoo 2015. Confidential & Proprietary. Yahoo 2015. Confidential &

New Urban Agenda Dr Graham Alabaster Chief of Sanitation & Waste Management, UNHABITAT 1