with Proprietary Interconnect and Its Programming and Applications - - PowerPoint PPT Presentation

with proprietary interconnect and
SMART_READER_LITE
LIVE PREVIEW

with Proprietary Interconnect and Its Programming and Applications - - PowerPoint PPT Presentation

Tightly Coupled Accelerators with Proprietary Interconnect and Its Programming and Applications Toshihiro Hanawa Information Technology Center, The University of Tokyo Taisuke Boku Center for Computational Sciences, University of Tsukuba


slide-1
SLIDE 1

Tightly Coupled Accelerators with Proprietary Interconnect and Its Programming and Applications

Toshihiro Hanawa

Information Technology Center, The University of Tokyo

Taisuke Boku

Center for Computational Sciences, University of Tsukuba Collaboration with Yuetsu Kodama, Mitsuhisa Sato, Masayuki Umemura @ CCS, Univ. of Tsukuba Hitoshi Murai @ Riken AICS, Hideharu Amano @ Keio Univ.

  • Mar. 19, 2015

GPU Technology Conference 2015

1

slide-2
SLIDE 2

Agenda

 Background  HA-PACS / AC-Crest Project  Introduction of HA-PACS / TCA

 Organization of TCA  PEACH2 Board designed for TCA

 Evaluation of Basic

Performance

 Collective Communications

 Implementation Examples  Performance Evaluation

 Application Examples

 QUDA (QCD)  FFTE (FFT)

 Introduction of XcalableACC

 Concept  Code Examples  Evaluations

 Summary

  • Mar. 19, 2015

GPU Technology Conference 2015

2

slide-3
SLIDE 3

Current Trend of HPC using GPU Computing

 Advantageous Features

High peak performance / cost ratio

High peak performance / power ratio

 Examples of HPC System:

 GPU Clusters and MPPs in TOP500

(Nov. 2014)

 2nd: Titan (NVIDIA K20X, 27 PF)  6th: Piz Daint (NVIDIA K20X, 7.8 PF)  10th: Cray CS-Storm (NVIDIA K40, 6.1

PF)

 15th: TSUBAME2.5 (NVIDIA K20X, 5.6

PF)

 48 systems use NVIDIA GPUs.

 GPU Clusters in Green500 (Nov.

2014) (“Greenest” Supercomputers ranked in Top500)

 3rd: TSUBAME-KFC (NVIDIA K20X, 4.4

GF/W)

 4th: Cray Storm1 (NVIDIA K40, 3.9 GF /

W)

 7th: HA-PACS/TCA (NVIDIA K20X, 3.5

GF/W)

 8 systems of Top10 use NVIDIA GPUs.

  • Mar. 19, 2015

GPU Technology Conference 2015

3

slide-4
SLIDE 4

Issues of GPU Computing

 Data I/O performance limitation

Ex) K20X: PCIe gen2 x16 Peak Performance: 8GB/s (I/O) ⇔ 1.3 TFLOPS (Computation)

Communication bottleneck becomes significant on multi GPU application

 Strong-scaling on GPU cluster

Important to shorten Turn-Around Time of production-run

Heavy impact of communication latency

 Ultra-low latency between GPUs is important for next generation’s HPC

Our target is developing a direct communication system between external GPUs for a feasibility study for future accelerated computing. ⇒ “Tightly Coupled Accelerators (TCA)” architecture

  • Mar. 19, 2015

GPU Technology Conference 2015

4

slide-5
SLIDE 5

HA-PACS Project

HA-PACS (Highly Accelerated Parallel Advanced system for Computational Sciences)

8th generation of PAX/PACS series supercomputer in University of Tsukuba

FY2011-2013, operation until FY2016(?)

Promotion of computational science applications in key areas in CCS- Tsukuba

Target field: QCD, astrophysics, QM/MM (quantum mechanics / molecular mechanics, bioscience)

HA-PACS is not only a “commodity GPU cluster” but also experiment platform

HA-PACS base cluster

for development of GPU-accelerated code for target fields, and performing product-run

Now in operation since Feb. 2012

HA-PACS/TCA (TCA = Tightly Coupled Accelerators)

for elementary research on direct communication technology for accelerated computing

Our original communication chip named “PEACH2” was installed in each node.

Now in operation since Nov. 2013

  • Mar. 19, 2015

GPU Technology Conference 2015

5

slide-6
SLIDE 6

AC-CREST project

 Project “Research and Development on Unified Environment of

Accelerated Computing and Interconnection for Post-Petascale Era” (AC-CREST)

  • Mar. 19, 2015

GPU Technology Conference 2015

6  Objectives

Realization of high-performance (direct) communication among accelerators

Development of system software supporting communication system among accelerators

Development of parallel language and compilers

 Higher productivity  Highly optimized (offload, communication)

Development of practical applications

Supported by JST-CREST “Development of System Software Technologies for post-Peta Scale High Performance Computing” program

slide-7
SLIDE 7

What is “Tightly Coupled Accelerators (TCA)” ?

Concept:

 Direct connection between accelerators (GPUs)

  • ver the nodes without CPU assistance

 Eliminate extra memory copies to the host  Reduce latency, improve strong scaling with small

data size

 Enable hardware support for complicated

communication patterns

  • Mar. 19, 2015

GPU Technology Conference 2015

7

slide-8
SLIDE 8

Communication on TCA Architecture

8

CPU PCIe Switch Node CPU Memory

PCIe

GPU GPU Memory

PCe

CPU PCIe Switch Node CPU Memory

PCIe

GPU GPU Memory

PCe

PCIe

PEACH2 PEACH2

 Using PCIe as a communication link

between accelerators over the nodes

 Direct device P2P communication is

available thru PCIe.

 PEACH2:

PCI Express Adaptive Communication Hub ver. 2

Implementation of the interface and data transfer engine for TCA

  • Mar. 19, 2015

GPU Technology Conference 2015

slide-9
SLIDE 9

GPU Communication with traditional MPI

 Traditional MPI using InfiniBand requires data copy

3 times

 Data copy between CPU and GPU (1 and 3) have to perform

manually

  • Mar. 19, 2015

GPU Technology Conference 2015

9

CPU GPU Mem Mem

PCIe SW

IB IB CPU GPU

PCIe SW

Mem Mem 1: Copy from GPU mem to CPU mem through PCI Express (PCIe) 3: Copy from CPU mem to GPU mem through PCIe 2: Data transfer over IB

slide-10
SLIDE 10

GPU Communication with IB/GDR

 The InfiniBand controller read and write GPU memory

directly (with GDR)

 Temporal data copy is eliminated  Lower latency than the previous method  Protocol conversion is still needed

10

CPU GPU Mem Mem

PCIe SW

IB IB CPU GPU

PCIe SW

Mem Mem 1: Direct data transfer (PCIe -> IB -> PCIe)

  • Mar. 19, 2015

GPU Technology Conference 2015

slide-11
SLIDE 11

GPU Communication with TCA (PEACH2)

 TCA does not need protocol conversion

 direct data copy using GDR  much lower latency than InfiniBand

11

CPU GPU Mem Mem

PCIe SW

TCA TCA CPU GPU

PCIe SW

Mem Mem 1: Direct data transfer (PCIe -> PCIe -> PCIe)

  • Mar. 19, 2015

GPU Technology Conference 2015

slide-12
SLIDE 12

TCA node structure example

 PEACH2 can access all

GPUs

 NVIDIA Kepler architecture

+ “GPUDirect Support for RDMA” are required.

 Connect among 3 nodes

using remaining PEACH2 port

 Similar to ordinary GPU cluster

configuration except PEACH2

 80 PCIe lanes are required

  • Mar. 19, 2015

GPU Technology Conference 2015

CPU

(Xeon E5 v2)

CPU

(Xeon E5 v2)

QPI PCIe

GPU GPU 2 GPU 3

IB HC A

PEA CH2 GPU 1

G2 x8 G2 x16 G2 x16 G3 x8 G2 x16 G2 x16 G2 x8 G2 x8 G2 x8

QDR 2port

GPU: NVIDIA K20X

12

Single PCI address space

slide-13
SLIDE 13

TCA node structure example

Actually,

 Performance over QPI is

miserable.

 PEACH2 is available for GPU0,

GPU1.

 Note that InfiniBand with GPU

Direct for RDMA is available

  • nly for GPU2, GPU3.

 Similar to ordinary GPU cluster

configuration except PEACH2

 80 PCIe lanes are required

  • Mar. 19, 2015

GPU Technology Conference 2015

CPU

(Xeon E5 v2)

CPU

(Xeon E5 v2)

QPI PCIe

GPU GPU 2 GPU 3

IB HC A

PEA CH2 GPU 1

G2 x8 G2 x16 G2 x16 G3 x8 G2 x16 G2 x16 G2 x8 G2 x8 G2 x8

GPU: NVIDIA K20X

13

slide-14
SLIDE 14

Design of PEACH2

 Implement by FPGA with four

PCIe Gen.2 IPs

 Altera Stratix IV GX  Prototyping, flexible

enhancement

 Sufficient communication

bandwidth

 PCI Express Gen2 x8 for each

port (40Gbps = IB QDR)

 Sophisticated DMA controller

 Chaining DMA, Block-stride transfer

function

 Latency reduction

 Hardwired logic  Low-overhead routing

mechanism

 Efficient address mapping in

PCIe address area using unused bits

 Simple comparator for decision

  • f output port

It is not only a proof-of-concept implementation, but it will also be available for product-run in GPU cluster.

  • Mar. 19, 2015

GPU Technology Conference 2015

14

slide-15
SLIDE 15

PEACH2 board (Production version for HA-PACS/TCA)

  • Mar. 19, 2015

GPU Technology Conference 2015

15

Main board + sub board

Most part operates at 250 MHz (PCIe Gen2 logic runs at 250MHz) PCI Express x8 card edge

Power supply for various voltage

DDR3- SDRAM

FPGA (Altera Stratix IV 530GX)

PCIe x16 cable connecter PCIe x8 cable connecter

slide-16
SLIDE 16

HA-PACS/TCA Compute Node

  • Mar. 19, 2015

GPU Technology Conference 2015

16

PEACH2 Board is installed here! Rear View Front View (8 node / rack) 3U height

slide-17
SLIDE 17

Inside of HA-PACS/TCA Compute Node

  • Mar. 19, 2015

GPU Technology Conference 2015

17

slide-18
SLIDE 18
  • Spec. of HA-PACS base cluster & HA-PACS/TCA

Base cluster (Feb. 2012) TCA (Nov. 2013) Node CRAY GreenBlade 8204 CRAY 3623G4-SM MotherBoard Intel Washington Pass SuperMicro X9DRG-QF CPU Intel Xeon E5-2670 x 2 socket (SandyBridge-EP, 2.6GHz 8 core) x2 Intel Xeon E5-2680 v2 x 2 socket (IvyBridge-EP, 2.8GHz 10 core) x2 Memory DDR3-1600 128 GB DDR3-1866 128 GB GPU NVIDIA M2090 x4 NVIDIA K20X x 4 # of Nodes (Racks) 268 (26) 64 (10) Interconnect Mellanox InfiniBand QDR x2 (Connect X-3) Mellanox InfiniBand QDR x2 + PEACH2 Peak Perf. 802 TFlops 364 TFlops Power 408 kW 99.3 kW

  • Mar. 19, 2015

GPU Technology Conference 2015

18

Totally, HA-PACS is over 1PFlops system !

slide-19
SLIDE 19

HA-PACS/TCA (Compute Node)

(2.8 GHz x 8 flop/clock)

Total: 5.688 TFLOPS

8 GB/s

AVX

1.31 TFLOPSx4 =5.24 TFLOPS 22.4 GFLOPS x20 =448.0 GFLOPS

(16 GB, 14.9 GB/s)x8 =128 GB, 119.4 GB/s (6 GB, 250 GB/s)x4 =24 GB, 1 TB/s 4 Channels 1,866 MHz 59.7 GB/sec 4 Channels 1,866 MHz 59.7 GB/sec

Ivy Bridge Ivy Bridge

4 x NVIDIA K20X

Gen 2 x 16 Gen 2 x 16 Gen 2 x 16 Gen 2 x 16

PEACH2 board (Proprietary Interconnect for TCA)

Gen 2 x 8 Gen 2 x 8 Gen 2 x 8

Red: upgraded from base-cluster to TCA Legacy Devices

  • Mar. 19, 2015

GPU Technology Conference 2015

19

slide-20
SLIDE 20

HA-PACS/TCA (since Nov. 2013) + Base cluster

  • Mar. 19, 2015

GPU Technology Conference 2015

20

Base cluster TCA

LINPACK: 277 Tflops (Efficiency 76%) 3.52GFLOPS/W #3 Green500 at Nov. 2013

slide-21
SLIDE 21

Configuration of TCA Sub-cluster (16 nodes/group)

 Each group consists of 2

racks, 16nodes. HA- PACS/TCA includes 4 TCA groups.

 Orange: Ring  Red: Cross link between 2

rings

 In TCA sub-cluster, 32 GPUs

can be treated seamlessly.

 limited to 2 GPUs under the

same socket per node

  • Mar. 19, 2015

GPU Technology Conference 2015

21

slide-22
SLIDE 22

Communication on TCA

 TCA provides two types of

communications.

 DMA controller function

 Chaining

 Multiple DMA descriptors chained

  • n memory

 DMA transactions are automatically

  • perated by HW

 Block-stride support  DMA Engine with 4ch

  • Mar. 19, 2015

GPU Technology Conference 2015

22

Comm. type

Min. Latency Band width How working Comm. patterns

DMA Low

(< 2us)

High

DMA controller Any CPU

  • r GPU

PIO Very low

( < 1us)

Low

CPU’s write

  • peration

CPU-CPU

Source Destination Length Flags Next Source Destination Length Flags Next Source Destination Length Flags Next = NULL Head of Descriptors DMAC Control Reg.

slide-23
SLIDE 23

Evaluation Results

 Ping-pong performance between

nodes

 Latency and bandwidth  Written as application

 Comparison

 MVAPICH2-GDR 2.0b (with/without

GPU Direct support) for GPU-GPU communication on TCA nodes

 A InfiniBand QDR link (40Gbps) is used,

which has the same performance as PEACH2.

 Performance over QPI on TCA nodes

In order to access GPU memory by the other device, “GPU Direct support for RDMA” in CUDA5 API is used.

 Special driver named “TCA p2p

driver” to enable memory mapping is developed.

“PEACH2 driver” to control the board is also developed.

  • Mar. 19, 2015

GPU Technology Conference 2015

23

slide-24
SLIDE 24

1 2 3 4 5 6 7 8 9 10 8 64 512 4096 32768

Latency (usec) Data Size (Bytes)

Ping-pong Latency

Minimum Latency (nearest neighbor comm.)

 PIO: CPU to CPU: 0.8 us  DMA:CPU to CPU: 1.8 us

GPU to GPU: 2.0 us

  • cf. MV2-GDR 2.0: 4.5 us (w/

GDR), 17 us (w/o GDR)

  • Mar. 19, 2015

GPU Technology Conference 2015

24

PIO DMA (CPU) DMA (GPU) MVAPICH2-GDR 2.0

slide-25
SLIDE 25

Ping-pong Latency

Minimum Latency (nearest neighbor comm.)

 PIO: CPU to CPU: 0.8 us  DMA:CPU to CPU: 1.8 us

GPU to GPU: 2.3 us Forwarding overhead

 200-300 nsec

 BW converges to the same peak

with various hop counts

  • Mar. 19, 2015

GPU Technology Conference 2015

25 1 2 3 4 5 6 7 8 8 64 512 4096 32768

Latency (usec) Data Size (bytes)

DMA Direct DMA 1 hop DMA 2 hop DMA 3 hop

DMA (CPU)

slide-26
SLIDE 26

500 1000 1500 2000 2500 3000 3500 4000 8 128 2048 32768 524288 Bandwidth (MB/sec) Size (Bytes)

Ping-pong Bandwidth

  • Max. 3.5 GByte/sec

95% of theoretical peak

Converge to the same peak if hop count increases

GPU - GPU DMA performance is up to 2.8 GByte/sec.

better than MV2GDR under < 1MB

Over QPI: limited to 360MB/s

SB(SandyBridge): limited to 880MB/s due to PCIe sw perf.

  • Mar. 19, 2015

GPU Technology Conference 2015

26

Max Payload Size = 256byte Theoretical peak (detailed):

4GB/sec × 256 / (256 + 24) = 3.66 GB/s

3.5 Gbyte/s DMA (GPU) DMA (CPU) MVAPICH2

  • GDR 2.0

DMA (SB, GPU)

2.8 Gbyte/s

DMA (QPI, GPU)

slide-27
SLIDE 27

GPUDirect behavior in MVAPICH2-GDR

From README

 MV2_GPUDIRECT_LIMIT

* Default: 8192

 MV2_USE_GPUDIRECT_R

ECEIVE_LIMIT * Default: 131072

 X <= 8KB:

GDR read + GDR write

 8KB < X <= 128KB:

memcpy H2D + GDR write

 X > 128KB

memcpy H2D + memcpy D2H

  • Mar. 19, 2015

GPU Technology Conference 2015

27

slide-28
SLIDE 28

Collective Communications

 Allgather

 All processes gather data of each

process.

 Communication bandwidth as well

as latency is important.

 GPU-GPU DMA

 Allreduce

 Conduct specified operation

among data arrays on each process and store the results on all processes.

 Latency decides the performance.  CPU-CPU PIO with host copy

 Alltoall

 All processes exchange specific

data of each process (transpose).

 Communication bandwidth is

important.

 GPU-GPU DMA, all requests to

every nodes are chained [AsHES2015] (To be appeared)

  • Mar. 19, 2015

GPU Technology Conference 2015

28

Packet contention might occur on ring,

  • ptimization should be required.
slide-29
SLIDE 29

Allgather Implementation: Recursive Doubling

29

Initial State

 Requires (log2p) steps

 Ex. p=16 => 4 steps

 Node mapping optimization

1.

Same hop counts between any nodes in every step

2.

Communicate data with neighbor node in the last step

  • Mar. 19, 2015

GPU Technology Conference 2015

slide-30
SLIDE 30

Allgather Implementation: Recursive Doubling

30

Step 1

 Requires (log2p) steps

 Ex. p=16 => 4 steps

 Node mapping optimization

1.

Same hop counts between any nodes in every step

2.

Communicate data with neighbor node in the last step

  • Mar. 19, 2015

GPU Technology Conference 2015

slide-31
SLIDE 31

Allgather Implementation: Recursive Doubling

31

Step 2

 Requires (log2p) steps

 Ex. p=16 => 4 steps

 Node mapping optimization

1.

Same hop counts between any nodes in every step

2.

Communicate data with neighbor node in the last step

  • Mar. 19, 2015

GPU Technology Conference 2015

slide-32
SLIDE 32

Allgather Implementation: Recursive Doubling

32

Step 3

 Requires (log2p) steps

 Ex. p=16 => 4 steps

 Node mapping optimization

1.

Same hop counts between any nodes in every step

2.

Communicate data with neighbor node in the last step

  • Mar. 19, 2015

GPU Technology Conference 2015

slide-33
SLIDE 33

Allgather Implementation: Recursive Doubling

33

Step 4

 Requires (log2p) steps

 Ex. p=16 => 4 steps

 Node mapping optimization

1.

Same hop counts between any nodes in every step

2.

Communicate data with neighbor node in the last step

  • Mar. 19, 2015

GPU Technology Conference 2015

slide-34
SLIDE 34

Allgather Performance Comparison among Various Algorithms

Time for all-gathering 128 KB data

N=16384 case in CG method

Recursive rsive Doub ubling ing sh show

  • ws

s good

  • d perfo

forma rmance nce

However, when p=16, TCA is slower than MPI in this size

34

50 100 150 200 250 2 4 8 16 Communication time [μsec] #Processes

Ring Neighbor Exchange Recursive Doubling Dissemination MPI

Better

  • Mar. 19, 2015

GPU Technology Conference 2015

slide-35
SLIDE 35

Allreduce Performance

 Allreduce time for 8 Bytes scalar data

 Di

Dissemination semination is the e fastest. stest.

 TCA is more than twice

ce faster r than MPI

 Low latency of TCA works effectively

35

10 20 30 2 4 8 16 Communition time [μsec] #Processes

Ring Neighbor Exchange Recursive Doubling Dissemination MPI

Better

  • Mar. 19, 2015

GPU Technology Conference 2015

slide-36
SLIDE 36

QUDA

 QUDA: The open source Lattice QCD library

 widely used as a LQCD library for NVIDIA GPUs

 Optimized for NVIDIA GPUs  All calculation run on GPUs

 Solves a liner equation using CG method  inter-node parallelism support  supports multiple GPUs in a node  source code is available at github

 https://github.com/lattice/quda

[HeteroPar2014]

36

  • Mar. 19, 2015

GPU Technology Conference 2015

slide-37
SLIDE 37

Communication in QUDA

 Halo data exchange with

RMA is dominant

 Write data to neighbor

processes’ memory region

 Allreduce communication in

CG

 latency is important

37

Halo data + + + +

Allreduce Halo data exchange

  • Mar. 19, 2015

GPU Technology Conference 2015

slide-38
SLIDE 38

QUDA results: Large Model (16^4)

38

200 400 600 800 1000 1200 1400 MPI-P2P MPI-RMA TCA MPI-P2P MPI-RMA TCA MPI-P2P MPI-RMA TCA MPI-P2P MPI-RMA TCA MPI-P2P MPI-RMA TCA MPI-P2P MPI-RMA TCA MPI-P2P MPI-RMA TCA MPI-P2P MPI-RMA TCA (2,1) (1,2) (4,1) (2,2) (1,4) (4,2) (2,4) (4,4) 2 Nodes 4 Nodes 8 Nodes 16 Nodes Time per iteration [us] Calc. Allreduce Comm.

(x,y) nodes

1.15 times speed up against MPI-P2P

4 nodes configuration is the crossover point.

Message Size per Dimension = 2 × (192KB / # of nodes in each dim.)

  • Mar. 19, 2015

GPU Technology Conference 2015

slide-39
SLIDE 39

QUDA Results: Small Model (8^4)

39 (x,y) nodes

200 400 600 800 1000 1200 MPI-P2P MPI-RMA TCA MPI-P2P MPI-RMA TCA MPI-P2P MPI-RMA TCA MPI-P2P MPI-RMA TCA MPI-P2P MPI-RMA TCA MPI-P2P MPI-RMA TCA MPI-P2P MPI-RMA TCA MPI-P2P MPI-RMA TCA (2,1) (1,2) (4,1) (2,2) (1,4) (4,2) (2,4) (4,4) 2 Nodes 4 Nodes 8 Nodes 16 Nodes Time per iteration [us] Calc. Allreduce Comm.

1.96 times speed up against MPI-P2P Message Size per Dimension = 2 × (24KB / # of nodes in each dim.)

  • Mar. 19, 2015

GPU Technology Conference 2015

slide-40
SLIDE 40

Summary

 TCA: Tightly Coupled Accelerators

TCA enables direct communication among accelerators as an element technology becomes a basic technology for next gen’s accelerated computing in exa-scale era.

PEACH2 board: Implementation for realizing TCA using PCIe technology

Bandwidth: max. 3.5 Gbyte/sec between CPUs (over 95% of theoretical peak), 2.8 Gbyte/sec between GPUs

  • Min. Latency: 0.8 us (PIO), 1.8 us (DMA between

CPUs), 2.0 us (DMA between GPUs)

GPU-GPU communication over the nodes can be utilized with 16 node sub-cluster.

 Ping-pong program: PEACH2 can achieve

lower latency than MPI in small data size.

Collective communications on TCA

Allreduce: much faster than 2x of MPI

Allgather: slightly faster than MPI

QUDA: TCA has a good performance

  • n short messages

Small Model: All configurations

But, speedup was not shown…

Large Model: 8 and 16 nodes configurations

FFTE: Small & Medium size is good for TCA

  • Mar. 19, 2015

GPU Technology Conference 2015

40

slide-41
SLIDE 41

Future Work

 Offload functions in PEACH2

 Reduction, etc.

 Prototype of PEACH3 is under development with PCIe

Gen3 x8.

 Altera Stratix V GX  Max bandwidth between CPUs is approx. 7GB/s with Gen3 x8,

double of PEACH2 [CANDAR2014]

  • Mar. 19, 2015

GPU Technology Conference 2015

41

slide-42
SLIDE 42

XcalableACC a parallel programming language for accelerated parallel systems

Taisuke Boku Center for Computational Sciences University of Tsukuba

  • Mar. 19, 2015

GPU Technology Conference 2015

42

slide-43
SLIDE 43

Complexity of parallel GPU programming

 Multiple orthogonal paradigms

 MPI – array must be distributed and communicated (two-side or one-side)  CUDA, OpenCL, OpenACC – memory allocation, data movement (to/from

host), computation

 controlling multiple devices if there are – CUDA 4.0 or with OpenMP

multithreading

 Issues

 how to combine array distribution, internal-communication, external-

communication, ...

 simple and easy-to-understand programming model is required for high

productivity

  • Mar. 19, 2015

GPU Technology Conference 2015

43

slide-44
SLIDE 44

XcalableACC (XACC)

 PGAS language (C & Fortran) with directive base parallel

programming for massively parallel accelerated computing

 Based on our traditional PGAS language XcalableMP (XMP)  OpenACC is used for control on accelerating devices  Developed in AICS, RIKEN under JST-CREST joint project  We implement the compiler and run-time system both for general

MPI-base system and TCA architecture

  • Mar. 19, 2015

GPU Technology Conference 2015

44

slide-45
SLIDE 45

Outline of base language XcalableMP

Execution model: SPMD (=MPI)

Two programming model on data view

Global View (PGAS): based on data parallel concept, directives similar to OpenMP is used for data and task distribution (easy programming)

Local View: based on local data and explicit communication (easy performance tuning)

OpenMP-like directives

Incremental parallelization from original sequential code

Low cost for parallelization -> high productivity

Not “fully automatic parallelization”, but user must do:

Each node processes the local data on that node

User can clearly imagine the data distribution and parallelization for easiness of tuning

Communication target of variables (arrays) and partitions can be simply specified

Communication point is specified by user, in easy manner

  • Mar. 19, 2015

GPU Technology Conference 2015

45

slide-46
SLIDE 46

#pragma xmp nodes p(4) declare node set #pragma xmp template t(0:99) declare template #pragma align array[i] with t(i) distribute array : owner of t(i) has a[i] #pragma xmp distribute t(BLOCK) on p distribute template

Template

virtual array representing data(index) space

array distribution, work-sharing must be done using template

template t(0:99)

100

double array[100];

100

p(1) p(2) p(3) p(4)

100 25 50 75

p(1) p(2) p(3) p(4)

array[] 100 25 50 75 Example)

Data Distibution Using Template

  • Mar. 19, 2015

GPU Technology Conference 2015

46

slide-47
SLIDE 47

Data Synchronization of Array(shadow)

Shadow Region

in XMP, memory access is always local

duplicated overlapped data distributed onto other nodes

data synchronization: reflect directive

  • Mar. 19, 2015

GPU Technology Conference 2015

47

NODE2 NODE3 NODE4 NODE1 a[]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

#pragma xmp shadow a[1:1] declare shadow #pragma xmp reflect a synchronize shadow

slide-48
SLIDE 48

 #pragma xmp gather(var=list)  gather array data (collect entire elements)

process1 process2 process3 process0 array[] all elements of the array get correct data

Data Synchronization of Array(gather)

  • Mar. 19, 2015

GPU Technology Conference 2015

48

slide-49
SLIDE 49

Internode Communication

 broadcast

#pragma xmp bcast var on node from node

 barrier synchronization

#pragma xmp barrier

 reduce operation

#pragma xmp reduction (var:op)

 data movement in global view

#pragma xmp gmove

  • Mar. 19, 2015

GPU Technology Conference 2015

49

slide-50
SLIDE 50

Processing model of XACC

  • Mar. 19, 2015

GPU Technology Conference 2015

#0 #1

Distribution among nodes Distribution among ACCs.

CPU ACC

Array/Work

Direct Comm. between ACCs Comm. between CPUs #pragma acc device d = nvidia(0:3) #pragma xmp reflect_init (a) device #pragma xmp loop (i) on t(i) for (int i = 0; i < 100; i++){ #pragma acc kernels loop on_device(d) for (int j = 0; j < 100; j++){ a[i][j] = ... } } #pragma xmp reflect_do (a)

50

slide-51
SLIDE 51

Two implementations of XACC

 based on traditional communication library

 for MPI  directive-base communication on distributed arrays are

automatically performed with OpenACC data I/O and MPI communication

 based on TCA

 using TCA for direct GPU-memory copy

  • Mar. 19, 2015

GPU Technology Conference 2015

51

slide-52
SLIDE 52

Examp mple of Xcalable lableACC ACC pr program ram

double u[XSIZE][YSIZE], uu[XSIZE][YSIZE]; #pragma xmp nodes p(x, y) #pragma xmp template t(0:YSIZE−1, 0:XSIZE−1) #pragma xmp distribute t(block, block) onto p #pragma xmp align [j][i] with t(i,j) :: u, uu #pragma xmp shadow uu[1:1][1:1] … #pragma acc data copy(u) copyin(uu) { for(k=0; k<MAX_ITER; k++){ #pragma xmp loop (y,x) on t(y,x) #pragma acc parallel loop collapse(2) for(x=1; x<XSIZE-1; x++) for(y=1; y<YSIZE-1; y++) uu[x][y] = u[x][y]; #pragma xmp reflect (uu) acc #pragma xmp loop (y,x) on t(y,x) #pragma acc parallel loop collapse(2) for(x=1; x<XSIZE-1; x++) for(y=1; y<YSIZE-1; y++) u[x][y] = (uu[x-1][y]+uu[x+1][y]+ uu[x][y-1]+uu[x][y+1])/4.0; } // end k } // end data

2-D Laplace Eq.

5

slide-53
SLIDE 53

double u[XSIZE][YSIZE], uu[XSIZE][YSIZE]; #pragma xmp nodes p(x, y) #pragma xmp template t(0:YSIZE−1, 0:XSIZE−1) #pragma xmp distribute t(block, block) onto p #pragma xmp align [j][i] with t(i,j) :: u, uu #pragma xmp shadow uu[1:1][1:1] … #pragma acc data copy(u) copyin(uu) { for(k=0; k<MAX_ITER; k++){ #pragma xmp loop (y,x) on t(y,x) #pragma acc parallel loop collapse(2) for(x=1; x<XSIZE-1; x++) for(y=1; y<YSIZE-1; y++) uu[x][y] = u[x][y]; #pragma xmp reflect (uu) #pragma xmp loop (y,x) on t(y,x) #pragma acc parallel loop collapse(2) for(x=1; x<XSIZE-1; x++) for(y=1; y<YSIZE-1; y++) u[x][y] = (uu[x-1][y]+uu[x+1][y]+ uu[x][y-1]+uu[x][y+1])/4.0; } // end k } // end data

array distribution and “sleeve” declaration exchange sleeves on array “uu”

2-D Laplace Eq.

Examp mple of Xcalable lableACC ACC pr program ram

5

slide-54
SLIDE 54

double u[XSIZE][YSIZE], uu[XSIZE][YSIZE]; #pragma xmp nodes p(x, y) #pragma xmp template t(0:YSIZE−1, 0:XSIZE−1) #pragma xmp distribute t(block, block) onto p #pragma xmp align [j][i] with t(i,j) :: u, uu #pragma xmp shadow uu[1:1][1:1] … #pragma acc data copy(u) copyin(uu) { for(k=0; k<MAX_ITER; k++){ #pragma xmp loop (y,x) on t(y,x) #pragma acc parallel loop collapse(2) for(x=1; x<XSIZE-1; x++) for(y=1; y<YSIZE-1; y++) uu[x][y] = u[x][y]; #pragma xmp reflect (uu) acc #pragma xmp loop (y,x) on t(y,x) #pragma acc parallel loop collapse(2) for(x=1; x<XSIZE-1; x++) for(y=1; y<YSIZE-1; y++) u[x][y] = (uu[x-1][y]+uu[x+1][y]+ uu[x][y-1]+uu[x][y+1])/4.0; } // end k } // end data

copy partial (distributed) array to device memory distributed array by XMP is processed according to OpenACC directive “acc” clause indicates to target the array on device memory

Examp mple of Xcalable lableACC ACC pr program ram

2-D Laplace Eq.

array distribution and “sleeve” declaration exchange sleeves on array “uu” 5

slide-55
SLIDE 55

double u[XSIZE][YSIZE], uu[XSIZE][YSIZE]; #pragma xmp nodes p(x, y) #pragma xmp template t(0:YSIZE−1, 0:XSIZE−1) #pragma xmp distribute t(block, block) onto p #pragma xmp align [j][i] with t(i,j) :: u, uu #pragma xmp shadow uu[1:1][1:1] … #pragma acc data copy(u) copyin(uu) { for(k=0; k<MAX_ITER; k++){ #pragma xmp loop (y,x) on t(y,x) #pragma acc parallel loop collapse(2) for(x=1; x<XSIZE-1; x++) for(y=1; y<YSIZE-1; y++) uu[x][y] = u[x][y]; #pragma xmp reflect (uu) acc #pragma xmp loop (y,x) on t(y,x) #pragma acc parallel loop collapse(2) for(x=1; x<XSIZE-1; x++) for(y=1; y<YSIZE-1; y++) u[x][y] = (uu[x-1][y]+uu[x+1][y]+ uu[x][y-1]+uu[x][y+1])/4.0; } // end k } // end data

Examp mple of Xcalable lableACC ACC pr program ram

copy partial (distributed) array to device memory distributed array by XMP is processed according to OpenACC directive

2-D Laplace Eq.

5

slide-56
SLIDE 56

Perf rformance

  • rmance on Hi

Himeno Bench chmark ark by by Xcalabl lableAC eACC

2-D stencil computing for fluid dynamics

80 160 240 320 1 2 4 8 16

XACC (TCA) OpenACC+MPI (GDR)

Number of Nodes Performance (GFlops)

size M (128x128x256)

160 320 480 640 1 2 4 8 16

XACC (TCA) OpenACC+MPI (GDR)

size L (256x256x512)

Number of Nodes

max x2.7↑

For size L, size of sleeve area is approximately 520KB, so TCA’s advantage is small compared to MVAPICH2-GDR. Additionally, TCA requires a barrier synch. after DMA transfer to cause additional overhead

better

5

slide-57
SLIDE 57

Summary

 TCA is a basic research on the possibility on direct network

between accelerators (GPUs) on current available technology

 Toward strong-scaling on post-peta to exascale HPC research, such

a direct network for accelerators is essential

 Language/Programming is also very important issue for high

productivity over multiple programming paradigms

 XcalableACC + TCA is a solution  Awarded in HPC Challenge Class2 Best Performance Award at

SC14

  • Mar. 19, 2015

GPU Technology Conference 2015

57

slide-58
SLIDE 58

References

[AsHES2015] Kazuya Matsumoto, Toshihiro Hanawa, Yuetsu Kodama, Hisafumi Fujii, Taisuke Boku, ” Implementation of CG Method on GPU Cluster with Proprietary Interconnect TCA for GPU Direct Communication,” The Internatonal Workshop on Accelerators and Hybrid Exascale Systems (AsHES2015) , May 2015 (To appear)

[CANDAR2014] Takuya Kuhara, Takahiro Kaneda, Toshihiro ihiro Hanawa wa, Yuetsu Kodama, Taisuke Boku, and Hideharu Amano, “A preliminarily evaluation of PEACH3: a switching hub for tightly coupled accelerators,” 2nd International Workshop on Computer Systems and Architectures (CSA‘14), in conjunction with the 2nd International Symposium on Computing and Networking (CANDAR 2014), pp. 377 - 381, Dec. 2014.

[WACCPD2014] Masahiro Nakao, Hitoshi Murai, Takenori Shimosaka, Akihiro Tabuchi, Toshihiro ihiro Hanawa wa, Yuetsu Kodama, Taisuke Boku, Mitsuhisa Sato, "XcalableACC: Extension of XcalableMP PGAS Language using OpenACC for Accelerator Clusters," Workshop on accelerator programming using directives (WACCPD 2014), in conjunction with SC14, pp. 27-36, Nov. 2014

[HeteroPar2014] Norihisa Fujita, Hisafumi Fujii, Toshihiro ihiro Hanawa, Yuetsu Kodama, Taisuke Boku, Yoshinobu Kuramashi, and Mike Clark, "QCD Library for GPU Cluster with Proprietary Interconnect for GPU Direct Communication," 12th International Workshop Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (HeteroPar2014), LNCS 8805, pp. 251- 262, Aug. 2014.

[HEART2014] Yuetsu Kodama, Toshihiro ihiro Hanawa, Taisuke Boku and Mitsuhisa Sato, "PEACH2: FPGA based PCIe network device for Tightly Coupled Accelerators," Fifth International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART2014), pp. 3-8, Jun. 2014

[HOTI2 I2013] 013] Toshihir iro

  • Hanawa

wa, Yuetsu Kodama, Taisuke Boku, and Mitsuhisa Sato, "Interconnect for Tightly Coupled Accelerators Architecture," IEEE 21st Annual Sympsium on High-Performance Interconnects (HOT Interconnects 21), short paper, pp. 79-82, Aug. 2013

[AsHE HES20 2013] 13] Toshihiro ihiro Hanawa wa, Yuetsu Kodama, Taisuke Boku, and Mitsuhisa Sato, "Tightly Coupled Accelerators Architecture for Minimizing Communication Latency among Accelerators," The Third International Workshop on Accelerators and Hybrid Exascale Systems (AsHES2013), pp. 1030-1039, May 2013.

  • Mar. 19, 2015

GPU Technology Conference 2015

58

slide-59
SLIDE 59

 Contact to:

 Toshihiro Hanawa

hanawa@cc.u-tokyo.ac.jp

 Taisuke Boku

taisuke@cs.tsukuba.ac.jp

  • Mar. 19, 2015

GPU Technology Conference 2015

59