QCD$Library$for$GPU$Cluster$with$ Proprietary$Interconnect$for$ - - PowerPoint PPT Presentation

qcd library for gpu cluster with proprietary interconnect
SMART_READER_LITE
LIVE PREVIEW

QCD$Library$for$GPU$Cluster$with$ Proprietary$Interconnect$for$ - - PowerPoint PPT Presentation

QCD$Library$for$GPU$Cluster$with$ Proprietary$Interconnect$for$ GPU$Direct$Communica<on 1 , Hisafumi Fujii 1 , Toshihiro Hanawa 2 , Yuetsu Kodama 3 , ! Norihisa Fujita 4 1,3 , Yoshinobu Kuramashi 3 , Mike Clark Taisuke


slide-1
SLIDE 1

QCD$Library$for$GPU$Cluster$with$ Proprietary$Interconnect$for$ GPU$Direct$Communica<on

Norihisa Fujita

†1, Hisafumi Fujii †1, Toshihiro Hanawa †2, Yuetsu Kodama †3,!

Taisuke Boku

†1,3, Yoshinobu Kuramashi †3, Mike Clark †4

†1: Graduate School of Systems and Information Engineering, University of Tsukuba,! †2: Information Technology Center, The University of Tokyo,! †3: Center for Computational Sciences, University of Tsukuba,! †4: NVIDIA Corporation

slide-2
SLIDE 2

Overview

  • Mo<va<on$
  • Tightly$Coupled$Accelerators$(TCA)$Architecture$

– PEACH2$Board$

  • QUDA$

– Detail$of$MPIJ3$RMA$Implementa<on$ – Detail$of$TCA$Implementa<on$

  • Performance$Evalua<on$
  • Conclusion
slide-3
SLIDE 3

Mo<va<on

  • Recently,$HPC$system$which$has$GPUs$as$

accelerators$is$widely$used$

  • For$parallel$GPU$compu<ng,$communica<on$

between$GPUs$becomes$performance$ boOlenecks$

– especially$for$strongJscaling$problem$ – extra$data$copy$is$required$ – longer$latency$than$CPU$only$communica<on$

  • Direct$communica<on$method$between$GPUs$

will$be$a$solu<on$

slide-4
SLIDE 4

Mo<va<on

  • We$have$been$developing$TCA$architecture$

– Interconnec<on$for$direct$communica<on$among$ accelerators$

  • Applying$TCA$architecture$to$the$parallel$GPU$

applica<on$

– LaTce$QCD$Library:$QUDA$

  • add$TCA$support$to$QUDA$

– Evaluate$performance$of$TCA$architecture$

  • compare$with$the$original$version,$MPIJ3$RMA$version$

and$TCA$

slide-5
SLIDE 5

Overview

  • Mo<va<on$
  • Tightly$Coupled$Accelerators$(TCA)$Architecture$

– PEACH2$Board$

  • QUDA$

– Detail$of$MPIJ3$RMA$Implementa<on$ – Detail$of$TCA$Implementa<on$

  • Performance$Evalua<on$
  • Conclusion
slide-6
SLIDE 6

Tightly$Coupled$Accelerators$(TCA)$ Architecture

  • TCA$is$an$interconnec<on$network$for$direct$

communica<on$between$accelerators$

– has$been$developed$in$Center$for$Computa<onal$ Sciences$(CCS),$University$of$Tsukuba$

  • TCA$can$transfer$data$from$a$GPU$to$another$

without$CPU$assistance$

– No$temporally$data$copy$to$CPU’s$memory$ – No$CPU$computa<on$power$is$required$

slide-7
SLIDE 7

TCA$Architecture

  • TCA$connects$among$GPUs$using$PCIe$

– Nodes$are$connected$by$PCIe$ – for$ultraJlow$latency$communica<on$among$GPUs$

  • CPU

PCIe$Switch Node CPU$Memory$

PCIe

GPU$ GPU$Memory$

PCe

CPU PCIe$Switch Node CPU$Memory$

PCIe

GPU$ GPU$Memory$

PCe

PCIe

PEACH2$ PEACH2$

slide-8
SLIDE 8

PCI$Express$Adap<ve$Communica<on$Hub$ver.$2$ (PEACH2)$Board

  • PEACH2$is$a$TCA$implementa<on$for$GPU$clusters$
  • Using$FPGA$(Altera$Stra<x$IV$530GX)$

– three$external$PCIe$links$ – Each$connec<on$has$PCIe$x8$bandwidth$(4GB/s$peak)

Altera$FPGA$chip

slide-9
SLIDE 9

GPUDirect$RDMA$(GDR)

  • 3rd$party$devices$can$access$GPU’s$memory$

directly$through$PCI$Express$if$the$ environment$supports$GDR$

– CUDA$5.0$or$later$and$NVIDIA$Kepler$class$GPUs$

  • InfiniBand$

– Mellanox$InfiniBand$and$Driver$(OFED)$

  • MPI$

– MVAPICH2JGDR,$OpenMPI$

  • TCA$
slide-10
SLIDE 10

GPU$Communica<on$with$MPI

CPU GPU Mem Mem PCIe$SW IB

  • Conven<onal$MPI$using$InfiniBand$requires$

data$copy$3$<mes$

– Data$copy$between$CPU$and$GPU$(1$and$3)$has$to$ be$performed$manually

IB CPU GPU PCIe$SW Mem Mem 1:$Copy$from$GPU$mem$$ to$CPU$mem$through$PCI$Express$(PCIe) 3:$Copy$from$CPU$mem$ to$GPU$mem$through$PCIe 2:$Data$transfer$over$IB

slide-11
SLIDE 11

GPU$Communica<on$with$IB/GDR

CPU GPU Mem Mem PCIe$SW IB

  • The$InfiniBand$controller$read$and$write$GPU$

memory$directly$(with$GDR)$

– Temporal$data$copy$is$eliminated$ – Lower$latency$than$the$previous$method$ – Protocol$conversion$is$s<ll$needed

IB CPU GPU PCIe$SW Mem Mem 1:$Direct$data$transfer$ (PCIe$J>$IB$J>$PCIe)

slide-12
SLIDE 12

GPU$Communica<on$with$TCA

CPU GPU Mem Mem PCIe$SW TCA

  • TCA$does$not$need$protocol$conversion$

– direct$data$copy$using$GDR$ – much$lower$latency$than$InfiniBand$

TCA CPU GPU PCIe$SW Mem Mem 1:$Direct$data$transfer$ (PCIe$J>$PCIe$J>$PCIe)

slide-13
SLIDE 13

Overview

  • Mo<va<on$
  • Tightly$Coupled$Accelerators$(TCA)$Architecture$

– PEACH2$Board$

  • QUDA$

– Detail$of$MPIJ3$RMA$Implementa<on$ – Detail$of$TCA$Implementa<on$

  • Performance$Evalua<on$
  • Conclusion
slide-14
SLIDE 14

QUDA

  • QUDA:$The$open$source$LaTce$QCD$library$

– widely$used$as$a$LQCD$library$for$NVIDIA$GPUs$

  • Op<mized$for$NVIDIA$GPUs$
  • All$calcula<on$run$on$GPUs$

– Solves$a$liner$equa<on$using$CG$method$ – supports$mul<ple$node$configura<on$

slide-15
SLIDE 15

Communica<on$in$QUDA

  • QUDA$already$supports$interJnode$parallelism$

using$MPI$

– Based$on$peerJto$peer$(P2P)$communica<on:$ MPI_Send,*MPI_Recv*

  • We$cannot$use$this$implementa<on$for$TCA$

communica<on$directly$

– because$TCA$supports$remote$memory$access$ (RMA)$only$ – We$have$to$modify$QUDA$to$support$RMA$ communica<ons

slide-16
SLIDE 16

RMA$Support$in$QUDA

  • We$design$new$RMA$interface$

in$QUDA$

– It’s$based$on$the$QUDA’s$ communica<on$abstract$layer$ – the$original$has$a$layer$for$peerJ toJpeer$communica<on$

  • Two$implementa<ons$for$$

new$APIs$

– TCA$ – MPIJ3$RMA$

  • for$portability$
  • for$performance$comparisons$
  • QUDA$Library$

QUDA$RMA$ Interface MPIJ3$ RMA$API TCA$ API Support$mul<ple$communica<on$ methods$using$the$abstract$layer

slide-17
SLIDE 17

Replace$send$with$write

  • We$replace$send$opera<ons$with$write$
  • pera<ons$

– Not$emula<ng$of$send$and$receive$ – Receive$opera<ons$are$removed$

  • TCA$supports$only$write$opera<ons$

– Read$opera<on$is$emulated$by$proxy$write$ – We$should$avoid$read$opera<ons$as$we$can

Origin Target 3:$Write$a$response 1:$Write$a$proxy$read$request 2:$interrup<on

slide-18
SLIDE 18

Communica<ons$in$QUDA

  • We$apply$new$RMA$APIs$

for$nearest$neighbor$halo$ exchange$

– Write$data$to$neighbor$ processes’$memory$region$

  • We$apply$TCA$also$for$

allreduce$communcia<on$

– latency$is$essen<al$ because$of$small$scalar$ value$

Halo$data

  • Allreduce

Halo$data$exchange

slide-19
SLIDE 19

Calcula<on$of$QUDA

  • Basic$flow$of$single$

itera<on$of$CG$

  • Inner$calcula<on$is$
  • verlapped$with$

communica<on$

– the$communica<on$ <me$will$be$hidden

  • GPU

$$$$Host

Halo$data$packing

Calc.$of$inner$ points Halo$data$ exchange

Halo$data$packing$

Calc.$of$inner$points$

cudaEvent$ Synchronize

Calc.$of$boundary$

Calc.$of$ boundary$points 4$<mes$ loop Reduce$in$GPU Reduce$in$GPU$ cudaEvent$ Synchronize AllReduce 2$<mes$ loop Kernel$Launch$ Kernel$ Execu<on cudaEvent$ Synchronize Communica<on

slide-20
SLIDE 20
  • window_alloc

Init. Init.

  • Loop

Calc. Calc. queue_start queue_start queue_wait Free Free

  • Fig.$Basic$flow$of$

RMA$communica<on

  • Prepare$all$

communica<on$ before$calcula<on$loop Star<ng$and$wai<ng$

  • nly$in$calcula<on$loop

Reuse$prepared$ communica<ons$

slide-21
SLIDE 21

Overview

  • Mo<va<on$
  • Tightly$Coupled$Accelerators$(TCA)$Architecture$

– PEACH2$Board$

  • QUDA$

– Detail$of$MPIJ3$RMA$Implementa<on$ – Detail$of$TCA$Implementa<on$

  • Performance$Evalua<on$
  • Conclusion
slide-22
SLIDE 22

MPIJ3$RMA$Implementa<on

  • Memory$region$for$RMA$is$allocated$by$

MPI_Win$APIs$

  • MPI_Put$is$used$for$wri<ng$data$to$the$

remote$memory$

  • MPI_Win_{post,wait,start,complete}*

to$synchronize$RMA$opera<ons$(queue_wait)$

– synchronize$with$neighbor$proceeses$

slide-23
SLIDE 23

TCA$Implementa<on

  • Based$on$the$MPIJ3$RMA$implementa<on$$

– TCA$is$used$for$core$communica<on$only$

  • halo$data$exchange$
  • Allreduce$(built$on$TCA$RMA$write$opera<ons)$

– TCA$network$is$not$selfJcontained$

  • MPI$is$used$for$ini<aliza<on$
slide-24
SLIDE 24

TCA$Implementa<on

  • Chaining$DMA$

– is$a$func<on$of$the$PEACH2’s$DMA$controller$ – We$can$chain$mul<ple$memory$writes$

  • Once$a$DMA$chain$is$kicked,$all$chained$opera<ons$are$

executed$con<nuously$

  • reduces$the$cost$of$DMA$controller$invoca<on$
  • RMAs$in$the$halo$data$exchange$for$mul<J

dimension$are$DMA$chained$$

– Launch$a$DMA$controller$once$in$a$itera<on$ – communica<on$paOern$does$not$change

slide-25
SLIDE 25

Overview

  • Mo<va<on$
  • Tightly$Coupled$Accelerators$(TCA)$Architecture$

– PEACH2$Board$

  • QUDA$

– Detail$of$MPIJ3$RMA$Implementa<on$ – Detail$of$TCA$Implementa<on$

  • Performance$Evalua<on$
  • Conclusion
slide-26
SLIDE 26

Machine$Environment

HA#PACS/TCA CPU Intel$E5J2680v2$×$2$ (2.8GHz,$10Core)$ CPU$Memory 128GB GPU NVIDIA// Tesla/K20/×/4 GPU$Memory 6GB/GPU CUDA/ Toolkit 6.0 NVIDA$Driver 331.89 MPI MVAPICH2#GDR/2.0b Interconnect

InfiniBand/QDR/4x,/2/Rails/ and/TCA(PEACH2)/

To$avoid$crossing$QPI,$ in$cases$of$TCA$evalua<on,$we$use$GPU1,$ in$cases$of$IB$evalua<on,$we$use$GPU3.$

QPI$x2$ (64GB/s)

CPU1

GPU1 GPU2$

CPU2

GPU3 GPU4$

PCIe$Gen2$x16$ (16GB/s)

IB

TCA

HAJPACS/TCA$Node$Diagram

slide-27
SLIDE 27

Fundamental$Performance$of$TCA

  • 0$

2$ 4$ 6$ 8$ 10$ 12$ 8$ 512$ Latency/[us] Size/[Byte] TCA$ MPI$ 0$ 500$ 1000$ 1500$ 2000$ 2500$ 3000$ 8$ 512$ 32768$ 2097152$ Bandwidth/[MB/s] Size/[Byte] TCA$ MPI$

Performance$crossover$ point$is$about$128KB

TCA:$2.0us/ less$than$1/3$of$ MPI/InfiniBand$latency

Performance$of$interJnode$GPUJGPU$communica<on

slide-28
SLIDE 28

Problem$for$Test

  • Using$“invert_test”$program$which$is$distributed$

with$QUDA$

– Auto$tuning$at$run<me$is$disabled$

  • We$compare$“calcula<on$<me$per$CG$itera<on”$

in$various$run<me$configura<ons$

– because$the$number$of$itera<on$depends$on$process$ size$and$grid$size$configura<on$ – Calcula<on$part$only;$not$including$ini<aliza<on$and$ finaliza<on$part

slide-29
SLIDE 29

Problem$for$Test

  • three$implementa<ons$to$compare$

– MPI$(InfiniBand$+$GDR)$

  • MPIJP2P:$the$original$version$(qudaJ0.7$branch)$
  • MPIJRMA$

– TCA$

  • Two$problem$sizes:$Large$and$Small$Model$

– Small:$(X,$Y,$Z,$T)$=$(8,$8,$8,$8)$=$8^4$ – Large:$(X,$Y,$Z,$T)$=$(16,$16,$16,$16)$=$16^4$

  • Scaling:$strong$scaling$

– 16$nodes$at$maximum

slide-30
SLIDE 30

Performance$Degrada<on$

  • Performance$data$in$the$informal0proceedings0

was$degraded$terribly$

– This$is$caused$by$the$system$configura<on$issue$of$ HAJPACS/TCA$ – We$fixed$the$problem$and$will$show$you$updated$ performance$data$ – We$also$updated$the$data$in$the$formal0 proceedings

slide-31
SLIDE 31

Small$Model$(8^4)

(x,y)$nodes

  • 0$

200$ 400$ 600$ 800$ 1000$ 1200$ MPIJP2P$ MPIJRMA$ TCA$ MPIJP2P$ MPIJRMA$ TCA$ MPIJP2P$ MPIJRMA$ TCA$ MPIJP2P$ MPIJRMA$ TCA$ MPIJP2P$ MPIJRMA$ TCA$ MPIJP2P$ MPIJRMA$ TCA$ MPIJP2P$ MPIJRMA$ TCA$ MPIJP2P$ MPIJRMA$ TCA$ (2,1)$ (1,2)$ (4,1)$ (2,2)$ (1,4)$ (4,2)$ (2,4)$ (4,4)$ 2$Nodes$ 4$Nodes$ 8$Nodes$ 16$Nodes$ Time/per/iteraXon/[us] /

  • Calc.$

Allreduce$ Comm.$

1.96$<mes$speed$ up$against$$ MPIJP2P Message$Size$per$Dimension$=$ 2$×$(24KB$/$#$of$nodes$in$each$dim.) measured$on$Rank$0

slide-32
SLIDE 32

Large$Model$(16^4)

0$ 200$ 400$ 600$ 800$ 1000$ 1200$ 1400$ MPIJP2P$ MPIJRMA$ TCA$ MPIJP2P$ MPIJRMA$ TCA$ MPIJP2P$ MPIJRMA$ TCA$ MPIJP2P$ MPIJRMA$ TCA$ MPIJP2P$ MPIJRMA$ TCA$ MPIJP2P$ MPIJRMA$ TCA$ MPIJP2P$ MPIJRMA$ TCA$ MPIJP2P$ MPIJRMA$ TCA$ (2,1)$ (1,2)$ (4,1)$ (2,2)$ (1,4)$ (4,2)$ (2,4)$ (4,4)$ 2$Nodes$ 4$Nodes$ 8$Nodes$ 16$Nodes$ Time/per/iteraXon/[us] /

  • Calc.$

Allreduce$ Comm.$

(x,y)$nodes

  • 1.15$<mes$speed$

up$against$$ MPIJP2P

4$nodes$configura<on$ is$the$crossover$point.

Message$Size$per$Dimension$=$ 2$×$(192KB$/$#$of$nodes$in$each$dim.) measured$on$Rank$0

slide-33
SLIDE 33

Discussion

  • TCA$performance$is$increased$when$the$message$

size$is$small$

– Small$Model:$

  • 24KB$or$smaller$

– Large$Model:$

  • 1.15$<mes$faster$than$MPIJP2P$where$(x,y)$=$(4,2)$
  • 48KB$for$X$dim.$and$96KB$for$Y$dim.$
  • Performance$of$the$Small$Model$does$not$scale$

– cost$of$CUDA$kernel$launch$becomes$boOleneck

slide-34
SLIDE 34

Conclusion

  • TCA$achieves$2.0us$latency$for$interJnode$GPUJ

GPU$communica<on$

– about$3$<mes$faster$than$MPI+InfiniBand$(6us)$

  • TCA$has$good$performance$on$short$messages$

– Small$Model:$All$configura<ons$ – Large$Model:$8$and$16$nodes$configura<ons$

  • But$slower$than$MPI$on$small$node$configura<ons$
  • TCA$is$suitable$for$strongJscaling$problem$

– message$size$becomes$smaller$if$the$number$of$nodes$ is$increased$

slide-35
SLIDE 35

Future$work

  • MPIJ3$RMA$version$will$be$merged$into$QUDA$

upstream$

– You$will$be$able$to$get$the$source$code$from$github$

  • More$op<miza<on$for$the$TCA$implementa<on$

– QUDA$side$

  • BeOer$synchroniza<on$

– TCA$side$(Driver$+$FPGA)

slide-36
SLIDE 36
slide-37
SLIDE 37
slide-38
SLIDE 38

PEACH2$Network$on$HAJPACS/TCA

  • Ring$topology,$$

each$node$has$three$links$

  • Each$link$has$4GB/s$bandwidth$
  • 16$nodes$at$maximum$in$a$

network

6 2 4 16 10 12 14 8 5 1 3 15 9 11 13 7

slide-39
SLIDE 39

Data$Types$for$RMA

  • We$introduce$3$types$for$RMA$communica<on$

– MsgHandle$(extended)$

  • Based$on$the$original$version$
  • Some$fields$are$added$for$RMA$communica<on$

– RmaWindow$

  • Represents$a$memory$region$for$RMA$
  • Contains$an$MPI_Win$for$MPIJ3$RMA$implementa<on$

– RmaQueue$

  • Used$for$RMA$communica<on$management$
  • MsgHandles$are$pushed$into$an$RmaQueue0
  • Similar$mechanism$to$cudaStream
slide-40
SLIDE 40

RMA$Queue

  • RmaQueue$is$a$object$to$hold$MsgHandles$and$to$

manage$RMA$opera<ons$

  • MsgHandles$are$pushed$into$an$RmaQueue0

– Once$they$are$pushed,$we$can$use$an$RmaQueue$ persistently$

  • We$manage$RMA$communica<on$via$an$

RmaQueue0

– start,$wait,$push,$clear,$…

slide-41
SLIDE 41

RmaQueue$APIs$(1/2)

RmaQueue**comm_queue_alloc(RmaWindow**window)

Creates$a$new$RmaQueue$associated$with$window.

void*comm_queue_push(RmaQueue**queue,*MsgHandle**mh)

Push$mh$into$queue.

void*comm_queue_commit(RmaQueue**queue)

Tell$queue$that$its$modifica<on$is$finished.

slide-42
SLIDE 42

RmaQueue$APIs$(2/2)

void*comm_queue_add_origin(RmaQueue**queue,*int*rank)

Specify$whose$rank$write$have$to$wait.

void*comm_queue_start(RmaQueue**queue)

Launch$RMA$ops$in$queue.

void*comm_queue_wait(RmaQueue**queue)

Wait$for$comple<on$of$RMA$ops$in$queue.

slide-43
SLIDE 43

MPIJ3$RMA$Problems

  • Currently,$comm_queue_wait$is$performed$as$

a$fence$

– Pipelining$is$removed$unlike$the$peerJtoJpeer$ version$ – MPIJ3$RMA$does$not$provide$APIs$for$fine$grained$ synchroniza<on $$

  • Using$MPI_Send/Recv$for$no<fica<on$may$improve$

performance$scaling$

slide-44
SLIDE 44

Persistence$Mode$Problem

  • The$performances$shown$in$the$informal0

proceeding$was$degraded$

– seems$to$be$related$to$Persistence0Mode$(PM)$ – PM$keeps$the$NVIDIA$GPU$driver$loaded$even$if$ no$applica<on$uses$GPUs$ – We$enabled$it$on$HAJPACS/TCA

A00flag00that00indicates00whether00persistence00mode0is0enabled0for0the0GPU.000 Value0is0either0"Enabled"0or0"Disabled".000 When0persistence0mode0is0enabled0the0NVIDIA0driver0 remains0loaded0even0when0no0acIve0clients,0such0as0X110or0nvidiaMsmi,0exist.0 0This0minimizes0the0driver0load0latency0associated0with0running0dependent0apps,00 such0as0CUDA0programs.00For0all0CUDAMcapable0products.00Linux0only. Quote$from$the$nvidiaJsmi$man$page

slide-45
SLIDE 45

Persistence$Mode$Problem

  • Running$on$our$development$plaworm$(not$on$HAJPACS/TCA)$
  • MPIJP2P,$Large$Model,$2$nodes,$InfiniBand$
  • 50$<mes$of$run

50 100 150 200 5 10 15 20 25 30 35 40 45 50 GFLOPS number of run 331.89 Persistence Mode=off 331.89 Persistence Mode=on

PM$degrades$$ QUDA’s$performance

slide-46
SLIDE 46

Communica<on$API$in$QUDA

  • QUDA$supports$two$communica<on$APIs$

– MPI,$LaTce$QCD$Message$Passing$(QMP)$ – QUDA$has$a$communica<on$abstract$layer$ – based$on$peerJtoJpeer$seman<cs$

  • We$extend$the$API$to$use$oneJsided$

communica<on$

– to$support$TCA$communica<on$ – Addi<onally,$MPIJ3$RMA$API$is$also$supported$

slide-47
SLIDE 47

GPU$Communica<on$$ with$CUDAJaware$MPI

CPU GPU Mem Mem PCIe$SW IB

  • Recently,$some$MPI$implementa<ons$have$CUDA$

na<ve$support$(a.k.a.$CUDAJaware)$

– We$can$pass$GPU$pointers$to$MPI$func<ons$directly$ – Data$copy$will$be$op<mized$such$as$pipelining$ – 3$<mes$data$movement$are$s<ll$required

IB CPU GPU PCIe$SW Mem Mem 1:$Op<mized$data$copy$ from$GPU$to$CPU 3:$Op<mized$data$copy$ from$CPU$to$GPU 2:$Data$transfer$over$IB

slide-48
SLIDE 48

CPU GPU Mem Mem PCIe$SW IB IB CPU GPU PCIe$SW Mem Mem

slide-49
SLIDE 49

TCA$Implementa<on

  • Based$on$the$MPIJ3$RMA$implementa<on$$

– TCA$is$used$for$core$communica<on$only$

  • halo$data$exchange$
  • Allreduce$(built$on$TCA$RMA$write$opera<ons)$

– TCA$network$is$not$selfJcontained$

  • MPI$is$used$for$ini<aliza<on$
  • RmaWindow$contains$a$memory$region$allocated$

by$TCA$memory$alloca<on$API$

  • Currently,$comm_queue_wait$acts$as$a$fence$like$

the$MPI$implementa<on$