 
              QCD$Library$for$GPU$Cluster$with$ Proprietary$Interconnect$for$ GPU$Direct$Communica<on � †1 , Hisafumi Fujii †1 , Toshihiro Hanawa †2 , Yuetsu Kodama †3 , ! Norihisa Fujita †4 †1,3 , Yoshinobu Kuramashi †3 , Mike Clark Taisuke Boku †1 : Graduate School of Systems and Information Engineering, University of Tsukuba, ! †2 : Information Technology Center, The University of Tokyo, ! †3 : Center for Computational Sciences, University of Tsukuba, ! †4 : NVIDIA Corporation � ��
Overview � • Mo<va<on$ • Tightly$Coupled$Accelerators$(TCA)$Architecture$ – PEACH2$Board$ • QUDA$ – Detail$of$MPIJ3$RMA$Implementa<on$ – Detail$of$TCA$Implementa<on$ • Performance$Evalua<on$ • Conclusion � ��
Mo<va<on � • Recently,$HPC$system$which$has$GPUs$as$ accelerators$is$widely$used$ • For$parallel$GPU$compu<ng,$communica<on$ between$GPUs$becomes$performance$ boOlenecks$ – especially$for$strongJscaling$problem$ – extra$data$copy$is$required$ – longer$latency$than$CPU$only$communica<on$ • Direct$communica<on$method$between$GPUs$ will$be$a$solu<on$ ��
Mo<va<on � • We$have$been$developing$TCA$architecture$ – Interconnec<on$for$direct$communica<on$among$ accelerators$ • Applying$TCA$architecture$to$the$parallel$GPU$ applica<on$ – LaTce$QCD$Library:$QUDA$ • add$TCA$support$to$QUDA$ – Evaluate$performance$of$TCA$architecture$ • compare$with$the$original$version,$MPIJ3$RMA$version$ and$TCA$ ��
Overview � • Mo<va<on$ • Tightly$Coupled$Accelerators$(TCA)$Architecture$ – PEACH2$Board$ • QUDA$ – Detail$of$MPIJ3$RMA$Implementa<on$ – Detail$of$TCA$Implementa<on$ • Performance$Evalua<on$ • Conclusion � ��
Tightly$Coupled$Accelerators$(TCA)$ Architecture � • TCA$is$an$interconnec<on$network$for$direct$ communica<on$between$accelerators$ – has$been$developed$in$Center$for$Computa<onal$ Sciences$(CCS),$University$of$Tsukuba$ • TCA$can$transfer$data$from$a$GPU$to$another$ without$CPU$assistance$ – No$temporally$data$copy$to$CPU’s$memory$ – No$CPU$computa<on$power$is$required$ ��
TCA$Architecture � • TCA$connects$among$GPUs$using$PCIe$ – Nodes$are$connected$by$PCIe$ – for$ultraJlow$latency$communica<on$among$GPUs$ Node � Node � CPU$Memory$ CPU$Memory$ CPU � CPU � PCIe$Switch � PCIe$Switch � PCIe � PCIe � PCe � PCe � GPU$ GPU$ PEACH2$ PEACH2$ PCIe � GPU$Memory$ GPU$Memory$ ��
PCI$Express$Adap<ve$Communica<on$Hub$ver.$2$ (PEACH2)$Board � • PEACH2$is$a$TCA$implementa<on$for$GPU$clusters$ • Using$FPGA$(Altera$Stra<x$IV$530GX)$ – three$external$PCIe$links$ – Each$connec<on$has$PCIe$x8$bandwidth$(4GB/s$peak) � Altera$FPGA$chip � ��
GPUDirect$RDMA$(GDR) � • 3 rd $party$devices$can$access$GPU’s$memory$ directly$through$PCI$Express$if$the$ environment$supports$GDR$ – CUDA$5.0$or$later$and$NVIDIA$Kepler$class$GPUs$ • InfiniBand$ – Mellanox$InfiniBand$and$Driver$(OFED)$ • MPI$ – MVAPICH2JGDR,$OpenMPI$ • TCA$ ��
GPU$Communica<on$with$MPI � • Conven<onal$MPI$using$InfiniBand$requires$ data$copy$3$<mes$ – Data$copy$between$CPU$and$GPU$(1$and$3)$has$to$ be$performed$manually � CPU � CPU � 2:$Data$transfer$over$IB � Mem � Mem � PCIe$SW � PCIe$SW � Mem � GPU � IB � IB � GPU � Mem � 1:$Copy$from$GPU$mem$$ 3:$Copy$from$CPU$mem$ ��� to$CPU$mem$through$PCI$Express$(PCIe) � to$GPU$mem$through$PCIe �
GPU$Communica<on$with$IB/GDR � • The$InfiniBand$controller$read$and$write$GPU$ memory$directly$(with$GDR)$ – Temporal$data$copy$is$eliminated$ – Lower$latency$than$the$previous$method$ – Protocol$conversion$is$s<ll$needed � 1:$Direct$data$transfer$ CPU � CPU � (PCIe$J>$IB$J>$PCIe) � Mem � Mem � PCIe$SW � PCIe$SW � Mem � GPU � IB � IB � GPU � Mem � ���
GPU$Communica<on$with$TCA � • TCA$does$not$need$protocol$conversion$ – direct$data$copy$using$GDR$ – much$lower$latency$than$InfiniBand$ 1:$Direct$data$transfer$ CPU � CPU � (PCIe$J>$PCIe$J>$PCIe) � Mem � Mem � PCIe$SW � PCIe$SW � Mem � GPU � TCA � TCA � GPU � Mem � ���
Overview � • Mo<va<on$ • Tightly$Coupled$Accelerators$(TCA)$Architecture$ – PEACH2$Board$ • QUDA$ – Detail$of$MPIJ3$RMA$Implementa<on$ – Detail$of$TCA$Implementa<on$ • Performance$Evalua<on$ • Conclusion � ���
QUDA � • QUDA:$The$open$source$LaTce$QCD$library$ – widely$used$as$a$LQCD$library$for$NVIDIA$GPUs$ • Op<mized$for$NVIDIA$GPUs$ • All$calcula<on$run$on$GPUs$ – Solves$a$liner$equa<on$using$CG$method$ – supports$mul<ple$node$configura<on$ ���
Communica<on$in$QUDA � • QUDA$already$supports$interJnode$parallelism$ using$MPI$ – Based$on$peerJto$peer$(P2P)$communica<on:$ MPI_Send,*MPI_Recv* • We$cannot$use$this$implementa<on$for$TCA$ communica<on$directly$ – because$TCA$supports$remote$memory$access$ (RMA)$only$ – We$have$to$modify$QUDA$to$support$RMA$ communica<ons � ���
RMA$Support$in$QUDA � • We$design$new$RMA$interface$ QUDA$Library$ � in$QUDA$ – It’s$based$on$the$QUDA’s$ QUDA$RMA$ Interface � communica<on$abstract$layer$ – the$original$has$a$layer$for$peerJ MPIJ3$ TCA$ RMA$API � API � toJpeer$communica<on$ • Two$implementa<ons$for$$ Support$mul<ple$communica<on$ new$APIs$ methods$using$the$abstract$layer � – TCA$ – MPIJ3$RMA$ • for$portability$ • for$performance$comparisons$ ���
Replace$ send $with$ write � • We$replace$send$opera<ons$with$write$ opera<ons$ – Not$emula<ng$of$send$and$receive$ – Receive$opera<ons$are$removed$ • TCA$supports$only$write$opera<ons$ – Read$opera<on$is$emulated$by$proxy$write$ – We$should$avoid$read$opera<ons$as$we$can � 1:$Write$a$proxy$read$request � 2:$interrup<on � Origin � Target � ��� 3:$Write$a$response �
Communica<ons$in$QUDA � Halo$data � • We$apply$new$RMA$APIs$ for$nearest$neighbor$halo$ exchange$ – Write$data$to$neighbor$ processes’$memory$region$ Halo$data$exchange � • We$apply$TCA$also$for$ allreduce $communcia<on$ � � – latency$is$essen<al$ � � because$of$small$scalar$ value$ ��� Allreduce �
$$$$Host � GPU � Calcula<on$of$QUDA � Halo$data$packing$ 4$<mes$ • Basic$flow$of$single$ loop � Calc.$of$inner$points$ itera<on$of$CG$ Halo$data$packing � cudaEvent$ Synchronize � • Inner$calcula<on$is$ Calc.$of$inner$ Halo$data$ points � overlapped$with$ exchange � communica<on$ Calc.$of$boundary$ – the$communica<on$ Calc.$of$ <me$will$be$hidden � boundary$points � Kernel$Launch$ Reduce$in$GPU$ 2$<mes$ loop � Kernel$ Reduce$in$GPU � Execu<on � cudaEvent$ cudaEvent$ Synchronize � Synchronize � AllReduce � Communica<on � ���
������ ������ window_alloc � Prepare$all$ �������������������� Init. � Init. � communica<on$ ������������ before$calcula<on$loop � ������������������ �������������� Loop � Calc. � Calc. � Star<ng$and$wai<ng$ only$in$calcula<on$loop � queue_start � queue_start � Reuse$prepared$ communica<ons$ � queue_wait � Free � Free � ������������ ����������������������� Fig.$Basic$flow$of$ ������������� RMA$communica<on � ���
Overview � • Mo<va<on$ • Tightly$Coupled$Accelerators$(TCA)$Architecture$ – PEACH2$Board$ • QUDA$ – Detail$of$MPIJ3$RMA$Implementa<on$ – Detail$of$TCA$Implementa<on$ • Performance$Evalua<on$ • Conclusion � ���
MPIJ3$RMA$Implementa<on � • Memory$region$for$RMA$is$allocated$by$ MPI_Win $APIs$ • MPI_Put $is$used$for$wri<ng$data$to$the$ remote$memory$ • MPI_Win_{post,wait,start,complete}* to$synchronize$RMA$opera<ons$(queue_wait)$ – synchronize$with$neighbor$proceeses$ ���
TCA$Implementa<on � • Based$on$the$MPIJ3$RMA$implementa<on$$ – TCA$is$used$for$core$communica<on$only$ • halo$data$exchange$ • Allreduce$(built$on$TCA$RMA$write$opera<ons)$ – TCA$network$is$not$selfJcontained$ • MPI$is$used$for$ini<aliza<on$ ���
TCA$Implementa<on � • Chaining$DMA$ – is$a$func<on$of$the$PEACH2’s$DMA$controller$ – We$can$chain$mul<ple$memory$writes$ • Once$a$DMA$chain$is$kicked,$all$chained$opera<ons$are$ executed$con<nuously$ • reduces$the$cost$of$DMA$controller$invoca<on$ • RMAs$in$the$halo$data$exchange$for$mul<J dimension$are$DMA$chained$$ – Launch$a$DMA$controller$once$in$a$itera<on$ – communica<on$paOern$does$not$change � ���
Overview � • Mo<va<on$ • Tightly$Coupled$Accelerators$(TCA)$Architecture$ – PEACH2$Board$ • QUDA$ – Detail$of$MPIJ3$RMA$Implementa<on$ – Detail$of$TCA$Implementa<on$ • Performance$Evalua<on$ • Conclusion � ���
Recommend
More recommend