QCD$Library$for$GPU$Cluster$with$ Proprietary$Interconnect$for$ - PowerPoint PPT Presentation

QCD$Library$for$GPU$Cluster$with$ Proprietary$Interconnect$for$ GPU$Direct$Communica<on � †1 , Hisafumi Fujii †1 , Toshihiro Hanawa †2 , Yuetsu Kodama †3 , ! Norihisa Fujita †4 †1,3 , Yoshinobu Kuramashi †3 , Mike Clark Taisuke Boku †1 : Graduate School of Systems and Information Engineering, University of Tsukuba, ! †2 : Information Technology Center, The University of Tokyo, ! †3 : Center for Computational Sciences, University of Tsukuba, ! †4 : NVIDIA Corporation � ��

Overview � • Mo<va<on$ • Tightly$Coupled$Accelerators$(TCA)$Architecture$ – PEACH2$Board$ • QUDA$ – Detail$of$MPIJ3$RMA$Implementa<on$ – Detail$of$TCA$Implementa<on$ • Performance$Evalua<on$ • Conclusion � ��

Mo<va<on � • Recently,$HPC$system$which$has$GPUs$as$ accelerators$is$widely$used$ • For$parallel$GPU$compu<ng,$communica<on$ between$GPUs$becomes$performance$ boOlenecks$ – especially$for$strongJscaling$problem$ – extra$data$copy$is$required$ – longer$latency$than$CPU$only$communica<on$ • Direct$communica<on$method$between$GPUs$ will$be$a$solu<on$ ��

Mo<va<on � • We$have$been$developing$TCA$architecture$ – Interconnec<on$for$direct$communica<on$among$ accelerators$ • Applying$TCA$architecture$to$the$parallel$GPU$ applica<on$ – LaTce$QCD$Library:$QUDA$ • add$TCA$support$to$QUDA$ – Evaluate$performance$of$TCA$architecture$ • compare$with$the$original$version,$MPIJ3$RMA$version$ and$TCA$ ��

Tightly$Coupled$Accelerators$(TCA)$ Architecture � • TCA$is$an$interconnec<on$network$for$direct$ communica<on$between$accelerators$ – has$been$developed$in$Center$for$Computa<onal$ Sciences$(CCS),$University$of$Tsukuba$ • TCA$can$transfer$data$from$a$GPU$to$another$ without$CPU$assistance$ – No$temporally$data$copy$to$CPU’s$memory$ – No$CPU$computa<on$power$is$required$ ��

TCA$Architecture � • TCA$connects$among$GPUs$using$PCIe$ – Nodes$are$connected$by$PCIe$ – for$ultraJlow$latency$communica<on$among$GPUs$ Node � Node � CPU$Memory$ CPU$Memory$ CPU � CPU � PCIe$Switch � PCIe$Switch � PCIe � PCIe � PCe � PCe � GPU$ GPU$ PEACH2$ PEACH2$ PCIe � GPU$Memory$ GPU$Memory$ ��

PCI$Express$Adap<ve$Communica<on$Hub$ver.$2$ (PEACH2)$Board � • PEACH2$is$a$TCA$implementa<on$for$GPU$clusters$ • Using$FPGA$(Altera$Stra<x$IV$530GX)$ – three$external$PCIe$links$ – Each$connec<on$has$PCIe$x8$bandwidth$(4GB/s$peak) � Altera$FPGA$chip � ��

GPUDirect$RDMA$(GDR) � • 3 rd $party$devices$can$access$GPU’s$memory$ directly$through$PCI$Express$if$the$ environment$supports$GDR$ – CUDA$5.0$or$later$and$NVIDIA$Kepler$class$GPUs$ • InfiniBand$ – Mellanox$InfiniBand$and$Driver$(OFED)$ • MPI$ – MVAPICH2JGDR,$OpenMPI$ • TCA$ ��

GPU$Communica<on$with$MPI � • Conven<onal$MPI$using$InfiniBand$requires$ data$copy$3$<mes$ – Data$copy$between$CPU$and$GPU$(1$and$3)$has$to$ be$performed$manually � CPU � CPU � 2:$Data$transfer$over$IB � Mem � Mem � PCIe$SW � PCIe$SW � Mem � GPU � IB � IB � GPU � Mem � 1:$Copy$from$GPU$mem$$ 3:$Copy$from$CPU$mem$ �� to$CPU$mem$through$PCI$Express$(PCIe) � to$GPU$mem$through$PCIe �

GPU$Communica<on$with$IB/GDR � • The$InfiniBand$controller$read$and$write$GPU$ memory$directly$(with$GDR)$ – Temporal$data$copy$is$eliminated$ – Lower$latency$than$the$previous$method$ – Protocol$conversion$is$s<ll$needed � 1:$Direct$data$transfer$ CPU � CPU � (PCIe$J>$IB$J>$PCIe) � Mem � Mem � PCIe$SW � PCIe$SW � Mem � GPU � IB � IB � GPU � Mem � ��

GPU$Communica<on$with$TCA � • TCA$does$not$need$protocol$conversion$ – direct$data$copy$using$GDR$ – much$lower$latency$than$InfiniBand$ 1:$Direct$data$transfer$ CPU � CPU � (PCIe$J>$PCIe$J>$PCIe) � Mem � Mem � PCIe$SW � PCIe$SW � Mem � GPU � TCA � TCA � GPU � Mem � ��

QUDA � • QUDA:$The$open$source$LaTce$QCD$library$ – widely$used$as$a$LQCD$library$for$NVIDIA$GPUs$ • Op<mized$for$NVIDIA$GPUs$ • All$calcula<on$run$on$GPUs$ – Solves$a$liner$equa<on$using$CG$method$ – supports$mul<ple$node$configura<on$ ��

Communica<on$in$QUDA � • QUDA$already$supports$interJnode$parallelism$ using$MPI$ – Based$on$peerJto$peer$(P2P)$communica<on:$ MPI_Send,*MPI_Recv* • We$cannot$use$this$implementa<on$for$TCA$ communica<on$directly$ – because$TCA$supports$remote$memory$access$ (RMA)$only$ – We$have$to$modify$QUDA$to$support$RMA$ communica<ons � ��

RMA$Support$in$QUDA � • We$design$new$RMA$interface$ QUDA$Library$ � in$QUDA$ – It’s$based$on$the$QUDA’s$ QUDA$RMA$ Interface � communica<on$abstract$layer$ – the$original$has$a$layer$for$peerJ MPIJ3$ TCA$ RMA$API � API � toJpeer$communica<on$ • Two$implementa<ons$for$$ Support$mul<ple$communica<on$ new$APIs$ methods$using$the$abstract$layer � – TCA$ – MPIJ3$RMA$ • for$portability$ • for$performance$comparisons$ ��

Replace$ send $with$ write � • We$replace$send$opera<ons$with$write$ opera<ons$ – Not$emula<ng$of$send$and$receive$ – Receive$opera<ons$are$removed$ • TCA$supports$only$write$opera<ons$ – Read$opera<on$is$emulated$by$proxy$write$ – We$should$avoid$read$opera<ons$as$we$can � 1:$Write$a$proxy$read$request � 2:$interrup<on � Origin � Target � �� 3:$Write$a$response �

Communica<ons$in$QUDA � Halo$data � • We$apply$new$RMA$APIs$ for$nearest$neighbor$halo$ exchange$ – Write$data$to$neighbor$ processes’$memory$region$ Halo$data$exchange � • We$apply$TCA$also$for$ allreduce $communcia<on$ � � – latency$is$essen<al$ � � because$of$small$scalar$ value$ �� Allreduce �

$$$$Host � GPU � Calcula<on$of$QUDA � Halo$data$packing$ 4$<mes$ • Basic$flow$of$single$ loop � Calc.$of$inner$points$ itera<on$of$CG$ Halo$data$packing � cudaEvent$ Synchronize � • Inner$calcula<on$is$ Calc.$of$inner$ Halo$data$ points � overlapped$with$ exchange � communica<on$ Calc.$of$boundary$ – the$communica<on$ Calc.$of$ <me$will$be$hidden � boundary$points � Kernel$Launch$ Reduce$in$GPU$ 2$<mes$ loop � Kernel$ Reduce$in$GPU � Execu<on � cudaEvent$ cudaEvent$ Synchronize � Synchronize � AllReduce � Communica<on � ��

�� window_alloc � Prepare$all$ �� Init. � Init. � communica<on$ �� before$calcula<on$loop � �� Loop � Calc. � Calc. � Star<ng$and$wai<ng$ only$in$calcula<on$loop � queue_start � queue_start � Reuse$prepared$ communica<ons$ � queue_wait � Free � Free � �� Fig.$Basic$flow$of$ �� RMA$communica<on � ��

MPIJ3$RMA$Implementa<on � • Memory$region$for$RMA$is$allocated$by$ MPI_Win $APIs$ • MPI_Put $is$used$for$wri<ng$data$to$the$ remote$memory$ • MPI_Win_{post,wait,start,complete}* to$synchronize$RMA$opera<ons$(queue_wait)$ – synchronize$with$neighbor$proceeses$ ��

TCA$Implementa<on � • Based$on$the$MPIJ3$RMA$implementa<on$$ – TCA$is$used$for$core$communica<on$only$ • halo$data$exchange$ • Allreduce$(built$on$TCA$RMA$write$opera<ons)$ – TCA$network$is$not$selfJcontained$ • MPI$is$used$for$ini<aliza<on$ ��

TCA$Implementa<on � • Chaining$DMA$ – is$a$func<on$of$the$PEACH2’s$DMA$controller$ – We$can$chain$mul<ple$memory$writes$ • Once$a$DMA$chain$is$kicked,$all$chained$opera<ons$are$ executed$con<nuously$ • reduces$the$cost$of$DMA$controller$invoca<on$ • RMAs$in$the$halo$data$exchange$for$mul<J dimension$are$DMA$chained$$ – Launch$a$DMA$controller$once$in$a$itera<on$ – communica<on$paOern$does$not$change � ��

QCD$Library$for$GPU$Cluster$with$ Proprietary$Interconnect$for$ - PowerPoint PPT Presentation

QCD$Library$for$GPU$Cluster$with$ Proprietary$Interconnect$for$ GPU$Direct$Communica<on 1 , Hisafumi Fujii 1 , Toshihiro Hanawa 2 , Yuetsu Kodama 3 , ! Norihisa Fujita 4 1,3 , Yoshinobu Kuramashi 3 , Mike Clark Taisuke

Interconnect Technologies for Clusters Interconnect approaches Cluster Computing WAN

Geometrically Parameterized Interconnect Geometrically Parameterized Interconnect Performance

with Proprietary Interconnect TCA for GPU Direct Communication Kazuya MATSUMOTO 1 , Toshihiro

2019-2020 Campaign Theme CONFIDENTIAL & PROPRIETARY CONFIDENTIAL & PROPRIETARY

2019-2020 Campaign Theme CONFIDENTIAL & PROPRIETARY CONFIDENTIAL & PROPRIETARY

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

MacroPoint Proprietary Cargo Theft is a Global Issue MacroPoint Proprietary MacroPoint

Capturing the opportunity in the SMB segment Proprietary + Confidential Proprietary +

Z c (3900) from lattice QCD based on Y. Ikeda et al., (HAL QCD), arXiv.1602.03465(hep-lat).

The challenge of discovering QCD critical point M. Stephanov M. Stephanov QCD Critical Point

Soft QCD WG Summary Xavier Janssen, Anna Kulesza, Andrew Pilkington QCD@LHC 2011 St Andrews

Extreme QCD at RHIC and LHC Jamal Jalilian-Marian Baruch College, New York, NY, USA OUTLINE QCD

QCD phase diagram: overview of recent lattice results Gergely Endr odi University of

Library Department FY 2021 Library Department FY 2021 Library Organization Chart Springfield

Shared Governance at Colorado State University: Administrative Professional Council Classified

Modelling Biochemical Reaction Networks Lecture 18: Modeling the cell cycle, Part II Marc R.

Auxiliary Mixture Sampling for Age-Period-Cohort Models Andrea Riebler & Leonhard Held

EDGE IN A BOX A joint solution with APC by Schneider Electric, StorMagic and HPE MAKING THE

Generative Modeling of Infinite Occluded Objects for Compositional Scene Representation Jinyang

Topics in f ( R ) THEORIES OF GRAVITY Nathalie Deruelle APC-CNRS, Paris ihes, April 22th 2010 2

Input Evidence Causal Leverage Analysis Lance Shea Partner BakerHostetler Washington, DC

A full sky, low foreground, high resolution CMB map from WMAP Jacques Delabrouille APC Paris

Sambuz

Useful Links

Newsletter

Mail Us

QCD$Library$for$GPU$Cluster$with$ Proprietary$Interconnect$for$ - PowerPoint PPT Presentation

QCD$Library$for$GPU$Cluster$with$ Proprietary$Interconnect$for$ GPU$Direct$Communica<on 1 , Hisafumi Fujii 1 , Toshihiro Hanawa 2 , Yuetsu Kodama 3 , ! Norihisa Fujita 4 1,3 , Yoshinobu Kuramashi 3 , Mike Clark Taisuke

Interconnect Technologies for Clusters Interconnect approaches Cluster Computing WAN

Geometrically Parameterized Interconnect Geometrically Parameterized Interconnect Performance

with Proprietary Interconnect TCA for GPU Direct Communication Kazuya MATSUMOTO 1 , Toshihiro

2019-2020 Campaign Theme CONFIDENTIAL &amp; PROPRIETARY CONFIDENTIAL &amp; PROPRIETARY

2019-2020 Campaign Theme CONFIDENTIAL &amp; PROPRIETARY CONFIDENTIAL &amp; PROPRIETARY

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

MacroPoint Proprietary Cargo Theft is a Global Issue MacroPoint Proprietary MacroPoint

Capturing the opportunity in the SMB segment Proprietary + Confidential Proprietary +

Z c (3900) from lattice QCD based on Y. Ikeda et al., (HAL QCD), arXiv.1602.03465(hep-lat).

The challenge of discovering QCD critical point M. Stephanov M. Stephanov QCD Critical Point

Soft QCD WG Summary Xavier Janssen, Anna Kulesza, Andrew Pilkington QCD@LHC 2011 St Andrews

Extreme QCD at RHIC and LHC Jamal Jalilian-Marian Baruch College, New York, NY, USA OUTLINE QCD

QCD phase diagram: overview of recent lattice results Gergely Endr odi University of

Library Department FY 2021 Library Department FY 2021 Library Organization Chart Springfield

Shared Governance at Colorado State University: Administrative Professional Council Classified

Modelling Biochemical Reaction Networks Lecture 18: Modeling the cell cycle, Part II Marc R.

Auxiliary Mixture Sampling for Age-Period-Cohort Models Andrea Riebler &amp; Leonhard Held

EDGE IN A BOX A joint solution with APC by Schneider Electric, StorMagic and HPE MAKING THE

Generative Modeling of Infinite Occluded Objects for Compositional Scene Representation Jinyang

Topics in f ( R ) THEORIES OF GRAVITY Nathalie Deruelle APC-CNRS, Paris ihes, April 22th 2010 2

Input Evidence Causal Leverage Analysis Lance Shea Partner BakerHostetler Washington, DC

A full sky, low foreground, high resolution CMB map from WMAP Jacques Delabrouille APC Paris

Sambuz

Useful Links

Newsletter

Mail Us

2019-2020 Campaign Theme CONFIDENTIAL & PROPRIETARY CONFIDENTIAL & PROPRIETARY

2019-2020 Campaign Theme CONFIDENTIAL & PROPRIETARY CONFIDENTIAL & PROPRIETARY

Auxiliary Mixture Sampling for Age-Period-Cohort Models Andrea Riebler & Leonhard Held