Scien&fic Simula&ons on Thousands of GPUs with - PowerPoint PPT Presentation

Scien&fic ¡Simula&ons ¡on ¡ Thousands ¡of ¡GPUs ¡with ¡ Performance ¡Portability ¡ Alan Gray and Kevin Stratford EPCC, The University of Edinburgh

CORAL ¡procurement ¡ ¡ • Three ¡“pre-‑exascale” ¡machines ¡have ¡been ¡announced ¡in ¡the ¡US, ¡each ¡ in ¡the ¡region ¡of ¡100-‑300 ¡petaflops ¡ • Summit ¡at ¡ORNL ¡and ¡ Sierra ¡at ¡LLNL ¡will ¡use ¡ NVIDIA ¡GPUs ¡ (with ¡IBM ¡ CPUs) ¡. ¡ ¡ • Aurora ¡at ¡Argonne ¡will ¡use ¡ Intel ¡Xeon ¡Phi ¡many-‑core ¡CPUs ¡ (Cray ¡ system) ¡ • Performance ¡Portability ¡ is ¡the ¡key ¡issue ¡for ¡the ¡programmer ¡ 2

Outline  Applications: Ludwig and MILC  Performance Portability with targetDP  Performance results on GPU, CPU and Xeon Phi § Using same source code for each  Scaling to many nodes with MPI+targetDP 3

Ludwig ¡Applica&on ¡ • So3 ¡ma4er ¡ substances ¡or ¡ complex ¡fluids ¡ are ¡all ¡around ¡us ¡ • Ludwig: ¡uses ¡laVce ¡Boltzmann ¡and ¡finite ¡difference ¡methods ¡to ¡ simulate ¡a ¡wide ¡range ¡of ¡systems ¡ • Improving ¡the ¡understanding ¡of, ¡and ¡ability ¡to ¡ manipulate, ¡ liquid ¡crystals ¡ is ¡a ¡very ¡ac&ve ¡research ¡ area ¡ ¡ • But ¡required ¡simula&ons ¡can ¡be ¡extremely ¡ computa&onally ¡demanding, ¡due ¡to ¡range ¡of ¡scales ¡ involved ¡ • targetDP ¡developed ¡in ¡co-‑design ¡with ¡Ludwig ¡ Gray, A., Hart, A., Henrich, O. & Stratford, K., Scaling soft Stratford, K., A. Gray, and J. S. Lintuvuori. " Large Colloids matter physics to thousands of graphics processing units in in Cholesteric Liquid Crystals ." Journal of Statistical 4 parallel, IJHPCA (2015) Physics 161.6 (2015): 1496-1507.

MILC ¡applica&on ¡ • LaVce ¡QCD ¡simula&ons ¡provide ¡numerical ¡ studies ¡to ¡help ¡understand ¡how ¡quarks ¡and ¡ gluons ¡interact ¡to ¡form ¡protons, ¡neutrons ¡and ¡ other ¡elementary ¡par&cles. ¡ • The ¡Unified ¡European ¡Applica&on ¡ Benchmark ¡Suite ¡(UEABS) ¡is ¡a ¡set ¡of ¡12 ¡ applica&on ¡codes ¡designed ¡to ¡be ¡ representa&ve ¡of ¡EU ¡HPC ¡usage ¡ ¡ including ¡LaVce ¡QCD ¡component, ¡ derived ¡from ¡MILC ¡codebase ¡ • targetDP ¡applied ¡to ¡this ¡ ¡ h_p://www.prace-‑ri.eu/ueabs/ ¡ applica&on ¡benchmark ¡to ¡enable ¡ for ¡GPU ¡and ¡Xeon ¡Phi ¡ 5

Mul&-‑valued ¡data ¡ ¡ • For ¡most ¡scien&fic ¡simula&ons ¡the ¡bo_leneck ¡is ¡ memory ¡bandwidth ¡ • Simula&on ¡data ¡consists ¡of ¡ mul?ple ¡values ¡ at ¡ each ¡site ¡ • In ¡memory, ¡we ¡have ¡a ¡choice ¡of ¡how ¡to ¡store ¡this ¡ ¡ |rgb|rgb|rgb|rgb| ¡ ¡(Array ¡of ¡Structs ¡AoS) ¡ ¡ |rrrr|gggg|bbbb| ¡ ¡(Struct ¡of ¡Arrays ¡SoA) ¡ ¡ Most ¡general ¡case ¡is ¡Array ¡of ¡Structs ¡of ¡(short) ¡Arrays ¡(AoSoA) ¡ ¡ E.g. ¡||rr|gg|bb|||rr|gg|bb|| ¡ ¡has ¡SA ¡length ¡of ¡2 ¡ ¡ Major ¡effect ¡on ¡bandwidth. ¡Best ¡layout ¡architecture-‑specific ¡ • Solu&on: ¡ ¡ ¡ De-‑couple ¡memory ¡layout ¡from ¡applica&on ¡source ¡code ¡ ¡ Can ¡simply ¡be ¡done ¡with ¡macro, ¡e.g. ¡ ¡ field[INDEX(iDim,iSite)] 6

targetDP ¡ • Simple ¡serial ¡code ¡example: ¡loop ¡over ¡N ¡grid ¡points ¡ ¡ With ¡some ¡opera&on ¡ … ¡at ¡each ¡point ¡ int iSite; for (iSite = 0; iSite < N; iSite++) { ... } 7

• OpenMP ¡ int iSite; #pragma omp parallel for • targetDP ¡ for (iSite = 0; iSite < N; iSite++) { ... __targetEntry__ void scale(double* field){ } int iSite; • CUDA ¡ __targetTLP__(iSite, N) { __global__ void scale(double* field) { ... } int iSite; return; iSite=blockIdx.x*blockDim.x+threadIdx.x } if(iSite<N) { ... } return; } 8

__targetEntry__ void scale(double* t_field) { int index; __targetTLP__ (iSite, N) { int iDim; for (iDim = 0; iDim < 3; iDim++) { t_field[INDEX(iDim,iSite)] = t_a*t_field[INDEX(iDim,iSite)]; } } return; } • PROBLEM: ¡ to ¡fully ¡u&lise ¡modern ¡CPUs, ¡compiler ¡must ¡vectorize ¡innermost ¡ loops ¡to ¡create ¡vector ¡instruc&ons. ¡ ¡ • SOLUTION : ¡TLP ¡can ¡be ¡strided, ¡such ¡that ¡each ¡thread ¡operates ¡on ¡chunk ¡of ¡VVL ¡ laVce ¡sites ¡ ¡ ¡ VVL ¡must ¡be ¡1 ¡for ¡above ¡example ¡to ¡work ¡ ¡ But ¡we ¡can ¡set ¡VVL>1, ¡and ¡add ¡a ¡new ¡innermost ¡loop ¡ 9

__targetEntry__ void scale(double* t_field) { int baseIndex; __targetTLP__(baseIndex, N) { int iDim, vecIndex; for (iDim = 0; iDim < 3; iDim++) { __targetILP__(vecIndex) \ t_field[INDEX(iDim,baseIndex+vecIndex)] = \ t_a*t_field[INDEX(iDim,baseIndex+vecIndex)]; } } return; } • ILP ¡can ¡map ¡to ¡loop ¡over ¡chunk ¡of ¡laVce ¡sites, ¡with ¡OpenMP ¡SIMD ¡ direc&ve ¡ • Easily ¡vectorizable ¡by ¡compiler ¡ ¡ • VVL ¡can ¡be ¡tuned ¡specifically ¡for ¡hardware, ¡e.g. ¡VVL=8 ¡will ¡create ¡single ¡ IMCI ¡instruc&on ¡for ¡8-‑way ¡DP ¡vector ¡unit ¡on ¡Xeon ¡Phi ¡ ¡ Without ¡this, ¡performance ¡is ¡several ¡&mes ¡worse ¡on ¡Xeon ¡Phi ¡ • We ¡can ¡just ¡map ¡to ¡an ¡empty ¡macro, ¡when ¡we ¡don’t ¡want ¡ILP ¡ 10

• Func&on ¡called ¡from ¡host ¡code ¡using ¡wrappers ¡to ¡CUDA ¡API ¡ ¡ That ¡can ¡alterna&vely ¡map ¡to ¡regular ¡CPU ¡(malloc, ¡memcpy ¡etc) ¡ targetMalloc((void **) &t_field, datasize); copyToTarget(t_field, field, datasize); copyConstDoubleToTarget(&t_a, &a, sizeof(double)); scale __targetLaunch__(N) (t_field); targetSynchronize(); copyFromTarget(field, t_field, datasize); targetFree(t_field); 11

Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡CPU ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Xeon ¡Phi ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡GPU ¡ • Same ¡performance-‑portable ¡targetDP ¡ source ¡code ¡on ¡all ¡ architectures ¡ 12

700" Full$Ludwig$Liquid$Crystal$128x128x128$Test$Case$$ 600" Ludwig"Remainder" Advect."Bound." 500" AdvecPon" LC"Update" 400" !me$(s)$ Chemical"Stress" Order"Par."Grad." 300" Collision" PropagaPon" 200" " "" 100" " " " "" "" 0" " " Intel"Ivy1 Intel"Haswell" AMD" Intel"Xeon" NVIDIA"K20X" NVIDIA"K40" bridge"121 81core"CPU" Interlagos" Phi"" GPU" GPU" core"CPU" 161core"CPU" AoSoA,%% Best%% AoSoA,%% %%AoS,%% %%AoS,%% %%SoA,%% %%SoA,%% VVL=8% Config:% VVL=4% VVL=1% VVL=1% VVL=1% VVL=1% 13

700" Full$MILC$Conjugate$Gradient$64x64x32x8$Test$Case$$ 600" MILC"Remainder" ShiN" 500" Scalar"Mult."Add" 400" Insert" !me$(s)$ Insert"&"Mult." 300" Extract"&"Mult." Extract" 200" " 100" "" " " " "" "" " " 0" Intel"Ivy1 Intel"Haswell" AMD" Intel"Xeon" NVIDIA"K20X" NVIDIA"K40" bridge"121 81core"CPU" Interlagos" Phi"" GPU" GPU" core"CPU" 161core"CPU" Best%% AoSoA,%% %%AoS,%% %%AoS,%% AoSoA,%% %%SoA,%% %%SoA,%% Config:% VVL=4% VVL=1% VVL=1% VVL=8% VVL=1% VVL=1% 14

Scien&fic Simula&ons on Thousands of GPUs with - PowerPoint PPT Presentation

Scien&fic Simula&ons on Thousands of GPUs with Performance Portability Alan Gray and Kevin Stratford EPCC, The University of Edinburgh CORAL procurement Three

FIC Roadshow Enforcement of the FIC Act September to October 2016 AGENDA Supervision of

Scien&fic Data File Formats Han-Wei Shen The Ohio

terms of the FIC Act AGENDA Compliance with the FIC Act Registration and Reporting Enforcement

Estate Agency Affairs Board FIC AMENDMENT ACT 1 FIC Amendment Act This is an overview of the

Introduc)on to Scien)fic and Technical compu)ng SSC 335/394, 2011 Victor

Scien&fic Skep&cism What is Skep&cism What does it mean to be skep&cal?

Click to edit Master /tle style "Scien'fic thought is the common heritage of

Welcome to the Biology FEST Launch Party! Biology FEST: Faculty Explora.ons in

Heimdallr: A Dataset Michael Riegler, Simula & UiO Duc-Tien Dang-Nguyen, University of Trento

FIC Roadshow Compliance with the FIC Act September to October 2016 AGENDA Applicable

Biology FEST: Faculty Explora.ons in Scien.fic Teaching Biology FEST

Biology FEST: Faculty Explora.ons in Scien.fic Teaching Biology FEST

Biology FEST: Faculty Explora.ons in Scien.fic Teaching Biology FEST

SSC 335/394: Scien.fic and Technical Compu.ng Computer Architectures:

Biology FEST: Faculty Explora.ons in Scien.fic Teaching HHMI Biology

Func+on applica+ons (calls, invoca+ons) lambda denotes a anonymous func+on To use a func+on, you

INVESTOR PRESENTATION 1 H 2 0 1 8 R E S U L T S 2 5 J U L Y 2 0 1 8 This presentation is for

Writing for the Web Marketing & Creative Services Today we are going to discuss How

(Three of) the eight key competences and why they matter Dr Sandra Leaton Gray UCL Institute

Financing a College Education After this program, you should know The types of aid

Job Order Contracting for Small Facilities and Maintenance Projects TE XAS P UBLI C P URCH ASI N

Peter Quinter, Attorney Customs & International Trade Law Group GrayRobinson, P.A. Mobile

February 11, 2014 Honorable Adam Gray Chair, Joint Legislative Audit Committee 1020 N Street,

The Business Savvy Strategist Todd Corham - Lowenstein Sandler PC Jeffrey K. Brandt - ILTA

Scien&fic Simula&ons on Thousands of GPUs with - PowerPoint PPT Presentation

Scien&fic Simula&ons on Thousands of GPUs with Performance Portability Alan Gray and Kevin Stratford EPCC, The University of Edinburgh CORAL procurement Three

FIC Roadshow Enforcement of the FIC Act September to October 2016 AGENDA Supervision of

Scien&amp;fic Data File Formats Han-Wei Shen The Ohio

terms of the FIC Act AGENDA Compliance with the FIC Act Registration and Reporting Enforcement

Estate Agency Affairs Board FIC AMENDMENT ACT 1 FIC Amendment Act This is an overview of the

Introduc)on to Scien)fic and Technical compu)ng SSC 335/394, 2011 Victor

Scien&amp;fic Skep&amp;cism What is Skep&amp;cism What does it mean to be skep&amp;cal?

Click to edit Master /tle style &quot;Scien'fic thought is the common heritage of

Welcome to the Biology FEST Launch Party! Biology FEST: Faculty Explora.ons in

Heimdallr: A Dataset Michael Riegler, Simula &amp; UiO Duc-Tien Dang-Nguyen, University of Trento

FIC Roadshow Compliance with the FIC Act September to October 2016 AGENDA Applicable

Biology FEST: Faculty Explora.ons in Scien.fic Teaching Biology FEST

Biology FEST: Faculty Explora.ons in Scien.fic Teaching Biology FEST

Biology FEST: Faculty Explora.ons in Scien.fic Teaching Biology FEST

SSC 335/394: Scien.fic and Technical Compu.ng Computer Architectures:

Biology FEST: Faculty Explora.ons in Scien.fic Teaching HHMI Biology

Func+on applica+ons (calls, invoca+ons) lambda denotes a anonymous func+on To use a func+on, you

INVESTOR PRESENTATION 1 H 2 0 1 8 R E S U L T S 2 5 J U L Y 2 0 1 8 This presentation is for

Writing for the Web Marketing &amp; Creative Services Today we are going to discuss How

(Three of) the eight key competences and why they matter Dr Sandra Leaton Gray UCL Institute

Financing a College Education After this program, you should know The types of aid

Job Order Contracting for Small Facilities and Maintenance Projects TE XAS P UBLI C P URCH ASI N

Peter Quinter, Attorney Customs &amp; International Trade Law Group GrayRobinson, P.A. Mobile

February 11, 2014 Honorable Adam Gray Chair, Joint Legislative Audit Committee 1020 N Street,

The Business Savvy Strategist Todd Corham - Lowenstein Sandler PC Jeffrey K. Brandt - ILTA

Scien&fic Data File Formats Han-Wei Shen The Ohio

Scien&fic Skep&cism What is Skep&cism What does it mean to be skep&cal?

Click to edit Master /tle style "Scien'fic thought is the common heritage of

Heimdallr: A Dataset Michael Riegler, Simula & UiO Duc-Tien Dang-Nguyen, University of Trento

Writing for the Web Marketing & Creative Services Today we are going to discuss How

Peter Quinter, Attorney Customs & International Trade Law Group GrayRobinson, P.A. Mobile