NP SciDAC Project: JLab Site Report Blint Jo Jefferson Lab, Oct - PowerPoint PPT Presentation

NP SciDAC Project: JLab Site Report Bálint Joó Jefferson Lab, Oct 18, 2013 Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

JLab Year 1 Tasks • T1: Extend the Just-In-Time (JIT) based version of QDP++ to use multiple GPUs for deployment on large scale GPU based systems • T2: Optimize QDP++ and Chroma for large scale multi GPU resources • T3: Continue Collaboration with Intel corporation for MIC • T4: Develop a generalized contraction code that is suitable for a small number of initial algorithms and final-state particles using a three-dimensional implementation over QDP++ • T5: Implement “distillation” for the study of hadronic structure and matrix elements • T6: Optimize QDP++ and Chroma to efficiently use the new floating-point features of the BG/Q Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

Tasks 1 & 2: QDP++ & Chroma on GPU • Status: Done. - QDP-JIT/PTX is in production on Titan • Using Chroma + QUDA solvers & Device interface - Paper in preparation for IPDPS • Staffing: - Frank Winter and Balint Joo, and NVIDIA Colleague (not funded by us): Mike Clark Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

A Sampling of Results shoulder Good Mem B/W region • NVIDIA K20m GPUs have max Mem Bandwidth of ~180 (208) GB/sec ECC on (off) • QDP-JIT/PTX achieves 150 (162) GB/sec, ~83% (~78%) of peak with ECC on (off). • Max perf reached around 12 4 -14 4 local lattice on single node. Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

Large Scale Running V=40x40x40x256, m π ~ 230 MeV, Anisotropic clover 10000 0.5x execution time CPU (all MPI) 9000 =2x speedup GPU (JITPTX+QUDA) • Runs from Titan in the Summer. 8000 • In terms of local volume: we hit the Trajectory time in seconds 7000 “shoulder region” at 400 GPUs 4 sites/node) 6000 0.65x execution time - Local vol: 40x8x8x16=40960 sites =1.53x speedup Start of shoulder region (~14 5000 • Cannot strong scale this volume beyond 4000 800 nodes 3000 2000 400 600 800 1000 1200 1400 1600 Nodes Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

Collaboration with SUPER • New Student on the Project: Diptorup Deb • Identifying potential inefficiencies in QDP-JIT/PTX/LLVM - e.g. C++ Literal Constants used in operations impact performance • up to 20% overhead in certain operations • Future work on QDP-JIT - code refactoring (improve performance of huge kernels, requires LLVM work) - Kernel fusion (requires higher level work) • Staffing: Frank Winter (JLab), Rob Fowler, Allan Porterfield, Diptorup Deb (RENCI) Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

Task 3: Xeon Phi & work with Intel • Status - Paper at ISC ʼ 13, Leipzig, June 16-20, 2013 - Running Code on Stampede @ TACC for testing purposes - Current focus: Integration with Chroma, boundaries, anisotropy, double precision • Prognosis: Excellent - Intel Colleagues are Engaged and Enthusiastic • Staffing: - Balint Joo, Jie Chen + Intel Colleagues (not funded by us): M. Smelyanskiy, D. Kalamkar, K. Vaidyanathan primarily, • Future steps: - Aiming for a public code release soon. Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

Optimizing QDP++ on Xeon Phi • QDP++ ʻ parscalarvec ʼ - work by Jie Chen & B. Joo 42*+4" - vector friendly layout in 32*+3" QDP++ 56-"789:";8-<"=>:7"??@A%" *12,-./*0+*" - Single Xeon Phi 789:"$B#(":7C<,-9" &":7C<,-9D;8C<" *12*+*" comparable to 2 SNB %":7C<,-9D;8C<" sockets (no intrinsics, no $":7C<,-9D;8C<" ,-./*0+*" #":7C<,-D;8C<" prefetch) *+*" - parscalarvec intrinsic free !" #!!!!" $!!!!" %!!!!" &!!!!" '!!!!" (!!!!" )!!!!" !"#$%&%'%()*+,-& host code comparable to SSE optimized host code for single precision Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

Optimized Code: Wilson Dslash *"+# *!&# *+&# *+!# "&'# "&'# "&)# "(+# *++# "%"# "%!# • Blocking scheme maximises ")&# ")(# "))# ")!# "$%# "$$# "$"# "$!# "$!# "*(# "*'# ""(# ")+# ""$# "!%# #of cores used "+'# !((# !(%# !("# !(!# "++# !'$# • SOA layout with tuned ʻ inner !)+# !"*# !""# !""# !""# !"!# !"!# !!&# !!&# !!'# !!'# array length ʼ !++# )+# • CPU Performance is excellent +# also (used 2.6 GHz SNB) ,-./01234435# 6/01234435# ,-./01234435# 6/01234435# ,-./01234435# 6/01234435# ,-./01234435# 6/01234435# 7-839:#;3/-:##<)="'(+#>?@A=<BC# 7-839:#;3/-#BDEF#>G@6C#)!!+B# 7-839:#;3/-#BDEF#>G@6C# @J7K7L:#G31932#G"+0# A!BHI=%!!+B# • Here Xeon Phi is comparable ME94/-#K49N4D# to 4 sockets JO"$P"$P"$P!"(## JO*"P*"P*"P!"(# JO$+P$+P$+P&'# JO$(P$(P"$P'$# JO*"P$+P"$P&'# From: B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnani,V. W. Lee, P. Dubey, W. Watson ||| “Lattice QCD on Intel(R) Xeon Phi(tm) Coprocessors”, Proceedings of ISC ʼ 13 (Leipzig) Lecture Notes in Computer Science Vol 7905 (to appear), Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

Multi-Node Performance • 2D Comms only (Z&T) Wilson Dslash !((($ +(+!$ ;<)&=)&=)&=&+!$ - vectorization mixes X & Y +((($ *('($ ;<*%=*%=*%=&+!$ )#()$ )!''$ *((($ • Intel Endeavor Cluster &&%"$ )((($ &"'($ ""%&$ &((($ • 1 Xeon Phi device per node !"#$ "((($ • MPI Proxy: ($ &$ *$ %$ "!$ )&$ ,-./01$23$4025$678$-589:$ - pick fastest bandwidth path between devices (via host Wilson CG in this case) )&)"# !"((# ;<)%=)%=)%=%"*# !(((# ;<!$=!$=!$=%"*# %&!"# )"((# %$!!# - similar to GPU strong scaling at this level (expected) %")$# )(((# %"((# '**&# '"(&# %(((# • Space to explore here: e.g. CCL-Proxy, MVAPICH2-MIC '"((# $%&# '(((# !""# "((# (# %# !# $# '*# )%# From: B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnani, V. W. Lee, P. Dubey, W. Watson |||, “Lattice QCD on Intel(R) Xeon Phi(tm) Coprocessors”, Proceedings of ISC ʼ 13 (Leipzig) Lecture Notes in Computer Science Vol 7905 (to appear), +,-./0#12#3/14#567#8479:# Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

Weak Scaling on Stampede • Without proxy Wilson Dslash, Weak Scaling, 48x48x24x64 sites per node, single precision 300 - drop in performance when going to multiple nodes - 250 performance halves when introducing second comms 31 TF !! GFLOPS per Node direction 200 Without Proxy - suggests issue is with async With CML Proxy progress rather than attainable 150 bandwidth or latency • With proxy 100 - small drop in performance from 1 to 2D comms. More likely due to 50 Communication in Communication in 2 dimensions B/W constraints... 1 dimension • Dslash scaled to 31 TF on 128 0 1 2 4 8 16 32 64 128 nodes (CG to 16.8 TF) number of nodes Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

Strong Scaling 48x48x48x256 sites, strong scaling, single precision, using CML proxy • Endeavor results from 6000 ISC ʼ 13 paper Dslash Endeavor Dslash Stampede • Need to understand better 5000 CG Endeavor CG Stampede the performance difference but 4000 GFLOPS • Stampede 3000 - no icache_snoop_off - shape difference may be due 2000 to different virtual topology on 16 nodes 1000 • Over 4TF reached in Dslash and 3.6TF reached in CG on 0 Stampede 4 8 16 32 64 Number of Nodes Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

Task 4,5: Contractions etc. • Status - ʻ redstar ʼ - generalized contraction code: • Computes quark line diagrams, generates list of propagators for Chroma to compute - ʻ harom ʼ - 3D version of QDP++ • Performs timeslice by timeslice contraction • Status: In production now - Used for multi-particle 2pt and 3pt calculations with variational method • Staff: - Robert Edwards, Jie Chen • Follow on tasks: - More optimization via BLAS/Lapack Library integration (GPU acceleration via CUBLAS?) - Support “isobar”-like operator constructions (recursive contractions) - improve I/O performance. Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

Task 6: Chroma on BG/Q • Status: Slight progress - Chroma has compiled with XLC on BG/Q but performance is low - Have Clover Solver through BAGEL, not yet integrated - IBM has re-coded SSE cpp_wilson_dslash package for BG/Q under contract with ANL • Integrated this on Cetus. Best observed performance is on the order of - ~11-12% of peak in my single node tests - ~7-9% of peak communicating in all directions - Tried Parscalarvec on single node, but this still needs work. • Staffing: - Balint Joo Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

NP SciDAC Project: JLab Site Report Blint Jo Jefferson Lab, Oct - PowerPoint PPT Presentation

NP SciDAC Project: JLab Site Report Blint Jo Jefferson Lab, Oct 18, 2013 Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013 JLab Year 1 Tasks T1: Extend the Just-In-Time (JIT) based version of QDP++ to use

E XPERIMENT AT JLAB Stepan Stepanyan JLAB Intensity Frontier Workshop 25-27 April 2013, ANL 2

Muon Acceleration in FFAG Rings Eberhard Keil CASA Seminar at JLab 26 April 2004 My WWW home

R. Avakian R.H. Avakian JLAB, November 1, 2004 R.H. Avakian JLAB, November 1, 2004 Coherent

SciDAC Software JLab AHM May 6, 2011 Possible Topics for Discussion New Machines: BG/Q &

SciDAC Software Infrastructure for Lattice Gauge Theory Richard C. Brower All Hands Meeting

JLab Site Report Blint Jo USQCD All Hands Meeting Brookhaven National Laboratory April 19,

JLAB waveguide couplers R. Rimmer for JLab SRF Institute Outline Background Original

Applications at JLab A. Margaryan A. Margaryan, YerPhI Seminar@JLab June 28, 2007 1 Contents

Future Possibilities at Jefferson Lab (JLab) Arne Freyberger Operations Department Accelerator

IESO Report Site Refresh New Report Site Changes October 10, 2014 R. Jovic Introduction The

Hercules 009 Landfill Superfund Site Scott Martin Presentation Overview Site History Site

Report on Cavity Progress at JLAB Anne McEwen Jefferson Lab, Newport News VA , USA May 18-20,

Metadata Working Group Report People Convener Tomoteru Yoshie (Japan) Members

TRES WEST ENGINEERS, INC Existing Site Development Proposed Site Development Proposed Site

De la wa re Co unty DPW F a c ility Site s T o p Site s Hyb rid Site # 11A & 7A a nd Site

Cline Family YMCA Beckley, WV Conceptual Design Package Site Site Site Site Proposed Site

Faster Programs with Guile 3 FOSDEM 2019, Brussels Andy Wingo | wingo@igalia.com wingolog.org |

Retargeting JIT compilers by using C-compiler generated executable code Mark Tokutomi January

JIT Compilation Module Overview JIT Compilation Native vs. Managed Compilation Managed

Optimizing JavaScript Filip Pizlo Apple Untyped Objects are hashtables Functions are

Creating a Job and Viewing Applicants Creating a Job in Talent Link Human

Event Based Programming Check out EventBasedProgramming from SVN Share designs for the Game

Automated Parallel Calculation of Collaborative Statistical Models in RooFit Patrick Bos IEEE

Glossary of Terms & Observer Pattern Supplemental DRY Don't Repeat Yourself Repetition =

NP SciDAC Project: JLab Site Report Blint Jo Jefferson Lab, Oct - PowerPoint PPT Presentation

NP SciDAC Project: JLab Site Report Blint Jo Jefferson Lab, Oct 18, 2013 Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013 JLab Year 1 Tasks T1: Extend the Just-In-Time (JIT) based version of QDP++ to use

E XPERIMENT AT JLAB Stepan Stepanyan JLAB Intensity Frontier Workshop 25-27 April 2013, ANL 2

Muon Acceleration in FFAG Rings Eberhard Keil CASA Seminar at JLab 26 April 2004 My WWW home

R. Avakian R.H. Avakian JLAB, November 1, 2004 R.H. Avakian JLAB, November 1, 2004 Coherent

SciDAC Software JLab AHM May 6, 2011 Possible Topics for Discussion New Machines: BG/Q &amp;

SciDAC Software Infrastructure for Lattice Gauge Theory Richard C. Brower All Hands Meeting

JLab Site Report Blint Jo USQCD All Hands Meeting Brookhaven National Laboratory April 19,

JLAB waveguide couplers R. Rimmer for JLab SRF Institute Outline Background Original

Applications at JLab A. Margaryan A. Margaryan, YerPhI Seminar@JLab June 28, 2007 1 Contents

Future Possibilities at Jefferson Lab (JLab) Arne Freyberger Operations Department Accelerator

IESO Report Site Refresh New Report Site Changes October 10, 2014 R. Jovic Introduction The

Hercules 009 Landfill Superfund Site Scott Martin Presentation Overview Site History Site

Report on Cavity Progress at JLAB Anne McEwen Jefferson Lab, Newport News VA , USA May 18-20,

Metadata Working Group Report People Convener Tomoteru Yoshie (Japan) Members

TRES WEST ENGINEERS, INC Existing Site Development Proposed Site Development Proposed Site

De la wa re Co unty DPW F a c ility Site s T o p Site s Hyb rid Site # 11A &amp; 7A a nd Site

Cline Family YMCA Beckley, WV Conceptual Design Package Site Site Site Site Proposed Site

Faster Programs with Guile 3 FOSDEM 2019, Brussels Andy Wingo | wingo@igalia.com wingolog.org |

Retargeting JIT compilers by using C-compiler generated executable code Mark Tokutomi January

JIT Compilation Module Overview JIT Compilation Native vs. Managed Compilation Managed

Optimizing JavaScript Filip Pizlo Apple Untyped Objects are hashtables Functions are

Creating a Job and Viewing Applicants Creating a Job in Talent Link Human

Event Based Programming Check out EventBasedProgramming from SVN Share designs for the Game

Automated Parallel Calculation of Collaborative Statistical Models in RooFit Patrick Bos IEEE

Glossary of Terms &amp; Observer Pattern Supplemental DRY Don't Repeat Yourself Repetition =

SciDAC Software JLab AHM May 6, 2011 Possible Topics for Discussion New Machines: BG/Q &

De la wa re Co unty DPW F a c ility Site s T o p Site s Hyb rid Site # 11A & 7A a nd Site

Glossary of Terms & Observer Pattern Supplemental DRY Don't Repeat Yourself Repetition =