NP SciDAC Project: JLab Site Report Blint Jo Jefferson Lab, Oct - - PowerPoint PPT Presentation

np scidac project jlab site report
SMART_READER_LITE
LIVE PREVIEW

NP SciDAC Project: JLab Site Report Blint Jo Jefferson Lab, Oct - - PowerPoint PPT Presentation

NP SciDAC Project: JLab Site Report Blint Jo Jefferson Lab, Oct 18, 2013 Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013 JLab Year 1 Tasks T1: Extend the Just-In-Time (JIT) based version of QDP++ to use


slide-1
SLIDE 1

Thomas Jefferson National Accelerator Facility

NP SciDAC Project: JLab Site Report

Bálint Joó Jefferson Lab, Oct 18, 2013

Thursday, October 17, 2013

slide-2
SLIDE 2

Thomas Jefferson National Accelerator Facility

JLab Year 1 Tasks

  • T1: Extend the Just-In-Time (JIT) based version of QDP++ to use multiple GPUs

for deployment on large scale GPU based systems

  • T2: Optimize QDP++ and Chroma for large scale multi GPU resources
  • T3: Continue Collaboration with Intel corporation for MIC
  • T4: Develop a generalized contraction code that is suitable for a small number of

initial algorithms and final-state particles using a three-dimensional implementation

  • ver QDP++
  • T5: Implement “distillation” for the study of hadronic structure and matrix elements
  • T6: Optimize QDP++ and Chroma to efficiently use the new floating-point features
  • f the BG/Q

Thursday, October 17, 2013

slide-3
SLIDE 3

Thomas Jefferson National Accelerator Facility

Tasks 1 & 2: QDP++ & Chroma on GPU

  • Status: Done.
  • QDP-JIT/PTX is in production on Titan
  • Using Chroma + QUDA solvers & Device interface
  • Paper in preparation for IPDPS
  • Staffing:
  • Frank Winter and Balint Joo, and NVIDIA Colleague (not funded by us): Mike Clark

Thursday, October 17, 2013

slide-4
SLIDE 4

Thomas Jefferson National Accelerator Facility

A Sampling of Results

  • NVIDIA K20m GPUs have max Mem Bandwidth of ~180 (208) GB/sec ECC on (off)
  • QDP-JIT/PTX achieves 150 (162) GB/sec, ~83% (~78%) of peak with ECC on (off).
  • Max perf reached around 124-144 local lattice on single node.

Good Mem B/W region shoulder

Thursday, October 17, 2013

slide-5
SLIDE 5

Thomas Jefferson National Accelerator Facility

Large Scale Running

  • Runs from Titan in the Summer.
  • In terms of local volume: we hit the

“shoulder region” at 400 GPUs

  • Local vol: 40x8x8x16=40960 sites
  • Cannot strong scale this volume beyond

800 nodes

400 600 800 1000 1200 1400 1600 Nodes 2000 3000 4000 5000 6000 7000 8000 9000 10000 Trajectory time in seconds CPU (all MPI) GPU (JITPTX+QUDA) V=40x40x40x256, mπ ~ 230 MeV, Anisotropic clover 0.65x execution time =1.53x speedup 0.5x execution time =2x speedup

Start of shoulder region (~14

4 sites/node)

Thursday, October 17, 2013

slide-6
SLIDE 6

Thomas Jefferson National Accelerator Facility

Collaboration with SUPER

  • New Student on the Project: Diptorup Deb
  • Identifying potential inefficiencies in QDP-JIT/PTX/LLVM
  • e.g. C++ Literal Constants used in operations impact performance
  • up to 20% overhead in certain operations
  • Future work on QDP-JIT
  • code refactoring (improve performance of huge kernels, requires LLVM work)
  • Kernel fusion (requires higher level work)
  • Staffing: Frank Winter (JLab), Rob Fowler, Allan Porterfield, Diptorup Deb

(RENCI)

Thursday, October 17, 2013

slide-7
SLIDE 7

Thomas Jefferson National Accelerator Facility

Task 3: Xeon Phi & work with Intel

  • Status
  • Paper at ISCʼ13, Leipzig, June 16-20, 2013
  • Running Code on Stampede @ TACC for testing purposes
  • Current focus: Integration with Chroma, boundaries, anisotropy, double precision
  • Prognosis: Excellent
  • Intel Colleagues are Engaged and Enthusiastic
  • Staffing:
  • Balint Joo, Jie Chen + Intel Colleagues (not funded by us): M. Smelyanskiy, D. Kalamkar, K.

Vaidyanathan primarily,

  • Future steps:
  • Aiming for a public code release soon.

Thursday, October 17, 2013

slide-8
SLIDE 8

Thomas Jefferson National Accelerator Facility

Optimizing QDP++ on Xeon Phi

  • QDP++ ʻparscalarvecʼ -

work by Jie Chen & B. Joo

  • vector friendly layout in

QDP++

  • Single Xeon Phi

comparable to 2 SNB sockets (no intrinsics, no prefetch)

  • parscalarvec intrinsic free

host code comparable to SSE optimized host code for single precision

!" #!!!!" $!!!!" %!!!!" &!!!!" '!!!!" (!!!!" )!!!!" *+*" ,-./*0+*" *12*+*" *12,-./*0+*" 32*+3" 42*+4" !"#$%&%'%()*+,-& 56-"789:";8-<"=>:7"??@A%" 789:"$B#(":7C<,-9" &":7C<,-9D;8C<" %":7C<,-9D;8C<" $":7C<,-9D;8C<" #":7C<,-D;8C<"

Thursday, October 17, 2013

slide-9
SLIDE 9

Thomas Jefferson National Accelerator Facility

Optimized Code: Wilson Dslash

  • Blocking scheme maximises

#of cores used

  • SOA layout with tuned ʻinner

array lengthʼ

  • CPU Performance is excellent

also (used 2.6 GHz SNB)

  • Here Xeon Phi is comparable

to 4 sockets

!"!# !"!# ""$# "%"# "$%# "&'# !((# "))# !""# !"*# "$!# "&'# ")&# *!&# !(%# ")!# !!'# !!'# "+'# "%!# "*'# *+!# !'$# ""(# !!&# !!&# "!%# "(+# "$"# *+&# !(!# "$!# !""# !""# "*(# "&)# ")(# *"+# !("# "$$#

+# )+# !++# !)+# "++# ")+# *++# ,-./01234435# 6/01234435# ,-./01234435# 6/01234435# ,-./01234435# 6/01234435# ,-./01234435# 6/01234435# 7-839:#;3/-:##<)="'(+#>?@A=<BC# 7-839:#;3/-#BDEF#>G@6C#)!!+B# 7-839:#;3/-#BDEF#>G@6C# A!BHI=%!!+B# @J7K7L:#G31932#G"+0# ME94/-#K49N4D# JO"$P"$P"$P!"(## JO*"P*"P*"P!"(# JO$+P$+P$+P&'# JO$(P$(P"$P'$# JO*"P$+P"$P&'#

From: B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnani,V. W. Lee, P. Dubey, W. Watson ||| “Lattice QCD on Intel(R) Xeon Phi(tm) Coprocessors”, Proceedings of ISCʼ13 (Leipzig) Lecture Notes in Computer Science Vol 7905 (to appear),

Thursday, October 17, 2013

slide-10
SLIDE 10

Thomas Jefferson National Accelerator Facility

Multi-Node Performance

  • 2D Comms only (Z&T)
  • vectorization mixes X & Y
  • Intel Endeavor Cluster
  • 1 Xeon Phi device per node
  • MPI Proxy:
  • pick fastest bandwidth path between devices (via host

in this case)

  • similar to GPU strong scaling at this level (expected)
  • Space to explore here: e.g. CCL-Proxy, MVAPICH2-MIC

!"#$ ""%&$ &"'($ )!''$ )#()$

&&%"$ *('($ +(+!$

($ "((($ &((($ )((($ *((($ +((($ !((($ &$ *$ %$ "!$ )&$ ,-./01$23$4025$678$-589:$ ;<)&=)&=)&=&+!$ ;<*%=*%=*%=&+!$ !""# $%&# '"(&# %")$# %$!!# '**&# %&!"# )&)"# (# "((# '(((# '"((# %(((# %"((# )(((# )"((# !(((# !"((# %# !# $# '*# )%# +,-./0#12#3/14#567#8479:# ;<)%=)%=)%=%"*# ;<!$=!$=!$=%"*#

Wilson Dslash Wilson CG

From: B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnani, V. W. Lee, P. Dubey, W. Watson |||, “Lattice QCD on Intel(R) Xeon Phi(tm) Coprocessors”, Proceedings of ISCʼ13 (Leipzig) Lecture Notes in Computer Science Vol 7905 (to appear),

Thursday, October 17, 2013

slide-11
SLIDE 11

Thomas Jefferson National Accelerator Facility

Weak Scaling on Stampede

  • Without proxy
  • drop in performance when going

to multiple nodes

  • performance halves when

introducing second comms direction

  • suggests issue is with async

progress rather than attainable bandwidth or latency

  • With proxy
  • small drop in performance from 1

to 2D comms. More likely due to B/W constraints...

  • Dslash scaled to 31 TF on 128

nodes (CG to 16.8 TF)

1 2 4 8 16 32 64 128 number of nodes 50 100 150 200 250 300 GFLOPS per Node

Without Proxy With CML Proxy

Wilson Dslash, Weak Scaling, 48x48x24x64 sites per node, single precision

Communication in 1 dimension Communication in 2 dimensions

31 TF !!

Thursday, October 17, 2013

slide-12
SLIDE 12

Thomas Jefferson National Accelerator Facility

Strong Scaling

  • Endeavor results from

ISCʼ13 paper

  • Need to understand better

the performance difference but

  • Stampede
  • no icache_snoop_off
  • shape difference may be due

to different virtual topology on 16 nodes

  • Over 4TF reached in Dslash

and 3.6TF reached in CG on Stampede

4 8 16 32 64 Number of Nodes 1000 2000 3000 4000 5000 6000 GFLOPS

Dslash Endeavor Dslash Stampede CG Endeavor CG Stampede

48x48x48x256 sites, strong scaling, single precision, using CML proxy

Thursday, October 17, 2013

slide-13
SLIDE 13

Thomas Jefferson National Accelerator Facility

Task 4,5: Contractions etc.

  • Status
  • ʻredstarʼ - generalized contraction code:
  • Computes quark line diagrams, generates list of propagators for Chroma to compute
  • ʻharomʼ - 3D version of QDP++
  • Performs timeslice by timeslice contraction
  • Status: In production now
  • Used for multi-particle 2pt and 3pt calculations with variational method
  • Staff:
  • Robert Edwards, Jie Chen
  • Follow on tasks:
  • More optimization via BLAS/Lapack Library integration (GPU acceleration via CUBLAS?)
  • Support “isobar”-like operator constructions (recursive contractions)
  • improve I/O performance.

Thursday, October 17, 2013

slide-14
SLIDE 14

Thomas Jefferson National Accelerator Facility

Task 6: Chroma on BG/Q

  • Status: Slight progress
  • Chroma has compiled with XLC on BG/Q but performance is low
  • Have Clover Solver through BAGEL, not yet integrated
  • IBM has re-coded SSE cpp_wilson_dslash package for BG/Q under contract with ANL
  • Integrated this on Cetus. Best observed performance is on the order of
  • ~11-12% of peak in my single node tests
  • ~7-9% of peak communicating in all directions
  • Tried Parscalarvec on single node, but this still needs work.
  • Staffing:
  • Balint Joo

Thursday, October 17, 2013

slide-15
SLIDE 15

Thomas Jefferson National Accelerator Facility

New/Old Tasks

  • Need for Algorithmic Improvement in solvers for CPUs
  • Driven by availability of unaccelerated resources:
  • Non-GPU part of BlueWaters (90% of resource)
  • BlueGene
  • Edison at NERSC
  • Would need integration of Multi-Grid/Domain Decomposed solvers
  • Would need this to improve analysis throughput on analysis and HMC
  • Combining improved solvers with HMC is a research area

Thursday, October 17, 2013

slide-16
SLIDE 16

Thomas Jefferson National Accelerator Facility

Summary

  • Good Progress on most of Y1 tasks.
  • Y1 GPU tasks complete, now on to LLVM
  • Xeon Phi tasks are coming along
  • Red Star/Harom in production.
  • We have some exposure on the BlueGene/Q which needs investment
  • Focus for rest of year
  • Xeon Phi + Chroma integration and Improvement (aim for publication of code)
  • Improvements to parscalarvec, investigate offload possibilities
  • Analysis improvements:
  • New tasks: Integrate advanced solvers.

Thursday, October 17, 2013