Functional Requirements
Don Holmgren Fermilab djholm@fnal.gov LQCD-ext II CD-1 Review Germantown, MD February 25, 2014
Functional Requirements Don Holmgren Fermilab djholm@fnal.gov - - PowerPoint PPT Presentation
Functional Requirements Don Holmgren Fermilab djholm@fnal.gov LQCD-ext II CD-1 Review Germantown, MD February 25, 2014 Outline Computational needs Functional requirements Functional Requirements 2 Capability and Capacity Computing
Don Holmgren Fermilab djholm@fnal.gov LQCD-ext II CD-1 Review Germantown, MD February 25, 2014
2 Functional Requirements
3
Functional Requirements
4
Capability tasks, such a gauge configuration generation, depend critically on achieving minimum time-to-solution, because each step depends on the prior step.
power (FLOPs/sec) to be achieved on individual large problems.
supercomputing sites (e.g., NCSA BlueWaters).
dedicated hardware.
Gauge Files
Functional Requirements
5
Quark Propagators
Capacity tasks, such as propagator generation, achieve high aggregate computing throughput by working on many independent computations (“jobs”) simultaneously using systems with large numbers of processors.
configuration jobs, with relative large simulation volumes addressed by each core.
with a few to many dozens of GPU accelerators, are well suited to capacity tasks.
performing correlations between propagators (“tie-ups”) and for doing fits.
Functional Requirements
gauge configurations
growing – The complexity of analysis jobs is typically much greater, and the number of jobs about 1000X greater – The need for capacity is essentially unbounded – USQCD would use 100X more capacity if it were available – Because the desired capacity is not available, different approaches and algorithms are used to make tradeoffs to achieve specific physics goals (for example, different actions – HISQ, DWF, anisotropic clover – are employed by the various subfields) – Because of this variety of approaches, it is possible to optimize capacity systems for specific problems. For example, the first GPU-accelerated cluster in 2009 was tailored for NP spectrum calculations that required very high numbers of inversions to produce propagators.
for all available hardware. DOE support through SciDAC has been an important part of this.
6 Functional Requirements
– Individual analysis jobs (e.g. propagator generation) require from 8 to 128 cluster nodes (0.4 TF/s to 6.4 TF/s based on 50 GF/node)
that also have high file I/O requirements
– GPU-accelerated propagator generation requires from 4 to 16 accelerators (0.6 TF/s to 2.5 TF/s based on 150 GF/GPU)
(conventional hardware) and effective TF (GPUs)
– In aggregate, at least 188 TF/s capacity will be needed by the end
brought forward from the prior project (LQCD-ext)
FY16 will result in the 188 TF aggregate (+50 TF/s, -62 TF/s)
7 Functional Requirements
– SU(3) algebra dominates (low arithmetic intensity)
96 bytes read, 24 bytes written, 66 FLOPs 1.8:1 bytes:flops
– Inter-node communications for message passing require roughly (oversimplification) 1 Gbit/sec of bandwidth for each GFLOP/sec
strong scaling
8 Functional Requirements
9 Functional Requirements
– Traditional supercomputers (BlueGene, Cray) – Conventional clusters based on x86_64 processors and Infiniband networking – Accelerated clusters based on NVIDIA GPUs, x86_64 processors, and Infiniband networking, of two types:
display hardware of desktop computers
– BlueGene/Q half-rack at BNL – Conventional, gaming-GPU, and Tesla-GPU clusters at JLab – Conventional and Tesla-GPU clusters at FNAL
10 Functional Requirements
– Large lattices are generated at DOE and NSF leadership centers (BlueGene and Cray architectures) – Small lattices are generated on the BG/Q half-rack at BNL, or in some cases (BSM, some thermodynamics) on conventional clusters
– Propagators are produced on conventional clusters and/or on accelerated clusters (for actions with available GPU code) – Propagators from small lattices can be produced on gaming GPUs with additional correctness checks.
– Large jobs on conventional clusters (suitable for BlueGene/Q)
11 Functional Requirements
12 Functional Requirements
Functional Requirements 13
performance (bandwidth at message sizes used) will be the limit on performance on a given parallel machine
constraint
– GPUs have proven to be very cost effective for LQCD because they have the lowest price per unit of memory bandwidth. Intel Xeon Phi accelerators have similar memory bandwidth costs as NVIDIA GPUs.
memory bandwidth or network performance, depending upon how many nodes are used on a given job
– Network performance limits scaling: Surface area to volume ratio increases as more nodes are used, causing relatively more communications and smaller messages
14 Functional Requirements
Functional Requirements 15
Steve Gottlieb's staggered model:
physics.indiana.edu/~sg/pcnets/
– L^4 lattice – communications in 4 directions
– L implies message size to communicate a hyperplane – Sustained MFlop/sec together with message size implies achieved communications bandwidth
decreases, and as sustained MFlop/sec increases
Functional Requirements 16
Using 64^3 x 96 ensemble, for each configuration:
1. Generate a set of staggered quark propagators 2. Extract many extended sources from propagators from #1 3. Compute many clover propagators, write some to disk, tie them together with each other and with propagators from #1 4. Compute another set of clover propagators and tie them together with each each other, with propagators from #3, and with propagators from #1
17
The dedicated capacity system, “Ds”, was more appropriate and was used for production:
Ds: $184/core, BG/P: $317/core based on $1.3M / rack (4 x $317/$184 x 3717sec/5876sec = 4.36 Inverter Only = 2.35) System Cores Total Inverter Time Total I/O Time Perf/Core Ds (cluster) 1024 5480 sec 396 sec (7%) 488.8 MF (clover) 473.6 MF (asqtad) Intrepid (BG/P) 4096 1872 sec 1845 sec (50%) 345.5 MF (clover) 316.5 MF (asqtad)
Functional Requirements
– The activity is from 60 users @ FNAL, 24 @ TJNAF – Total job count: 1.42M, in aggregate: 305M JPsi-core-hrs, average job length: 11.6 hours – Substantial demand for computation across entire job-size range of 1 to 1K cores
– Intrepid (BG/P) core-hrs are essentially the same units as “JPsi Core-hrs” – Aggregate is 143M core-hrs – If all jobs are the same length as the cluster average above (11.6 hours), job count: 1670
18