Functional Requirements Don Holmgren Fermilab djholm@fnal.gov - - PowerPoint PPT Presentation

functional requirements
SMART_READER_LITE
LIVE PREVIEW

Functional Requirements Don Holmgren Fermilab djholm@fnal.gov - - PowerPoint PPT Presentation

Functional Requirements Don Holmgren Fermilab djholm@fnal.gov LQCD-ext II CD-1 Review Germantown, MD February 25, 2014 Outline Computational needs Functional requirements Functional Requirements 2 Capability and Capacity Computing


slide-1
SLIDE 1

Functional Requirements

Don Holmgren Fermilab djholm@fnal.gov LQCD-ext II CD-1 Review Germantown, MD February 25, 2014

slide-2
SLIDE 2

Outline

  • Computational needs
  • Functional requirements

2 Functional Requirements

slide-3
SLIDE 3

Capability and Capacity Computing

3

LQCD computing involves a mixture of capability and capacity computing tasks.

Functional Requirements

slide-4
SLIDE 4

Capability versus Capacity Computing

4

Capability tasks, such a gauge configuration generation, depend critically on achieving minimum time-to-solution, because each step depends on the prior step.

  • They benefit from architectures and algorithms that allow the maximum computing

power (FLOPs/sec) to be achieved on individual large problems.

  • Large gauge configurations are generated at the DOE LCFs and at other large

supercomputing sites (e.g., NCSA BlueWaters).

  • Smaller configurations, e.g. for BSM or thermodynamics, are often generated on USQCD

dedicated hardware.

Gauge Files

Functional Requirements

slide-5
SLIDE 5

Capability versus Capacity Computing

5

Quark Propagators

Capacity tasks, such as propagator generation, achieve high aggregate computing throughput by working on many independent computations (“jobs”) simultaneously using systems with large numbers of processors.

  • The jobs are relatively small – O(10) to O(1K) core counts - compared to gauge

configuration jobs, with relative large simulation volumes addressed by each core.

  • Clusters, with good performance on jobs with hundreds to a few thousands of cores, or

with a few to many dozens of GPU accelerators, are well suited to capacity tasks.

  • A large number of very small jobs – from one to a few cluster nodes – are required for

performing correlations between propagators (“tie-ups”) and for doing fits.

Functional Requirements

slide-6
SLIDE 6

Computational Needs

  • LQCD is dependent on capability computing for generating ensembles of

gauge configurations

  • The need for computing capacity for analysis is at least as large, and

growing – The complexity of analysis jobs is typically much greater, and the number of jobs about 1000X greater – The need for capacity is essentially unbounded – USQCD would use 100X more capacity if it were available – Because the desired capacity is not available, different approaches and algorithms are used to make tradeoffs to achieve specific physics goals (for example, different actions – HISQ, DWF, anisotropic clover – are employed by the various subfields) – Because of this variety of approaches, it is possible to optimize capacity systems for specific problems. For example, the first GPU-accelerated cluster in 2009 was tailored for NP spectrum calculations that required very high numbers of inversions to produce propagators.

  • The LQCD community has a long history of producing well-optimized code

for all available hardware. DOE support through SciDAC has been an important part of this.

6 Functional Requirements

slide-7
SLIDE 7

Functional Requirements

  • Computational capacity

– Individual analysis jobs (e.g. propagator generation) require from 8 to 128 cluster nodes (0.4 TF/s to 6.4 TF/s based on 50 GF/node)

  • The larger jobs – 64 to 128 nodes – are used for eigenvector projection methods

that also have high file I/O requirements

– GPU-accelerated propagator generation requires from 4 to 16 accelerators (0.6 TF/s to 2.5 TF/s based on 150 GF/GPU)

  • In the next talk, Chip Watson will define LQCD metrics like sustained TF

(conventional hardware) and effective TF (GPUs)

– In aggregate, at least 188 TF/s capacity will be needed by the end

  • f FY16
  • Because of the funding profile, no new hardware will be purchased in FY15
  • At the beginning of FY15, an estimated 200 TF/s aggregate capacity will be

brought forward from the prior project (LQCD-ext)

  • A combination of new hardware deployment and old system decommissioning in

FY16 will result in the 188 TF aggregate (+50 TF/s, -62 TF/s)

7 Functional Requirements

slide-8
SLIDE 8

Functional Requirements

  • Characteristics of production LQCD codes:

– SU(3) algebra dominates (low arithmetic intensity)

  • Single precision complex matrix (3x3) – vector (3x1):

96 bytes read, 24 bytes written, 66 FLOPs  1.8:1 bytes:flops

  • Memory bandwidth is more important than peak FLOPs

– Inter-node communications for message passing require roughly (oversimplification) 1 Gbit/sec of bandwidth for each GFLOP/sec

  • f node capability
  • Also, low latency required for efficient global reductions, and for good

strong scaling

8 Functional Requirements

slide-9
SLIDE 9

Functional Requirements

  • Access to large shared file systems and to tape

storage

– Analysis jobs read gauge configurations, and read and write propagators (for each configuration, O(10) or more propagators). Gauge configurations are of order 10 Gbyte (latest are 250 GB). Individual propagators are up to 12X the volume of gauge configurations. – Tape provides intermediate storage for long analysis campaigns with high data volumes (cheaper than disk), and archival storage of important files (gauge configurations, expensive propagators).

9 Functional Requirements

slide-10
SLIDE 10

LQCD Machine Architectures

  • A number of architectures are currently used by USQCD:

– Traditional supercomputers (BlueGene, Cray) – Conventional clusters based on x86_64 processors and Infiniband networking – Accelerated clusters based on NVIDIA GPUs, x86_64 processors, and Infiniband networking, of two types:

  • “Gaming” card GPU systems, using graphics cards designed for the

display hardware of desktop computers

  • “Tesla-class” systems, using GPUs designed for numerical work
  • LQCD hardware deployed at FNAL, JLab, BNL:

– BlueGene/Q half-rack at BNL – Conventional, gaming-GPU, and Tesla-GPU clusters at JLab – Conventional and Tesla-GPU clusters at FNAL

10 Functional Requirements

slide-11
SLIDE 11

Matching Architectures to Job Requirements

  • Gauge configuration generation

– Large lattices are generated at DOE and NSF leadership centers (BlueGene and Cray architectures) – Small lattices are generated on the BG/Q half-rack at BNL, or in some cases (BSM, some thermodynamics) on conventional clusters

  • Quark propagator production (traditional)

– Propagators are produced on conventional clusters and/or on accelerated clusters (for actions with available GPU code) – Propagators from small lattices can be produced on gaming GPUs with additional correctness checks.

  • Quark propagator production (eigenvector projection)

– Large jobs on conventional clusters (suitable for BlueGene/Q)

  • Combining propagators (“tie-ups”): conventional clusters (I/O bound)
  • Physics parameter extraction: conventional clusters

11 Functional Requirements

slide-12
SLIDE 12

End

12 Functional Requirements

slide-13
SLIDE 13

Backup Slides

Functional Requirements 13

slide-14
SLIDE 14

Hardware Requirements

  • Either memory bandwidth, floating point performance, or network

performance (bandwidth at message sizes used) will be the limit on performance on a given parallel machine

  • On current single commodity nodes memory bandwidth is the

constraint

– GPUs have proven to be very cost effective for LQCD because they have the lowest price per unit of memory bandwidth. Intel Xeon Phi accelerators have similar memory bandwidth costs as NVIDIA GPUs.

  • On current parallel computer clusters, the constraint is either

memory bandwidth or network performance, depending upon how many nodes are used on a given job

– Network performance limits scaling: Surface area to volume ratio increases as more nodes are used, causing relatively more communications and smaller messages

14 Functional Requirements

slide-15
SLIDE 15

Functional Requirements 15

Balanced Design Requirements

Communications for Dslash

  • Modified for improved staggered from

Steve Gottlieb's staggered model:

physics.indiana.edu/~sg/pcnets/

  • Assume:

– L^4 lattice – communications in 4 directions

  • Then:

– L implies message size to communicate a hyperplane – Sustained MFlop/sec together with message size implies achieved communications bandwidth

  • Required network bandwidth increases as L

decreases, and as sustained MFlop/sec increases

slide-16
SLIDE 16

Functional Requirements 16

SDR vs. DDR Infiniband

slide-17
SLIDE 17

Sample Analysis Workflow (Fermilab-MILC “Superscript”)

Using 64^3 x 96 ensemble, for each configuration:

1. Generate a set of staggered quark propagators 2. Extract many extended sources from propagators from #1 3. Compute many clover propagators, write some to disk, tie them together with each other and with propagators from #1 4. Compute another set of clover propagators and tie them together with each each other, with propagators from #3, and with propagators from #1

17

The dedicated capacity system, “Ds”, was more appropriate and was used for production:

  • better per core performance
  • better I/O performance (smaller fraction of time with idle cores)
  • more cost effective by a factor of more than 4X based on hardware costs

Ds: $184/core, BG/P: $317/core based on $1.3M / rack (4 x $317/$184 x 3717sec/5876sec = 4.36 Inverter Only = 2.35) System Cores Total Inverter Time Total I/O Time Perf/Core Ds (cluster) 1024 5480 sec 396 sec (7%) 488.8 MF (clover) 473.6 MF (asqtad) Intrepid (BG/P) 4096 1872 sec 1845 sec (50%) 345.5 MF (clover) 316.5 MF (asqtad)

Functional Requirements

slide-18
SLIDE 18

Job Sizes – Analysis versus Capability

  • Left: Conventional (non-GPU) core-hrs on USQCD dedicated hardware.

– The activity is from 60 users @ FNAL, 24 @ TJNAF – Total job count: 1.42M, in aggregate: 305M JPsi-core-hrs, average job length: 11.6 hours – Substantial demand for computation across entire job-size range of 1 to 1K cores

  • Right: ALCF “Intrepid” usage for LQCD

– Intrepid (BG/P) core-hrs are essentially the same units as “JPsi Core-hrs” – Aggregate is 143M core-hrs – If all jobs are the same length as the cluster average above (11.6 hours), job count: 1670

18