Lawrence Berkeley National Laboratory WPSE 2009 Unified Parallel - - PDF document

lawrence berkeley national laboratory
SMART_READER_LITE
LIVE PREVIEW

Lawrence Berkeley National Laboratory WPSE 2009 Unified Parallel - - PDF document

Costin Iancu Lawrence Berkeley National Laboratory WPSE 2009 Unified Parallel C SPMD programming model, shared memory space abstraction Communication is either implicit or explicit one-sided Memory model:


slide-1
SLIDE 1

Costin Iancu Lawrence Berkeley National Laboratory

WPSE 2009

  • Unified Parallel C

– SPMD programming model, shared memory space abstraction

– Communication is either implicit or explicit – one-sided – Memory model: relaxed and strict

  • Ubiquitous UPC implementation

– Compiler based on the Open64 framework

–Source to source translation

– GASNet communication libraries

  • PUT/GET primitives
  • Vector/Index/Strided (VIS) primitives
  • Synchronization, collective operations
  • Provide integration across all levels of the software stack
  • Mechanisms for finer grained control over system resources
  • Application level resource usage policies
  • Language and compiler support

Compiler-generated C code UPC Runtime system GASNet Communication System Network Hardware

UPC Code Compiler

UPC Runtime system

Emphasize production quality development tools

slide-2
SLIDE 2
  • Productivity = performance without pain + portability
  • Provide support for application adaptation (load balance, comm/comp overlap,

scheduling, synchronization)

  • Challenges: scale, heterogeneity, convergence of shared and distributed

memory optimizations

  • Broad spectrum of approaches (distributed / shared memory)
  • Fine grained communication optimizations (PACT’05)
  • Automatic non-blocking communication (PACT’05, ICS’07)
  • Performance models for loop nest optimizations (PPoPP’07, ICS’08, PACT’08)
  • Applications ( IPDPS’05, SC’07, PPoPP’08, IPDPS’09)

Adoption: >7 years concerted effort, DOE support and encouragement, one big government user

  • One of the highest scaling FFT

(NAS) results to date (~2 Tflops)

  • Communication is aggressively
  • verlapped with computation
  • UPC vs MPI – 10%-70% faster
  • ne-sided is more effective
slide-3
SLIDE 3
  • Best performance of “primitive” operations

– Select best implementation available for “primitive” operations (put/get, sync) – Provide efficient implementations for library “abstractions” (collectives)

  • Optimizations

– Single node performance – Mechanisms to efficiently map application to hardware/OS – Program transformations – minimize processor “idle” waiting

Runtime Adaptation

  • Multi-level optimizations (distributed and shared memory)
  • Compile time, static optimizations are not sufficient
  • Adaptation = runtime

– Program Description – Performance Models vs Autotuning – Parameter Estimation/Classification

Instantaneous vs Asymptotic Guided vs Automatic Offline vs Online

– Feedback Loop – Static topology mapping vs dynamic

slide-4
SLIDE 4

Compile Time Transformations Runtime Mechanisms Communication Oblivious Transformations Communication Aware Analysis MessageVectorization Message Strip-Mining Data Redistribution Estimation of Performance Parameters

Description + Code Templates

Performance Database Performance Models Memory Manager (Cache)

Estimate Params Analyze Comm Requirements Estimate Load Instantiate Comm Plan Eliminate Redundant Comm & Reshape

Code Generation

(categorical) (numerical)

  • Describe program behavior, lightweight representation (Paek - LMAD perfect nests)
  • Easily extended for symbolic analysis
  • RT-LMAD similar to SSA- irregular loops
  • Decouple serial transformations from communication transformations
  • Serial transformations - cache parameters (static/conservative)
  • Communication transformations - network parameters (dynamic)
  • No performance loss when decoupling optimizations
  • Coarse grained characteristics
  • Blocking for cache and network at different scales
  • Compute and communication bound are categories
  • Multithreading
  • No global communication scheduling (intrinsic computation)
slide-5
SLIDE 5

COMMUNICATION OPTIMIZATIONS

  • Domain Decomposition and Scheduling for Code Generation
  • Efficient High Level Communication Primitives (collectives,p2p)
  • Application level performance determining factors:

– Computation – Spatial - topology (point-to-point, one-from-many, many-from-one, many-to-many) – Temporal - schedule (burst, peer order)

  • System level performance determining factors:

– Multiple available implementations – Resource constraints (issue queue, TLB footprint) – Interaction with OS (mapping, scheduling)

Adaptation: offline search, easy to evaluate heuristics, lightweight analysis

slide-6
SLIDE 6

Load Overhead OR Inverse Bandwidth

Models, Asymptotic Optimizations, Instantaneous Flow Control, Fairness Throttling load is desirable for performance > 2X

  • Deployed systems are under-provisioned, unfair, noisy

Two processors saturate the network, four processors overwhelm it (Underwood et al, SC’07)

  • Performance is unpredictable and unreproducible
  • Simple models can’t capture variation

100 100100 200100 300100 400100 500100 600100 700100 800100 900100 10 100 1000 10000 100000 1000000 10000000

Bandwidth (KB/s) Size (bytes)

InfiniBand Bandwidth Repartition for 128 Procs Across Bisection

Quantitative or Qualitative?

slide-7
SLIDE 7
  • Previous approaches measure asymptotic values, optimizations

need instantaneous values

  • Existing “time accurate” performance models do not account well

for system scale OR wide SMP nodes

  • Qualitative models: which is faster, not how fast! (PPoPP’07, ICS’08)

Not time accurate, understand errors and model robustness, allow for imprecision/noise

  • Spatiotemporal exploration of network performance:
  • Short and large time scales – account for variability and system noise
  • Small and large system scales – SMP node, full system
  • Preserve Ordering

– Sample implementation space, transformation specific – Be pessimistic – determine the worst case – Track derivatives, not absolute values

  • Analytical performance models (strip-mining transformations, PPoPP’07) > 90% efficiency
  • Multiprotocol implementation of vector operations (ICS’08, PACT’08)
slide-8
SLIDE 8

TUNING OF VECTOR OPERATIONS

  • Vector Operations – copy disjoint memory regions in one logical step (scatter/gather)
  • Often used in applications: boundary data in finite difference, particle-mesh, sparse

matrices, MPI Derived Data Types

  • Well supported:
  • Native : Elan, InfiniBand, IBM LAPI/DCMF
  • Third party comm libraries: GASNet, ARMCI, MPI
  • “Frameworks”: UPC, Titanium, CAF, GA, LAPI
slide-9
SLIDE 9
  • Interfaces: strided, indexed
  • Previous studies show the need for a multi-protocol approach
  • Implementations:

– Blocking – no overlap (BLOCK) – Pipelining – flow control and fairness are problems (PIPE) – Packing – flow control and attentiveness are problems (VIS)

foreach(S) start_time() for (iters) foreach(N) get(S) end_time() foreach(S) start_time() for(iters) foreach(N) get_nb(S) sync_all end_time() foreach(S) start_time() for(iters) foreach(N) vector_get(N,S) end_time()

  • Protocols : Blocking, Non-Blocking, Packing (AM based)
  • Empirical approach based on optimization space exploration
  • Transfer structure (N, S)
  • Application characteristics : active processors, communication topology,

system size, instantaneous load

  • For each setting – Which implementation is faster?
  • Fast, lightweight decision mechanism – prune parameter space
  • Strategy: best OR worst case scenario?
slide-10
SLIDE 10
  • Best algorithm determined by SMP arity and load

Resource constraints determine algorithm change

  • VIS

BLOCK

  • VIS

PIPE

See PACT’08 paper for details

  • Changing system size or topology does not cause protocol changes
  • Magnitude of performance differences is lowered (40x – 20x)
  • Accuracy > 90%, less than 2x performance loss
slide-11
SLIDE 11

2 4 8 16 32 64 128 256 512 1024 1 4 16 64 256 1024 NMSG Size (Dbls)

BLOCK Inter/Poll

1-2 0-1 2 4 8 16 32 64 128 256 512 1024 1 2 4 8 16 32 64 128 256 512 1024 NMSG Size (Dbls)

VIS Inter/Poll

4-5 3-4 2-3 1-2 0-1

  • Polling vs Interrupts
  • Different event notification mechanisms required for different protocols (event inter-

arrival rate)

  • Categorical choice

> 5X performance difference

Bassi – Power5/Federation

  • Pessimistic (max) predictors obtained under high load work best.

Our micro-benchmarks and models are always concerned about worst case performance.

slide-12
SLIDE 12
  • UPC compiler, GASNet communication layer
  • 2 x 2068 x 2.6 Ghz Opteron, Cray (BigBen)

– 2 x 320 x 2.2 GHz Opteron, InfiniBand 4x cluster (Jacquard) – 8 x 111 x 1.9 Ghz Power5, Federation (Bassi) – 16 x 3936 x 1.9 Ghz Barcelona, InfiniBand (Ranger)

  • NAS Parallel Benchmarks - manual optimizations vs compiler optimized

– MG: point-to-point Put, dynamic granularity across one run – SP: point-to-point VIS Put, “static” – BT: point-to-point VIS Put/Get, “static”

  • Node load (category) is determining performance factor for wide SMPs
  • Categories can be further refined into numerical values, e.g.

instantaneous load estimation Workload: 22% improvement

Load estimation?

0.5 1 1.5 2 2.5 16-A 64-A 16-B 64-B 144-C 16-A 64-A 16-B 64-B 144-C 16-A 64-A 16-B 64-B 128-C 256-C BT SP MG Performance Compared to VIS Implementation

IBM p575

VIS PIPE BLOCK ADAPTIVE

Higher is better

slide-13
SLIDE 13
  • Communication optimizations: qualitative models, worst case

performance, offline/guided exploration

  • First order performance determining factors are system dependent,

number of correlations tends to be constant, large ranges

  • Strip-mining optimizations: Fat-tree and Torus
  • Vector optimizations: thin nodes and wide nodes
  • Instantaneous behavior important, can be coarsely categorized (#pragma)
  • Runtime Analysis feasible: algorithms O(n*log n) transfers, O(enest) faster than RTT
  • Decoupling transformations (comm/comp) works – no whole program

analysis

  • SPMD performance can be enhanced by RT/OS mechanisms

Thank You!

slide-14
SLIDE 14
  • Large number of network performance models (LogGP variants) -

measurement methodology and validation on applications (asympotic values)

– Su et al (SC’05) – Cameron et al (IEEE ToC’07)

  • Implementations:

– Tipparaju et al (IPDPS’04) – InfiniBand – Nieplocha et al (HPCA’04) – Quadrics – Santhanaraman et al (PVM/MPI ’04) – InfiniBand

  • PGAS compilers

– CAF: message vectorization – Titanium: array copy operations, inspector-executor

slide-15
SLIDE 15

Micro-Benchmarks

Processing System Characterization

Knowledge & Experience Base OS & Runtime System

Configuration File & System Model

Back-end Processor Specific Compiler

Source-to-Source Code Transformations

Language Extensions & Libraries

C/C++ Fortran UPC/CAF/Chapel Autotuning Optimization Learning & Reasoning

HPC Languages Optimized Parallel Executable

Automated Task Recognition

DOD Application Code

Program Analysis Source Code Generation OpenMP

Architecture Models Network Models

Application Codes

Component Framework

Ideal Development Environment

  • J. Demmel, M. Hall, C. Iancu, D. Quinlan, K. Yelick…
  • All protocols chosen across the whole workload and systems
  • Two types of systems:

– IBM – N-N estimators – static estimators are enough – Sun – P-N, P-HN, P-P – heuristics to change predictors with scale or use instantaneous load estimation

Overall improved scalability and performance

  • Improvement: 22% workload, 3x speedup max

Load estimation?

slide-16
SLIDE 16

NAS Application Benchmarks Infiniband Cluster

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

A-4 A-8 B-16 B-32 C-64 C-128 A-4 B-16 C-64 A-4 B-16 C-64 A-4 A-8 B-16 B-32 C-64 C-128 A-4 A-8 B-16 B-32 C-64 C-128 A-4 A-8 B-16 B-32 C-64 C-128 A-4 A-8 B-16 B-32 C-64 MG SP BT CG IS FT FT-NLE Performance Relative to UNOPTIMIZED Implementation

HAND OPT 2.96 2.15

Improvement: 22% workload, 3x speedup max

(Sun: 2.5% workload, 15% speedup)

Iancu, Yelick

Instantaneous load estimation required for these results (SMP load, comm topology, comm distance)