Time to Start over? Software for Exascale William Gropp - - PowerPoint PPT Presentation
Time to Start over? Software for Exascale William Gropp - - PowerPoint PPT Presentation
Time to Start over? Software for Exascale William Gropp www.cs.illinois.edu/~wgropp Why Is Exascale Different? Extreme power constraints, leading to Clock Rates similar to todays systems A wide-diversity of simple computing
2
Why Is Exascale Different?
- Extreme power constraints, leading to
♦ Clock Rates similar to today’s systems ♦ A wide-diversity of simple computing elements
(simple for hardware but complex for algorithms and software)
♦ Memory per core and per FLOP will be much smaller ♦ Moving data anywhere will be expensive (time and
power)
- Faults that will need to be detected and
managed
♦ Some detection may be the job of the programmer,
as hardware detection takes power
3
Why Is Exascale Different?
- Extreme scalability and performance
irregularity
♦ Performance will require enormous
concurrency (108 – 109)
♦ Performance is likely to be variable
- Simple, static decompositions will not scale
- A need for latency tolerant algorithms
and programming
♦ Memory, processors will be 100s to 10000s
- f cycles away. Waiting for operations to
complete will cripple performance
4
Why is Everyone Worried?
- Exascale makes all problems extreme:
♦ Power, data motion costs, performance
irregularities, faults, extreme degree of parallelism, specialized functional units
- Added to each of these is
♦ Complexity resulting from all of the above
- These issues are not new, but may be
impossible to ignore
♦ The “free ride” from rapid improvement in
hardware performance is ending/over
5
That “Kink” in #500 is Real
- Extrapolation of
recent data gives ~1PF HPL in 2018
- n the #500 system
- Extrapolation of
- lder data gives
~1PF in 2015, ~7PF in 2018
- #500 may be a
better predictor of trends
0.0001 0.001 0.01 0.1 1 10 100 1000 10 20 30 40 50 HPL Perf (TF) Fit perf (TF)
6
Current Petascale Systems Already Complex
- Typical processor
♦ 8 floating point units, 16 integer units
- What is a “core”?
♦ Full FP performance requires use of short vector
instructions
- Memory
♦ Performance depends on location, access pattern ♦ “Saturates” on multicore chip
- Specialized processing elements
♦ E.g., NVIDIA GPU (K20X); 2688 “cores” (or 56…)
- Network
♦ 3- or 5-D Torus, latency, bandwidth, contention
important
7
Blue Waters: NSF’s Most Powerful System
- 4,224 XK7 nodes and 22,640 XE6 nodes
♦ ~ 1/7 GPU+CPU, 6/7 CPU+CPU ♦ Peak perf >13PF: ~ 1/3 GPU+CPU, 2/3 CPU+CPU
- 1.5 PB Memory, >1TB/Sec I/O Bandwidth
- System sustains > 1 PetaFLOPS on a wide
range of applications
♦ From starting to read input from disk to results
written to disk, not just computational kernels
♦ No Top500 run – does not represent application
workload
8
How Do We Program These Systems?
- There are many claims about how we can and
cannot program extreme scale systems
♦ Confusion is rampant ♦ Incorrect statements and conclusions common ♦ Often reflects “I don’t want to do it that way” instead
- f “there’s a good reason why it can’t be done that
way”
- General impression
♦ The programming model influences the solutions
used by programmers and algorithm developers
♦ In Linguistics, this is the Sapir-Whorf or Whorfian
hypothesis
- We need to understand our terms first
9
How Should We Think About Parallel Programming?
- Need a more formal way to think about
programming
♦ Must be based on the realities of real
systems
♦ Not the system that we wish we could build
(see PRAM)
- Not talking about a programming model
♦ Rather, first need to think about what an
extreme scale parallel system can do
♦ System – the hardware and the software
together
10
Separate the Programming Model from the Execution Model
- What is an execution model?
♦ It’s how you think about how you can use a
parallel computer to solve a problem
- Why talk about this?
♦ The execution model can influence what
solutions you consider (the Whorfian hypothesis)
♦ After decades where many computer
scientists only worked with one execution model, we are now seeing new models and their impact on programming and algorithms
11
Examples of Execution Models
- Von Neumann machine:
♦ Program counter ♦ Arithmetic Logic Unit ♦ Addressable Memory
- Classic Vector machine:
♦ Add “vectors” – apply the same operation to
a group of data with a single instruction
- Arbitrary length (CDC Star 100), 64 words (Cray),
2 words (SSE)
- GPUs with collections of threads
(Warps)
12
Programming Models and Systems
- In past, often a tight connection between the execution
model and the programming approach
♦ Fortran: FORmula TRANslation to von Neumann machine ♦ C: e.g., “register”, ++ operator match PDP-11 capabilities,
needs
- Over time, execution models and reality changed but
programming models rarely reflected those changes
♦ Rely on compiler to “hide” those changes from the user –
e.g., auto-vectorization for SSE(n)
- Consequence: Mismatch between users’ expectation and
system abilities.
♦ Can’t fully exploit system because user’s mental model of
execution does not match real hardware
♦ Decades of compiler research have shown this problem is
extremely hard – can’t expect system to do everything for you.
13
Programming Models and Systems
- Programming Model: an abstraction of a
way to write a program
♦ Many levels
- Procedural or imperative?
- Single address space with threads?
- Vectors as basic units of programming?
♦ Programming model often expressed with
pseudo code
- Programming System: (My terminology)
♦ An API that implements parts or all of one or
more programming models, enabling the precise specification of a program
14
Why the Distinction?
- In parallel computing,
♦ Message passing is a programming model
- Abstraction: A program consists of processes that
communication by sending messages. See “Communicating Sequential Processes”, CACM 21#8, 1978, by C.A.R. Hoare.
♦ The Message Passing Interface (MPI) is a programming
system
- Implements message passing and other parallel programming
models, including:
- Bulk Synchronous Programming
- One-sided communication
- Shared-memory (between processes)
♦ CUDA/OpenACC/OpenCL are systems implementing
a “GPU Programming Model”
- Execution model involves teams, threads, synchronization
primitives, different types of memory and operations
15
The Devil Is in the Details
- There is no unique execution model
♦ What level of detail do you need to design and
implement your program?
- Don’t forget – you decided to use parallelism because
you could not get the performance you need without it
- Getting what you need already?
♦ Great! It ain’t broke
- But if you need more performance of any
type (scalability, total time to solution, user productivity)
♦ Rethink your model of computation and the
programming models and systems that you use
16
Rethinking Parallel Computing
- Changing the execution model
♦ No assumption of performance regularity – but not
unpredictable, just imprecise
- Predictable within limits and most of the time
♦ Any synchronization cost amplifies irregularity – don’t
include synchronizing communication as a desirable
- peration
♦ Memory operations are always costly, so moving operation
to data may be more efficient
- Some hardware designs provide direct support for this, not
just software emulation
♦ Important to represent key hardware operations, which go
beyond simple single Arithmetic Logic Unit (ALU)
- Remote update (RDMA)
- Remote atomic operation (compare and swap)
- Execute short code sequence (active messages, parcels)
17
How Does This Change The Way You Should Look At Parallel Programming?
- More dynamic. Plan for performance irregularity
♦ But still exploit as much regularity as possible to minimize
- verhead of being dynamic)
- Recognize communication takes time, which is not
precisely predictable
♦ Communication between cache and memory or between
two nodes in a parallel system
♦ Contention in system hard to avoid
- Think about the execution model
♦ Your abstraction of how a parallel machine works ♦ Include the hardware-supported features that you need for
performance
- Finally, use a programming system that lets you
express the elements you need from the execution model.
18
Challenges for Programming Models
- Parallel programming models need to provide ways to
coordinate resource allocation
♦ Numbers of cores/threads/functional units ♦ Assignment (affinity) of cores/threads ♦ Intranode memory bandwidth ♦ Internode memory bandwidth
- They must also provide clean ways to share data
♦ Consistent memory models ♦ Decide whether its best to make it easy and transparent
for the programmer (but slow) or fast but hard (or impossible, which is often the current state)
- Remember, parallel programming is about performance
♦ You will always get higher programmer productivity with a
single threaded code
19
Solutions
- All new: Applications not well served by
current systems
♦ e.g., not PDE simulations
- MPI 4+: Especially ensembles of MPI
simulations
♦ After all, MPI took us from giga to tera to
peta, despite claims that was impossible
- Addition and Composition
♦ MPI+X, including more interesting X ♦ Includes embedded abstract data-structure
specific languages
20
Quotes from “Enabling Technologies for Petaflops Computing” (MIT Press 1995)
- “The software for the current generation of 100 GF
machines is not adequate to be scaled to a TF…”
- “The Petaflops computer is achievable at reasonable cost with
technology available in about 20 years [2014].”
♦ (estimated clock speed in 2004 — 700MHz)*
- “Software technology for MPP’s must evolve new ways to
design software that is portable across a wide variety of computer architectures. Only then can the small but important MPP sector of the computer hardware market leverage the massive investment that is being applied to commercial software for the business and commodity computer market.”
- “To address the inadequate state of software productivity,
there is a need to develop language systems able to integrate software components that use different paradigms and language dialects.”
21
Quotes from “Enabling Technologies for Petaflops Computing” (MIT Press 1995)
- “The software for the current generation of 100 GF machines
is not adequate to be scaled to a TF…”
- “The Petaflops computer is achievable at reasonable cost with
technology available in about 20 years [2014].”
♦ (estimated clock speed in 2004 — 700MHz) – BG/L clock
speed
- “Software technology for MPP’s must evolve new ways to
design software that is portable across a wide variety of computer architectures. Only then can the small but important MPP sector of the computer hardware market leverage the massive investment that is being applied to commercial software for the business and commodity computer market.”
- “To address the inadequate state of software productivity,
there is a need to develop language systems able to integrate software components that use different paradigms and language dialects.”
22
Quotes from “Enabling Technologies for Petaflops Computing” (MIT Press 1995)
- “The software for the current generation of 100 GF machines
is not adequate to be scaled to a TF…”
- “The Petaflops computer is achievable at reasonable cost with
technology available in about 20 years [2014].”
♦ (estimated clock speed in 2004 — 700MHz)*
- “Software technology for MPP’s must evolve new ways to
design software that is portable across a wide variety of computer architectures. Only then can the small but important MPP sector of the computer hardware market leverage the massive investment that is being applied to commercial software for the business and commodity computer market.”
- “To address the inadequate state of software
productivity, there is a need to develop language systems able to integrate software components that use different paradigms and language dialects.”
23
But MPI is Message Passing…
- No. This hasn’t been true for nearly 20 years
- Its not Bulk Synchronous Parallelism either
- MPI features include
♦ Designed to be thread-safe, enabling interoperation with
OpenMP, OpenAcc (original MPI)
♦ Nonblocking, latency hiding communication (MPI-1, 2, 3) ♦ Limited (MPI-2) and Full featured (MPI-3) one-sided.
- Includes RMW operations such as CAS, FetchAndOp
♦ Portable shared memory interface ♦ Parallel I/O with high performance semantics (MPI-2)
- Features under discussion include
♦ Fault tolerance (hard to get this right: what kind of faults?) ♦ Finer grain interaction with threads (endpoints, thread sharing) ♦ Active messages
- Toward Asynchronous and MPI-Interoperable Active Messages
24
Aside: Comparing MPI and X
- Many papers claim to prove X is better than MPI
♦ Most are really: ♦ “My implementation of X on system Y is better than the
stock implementation of MPI from Z on system Y”
- In many cases, these are really
♦ “I have a better way to implement an operation that could
be (a) within MPI or (b) above MPI”
- This doesn’t mean that we should stick with MPI –
just that we should focus on the real issues
♦ High level – focus on supporting data structures and
algorithms
♦ Low level – focus on supporting experts exploiting
hardware
♦ Composition – make sure high and low level interoperate
25
Observations on Programming for Exascale
- Restrict the use of separate computational
and communication “phases”
♦ Need more overlap of communication and
computation to achieve latency tolerance (and energy reduction)
♦ Adds pressure to be memory efficient
- May need to re-think entire solution stack
♦ E.g., Nonlinear Schwarz instead of
approximate Newton
♦ Don’t reduce everything to Linear Algebra ♦ Libraries may need to be more open to source
transformation
26
Observations on Programming for Exascale
- Use aggregates that match the hardware
- Limit scalars to essential control
♦ Data must be in a hierarchy of small to
large
- Fully automatic fixes unlikely
♦ No vendor compiles the simple code for
DGEMM and uses that for benchmarks
♦ No vendor compiles simple code for a
shared memory barrier and uses that (e.g., in OpenMP)
♦ Until they do, the best case is a human-
machine interaction, with the compiler helping
27
Observations on Programming for Exascale
- Use mathematics as the organizing principle
♦ Continuous representations, possibly adaptive,
memory-optimizing representation, lossy (within accuracy limits) but preserves essential properties (e.g., conservation)
- Manage code by using abstract-data-structure-
specific languages (ADSL) to handle operations and vertical integration across components
♦ So-called “domain specific languages” are really
abstract-data-structure specific languages – they support more applications but fewer algorithms.
♦ Difference is important because a “science domain”
almost certainly requires flexibility with data structures and algorithms
28
Observations on Programming for Exascale
- Adaptive program models with a multi-
level approach
♦ Lightweight, locality-optimized for fine grain ♦ Within node/locality domain for medium
grain
♦ Regional/global for coarse grain ♦ May be different programming models
(hierarchies are ok!) but they must work well together
- Performance annotations to support a
complex compilation environment
- Asynchronous and multilevel algorithms
to match hardware
29
Conclusions
- Is it time to start over?
♦ Yes and No
- Yes: Especially for new application
areas
♦ Both at the high level and close to the now
very different hardware
- No: We can augment current systems