Time to Start over? Software for Exascale William Gropp - - PowerPoint PPT Presentation

time to start over software for exascale
SMART_READER_LITE
LIVE PREVIEW

Time to Start over? Software for Exascale William Gropp - - PowerPoint PPT Presentation

Time to Start over? Software for Exascale William Gropp www.cs.illinois.edu/~wgropp Why Is Exascale Different? Extreme power constraints, leading to Clock Rates similar to todays systems A wide-diversity of simple computing


slide-1
SLIDE 1

Time to Start over? Software for Exascale

William Gropp www.cs.illinois.edu/~wgropp

slide-2
SLIDE 2

2

Why Is Exascale Different?

  • Extreme power constraints, leading to

♦ Clock Rates similar to today’s systems ♦ A wide-diversity of simple computing elements

(simple for hardware but complex for algorithms and software)

♦ Memory per core and per FLOP will be much smaller ♦ Moving data anywhere will be expensive (time and

power)

  • Faults that will need to be detected and

managed

♦ Some detection may be the job of the programmer,

as hardware detection takes power

slide-3
SLIDE 3

3

Why Is Exascale Different?

  • Extreme scalability and performance

irregularity

♦ Performance will require enormous

concurrency (108 – 109)

♦ Performance is likely to be variable

  • Simple, static decompositions will not scale
  • A need for latency tolerant algorithms

and programming

♦ Memory, processors will be 100s to 10000s

  • f cycles away. Waiting for operations to

complete will cripple performance

slide-4
SLIDE 4

4

Why is Everyone Worried?

  • Exascale makes all problems extreme:

♦ Power, data motion costs, performance

irregularities, faults, extreme degree of parallelism, specialized functional units

  • Added to each of these is

♦ Complexity resulting from all of the above

  • These issues are not new, but may be

impossible to ignore

♦ The “free ride” from rapid improvement in

hardware performance is ending/over

slide-5
SLIDE 5

5

That “Kink” in #500 is Real

  • Extrapolation of

recent data gives ~1PF HPL in 2018

  • n the #500 system
  • Extrapolation of
  • lder data gives

~1PF in 2015, ~7PF in 2018

  • #500 may be a

better predictor of trends

0.0001 0.001 0.01 0.1 1 10 100 1000 10 20 30 40 50 HPL Perf (TF) Fit perf (TF)

slide-6
SLIDE 6

6

Current Petascale Systems Already Complex

  • Typical processor

♦ 8 floating point units, 16 integer units

  • What is a “core”?

♦ Full FP performance requires use of short vector

instructions

  • Memory

♦ Performance depends on location, access pattern ♦ “Saturates” on multicore chip

  • Specialized processing elements

♦ E.g., NVIDIA GPU (K20X); 2688 “cores” (or 56…)

  • Network

♦ 3- or 5-D Torus, latency, bandwidth, contention

important

slide-7
SLIDE 7

7

Blue Waters: NSF’s Most Powerful System

  • 4,224 XK7 nodes and 22,640 XE6 nodes

♦ ~ 1/7 GPU+CPU, 6/7 CPU+CPU ♦ Peak perf >13PF: ~ 1/3 GPU+CPU, 2/3 CPU+CPU

  • 1.5 PB Memory, >1TB/Sec I/O Bandwidth
  • System sustains > 1 PetaFLOPS on a wide

range of applications

♦ From starting to read input from disk to results

written to disk, not just computational kernels

♦ No Top500 run – does not represent application

workload

slide-8
SLIDE 8

8

How Do We Program These Systems?

  • There are many claims about how we can and

cannot program extreme scale systems

♦ Confusion is rampant ♦ Incorrect statements and conclusions common ♦ Often reflects “I don’t want to do it that way” instead

  • f “there’s a good reason why it can’t be done that

way”

  • General impression

♦ The programming model influences the solutions

used by programmers and algorithm developers

♦ In Linguistics, this is the Sapir-Whorf or Whorfian

hypothesis

  • We need to understand our terms first
slide-9
SLIDE 9

9

How Should We Think About Parallel Programming?

  • Need a more formal way to think about

programming

♦ Must be based on the realities of real

systems

♦ Not the system that we wish we could build

(see PRAM)

  • Not talking about a programming model

♦ Rather, first need to think about what an

extreme scale parallel system can do

♦ System – the hardware and the software

together

slide-10
SLIDE 10

10

Separate the Programming Model from the Execution Model

  • What is an execution model?

♦ It’s how you think about how you can use a

parallel computer to solve a problem

  • Why talk about this?

♦ The execution model can influence what

solutions you consider (the Whorfian hypothesis)

♦ After decades where many computer

scientists only worked with one execution model, we are now seeing new models and their impact on programming and algorithms

slide-11
SLIDE 11

11

Examples of Execution Models

  • Von Neumann machine:

♦ Program counter ♦ Arithmetic Logic Unit ♦ Addressable Memory

  • Classic Vector machine:

♦ Add “vectors” – apply the same operation to

a group of data with a single instruction

  • Arbitrary length (CDC Star 100), 64 words (Cray),

2 words (SSE)

  • GPUs with collections of threads

(Warps)

slide-12
SLIDE 12

12

Programming Models and Systems

  • In past, often a tight connection between the execution

model and the programming approach

♦ Fortran: FORmula TRANslation to von Neumann machine ♦ C: e.g., “register”, ++ operator match PDP-11 capabilities,

needs

  • Over time, execution models and reality changed but

programming models rarely reflected those changes

♦ Rely on compiler to “hide” those changes from the user –

e.g., auto-vectorization for SSE(n)

  • Consequence: Mismatch between users’ expectation and

system abilities.

♦ Can’t fully exploit system because user’s mental model of

execution does not match real hardware

♦ Decades of compiler research have shown this problem is

extremely hard – can’t expect system to do everything for you.

slide-13
SLIDE 13

13

Programming Models and Systems

  • Programming Model: an abstraction of a

way to write a program

♦ Many levels

  • Procedural or imperative?
  • Single address space with threads?
  • Vectors as basic units of programming?

♦ Programming model often expressed with

pseudo code

  • Programming System: (My terminology)

♦ An API that implements parts or all of one or

more programming models, enabling the precise specification of a program

slide-14
SLIDE 14

14

Why the Distinction?

  • In parallel computing,

♦ Message passing is a programming model

  • Abstraction: A program consists of processes that

communication by sending messages. See “Communicating Sequential Processes”, CACM 21#8, 1978, by C.A.R. Hoare.

♦ The Message Passing Interface (MPI) is a programming

system

  • Implements message passing and other parallel programming

models, including:

  • Bulk Synchronous Programming
  • One-sided communication
  • Shared-memory (between processes)

♦ CUDA/OpenACC/OpenCL are systems implementing

a “GPU Programming Model”

  • Execution model involves teams, threads, synchronization

primitives, different types of memory and operations

slide-15
SLIDE 15

15

The Devil Is in the Details

  • There is no unique execution model

♦ What level of detail do you need to design and

implement your program?

  • Don’t forget – you decided to use parallelism because

you could not get the performance you need without it

  • Getting what you need already?

♦ Great! It ain’t broke

  • But if you need more performance of any

type (scalability, total time to solution, user productivity)

♦ Rethink your model of computation and the

programming models and systems that you use

slide-16
SLIDE 16

16

Rethinking Parallel Computing

  • Changing the execution model

♦ No assumption of performance regularity – but not

unpredictable, just imprecise

  • Predictable within limits and most of the time

♦ Any synchronization cost amplifies irregularity – don’t

include synchronizing communication as a desirable

  • peration

♦ Memory operations are always costly, so moving operation

to data may be more efficient

  • Some hardware designs provide direct support for this, not

just software emulation

♦ Important to represent key hardware operations, which go

beyond simple single Arithmetic Logic Unit (ALU)

  • Remote update (RDMA)
  • Remote atomic operation (compare and swap)
  • Execute short code sequence (active messages, parcels)
slide-17
SLIDE 17

17

How Does This Change The Way You Should Look At Parallel Programming?

  • More dynamic. Plan for performance irregularity

♦ But still exploit as much regularity as possible to minimize

  • verhead of being dynamic)
  • Recognize communication takes time, which is not

precisely predictable

♦ Communication between cache and memory or between

two nodes in a parallel system

♦ Contention in system hard to avoid

  • Think about the execution model

♦ Your abstraction of how a parallel machine works ♦ Include the hardware-supported features that you need for

performance

  • Finally, use a programming system that lets you

express the elements you need from the execution model.

slide-18
SLIDE 18

18

Challenges for Programming Models

  • Parallel programming models need to provide ways to

coordinate resource allocation

♦ Numbers of cores/threads/functional units ♦ Assignment (affinity) of cores/threads ♦ Intranode memory bandwidth ♦ Internode memory bandwidth

  • They must also provide clean ways to share data

♦ Consistent memory models ♦ Decide whether its best to make it easy and transparent

for the programmer (but slow) or fast but hard (or impossible, which is often the current state)

  • Remember, parallel programming is about performance

♦ You will always get higher programmer productivity with a

single threaded code

slide-19
SLIDE 19

19

Solutions

  • All new: Applications not well served by

current systems

♦ e.g., not PDE simulations

  • MPI 4+: Especially ensembles of MPI

simulations

♦ After all, MPI took us from giga to tera to

peta, despite claims that was impossible

  • Addition and Composition

♦ MPI+X, including more interesting X ♦ Includes embedded abstract data-structure

specific languages

slide-20
SLIDE 20

20

Quotes from “Enabling Technologies for Petaflops Computing” (MIT Press 1995)

  • “The software for the current generation of 100 GF

machines is not adequate to be scaled to a TF…”

  • “The Petaflops computer is achievable at reasonable cost with

technology available in about 20 years [2014].”

♦ (estimated clock speed in 2004 — 700MHz)*

  • “Software technology for MPP’s must evolve new ways to

design software that is portable across a wide variety of computer architectures. Only then can the small but important MPP sector of the computer hardware market leverage the massive investment that is being applied to commercial software for the business and commodity computer market.”

  • “To address the inadequate state of software productivity,

there is a need to develop language systems able to integrate software components that use different paradigms and language dialects.”

slide-21
SLIDE 21

21

Quotes from “Enabling Technologies for Petaflops Computing” (MIT Press 1995)

  • “The software for the current generation of 100 GF machines

is not adequate to be scaled to a TF…”

  • “The Petaflops computer is achievable at reasonable cost with

technology available in about 20 years [2014].”

♦ (estimated clock speed in 2004 — 700MHz) – BG/L clock

speed

  • “Software technology for MPP’s must evolve new ways to

design software that is portable across a wide variety of computer architectures. Only then can the small but important MPP sector of the computer hardware market leverage the massive investment that is being applied to commercial software for the business and commodity computer market.”

  • “To address the inadequate state of software productivity,

there is a need to develop language systems able to integrate software components that use different paradigms and language dialects.”

slide-22
SLIDE 22

22

Quotes from “Enabling Technologies for Petaflops Computing” (MIT Press 1995)

  • “The software for the current generation of 100 GF machines

is not adequate to be scaled to a TF…”

  • “The Petaflops computer is achievable at reasonable cost with

technology available in about 20 years [2014].”

♦ (estimated clock speed in 2004 — 700MHz)*

  • “Software technology for MPP’s must evolve new ways to

design software that is portable across a wide variety of computer architectures. Only then can the small but important MPP sector of the computer hardware market leverage the massive investment that is being applied to commercial software for the business and commodity computer market.”

  • “To address the inadequate state of software

productivity, there is a need to develop language systems able to integrate software components that use different paradigms and language dialects.”

slide-23
SLIDE 23

23

But MPI is Message Passing…

  • No. This hasn’t been true for nearly 20 years
  • Its not Bulk Synchronous Parallelism either
  • MPI features include

♦ Designed to be thread-safe, enabling interoperation with

OpenMP, OpenAcc (original MPI)

♦ Nonblocking, latency hiding communication (MPI-1, 2, 3) ♦ Limited (MPI-2) and Full featured (MPI-3) one-sided.

  • Includes RMW operations such as CAS, FetchAndOp

♦ Portable shared memory interface ♦ Parallel I/O with high performance semantics (MPI-2)

  • Features under discussion include

♦ Fault tolerance (hard to get this right: what kind of faults?) ♦ Finer grain interaction with threads (endpoints, thread sharing) ♦ Active messages

  • Toward Asynchronous and MPI-Interoperable Active Messages
slide-24
SLIDE 24

24

Aside: Comparing MPI and X

  • Many papers claim to prove X is better than MPI

♦ Most are really: ♦ “My implementation of X on system Y is better than the

stock implementation of MPI from Z on system Y”

  • In many cases, these are really

♦ “I have a better way to implement an operation that could

be (a) within MPI or (b) above MPI”

  • This doesn’t mean that we should stick with MPI –

just that we should focus on the real issues

♦ High level – focus on supporting data structures and

algorithms

♦ Low level – focus on supporting experts exploiting

hardware

♦ Composition – make sure high and low level interoperate

slide-25
SLIDE 25

25

Observations on Programming for Exascale

  • Restrict the use of separate computational

and communication “phases”

♦ Need more overlap of communication and

computation to achieve latency tolerance (and energy reduction)

♦ Adds pressure to be memory efficient

  • May need to re-think entire solution stack

♦ E.g., Nonlinear Schwarz instead of

approximate Newton

♦ Don’t reduce everything to Linear Algebra ♦ Libraries may need to be more open to source

transformation

slide-26
SLIDE 26

26

Observations on Programming for Exascale

  • Use aggregates that match the hardware
  • Limit scalars to essential control

♦ Data must be in a hierarchy of small to

large

  • Fully automatic fixes unlikely

♦ No vendor compiles the simple code for

DGEMM and uses that for benchmarks

♦ No vendor compiles simple code for a

shared memory barrier and uses that (e.g., in OpenMP)

♦ Until they do, the best case is a human-

machine interaction, with the compiler helping

slide-27
SLIDE 27

27

Observations on Programming for Exascale

  • Use mathematics as the organizing principle

♦ Continuous representations, possibly adaptive,

memory-optimizing representation, lossy (within accuracy limits) but preserves essential properties (e.g., conservation)

  • Manage code by using abstract-data-structure-

specific languages (ADSL) to handle operations and vertical integration across components

♦ So-called “domain specific languages” are really

abstract-data-structure specific languages – they support more applications but fewer algorithms.

♦ Difference is important because a “science domain”

almost certainly requires flexibility with data structures and algorithms

slide-28
SLIDE 28

28

Observations on Programming for Exascale

  • Adaptive program models with a multi-

level approach

♦ Lightweight, locality-optimized for fine grain ♦ Within node/locality domain for medium

grain

♦ Regional/global for coarse grain ♦ May be different programming models

(hierarchies are ok!) but they must work well together

  • Performance annotations to support a

complex compilation environment

  • Asynchronous and multilevel algorithms

to match hardware

slide-29
SLIDE 29

29

Conclusions

  • Is it time to start over?

♦ Yes and No

  • Yes: Especially for new application

areas

♦ Both at the high level and close to the now

very different hardware

  • No: We can augment current systems

by addressing their limitations

♦ Composition of solutions is key ♦ Mature programming systems have proven

adaptable to new ideas and capabilities