Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor - - PowerPoint PPT Presentation

towards a usable programming model for gpgpu
SMART_READER_LITE
LIVE PREVIEW

Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor - - PowerPoint PPT Presentation

Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 http://lawlor.cs.uaf.edu/ 1 8 Obligatory Introductory Quote He who controls the past, controls the future. George


slide-1
SLIDE 1

1

Towards a Usable Programming Model for GPGPU

  • Dr. Orion Sky Lawlor

lawlor@alaska.edu

  • U. Alaska Fairbanks

2011-04-19

http://lawlor.cs.uaf.edu/

8

slide-2
SLIDE 2

2

Obligatory Introductory Quote

“He who controls the past, controls the future.” George Orwell, 1984

slide-3
SLIDE 3

3

In Parallel Programming...

“He who controls the past, controls the future.” George Orwell, 1984 “He who controls the writes, controls performance.” Orion Lawlor, 2011

slide-4
SLIDE 4

4

Talk Outline

 Existing parallel programming

models

 Who controls the writes?

 Charm++ and Charm--

 Charm-style GPGPU

 Conclusions

slide-5
SLIDE 5

5

Existing Model: Superscalar

 Hardware parallelization of a

sequential programming model

 Fetch future instructions

 Need good branch prediction

 Runtime Dependency Analysis

 Load/store buffer for mem-carried  Rename away false dependencies  RAR, WAR, WAW, -> RAW <-

 Now “solved”: low future gain

slide-6
SLIDE 6

6

Spacetime Data Arrows

Read Write “time” (program order) “space” (memory, node)

slide-7
SLIDE 7

7

Read After Write Dependency

Read Write Read Write Artificial Instruction Boundary

slide-8
SLIDE 8

8

Read After Read: No Problem!

Read Write Read Write Artificial Instruction Boundary

slide-9
SLIDE 9

9

Existing Model: Shared Memory

 OpenMP, threads, shmem  “Just” let different processors

access each others' memory

 HW: Cache coherence

  • false sharing, cache thrashing

 SW: Synchronization

  • locks, semaphores, fences, ...

 Correctness is a huge issue

 Weird race conditions abound  New bugs in 10+ year old code

slide-10
SLIDE 10

10

Gather: Works Fine

Distributed Reads Centralized Writes

slide-11
SLIDE 11

11

Scatter: Tough to Synchronize!

Oops!

slide-12
SLIDE 12

12

Existing Model: Message Passing

 MPI, sockets  Explicit control of parallel reads

(send) and writes (recv)

 Far fewer race conditions

 Programmability is an issue

 Raw byte-based interface (C style)  High per-message cost (alpha)  Synchronization issues: when does

MPI_Send block?

slide-13
SLIDE 13

13

Existing Model: SIMD

 SSE, AVX, and GPU  Single Instruction, Multiple Data

 Far fewer fetches & decodes  Far higher arithmetic intensity

 CPU: Programmability N/A

 Assembly language (hello, 1984!)  mmintrin.h wrappers: _mm_add_ps  Or pray for automagic compiler!

 GPU: Programmability OK

 Graphics side: GLSL, HLSL, Cg  GPGPU: CUDA, OpenCL, DX CS

slide-14
SLIDE 14

14

NVIDIA CUDA

 CPU calls a GPU “kernel” with a

“block” of threads

  • Now fully programmable (throw/catch,

virtual methods, recursion, etc)

 Read and write memory anywhere

  • Zero protection against multithreaded

race conditions

 Manual control over a small

__shared__ memory region

 Only runs on NVIDIA hardware

(OpenCL is portable... sorta)

slide-15
SLIDE 15

15

OpenGL: GL Shading Language

 Mostly programmable (loops, etc)  Can read anywhere in “textures”, only

write to “framebuffer” (2D/3D arrays)

  • Reads go through “texture cache”, so

performance is good (iff locality)

  • Writes are on space-filling curve
  • Writes are controlled by the graphics driver
  • So cannot have synchronization bugs!

 Rich selection of texture filtering (array

interpolation) modes

  • Includes mipmaps, for multigrid

 GLSL can run OK on every modern GPU

(well, except Intel...)

slide-16
SLIDE 16

16

GLSL vs CUDA

GLSL Programs

slide-17
SLIDE 17

17

GLSL vs CUDA

GLSL Programs CUDA Programs Mipmaps; texture writes Arbitrary writes

slide-18
SLIDE 18

18

GLSL vs CUDA

GLSL Programs CUDA Programs Correct Programs

slide-19
SLIDE 19

19

GLSL vs CUDA

GLSL Programs CUDA Programs Correct Programs High Performance Programs

slide-20
SLIDE 20

20

GPU/CPU Convergence

 GPU, per socket:

 SIMD: 16-32 way  SMT: 2-50 way (register limited)  SMP: 4-36 way

 CPUs will get there, soon!

 SIMD: 8 way AVX (64-way SWAR)  SMT: 2 way Intel; 4 way IBM  SMP: 6-8 way/socket already

  • Intel has shown 48 way chips

 Biggest difference: CPU has

branch prediction & superscalar!

slide-21
SLIDE 21

21

CUDA: Memory Output Bandwidth

NVIDIA GeForce GTX 280, fixed 128 threads per block

Kernel startup latency: 4us K e r n e l

  • u

t p u t b a n d w i d t h : 8 G B / s t = 4000ns / kernel + bytes * 0.0125 ns / byte

slide-22
SLIDE 22

Charm++ and “Charm--”

slide-23
SLIDE 23

23

Existing Model: Charm++

 Chares send each other messages  Runtime system does delivery

 Scheduling!  Migration with efficient forwarding  Cheap broadcasts

 Runtime system schedules Chares

 Overlap comm and compute

 Programmability still an issue

 Per-message overhead, even with

message combining library

 Collect up your messages (SDAG?)  Cheap SMP reads? SIMD? GPU?

slide-24
SLIDE 24

24 Entry Method

One Charm++ Method Invocation

Receive one message (but in what order?) Send messages Chare Messages Update internal state Read internal state Between send and receive: migration, checkpointing, ...

slide-25
SLIDE 25

25

The Future: SIMD

 AVX, SSE, AltiVec, GPU, etc  Thought experiment

 Imagine a block of 8 chares living in

  • ne SIMD register
  • Deliver 8 messages at once (!)

 Or imagine 100K chares living in

GPU RAM

 Locality (mapping) is important!

 Branch divergence penalty  Struct-of-Arrays member storage

  • xxxxxxxx yyyyyyyy zzzzzzzz
  • Members of 8 separate chares!
slide-26
SLIDE 26

26

Vision: Charm-- Stencil

array [2D] stencil { public: float data; [entry] void average( float nbors[4]=fetchnbors()) { data=0.25*( nbors[0]+ nbors[1]+ nbors[2]+ nbors[3]); } };

slide-27
SLIDE 27

27

Vision: Charm-- Explained

array [2D] stencil { public: float data; [entry] void average( float nbors[4]=fetchnbors()) { data=0.25*( nbors[0]+ nbors[1]+ nbors[2]+ nbors[3]); } };

Assembled into GPU arrays or SSE vectors

slide-28
SLIDE 28

28

Vision: Charm-- Explained

array [2D] stencil { public: float data; [entry] void average( float nbors[4]=fetchnbors()) { data=0.25*( nbors[0]+ nbors[1]+ nbors[2]+ nbors[3]); } };

Broadcast out to blocks of array elements

slide-29
SLIDE 29

29

Vision: Charm-- Explained

array [2D] stencil { public: float data; [entry] void average( float nbors[4]=fetchnbors()) { data=0.25*( nbors[0]+ nbors[1]+ nbors[2]+ nbors[3]); } };

Hides local synchronized reads, network, and domain boundaries

slide-30
SLIDE 30

30

Vision: Charm-- Springs

array [1D] sim_spring { public: float restlength; [entry] void netforce( sim_vertex ends[2]=fetch_ends()) { vec3 along=ends[1].pos-ends[0].pos; float f=-k*(length(along)-restlength); vec3 F=f*normalize(along); ends[0].netforce+=F; ends[1].netforce-=F; } };

slide-31
SLIDE 31

31

One Charm-- Method Invocation

Fetch together multiple messages Send off network messages Chare (on GPU) “Mainchare” (on CPU) Update internal states Read internal states

slide-32
SLIDE 32

32

Noncontiguous Communication

Network Data Buffer GPU Target Buffer  Run scatter kernel  Or fold into fetch

slide-33
SLIDE 33

33

Key Charm-- Design Features

 Multiple chares receive message

at once

 Runtime block-allocates incoming

and outgoing message storage

 Critical for SIMD, GPU, SMP

 Receive multiple messages in

  • ne entry point

 Minimize roundtrip to GPU

 Explicit support for timesteps

 E.g., double-buffer message

storage

slide-34
SLIDE 34

34

Charm-- Not Shown

 Lots of work in “mainchare”

 Controls decomposition & comms  Set up “fetch”

 Still lots of network work

 Gather & send off messages  Distribute incoming messages

 Division of labor?

 Application scientist writes Chare  Computer scientist writes Mainchare

slide-35
SLIDE 35

35

Related Work

 Charm++ Accelerator API

[Wesolowski]

 Pipeline CUDA copy, queue kernels  Good backend for Charm--

 Intel ArBB: SIMD from kernel

 Based on RapidMind  But GPU support?

 My “GPGPU” library

 Based on GLSL

slide-36
SLIDE 36

The Future

slide-37
SLIDE 37

37

The Future: Memory Bandwidth

 Today: 1TF/s, but only 0.1TB/s  Don't communicate, recompute

 multistep stencil methods  “fetch” gets even more complex!

 64-bit -> 32-bit -> 16-bit -> 8?

 Spend flops scaling the data  Split solution + residual storage

  • Most flops use fewer bits, in residual

 Fight roundoff with stochastic

rounding

  • Add noise to improve precision
slide-38
SLIDE 38

38

Conclusions

 C++ is dead. Long live C++!  CPU and GPU on collision course

 SIMD+SMT+SMP+network

 Software is the bottleneck

 Exciting time to build software!

 Charm-- model

 Support ultra-low grainsize chares

  • Combine into SIMD blocks at runtime

 Simplify programmer's life  Add flexibility for runtime system  BUT must scale to real applications!