P ROGRAMMING IN CUDA Tams Budavri / The Johns Hopkins University - - PowerPoint PPT Presentation

p rogramming in cuda
SMART_READER_LITE
LIVE PREVIEW

P ROGRAMMING IN CUDA Tams Budavri / The Johns Hopkins University - - PowerPoint PPT Presentation

GRAPHICS PROCESSOR P ROGRAMMING IN CUDA Tams Budavri / The Johns Hopkins University 7/18/2012 How I got into this? Tams Budavri 2 Galaxy correlation function 8 bins Histogram of distances State-of-the-art method


slide-1
SLIDE 1

GRAPHICS PROCESSOR PROGRAMMING IN CUDA

Tamás Budavári / The Johns Hopkins University

7/18/2012

slide-2
SLIDE 2

Tamás Budavári

How I got into this?

7/18/2012

 Galaxy correlation function

 Histogram of distances

 State-of-the-art method

 Dual-tree traversal

8 bins

2 ISSAC at HiPACC

slide-3
SLIDE 3

Tamás Budavári

What if?

7/18/2012

800 × 800 bins

3 ISSAC at HiPACC

slide-4
SLIDE 4

Tamás Budavári

Extending SQL Server

 Dedicated service for direct access

 Shared memory IPC w/ on-the-fly data transform

7/18/2012

IPC

4 ISSAC at HiPACC

slide-5
SLIDE 5

Tamás Budavári

User-Defined Functions

 Pair counts computed on the GPU  Returns 2D histogram as a table (i, j, cts)  Calculate the correlation fn in SQL

7/18/2012 5 ISSAC at HiPACC

slide-6
SLIDE 6

Tamás Budavári

User-Defined Functions

 Pair counts computed on the GPU  Returns 2D histogram as a table (i, j, cts)  Calculate the correlation fn in SQL

7/18/2012 6 ISSAC at HiPACC

slide-7
SLIDE 7

Tamás Budavári

Multiple GPUs in Parallel

 Several C# proxies to launch jobs on more cards  Non-blocking SQL routines

7/18/2012

IPC

7 ISSAC at HiPACC

slide-8
SLIDE 8

Tamás Budavári

Async SQL Interface

7/18/2012 8 ISSAC at HiPACC

slide-9
SLIDE 9

Tamás Budavári

Baryon Acoustic Oscillations

 600 trillion galaxy pairs  C for CUDA on GPUs

7/18/2012

BAO

Tian, Neyrinck, TB & Szalay (2011)

9 ISSAC at HiPACC

slide-10
SLIDE 10

Tamás Budavári

Outline

7/18/2012 ISSAC at HiPACC 10

 Parallelism  Hardware  Programming  Multithreading  Coding for GPUs  CUDA, Thrust, …

slide-11
SLIDE 11

Tamás Budavári

Parallelism

7/18/2012 ISSAC at HiPACC 11

 Data parallel

 Same processing on different pieces of data

 Task parallel

 Simultaneous processing on the same data

slide-12
SLIDE 12

Tamás Budavári

On all levels of the hierarchy

 Clouds  Clusters  Machines  Cores  Threads

7/18/2012

12

ISSAC at HiPACC

slide-13
SLIDE 13

Tamás Budavári

Scalability

 Scale up

 Vertically  Add resources to a node

 Bigger memory, …  Faster processor, …  Scale out

 Horizontally  Use more of the

 Threads, cores,

machines, clusters, clouds, …

7/18/2012

13

ISSAC at HiPACC

slide-14
SLIDE 14

Cluster

7/18/2012

14

ISSAC at HiPACC

slide-15
SLIDE 15

Tamás Budavári

High-Performance Computing

7/18/2012 ISSAC at HiPACC 15

 Traditional HPC clusters

 Launching jobs on a cluster of machines  Use MPI to communicate among nodes

 Message Passing Interface

slide-16
SLIDE 16

Tamás Budavári

Queuing Systems

7/18/2012 ISSAC at HiPACC 16

 Used for batch jobs on computer clusters

 Fair scheduling of user jobs  Group policies

 Several systems

 Portable Batch System (PBS)  Condor, etc…

slide-17
SLIDE 17

Computer

7/18/2012

17

ISSAC at HiPACC

slide-18
SLIDE 18

Tamás Budavári

Classification of Parallel Computers

7/18/2012 ISSAC at HiPACC 18

 Flynn’s Taxonomy

slide-19
SLIDE 19

Tamás Budavári

SISD

7/18/2012 ISSAC at HiPACC 19

 Single Instruction Single Data

 Classical Von Neumann machines  Single threaded codes

arstechnica.com

slide-20
SLIDE 20

Tamás Budavári

SIMD

7/18/2012 ISSAC at HiPACC 20

 Single Instruction Multiple Data

 On x86

 MMX: Math Matrix eXtension  SSE: Streaming SIMD Extension  …and more…

 GPU programming!!

arstechnica.com

slide-21
SLIDE 21

Tamás Budavári

Amdahl’s Law of Parallelism

7/18/2012 ISSAC at HiPACC 22

 Speed up:

with

 Before looking into parallelism, speed up the serial

code, to figure out the max speedup, i.e.,

N P S P S N T T    ) ( ) 1 ( N p p N T T    ) 1 ( 1 ) ( ) 1 ( P S P p     N

slide-22
SLIDE 22

Chip

7/18/2012

23

ISSAC at HiPACC

slide-23
SLIDE 23

Tamás Budavári

Moore’s Law

7/18/2012 ISSAC at HiPACC 24

slide-24
SLIDE 24

Tamás Budavári

New Limitation is Energy!

7/18/2012 ISSAC at HiPACC

 Power to compute the same thing?

 CPU is 10× less efficient than a digital signal processor  DSP is 10× less efficient than a custom chip

 New design: multicores with slower clocks

 But the interconnect is expensive  Need simpler components

25

slide-25
SLIDE 25

Tamás Budavári

Emerging Architectures

7/18/2012 ISSAC at HiPACC 26

 Andrew Chien: 10×10 to replace the 90/10 rule

 Custom modules on chip, cf. SoC in cellphones

 Statistics on a video codec module?

slide-26
SLIDE 26

Tamás Budavári

Emerging Architectures

7/18/2012 ISSAC at HiPACC 27

 Andrew Chien: 10×10 to replace the 90/10 rule

 Custom modules on chip, cf. SoC in cellphones

 Scientific analysis on such specialized units?

slide-27
SLIDE 27

Tamás Budavári

GPUs Evolved to be General Purpose

 Virtual world: simulation of real physics

 C for CUDA and OpenCL

 512 cores 25k threads, running 1 billion/sec  Old algorithms built on wrong assumption

 Today processing is free but memory is slow

New programming paradigm!

7/18/2012 28 ISSAC at HiPACC

slide-28
SLIDE 28

Tamás Budavári

New Moore’s Law

 In the number of cores  Faster than ever

7/18/2012 29 ISSAC at HiPACC

slide-29
SLIDE 29

Programming

7/18/2012

30

ISSAC at HiPACC

slide-30
SLIDE 30

Tamás Budavári

Programming Languages

7/18/2012 ISSAC at HiPACC 31

 No one language to rule them all  And many to choose from

slide-31
SLIDE 31

Tamás Budavári

Assembly

7/18/2012 ISSAC at HiPACC 32

 Low-level (almost) machine code  Different for each computer

slide-32
SLIDE 32

Tamás Budavári

The “C” Language

7/18/2012 ISSAC at HiPACC 33

 Higher level but still close to hardware, i.e., fast

 Pointers!

 Many things written in C

 Operating systems  Other languages, …

slide-33
SLIDE 33

Tamás Budavári

Java

7/18/2012 ISSAC at HiPACC 34

 Pros

 Memory management with garbage collection  Just-In-Time compilation from ‘bytecode’

 Cons

 Not so great performance  Hard to include legacy codes  New language features were an afterthought

slide-34
SLIDE 34

Tamás Budavári

Python

7/18/2012 ISSAC at HiPACC 35

 Scripting to glue things together  Easy to wrap legacy codes  Lots of scientific modules and plotting  Good for prototyping

slide-35
SLIDE 35

Tamás Budavári

Etc…

 Perl  Matlab  Mathematica  IDL  R  Lisp  Haskell  Ocaml  Erlang  Your favorite here…

7/18/2012

36

ISSAC at HiPACC

slide-36
SLIDE 36

Tamás Budavári

Programming in C

7/18/2012 ISSAC at HiPACC 37

 Skeleton of an application

slide-37
SLIDE 37

Tamás Budavári

Programming in C

7/18/2012 ISSAC at HiPACC 38

 Files

 Headers *.h  Source *.c

 Building an application

 Compile source  Link object files

slide-38
SLIDE 38

Tamás Budavári

Using Pointers

7/18/2012 ISSAC at HiPACC 39

slide-39
SLIDE 39

Tamás Budavári

Arrays

7/18/2012 ISSAC at HiPACC 40

 Dynamic arrays

 Memory allocation  Freeing memory

 Pointer arithmetics

slide-40
SLIDE 40

Tamás Budavári

Matrix, etc…

7/18/2012 ISSAC at HiPACC 41

 Point to pointers

 Data allocated in v  Pointers in A

 For 2D indexing  One can have

 Matrix, tensor, …  Jagged arrays, …

slide-41
SLIDE 41

Parallel actions

Concurrency

7/18/2012

42

ISSAC at HiPACC

slide-42
SLIDE 42

Tamás Budavári

Data Parallel Techniques

7/18/2012 ISSAC at HiPACC 43

 “Embarrassingly Parallel”

 Decoupled problems, independent processing

 MapReduce

 Map  Reduce

slide-43
SLIDE 43

Tamás Budavári

The Elevator Problem

7/18/2012 ISSAC at HiPACC 44

 People on multiple levels

 Press the button…

slide-44
SLIDE 44

Tamás Budavári

Mutual Exclusion

7/18/2012 ISSAC at HiPACC 45

 Multiple processes or threads

 Access shared resources in critical sections

 E.g., call the elevator when it’s time to go  Locking

 Elevators, etc…

slide-45
SLIDE 45

Tamás Budavári

Dining Philosophers

7/18/2012 ISSAC at HiPACC 46

 Five silent philosophers sit at the table

 Alternate between eating and thinking  Need both forks left & right to eat

 Must be picked up one by one!

 Infinite food in front of them

 How can they all think & eat forever?

slide-46
SLIDE 46

Parallel Threads

7/18/2012

47

ISSAC at HiPACC

slide-47
SLIDE 47

Tamás Budavári

Threading

7/18/2012 ISSAC at HiPACC 48

 Concurrent parallelism in a machine

slide-48
SLIDE 48

Tamás Budavári

Parallelism

7/18/2012 ISSAC at HiPACC 49

 Data parallel

 Same processing on different pieces of data

 Task parallel

 Simultaneous processing on the same data

slide-49
SLIDE 49

Tamás Budavári

Comparing Chips

7/18/2012 50 ISSAC at HiPACC

slide-50
SLIDE 50

Tamás Budavári

launch launch run run

sync

Hybrid Architecture

7/18/2012 51 ISSAC at HiPACC

slide-51
SLIDE 51

Tamás Budavári

Programming GPGPUs

 CUDA

 Low-level & high-level

 OpenCL  DirectCompute

 DirectX, etc…

 C++ AMP New!

 Accelerated Massive Parallelism

7/18/2012 ISSAC at HiPACC 52

slide-52
SLIDE 52

CUDA

7/18/2012

53

ISSAC at HiPACC

slide-53
SLIDE 53

Tamás Budavári

Projects on CUDA Zone

7/18/2012 54 ISSAC at HiPACC

slide-54
SLIDE 54

Tamás Budavári

Currently Available

7/18/2012

 GPU optimized Sorting, RNG, BLAS, FFT, Hadamard...  SDK w/examples  Nsight debugger!  Imaging routines  Python w/ PyCUDA  High-level C++ programming with

55 ISSAC at HiPACC

slide-55
SLIDE 55

Tamás Budavári

Fermi

7/18/2012 ISSAC at HiPACC 56

 Previous generation

 20 series Tesla cards, e.g., C2050  400+ series GeForce cards, e.g., GTX 480

 IEEE-754 arithmetic

 Standard floating point

 Same as in the CPUs

slide-56
SLIDE 56

Tamás Budavári

Kepler

7/18/2012 ISSAC at HiPACC 57

 Latest generation

 More efficient, more cores

 GTX 680 has 1536 cores

slide-57
SLIDE 57

Tamás Budavári

Which Device?

 Tesla

 Computation (& games)  Up to 6GB memory  ECC on/off

 Error Correcting Codes

 More double-precision

units on chip

 GeForce

 Games (& computation)  Typically 1.5GB memory  Faster CLK & more cores

 Heat! Not to run for years

 Great for development

and more…

7/18/2012

58

ISSAC at HiPACC

slide-58
SLIDE 58

Tamás Budavári

Multiprocessors

7/18/2012 ISSAC at HiPACC 59

 1536 simultaneous threads

 Up to ~25,000 threads on a Fermi GPU w/16 MPs

 L1 cache memory – implicitly speeds codes up  Shared memory – explicit programming

 100× access speed compared to global memory  Allows for fast communication between threads

 L1 + Shared is 64KB (same chip)

 Configurable to 16KB + 48KB or 48KB+16KB

slide-59
SLIDE 59

Tamás Budavári

Kernel

7/18/2012 ISSAC at HiPACC 60

 GPU code to run on all threads

 Pick your data  Process it  Save the results

slide-60
SLIDE 60

Tamás Budavári

Warp

7/18/2012 ISSAC at HiPACC 61

 Threads are grouped into warps

 32 threads running together SIMD style

slide-61
SLIDE 61

Tamás Budavári

Block

7/18/2012 ISSAC at HiPACC 62

 Block of threads

 1D, 2D or 3D to best match the data layout  Can communicate with each other!

slide-62
SLIDE 62

Tamás Budavári

Grid

7/18/2012 ISSAC at HiPACC 63

 Grid of blocks

 Launch millions of threads

 Regardless how many cores available

 No communication

 Different blocks can be running sequentially

  • r different processor
slide-63
SLIDE 63

Tamás Budavári

Same code on different devices

7/18/2012 ISSAC at HiPACC 64

slide-64
SLIDE 64

Tamás Budavári

Hello World!

7/18/2012 65 ISSAC at HiPACC

slide-65
SLIDE 65

To Be Continued…

7/18/2012

66

ISSAC at HiPACC