[PDF] - Pat Hanrahan Computer Science Department Stanford University PDF Document

SLIDE 1

Page 1

How Powerful are GPUs?

Pat Hanrahan

Computer Science Department Stanford University Computer Forum 2007

Modern Graphics Pipeline

Application C d Geometry Rasterization Texture Command Fragment Display

SLIDE 2

Page 2

A Pitch from 5 Years Ago …

Cinematic games and media drive GPU market Current GPUs faster than CPUs (at graphics) Gap between the GPU and the CPU increasing Why? Efficiently use VLSI resources Programmable GPUs ≈ Stream processors Many applications map to stream processing Therefore, a $50 high-performance, massively Therefore, a $50 high performance, massively parallel computer will soon ship with every PC

Pat Hanrahan, circa 2002-2005

What Happened?

Now AMD and Intel gave up on sequential CPUs with high clock rates and went multi-core (2-4) high clock rates and went multi-core (2-4)

Gap between GPU and CPU stablelized

GPUs are data parallel (64-128 cores)

DX10 mandates unified graphics pipeline GPGPU – many algorithms implemented

Future Future Two main types of processors

CPU – fast sequential processor GPU – fast data parallel processor Hybrid CPU/GPU

SLIDE 3

Page 3

Overview

Current programmable GPUs Performance Programming model: Stream abstraction Applications How General?

Programmable GPUs

SLIDE 4

Page 4

ATI R600 (X2X00)

80 nm process ~700 million transistors 64 4-wide unified shaders ~700 Mhz clock 512-bit GDDR memory GDDR3 @ 900Mh 115 GB/ GDDR3 @ 900Mhz = 115 GB/s GDDR4 @ 1100Mhz = 140 GB/s 230 Watt

R300 not R600

NVIDIA G80 (8800)

90 nm TSMC process 681million transistors 480 mm^2 128 scalar processors 1.3 Ghz clock rate 384 bit GDDR 384-bit GDDR memory GDDR3 @ 900Mhz = 86.4 GB/s 130 Watts

SLIDE 5

Page 5

GeForce 8800 Series GPU

R t i ti Input Assembler Host

SP SP

L1

TF

Thread Processor Vertex Thread Rasterization Geometry Thread Pixel Thread Input Assembler

SP SP

L1

TF SP SP

L1

TF SP SP

L1

TF SP SP

L1

TF SP SP

L1

TF SP SP

L1

TF SP SP

L1

TF

L2 FB L2 FB L2 FB L2 FB L2 FB L2 FB

Shader Model 4.0 Architecture

Input Parameters

64K 32-bit 32 4-32-bit

Input Program Registers Textures

64K insts 32 4-32-bit

Output

8 4-32-bit

SLIDE 6

Page 6

Simple Graphics Pipeline

# c[0-3] = modelview projection (composite) matrix # c[4-7] = modelview inverse transpose # c[32] = eye-space light direction # c[33] = constant eye-space half-angle vector # c[35].x = pre-multiplied diffuse light color & diffuse mat. # c[35].y = pre-multiplied ambient light color & diffuse mat. # c[36] = specular color; c[38].x = specular power

DP4 o[HPOS].x, c[0], v[OPOS]; # Transform position. DP4 o[HPOS].y, c[1], v[OPOS]; DP4 o[HPOS].z, c[2], v[OPOS]; DP4 o[HPOS].w, c[3], v[OPOS]; DP3 R0.x, c[4], v[NRML]; # Transform normal. DP3 R0.y, c[5], v[NRML]; DP3 R0 z c[6] v[NRML]; DP3 R0.z, c[6], v[NRML]; DP3 R1.x, c[32], R0; # R1.x = L DOT N' DP3 R1.y, c[33], R0; # R1.y = H DOT N' MOV R1.w, c[38].x; # R1.w = specular power LIT R2, R1; # Compute lighting MAD R3, c[35].x, R2.y, c[35].y; # diffuse + ambient MAD o[COL0].xyz, c[36], R2.z, R3; # + specular END

G80 = Data Parallel Computer

Thread Execution Manager Input Assembler Host Thread Execution Manager

Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD core SIMD core SIMD Core

Global Memory Load/store

SLIDE 7

Page 7

G80 “core”

Each core

8 functional units

SIMD 16/32 “ ”

P ll l D t C h

SIMD 16/32 “warp” 8-10 stage pipeline Thread scheduler 128-512 threads/core 16 KB shared memory

Parallel Data Cache

y Total #threads/chip 16 * 512 = 8K

GPU Multi-threading (version 1)

Change threads each cycle (round robin)

frag4 frag3 frag2 frag1 instr1 instr2 instr3

SLIDE 8

Page 8

GPU Multi-threading (version 2)

Change thread after texture fetch/stall

frag4 frag3 frag2 frag1 Run until stall at texture fetch

(multiple instructions)

8800GTX Peak Performance

575 Mhz * 128 processors * 2 flop/inst * 2 inst/clock

MAD instruction

= 332.8 GFLOPS

SLIDE 9

Page 9

Instructions Issue Rate

http://graphics.stanford.edu/projects/gpubench/

ATI X1900XTX NVIDIA 7900GTX

Instructions Issue Rate

http://graphics.stanford.edu/projects/gpubench/

NVIDIA 7900GTX NVIDIA 8800GTX

SLIDE 10

Page 10

Measured BLAS Performance

SAXPY

X1900 (DX9):

6 GFlops

X1900 (CTM):

6 GFlops

X1900 (CTM):

6 GFlops

8800GTX (DX9):

12 GFlops SGEMV

X1900 (DX9):

4 GFlops

X1900 (CTM):

6 GFlops

8800GTX (DX9):

14 GFlops SGEMM

X1900 (DX9):

30 GFlops

X1900 (CTM):

120 GFlops

8800GTX (DX9):

105 Gflops

3 Ghz Core 2

40 Gflops

Programming Abstractions

SLIDE 11

Page 11

Approach I

Run application using graphics library Graphics library-based programming models

NVIDIA’s Cg Microsoft’s HLSL OpenGL Shading Language RapidMind Sh [McCool et al. 2004] RapidMind Sh [McCool et al. 2004]

Approach II

Map application to parallel computer C i ti ti l (CSP) Communicating sequential processes (CSP)

Threads: pthreads, Occam, UPC, … Message passing: MPI

Data parallel programming

APL, SETL, S, Fortran90, … C* (lisp*), NESL, …

Stream languages

StreaMIT, StreamC/KernelC MS Accelerator, CUDA, DPVM, PeakStream

SLIDE 12

Page 12

Stream Programming Environment

Collections stored in memory

Multidimensional arrays (stencils) Graphs and meshes (topology)

Data parallel operators

Application: map Reductions: scan, reduce (fold) Communication: send, sort, gather, scatter Communication: send, sort, gather, scatter Filter (|O|<|I|) and generate (|O|>|I|)

Brook

Ian Buck PhD Thesis Stanford University

Brook for GPUs: Stream computing on graphics hardware,

I. Buck, T. Foley, D. Horn, J. Sugarman, K. Fatahalian, M. Houston,
P. Hanrahan, SIGGRAPH 2004

SLIDE 13

Page 13

Brook Example

kernel void foo ( float a<>, float b<>, t fl t lt )

ut float result<> )

{ result = a + b; } float a<100>; float b<100>; float c<100>; foo(a,b,c);

for (i=0; i<100; i++) c[i] = a[i]+b[i];

Classical N-Body Simulation

Stellar dynamics

Gravitational acceleration Gravitational accel. + jerk

Molecular dynamics

Implicit solvent models Implicit solvent models Lennard-Jones Coulomb

SLIDE 14

Page 14

Folding@Home Performance

Vijay Pande Group GROMACs on Brook GPU:CPUcore 40:1 CPU: 3.0 Ghz P4 GPU: ATI X1900X

Current Statistics: March 19, 2007

Client type Current TFLOPS* Current Processors Windows 150 157457 Windows 150 157457 Mac OS X/PPC 7 8710 Mac OS X/Intel 7 2520 Linux 34 24639 GPU 40 682 GPU 40 682 PS/3 26 877 Total 223 1824132

*TFLOPs is actual flops from software cores, not peak values

SLIDE 15

Page 15

Folding@Home GPU Cluster

25 nodes

Nforce4 SLI Dual core Opteron 2x ATI X1900XTX Linux

5 TFlops of folding “power”

Not actual machine

Future

SLIDE 16

Page 16

Summary

Cinematic games and media drive GPU market GPU evolving into a high throughput processor

“Data parallel multi-threaded machine”

Many applications map to GPUs

Processor of the future likely to be a CPU/GPU Small number of traditional CPU cores Large number of GPU cores Large number of GPU cores

Opportunities

Current hardware not optimal

Incredible opportunity for architectural

innovation innovation Current software environment immature

Incredible opportunity for reinventing parallel

computing software, programming environments and languages

SLIDE 17

Page 17

Acknowledgements

Bill Dally Eric Darve Ian Buck Mattan Erez Eric Darve Vijay Pande Bill Mark John Owens Kurt Akeley Mattan Erez Kayvon Fatahalian Tim Foley Daniel Horn Michael Houston Kurt Akeley Mark Horowitz Michael Houston Jeremy Sugarman