Pat Hanrahan Computer Science Department Stanford University - - PDF document
Pat Hanrahan Computer Science Department Stanford University - - PDF document
How Powerful are GPUs? Pat Hanrahan Computer Science Department Stanford University Computer Forum 2007 Modern Graphics Pipeline Application C Command d Geometry Rasterization Texture Fragment Display Page 1 A Pitch from 5 Years Ago
Page 2
A Pitch from 5 Years Ago …
Cinematic games and media drive GPU market Current GPUs faster than CPUs (at graphics) Gap between the GPU and the CPU increasing Why? Efficiently use VLSI resources Programmable GPUs ≈ Stream processors Many applications map to stream processing Therefore, a $50 high-performance, massively Therefore, a $50 high performance, massively parallel computer will soon ship with every PC
Pat Hanrahan, circa 2002-2005
What Happened?
Now AMD and Intel gave up on sequential CPUs with high clock rates and went multi-core (2-4) high clock rates and went multi-core (2-4)
Gap between GPU and CPU stablelized
GPUs are data parallel (64-128 cores)
DX10 mandates unified graphics pipeline GPGPU – many algorithms implemented
Future Future Two main types of processors
CPU – fast sequential processor GPU – fast data parallel processor Hybrid CPU/GPU
Page 3
Overview
Current programmable GPUs Performance Programming model: Stream abstraction Applications How General?
Programmable GPUs
Page 4
ATI R600 (X2X00)
80 nm process ~700 million transistors 64 4-wide unified shaders ~700 Mhz clock 512-bit GDDR memory GDDR3 @ 900Mh 115 GB/ GDDR3 @ 900Mhz = 115 GB/s GDDR4 @ 1100Mhz = 140 GB/s 230 Watt
R300 not R600
NVIDIA G80 (8800)
90 nm TSMC process 681million transistors 480 mm^2 128 scalar processors 1.3 Ghz clock rate 384 bit GDDR 384-bit GDDR memory GDDR3 @ 900Mhz = 86.4 GB/s 130 Watts
Page 5
GeForce 8800 Series GPU
R t i ti Input Assembler Host
SP SP
L1
TF
Thread Processor Vertex Thread Rasterization Geometry Thread Pixel Thread Input Assembler
SP SP
L1
TF SP SP
L1
TF SP SP
L1
TF SP SP
L1
TF SP SP
L1
TF SP SP
L1
TF SP SP
L1
TF
L2 FB L2 FB L2 FB L2 FB L2 FB L2 FB
Shader Model 4.0 Architecture
Input Parameters
64K 32-bit 32 4-32-bit
Input Program Registers Textures
64K insts 32 4-32-bit
Output
8 4-32-bit
Page 6
Simple Graphics Pipeline
# c[0-3] = modelview projection (composite) matrix # c[4-7] = modelview inverse transpose # c[32] = eye-space light direction # c[33] = constant eye-space half-angle vector # c[35].x = pre-multiplied diffuse light color & diffuse mat. # c[35].y = pre-multiplied ambient light color & diffuse mat. # c[36] = specular color; c[38].x = specular power
DP4 o[HPOS].x, c[0], v[OPOS]; # Transform position. DP4 o[HPOS].y, c[1], v[OPOS]; DP4 o[HPOS].z, c[2], v[OPOS]; DP4 o[HPOS].w, c[3], v[OPOS]; DP3 R0.x, c[4], v[NRML]; # Transform normal. DP3 R0.y, c[5], v[NRML]; DP3 R0 z c[6] v[NRML]; DP3 R0.z, c[6], v[NRML]; DP3 R1.x, c[32], R0; # R1.x = L DOT N' DP3 R1.y, c[33], R0; # R1.y = H DOT N' MOV R1.w, c[38].x; # R1.w = specular power LIT R2, R1; # Compute lighting MAD R3, c[35].x, R2.y, c[35].y; # diffuse + ambient MAD o[COL0].xyz, c[36], R2.z, R3; # + specular END
G80 = Data Parallel Computer
Thread Execution Manager Input Assembler Host Thread Execution Manager
Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD core SIMD core SIMD Core
Global Memory Load/store
Page 7
G80 “core”
Each core
8 functional units
SIMD 16/32 “ ”
P ll l D t C h
SIMD 16/32 “warp” 8-10 stage pipeline Thread scheduler 128-512 threads/core 16 KB shared memory
Parallel Data Cache
y Total #threads/chip 16 * 512 = 8K
GPU Multi-threading (version 1)
Change threads each cycle (round robin)
frag4 frag3 frag2 frag1 instr1 instr2 instr3
Page 8
GPU Multi-threading (version 2)
Change thread after texture fetch/stall
frag4 frag3 frag2 frag1 Run until stall at texture fetch
(multiple instructions)
8800GTX Peak Performance
575 Mhz * 128 processors * 2 flop/inst * 2 inst/clock
MAD instruction
= 332.8 GFLOPS
Page 9
Instructions Issue Rate
http://graphics.stanford.edu/projects/gpubench/
ATI X1900XTX NVIDIA 7900GTX
Instructions Issue Rate
http://graphics.stanford.edu/projects/gpubench/
NVIDIA 7900GTX NVIDIA 8800GTX
Page 10
Measured BLAS Performance
SAXPY
X1900 (DX9):
6 GFlops
X1900 (CTM):
6 GFlops
X1900 (CTM):
6 GFlops
8800GTX (DX9):
12 GFlops SGEMV
X1900 (DX9):
4 GFlops
X1900 (CTM):
6 GFlops
8800GTX (DX9):
14 GFlops SGEMM
X1900 (DX9):
30 GFlops
X1900 (CTM):
120 GFlops
8800GTX (DX9):
105 Gflops
3 Ghz Core 2
40 Gflops
Programming Abstractions
Page 11
Approach I
Run application using graphics library Graphics library-based programming models
NVIDIA’s Cg Microsoft’s HLSL OpenGL Shading Language RapidMind Sh [McCool et al. 2004] RapidMind Sh [McCool et al. 2004]
Approach II
Map application to parallel computer C i ti ti l (CSP) Communicating sequential processes (CSP)
Threads: pthreads, Occam, UPC, … Message passing: MPI
Data parallel programming
APL, SETL, S, Fortran90, … C* (lisp*), NESL, …
Stream languages
StreaMIT, StreamC/KernelC MS Accelerator, CUDA, DPVM, PeakStream
Page 12
Stream Programming Environment
Collections stored in memory
Multidimensional arrays (stencils) Graphs and meshes (topology)
Data parallel operators
Application: map Reductions: scan, reduce (fold) Communication: send, sort, gather, scatter Communication: send, sort, gather, scatter Filter (|O|<|I|) and generate (|O|>|I|)
Brook
Ian Buck PhD Thesis Stanford University
Brook for GPUs: Stream computing on graphics hardware,
- I. Buck, T. Foley, D. Horn, J. Sugarman, K. Fatahalian, M. Houston,
- P. Hanrahan, SIGGRAPH 2004
Page 13
Brook Example
kernel void foo ( float a<>, float b<>, t fl t lt )
- ut float result<> )
{ result = a + b; } float a<100>; float b<100>; float c<100>; foo(a,b,c);
for (i=0; i<100; i++) c[i] = a[i]+b[i];
Classical N-Body Simulation
Stellar dynamics
Gravitational acceleration Gravitational accel. + jerk
Molecular dynamics
Implicit solvent models Implicit solvent models Lennard-Jones Coulomb
Page 14
Folding@Home Performance
Vijay Pande Group GROMACs on Brook GPU:CPUcore 40:1 CPU: 3.0 Ghz P4 GPU: ATI X1900X
Current Statistics: March 19, 2007
Client type Current TFLOPS* Current Processors Windows 150 157457 Windows 150 157457 Mac OS X/PPC 7 8710 Mac OS X/Intel 7 2520 Linux 34 24639 GPU 40 682 GPU 40 682 PS/3 26 877 Total 223 1824132
*TFLOPs is actual flops from software cores, not peak values
Page 15
Folding@Home GPU Cluster
25 nodes
Nforce4 SLI Dual core Opteron 2x ATI X1900XTX Linux
5 TFlops of folding “power”
Not actual machine
Future
Page 16
Summary
Cinematic games and media drive GPU market GPU evolving into a high throughput processor
“Data parallel multi-threaded machine”
Many applications map to GPUs
Processor of the future likely to be a CPU/GPU Small number of traditional CPU cores Large number of GPU cores Large number of GPU cores
Opportunities
Current hardware not optimal
Incredible opportunity for architectural
innovation innovation Current software environment immature
Incredible opportunity for reinventing parallel