Stream Programming Environments Pat Hanrahan Computer Science & - - PDF document

stream programming environments
SMART_READER_LITE
LIVE PREVIEW

Stream Programming Environments Pat Hanrahan Computer Science & - - PDF document

Stream Programming Environments Pat Hanrahan Computer Science & Electrical Engineering Stanford University GP^2 Workshop August 7-8, 2004 Acknowledgements Bill Dally Ian Buck Eric Darve Mattan Erez Vijay Pande


slide-1
SLIDE 1

1

Stream Programming Environments

Pat Hanrahan

Computer Science & Electrical Engineering Stanford University

GP^2 Workshop August 7-8, 2004

Acknowledgements

  • Bill Dally
  • Eric Darve
  • Vijay Pande
  • Bill Mark
  • John Owens
  • Kurt Akeley
  • Mark Horowitz
  • Ian Buck
  • Mattan Erez
  • Kayvon Fatahalian
  • Tim Foley
  • Daniel Horn
  • Michael Houston
  • Jeremy Sugarman

Funding: DARPA, DOE, ATI, IBM, NVIDIA, SONY

slide-2
SLIDE 2

2

Motivation

  • Cinematic games and media drive GPU market
  • Current GPU faster than CPU (at graphics)
  • Gap between the GPU and the CPU increasing
  • Why? Data parallelism; efficient communication
  • Programmable GPUs ≈ Stream processors
  • Many applications map to stream processing
  • Therefore, a $50 high-performance parallel-

computer is shipping with every PC

  • Revolutionize computing

Overview

  • Technology trends
  • Stream programming abstraction
  • Brook for GPUs
  • Applications
slide-3
SLIDE 3

3

VLSI for Programmers :-)

slide-4
SLIDE 4

4

The Capability Gap

1e-4 1e-3 1e-2 1e-1 1e+0 1e+1 1e+2 1e+3 1e+4 1e+5 1e+6 1e+7 1980 1985 1990 1995 2000 2005 2010 2015 2020 Perf (ps/Inst) Delay/CPUs

52%/year 74%/year 19%/year

30:1 1,000:1 30,000:1

Graph courtesy of Bill Dally

Recent Performance Trends

GFLOPS Multiplies per second

NVIDIA NV30, 35, 40 ATI R300, 360, 420 Pentium 4

July 01 Jan 02 July 02 Jan 03 July 03 Jan 04

slide-5
SLIDE 5

5

Programming Environments[1999]?

RenderMan Real-Time Shading Language (RTSL)

Rendering

RenderMan++

Programming Environments[2004]?

RenderMan Real-Time Shading Language (RTSL)

Simulation

Brook (Stream)

slide-6
SLIDE 6

6

Stream Abstraction

Streams – Old/Hot Idea in CS

  • <stream.h>
  • OpenGL/GLS/Chromium
  • Data visualization systems (vtk, avs, dx)
  • Signal processing (signal flow graphs)
  • Functional programming
  • Streaming data bases
  • Sensor nets
slide-7
SLIDE 7

7

Minimize State!

Fragment Processor

FP Texture Processor L1 Texture Cache Branch Processor FP32 Shader Unit 1 FP32 Shader Unit 2

Input Fragment Data

Output Shaded Fragments

Fog ALU

Texture Data

L2 Texture Cache

  • SIMD Architecture
  • Dual Issue / Co-

Issue

  • FP32 Computation
  • Shader Model 3.0
  • SIMD Architecture
  • Dual Issue / Co-

Issue

  • FP32 Computation
  • Shader Model 3.0

Shader Unit 1

4 FP Ops / pixel Dual/Co-Issue Texture Address Calc Free fp16 normalize + mini ALU

Shader Unit 1

4 FP Ops / pixel Dual/Co-Issue Texture Address Calc Free fp16 normalize + mini ALU

Texture Filter

Bi / Tri / Aniso 1 texture @ full speed 4-tap filter @ full speed 16:1 Aniso w/ Trilinear (128-tap) FP16 Texture Filtering

Texture Filter

Bi / Tri / Aniso 1 texture @ full speed 4-tap filter @ full speed 16:1 Aniso w/ Trilinear (128-tap) FP16 Texture Filtering

Shader Unit 2

4 FP Ops / pixel Dual/Co-Issue + mini ALU

Shader Unit 2

4 FP Ops / pixel Dual/Co-Issue + mini ALU

slide-8
SLIDE 8

8

GeForce 6800 Series 3D Pipeline

Triangle Setup L2 Tex Shader Instruction Dispatch Fragment Crossbar Memory Partition Memory Partition Memory Partition Memory Partition Z-Cull

Stream Programming Abstraction

  • Streams

– Collection of data records

  • Kernels

– Inputs/outputs are streams – Perform computation – Can be chained together

kernel

stream stream stream stream

slide-9
SLIDE 9

9

Why Architects like Streams?

  • Parallelism

– Data parallelism – Pipeline (task) parallelism

  • Communication

– Producer-consumer locality – Predictable memory access pattern – No read-write hazards; simple coherence – Hide latency of random memory accesses – High arithmetic intensity

A lot like vector machines …

Arithmetic Intensity

Arithmetic Intensity = Compute-to-bandwidth ratio Graphics pipeline

– Vextex

BW: 1 vertex = 32 bytes; OP: 100-500 f32-ops / vertex

– Fragment

BW: 1 fragment = 10 bytes OP: 300-1000 i8-ops/fragment

slide-10
SLIDE 10

10

15.6 28.4 63.7 ATI X800 XT PE* 7.3 12.2 26.1 ATI 9800 XT 8.4 20.6 53.4 NV 6800 Ultra 4.1 11.4 40.0 NV 5900 Ultra Seq BW Cache BW GFLOPS Bandwidth measured in GB/sec. * ATI X800 XT PE is a prerelease board: 500Mhz core / 500Mhz clock GPUBench: Evaluating GPU performance for numerical and scientific applications, K. Fatahalian, I. Buck, M. Houston, P. Hanrahan, GP^2 2004

Microbenchmarks

Measured Arithmetic Intensity CPU vs GPU

  • Intel 3 Ghz Pentium 4

– 12 GFLOPS peak performance (via SSE2) – 6 GB/sec peak memory bandwidth – 44 GB/sec peak bw from 8K L1 data cache

  • NVIDIA GeForce 6800

– 45 GFLOPS peak performance – 36 GB/sec peak memory bandwidth – 21 GB/sec peak bw from ?K L1 data cache

slide-11
SLIDE 11

11

Approach I

Map application to graphics primitives

  • Graphics library-based programming

models

– Cg/HLSL – OpenGL Shading Language – Sh [McCool et al. 2004]

Approach II

Map application to parallel computer

  • Stream languages

– AWK, Ptolemy, … – StreaMIT, StreamC/KernelC, …

  • Data parallel programming

– APL, SETL, S, Fortran90, … – C* (lisp*), NESL, …

  • Communicating sequential processes (CSP)

– Threads: Occam, UPC – Message passing: MPI

slide-12
SLIDE 12

12

Stream Programming Environment

  • Collections stored in memory

– Multidimensional arrays (stencils) – Graphs and meshes (topology)

  • Data parallel operators

– Application: map – Reductions: scan, reduce (fold) – Communication: send, sort, gather, scatter – Filter (|O|<|I|) and generate (|O|>|I|)

Brook

Ian Buck, Ph. D. Thesis, Stanford

Brook for GPUs: Stream computing on graphics hardware,

  • I. Buck, T. Foley, D. Horn, J. Sugarman, K. Fatahalian,
  • M. Houston, P. Hanrahan, SIGGRAPH 2004
slide-13
SLIDE 13

13

Brook Language

kernel void foo ( float a<>, float b<>,

  • ut float result<> )

{ result = a + b; } float a<100>; float b<100>; float c<100>; foo(a,b,c); for (i=0; i<100; i++) c[i] = a[i]+b[i];

Goals

  • Develop version of PCA Brook for GPUs

– Programmer need not know GL

  • Versions

– New ATI (420) and NVIDIA (NV40) hardware – Linux and Windows – DX and OpenGL

  • Release as open source [V1.0 Dec 2003]

– http://brook.sourceforge.net – http://sourceforge.net/projects/brook – over 6,300 downloads in 8 months

slide-14
SLIDE 14

14

Brook Performance

First Generation GPUs

ATI Radeon 9800 XT NVIDIA GeForceFX

Floating precisions different ATI – 24-bit NV – ~IEEE 32-bit Intel – IEEE 32-bit compared against:

  • Intel Math Library
  • Atlas Math Library
  • Cached blocked segmentation
  • FFTW
  • SSE-opt Ray Triangle code

Brook Performance

ATI Radeon X800 XT NVIDIA GeForce 6800

Second Generation GPUs

Floating precisions different ATI – 24-bit NV – ~IEEE 32-bit Intel – IEEE 32-bit compared against:

  • Intel Math Library
  • Atlas Math Library
  • Cached blocked segmentation
  • FFTW
  • SSE-opt Ray Triangle code
slide-15
SLIDE 15

15

Dense Matrix-Matrix Multiplication

Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication, K. Fatahalian, J. Sugerman, P. Hanrahan, Graphics Hardware 2004

* ATI X800 XT PE is a prerelease board: 500Mhz core/500Mhz clock

64.8% 17.9% 18.5% 17.3% 7.5%

Compute Efficiency

61.9% 96.8% 98.9% 90.9% 79.6%

BW Efficiency 27.68 7.78 0.289 P4 ATLAS 27.50 11.40 0.188 ATI X800 XT* 12.06 4.83 0.445 ATI 9800 XT 18.78 9.25 0.232 NV 6800 Ultra 9.07 3.01 0.713 NV 5900 Ultra BW (GB/sec) GFLOPS Time (s)

Dense Matrix-Matrix Multiplication

  • Matrix-matrix multiplication is bandwidth limited on GPU.

– Memory blocking to increase cache utilization does not help – Architectural problem, not programming model problem

slide-16
SLIDE 16

16

Beyond Graphics and Imaging …

Molecular Dynamics Folding@Home Fluid Flow

Accelerating molecular dynamics with GPUs, I. Buck, V. Rangasayee,

  • E. Darve, V. Pande, P. Hanrahan GP^2 2004

Applications

  • Media: audio, images (vision), video, …
  • Simulation

– Monte Carlo

  • Ray tracing

– Ordinary differential equations

  • N-body problems: molecular dynamics, astrophysics
  • Particle systems and rigid body dynamics

– Partial differential equations

  • Explicit: elastic deformations
  • Implicit: cloth, fluid flow
  • Machine learning and computational statistics?
slide-17
SLIDE 17

17

16 Node GPU Cluster

  • Compute

– 32 2.4GHz P4 Xeons – 16GB DDR – 1.2TB disk – Intel E7505 chipset

  • Network

– Infiniband 4X interconnect – GigE

  • Graphics

– ATI Radeon 9800 Pro 256MB

Parallel computation on a cluster of GPUs, M. Houston, K. Fatahalian,

  • J. Sugarman, I. Buck, P. Hanrahan GP^2 2004

Stream Processor 128 FPUs 128GFLOPS 16 x DRDRAM 2GBytes 16GBytes/s 16GBytes/s 32+32 pairs Node On-Board Network Node 2 Node 16 Board 2 16 Nodes 1K FPUs 2TFLOPS 32GBytes Intra-Cabinet Network Board 32 64GBytes/s 128+128 pairs 6" Teradyne GbX Board B a c k p l a n e Inter-Cabinet Network Backplane 2 32 Boards 512 Nodes 64K FPUs 64TFLOPS 1TByte E/O O/E 1TBytes/s 2K+2K links Ribbon Fiber Backplane 32 Bisection 32TBytes/s All links 5Gb/s per pair or fiber All bandwidths are full duplex

Merrimac: Supercomputing with streams, M. Erez, J. Ahn,

  • N. Jayasena, T. Knight, A. Das, F. Labonte, J. Gummaraju,
  • W. Dally, P. Hanrahan, M. Rosenblum, GP^2 2004
slide-18
SLIDE 18

18

Wrap-Up

Vision

  • Cinematic games and media drive GPU market
  • Current GPU faster than CPU (at graphics)
  • Gap between the GPU and the CPU increasing
  • Why? Data parallelism; efficient communication
  • Programmable GPUs = Stream processors
  • Many applications map to stream processing
  • Therefore, a $50 high-performance parallel-

computer is shipping with every PC

  • Revolutionize computing
slide-19
SLIDE 19

19

Opportunities

  • Current hardware not optimal

– Incredible opportunity for architectural innovation

  • Current software environment immature

– Incredible opportunity for reinventing parallel computing software

Questions?

Fly-fishing fly images from The English Fly Fishing Shop