Computer Architecture for the Next Millenium
November 1, 1999
William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu
Computer Architecture for the Next Millenium November 1, 1999 - - PowerPoint PPT Presentation
Computer Architecture for the Next Millenium November 1, 1999 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu Outline The Stanford Concurrent VLSI Architecture Group Forces acting on
November 1, 1999
William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu
Computer Architecture for the Next Millenium WJD November 1, 1999 2
– applications (media) – technology (wire-limited) – techniques (explicit parallelism)
– distributed register files
– 20GFLOPS on a 0.5cm2 chip
– its not a mature field yet
Computer Architecture for the Next Millenium WJD November 1, 1999 3
– Torus Routing Chip, Network Design Frame, Reliable Router – Basis for Intel, Cray/SGI, Mercury, Avici network chips
Computer Architecture for the Next Millenium WJD November 1, 1999 4
– Fast messaging, scalable processing nodes, scalable memory architecture
MDP Chip J-Machine Cray T3D MAP Chip
Computer Architecture for the Next Millenium WJD November 1, 1999 5
– Simultaneous bidirectional signaling, 1989
– High-speed signalling
– Low-voltage on-chip signalling – Low-skew clock distribution
– Mesochronous, Plesiochronous – Self-Timed Design
4Gb/s CMOS I/O 250ps/division
Computer Architecture for the Next Millenium WJD November 1, 1999 6
Technology Applications Computer Architect Interfaces Machine Organization Measurement & Evaluation ISA API Link I/O Chan
Regs IR
Computer Architecture for the Next Millenium WJD November 1, 1999 7
– video, graphics, audio, DSL modems, cellular base stations
– power and delay dominated by communication, not arithmetic – global structures: register files and instruction issue don’t scale
– to the point of diminishing returns on squeezing performance from sequential code – explicit parallelism (data parallelism and thread-level parallelism) required to continue scaling performance
Computer Architecture for the Next Millenium WJD November 1, 1999 8
– read each pixel once – often non-unit stride – but there is producer-consumer locality
– 100s of arithmetic operations per memory reference
Computer Architecture for the Next Millenium WJD November 1, 1999 9
Minimum width wire in an 0.35µm process
Computer Architecture for the Next Millenium WJD November 1, 1999 10
0.18µm 256Mb DRAM 16 64b FP Proc 500MHz 0.07µm 4Gb DRAM 256 64b FP Proc 2.5GHz
P
1999 2008
18mm 30,000 tracks 1 clock repeaters every 3mm 25mm 120,000 tracks 16 clocks repeaters every 0.5mm
P
Computer Architecture for the Next Millenium WJD November 1, 1999 11
Data Bandwidth Instruction Bandwidth Regs Instr. Cache IR IP ‘Feeding’ Structure Dwarfs ALU
Computer Architecture for the Next Millenium WJD November 1, 1999 12
– Media problems have lots of parallelism and locality – VLSI technology enables 100s of ALUs per chip (1000s soon)
– Locality - global structures won’t work – Explicit parallelism - ILP won’t keep 100 ALUs busy – Memory - streaming applications don’t cache well
Computer Architecture for the Next Millenium WJD November 1, 1999 13
– Short term storage for intermediate results – Communication between multiple function units
– Need more registers to hold more results (grows with N) – Need more ports to connect all of the units (grows with N2)
Computer Architecture for the Next Millenium WJD November 1, 1999 14
Bit Lines Word Lines Vdd Gnd Vdd p w p h Bit Lines Word Lines
...
1 wire grid
...
p p w h
Computer Architecture for the Next Millenium WJD November 1, 1999 15
N Arithmetic Units N/C Arithmetic Units C SIMD Clusters N/C Arithmetic Units C SIMD Clusters (A) (B) (C) (D) N Arithmetic Units N/C Arithmetic Units N/C Arithmetic Units
Computer Architecture for the Next Millenium WJD November 1, 1999 16
Central SIMD DRF SIMD/DRF
0.1 1 10 100 1000 1 10 100 1000
Number of Arithmetic Units
Computer Architecture for the Next Millenium WJD November 1, 1999 17
Central DRF SIMD/DRF SIMD
0.1 1 10 100 1000 1 10 100 1000
Number of Arithmetic Units
Computer Architecture for the Next Millenium WJD November 1, 1999 18
0.00 0.20 0.40 0.60 0.80 1.00 1.20
Central SIMD SIMD/DRF HIERARCHICAL STREAM
(B) Performance with Latency
0.00 0.20 0.40 0.60 0.80 1.00 1.20
Central SIMD SIMD/DRF HIERARCHICAL STREAM
(A) Raw Performance
Computer Architecture for the Next Millenium WJD November 1, 1999 19
FU (Op 1) RF FU (Op 2) RF Op 1 Op 2 Write stub Read stub Data transfer
Computer Architecture for the Next Millenium WJD November 1, 1999 20
+ t1 t1 t1 * t2 L/S + k * x L/S t1 t1 + * L/S + t1 * t2 L/S + k * x L/S t1 Pass t1 * L/S + t1 t1 * t2 L/S
Instruction 5 Instruction 6 (a) (b) (c)
Computer Architecture for the Next Millenium WJD November 1, 1999 21
Stream Register File Network Interface Host Interface Imagine Stream Processor Host Processor Network ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 ALU Cluster 7 SDRAM SDRAM SDRAM SDRAM Streaming Memory System Microcontroller
Computer Architecture for the Next Millenium WJD November 1, 1999 22
Imagine Stream Processor
3.2GB/s 64GB/s SDRAM SDRAM SDRAM SDRAM Stream Register File ALU Cluster ALU Cluster ALU Cluster 544GB/s
Computer Architecture for the Next Millenium WJD November 1, 1999 23
CU
Intercluster Network
+
From Stream Buffers To Stream Buffers
+ +
* *
/
Cross Point Local Register File
Computer Architecture for the Next Millenium WJD November 1, 1999 24
– operands are streams – also Send and Receive for multiple-imagine systems
– read elements from input streams – perform a local computation – append elements to output streams – repeat until input stream is consumed – (e.g., triangle transform)
Pixel Depth & Color
Memory Stream Register File
Input Data word record
Triangle Records Shaded Triangle Records Projected Triangle Records Span Records Fragment Records Pixel Depth & Color Pixel Depth & Color
Z-Composite
Fragment Records
Image Depth & Color
Image Buffer Indices Memory Bandwidth Register Bandwidth
Compact Sort Process Span Span Setup Project/ Cull Transform
Arithmetic Clusters
Triangle Records
Shade
Computer Architecture for the Next Millenium WJD November 1, 1999 26
Computer Architecture for the Next Millenium WJD November 1, 1999 27
Computer Architecture for the Next Millenium WJD November 1, 1999 28
A x>0 B C J K Data-Dependent Branch Y N A x>0 Y B C Whoops J K Speculative Loss D x W ~100s A B J y=(x>0) if y if ~y C K if y if ~y Exponentially Decreasing Duty Factor
Computer Architecture for the Next Millenium WJD November 1, 1999 29
– Branching control flow - dead issue slots on mispredicted branches – Predication (SIMD select, masked vectors) - large fraction of execution ‘opportunities’ go idle.
– append an element to an output stream depending on a case variable.
Value Stream Case Stream {0,1} Output Stream
Computer Architecture for the Next Millenium WJD November 1, 1999 30
12.8 8.0 8.9 8.1 7.0 4.0 24.4 14.6 3.2 1.8 3.6 2.8 0.8 2.6
5 10 15 20 25 30
Transform FFT FIR Blockwarp Sort Merge DCT FIR
GOPS
Communication 16-bit Arithmetic 32-bit Arithmetic FP Arithmetic
Computer Architecture for the Next Millenium WJD November 1, 1999 31
3.7 1.5 0.28 0.22 1.2 1 2 3 4 5 6
GOPs/ W
I magine AD 21160 TI C67x StrongARM SA- 1100 1024-point Floating-Point FFT Communications Dhrystone
Computer Architecture for the Next Millenium WJD November 1, 1999 32
34% 45% 9% 9% 3%
0.5 1 1.5 2 2.5 3 3.5 W FI R FFT BWARP XFORM MERGE DCT Other Mem Sys SRF Clusters Clock
Computer Architecture for the Next Millenium WJD November 1, 1999 33
Clusters Mem_0 Mem_1
21400 21500 21600 21700 21800 21900 22000 22100 22200 22300 22400 22500 22600 22700 22800 22900 23000 23100 23200 23300 23400 23500 23600
STORE UNPACK LOAD CONV 7x7 CONV 3x3 STORE UNPACK LOAD CONV 7x7 CONV 3x3 STORE UNPACK LOAD CONV 7x7
Clust Mem0 Mem1 501400 501500 501600 501700 501800 501900 502000 502100 502200 502300 502400 502500 502600 502700 502800 502900 503000 503100 503200 503300
BlockSAD Load Load BlockSAD Store BlockSAD BlockSAD BlockSAD Load Load BlockSAD Store BlockSAD
Load original packed row Unpack (8bit -> 16 bit) 7x7 Convolve 3x3 Convolve Store convolved row Load Convolved Rows Calculate BlockSADs at different disparities Store best disparity values
Computer Architecture for the Next Millenium WJD November 1, 1999 35
ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 JUK0 VAL0 G E N _ C I S T A T E C O N D _ I N _ D G E N _ C C E N D S P C R E A D _ W T S P C W R I T E C O M M U C D A T A C H K _ A N Y S E L E C T S H I F T A 1 6 C O M M U C P E R M C O M M U C P E R M C O M M U C P E R M S E L E C T S E L E C T C O M M U C P E R M I M U L R N D 1 6 S E L E C T I M U L R N D 1 6 I M U L R N D 1 6 I M U L R N D 1 6 N S E L E C T I M U L R N D 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I M U L R N D 1 6 P A S S I M U L R N D 1 6 I M U L R N D 1 6 P A S S I M U L R N D 1 6 I M U L R N D 1 6 P A S S I M U L R N D 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I A D D S 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 N S E L E C T I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I A D D S 1 6 S E L E C T P A S S I M U L R N D 1 6 I M U L R N D 1 6 P A S S I A D D S 1 6 N S E L E C T P A S S I M U L R N D 1 6 I M U L R N D 1 6 P A S S I A D D S 1 6 I A D D S 1 6 N S E L E C T I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I A D D S 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I A D D S 1 6 S H U F F L E I M U L R N D 1 6 I M U L R N D 1 6 S H U F F L E P A S S I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I A D D S 1 6 P A S S I A D D S 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I A D D S 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I A D D S 1 6 P A S S S H U F F L E I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 S H U F F L E I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 S H U F F L E S H U F F L E I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I A D D S 1 6 P A S S P A S S P A S S S H U F F L E S H U F F L E S H U F F L E S H U F F L E I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 P A S S I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 P A S S P A S S P A S S I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 S H U F F L E S H U F F L E S H U F F L E D A T A _ I N I A D D S 1 6 I A D D S 1 6 S H U F F L E P A S S D A T A _ I N I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 P A S S P A S S D A T A _ I N S E L E C T I A D D S 1 6 I A D D S 1 6 D A T A _ I N I A D D S 1 6 S E L E C T I A D D S 1 6 I A D D S 1 6 N S E L E C T D A T A _ O U T I A D D S 1 6 S E L E C T I A D D S 1 6 I A D D S 1 6 D A T A _ I N D A T A _ O U T I A D D S 1 6 N S E L E C T D A T A _ I N D A T A _ O U T I A D D S 1 6 N S E L E C T N S E L E C T D A T A _ O U T L O O P I A D D S 1 6 I A D D S 1 6 D A T A _ O U T D A T A _ O U T D A T A _ O U TADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 JUK0 VAL0
IMULRND16 IMULRND16 PASS IADDS16 NSELECT PASS IMULRND16 IMULRND16 PASS IADDS16 IADDS16 NSELECT SHIFTA16 IMULRND16 IMULRND16 IADDS16 IADDS16 IMULRND16 IMULRND16 IADDS16 IADDS16 SHUFFLE IMULRND16 IMULRND16 SHUFFLE PASS IMULRND16 IMULRND16 IADDS16 IADDS16 PASS IADDS16 IMULRND16 IMULRND16 IADDS16 IADDS16 IMULRND16 IMULRND16 IADDS16 IADDS16 PASS SHUFFLE IMULRND16 IMULRND16 IADDS16 SHUFFLE IMULRND16 IMULRND16 IADDS16 SHUFFLE SHUFFLE IMULRND16 IMULRND16 IADDS16 IADDS16 IADDS16 COMMUCPERM IMULRND16 IMULRND16 IADDS16 IADDS16 IADDS16 COMMUCPERM IMULRND16 IMULRND16 IADDS16 IADDS16 IADDS16 COMMUCPERM IMULRND16 IMULRND16 IADDS16 IADDS16 PASS PASS PASS SHUFFLE SELECT SHUFFLE SHUFFLE SHUFFLE SELECT COMMUCPERM IADDS16 IADDS16 IADDS16 PASS IMULRND16 SELECT IADDS16 IADDS16 IADDS16 PASS PASS PASS IMULRND16 IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16 NSELECT IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16 IMULRND16 IMULRND16 PASS SHUFFLE SHUFFLE SHUFFLE DATA_IN IMULRND16 IMULRND16 PASS IADDS16 IADDS16 SHUFFLE PASS DATA_IN IMULRND16 IMULRND16 PASS GEN_CISTATE IADDS16 IADDS16 IADDS16 PASS PASS DATA_IN IMULRND16 IMULRND16 COND_IN_D SELECT IADDS16 IADDS16 DATA_IN IMULRND16 IMULRND16 GEN_CCEND IADDS16 SELECT IADDS16 IADDS16 NSELECT IMULRND16 IMULRND16 SPCREAD_WT SPCWRITE DATA_OUT IADDS16 SELECT IADDS16 IADDS16 DATA_IN IMULRND16 IMULRND16 COMMUCDATA DATA_OUT IADDS16 NSELECT DATA_IN IMULRND16 IMULRND16 IADDS16 CHK_ANY DATA_OUT IADDS16 NSELECT NSELECT IMULRND16 IMULRND16 IADDS16 IADDS16 DATA_OUT LOOP IADDS16 IADDS16 IMULRND16 IMULRND16 IADDS16 NSELECT SELECT DATA_OUT IMULRND16 IMULRND16 IADDS16 IADDS16 SELECT PASS DATA_OUT DATA_OUTComputer Architecture for the Next Millenium WJD November 1, 1999 36
– simplifies programming – exposes locality and concurrency
– perform a subroutine on each stream element – reduces global register bandwidth
– use bandwidth where its inexpensive – distributed and hierarchical register organization
– sort elements into homogeneous streams – avoid predication or speculation
Computer Architecture for the Next Millenium WJD November 1, 1999 37
– media applications process streams of low-precision samples – wires dominate gates
– new applications have lots of parallelism and locality – modern technology can build chips with 100s of ALUs (32b FP) 1000s in the near future
– that can harness this potential performance – in a way that can be easily programmed