Computer Architecture for the Next Millenium November 1, 1999 - - PowerPoint PPT Presentation

computer architecture for the next millenium
SMART_READER_LITE
LIVE PREVIEW

Computer Architecture for the Next Millenium November 1, 1999 - - PowerPoint PPT Presentation

Computer Architecture for the Next Millenium November 1, 1999 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu Outline The Stanford Concurrent VLSI Architecture Group Forces acting on


slide-1
SLIDE 1

Computer Architecture for the Next Millenium

November 1, 1999

William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu

slide-2
SLIDE 2

Computer Architecture for the Next Millenium WJD November 1, 1999 2

Outline

  • The Stanford Concurrent VLSI Architecture Group
  • Forces acting on computer architecture

– applications (media) – technology (wire-limited) – techniques (explicit parallelism)

  • Example: register organization

– distributed register files

  • Imagine a stream processor

– 20GFLOPS on a 0.5cm2 chip

  • Tremendous opportunities and challenges for

computer architecture in the next millenium

– its not a mature field yet

slide-3
SLIDE 3

Computer Architecture for the Next Millenium WJD November 1, 1999 3

The Concurrent VLSI Architecture Group

  • Architecture and design technology for VLSI
  • Routing chips

– Torus Routing Chip, Network Design Frame, Reliable Router – Basis for Intel, Cray/SGI, Mercury, Avici network chips

slide-4
SLIDE 4

Computer Architecture for the Next Millenium WJD November 1, 1999 4

Parallel computer systems

  • J-Machine (MDP) led to Cray T3D/T3E
  • M-Machine (MAP)

– Fast messaging, scalable processing nodes, scalable memory architecture

MDP Chip J-Machine Cray T3D MAP Chip

slide-5
SLIDE 5

Computer Architecture for the Next Millenium WJD November 1, 1999 5

Design technology

  • Off-chip I/O

– Simultaneous bidirectional signaling, 1989

  • now used by Intel and Hitachi

– High-speed signalling

  • 4Gb/s in 0.6µm CMOS, Equalization, 1995
  • On-Chip Signalling

– Low-voltage on-chip signalling – Low-skew clock distribution

  • Synchronization

– Mesochronous, Plesiochronous – Self-Timed Design

4Gb/s CMOS I/O 250ps/division

slide-6
SLIDE 6

Computer Architecture for the Next Millenium WJD November 1, 1999 6

What is Computer Architecture?

Technology Applications Computer Architect Interfaces Machine Organization Measurement & Evaluation ISA API Link I/O Chan

Regs IR

slide-7
SLIDE 7

Computer Architecture for the Next Millenium WJD November 1, 1999 7

Forces Acting on Architecture

  • Applications - shifting towards media applications dealing

with streams of low-precision samples

– video, graphics, audio, DSL modems, cellular base stations

  • Technology - becoming wire-limited

– power and delay dominated by communication, not arithmetic – global structures: register files and instruction issue don’t scale

  • Technique - Micro-architecture - ILP has been mined out

– to the point of diminishing returns on squeezing performance from sequential code – explicit parallelism (data parallelism and thread-level parallelism) required to continue scaling performance

slide-8
SLIDE 8

Computer Architecture for the Next Millenium WJD November 1, 1999 8

Applications

  • Little locality of reference

– read each pixel once – often non-unit stride – but there is producer-consumer locality

  • Very high arithmetic intensity

– 100s of arithmetic operations per memory reference

  • Dominated by low-precision

(16-bit) integer operations

slide-9
SLIDE 9

Computer Architecture for the Next Millenium WJD November 1, 1999 9

Wires Are Becoming Like Wet Noodles

0.0mm 2.5mm 5.0mm 7.5mm 10.0mm

Minimum width wire in an 0.35µm process

slide-10
SLIDE 10

Computer Architecture for the Next Millenium WJD November 1, 1999 10

Technology scaling makes communication the scarce resource

0.18µm 256Mb DRAM 16 64b FP Proc 500MHz 0.07µm 4Gb DRAM 256 64b FP Proc 2.5GHz

P

1999 2008

18mm 30,000 tracks 1 clock repeaters every 3mm 25mm 120,000 tracks 16 clocks repeaters every 0.5mm

P

slide-11
SLIDE 11

Computer Architecture for the Next Millenium WJD November 1, 1999 11

Care and Feeding of ALUs

Data Bandwidth Instruction Bandwidth Regs Instr. Cache IR IP ‘Feeding’ Structure Dwarfs ALU

slide-12
SLIDE 12

Computer Architecture for the Next Millenium WJD November 1, 1999 12

What Does This Say About Architecture?

  • Tremendous opportunities

– Media problems have lots of parallelism and locality – VLSI technology enables 100s of ALUs per chip (1000s soon)

  • (in 0.18um 0.1mm2 per integer adder, 0.5mm2 per FP adder)
  • Challenging problems

– Locality - global structures won’t work – Explicit parallelism - ILP won’t keep 100 ALUs busy – Memory - streaming applications don’t cache well

  • Its time to try some new approaches
slide-13
SLIDE 13

Computer Architecture for the Next Millenium WJD November 1, 1999 13

Example Register File Organization

  • Register files serve two functions:

– Short term storage for intermediate results – Communication between multiple function units

  • Global register files don’t scale well as N, number of

ALUs increases

– Need more registers to hold more results (grows with N) – Need more ports to connect all of the units (grows with N2)

slide-14
SLIDE 14

Computer Architecture for the Next Millenium WJD November 1, 1999 14

Register Cells are Mostly Switch

Bit Lines Word Lines Vdd Gnd Vdd p w p h Bit Lines Word Lines

...

1 wire grid

...

p p w h

slide-15
SLIDE 15

Computer Architecture for the Next Millenium WJD November 1, 1999 15

Register Architecture for ‘wide’ Processors

N Arithmetic Units N/C Arithmetic Units C SIMD Clusters N/C Arithmetic Units C SIMD Clusters (A) (B) (C) (D) N Arithmetic Units N/C Arithmetic Units N/C Arithmetic Units

slide-16
SLIDE 16

Computer Architecture for the Next Millenium WJD November 1, 1999 16

Area of Register Organizations

Central SIMD DRF SIMD/DRF

0.1 1 10 100 1000 1 10 100 1000

Number of Arithmetic Units

slide-17
SLIDE 17

Computer Architecture for the Next Millenium WJD November 1, 1999 17

Delay of Register Organizations

Central DRF SIMD/DRF SIMD

0.1 1 10 100 1000 1 10 100 1000

Number of Arithmetic Units

slide-18
SLIDE 18

Computer Architecture for the Next Millenium WJD November 1, 1999 18

Performance of Register Organizations

0.00 0.20 0.40 0.60 0.80 1.00 1.20

Central SIMD SIMD/DRF HIERARCHICAL STREAM

(B) Performance with Latency

0.00 0.20 0.40 0.60 0.80 1.00 1.20

Central SIMD SIMD/DRF HIERARCHICAL STREAM

(A) Raw Performance

slide-19
SLIDE 19

Computer Architecture for the Next Millenium WJD November 1, 1999 19

Stubs Abstract the Communication Between Operations

FU (Op 1) RF FU (Op 2) RF Op 1 Op 2 Write stub Read stub Data transfer

slide-20
SLIDE 20

Computer Architecture for the Next Millenium WJD November 1, 1999 20

A Communication Example

+ t1 t1 t1 * t2 L/S + k * x L/S t1 t1 + * L/S + t1 * t2 L/S + k * x L/S t1 Pass t1 * L/S + t1 t1 * t2 L/S

Instruction 5 Instruction 6 (a) (b) (c)

slide-21
SLIDE 21

Computer Architecture for the Next Millenium WJD November 1, 1999 21

The Imagine Stream Processor

Stream Register File Network Interface Host Interface Imagine Stream Processor Host Processor Network ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 ALU Cluster 7 SDRAM SDRAM SDRAM SDRAM Streaming Memory System Microcontroller

slide-22
SLIDE 22

Computer Architecture for the Next Millenium WJD November 1, 1999 22

Data Bandwidth Hierarchy

Imagine Stream Processor

3.2GB/s 64GB/s SDRAM SDRAM SDRAM SDRAM Stream Register File ALU Cluster ALU Cluster ALU Cluster 544GB/s

slide-23
SLIDE 23

Computer Architecture for the Next Millenium WJD November 1, 1999 23

Cluster Architecture

  • VLIW organization with shared control
  • Local register files provide high data bandwidth

CU

Intercluster Network

+

From Stream Buffers To Stream Buffers

+ +

* *

/

Cross Point Local Register File

slide-24
SLIDE 24

Computer Architecture for the Next Millenium WJD November 1, 1999 24

Imagine is a Stream Processor

  • Instructions are Load, Store, and Operate

– operands are streams – also Send and Receive for multiple-imagine systems

  • Operate performs a compound stream operation

– read elements from input streams – perform a local computation – append elements to output streams – repeat until input stream is consumed – (e.g., triangle transform)

  • Order of magnitude less global register bandwidth

than a vector processor

slide-25
SLIDE 25

Pixel Depth & Color

Memory Stream Register File

Input Data word record

Triangle Records Shaded Triangle Records Projected Triangle Records Span Records Fragment Records Pixel Depth & Color Pixel Depth & Color

Z-Composite

Fragment Records

Image Depth & Color

Image Buffer Indices Memory Bandwidth Register Bandwidth

Compact Sort Process Span Span Setup Project/ Cull Transform

Arithmetic Clusters

Triangle Records

Shade

Triangle Rendering

slide-26
SLIDE 26

Computer Architecture for the Next Millenium WJD November 1, 1999 26

Transform Kernel

References (per ∆) Stream Scalar Vector Memory 5.5 117 (21.3) 48 (8.7) Global RF 48 624 (13.0) 261 (5.4) Local RF 372 N/A N/A

Bandwidth Demands

slide-27
SLIDE 27

Computer Architecture for the Next Millenium WJD November 1, 1999 27

Data Parallelism is easier than ILP

Kernel 1 to 8 Cluster Speedup FFT (1024) 6.4 DCT (8x8) 7.8 Blockwarp (8x8) 7.2 Transform (∆) 8.0 Harmonic Mean 7.3

slide-28
SLIDE 28

Computer Architecture for the Next Millenium WJD November 1, 1999 28

Conventional Approaches to Data-Dependent Conditional Execution

A x>0 B C J K Data-Dependent Branch Y N A x>0 Y B C Whoops J K Speculative Loss D x W ~100s A B J y=(x>0) if y if ~y C K if y if ~y Exponentially Decreasing Duty Factor

slide-29
SLIDE 29

Computer Architecture for the Next Millenium WJD November 1, 1999 29

Zero-Cost Conditionals

  • Most Approaches to Conditional Operations are Costly

– Branching control flow - dead issue slots on mispredicted branches – Predication (SIMD select, masked vectors) - large fraction of execution ‘opportunities’ go idle.

  • Conditional Streams

– append an element to an output stream depending on a case variable.

Value Stream Case Stream {0,1} Output Stream

slide-30
SLIDE 30

Computer Architecture for the Next Millenium WJD November 1, 1999 30

Sustainable Performance

12.8 8.0 8.9 8.1 7.0 4.0 24.4 14.6 3.2 1.8 3.6 2.8 0.8 2.6

5 10 15 20 25 30

Transform FFT FIR Blockwarp Sort Merge DCT FIR

GOPS

Communication 16-bit Arithmetic 32-bit Arithmetic FP Arithmetic

slide-31
SLIDE 31

Computer Architecture for the Next Millenium WJD November 1, 1999 31

Power Comparison

3.7 1.5 0.28 0.22 1.2 1 2 3 4 5 6

GOPs/ W

I magine AD 21160 TI C67x StrongARM SA- 1100 1024-point Floating-Point FFT Communications Dhrystone

  • Source: Web Pages of Intel, TI, and Analog Devices
slide-32
SLIDE 32

Computer Architecture for the Next Millenium WJD November 1, 1999 32

Power and Performance

34% 45% 9% 9% 3%

0.5 1 1.5 2 2.5 3 3.5 W FI R FFT BWARP XFORM MERGE DCT Other Mem Sys SRF Clusters Clock

11.5 5.2 5.2 3.8 7.1 14.1 GOPs/W:

slide-33
SLIDE 33

Computer Architecture for the Next Millenium WJD November 1, 1999 33

A Look Inside an Application Stereo Depth Extraction

  • 320x240 8-bit grayscale

images

  • 30 disparity search
  • 220 frames/second
  • 12.7 GOPS
  • 5.7 GOPS/W
slide-34
SLIDE 34

Clusters Mem_0 Mem_1

21400 21500 21600 21700 21800 21900 22000 22100 22200 22300 22400 22500 22600 22700 22800 22900 23000 23100 23200 23300 23400 23500 23600

STORE UNPACK LOAD CONV 7x7 CONV 3x3 STORE UNPACK LOAD CONV 7x7 CONV 3x3 STORE UNPACK LOAD CONV 7x7

Clust Mem0 Mem1 501400 501500 501600 501700 501800 501900 502000 502100 502200 502300 502400 502500 502600 502700 502800 502900 503000 503100 503200 503300

BlockSAD Load Load BlockSAD Store BlockSAD BlockSAD BlockSAD Load Load BlockSAD Store BlockSAD

Load original packed row Unpack (8bit -> 16 bit) 7x7 Convolve 3x3 Convolve Store convolved row Load Convolved Rows Calculate BlockSADs at different disparities Store best disparity values

Stereo Depth Extractor

Convolutions Disparity Search

slide-35
SLIDE 35

Computer Architecture for the Next Millenium WJD November 1, 1999 35

ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 JUK0 VAL0 G E N _ C I S T A T E C O N D _ I N _ D G E N _ C C E N D S P C R E A D _ W T S P C W R I T E C O M M U C D A T A C H K _ A N Y S E L E C T S H I F T A 1 6 C O M M U C P E R M C O M M U C P E R M C O M M U C P E R M S E L E C T S E L E C T C O M M U C P E R M I M U L R N D 1 6 S E L E C T I M U L R N D 1 6 I M U L R N D 1 6 I M U L R N D 1 6 N S E L E C T I M U L R N D 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I M U L R N D 1 6 P A S S I M U L R N D 1 6 I M U L R N D 1 6 P A S S I M U L R N D 1 6 I M U L R N D 1 6 P A S S I M U L R N D 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I A D D S 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 N S E L E C T I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I A D D S 1 6 S E L E C T P A S S I M U L R N D 1 6 I M U L R N D 1 6 P A S S I A D D S 1 6 N S E L E C T P A S S I M U L R N D 1 6 I M U L R N D 1 6 P A S S I A D D S 1 6 I A D D S 1 6 N S E L E C T I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I A D D S 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I A D D S 1 6 S H U F F L E I M U L R N D 1 6 I M U L R N D 1 6 S H U F F L E P A S S I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I A D D S 1 6 P A S S I A D D S 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I A D D S 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I A D D S 1 6 P A S S S H U F F L E I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 S H U F F L E I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 S H U F F L E S H U F F L E I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 I M U L R N D 1 6 I M U L R N D 1 6 I A D D S 1 6 I A D D S 1 6 P A S S P A S S P A S S S H U F F L E S H U F F L E S H U F F L E S H U F F L E I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 P A S S I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 P A S S P A S S P A S S I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 S H U F F L E S H U F F L E S H U F F L E D A T A _ I N I A D D S 1 6 I A D D S 1 6 S H U F F L E P A S S D A T A _ I N I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 P A S S P A S S D A T A _ I N S E L E C T I A D D S 1 6 I A D D S 1 6 D A T A _ I N I A D D S 1 6 S E L E C T I A D D S 1 6 I A D D S 1 6 N S E L E C T D A T A _ O U T I A D D S 1 6 S E L E C T I A D D S 1 6 I A D D S 1 6 D A T A _ I N D A T A _ O U T I A D D S 1 6 N S E L E C T D A T A _ I N D A T A _ O U T I A D D S 1 6 N S E L E C T N S E L E C T D A T A _ O U T L O O P I A D D S 1 6 I A D D S 1 6 D A T A _ O U T D A T A _ O U T D A T A _ O U T

ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 JUK0 VAL0

IMULRND16 IMULRND16 PASS IADDS16 NSELECT PASS IMULRND16 IMULRND16 PASS IADDS16 IADDS16 NSELECT SHIFTA16 IMULRND16 IMULRND16 IADDS16 IADDS16 IMULRND16 IMULRND16 IADDS16 IADDS16 SHUFFLE IMULRND16 IMULRND16 SHUFFLE PASS IMULRND16 IMULRND16 IADDS16 IADDS16 PASS IADDS16 IMULRND16 IMULRND16 IADDS16 IADDS16 IMULRND16 IMULRND16 IADDS16 IADDS16 PASS SHUFFLE IMULRND16 IMULRND16 IADDS16 SHUFFLE IMULRND16 IMULRND16 IADDS16 SHUFFLE SHUFFLE IMULRND16 IMULRND16 IADDS16 IADDS16 IADDS16 COMMUCPERM IMULRND16 IMULRND16 IADDS16 IADDS16 IADDS16 COMMUCPERM IMULRND16 IMULRND16 IADDS16 IADDS16 IADDS16 COMMUCPERM IMULRND16 IMULRND16 IADDS16 IADDS16 PASS PASS PASS SHUFFLE SELECT SHUFFLE SHUFFLE SHUFFLE SELECT COMMUCPERM IADDS16 IADDS16 IADDS16 PASS IMULRND16 SELECT IADDS16 IADDS16 IADDS16 PASS PASS PASS IMULRND16 IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16 NSELECT IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16 IMULRND16 IMULRND16 PASS SHUFFLE SHUFFLE SHUFFLE DATA_IN IMULRND16 IMULRND16 PASS IADDS16 IADDS16 SHUFFLE PASS DATA_IN IMULRND16 IMULRND16 PASS GEN_CISTATE IADDS16 IADDS16 IADDS16 PASS PASS DATA_IN IMULRND16 IMULRND16 COND_IN_D SELECT IADDS16 IADDS16 DATA_IN IMULRND16 IMULRND16 GEN_CCEND IADDS16 SELECT IADDS16 IADDS16 NSELECT IMULRND16 IMULRND16 SPCREAD_WT SPCWRITE DATA_OUT IADDS16 SELECT IADDS16 IADDS16 DATA_IN IMULRND16 IMULRND16 COMMUCDATA DATA_OUT IADDS16 NSELECT DATA_IN IMULRND16 IMULRND16 IADDS16 CHK_ANY DATA_OUT IADDS16 NSELECT NSELECT IMULRND16 IMULRND16 IADDS16 IADDS16 DATA_OUT LOOP IADDS16 IADDS16 IMULRND16 IMULRND16 IADDS16 NSELECT SELECT DATA_OUT IMULRND16 IMULRND16 IADDS16 IADDS16 SELECT PASS DATA_OUT DATA_OUT

7x7 Convolve Kernel

slide-36
SLIDE 36

Computer Architecture for the Next Millenium WJD November 1, 1999 36

Imagine Summary

  • Imagine operates on streams of records

– simplifies programming – exposes locality and concurrency

  • Compound stream operations

– perform a subroutine on each stream element – reduces global register bandwidth

  • Bandwidth hierarchy

– use bandwidth where its inexpensive – distributed and hierarchical register organization

  • Conditional stream operations

– sort elements into homogeneous streams – avoid predication or speculation

slide-37
SLIDE 37

Computer Architecture for the Next Millenium WJD November 1, 1999 37

Computer Architecture for the Next Millenium

  • Applications and technology are changing

– media applications process streams of low-precision samples – wires dominate gates

  • ILP is at the point of diminishing returns
  • Tremendous opportunities for new architectures

– new applications have lots of parallelism and locality – modern technology can build chips with 100s of ALUs (32b FP) 1000s in the near future

  • The challenge is to develop architectures

– that can harness this potential performance – in a way that can be easily programmed

  • Stream processing is one approach, there are many
  • thers. We need to start exploring them