Computer Architecture for the Next Millenium November 1, 1999 - PowerPoint PPT Presentation

Computer Architecture for the Next Millenium November 1, 1999 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu

Outline • The Stanford Concurrent VLSI Architecture Group • Forces acting on computer architecture – applications (media) – technology (wire-limited) – techniques (explicit parallelism) • Example: register organization – distributed register files • Imagine a stream processor – 20GFLOPS on a 0.5cm 2 chip • Tremendous opportunities and challenges for computer architecture in the next millenium – its not a mature field yet WJD November 1, 1999 Computer Architecture for the Next Millenium 2

The Concurrent VLSI Architecture Group • Architecture and design technology for VLSI • Routing chips – Torus Routing Chip, Network Design Frame, Reliable Router – Basis for Intel, Cray/SGI, Mercury, Avici network chips WJD November 1, 1999 Computer Architecture for the Next Millenium 3

Parallel computer systems • J-Machine (MDP) led to Cray T3D/T3E • M-Machine (MAP) – Fast messaging, scalable processing nodes, scalable memory architecture MDP Chip J-Machine Cray T3D MAP Chip WJD November 1, 1999 Computer Architecture for the Next Millenium 4

Design technology • Off-chip I/O – Simultaneous bidirectional signaling, 1989 • now used by Intel and Hitachi – High-speed signalling • 4Gb/s in 0.6 µ m CMOS, Equalization, 1995 • On-Chip Signalling – Low-voltage on-chip signalling – Low-skew clock distribution 250ps/division • Synchronization – Mesochronous, Plesiochronous – Self-Timed Design 4Gb/s CMOS I/O WJD November 1, 1999 Computer Architecture for the Next Millenium 5

What is Computer Architecture? I/O Chan Link API ISA Interfaces Technology IR Regs Machine Organization Computer Applications Architect Measurement & Evaluation WJD November 1, 1999 Computer Architecture for the Next Millenium 6

Forces Acting on Architecture • Applications - shifting towards media applications dealing with streams of low-precision samples – video, graphics, audio, DSL modems, cellular base stations • Technology - becoming wire-limited – power and delay dominated by communication, not arithmetic – global structures: register files and instruction issue don’t scale • Technique - Micro-architecture - ILP has been mined out – to the point of diminishing returns on squeezing performance from sequential code – explicit parallelism (data parallelism and thread-level parallelism) required to continue scaling performance WJD November 1, 1999 Computer Architecture for the Next Millenium 7

Applications • Little locality of reference – read each pixel once – often non-unit stride – but there is producer-consumer locality • Very high arithmetic intensity – 100s of arithmetic operations per memory reference • Dominated by low-precision (16-bit) integer operations WJD November 1, 1999 Computer Architecture for the Next Millenium 8

Wires Are Becoming Like Wet Noodles 0.0mm 2.5mm Minimum width wire in an 0.35 µ m 5.0mm process 7.5mm 10.0mm WJD November 1, 1999 Computer Architecture for the Next Millenium 9

Technology scaling makes communication the scarce resource 1999 2008 0.07 µ m 0.18 µ m 4Gb DRAM 256Mb DRAM 256 64b FP Proc 16 64b FP Proc 2.5GHz 500MHz P P 18mm 25mm 30,000 tracks 120,000 tracks 1 clock 16 clocks repeaters every 3mm repeaters every 0.5mm WJD November 1, 1999 Computer Architecture for the Next Millenium 10

Care and Feeding of ALUs Instr. IP Cache Instruction Bandwidth IR Data Bandwidth Regs ‘Feeding’ Structure Dwarfs ALU WJD November 1, 1999 Computer Architecture for the Next Millenium 11

What Does This Say About Architecture? • Tremendous opportunities – Media problems have lots of parallelism and locality – VLSI technology enables 100s of ALUs per chip (1000s soon) • (in 0.18um 0.1mm 2 per integer adder, 0.5mm 2 per FP adder) • Challenging problems – Locality - global structures won’t work – Explicit parallelism - ILP won’t keep 100 ALUs busy – Memory - streaming applications don’t cache well • Its time to try some new approaches WJD November 1, 1999 Computer Architecture for the Next Millenium 12

Example Register File Organization • Register files serve two functions: – Short term storage for intermediate results – Communication between multiple function units • Global register files don’t scale well as N, number of ALUs increases – Need more registers to hold more results (grows with N) – Need more ports to connect all of the units (grows with N 2 ) WJD November 1, 1999 Computer Architecture for the Next Millenium 13

Register Cells are Mostly Switch p w p w Bit Lines Vdd Vdd Gnd Bit Lines h h 1 wire Word Lines grid Word Lines p p ... ... WJD November 1, 1999 Computer Architecture for the Next Millenium 14

Register Architecture for ‘wide’ Processors (A) (B) C SIMD Clusters N Arithmetic Units N/C N/C Arithmetic Arithmetic Units Units (C) (D) C SIMD Clusters N/C Arithmetic N/C Arithmetic N Arithmetic Units Units Units WJD November 1, 1999 Computer Architecture for the Next Millenium 15

Area of Register Organizations 1000 Central 100 SIMD 10 DRF SIMD/DRF 1 0.1 1 10 100 1000 Number of Arithmetic Units WJD November 1, 1999 Computer Architecture for the Next Millenium 16

Delay of Register Organizations 1000 Central 100 SIMD 10 DRF 1 SIMD/DRF 0.1 1 10 100 1000 Number of Arithmetic Units WJD November 1, 1999 Computer Architecture for the Next Millenium 17

Performance of Register Organizations 1.20 1.20 1.00 1.00 0.80 0.80 0.60 0.60 0.40 0.40 0.20 0.20 0.00 0.00 Central SIMD SIMD/DRF HIERARCHICAL STREAM Central SIMD SIMD/DRF HIERARCHICAL STREAM (A) Raw Performance (B) Performance with Latency WJD November 1, 1999 Computer Architecture for the Next Millenium 18

Stubs Abstract the Communication Between Operations FU Op 1 (Op 1) Write stub RF Data transfer Read stub RF FU Op 2 (Op 2) WJD November 1, 1999 Computer Architecture for the Next Millenium 19

A Communication Example Instruction 6 + k * x L/S + k * x L/S t1 t1 t1 Instruction 5 + * L/S Pass t1 * L/S t1 t1 t1 + t1 * t2 L/S + t1 * t2 L/S + t1 * t2 L/S (a) (b) (c) WJD November 1, 1999 Computer Architecture for the Next Millenium 20

The Imagine Stream Processor SDRAM SDRAM SDRAM SDRAM Streaming Memory System Host Network Interface Network Host Stream Register File Interface Processor Microcontroller ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 ALU Cluster 7 Imagine Stream Processor WJD November 1, 1999 Computer Architecture for the Next Millenium 21

Data Bandwidth Hierarchy Imagine Stream Processor ALU Cluster SDRAM Register File ALU Cluster Stream SDRAM SDRAM SDRAM ALU Cluster 3.2GB/s 64GB/s 544GB/s WJD November 1, 1999 Computer Architecture for the Next Millenium 22

Cluster Architecture Intercluster Network Local Register File + + + * * / CU To Stream Buffers Cross Point From Stream Buffers • VLIW organization with shared control • Local register files provide high data bandwidth WJD November 1, 1999 Computer Architecture for the Next Millenium 23

Imagine is a Stream Processor • Instructions are Load, Store, and Operate – operands are streams – also Send and Receive for multiple-imagine systems • Operate performs a compound stream operation – read elements from input streams – perform a local computation – append elements to output streams – repeat until input stream is consumed – (e.g., triangle transform) • Order of magnitude less global register bandwidth than a vector processor WJD November 1, 1999 Computer Architecture for the Next Millenium 24

Triangle Rendering Arithmetic Memory Stream Register File Clusters word Memory Register Bandwidth Bandwidth record Triangle Records Transform Input Data Triangle Records Shade Shaded Triangle Records Project/ Cull Projected Triangle Records Span Setup Span Records Process Span Fragment Records Sort Fragment Records Compact Image Buffer Indices Pixel Depth & Color Image Z-Composite Depth & Pixel Depth & Color Color Pixel Depth & Color

Bandwidth Demands Transform Kernel References Stream Scalar Vector (per ∆ ) Memory 5.5 117 (21.3) 48 (8.7) Global RF 48 624 (13.0) 261 (5.4) Local RF 372 N/A N/A WJD November 1, 1999 Computer Architecture for the Next Millenium 26

Data Parallelism is easier than ILP Kernel 1 to 8 Cluster Speedup FFT (1024) 6.4 DCT (8x8) 7.8 Blockwarp (8x8) 7.2 Transform ( ∆ ) 8.0 Harmonic Mean 7.3 WJD November 1, 1999 Computer Architecture for the Next Millenium 27

Conventional Approaches to Data-Dependent Conditional Execution A A A y=(x>0) x>0 Y N x>0 Y B Speculative if y Exponentially B B J Loss Decreasing D x W Duty Factor J if ~y ~100s C C K C if y Whoops Data-Dependent Branch J K if ~y K WJD November 1, 1999 Computer Architecture for the Next Millenium 28

Computer Architecture for the Next Millenium November 1, 1999 - PowerPoint PPT Presentation

Computer Architecture for the Next Millenium November 1, 1999 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu Outline The Stanford Concurrent VLSI Architecture Group Forces acting on

M Millenium Hotels R e a l E s t a t e Excellent exposure to flagship hotel properties

EXAMINING PHOTO CREDIT: KARI NELSON SUSTAINABILITY OF USAIDS MILLENIUM WATER ALLIANCE

Verifiedexec: An Introduction Brett Lymn Origins Idea formulated late last millenium A sudden

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

Architecture: Culture and Space Architecture: Culture and Space Architecture: Culture and Space

CSE 675.02: three aspects of computer design: instruction set architecture, Introduction to

ICS 233 ICS 233 ICS 233 ICS 233 Computer Architecture & Computer Architecture &

The eXplicit MultiThreading (XMT) Parallel Computer Architecture Parallel Computer Architecture

Introduction to Software Architecture Reid Holmes Architecture Architecture is: All

CMS Strip Readout Architecture for SLHC OUTLINE brief review of LHC strip readout architecture p

A New Golden Age for 1. Software advances can inspire architecture Computer Architecture:

cse141: Introduction to Computer Architecture Steven Swanson Alice Liang 1 Todays Agenda

cse141: Introduction to Computer Architecture Steven Swanson Andiry Xu Qi Li 1 Today s

cse141: Introduction to Computer Architecture Steven Swanson Nathan Goulding Manoj Mardithaya

Hot Topics in Computer System Architecture Computer Architecture 1950s and 1960s:

Next Edge Theta Yield Fund Next Edge Capital Corp., January 2016 IMPORTANT NOTES The Next Edge

Scale-out your Tier-Based Systems in 3 steps Using Spring Nati Shalom CTO GigaSpaces Agenda

High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R.

Frontier and Squid for same data access by many jobs Dave Dykstra dwd@fnal.gov OSG Users'

WOMBAT: towards a Worldwide Observatory of Malicious Behaviors and Attack Threats Fabien Pouget

StatisticalNLP Sofar:languagemodelsgiveP(s) Spring2010

GP Cluster 14 December 2017 Healthier. Stronger. Together PARKING - IMPORTANT Whilst delegates

REAL-TIME WITH AI THE CONVERGENCE OF BIG DATA AND AI COLIN MACNAUGHTON NEEVE RESEARCH

Agenda 1. Capital One 2. Traditional Batch Analytics 3. The Great Paradigm Shift Real-Time

Computer Architecture for the Next Millenium November 1, 1999 - PowerPoint PPT Presentation

Computer Architecture for the Next Millenium November 1, 1999 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu Outline The Stanford Concurrent VLSI Architecture Group Forces acting on

M Millenium Hotels R e a l E s t a t e Excellent exposure to flagship hotel properties

EXAMINING PHOTO CREDIT: KARI NELSON SUSTAINABILITY OF USAIDS MILLENIUM WATER ALLIANCE

Verifiedexec: An Introduction Brett Lymn Origins Idea formulated late last millenium A sudden

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

Architecture: Culture and Space Architecture: Culture and Space Architecture: Culture and Space

CSE 675.02: three aspects of computer design: instruction set architecture, Introduction to

ICS 233 ICS 233 ICS 233 ICS 233 Computer Architecture &amp; Computer Architecture &amp;

The eXplicit MultiThreading (XMT) Parallel Computer Architecture Parallel Computer Architecture

Introduction to Software Architecture Reid Holmes Architecture Architecture is: All

CMS Strip Readout Architecture for SLHC OUTLINE brief review of LHC strip readout architecture p

A New Golden Age for 1. Software advances can inspire architecture Computer Architecture:

cse141: Introduction to Computer Architecture Steven Swanson Alice Liang 1 Todays Agenda

cse141: Introduction to Computer Architecture Steven Swanson Andiry Xu Qi Li 1 Today s

cse141: Introduction to Computer Architecture Steven Swanson Nathan Goulding Manoj Mardithaya

Hot Topics in Computer System Architecture Computer Architecture 1950s and 1960s:

Next Edge Theta Yield Fund Next Edge Capital Corp., January 2016 IMPORTANT NOTES The Next Edge

Scale-out your Tier-Based Systems in 3 steps Using Spring Nati Shalom CTO GigaSpaces Agenda

High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R.

Frontier and Squid for same data access by many jobs Dave Dykstra dwd@fnal.gov OSG Users'

WOMBAT: towards a Worldwide Observatory of Malicious Behaviors and Attack Threats Fabien Pouget

StatisticalNLP Sofar:languagemodelsgiveP(s) Spring2010

GP Cluster 14 December 2017 Healthier. Stronger. Together PARKING - IMPORTANT Whilst delegates

REAL-TIME WITH AI THE CONVERGENCE OF BIG DATA AND AI COLIN MACNAUGHTON NEEVE RESEARCH

Agenda 1. Capital One 2. Traditional Batch Analytics 3. The Great Paradigm Shift Real-Time

ICS 233 ICS 233 ICS 233 ICS 233 Computer Architecture & Computer Architecture &