Vector IRAM: ISA and Micro-architecture Christoforos E. Kozyrakis - - PowerPoint PPT Presentation

vector iram isa and micro architecture
SMART_READER_LITE
LIVE PREVIEW

Vector IRAM: ISA and Micro-architecture Christoforos E. Kozyrakis - - PowerPoint PPT Presentation

Vector IRAM: ISA and Micro-architecture Christoforos E. Kozyrakis Computer Science Division University of California, Berkeley kozyraki@cs.berkeley.edu http://iram.cs.berkeley.edu/ Outline Project motivation, goals and approach


slide-1
SLIDE 1

Vector IRAM: ISA and Micro-architecture

Christoforos E. Kozyrakis

Computer Science Division University of California, Berkeley kozyraki@cs.berkeley.edu http://iram.cs.berkeley.edu/

slide-2
SLIDE 2

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 2

Outline

  • Project motivation, goals and approach
  • Vector IRAM ISA
  • VIRAM-1 micro-architecture
  • Project status
slide-3
SLIDE 3

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 3

Project Motivation

  • Processor-memory gap is growing exponentially
  • Applications shifting from engineering/desktop to

multimedia

– importance of performance of media functions importance of real-time predictable performance

  • Embedded/ portable systems gain popularity

– importance of energy consumption – importance system size

  • Focus on processors for portable, multimedia

systems

slide-4
SLIDE 4

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 4

The Vector IRAM Approach

Vector processing

  • multimedia ready
  • predictable, high

performance

  • simple
  • energy savings
  • high code density
  • well understood

programming model Embedded DRAM

  • high memory bandwidth
  • low memory latency
  • energy savings
  • system size benefits

Serial I/O

  • Gbit/sec I/O bandwidth
  • low pin count
  • low power
slide-5
SLIDE 5

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 5

Outline

  • Project motivation and goals
  • Vector IRAM ISA

– Overview of VIRAM ISA extensions – Fixed-point and DSP support – Conditional and speculative execution – Memory model

  • VIRAM-1 micro-architecture
  • Project status
slide-6
SLIDE 6

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 6

Vector Execution Model

+ r1 r2 r3

add r3, r1, r2

SCALAR (1 operation) v1 v2 v3 +

vector length

add.vv v3, v1, v2

VECTOR (N operations)

slide-7
SLIDE 7

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 7

Vector Architectural State

General Purpose Registers (32) Flag Registers (32)

VP0 VP1 VP$vlr-1

vr0 vr1 vr31 vf0 vf1 vf31 $vpw 1b

Virtual Processors ($vlr)

vcr0 vcr1 vcr31

Control Registers

32b

slide-8
SLIDE 8

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 8

Overview of V-IRAM ISA Extensions

s.int u.int s.fp d.fp .v .vv .vs .sv s.int u.int unit stride constant stride indexed load store 8 16 32 64

Vector ALU Vector Memory Scalar Plus: flag, convert, fixed-point, and transfer operations Vector Registers

32 x VL x 64b data 32 x 4VL x 32b data 32 x 8VL x 16b data MIPS-V scalar instruction set alu op

All ALU / memory

  • perations under

mask +

32 x VL x 1b flag 32 x 2VL x 1b flag 32 x 8VL x 1b flag 8 16 32 64 8 16 32 64

slide-9
SLIDE 9

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 9

Fixed-point and DSP support

  • GOAL: Competitive DSP performance
  • Many DSP features already provided

– narrow data widths [provided] – high speed MACs [instruction chaining] – multiple LD/ST per cycle [multiple memory units] – auto increment / decrement [strided memory access] – zero overhead loops [vector instructions] – fixed floating convert [provided] – bit reverse addressing [use better FFT algorithm]

slide-10
SLIDE 10

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 10

Fixed-point Multiply-Add Model

F

Round

a w y z + * x

n/2 n/2 n truncate round nearest even round nearest up jam Round = signed saturate unsigned saturate shift by one F = n n

Mul Add

slide-11
SLIDE 11

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 11

Fixed-point instructions

  • Vector half-width integer multiply
  • Vector fixed-point shift and add
  • Vector saturate
  • Vector saturating left arithmetic shift
slide-12
SLIDE 12

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 12

Conditional (Predicated) Execution

  • Almost every vector instruction is executed

subject to one of two vector masks

  • 15 GP flag register provided to buffer masks or
  • perate on them
  • 6 flag logical and 13 flag processing instructions

(like population count, iota etc)

  • 15 flag registers used for sticky exception bits for

arithmetic/FP operations and speculative

  • perations
slide-13
SLIDE 13

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 13

Speculative Execution

  • Vectorizing loops with conditional exit conditions

– Need to speculate past loop exit – Need to temporarily suppress exceptions

  • Speculation controlled by software
  • Solution:

– A duplicate set of arithmetic exception flag registers – A flag register reserved for load faults – Speculative loads and speculative arithmetic instructions write these duplicate exception bits

slide-14
SLIDE 14

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 14

Speculative Execution (cont.)

  • Perform loads and enough arithmetic to determine

loop exit condition

– Stores cannot be speculated!

  • Generate mask to exclude iterations after loop exit

(flag processor instruction)

  • VCOMMIT instruction (under mask):

– ORs speculative flags into real flags – Raises memory exceptions

slide-15
SLIDE 15

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 15

Memory Model

  • Relaxed consistency to simplify hardware: no

guarantee about ordering of memory operations, even within the same VP

  • Register interlocks provided on a per-element

basis

  • Vector memory barrier used for ordering between

scalar unit and vector unit and between VPs

  • Indexed memory operations do not specify
  • rdering; separate ordered indexed store

instruction

slide-16
SLIDE 16

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 16

Outline

  • Project motivation and goals
  • Vector IRAM ISA
  • VIRAM-1 micro-architecture

– Overview of VIRAM-1 micro-architecture – Vector pipelines – Memory system architecture

  • Project status
slide-17
SLIDE 17

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 17

VIRAM-1 Block Diagram

slide-18
SLIDE 18

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 18

VIRAM-1 Features

  • Scalar unit

64-bit MIPS core with FP unit 8KB I+D caches, write-through cache invalidation interface

  • Vector unit

maximum vector length 32 64, 32, 16 bit data-types 2 vector arithmetic units 2 vector flag processing units 4 pipelines per functional unit 2 vector load/store units 64 entry vector TLB, multi-ported

slide-19
SLIDE 19

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 19

Vector Pipelines

  • Multiple pipelines can increase performance OR
  • Energy decrease by decreasing clock frequency

and power supply

slide-20
SLIDE 20

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 20

VIRAM-1 Memory System

  • 16 to 32MB DRAM
  • 16 independently addressed banks
  • 8 2Mbit DRAM macros per bank with 256-bit

synchronous interface

  • Memory crossbar

– interconnects scalar, vector unit and I/O to memory – 8 addresses per cycle – 12.8GB/sec maximum data bandwidth per direction – implemented using low-swing techniques

slide-21
SLIDE 21

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 21

VIRAM-1 Floorplan

slide-22
SLIDE 22

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 22

VIRAM-1 Goals

Technology 0.20 micron, 5 metal layers, embedded DRAM-logic process Memory 16-32 MB Die size 250-300 mm2 Vector pipelines 4 64-bit (or 8 32-bit or 16 16-bit) Clock Frequency 200MHz scalar, 200MHz vector, 100MHz DRAM Serial I/O 4 lines @ 1 Gbit/s Power 2 W @ 1.5 volt logic Performance 1.6 GFLOPS64 – 6.4 GOPS16 First microprocessor above 0.25B transistors?

slide-23
SLIDE 23

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 23

Scaling Down VIRAM-1

  • Scaled-down version automatically

generated from the the original

  • 8 MB in 4 banks
  • Vector unit with single pipeline per

functional unit => same control

  • die:

80 mm2

  • transistors:

70M

  • power:

0.5 Watts

  • performance:

0.4 GFLOPS64 1.6 GOPS16

slide-24
SLIDE 24

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 24

Project Status

  • ISA extensions frozen
  • Micro-architecture still under development but

design has started

  • Developing simulation infrastructure
  • Designed 2 test-chips for circuit evaluation

– serial I/O @ 1Gbit/s – embedded DRAM and on-chip crossbar

  • Expected VIRAM-1 tape-out: early 2000
slide-25
SLIDE 25

C.E. Kozyrakis, IEEE Computer Elements Workshop, June 22, 1998 25

Acknowledgments

  • Thanks for advice/support: DARPA, California

MICRO, ARM, Hitachi, IBM, Intel, LG Semicon, Microsoft, Mitsubishi, Neomagic, Samsung, SGI/Cray, Sun Microsystems

  • The IRAM/ISTORE cast: D. Patterson, K. Asanovic,
  • A. Brown, J. Gebis, B. Gribstad, R. Fromm, J. Golbus, K.

Keeton, C. Kozyrakis, J. Kubiatowicz, D. Martin, S. Perissakis, R. Thomas, N. Treuhaft and K. Yelick