Day 3 Advanced Vector Architectures Session A: Vector Instruction - - PDF document

day 3
SMART_READER_LITE
LIVE PREVIEW

Day 3 Advanced Vector Architectures Session A: Vector Instruction - - PDF document

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break Session B: Vector Flag Processing & Vector Register Files Lunch Session C: Virtual Processor Caches Break Session D: Vector IRAM Vector


slide-1
SLIDE 1

Day 3

Advanced Vector Architectures

Session A: Vector Instruction Execution Pipelines Break Session B: Vector Flag Processing & Vector Register Files Lunch Session C: Virtual Processor Caches Break Session D: Vector IRAM

Vector Instruction Execution Pipelines

Main issues: Hiding/Tolerating Memory Latency Handling Exceptions Avoiding complexity

slide-2
SLIDE 2

Tolerating Memory Latency with Short Chimes => Can use same techniques as scalar processors: (also multithreading with parallel threads)

Prefetch Load

Instruction Stream Hardware or Software Prefetch: Static Scheduling: Dynamic Scheduling: Out-of-Order Execution [Espasa, PhD ‘97])

Add

(Decoupled Pipeline or Request Data Earlier Move Load Earlier Execute Add Later

Memory Latency

Vectors allow simple control logic to buffer 1000s of outstanding operations

Tolerating Memory Latency with Vectors

Traditional In-Order Vector Issue Pipeline Decoupled Vector Pipeline (Espasa, PhD’97) Also, full out-of-order issue is possible (Espasa, PhD’97) I A A A A A I W

Memory Latency

R W R W R W R W R

VLD v1 Chaining VMUL v2,v1,r1 Issue Stage Blocked

I A A A A A W

Memory Latency

R W R W R W R W R

VLD v1 Chaining VMUL v2,v1,r1

I

Instruction Queue Load Data Queue Issue Stage Free

slide-3
SLIDE 3

Memory Latency and Short Vectors

VLD v1 VMUL v2,v1,r1 VLD v3 VMUL v4,v3,r2 Instruction Execution in Time VMUL v2,v1,r1 VLD v3 Address VLD v1 Address VLD v1 Data VLD v3 Data VMUL v4,v3,r2 Memory Latency Vector Instruction Sequence

Cray-style

VLD v1 Address VMUL v2,v1,r1 VLD v1 Data VLD v3 Address VLD v3 Data VMUL v4,v3,r2

Decoupled Pipeline

Enqueue VMUL Mem Idle Mult Idle Addr.Gen. Data Bus Multiplier Addr.Gen. Data Bus Multiplier Enqueue VLD data

Decoupled Pipeline Issues

Latencies:

Decoupling hides memory latency in most cases but exposes latency in others.

Exceptions:

  • IEEE Floating-Point
  • Page Faults for Virtual Memory

F D X M W Instruction Queues Scalar Pipe A Vector Load Pipe W R X X X W Vector Arithmetic Pipe

Scalar Unit Reads of Vector Unit State Scatter/Gather Indices Load/Store Masks Memory Latency

slide-4
SLIDE 4

Vector IEEE Floating-Point Model

Vector FP arithmetic instructions never cause machine traps

  • (Except in special debugging modes)
  • IEEE default results handled without user-visible traps (unlike Alpha)
  • Largest expense is hardware subnormal handling

Vector FP exceptions signaled by writes to vector flag registers

  • Reserve 5 vector flag registers to receive exception information:

Invalid, DivideByZero, Overflow, Underflow, InExact

User trap handlers: inline conditional code or trap barrier

  • Use normal vector conditional execution to handle vector FP exceptions
  • Explicit trap barrier instruction checks flags and takes precise trap

Full IEEE support at full speed in deep vector pipeline

Short-Running Vector Instructions Simplify Virtual Memory

  • Address translate/check (C) of whole vector takes only 4-8 clocks
  • Overlap checks with memory latency - no added latency for VM
  • Buffer following instructions until address check complete
  • For in-order machine, short vectors limit size of state to save/restore
  • For out-of-order machine, short vectors limit reorder buffer size

F D X M W Scalar Pipe A

W R

Load Data Queue Memory Latency Pre-Address Check Instruction Queue Address Check Instruction Queue Committed Instruction Queue

C Page Fault?

Few Clock Cycles Many Clock Cycles

slide-5
SLIDE 5

Instruction Queue Design

Pointers To Vector Memory Instructions CIQ head ACIQ head PAIQ tail PAIQ head Scalar 2 Scalar 1 Vlen Inst. PC Issue Dispatch

Delayed Pipeline

Replace queues with fixed length instruction pipeline: Simpler than decoupled, no data buffers. Works best for fixed latency memory with few collision.

F D X M W Instruction Delay Scalar Pipe A Vector Load Pipe W R X X X W Vector Arithmetic Pipe Memory Latency A Vector Store Pipe R

Short bypass latencies

slide-6
SLIDE 6

Out-of-Order Vector Execution

Simpler than scalar out-of-order because of reduced instruction bandwidth. Vector register renaming solves exception problem. But problems in vector register renaming:

  • Elements beyond vector length (change ISA to mark undefined)
  • Masked elements (change ISA to leave undefined - requires merges)
  • Scalar insert into vector register (Make it slow so programmers avoid this

But maybe OOO not a big win with more vector registers, better vector compiler, decoupled pipeline. (vector loops should be mostly statically schedulable) OOO without vector register renaming may give small boost (put OOO after address commit)

Day 3, Session B Vector Flag Processing Model & Vector Register Files

slide-7
SLIDE 7

Flags are more than Masks

Flags are used for:

  • Conditional Execution (Mask Registers)
  • Reporting Status (Popcount and Count Leading/Trailing Zeros)
  • Exception Reporting (IEEE754 FP)
  • Speculative Execution

Flag Priority Instructions

Goal: Avoid latency of scalar read-flags -> write-new-length Approach: Generate mask vector with correct length Reads flag register, writes flag register, three forms: Also, operation that compresses flag register 7 Source flags Flag-before-first (fbf) Flag-including-first (fif) Flag-only-first (fof) Source flags Compress-flags (cpf)

slide-8
SLIDE 8

Vector Register File Design

Construct high bandwidth VRF from multiple banks of less highly multiported memory. Design decisions:

  • form of bank partitioning
  • number of banks versus ports/bank

Bank Partitioning Alternatives

V3[0] V3[1] V3[2] V3[3] V3[4] V3[5] V3[6] V3[7] V2[0] V2[1] V2[2] V2[3] V2[4] V2[5] V2[6] V2[7] V1[0] V1[1] V1[2] V1[3] V1[4] V1[5] V1[6] V1[7] V0[0] V0[1] V0[2] V0[3] V0[4] V0[5] V0[6] V0[7] V0[0] V1[0] V2[0] V3[0] V0[4] V1[4] V2[4] V3[4] V0[1] V1[1] V2[1] V3[1] V0[5] V1[5] V2[5] V3[5] V0[2] V1[2] V2[2] V3[2] V0[6] V1[6] V2[6] V3[6] V0[7] V1[7] V2[7] V3[7] V0[3] V1[3] V2[3] V3[3] V2[0] V3[0] V2[2] V3[2] V2[4] V3[4] V2[6] V3[6] V2[1] V3[1] V2[3] V3[3] V2[5] V3[5] V2[7] V3[7] V0[1] V1[1] V0[3] V1[3] V0[5] V1[5] V0[7] V1[7] V0[0] V1[0] V0[2] V1[2] V0[4] V1[4] V0[6] V1[6] Register Partitioned Element Partitioned Register and Element Partitioned

slide-9
SLIDE 9

VAFU0 VAFU1 VMFU Write Select VAFU0 Read Enable VAFU1 Read Enable VMFU Read Enable Element Bank 0 Write Select VAFU0 Read Enable VAFU1 Read Enable VMFU Read Enable Element Bank 3 Write Word Selects Read X Word Selects Read Y Word Selects Write Word Selects Read X Word Selects Read Y Word Selects

Vector Register File: 1 Lane Example Multiported Storage Cells

1R+1W 5R+3W 2R+1W 3R+2W (all designs double-pumped)

slide-10
SLIDE 10

Vector Regfile: Design Comparison

All designs provide 256 64-bit elements, and 5R+3W ports. Cell 5R+3W 3R+2W 2R+1W 2R+1W 1R+1W Width 1 1 1 2 2

Day 3, Session C: Virtual Processor (VP) Caches

Highly parallel primary caches for vector units Reduce bandwidth demands on main memory Convert strided and scatter/gather operations to unit-stride Two forms: Rake Cache (Spatial VP Cache) Histogram Cache (Temporal VP Cache)

slide-11
SLIDE 11

Virtual Processor Paradigm

Integer Registers Float Registers Scalar Unit Vector Data Registers [0] [1] [2] [MAXVL-1] Vector Unit Vector Length Register VLR Vector Arithmetic Instructions [0] [1] [2] [VLR-1] VADD v3,v1,v2 v1 v2 v3 v0 v7 Vector Load and Store Instructions VLD v1,r1,r2 v1 r0 r7 f0 f7 Base, r1 Stride, r2 Memory [0] [1] [2] [VLR-1] + + + + + +

Virtual Processor

Many Useful Vector Algorithms use Virtual Processor Paradigm

Developed by Blelloch et al., CMU SCANDAL group:

  • Sorting
  • Sparse Matrix-Vector Multiply
  • Connected Components
  • Linear Recurrences
  • List Ranking

But frequent scatter/gather and non-unit stride accesses Address bandwidth expensive:

  • Address Crossbars
  • TLB ports
  • Cache Transactions
  • DRAM Page Breaks
slide-12
SLIDE 12

Matrix-Vector Multiply

Strided vector accesses but each virtual processor accesses unit-stride stream C = A x B

A B C

VP0 VP7

(Row-major matrix storage)

Rake Cache

KEY IDEA: Associate one (or more) cache line with each virtual processor Advantages over shared cache:

  • Access local to lane, lower energy and compact layout
  • High-bandwidth without multiport or interleaved memories
  • No inter-VP conflicts, power-of-2 stride OK!

Vector Data Registers [0] [1] [2] v0 v7 [MAXVL-1] Separate Cache Line Per VP

slide-13
SLIDE 13

Rake Cache for Matrix-Vector Multiply

With 4 word cache line, rake cache can reduce address bandwidth by up to 4x

B C

Four word rake cache line

A

C = A x B

VP0 VP7

Other Forms of Rake

1D Strided Rake Indexed Rake (parallel structure access)

VP0 VP7 VP0 VP7

slide-14
SLIDE 14

Rake Cache Design

Explicitly Selected and Indexed

  • Strided and indexed instructions specify use of rake cache (and which line if more than one)

Non-coherent

  • weak vector consistency model, flush at vector memory barrier instructions

Virtually Tagged

  • reduces TLB accesses
  • weak vector consistency model, no problem with synonyms

Per Byte Dirty Bits

  • Avoids false sharing problem, only write-back modified bytes

VPN PPN V D Byte D Byte D Byte Virtually Tagged Per Byte Dirty Bit Valid bit per line

Single Rake Cache Line

Rake Cache Implementation

=? =?

Write Back Physical Address Bus Address Generator Store Load 64 64 Data

S.M.

Index Base Rake VTag PPN Index Cache Index FIFO Vector Register File Page Hit? Virtual Page Number Line Hit? Physical Page Number Stride Data Bus 256

slide-15
SLIDE 15

VP Cache Performance

Garbage Collection in SPECint95 li

  • Rake cache

removes 37% of address traffic

OpenGL Rasterization (Aaron Brown, UCB)

  • Rake cache

removes 68% of address traffic

Radix Sorting

  • Rake cache + histogram cache

removes 78% address traffic and 57% data traffic

IDEA encryption

  • Rake cache

removes 74% address traffic

All assuming 1KB rake cache and 32KB histogram cache

Day 3, Session D: Vector IRAM UC Berkeley

  • Profs. Patterson, Yelick, Kubiatowicz

http://iram.cs.berkeley.edu/

slide-16
SLIDE 16

Key Observation

DRAM capacity increases 4x per generation (every 3 years) DRAM cost/bit decreases 2x per generation => Fewer DRAMs per system at constant cost

IRAM Will Engulf Low-End Market

8MB 32MB 64MB 128MB 256MB 512MB 64Mb 256Mb 1Gb 4Gb 16MB 16Gb

DRAM/IRAM Generation System Memory Capacity

1GB 2GB IRAM-PC VIRAM-1 IRAM-PDA

?

Constant System Price Largest Practical IRAM

slide-17
SLIDE 17

IRAM

Put processor and DRAM main memory on same die

  • Memory latency 5x-10x improvement
  • Memory bandwidth 50-100x improvement
  • Lower power
  • Smaller board size

What Type of Processor for IRAM?

  • Must be small to leave room for DRAM
  • Must convert 100s GB/s memory bandwidth into application speedups
  • Desire low power

=> Vector!

Vector IRAM: Pocket Supercomputer

16-32MB DRAM Scalar Unit + I/D Caches Vector Unit

12.8 GB/s

1.6 GFLOPS (64-bit) 3.2 GFLOPS (32-bit) 6.4 GOPS (16-bit)

Single die, 256Mb Merged DRAM/Logic Technology

Serial I/O Lines 8 x 1Gb/s

200 MHz, ~2W

slide-18
SLIDE 18

VIRAM-1 Block Diagram

Instruction Cache Scalar Unit Flag Registers VFFU VFFU DRAM Vector Registers I/O VAFU0 VAFU1 VMFU0 VMFU1 1 Write Buffer Scalar Data Cache

Vector Memory Subsystem

XBar I$ CPU D$ XBar I/O 64 256 256

Bank Wing

1 2 3

Sub-bank Total Memory: 32MB 2 wings X 8 banks X 8 sub-banks = 128 sub-banks. Each sub-bank is 1K rows by 2K bits. Each column access returns 256 bits. Total random access memory latency is 12-15 clocks at 200MHz.

slide-19
SLIDE 19

VMFU Questions

How many and what type of load/store units? How many addresses per cycle per unit?

Vector Load and Store Components

Load Store Generate Address(es) Translate Address(es) Invalidate Scalar Data Cache Send Physical Address to DRAM Read Vector Register File Send Data to DRAM Access Memory Receive Data from DRAM Write Vector Register File

slide-20
SLIDE 20

Two Wings Makes Two VMFUs Cheap

Already have two data and address crossbars. At cost of second address generator, TLB port, two-way address multiplexer and extra vector register write port, can have second load-only VMFU. VIRAM-1 has one load/store unit plus one load-only unit. Unit-stride operations alternate between wings and so synchronize after first collision. Wing B Time Wing A VMFU0 VMFU1

Supporting Fast Non-Unit Stride with Many Addresses Per Cycle Expensive

Need: more address generators + more TLB ports + more scalar data cache ports (stores only) + more address crossbar + conflict detect logic + more data crossbar control + maybe more buffering to smooth out conflicts =>Not obvious that more addresses/cycle best use of silicon

slide-21
SLIDE 21

IRAM Memory Subsystem

I/O CPU I$ VMFU0 VReg A B D$ A B A B A B VMFU1 A B Wing B Ref.B Wing A Ref.A

Drive DRAM with Fixed Pipeline

CAS Start RAS PCH Start Start Load Data Load Data Load Data Load Data CAS Start RAS PCH Start Start Load Data Potential Structural Hazard Address CAS Start CAS Start CAS Start A0 A1 A2 A3 Row Miss Bank Busy PCH Start B0

slide-22
SLIDE 22