Cache Refill/Access Decoupling for Vector Machines Christopher - - PowerPoint PPT Presentation

cache refill access decoupling for vector machines
SMART_READER_LITE
LIVE PREVIEW

Cache Refill/Access Decoupling for Vector Machines Christopher - - PowerPoint PPT Presentation

Cache Refill/Access Decoupling for Vector Machines Christopher Batten, Ronny Krashinsky, Steve Gerding, Krste Asanovi Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology September 23, 2004 Outline


slide-1
SLIDE 1

Cache Refill/Access Decoupling for Vector Machines

Christopher Batten, Ronny Krashinsky, Steve Gerding, Krste Asanović Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology September 23, 2004

slide-2
SLIDE 2

Outline

  • Motivation

– Large bandwidth-delay product memory systems – Access parallelism and resource requirements

  • The SCALE memory system

– Baseline SCALE memory system – Refill/access decoupling – Vector segment accesses

  • Evaluation
  • Conclusions
slide-3
SLIDE 3

Bandwidth-Delay Product

  • Modern memory systems

– Increasing latency: Higher frequency processors – Increasing bandwidth: DDR, highly pipelined, interleaved banks

  • These trends combine to yield very large and growing

bandwidth-delay products

– Number of bytes of memory bandwidth per processor cycle times the number of processor cycles for a round trip memory access – To saturate such memory systems, processors must be able to generate and manage many hundreds of outstanding elements

Higher Frequency Processors Lower Frequency Processors BW BW Latency Latency

slide-4
SLIDE 4

Access Parallelism

  • Memory accesses which are independent and thus can

be performed in parallel exhibit access parallelism

  • The addresses of such accesses are usually known

well in advance

  • We can exploit access parallelism to saturate large

bandwidth-delay memory systems

loop load load compute store end

L L 1 2 3 S

Time

slide-5
SLIDE 5

Access Parallelism

  • Memory accesses which are independent and thus can

be performed in parallel exhibit access parallelism

  • The addresses of such accesses are usually known

well in advance

  • We can exploit access parallelism to saturate large

bandwidth-delay memory systems

loop load load compute store end

L L 1 2 3 S

Time

slide-6
SLIDE 6

Access Parallelism

L L L L L L 1 2 3 1 2 3 S S S 1 2 3

Time

slide-7
SLIDE 7

Access Parallelism

L L L L L L Exploiting access parallelism requires

  • Access management state
  • Reserved element data storage

1 2 3 1 2 3 S S S 1 2 3

Time

slide-8
SLIDE 8

Structured Access Parallelism

  • The amount of required access management state and

reserved element data storage scales roughly linearly with the number of outstanding elements

  • Structured access parallelism is when the addresses of

parallel accesses form a simple pattern such as each address having a constant offset from the previous address Goal: Exploit structured access parallelism to saturate large bandwidth-delay product memory systems, while efficiently utilizing the available access management state and reserved element data storage

slide-9
SLIDE 9

Access Parallelism in SCALE

  • SCALE is a highly decoupled vector-thread processor

– Several parallel execution units effectively exploit data level compute parallelism – A vector memory access unit attempts to bring whole vectors of data into vector registers as in traditional vector machines – Includes a unified cache to capture the temporal and spatial locality readily available in some applications – Cache is non-blocking to enable many overlapping misses

  • We introduce two mechanisms which enable the SCALE

processor to more efficiently exploit access parallelism

– Vector memory refill unit provides refill/access decoupling – Vector segment accesses represent a common structured access pattern in a more compact form

slide-10
SLIDE 10

The SCALE Memory System

Tags Data LDQ ReplayQ Pending Tags Non-Blocking Cache Address VMAU Store Data VEU Address VEU CmdQ Load CmdQ Store CmdQ CP Main Memory

slide-11
SLIDE 11

The SCALE Memory System

Tags Data LDQ ReplayQ Pending Tags Non-Blocking Cache Address VMAU Store Data VEU Address VEU CmdQ Load CmdQ Store CmdQ CP Control processor issues commands to the vector memory access unit and the vector execution unit Main Memory

slide-12
SLIDE 12

The SCALE Memory System

Tags Data LDQ ReplayQ Pending Tags Non-Blocking Cache Address VMAU Store Data VEU Address VEU CmdQ Load CmdQ Store CmdQ CP Command queues allow decoupled execution Main Memory

slide-13
SLIDE 13

Tracing a Vector Load

Tags Data LDQ ReplayQ Pending Tags Non-Blocking Cache Address VMAU Store Data VEU Address VEU CmdQ Load CmdQ Store CmdQ Main Memory CP CP Control processor issues a vector load command to the VMAU vlw rbase, vr1

slide-14
SLIDE 14

Tracing a Vector Load

Tags Data LDQ ReplayQ Pending Tags Non-Blocking Cache Address VMAU Store Data VEU Address VEU CmdQ Load CmdQ Store CmdQ CP VMAU breaks vector load into multiple cache bandwidth sized blocks and reserves storage in load data queue VMAU Main Memory

slide-15
SLIDE 15

Tracing a Vector Load

Tags Data LDQ ReplayQ Pending Tags Non-Blocking Cache Address VMAU Store Data VEU Address VEU CmdQ Load CmdQ Store CmdQ CP VMAU makes a cache request for each block and if request is a hit, the data is written into the load data queue Main Memory

slide-16
SLIDE 16

Tracing a Vector Load

Tags Data LDQ ReplayQ Pending Tags Non-Blocking Cache Address VMAU Store Data VEU Address VEU CmdQ Load CmdQ Store CmdQ CP VME executes register writeback command to move the data into architectural register Main Memory

slide-17
SLIDE 17

Tracing a Vector Load

Tags Data LDQ ReplayQ Pending Tags Non-Blocking Cache Address VMAU Store Data VEU Address VEU CmdQ Load CmdQ Store CmdQ CP On a miss the cache allocates a new pending tag and replayQ entry Main Memory

slide-18
SLIDE 18

Tracing a Vector Load

Tags Data LDQ ReplayQ Pending Tags Non-Blocking Cache Address VMAU Store Data VEU Address VEU CmdQ Load CmdQ Store CmdQ CP If needed the cache reserves a victim line in the cache data array Main Memory

slide-19
SLIDE 19

Tracing a Vector Load

Tags Data LDQ ReplayQ Pending Tags Non-Blocking Cache Address VMAU Store Data VEU Address VEU CmdQ Load CmdQ Store CmdQ CP If a pending tag for the desired line already exists then the cache just needs to add a new replayQ entry Main Memory

slide-20
SLIDE 20

Tracing a Vector Load

Tags Data LDQ ReplayQ Pending Tags Non-Blocking Cache Address VMAU Store Data VEU Address VEU CmdQ Load CmdQ Store CmdQ CP When a refill returns from memory, the cache writes the refill data into the data ram Main Memory

slide-21
SLIDE 21

Tracing a Vector Load

Tags Data LDQ ReplayQ Pending Tags Non-Blocking Cache VMAU Store Data VEU Address VEU CmdQ Load CmdQ Store CmdQ CP Cache then replays each entry in the replay queue, sending data to the LDQ as needed Main Memory

slide-22
SLIDE 22

Tracing a Vector Load

Tags Data LDQ ReplayQ Pending Tags Non-Blocking Cache Address VMAU Store Data VEU Address VEU CmdQ Load CmdQ Store CmdQ CP Large numbers of outstanding accesses require great deal of access management state and reserved element data storage Main Memory

slide-23
SLIDE 23

Required Queuing Resources

Program Execution VMAU Stores VMAU Loads Memory Latency VEU CP

slide-24
SLIDE 24

Required Queuing Resources

Program Execution VMAU Stores VMAU Loads Memory Latency VEU CP Load CmdQ VEU Command Queue Store Command Queue Load Data Queue Replay Queue Pending Tags

slide-25
SLIDE 25

Vector Memory Refill Unit

Tags Data LDQ ReplayQ Pending Tags Miss Address File CP VMRU VMRU CmdQ Add a decoupled vector memory refill unit to bring lines into the cache before the VMAU accesses them Load CmdQ VEU CmdQ Store CmdQ VMAU VEU Store Data Address Main Memory

slide-26
SLIDE 26

Vector Memory Refill Unit

  • VMRU runs ahead of the VMAU and pre-executes vector

load commands

– Issues refill requests for each cache line the vector load requires – Uses cache as efficient prefetch buffer for vector accesses, but because it is a cache, the buffer also captures reuse – Ideally the VMRU is far enough ahead that VMAU always hits

  • Key implementation concerns

– Throttling the VMRU to prevent evicting out lines which have yet to be used by the VMAU – Throttling the VMRU to prevent it from using up all the cache miss resources and blocking the VMAU – Throttling the VMAU to enable the VMRU to get ahead for memory bandwidth limited applications – Interaction between VMRU and cache replacement policy – Handling vector stores: allocating versus non-allocating

slide-27
SLIDE 27

Required Queuing Resources

Program Execution VMRU VMAU Loads VMAU Stores Memory Latency VEU CP VMRU CmdQ Command Queues Pending Tags

Trade off increase in compact command queues for drastic decrease in expensive replay and load data queues

LDQ Replay Q

slide-28
SLIDE 28

Vector Segment Accesses

  • Vector processors usually use multiple strided

accesses to load stream-of-records or groups of columns into vector registers

vr1 vr2

Stream Corner Turn

vr3 Mem

slide-29
SLIDE 29

Vector Segment Accesses

  • Vector processors usually use multiple strided

accesses to load stream-of-records or groups of columns into vector registers

vr1 vr2

Stream Corner Turn

vr3 Mem

slide-30
SLIDE 30

Vector Segment Accesses

  • Vector processors usually use multiple strided

accesses to load stream-of-records or groups of columns into vector registers

  • Several disadvantages

– Increases bank conflicts in banked caches or memory systems – Ignores spatial locality in the application – Makes inefficient use of access management state

vr1 vr2

Stream Corner Turn

vr3 Mem

slide-31
SLIDE 31

Vector Segment Accesses

  • Segment accesses uses efficient buffering under the

existing data crossbar to perform the stream corner turn in hardware with a single vector memory command

  • Reads data from the cache in a unit stride fashion and

then writes data into the vector register file one element at a time

vr1 vr2

Stream Corner Turn

vr3 Mem

slide-32
SLIDE 32

Evaluation

  • Microarchitectural C++ model of control processor,

VMRU, VMAU, VAE, and non-blocking cache

  • “Magic” main memory with a latency of 100 cycles and

bandwidth of 8 bytes/cycle to model the planned SCALE prototype system

– Bandwidth delay product is 800 bytes = 25 cache lines

  • A selection of kernels and microkernels which exhibit a

wide variety of characteristics

– Cooley-Tukey Fast Fourier Transform – Inverse Discrete Cosine Transform – Vertex 3D to 2D projection – Matrix transpose – Color conversion

slide-33
SLIDE 33

VVAdd Word Microkernel

Normalized Performance Total Load Data Queue Size Total Replay Queue Size

slide-34
SLIDE 34

RGBYIQ Kernel

Normalized Performance Total Load Data Queue Size Total Replay Queue Size

slide-35
SLIDE 35

Performance vs. Mem Latency

slide-36
SLIDE 36

Conclusions

  • Saturating large bandwidth-delay memory

systems requires many in-flight elements and thus a great deal of access management state and reserved element data storage

  • The SCALE processor uses refill/access

decoupling and vector segment accesses to efficiently saturate its memory system with hundreds of outstanding accesses

Paper to appear in 37th International Symposium on Microarchitecture, December 2004