Cache Refill/Access Decoupling for Vector Machines Christopher - PowerPoint PPT Presentation

Cache Refill/Access Decoupling for Vector Machines Christopher Batten, Ronny Krashinsky, Steve Gerding, Krste Asanovi ć Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology December 8, 2004

Cache Refill/Access Decoupling for Vector Machines • Intuition – Motivation and Background – Cache Refill/Access Decoupling – Vector Segment Memory Accesses • Evaluation – The SCALE Vector-Thread Processor – Selected Results

Turning access parallelism into performance is challenging Applications with Ample Memory Access Parallelism Processor Architecture Modern High Bandwidth Memory Systems

Turning access parallelism into performance is challenging Target application domain Applications with Ample – Streaming Memory Access Parallelism – Embedded – Media – Graphics Processor Architecture – Scientific Modern High Bandwidth Memory Systems

Turning access parallelism into performance is challenging Target application domain Applications with Ample – Streaming Memory Access Parallelism – Embedded – Media – Graphics Processor Architecture – Scientific Techniques for high Modern High Bandwidth bandwidth memory systems Memory Systems – DDR interfaces – Interleaved banks – Extensive pipelining

Turning access parallelism into performance is challenging Many architectures Applications with Ample Memory Access Parallelism have difficulty turning memory access parallelism Processor Architecture into performance since they are unable to fully Modern High Bandwidth saturate their Memory Systems memory systems

Turning access parallelism into performance is challenging Applications with Ample Memory access Memory Access Parallelism parallelism is poorly encoded in a scalar ISA Processor Architecture Supporting many in-flight accesses is Modern High Bandwidth very expensive Memory Systems

Turning access parallelism into performance is challenging Applications with Ample Memory Access Parallelism Vector Architecture Supporting many in-flight accesses is Modern High Bandwidth very expensive Memory Systems

Turning access parallelism into performance is challenging Applications with Ample Memory Access Parallelism Vector Architecture A data cache helps reduce off-chip bandwidth costs at the Non-Blocking Data Cache expense of additional on-chip hardware Modern High Bandwidth Main Memory

Each in-flight access has an associated hardware cost Processor Cache Memory Primary Miss 100 Cycle Cache Refill Memory Latency

Each in-flight access has an associated hardware cost Processor Cache Memory Primary Miss 100 Cycle Cache Refill Memory Latency Access Management State Reserved Element Data Buffering

Saturating modern memory systems requires many in-flight accesses Processor Cache Memory Primary Miss 100 Cycle Cache Refill Memory Secondary Miss Latency Main Memory Bandwidth-Delay Product 1 element 100 cycles cycle 100 in-flight elements

Caches increase the effective bandwidth-delay product Processor Cache Memory Primary Miss 100 Cycle Cache Refill Memory Secondary Miss Latency Effective Bandwidth-Delay Product 2 elements 100 cycles cycle 200 in-flight elements

Goal For This Work Reduce the hardware cost of non-blocking caches in vector machines while still turning access parallelism into performance by saturating the memory system

In a basic vector machine a single vector instruction operates on a vector of data Vector Processor Control Processor FU FU FU FU vr0 Memory vr1 Unit vr2 Memory System

In a basic vector machine a single vector instruction operates on a vector of data vlw vr2, r1 Vector Processor Control Processor FU FU FU FU vr0 Memory vr1 Unit vr2 Memory System

In a basic vector machine a single vector instruction operates on a vector of data vadd vr0, vr1, vr2 Vector Processor Control Processor FU FU FU FU vr0 Memory vr1 Unit vr2 Memory System

In a basic vector machine a single vector instruction operates on a vector of data vsw vr0, r2 Vector Processor Control Processor FU FU FU FU vr0 Memory vr1 Unit vr2 Memory System

In a decoupled vector machine the vector units are connected by queues VEU-CmdQ Vector Control Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Memory System

Non-blocking caches require extra state to manage outstanding misses VEU-CmdQ Vector Control Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tag Data Tags Array Array Replay Queues Main Memory

Control processor issues a vector load command to vector units Proc Cache Mem VEU-CmdQ Vector Control Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory

Vector load unit reserves storage in the vector load data queue Proc Cache Mem VEU-CmdQ Vector Control Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory

If request is a hit, then data is written into the VLDQ Proc Cache Mem VEU-CmdQ Vector Control HIT Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory

VEU executes writeback command to move data into architectural register Proc Cache Mem VEU-CmdQ Vector Control HIT Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory

On a primary miss, cache allocates a new miss tag and replay queue entry Proc Cache Mem VEU-CmdQ Vector Control MISS Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory Replay Queue Entries • Target register specifier • Cache line offset • Other management state

On a primary miss, cache allocates a new miss tag and replay queue entry Proc Cache Mem VEU-CmdQ Vector Control MISS Execution RE- Proc VLU- VSU- Unit FILL CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory

On a secondary miss, cache just allocates a new replay queue entry Proc Cache Mem VEU-CmdQ Vector Control Execution RE- Proc VLU- VSU- Unit FILL CmdQ CmdQ MISS VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory

Processor is free to continue issuing requests which may hit in the cache Proc Cache Mem VEU-CmdQ Vector Control Execution RE- Proc VLU- VSU- Unit FILL CmdQ CmdQ MISS VLU VLDQ HIT VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory

When the refill returns from memory, the cache replays each pending access Proc Cache Mem VEU-CmdQ Vector Control Execution RE- Proc VLU- VSU- Unit FILL CmdQ CmdQ MISS VLU VLDQ HIT VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory

When the refill returns from memory, the cache replays each pending access Proc Cache Mem VEU-CmdQ Vector Control Execution RE- Proc VLU- VSU- Unit FILL CmdQ CmdQ MISS VLU VLDQ HIT VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues RE- PLAY Main Memory

Expensive hardware is required to support many in-flight accesses Proc Cache Mem VEU-CmdQ Vector Control Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory

Effective decoupling requires command and data queuing Program Execution VEU CP VLU VSU Tags Data CP Main Memory VLU VEU VSU

Effective decoupling requires command and data queuing Program Execution VEU CP VLU VSU Tags Data CP Main Memory VLU VLU- CmdQ VEU VEU-CmdQ VSU VEU-CmdQ

Effective decoupling requires command and data queuing Program Execution VEU CP VLU VSU Tags Data CP Main Memory VLU VLU- CmdQ VEU VEU-CmdQ VSU VSU-CmdQ

Effective decoupling requires command and data queuing Program Execution VEU CP VLU VSU Tags Data CP VLDQ Main Memory Entries VLU VLU- CmdQ VSDQ VEU VEU-CmdQ Entries VSU VSU-CmdQ

Saturating memory system with many misses requires additional queuing Program Execution VEU CP VLU VSU Miss Tags Tags Data CP Replay Queue Entries VLDQ Entries Main Memory VLU VLU- CmdQ VSDQ VEU VEU-CmdQ Entries VSU VSU-CmdQ

Saturating memory system with many misses requires additional queuing Bandwidth-Delay Product VEU CP VLU VSU Miss Tags Tags Data CP Replay Queue Entries VLDQ Entries Main Memory VLU VLU- CmdQ VSDQ VEU VEU-CmdQ Entries VSU VSU-CmdQ

Cache Refill/Access Decoupling for Vector Machines Christopher - PowerPoint PPT Presentation

Cache Refill/Access Decoupling for Vector Machines Christopher Batten, Ronny Krashinsky, Steve Gerding, Krste Asanovi Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology December 8, 2004 Cache

Cache Refill/Access Decoupling for Vector Machines Christopher Batten, Ronny Krashinsky, Steve

Cache Refill/Access Decoupling for Vector Machines Christopher Batten, Ronny Krashinsky, Steve

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Efficient Decoupling Capacitor Planning Efficient Decoupling Capacitor Planning via Convex

PSE Decoupling Mechanisms A Brief Overview Jon Piliaris Manager, Pricing & Cost of Service,

M-theory S-Matrix from 3d SCFT Silviu S. Pufu, Princeton University Based on: arXiv:1711.07343

California Framework for Grid Value of Vehicle Grid Integration (VGI) Presentation to VGI

An Alterna)ve Approach to Hardware Benchmarking of CAESAR Candidates Based on the Use of

Lectu ture 6 6 reca recap Prof. Leal-Taix and Prof. Niessner 1 Ne Neural Ne Netw twork

Welcome to the Startup Challenge community! You are not only starting a business; you are being

LESSONS LEARNED FROM THE CHAMELEON TESTBED Kate Keahey University of Chicago, Argonne National

Raising the Design Voice in Government: A Case Study Government UX Summit // May 15, 2019

8x Solution for Next Generation Streaming Yves Daoust and Martin Benoit ICNRG Interim Meeting -

Sambuz

Useful Links

Newsletter

Mail Us

Cache Refill/Access Decoupling for Vector Machines Christopher - PowerPoint PPT Presentation

Cache Refill/Access Decoupling for Vector Machines Christopher Batten, Ronny Krashinsky, Steve Gerding, Krste Asanovi Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology December 8, 2004 Cache

Cache Refill/Access Decoupling for Vector Machines Christopher Batten, Ronny Krashinsky, Steve

Cache Refill/Access Decoupling for Vector Machines Christopher Batten, Ronny Krashinsky, Steve

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Efficient Decoupling Capacitor Planning Efficient Decoupling Capacitor Planning via Convex

PSE Decoupling Mechanisms A Brief Overview Jon Piliaris Manager, Pricing &amp; Cost of Service,

M-theory S-Matrix from 3d SCFT Silviu S. Pufu, Princeton University Based on: arXiv:1711.07343

California Framework for Grid Value of Vehicle Grid Integration (VGI) Presentation to VGI

An Alterna)ve Approach to Hardware Benchmarking of CAESAR Candidates Based on the Use of

Lectu ture 6 6 reca recap Prof. Leal-Taix and Prof. Niessner 1 Ne Neural Ne Netw twork

Welcome to the Startup Challenge community! You are not only starting a business; you are being

LESSONS LEARNED FROM THE CHAMELEON TESTBED Kate Keahey University of Chicago, Argonne National

Raising the Design Voice in Government: A Case Study Government UX Summit // May 15, 2019

8x Solution for Next Generation Streaming Yves Daoust and Martin Benoit ICNRG Interim Meeting -

Sambuz

Useful Links

Newsletter

Mail Us

PSE Decoupling Mechanisms A Brief Overview Jon Piliaris Manager, Pricing & Cost of Service,