NOC-BASED SUPPORT OF HETEROGENEOUS CACHE-COHERENCE MODELS FOR - PowerPoint PPT Presentation

NOC-BASED SUPPORT OF HETEROGENEOUS CACHE-COHERENCE MODELS FOR ACCELERATORS Davide Giri Columbia University ACM/IEEE NOCS 2018 Paolo Mantovani New York, USA Torino, Italy Luca P. Carloni

NVIDIA Parker , 2016. Mobileye EyeQ5, 2020. SOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory Challenges o Scalability o Programmability Xilinx Everest, 2018. Qualcomm Snapdragon 835, 2017. October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 2

LOOSELY-COUPLED ACCELERATORS Major speedups and energy savings: o Highly parallel and customized datapath o Aggressively banked private local memory (PLM) What should the cache coherence model for accelerators be? o We identified 3 main models in literature October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 3

ACCELERATOR MODELS: FULLY COHERENT Coherent with entire cache hierarchy o Same coherence model as the processor Programming requirements o Race free accelerator execution Implementation variants o Generally bus-based o Accelerators may own a cache v IBM CAPI, [Y. Shao et al., MICRO ‘16] , [M. J. Lyons et al., TACO ‘12] × ARM ACE-lite October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 4

ACCELERATOR MODELS: NON COHERENT Not coherent with cache hierarchy o Caches are by-passed Programming requirements o Race free accelerator execution o Flush all caches prior to accelerator execution Implementation variants o Generally NoC-based and DMA-based o [Y. Chen et al., ICCD ‘13] , [E. Cota et al., DAC ‘15] [Y. Shao et al., MICRO ‘16] October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 5

ACCELERATOR MODELS: LLC COHERENT Coherent with LLC only o Processors’ private caches are by -passed Programming requirements o Race free accelerator execution o Flush processors’ private caches prior to accelerator execution Implementation variants o No implementation in literature o First proposed by [E. Cota et al., DAC ‘15] October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 6

CONTRIBUTIONS Protocol. o Variation of MESI to support 3 coherence models for accelerators (NoC-based) Coherence Models. o Show how each model can outperform the others in some cases o Show that the best choice of model varies at runtime Architecture. Design of a multi-core NoC-based architecture that supports: o Three models of coherence for accelerators o Run-time selection of the coherence model for each accelerator o Coexistence of heterogeneous coherence models for accelerators October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 7

OUR SOC PLATFORM Our design is based on an instance of Embedded Scalable Platforms (ESP) [L. P. Carloni , DAC ‘16] o Socketed tiles o NoC o Easy integration and reuse of heterogeneous components We added a cache hierarchy to ESP o Now it can run multi-processor and multi- accelerator applications on Linux SMP October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 8

ESP: NOC o 2D-mesh o 1 cycle hops o 6 physical planes to prevent deadlock and to provide sufficient bandwidth o Point-to-point ordering required to prevent deadlock October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 9

ESP: PROCESSOR TILE Main components o Single processor core o L2 private cache o Added for this work In this work o Up to 2 processor tiles o 64KB private caches o Off-the-shelf processor with L1 write-through caches October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 10

ESP: MEMORY TILE Main components o Memory controller o LLC and directory o Added for this work o Can be split over multiple tiles In this work o Up to 2 memory tiles o Up to 2MB aggregate LLC October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 11

ESP: ACCELERATOR TILE Main components o Any accelerator complying with a simple interface o A small TLB o A DMA controller and/or a private cache (added for this work) Support for run-time selection of coherence model through one I/O write to the configuration registers October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 12

OUR PROTOCOL We modified a classic MESI directory-based cache-coherence protocol o to make it work over a NoC (atomic operations) o to support all coherence models for accelerators (recalls, flush, LLC-coherent requests ) Private cache controller Directory controller o L1 invalidation o Write-back: add a Valid state and dirty bit o Recalls o Recalls o Flush o Flush o Atomic operations o LLC-coherent read/write requests October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 13

OUR PROTOCOL: DIRECTORY CONTROLLER EXCERPT \ Requests LLC-coherent Read LLC-coherent Write State \ Read memory Read memory if misaligned Invalid Send data to requestor Write to LLC Go to Valid state Go to Valid state Valid Send data to requestor Write to LLC Shared - - Exclusive - - Modified - - October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 14

ESP’ s GUI: EXPERIMENTAL SETUP The CAD flow from GUI to bitstream is fully automated . We designed 4 custom accelerators : o Sort (merge and bitonic sort combined) o Sparse Matrix-Vector Multiplication o FFT-1D and FFT-2D These accelerators represent a good mix of memory access pattern characteristics: o Varying footprint size (32KB – 20MB) o Streaming vs. irregular pattern We deployed our SoC on FPGA and we executed applications on Linux SMP. o Temporal and spatial locality October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 15

NC = non-coherent LLC = LLC-coherent RESULTS: SINGLE ACCELERATOR Speedup DRAM accesses LLC LLC LLC LLC winning winning winning winning 0.5x 20x October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 16

RESULTS: MULTIPLE ACCELERATORS NC = non-coherent LLC = LLC-coherent Speedup DRAM accesses Dataset size: 256KB to 512KB October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 17

RESULTS: FULLY-COHERENT ACCELERATORS FC FC winning winning The fully-coherent model can win for workloads whose data structures fit the accelerator’s private cache. No flush needed. NC = non-coherent LLC = LLC-coherent FC = fully-coherent Speedup October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 18

RESULTS: SUMMARY o The best coherence model varies with the accelerator workload size and with the number of active accelerators in the system. o LLC-coherent and fully-coherent models can significantly reduce accesses to DRAM . RULE OF THUMB BEST fully-coherent LLC-coherent non-coherent MODEL model model model ~ memory footprint of private cache size LLC size workload October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 19

CONCLUSIONS o There is no absolute winner among the coherence models. o Workload size, caches size and number of active accelerators influences the best choice → Hence, the best choice can vary at runtime. o We proposed a cache-coherence protocol that supports all three coherence models in a NoC-based SoC: o Fully-coherent, LLC-coherent, non-coherent. o We designed a NoC-based SoC architecture enabling o Coexistence of heterogeneous coherence models operating simultaneously. o Run-time selection of the coherence model for each accelerator. October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 20

THANK YOU! Any question? Davide Giri NOC-BASED SUPPORT OF HETEROGENEOUS Paolo Mantovani CACHE-COHERENCE MODELS FOR ACCELERATORS Luca P. Carloni

BACKUP October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 22

ESP: PROGRAMMABILITY o The accelerator driver is invoked by an application to offload a task. o Accelerator tiles handle virtual memory without interrupting the processor cores o We use locks to enforce race free execution of the accelerators. Additionally: o During the execution of non-coherent accelerators, we ensure that there exists only a single copy of the data. o For LLC-coherent accelerators data can be present both in DRAM and in the LLC. o The flush phase becomes a negligible overhead for large accelerator workloads October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 23

ESP: CACHES o Designed in SystemC and implemented through HLS. o Configurable sets, ways and the number of sharers and owners. o The device driver can select which caches to flush. For this work: o LLC: 2 MB o Private caches: 64KB October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 24

OUR PROTOCOL: DIRECTORY CONTROLLER EXCERPT o Put the whole table and list of features: Valid state, Recalls, DMA requests. o Make an example with timing diagram a zig zag, basically for a DMA request. o Slide with list of features for L2. October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 25

NOC-BASED SUPPORT OF HETEROGENEOUS CACHE-COHERENCE MODELS FOR - PowerPoint PPT Presentation

NOC-BASED SUPPORT OF HETEROGENEOUS CACHE-COHERENCE MODELS FOR ACCELERATORS Davide Giri Columbia University ACM/IEEE NOCS 2018 Paolo Mantovani New York, USA Torino, Italy Luca P. Carloni NVIDIA Parker , 2016. Mobileye EyeQ5, 2020. SOC

In-house management tools TF-NOC George Kargiotakis (kargig@noc.grnet.gr) Andreas Polyrakis

in Barcelona NOC taxonomy Stefan Listrm NORDUnet NOC taxonomy topics Nordic infrastructure

NREN N NOC TF-NOC preparation meeti ing Copenhagen May 3, 2010 Hvard Kusslid, NOC C-manager,

Network-on-Chip Switching strategies Routing algorithms (NoC) Flow control schemes NoC

GRNET NOC flash presentation TF NOC 15 16/2/2011, Ljubljana Andreas Polyrakis GRNET NOC

New mandate for TF-NOC TF-NOC meeting, Prague, 14-11-2013 TF-NOC, a summary from 2010 to 2013

SoC Design Lecture 13: NoC (Network-on-Chip) Department of Computer Engineering Sharif

top 3 items for a new NOC top 3 items for a new NOC an informal survey Gareth Eason, HEAnet for

Welcome to Madrid, TF-NOC! Maria Isabel Gandia Carriedo 11th TF-NOC meeting 21-22 Oct 2014

NOC front end development NOC front end development Work Item Update Gareth Eason, HEAnet for

GRNET NOC network monitoring & visualization tools TF-NOC Zurich Alex Kosiaris

How to Build a 24x7 NOC Cheaply Hank Nussbacher IUCC Terena TF-NOC - Zurich June 29, 2011 How

Social Media in your NOC Social Media in your NOC A discussion Gareth Eason, HEAnet for TF-NOC,

3M DI-NOC Film Interior & Exterior Design Solutions 3M Confidential 3M Confidential

21c3 NOC Overview Concepts, Implementation and Hardware Christian Carstensen, Sebastian Werner

RF-Interconnect RF-Interconnect and its Applications to and its Applications to NoC Design NoC

Flushing Program Workshop developed by RCAP/AWWA and funded by the USEPA Learning Objectives

The Bw-Tree: A B-tree for New Hardware Platforms Author: J. Levandoski et al. B uzz w ord The

Spectre and Meltdown: Data leaks during speculative execution Speaker: Jann Horn (Google Project

Attack Directories, Not Caches: Side Channel Attacks in a Non-Inclusive World Mengjia Yan , Read

Branch Prediction Philipp Koehn 11 October 2019 Philipp Koehn Computer Systems Fundamentals:

CSSE232 Computer Architecture I Control Hazards Pipelining From

Efficient GPU Synchronization without Scopes: Saying No to Complex Consistency Models Matthew D.

XtraDB 5.7: Key Performance Algorithms Laurynas Biveinis Alexey Stroganov Percona

NOC-BASED SUPPORT OF HETEROGENEOUS CACHE-COHERENCE MODELS FOR - PowerPoint PPT Presentation

NOC-BASED SUPPORT OF HETEROGENEOUS CACHE-COHERENCE MODELS FOR ACCELERATORS Davide Giri Columbia University ACM/IEEE NOCS 2018 Paolo Mantovani New York, USA Torino, Italy Luca P. Carloni NVIDIA Parker , 2016. Mobileye EyeQ5, 2020. SOC

In-house management tools TF-NOC George Kargiotakis (kargig@noc.grnet.gr) Andreas Polyrakis

in Barcelona NOC taxonomy Stefan Listrm NORDUnet NOC taxonomy topics Nordic infrastructure

NREN N NOC TF-NOC preparation meeti ing Copenhagen May 3, 2010 Hvard Kusslid, NOC C-manager,

Network-on-Chip Switching strategies Routing algorithms (NoC) Flow control schemes NoC

GRNET NOC flash presentation TF NOC 15 16/2/2011, Ljubljana Andreas Polyrakis GRNET NOC

New mandate for TF-NOC TF-NOC meeting, Prague, 14-11-2013 TF-NOC, a summary from 2010 to 2013

SoC Design Lecture 13: NoC (Network-on-Chip) Department of Computer Engineering Sharif

top 3 items for a new NOC top 3 items for a new NOC an informal survey Gareth Eason, HEAnet for

Welcome to Madrid, TF-NOC! Maria Isabel Gandia Carriedo 11th TF-NOC meeting 21-22 Oct 2014

NOC front end development NOC front end development Work Item Update Gareth Eason, HEAnet for

GRNET NOC network monitoring &amp; visualization tools TF-NOC Zurich Alex Kosiaris

How to Build a 24x7 NOC Cheaply Hank Nussbacher IUCC Terena TF-NOC - Zurich June 29, 2011 How

Social Media in your NOC Social Media in your NOC A discussion Gareth Eason, HEAnet for TF-NOC,

3M DI-NOC Film Interior &amp; Exterior Design Solutions 3M Confidential 3M Confidential

21c3 NOC Overview Concepts, Implementation and Hardware Christian Carstensen, Sebastian Werner

RF-Interconnect RF-Interconnect and its Applications to and its Applications to NoC Design NoC

Flushing Program Workshop developed by RCAP/AWWA and funded by the USEPA Learning Objectives

The Bw-Tree: A B-tree for New Hardware Platforms Author: J. Levandoski et al. B uzz w ord The

Spectre and Meltdown: Data leaks during speculative execution Speaker: Jann Horn (Google Project

Attack Directories, Not Caches: Side Channel Attacks in a Non-Inclusive World Mengjia Yan , Read

Branch Prediction Philipp Koehn 11 October 2019 Philipp Koehn Computer Systems Fundamentals:

CSSE232 Computer Architecture I Control Hazards Pipelining From

Efficient GPU Synchronization without Scopes: Saying No to Complex Consistency Models Matthew D.

XtraDB 5.7: Key Performance Algorithms Laurynas Biveinis Alexey Stroganov Percona

GRNET NOC network monitoring & visualization tools TF-NOC Zurich Alex Kosiaris

3M DI-NOC Film Interior & Exterior Design Solutions 3M Confidential 3M Confidential