BRU: Bandwidth Regulation Unit for Real-Time Multicore Processors - PowerPoint PPT Presentation

BRU: Bandwidth Regulation Unit for Real-Time Multicore Processors Farzad Farshchi § , Qijing Huang ¶ , Heechul Yun § § University of Kansas, ¶ University of California, Berkeley RTAS 2020 1

Multicore Processors in Real-time Systems ● Provide high computing performance needed for intelligent real-time systems ● Allow consolidation reducing cost, size, weight, and power 2

Challenge: Inter-core Memory Interference ● Memory system is shared between the cores ● Memory performance varies widely due to memory interference ● Task WCET can be extremely pessimistic : >10x or >100x 3 P.K. Valsan et al. “Addressing Isolation Challenges of Non-blocking Caches for Multicore Real-Time Systems”. Real-time Systems Journal

Software Solutions ● To bound memory interference : MemGuard 1 , PALLOC 2 , etc. ● Usually implemented in OS or hypervisor ● Use COTS processors features (performance counters, MMU, etc.) ❌ Fundamentally limited due to lack of full control over hardware ❌ Treat hardware as a black box ❌ Overhead. E.g. interrupt-handler overhead 1 H. Yun et al. “Memguard: Memory bandwidth reservation system for efficient performance isolation in multi-core platforms” RTAS'13 2 H. Yun et al. “PALLOC: DRAM bank-aware memory allocator for performance isolation on multicore platforms” RTAS'14 4

Hardware Solutions Cost of Developing a New Chip ● Real-time architectures: T-CREST 1 , MERASA 2 ● Priority-aware memory components: LLC 3 , DRAM controller 4 ❌ Low average performance ❌ Verifying a new IP is costly ❌ Hard to justify commercially https://www.extremetech.com/computing/272096-3nm-process-node 1 M. Schoeberl et al. “T-CREST Time-predictable multi-core architecture for embedded systems” Journal of Systems Architecture 2015 2 T. Ungerer et al. “MERASA: Multicore execution of hard real-time applications supporting analyzability” Micro'10 3 J. Yan et al. “Time-predictable L2 cache design for high performance real-time systems” RTCSA'10 5 4 F. Farshchi et al. “Deterministic memory abstraction and supporting multicore system architecture” ECRTS'18

Outline ● Motivation ● BRU ○ Access Regulation ○ Writeback Regulation ● Implementation ● Evaluation 6

BRU: Bandwidth Regulation Unit ● BRU is a hardware IP ✔ Drop-in module, less intrusive ✔ No runtime overhead (e.g. interrupt handling) ● BRU enables ✔ Fine-grained regulation period ✔ Group-regulation for multiple cores 7

Bird’s Eye View of BRU Architecture ● Located between private caches and the shared memory ● Regulates bandwidth by throttling private caches misses and writebacks ● Low logic complexity due to direct connection to private caches ● Can throttle each core independently without interfering with the other cores ● No LLC metadata to store core ID 8

Access (Cache Miss) Regulation Multiple cores can be assigned to a domain . B/W is regulated Domain budget is decremented when a private collectively for these cores. cache miss causes access to shared memory. 9

Bandwidth Budget Equation Shared memory is accessed at the LS : Cache line size granularity of a cache line f clk : System clock frequency 10

Writeback Regulation ● Cause and effect relationship between cache misses and writebacks : Cache miss → cache conflict → dirty line eviction → writeback ● With access bandwidth set to X MB/s, the writeback bandwidth is also limited to X MB/s ● Writes contend more severely in shared memory [1]. We want to set a lower budget for writebacks ● Add a writeback budget to each domain. When writeback budget depletes, throttle writebacks [1] M. Bechtel et al. “Denial-of-Service Attacks on Shared Cache in Multicore: Analysis and Prevention”. RTAS’19 11

Rocket Chip SoC 1 ● An open-source system on chip ● Can be configured with BOOM 2 out-of-order processor ● Uses TileLink cache-coherent protocol for on-chip communication and accessing memory 1 K. Asanovic et al. “The Rocket Chip Generator” UC Berkeley Tech. Rep. 2016 2 C. Celio et al. “The Berkeley Out-of-Order Machine (BOOM): An Industry-Competitive, Rocket Chip augmented with BRU Synthesizable, Parameterized RISC-V Processor” UC Berkeley Tech. Rep. 2015 13

Access Regulation Implementation BRU Channels of a TileLink link ● On a cache miss, an Acquire message is transferred over Channel A ● BRU counts Acquires and when the budget deplates, throttles Channel A 14

Writeback Regulation Implementation WB : Writeback Unit ● On a writeback, a Release message is transferred over Channel C ● A special throttle logic inserted after WB unit ● Cannot throttle Channel C due to other (only two AND gates) messages ( Probe responses ) going through ● BRU sends a signal to D cache to throttle this channel writebacks 15

Evaluation ● FireSim FPGA-accelerated simulator ○ Directly derived from RTL ○ Runs on FPGAs in Amazon cloud ○ Fast, highly accurate ● Setup ○ Quad-core out-of-order (RISC-V ISA) 2.13 GHz ○ Caches: 64-byte lines, Private L1-I/D: 16/16 KiB, Shared LLC: 2MiB ○ DDR3-2133, 1 rank, 8 banks, FR-FCFS ● Workloads ○ SD-VBS 1 , IsolBench 2 (synthetic) 1 S. K. Venkata et al. "SD-VBS: The san diego vision benchmark suite" IISWC'09 2 https://github.com/CSL-KU/IsolBench 17

Effect of Regulation Period Length Less variation Real-time task WCET: 1.5ms in Distribution of the real-time task response time vs. isolation, run for 1k periods different regulation period lengths (ms) ● Regulation period shorter than task WCET reduces response time variation 18

Effect of Group Bandwidth Regulation 37% faster ● Memory intensity: disparity > mser > texture_syn ● Group bandwidth regulation of best-effort tasks improves utilization 19

Effect of Writeback Regulation Benchmark: sift Writeback regulation: disabled Writeback budget: 0.64 GB/s Access budget: 1.28 GB/s Access budget: 1.28 GB/s ● Access regulation limits writeback bandwidth ● Writeback regulation allows setting a lower budget for writebacks 20

Hardware Overhead ● Synthesis and place and route for 7nm ● The area overhead is negligible: < 0.3% ● < 2% impact on max clock frequency A dual-core processor chip layout with BRU circled in red 21

Conclusion ● BRU enables bounding the memory interference with minimal changes to the hardware ● Single drop-in module; less intrusive than other hardware solutions ● No runtime overhead; reduces response time variation and improves utilization ● Negligible hardware overhead 22

Thank you for listening! Acknowledgement: This research is supported in part by NSF CNS 1718880 and CNS 1815959, NSA Science of Security initiative contract #H98230-18-D-0009, and AWS Cloud Credits for Research. 23

BRU: Bandwidth Regulation Unit for Real-Time Multicore Processors - PowerPoint PPT Presentation

BRU: Bandwidth Regulation Unit for Real-Time Multicore Processors Farzad Farshchi , Qijing Huang , Heechul Yun University of Kansas, University of California, Berkeley RTAS 2020 1 Multicore Processors in Real-time Systems

IRN BRU CLASSIC CAN 24X330ML 03.50130 IRN BRU DIET CAN 24X330ML 05.50131 IRN BRU PET BOTTLE

Use of IEC-61850 to telecontrol MV grids PAC W orld Conference David Bru i Bru Technical

Fig. 1. RELAP5 nodalization scheme of V-1 NPP. Fig. 2.1. Reactor power. Fig. 2.2.

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Real- Real -Time Systems Time Systems Real- -Time Systems Time Systems Real

Real Real- -Time Systems Time Systems Designing a real- Designing a real -time system time

Real- Real -time systems time systems Real- Real -time programming time programming

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

Real-Time Regulation and Blockchain Jeff Bandman P2PFISY 2017 London Real-Time

Real graduates, Real graduates, real transitions, real transitions, real stories: real

HOUSING PROJECT 1 UNIT 4 UNIT 1 UNIT 6 UNIT 5 UNIT 3 UNIT 2 Application of the Concept

Bandwidth Management Chris Wilson Aptivate Ltd, UK AfNOG 2010 Ingredients What is bandwidth

Comparing Time-Triggered Ethernet with Till Steinbach, Franz Korf, Thomas C. FlexRay: Schmidt

A Bandwidth-saving Optimization for MPI Broadcast Collective Operation Huan Zhou, Vladimir

Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory Shaden Smith 1

Physical Synthesis of Bus Matrix for High Bandwidth Low Power On-chip Communications Renshen Wang

Lecture 1 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Introduction Welcome to

MASTER'S THESIS Routing Protocols in Wireless Ad-hoc Networks - A Simulation Study Tony Larsson,

Control of large-scale systems with applications to water distribution and road traffic networks

T oke Hiland Jrgensen's PhD defense Introduction Luca Muscariello Cisco Principal Engineer