Configurable TLB Hierarchy for the Rocket Chip Generator Nikos Ch. - PowerPoint PPT Presentation

Enabling Virtual Memory Research on RISC-V with a Configurable TLB Hierarchy for the Rocket Chip Generator Nikos Ch. Papadopoulos , Vasileios Karakostas, Konstantinos Nikas, Nectarios Koziris, Dionisios N. Pnevmatikatos ncpapad@cslab.ece.ntua.gr National Technical University of Athens School of Electrical and Computer Engineering Computing Systems Laboratory

Motivation Explore RISC-V ISA and Rocket Chip Generator ● Vanilla L1 TLB is fully-associative ○ May impact the critical path ○ #entries vs resource usage tradeoff ● Vanilla L2 TLB is direct-mapped ○ May impact the miss rate ● We want to lift these restrictions and enable: ○ Configurable L1 and L2 TLBs ○ From direct mapped up to fully-associative structures CARRV 2020 | May 29, 2020 | Virtual Workshop 2

Outline ● Background ○ Rocket Chip Generator ○ RISC-V Virtual Memory support ● Configurable TLB Hierarchy features ● Methodology ○ Hardware & Software Development Flow ● Performance and Area Results ● Related & Future work ● Conclusions 3 CARRV 2020 | May 29, 2020 | Virtual Workshop 3

Rocket Chip Generator ● SoC Generator that produces Synthesizable RTL ○ Written in Chisel ○ Rocket core or BOOM (Berkeley Out-of-Order Machine) ○ Parameterized Tiles, Caches, Accelerators, etc. ● Library of processor parts and utilities ○ Replacement policies ○ Branch predictors ○ ...and many more 4 CARRV 2020 | May 29, 2020 | Virtual Workshop 4

RV64-Sv39 Paging Scheme 39-bit (512GB) virtual address space ● 3-level page table ● Supports 4KB base pages ● But also 2MB, 1GB superpages ○ 27-bit VPN → 44-bit PPN ● 12-bit page offset for 4KB pages ○ SATP register ● Stores the root of the page table ○ 5 CARRV 2020 | May 29, 2020 | Virtual Workshop 5

Existing MMU in Rocket Chip Generator ● Fully-associative L1 TLB ○ Separate Data/Instr L1 TLB ○ Vector of Registers ○ Fast & small (32-128 entries) ● Direct-mapped L2 TLB ○ SyncReadMem ○ Slower but larger (128-1024) ● Fully-associative PTW Cache ○ Vector of Registers ○ Keeps non-leaf nodes 6 CARRV 2020 | May 29, 2020 | Virtual Workshop 6

Configurable TLB hierarchy in Rocket ● Kept the same overall structure ○ Lookups, refill, replacement policies, flushing ● Added about 70 LoC for the L1 TLB ● 50 LoC for the L2 TLB ● Implementation in two different editions of the RCG ○ Apr 2018 version ■ Supports Xilinx ZCU102 ○ January 2020 version 7 CARRV 2020 | May 29, 2020 | Virtual Workshop 7

Hardware Development Flow Implementation ● Chisel & FIRRTL checks ○ Syntax errors, unconnected wires, etc. ○ Testing ● Verilator: Cycle-accurate Simulator ○ Chisel debug statements ○ Assembly tests ○ Evaluation ● Generate bitstream for the Xilinx ZCU102 ○ Run tests and benchmarks using Buildroot ○ 8 CARRV 2020 | May 29, 2020 | Virtual Workshop 8

Software Flow Freedom-U-SDK by Sifive ● SW for the Freedom Unleashed ○ Buildroot ● Minimal embedded distribution ○ Easy to add custom packages ○ Linux kernel 4.15 ● Cross-compilation for RISC-V ○ Berkeley Boot Loader (BBL) ● Sets up performance counters (cycles, TLB misses) ○ Boots linux ○ 9 CARRV 2020 | May 29, 2020 | Virtual Workshop 9

L1 | L2 TLB Contributions Vanilla L1 | L2 TLB Configurable L1 | L2 TLB Organization Fully-assoc | Direct-mapped Any associativity Parameterization #Entries #Sets, #Ways (pow2) Replacement policies PseudoLRU/Random | No policy Pseudo LRU/Random set- associative alternatives Other features Sectored L1 TLB entries Sectored L1 TLB entries are supported too 10 CARRV 2020 | May 29, 2020 | Virtual Workshop 10

Evaluation Metrics ● FPGA Resource Usage ○ Lookup-Tables (LUTs), Flip-Flops (FFs), Block RAM (BRAMs) ● Performance Metrics ○ SPEC2006 benchmarks (with test input set) ■ Misses-per-kilo-Instructions (MPKI) ■ Instructions-per-cycle (IPC) 11 CARRV 2020 | May 29, 2020 | Virtual Workshop 11

Evaluation Scenarios Configurations resembling well-known architectures ● Conf III → ARM Cortex A57 ○ Conf IV → Intel Skylake ○ Conf V → Intel Skylake (swapped I/D TLB sizes) ○ 12 CARRV 2020 | May 29, 2020 | Virtual Workshop 12

FPGA resource usage evaluation 13 CARRV 2020 | May 29, 2020 | Virtual Workshop 13

L1 TLB Performance Evaluation (MPKI) Results for L1 Data and Instruction TLBs ● Most TLB misses come from data accesses ● Several benchmarks show similar behavior ● across configurations But larger L1 DTLB may improve performance ● mcf stresses the TLB hierarchy the most ● 14 CARRV 2020 | May 29, 2020 | Virtual Workshop 14

L2 TLB Performance Evaluation (MPKI) L2 TLB misses are rare for most benchmarks ● Larger L2 TLB reach may reduce page walks ● Configurations IV and V ○ mcf improves significantly as L2 TLB increases ● 18 CARRV 2020 | May 29, 2020 | Virtual Workshop 15

System Performance Evaluation (IPC) 22 CARRV 2020 | May 29, 2020 | Virtual Workshop 16

… Further Evaluation ● Unfortunately the Xilinx ZCU102 board reserves only 512MB RAM for the PL thus limiting the benchmarks we could run ○ Older Rocket Chip commit ● Correctness evaluation of the more recent RC edition ● We plan on moving to Firesim ○ Evaluation with SPEC2017 and other benchmarks ○ + Multicore benchmarking ● BOOM performance evaluation 25 CARRV 2020 | May 29, 2020 | Virtual Workshop 17

Related & Future Work ● Research/Develop new MMU features ○ Direct Segments [ISCA'13] ○ Coalesced/Clustered TLBs [MICRO'12, HPCA'14] ○ Redundant Memory Mappings [ISCA'15] ○ Hybrid TLB Coalescing [ISCA'17] ● Reduce resource usage in FPGA simulation ○ TLBs are CAMs → FPGA-hostile structure 26 CARRV 2020 | May 29, 2020 | Virtual Workshop 18

Conclusions ● Enabled further configurability in the Rocket Chip Generator ● Our design can output any L1/L2 TLB organization/size ● Evaluated resource usage & application performance ● Feel free to review our work in github! ○ https://github.com/ncppd/rocket-chip Thank you! 27 CARRV 2020 | May 29, 2020 | Virtual Workshop 19

Configurable TLB Hierarchy for the Rocket Chip Generator Nikos Ch. - PowerPoint PPT Presentation

Enabling Virtual Memory Research on RISC-V with a Configurable TLB Hierarchy for the Rocket Chip Generator Nikos Ch. Papadopoulos , Vasileios Karakostas, Konstantinos Nikas, Nectarios Koziris, Dionisios N. Pnevmatikatos ncpapad@cslab.ece.ntua.gr

6 th Grade Model Rocket Program The 6 th Grade Rocket Program Day 1 Investigate how and

branch prediction 1 last time what happens with TLB in access patterns overlapping TLB and

A Configurable TLB Hierarchy for the RISC-V Architecture Nikolaos Charalampos Papadopoulos ,

Untethering the RISC-V Rocket Chip -- A code release from the lowRISC project Wei Song Computer

Untethering the Rocket-Chip Producing a stand-alone lowRISC SoC Wei Song 07/10/2015 1

Fibre Optic Multiplexer Configurable The What is the Badger Fully configurable Audio/Data

Sampling Effect on Performance Prediction of Configurable Systems : A Case Study Juliana Alves

Overview of Overview of configurable architectures configurable architectures Prof. Kurt

Hybrid TLB Coal B Coalescing: I Improving g TLB Translati tion C Cover erage e under er D

Rocket Tracking and Recovery Rocket Men Terry Ngin Bryant Lam 1 Project Overview Our project

Sergio Benitez sb@sergio.bz 1 Introduction to Rocket 2 Code Generation in Rocket and Rust 3

Introduction to Rockets V-2 Rocket Vostok I Redstone Alan Shepard

Calibration des Microroc (II) Alex, Cyril, Giom, Jean, Max 09 Mai 2011, Annecy 1 Reminder 2

Dual-Mode Configurable RISC-V Processor IP Nuclei System Technology Dual-Mode

Designing a Web of Highly-Configurable Designing a Web of Highly-Configurable Intrusion Detection

Configurable software- -based based Configurable software edge router architecture edge router

Exam 1 solutions 1. A cube of metal has a mass of 0.5 kg. It measures 2.1 cm on a side.

Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 Processor-DRAM Gap (latency)

The structure of the argument Evidence from Polish: Argument 1 Predication from within a PP and

called Bethesda, which has five alcoves. In these lay many invalids blind, lame, and

Kilo Degree Survey F. Khlinger, B. Joachimi, S. Joudaki, L. Miller on behalf of the

DEMAND ELASTICITY Overview Context: Product manager wants to estimate impact of price change

Graphical User Interface (GUI) Programming Joan Boone jpboone@email.unc.edu Summer 2020 Slide 1

Scalable 10 to 20 Kilo-pixel MKID Signal Generation and DAQ for Cosmology Gustavo Cancelo

Configurable TLB Hierarchy for the Rocket Chip Generator Nikos Ch. - PowerPoint PPT Presentation

Enabling Virtual Memory Research on RISC-V with a Configurable TLB Hierarchy for the Rocket Chip Generator Nikos Ch. Papadopoulos , Vasileios Karakostas, Konstantinos Nikas, Nectarios Koziris, Dionisios N. Pnevmatikatos ncpapad@cslab.ece.ntua.gr

6 th Grade Model Rocket Program The 6 th Grade Rocket Program Day 1 Investigate how and

branch prediction 1 last time what happens with TLB in access patterns overlapping TLB and

A Configurable TLB Hierarchy for the RISC-V Architecture Nikolaos Charalampos Papadopoulos ,

Untethering the RISC-V Rocket Chip -- A code release from the lowRISC project Wei Song Computer

Untethering the Rocket-Chip Producing a stand-alone lowRISC SoC Wei Song 07/10/2015 1

Fibre Optic Multiplexer Configurable The What is the Badger Fully configurable Audio/Data

Sampling Effect on Performance Prediction of Configurable Systems : A Case Study Juliana Alves

Overview of Overview of *configurable* architectures *configurable* architectures Prof. Kurt

Hybrid TLB Coal B Coalescing: I Improving g TLB Translati tion C Cover erage e under er D

Rocket Tracking and Recovery Rocket Men Terry Ngin Bryant Lam 1 Project Overview Our project

Sergio Benitez sb@sergio.bz 1 Introduction to Rocket 2 Code Generation in Rocket and Rust 3

Introduction to Rockets V-2 Rocket Vostok I Redstone Alan Shepard

Calibration des Microroc (II) Alex, Cyril, Giom, Jean, Max 09 Mai 2011, Annecy 1 Reminder 2

Dual-Mode Configurable RISC-V Processor IP Nuclei System Technology Dual-Mode

Designing a Web of Highly-Configurable Designing a Web of Highly-Configurable Intrusion Detection

Configurable software- -based based Configurable software edge router architecture edge router

Exam 1 solutions 1. A cube of metal has a mass of 0.5 kg. It measures 2.1 cm on a side.

Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 Processor-DRAM Gap (latency)

The structure of the argument Evidence from Polish: Argument 1 Predication from within a PP and

called Bethesda, which has five alcoves. In these lay many invalids blind, lame, and

Kilo Degree Survey F. Khlinger, B. Joachimi, S. Joudaki, L. Miller on behalf of the

DEMAND ELASTICITY Overview Context: Product manager wants to estimate impact of price change

Graphical User Interface (GUI) Programming Joan Boone jpboone@email.unc.edu Summer 2020 Slide 1

Scalable 10 to 20 Kilo-pixel MKID Signal Generation and DAQ for Cosmology Gustavo Cancelo

Overview of Overview of configurable architectures configurable architectures Prof. Kurt