Configurable TLB Hierarchy for the Rocket Chip Generator Nikos Ch. - - PowerPoint PPT Presentation

configurable tlb hierarchy for the rocket chip generator
SMART_READER_LITE
LIVE PREVIEW

Configurable TLB Hierarchy for the Rocket Chip Generator Nikos Ch. - - PowerPoint PPT Presentation

Enabling Virtual Memory Research on RISC-V with a Configurable TLB Hierarchy for the Rocket Chip Generator Nikos Ch. Papadopoulos , Vasileios Karakostas, Konstantinos Nikas, Nectarios Koziris, Dionisios N. Pnevmatikatos ncpapad@cslab.ece.ntua.gr


slide-1
SLIDE 1

Enabling Virtual Memory Research on RISC-V with a Configurable TLB Hierarchy for the Rocket Chip Generator

Nikos Ch. Papadopoulos, Vasileios Karakostas, Konstantinos Nikas, Nectarios Koziris, Dionisios N. Pnevmatikatos

National Technical University of Athens

School of Electrical and Computer Engineering Computing Systems Laboratory

ncpapad@cslab.ece.ntua.gr

slide-2
SLIDE 2

Motivation

Explore RISC-V ISA and Rocket Chip Generator

  • Vanilla L1 TLB is fully-associative

○ May impact the critical path ○ #entries vs resource usage tradeoff

  • Vanilla L2 TLB is direct-mapped

○ May impact the miss rate

  • We want to lift these restrictions and enable:

○ Configurable L1 and L2 TLBs ○ From direct mapped up to fully-associative structures

CARRV 2020 | May 29, 2020 | Virtual Workshop 2

slide-3
SLIDE 3
  • Background

○ Rocket Chip Generator ○ RISC-V Virtual Memory support

  • Configurable TLB Hierarchy features
  • Methodology

○ Hardware & Software Development Flow

  • Performance and Area Results
  • Related & Future work
  • Conclusions

Outline

3 CARRV 2020 | May 29, 2020 | Virtual Workshop 3

slide-4
SLIDE 4

Rocket Chip Generator

  • SoC Generator that produces Synthesizable RTL

○ Written in Chisel ○ Rocket core or BOOM (Berkeley Out-of-Order Machine) ○ Parameterized Tiles, Caches, Accelerators, etc.

  • Library of processor parts and utilities

○ Replacement policies ○ Branch predictors ○ ...and many more

4 CARRV 2020 | May 29, 2020 | Virtual Workshop 4

slide-5
SLIDE 5

RV64-Sv39 Paging Scheme

  • 39-bit (512GB) virtual address space
  • 3-level page table
  • Supports 4KB base pages

○ But also 2MB, 1GB superpages

  • 27-bit VPN → 44-bit PPN

○ 12-bit page offset for 4KB pages

  • SATP register

○ Stores the root of the page table

5 CARRV 2020 | May 29, 2020 | Virtual Workshop 5

slide-6
SLIDE 6

Existing MMU in Rocket Chip Generator

  • Fully-associative L1 TLB

○ Separate Data/Instr L1 TLB ○ Vector of Registers ○ Fast & small (32-128 entries)

  • Direct-mapped L2 TLB

○ SyncReadMem ○ Slower but larger (128-1024)

  • Fully-associative PTW Cache

○ Vector of Registers ○ Keeps non-leaf nodes

6 CARRV 2020 | May 29, 2020 | Virtual Workshop 6

slide-7
SLIDE 7

Configurable TLB hierarchy in Rocket

  • Kept the same overall structure

○ Lookups, refill, replacement policies, flushing

  • Added about 70 LoC for the L1 TLB
  • 50 LoC for the L2 TLB
  • Implementation in two different

editions of the RCG ○ Apr 2018 version ■ Supports Xilinx ZCU102 ○ January 2020 version

7 CARRV 2020 | May 29, 2020 | Virtual Workshop 7

slide-8
SLIDE 8

Hardware Development Flow

  • Implementation

○ Chisel & FIRRTL checks ○ Syntax errors, unconnected wires, etc.

  • Testing

○ Verilator: Cycle-accurate Simulator ○ Chisel debug statements ○ Assembly tests

  • Evaluation

○ Generate bitstream for the Xilinx ZCU102 ○ Run tests and benchmarks using Buildroot

8 CARRV 2020 | May 29, 2020 | Virtual Workshop 8

slide-9
SLIDE 9

Software Flow

  • Freedom-U-SDK by Sifive

○ SW for the Freedom Unleashed

  • Buildroot

○ Minimal embedded distribution ○ Easy to add custom packages

  • Linux kernel 4.15

○ Cross-compilation for RISC-V

  • Berkeley Boot Loader (BBL)

○ Sets up performance counters (cycles, TLB misses) ○ Boots linux

9 CARRV 2020 | May 29, 2020 | Virtual Workshop 9

slide-10
SLIDE 10

L1 | L2 TLB Contributions

Vanilla L1 | L2 TLB Configurable L1 | L2 TLB

Organization Fully-assoc | Direct-mapped Any associativity Parameterization #Entries #Sets, #Ways (pow2) Replacement policies PseudoLRU/Random | No policy Pseudo LRU/Random set- associative alternatives Other features Sectored L1 TLB entries Sectored L1 TLB entries are supported too

10 CARRV 2020 | May 29, 2020 | Virtual Workshop 10

slide-11
SLIDE 11

Evaluation Metrics

  • FPGA Resource Usage

○ Lookup-Tables (LUTs), Flip-Flops (FFs), Block RAM (BRAMs)

  • Performance Metrics

○ SPEC2006 benchmarks (with test input set) ■ Misses-per-kilo-Instructions (MPKI) ■ Instructions-per-cycle (IPC)

11 CARRV 2020 | May 29, 2020 | Virtual Workshop 11

slide-12
SLIDE 12

Evaluation Scenarios

  • Configurations resembling well-known architectures

○ Conf III → ARM Cortex A57 ○ Conf IV → Intel Skylake ○ Conf V → Intel Skylake (swapped I/D TLB sizes)

12 CARRV 2020 | May 29, 2020 | Virtual Workshop 12

slide-13
SLIDE 13

FPGA resource usage evaluation

13 CARRV 2020 | May 29, 2020 | Virtual Workshop 13

slide-14
SLIDE 14

L1 TLB Performance Evaluation (MPKI)

  • Results for L1 Data and Instruction TLBs
  • Most TLB misses come from data accesses
  • Several benchmarks show similar behavior

across configurations

  • But larger L1 DTLB may improve performance
  • mcf stresses the TLB hierarchy the most

14 CARRV 2020 | May 29, 2020 | Virtual Workshop 14

slide-15
SLIDE 15

L1 TLB Performance Evaluation (MPKI)

  • Results for L1 Data and Instruction TLBs
  • Most TLB misses come from data accesses
  • Several benchmarks show similar behavior

across configurations

  • But larger L1 DTLB may improve performance
  • mcf stresses the TLB hierarchy the most

15 CARRV 2020 | May 29, 2020 | Virtual Workshop 14

slide-16
SLIDE 16

L1 TLB Performance Evaluation (MPKI)

  • Results for L1 Data and Instruction TLBs
  • Most TLB misses come from data accesses
  • Several benchmarks show similar behavior

across configurations

  • But larger L1 DTLB may improve performance
  • mcf stresses the TLB hierarchy the most

16 CARRV 2020 | May 29, 2020 | Virtual Workshop 14

slide-17
SLIDE 17

L1 TLB Performance Evaluation (MPKI)

  • Results for L1 Data and Instruction TLBs
  • Most TLB misses come from data accesses
  • Several benchmarks show similar behavior

across configurations

  • But larger L1 DTLB may improve performance
  • mcf stresses the TLB hierarchy the most

17 CARRV 2020 | May 29, 2020 | Virtual Workshop 14

slide-18
SLIDE 18

L2 TLB Performance Evaluation (MPKI)

  • L2 TLB misses are rare for most benchmarks
  • Larger L2 TLB reach may reduce page walks

○ Configurations IV and V

  • mcf improves significantly as L2 TLB increases

18 CARRV 2020 | May 29, 2020 | Virtual Workshop 15

slide-19
SLIDE 19

L2 TLB Performance Evaluation (MPKI)

  • L2 TLB misses are rare for most benchmarks
  • Larger L2 TLB reach may reduce page walks

○ Configurations IV and V

  • mcf improves significantly as L2 TLB increases

19 CARRV 2020 | May 29, 2020 | Virtual Workshop 15

slide-20
SLIDE 20

L2 TLB Performance Evaluation (MPKI)

  • L2 TLB misses are rare for most benchmarks
  • Larger L2 TLB reach may reduce page walks

○ Configurations IV and V

  • mcf improves significantly as L2 TLB increases

20 CARRV 2020 | May 29, 2020 | Virtual Workshop 15

slide-21
SLIDE 21

L2 TLB Performance Evaluation (MPKI)

  • L2 TLB misses are rare for most benchmarks
  • Larger L2 TLB reach may reduce page walks

○ Configurations IV and V

  • mcf improves significantly as L2 TLB increases

21 CARRV 2020 | May 29, 2020 | Virtual Workshop 15

slide-22
SLIDE 22

System Performance Evaluation (IPC)

22 CARRV 2020 | May 29, 2020 | Virtual Workshop 16

slide-23
SLIDE 23

System Performance Evaluation (IPC)

23 CARRV 2020 | May 29, 2020 | Virtual Workshop 16

slide-24
SLIDE 24

System Performance Evaluation (IPC)

24 CARRV 2020 | May 29, 2020 | Virtual Workshop 16

slide-25
SLIDE 25

… Further Evaluation

  • Unfortunately the Xilinx ZCU102 board reserves only 512MB RAM

for the PL thus limiting the benchmarks we could run ○ Older Rocket Chip commit

  • Correctness evaluation of the more recent RC edition
  • We plan on moving to Firesim

○ Evaluation with SPEC2017 and other benchmarks ○ + Multicore benchmarking

  • BOOM performance evaluation

25 CARRV 2020 | May 29, 2020 | Virtual Workshop 17

slide-26
SLIDE 26

Related & Future Work

  • Research/Develop new MMU features

○ Direct Segments [ISCA'13] ○ Coalesced/Clustered TLBs [MICRO'12, HPCA'14] ○ Redundant Memory Mappings [ISCA'15] ○ Hybrid TLB Coalescing [ISCA'17]

  • Reduce resource usage in FPGA simulation

○ TLBs are CAMs → FPGA-hostile structure

26 CARRV 2020 | May 29, 2020 | Virtual Workshop 18

slide-27
SLIDE 27

Conclusions

  • Enabled further configurability in the Rocket Chip Generator
  • Our design can output any L1/L2 TLB organization/size
  • Evaluated resource usage & application performance
  • Feel free to review our work in github!

○ https://github.com/ncppd/rocket-chip

Thank you!

27 CARRV 2020 | May 29, 2020 | Virtual Workshop 19