Configurable TLB Hierarchy for the Rocket Chip Generator Nikos Ch. - - PowerPoint PPT Presentation
Configurable TLB Hierarchy for the Rocket Chip Generator Nikos Ch. - - PowerPoint PPT Presentation
Enabling Virtual Memory Research on RISC-V with a Configurable TLB Hierarchy for the Rocket Chip Generator Nikos Ch. Papadopoulos , Vasileios Karakostas, Konstantinos Nikas, Nectarios Koziris, Dionisios N. Pnevmatikatos ncpapad@cslab.ece.ntua.gr
Motivation
Explore RISC-V ISA and Rocket Chip Generator
- Vanilla L1 TLB is fully-associative
○ May impact the critical path ○ #entries vs resource usage tradeoff
- Vanilla L2 TLB is direct-mapped
○ May impact the miss rate
- We want to lift these restrictions and enable:
○ Configurable L1 and L2 TLBs ○ From direct mapped up to fully-associative structures
CARRV 2020 | May 29, 2020 | Virtual Workshop 2
- Background
○ Rocket Chip Generator ○ RISC-V Virtual Memory support
- Configurable TLB Hierarchy features
- Methodology
○ Hardware & Software Development Flow
- Performance and Area Results
- Related & Future work
- Conclusions
Outline
3 CARRV 2020 | May 29, 2020 | Virtual Workshop 3
Rocket Chip Generator
- SoC Generator that produces Synthesizable RTL
○ Written in Chisel ○ Rocket core or BOOM (Berkeley Out-of-Order Machine) ○ Parameterized Tiles, Caches, Accelerators, etc.
- Library of processor parts and utilities
○ Replacement policies ○ Branch predictors ○ ...and many more
4 CARRV 2020 | May 29, 2020 | Virtual Workshop 4
RV64-Sv39 Paging Scheme
- 39-bit (512GB) virtual address space
- 3-level page table
- Supports 4KB base pages
○ But also 2MB, 1GB superpages
- 27-bit VPN → 44-bit PPN
○ 12-bit page offset for 4KB pages
- SATP register
○ Stores the root of the page table
5 CARRV 2020 | May 29, 2020 | Virtual Workshop 5
Existing MMU in Rocket Chip Generator
- Fully-associative L1 TLB
○ Separate Data/Instr L1 TLB ○ Vector of Registers ○ Fast & small (32-128 entries)
- Direct-mapped L2 TLB
○ SyncReadMem ○ Slower but larger (128-1024)
- Fully-associative PTW Cache
○ Vector of Registers ○ Keeps non-leaf nodes
6 CARRV 2020 | May 29, 2020 | Virtual Workshop 6
Configurable TLB hierarchy in Rocket
- Kept the same overall structure
○ Lookups, refill, replacement policies, flushing
- Added about 70 LoC for the L1 TLB
- 50 LoC for the L2 TLB
- Implementation in two different
editions of the RCG ○ Apr 2018 version ■ Supports Xilinx ZCU102 ○ January 2020 version
7 CARRV 2020 | May 29, 2020 | Virtual Workshop 7
Hardware Development Flow
- Implementation
○ Chisel & FIRRTL checks ○ Syntax errors, unconnected wires, etc.
- Testing
○ Verilator: Cycle-accurate Simulator ○ Chisel debug statements ○ Assembly tests
- Evaluation
○ Generate bitstream for the Xilinx ZCU102 ○ Run tests and benchmarks using Buildroot
8 CARRV 2020 | May 29, 2020 | Virtual Workshop 8
Software Flow
- Freedom-U-SDK by Sifive
○ SW for the Freedom Unleashed
- Buildroot
○ Minimal embedded distribution ○ Easy to add custom packages
- Linux kernel 4.15
○ Cross-compilation for RISC-V
- Berkeley Boot Loader (BBL)
○ Sets up performance counters (cycles, TLB misses) ○ Boots linux
9 CARRV 2020 | May 29, 2020 | Virtual Workshop 9
L1 | L2 TLB Contributions
Vanilla L1 | L2 TLB Configurable L1 | L2 TLB
Organization Fully-assoc | Direct-mapped Any associativity Parameterization #Entries #Sets, #Ways (pow2) Replacement policies PseudoLRU/Random | No policy Pseudo LRU/Random set- associative alternatives Other features Sectored L1 TLB entries Sectored L1 TLB entries are supported too
10 CARRV 2020 | May 29, 2020 | Virtual Workshop 10
Evaluation Metrics
- FPGA Resource Usage
○ Lookup-Tables (LUTs), Flip-Flops (FFs), Block RAM (BRAMs)
- Performance Metrics
○ SPEC2006 benchmarks (with test input set) ■ Misses-per-kilo-Instructions (MPKI) ■ Instructions-per-cycle (IPC)
11 CARRV 2020 | May 29, 2020 | Virtual Workshop 11
Evaluation Scenarios
- Configurations resembling well-known architectures
○ Conf III → ARM Cortex A57 ○ Conf IV → Intel Skylake ○ Conf V → Intel Skylake (swapped I/D TLB sizes)
12 CARRV 2020 | May 29, 2020 | Virtual Workshop 12
FPGA resource usage evaluation
13 CARRV 2020 | May 29, 2020 | Virtual Workshop 13
L1 TLB Performance Evaluation (MPKI)
- Results for L1 Data and Instruction TLBs
- Most TLB misses come from data accesses
- Several benchmarks show similar behavior
across configurations
- But larger L1 DTLB may improve performance
- mcf stresses the TLB hierarchy the most
14 CARRV 2020 | May 29, 2020 | Virtual Workshop 14
L1 TLB Performance Evaluation (MPKI)
- Results for L1 Data and Instruction TLBs
- Most TLB misses come from data accesses
- Several benchmarks show similar behavior
across configurations
- But larger L1 DTLB may improve performance
- mcf stresses the TLB hierarchy the most
15 CARRV 2020 | May 29, 2020 | Virtual Workshop 14
L1 TLB Performance Evaluation (MPKI)
- Results for L1 Data and Instruction TLBs
- Most TLB misses come from data accesses
- Several benchmarks show similar behavior
across configurations
- But larger L1 DTLB may improve performance
- mcf stresses the TLB hierarchy the most
16 CARRV 2020 | May 29, 2020 | Virtual Workshop 14
L1 TLB Performance Evaluation (MPKI)
- Results for L1 Data and Instruction TLBs
- Most TLB misses come from data accesses
- Several benchmarks show similar behavior
across configurations
- But larger L1 DTLB may improve performance
- mcf stresses the TLB hierarchy the most
17 CARRV 2020 | May 29, 2020 | Virtual Workshop 14
L2 TLB Performance Evaluation (MPKI)
- L2 TLB misses are rare for most benchmarks
- Larger L2 TLB reach may reduce page walks
○ Configurations IV and V
- mcf improves significantly as L2 TLB increases
18 CARRV 2020 | May 29, 2020 | Virtual Workshop 15
L2 TLB Performance Evaluation (MPKI)
- L2 TLB misses are rare for most benchmarks
- Larger L2 TLB reach may reduce page walks
○ Configurations IV and V
- mcf improves significantly as L2 TLB increases
19 CARRV 2020 | May 29, 2020 | Virtual Workshop 15
L2 TLB Performance Evaluation (MPKI)
- L2 TLB misses are rare for most benchmarks
- Larger L2 TLB reach may reduce page walks
○ Configurations IV and V
- mcf improves significantly as L2 TLB increases
20 CARRV 2020 | May 29, 2020 | Virtual Workshop 15
L2 TLB Performance Evaluation (MPKI)
- L2 TLB misses are rare for most benchmarks
- Larger L2 TLB reach may reduce page walks
○ Configurations IV and V
- mcf improves significantly as L2 TLB increases
21 CARRV 2020 | May 29, 2020 | Virtual Workshop 15
System Performance Evaluation (IPC)
22 CARRV 2020 | May 29, 2020 | Virtual Workshop 16
System Performance Evaluation (IPC)
23 CARRV 2020 | May 29, 2020 | Virtual Workshop 16
System Performance Evaluation (IPC)
24 CARRV 2020 | May 29, 2020 | Virtual Workshop 16
… Further Evaluation
- Unfortunately the Xilinx ZCU102 board reserves only 512MB RAM
for the PL thus limiting the benchmarks we could run ○ Older Rocket Chip commit
- Correctness evaluation of the more recent RC edition
- We plan on moving to Firesim
○ Evaluation with SPEC2017 and other benchmarks ○ + Multicore benchmarking
- BOOM performance evaluation
25 CARRV 2020 | May 29, 2020 | Virtual Workshop 17
Related & Future Work
- Research/Develop new MMU features
○ Direct Segments [ISCA'13] ○ Coalesced/Clustered TLBs [MICRO'12, HPCA'14] ○ Redundant Memory Mappings [ISCA'15] ○ Hybrid TLB Coalescing [ISCA'17]
- Reduce resource usage in FPGA simulation
○ TLBs are CAMs → FPGA-hostile structure
26 CARRV 2020 | May 29, 2020 | Virtual Workshop 18
Conclusions
- Enabled further configurability in the Rocket Chip Generator
- Our design can output any L1/L2 TLB organization/size
- Evaluated resource usage & application performance
- Feel free to review our work in github!
○ https://github.com/ncppd/rocket-chip
Thank you!
27 CARRV 2020 | May 29, 2020 | Virtual Workshop 19