HERO: Open-Source Heterogeneous Embedded Research Platform for - - PowerPoint PPT Presentation

hero open source heterogeneous embedded research platform
SMART_READER_LITE
LIVE PREVIEW

HERO: Open-Source Heterogeneous Embedded Research Platform for - - PowerPoint PPT Presentation

HERO: Open-Source Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Architectures on FPGA First Workshop on Computer Architecture Research with RISC-V (CARRV) @ MICRO 50 2017-10-14 Andreas Kurth Pirmin Vogel Alessandro


slide-1
SLIDE 1

HERO: Open-Source Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Architectures on FPGA

First Workshop on Computer Architecture Research with RISC-V (CARRV) @ MICRO 50

2017-10-14

Andreas Kurth Pirmin Vogel Alessandro Capotondi Andrea Marongiu Luca Benini

Integrated Systems Laboratory Digital Circuits and Systems Group

slide-2
SLIDE 2

Heterogeneous Embedded Systems on Chip (HESoCs)

Host PMCA PMCA N

I/O

Shared Main Memory

Die shot of an Apple A11 SoC (Source: chipworks). Architectural template of HESoCs.

HESoCs co-integrate a general-purpose host processor and efficient, domain-specific programmable manycore accelerators (PMCAs). They combine versatility with extreme nominal energy efficiency. While industry rapidly advances products, ...

  • A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS)

1 / 20

slide-3
SLIDE 3

The Research Gap on HESoCs

... research on HESoCs lags behind!

Host PMCA PMCA N

I/O

Shared Main Memory

Architectural template of HESoCs.

There are many open questions in various areas of computer engineering: programming models, task distribution and scheduling, memory organization, communication, synchronization, accelerator architectures and granularity, ... But there is no research platform for HESoCs!

  • A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS)

2 / 20

slide-4
SLIDE 4

Problems with Simulating HESoCs

Developing HESoC components in isolation and estimating their system-level performance is problematic: Complex interactions between host, accelerators, and memory hierarchy make (reasonably accurate) simulations orders of magnitude slower than running prototypes. Even full-system simulators (e.g., GEM5) do not model all HESoC components. Models make assumptions about non-deterministic processes. The validity of results thus entirely depends on the validity of assumptions, and the assumptions for modeling HESoCs are very complex.

Conclusion: A research platform for HESoCs must be available.

This is not only about hardware: For system-level research, the platform must be efficiently programmable. Additionally, the platform should come with tools to increase the observability and decrease the validation and implementation overhead of the prototype.

  • A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS)

3 / 20

slide-5
SLIDE 5

HERO: Open-Source Heterogeneous Embedded Research Platform

Heterogeneous Hardware Architecture

TLX-400

SoC Bus

Mailbox

L2 Mem X-Bar Interconnect Cluster Bus DMA

L1 SPM Bank M-1

RISC-V PE N-1 Shared L1 I$

DEMUX

Cluster 0

L1 Mem

Cluster 1

L1 Mem

Cluster L-1

L1 Mem DEMUX DEMUX L1 SPM Bank 0 L1 SPM Bank 1 L1 SPM Bank 2

RISC-V PE 1 RISC-V PE 0 Peripheral Bus

TRYX Per2AXI AXI2Per Timer Event Unit TRYX TRYX

RAB L2 $

L1 I$ L1 D$

MMU Coherent Interconnect

L1 I$ L1 D$

MMU A57 Core 0 A57 Core 1 L2 $ Coherent Interconnect

L1 I$ L1 D$ MMU A53 Core 0 L1 I$ L1 D$ MMU A53 Core 1 L1 I$ L1 D$ MMU A53 Core 2 L1 I$ L1 D$ MMU A53 Core 3

Coherent Interconnect DDR DRAM

TLX-400

ARM Juno SoC

TLX-400 TLX-400 ACE-Lite

Host PMCA

Shared APU

Heterogeneous Sofware Stack single-source, single-binary cross compilation toolchain OpenMP 4.5 shared virtual memory for Host and PMCA

Host Linux Kernel RTE LIB Driver OpenMP RTE Heterogeneous Application PMCA Hardware Kernel Level User Level Offloaded Kernel OpenMP RTE RTE VMM LIB

Profiling and automated verification solutions

  • A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS)

4 / 20

slide-6
SLIDE 6

HERO’s Hardware Architecture

TLX-400

SoC Bus

Mailbox

L2 Mem X-Bar Interconnect Cluster Bus DMA

L1 SPM Bank M-1

RISC-V PE N-1 Shared L1 I$

DEMUX

Cluster 0

L1 Mem

Cluster 1

L1 Mem

Cluster L-1

L1 Mem DEMUX DEMUX L1 SPM Bank 0 L1 SPM Bank 1 L1 SPM Bank 2

RISC-V PE 1 RISC-V PE 0 Peripheral Bus

TRYX Per2AXI AXI2Per Timer Event Unit TRYX TRYX

RAB L2 $

L1 I$ L1 D$

MMU Coherent Interconnect

L1 I$ L1 D$

MMU A57 Core 0 A57 Core 1 L2 $ Coherent Interconnect

L1 I$ L1 D$

MMU A53 Core 0

L1 I$ L1 D$

MMU A53 Core 1

L1 I$ L1 D$

MMU A53 Core 2

L1 I$ L1 D$

MMU A53 Core 3

Coherent Interconnect DDR DRAM

TLX-400

ARM Juno SoC

TLX-400 TLX-400 ACE-Lite

Host PMCA

Shared APU

industry-standard, hard-macro ARM Cortex-A Host processor scalable, configurable, modifiable FPGA implementation

  • f a silicon-proven, cluster-based PMCA with RISC-V PEs

shared main DRAM low-latency interconnect, which offers coherency to host caches HERO’s hardware, as implemented on the Juno ADP.

  • A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS)

5 / 20

slide-7
SLIDE 7

PMCA Implementation on FPGA: Overview

TLX-400

SoC Bus

Mailbox

L2 Mem Cluster 0

L1 Mem

Cluster 1

L1 Mem

Cluster L-1

L1 Mem

RAB

TLX-400

SoC Bus

Mailbox

L2 Mem Cluster 0

L1 Mem

Cluster 1

L1 Mem

Cluster L-1

L1 Mem

RAB X-Bar Interconnect Cluster Bus DMA

L1 SPM Bank M-1

RISC-V PE N-1 Shared L1 I$

L1 SPM Bank 0 L1 SPM Bank 1 L1 SPM Bank 2

RISC-V PE 1 RISC-V PE 0 Peripheral Bus

TRYX Per2AXI AXI2Per Timer Event Unit TRYX TRYX

Shared APU

DEMUX DEMUX DEMUX

multi-cluster design to overcome scalability limitations multi-banked, sofware-managed scratchpad memories (SPMs) and multi-channel DMA engine instead of data caches RISC-V processing elements (PEs) and shared auxiliary processing units (APUs) operating on local data shared virtual memory access through the sofware-managed, lightweight Remapping Address Block (RAB) PMCA based on the PULP architectural template.

  • A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS)

6 / 20

slide-8
SLIDE 8

PMCA on FPGA: Configurable, Modifiable, and Expandable

Configurable:

TLX-400

SoC Bus

Mailbox

L2 Mem Cluster 0

L1 Mem

Cluster 1

L1 Mem

Cluster L-1

L1 Mem

RAB X-Bar Interconnect Cluster Bus DMA

L1 SPM Bank M-1

RISC-V PE N-1 Shared L1 I$

L1 SPM Bank 0 L1 SPM Bank 1 L1 SPM Bank 2

RISC-V PE 1 RISC-V PE 0 Peripheral Bus

TRYX Per2AXI AXI2Per Timer Event Unit TRYX TRYX

Shared APU

DEMUX DEMUX DEMUX

# of clusters ∈ {1, 2, 4, 8} # of PEs

∈ {2, 4, 8}

FPU

∈ {private, shared (APU), off}

integer DSP unit

∈ {private, shared (APU)}

L1 SPM size and # of banks I$ design, size, # of banks L2 SPM size system-level interconnect topology RAB L1 TLB size and L2 TLB size, associativity, and # of banks

Modifiable and expandable: All components are open-source and written in industry-standard SystemVerilog. Interfaces are either standard (mostly AXI) or simple (e.g., stream-payload). New components can be easily added to the memory map.

  • A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS)

7 / 20

slide-9
SLIDE 9

HERO’s Memory Map (from the Perspective of a PE in the PMCA)

0x1000 0000 0x1FFF FFFF

256 MiB of virtual addresses reserved for PMCA-internal usage Own Cluster

0x1B00 0000 0x1B3F FFFF

Tightly-Coupled Data Memory

0x1B00 0000

Peripherals

0x1B20 0000

Timer

0x1B20 0400

Clkgate Control

0x1B20 0900

Event Unit

0x1B40 4000

DMA Control

0x1B40 4400

SoC Peripherals

0x1A10 0000 0x1A1F FFFF

UART

0x1A10 2000

Standard I/O

0x1A11 0000

Mailbox

0x1A12 0000

RAB Configuration

0x1A13 0000

Remote Cluster 0

0x1000 0000 0x103F FFFF

Remote Cluster 1

0x1040 0000 0x107F FFFF

Remote Cluster n

0x1000 0000 +n*0x40 0000 0x103F FFFF +n*0x40 0000

L2 Memory

0x1C00 0000 0x1CFF FFFF

  • A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS)

8 / 20

slide-10
SLIDE 10

HERO’s Sofware Stack

Allows to write programs that start on the host but seamlessly integrate the PMCAs.

Host Linux Kernel RTE LIB Driver OpenMP RTE Heterogeneous Application PMCA Hardware Kernel Level User Level Offloaded Kernel OpenMP RTE RTE VMM LIB

int main() { vertex vertices[N]; load(&vertices, N); #pragma omp target map(tofrom:vertices) { #pragma omp parallel for for (i = 0; i < N; ++i) vertices[i] = process(); } }

Offloads with OpenMP 4.5 target semantics, zero-copy (pointer passing) or copy-based Integrated cross-compilation and single-binary linkage PMCA-specific runtime environment and hardware abstraction libraries (HAL)

  • A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS)

9 / 20

slide-11
SLIDE 11

Sofware Stack: OpenMP

The libgomp plugin determines how input and output variables are passed between host and PMCA: With copy-based shared memory, data is copied to and from a physically contiguous, uncached section in main memory, and physical pointers are passed to the PMCA. Shared virtual memory enables zero-copy offloads, directly passing virtual pointers to the PMCA. Furthermore, the plugin implements essential OpenMP functionality such as

parallel (starting parallel execution) team (definition of parallel thread teams) sections (distributed, one-time execution worksharing) barrier (synchronization barrier) critical (single-threaded execution within a parallel region)

efficiently on the specific PMCA hardware.

  • A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS)

10 / 20

slide-12
SLIDE 12

Sofware Stack: Runtime Environment and VMM Library

The PMCA can access the page table of the heterogeneous user-space application, and can

  • perate its virtual memory hardware, the RAB, autonomously:

Assume a core accesses a virtual address that is currently not in the RAB. That core goes to sleep and its miss is enqueued in the RAB. Another core handles the miss using the VMM library (details in the paper). The VMM library is compatible with any host architecture supported by the Linux kernel.

  • A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS)

11 / 20

slide-13
SLIDE 13

Sofware Stack: Cross Compilation Toolchain

OpenMP offloading with the GCC toolchain requires a host compiler plus one target compiler for each PMCA ISA in the system.

riscv-none-gcc

cc1-lto ld

GCC

arm-linux-gnueabihf-gcc cc1 ld lto-wrapper pulp-mkoffload

.text .text.target._omp_fn.0 { ... } .gnu.offload_vars .gnu.offload_funcs

.gnu.offload_images

.text { ... } .gnu.offload_vars .gnu.offload_funcs .target._omp_fn.0

src.bin target.bin (ARM ISA) (RISC-V ISA)

PULP Syslibs libgomp HAL OpenMP Expansion

SSA SSA-opt1

. . .

ipa_write_passes

Details in: Capotondi et. al. 2017. Enabling zero-copy OpenMP offloading on the PULP many-core accelerator.

A target compiler requires both compiler extensions (e.g., additional compilation and

  • ptimization passes) and runtime extensions (e.g., libgomp plugins).

HERO includes the first non-commercial heterogeneous cross compilation toolchain.

  • A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS)

12 / 20

slide-14
SLIDE 14

Tools: Cycle-Accurate, Non-Interfering Tracer

Problem: Poor observability of implementation internals compared to simulation. Solution: Run-time event tracing and post-mortem analysis. Requirements on the tracer:

1

must not interfere with program execution

(e.g., inserting instructions to write memory is not an option),

2 must be cycle-accurate yet be able to trace millions of consecutive events

(to cover complex applications), and

3 should use FPGA resources economically (to not hamper the evaluation of complex hardware).

Hybrid tracer design: sofware-controlled, lightweight, customizable hardware blocks that ...

can use any signals in the fabric for data and trigger, log timestamped events to dedicated, local buffers, get flushed to main memory by the host while the PMCA is “frozen”.

Details are in the paper.

  • A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS)

13 / 20

slide-15
SLIDE 15

Tools: Event Analysis

Input: Recorded events from tracers, e.g., memory transactions at the RAB (with meta information on source, read/write, and hit/miss):

timestamp address meta 4789575

0x00580384 (1, 7), R, M

4790083

0x00581d2c (1, 2), W, H

4790493

0x00580384 (1, 5), R, H

Example: Time sequence analysis of events

VM hardware: L1 TLB hit-under-miss behavior and L2 TLB latency VM sofware: Page table walk and TLB entry replacement

Example: Memory access pattern analysis

Two phases of a parallel graph processing algorithm are shown; different colors are different PEs.

1

First phase: Two linear traversals and sparse memory accesses in parallel.

2

Second phase: Single linear traversal.

virtual address time

  • A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS)

14 / 20

slide-16
SLIDE 16

Tools: Automated Builds and Tests

Automated full-system builds and tests are a prerequisite for many effective development

  • paradigms. They are fairly standard for the hardware or the sofware alone, but the

combination is highly complex on HESoCs. Our solution is described in the paper.

change HDL of IP simulate unit test commit HDL change simulate all TBs build bitstreams for all targets check results and deploy bitstreams target platforms execute all (app., build conf., run param.) combinations

  • n all platforms

build all tests and benchmarks in all (platform- specific) configs. commit SW change compile and run individual tests change SW execute if failed if failed automated manual hardware-related sofware-related

  • A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS)

15 / 20

slide-17
SLIDE 17

Supported Platforms and Configurations

Property ARM Juno (with a Xilinx Virtex-7 2000T) Xilinx Zynq ZC706 Host CPU 64-bit ARMv8 big.LITTLE 32-bit ARMv7 dual-core A9 Shared main memory 8 GiB DDR3L 1 GiB DDR3 PMCA clock frequency 31 MHz 57 MHz # of RISC-V PEs 64 in 8 clusters 8 in 1 cluster Integer DSP unit private per PE L1 SPM 256 KiB in 16 banks Instruction cache 8 KiB in 8 single-ported banks 4 KiB in 4 multi-ported banks

Slices used by clusters

80% 65%

Slices used by infrastructure

7% 12%

BRAMs used by clusters

89% 70%

BRAMs used by infrastructure

6% 13% Price 25 000 $ 2500 $

  • A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS)

16 / 20

slide-18
SLIDE 18

Case Study: Parallel Speedup Analysis

Benchmarking parallel execution and data transfers of the PMCA on the Juno ADP Matrix-matrix multiplication C = AB A and C are tiled row-wise over the clusters, and each row is parallelized block-wise over the PEs. Data is transferred with DMA bursts, and all PEs operate on data in local SPMs. baseline perfect linear speedup bottleneck interconnect

(NoC could be selected instead)

◮ HERO allows to make architectural choices based on measured results of benchmarks.

  • A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS)

17 / 20

slide-19
SLIDE 19

Case Study: Shared Virtual Memory Performance Analysis

The main motivation for shared virtual memory (SVM) is programmability. However, SVM can also significantly improve performance! PageRank is a well-known algorithm for analyzing the connectivity of graphs.

The overhead of manipulating pointers at offload-time in copy-based offloading exceeds the run-time overhead of translating pointers with shared virtual memory. In this case, SVM reduces the run time by nearly 60 %.

Offload Kernel Execution Total 0.2 0.4 0.6 0.8 1 Normalized Run Time Copy-Based SM SVM

MemCopy simply copies a large array from DRAM to the PMCA and back, which is representative for streaming applications with little actual work.

Letting the host copy data to physically contiguous, uncached memory is much slower than letting the PMCA access data directly with high-bandwidth DMA transfers. In this case, SVM reduces the run time by more than 95 %.

Offload Kernel Execution Total 0.2 0.4 0.6 0.8 1 Normalized Run Time

◮ HERO allows to back research claims with reproducible, falsifiable implementation results.

  • A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS)

18 / 20

slide-20
SLIDE 20

Conclusion

HERO is the first open-source heterogeneous embedded research platform. It unites an ARM Cortex-A host processor with a fully modifiable RISC-V manycore implemented on an FPGA. HERO enables efficient hardware and sofware research on HESoCs through a heterogeneous sofware stack, which supports shared virtual memory and OpenMP 4.5—tremendously simplifying porting of standard benchmarks and real-world applications, and profiling and automated verification solutions. We have been successfully using HERO in our research over the last years and will continue its development as open-source hardware and sofware!

  • A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS)

19 / 20

slide-21
SLIDE 21

HERO Will Be Released Open-Source!

Coming Q4 2017

pulp-platform.org/hero

  • A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS)

20 / 20