ENABLING LOW-COST AND LIGHTWEIGHT ZERO-COPY OFFLOADING ON - - PowerPoint PPT Presentation

enabling low cost and lightweight zero copy offloading on
SMART_READER_LITE
LIVE PREVIEW

ENABLING LOW-COST AND LIGHTWEIGHT ZERO-COPY OFFLOADING ON - - PowerPoint PPT Presentation

ERC GRANT N 291125 IWES17 September 07-08, 2017, Rome (Italy) ENABLING LOW-COST AND LIGHTWEIGHT ZERO-COPY OFFLOADING ON HETEROGENEOUS MANY CORE ACCELERATORS: THE PULP EXPERIENCE Alessandro Capotondi ( alessandro.capotondi@unibo.it )


slide-1
SLIDE 1

ENABLING LOW-COST AND LIGHTWEIGHT ZERO-COPY OFFLOADING ON HETEROGENEOUS MANY‐CORE ACCELERATORS: THE PULP EXPERIENCE

ERC GRANT N° 291125

IWES17 September 07-08, 2017, Rome (Italy)

Alessandro Capotondi (alessandro.capotondi@unibo.it)

Andrea Marongiu Luca Benini University of Bologna

slide-2
SLIDE 2

ENABLING LOW-COST AND LIGHTWEIGHT ZERO-COPY OFFLOADING ON HETEROGENEOUS MANY‐CORE ACCELERATORS: THE PULP EXPERIENCE

Alessandro Capotondi (alessandro.capotondi@unibo.it)

Andrea Marongiu Luca Benini University of Bologna

ERC GRANT N° 291125

IWES17 September 07-08, 2017, Rome (Italy)

TLTR: Low-cost Unified Virtual Memory Support on Embedded SoC

slide-3
SLIDE 3

Heterogenous Manycores

Ever-increasing demand for computational power has recently led to radical evolution of computer architectures Two design paradigms have proven effective in increasing performance and energy efficiency of compute systems

> Many-cores > Architectural Heterogeneity

A common template is one where a powerful general-purpose processor (the host) is coupled to one or more a many-core accelerators

slide-4
SLIDE 4

Titan Cray X47

Opteron 6274 16C 2.2GHz Cray Gemini NVIDIA K20x

Tianhe-2

Xeon E5-2692 12C 2.2GHz TH Express-2 Intel Xeon Phi

HPC / SERVER

Heterogenous Manycores

Gyoukou

Xeon D-1571 16C 1.3Ghz Infiniband EDR PEZY-SC2

slide-5
SLIDE 5

Gyoukou

Xeon D-1571 16C 1.3Ghz Infiniband EDR PEZY-SC2

Kalray MPPA256 NVIDIA Tegra X1 Titan Cray X47

Opteron 6274 16C 2.2GHz Cray Gemini NVIDIA K20x

Tianhe-2

Xeon E5-2692 12C 2.2GHz TH Express-2 Intel Xeon Phi

TI KeystoneII

HPC / SERVER SoC

Heterogenous Manycores

True in every computing domain and at every scale!

slide-6
SLIDE 6

Execute control intensive and sequential tasks. Fine-grained

  • ffloading of highly

parallel tasks.

  • Communicate via coherent shared memory
  • IOMMU for hUMA in high-end systems

Heterogenous Manycores

  • CUDA 6 Unified Virtual Memory
  • Pascal Architecture and Tegra X series
slide-7
SLIDE 7

Execute control intensive and sequential tasks. Fine-grained

  • ffloading of highly

parallel tasks.

> Communicate via coherent shared memory > IOMMU for hUMA in high-end systems

What about low-power, embedded systems?

Heterogenous Manycores

slide-8
SLIDE 8

Embedded Heterogenous SoCs

Kalray MPPA256 Adapteva STHORM

Many-Core Accelerators

TI KeystoneII Xilinx Zynq Altera Arria

DSP/ASIC/FPGA Accelerators

slide-9
SLIDE 9

Embedded Heterogenous SoCs

Coherent virtual memory for host. Accelerator can only access contiguous section in shared main memory, no virtual memory.

copy-based approach

slide-10
SLIDE 10

Embedded Heterogenous SoCs

Pros

  • Do not require specific HW
  • Cheap and low-power

Cons

  • Overheads for copying data from/to the dedicated memory
  • Complex data structures require ad-hoc transfer
  • Performance issue on not-paged sections

Coherent virtual memory for host. Accelerator can only access contiguous section in shared main memory, no virtual memory.

copy-based approach

slide-11
SLIDE 11

Embedded Heterogenous SoCs

Pros

  • Do not require specific HW
  • Cheap and low-power

Cons

  • Overheads for copying data from/to the dedicated memory
  • Complex data structures require ad-hoc transfer
  • Performance issue on not-paged sections

Coherent virtual memory for host. Accelerator can only access contiguous section in shared main memory, no virtual memory.

copy-based approach

slide-12
SLIDE 12

Contributions

  • Lightweight mixed HW/SW managed IOMMU for UVM support
  • PULP architecture
  • IOMMU Implementation
  • GNU GCC Toolchain Extensions for offloading to PULP accelerator
  • Compiler Extensions
  • Runtime/Libraries Extensions
  • UVM Experimental evaluation on OpenMP offloading
slide-13
SLIDE 13

PULP - An Open Parallel Ultra-Low-Power Processing-Platform This is a joint project between the Integrated Systems Laboratory (IIS) of ETH Zurich and the Energy-efficient Embedded Systems (EEES) group of UNIBO to develop an

  • pen, scalable Hardware and Software research platform with the goal to break

the pJ/op barrier within a power envelope of a few mW. The PULP platform is a multi-core platform achieving leading-edge energy-efficiency and featuring widely-tunable performance.

cluster-based scalable silicon-proven OpenRISC/RISC-V

slide-14
SLIDE 14

PULP - An Open Parallel Ultra-Low-Power Processing-Platform This is a joint project between the Integrated Systems Laboratory (IIS) of ETH Zurich and the Energy-efficient Embedded Systems (EEES) group of UNIBO to develop an

  • pen, scalable Hardware and Software research platform with the goal to break

the pJ/op barrier within a power envelope of a few mW. The PULP platform is a multi-core platform achieving leading-edge energy-efficiency and featuring widely-tunable performance.

not only ULP power envelop!

cluster-based scalable silicon-proven OpenRISC/RISC-V

slide-15
SLIDE 15

PULP as heterogeneous programmable accelerator emulator

Host: Dual-Core ARM Cortex-A9 running full fledged Ubuntu 16.04 Accelerator: 8 core – PULP Fulmine cluster (www.pulp-platform.org)

slide-16
SLIDE 16

Lightweight UVM Unified Virtual Memory

Goals:

  • Sharing of virtual address pointers
  • Transparent to application developer
  • Zero-copy offload, performance predictability
  • Low complexity, low area, low cost
  • Non-intrusive to accelerator architecture

Mixed Hardware/Software Solution:

> Input/output translation lookaside buffer (IOTLB) > Special-purpose TRYX Control register

Requires:

> Compiler extension to insert tryread/trywrite operation > Kernel-level driver module

Accelerator Host

Shared Memory Remapping Address Block (RAB):

> Virtual-to-physical address translation > Per-port private IOTLBs, shared configuration interface

slide-17
SLIDE 17
  • No hardware modifications to the processing elements.
  • Portable RAB miss handling routine on the host.
  • Optimized for common case: overhead of 8 cycles.

Lightweight UVM Unified Virtual Memory

slide-18
SLIDE 18

OpenMP

▲ De-facto standard for shared memory programming ▲ Support for nested (multi-level) parallelism  good for clusters ▲ Annotations to incrementally convey parallelism to the compiler  increased ease of use ▲ Based on well-understood programming practices (shared memory, C language)  increases productivity

“OpenCL for programming shared memory multicore CPUs” by Akhtar Ali , Usman Dastgeer , Christoph Kessler

slide-19
SLIDE 19

OpenMP

▲ De-facto standard for shared memory programming ▲ Support for nested (multi-level) parallelism  good for clusters ▲ Annotations to incrementally convey parallelism to the compiler  increased ease of use ▲ Based on well-understood programming practices (shared memory, C language)  increases productivity ▲ Since Specification 4.0 OpenMP support Heterogenous Execution Model based on offloads! At the moment GCC supports OpenMP

  • ffloading ONLY to:
  • Intel Xeon Phi
  • Nvidia PTX (only through OpenACC)
slide-20
SLIDE 20

The compiler outlines the code within the target region and generates a binary version for each accelerator (multi-ISA) The runtime libraries are in charge to:

  • manage the accelerator devices
  • map the variables
  • run/wait execution of target regions

OpenMP target example

void vec_mult() { double p[N], v1[N], v2[N]; # pragma omp target map(to: v1, v2)\ map(from: p) { # pragma omp parallel for for (int i = 0; i < N; i++) p[i] = v1[i] * v2[i]; } }

  • 1. Initialize target device
  • 2. Offload target image
  • 3. Map TO the device mem
  • 4. Trigger execution target region
  • 5. Wait termination
  • 6. Map FROM the device mem
slide-21
SLIDE 21
  • Added PULP as target accelerator

– Enabled OpenRISC back-end as OpenMP4 accelerator supported ISA – Created ad-hoc lto-wrapped linker tool for PULP offloaded region (pulp-mkoffload)

  • Enabled UVM (zero-copy) support for PULP

– Added new SSA pass to protect usage of shared mapped variables between the accelerator and the host

GNU GCC - Extensions

slide-22
SLIDE 22

Added PULP as target accelerator (1)

vertex { unsigned int vertex_id, n_successors; float pagerank, pagerank_next; vertex ** successors; } * vertices; #pragma omp target map(tofrom: vertices, n_vertices) ( i = 0; i < n_vertices; i++) { vertices[i].pagerank = compute(... ); vertices[i].pagerank_next = compute_next(...); pr_sum += (vertices + i)->pagerank; ((vertices+i)->n_successors == 0) { pr_sum_dangling += (vertices + i)->pagerank; } } ORIGINAL CODE

1 2 3

GCC

(arm-linux-gnueabihf-gcc) ld

lto-wrapper pulp-mkoffload

  • r1kl-none-gcc

cc1-lto ld

cc1

src.object (ARM-ISA)

.text .text.target._omp_fn.0 { .data, .bss, etc.} .gnu.offload_vars .gnu.offload_funcs

.gnu.offload_lto_target._omp_fn.0 .gnu.offload_lto_.{decls, refs, etc.}

LTO.object (GIMPLE)

cc1 LinkTimeOptimization representation of target regions are appended to the object file

slide-23
SLIDE 23

Added PULP as target accelerator (2)

GCC

(arm-linux-gnueabihf-gcc) ld

lto-wrapper pulp-mkoffload

  • r1kl-none-gcc

cc1-lto ld

cc1

src.object (ARM-ISA)

.text .text.target._omp_fn.0 { .data, .bss, etc.} .gnu.offload_vars .gnu.offload_funcs

.gnu.offload_lto_target._omp_fn.0 .gnu.offload_lto_.{decls, refs, etc.}

LTO.object (GIMPLE)

slide-24
SLIDE 24

Added PULP as target accelerator (2)

GCC

(arm-linux-gnueabihf-gcc) ld

lto-wrapper pulp-mkoffload

  • r1kl-none-gcc

cc1-lto ld

cc1

src.object (ARM-ISA)

.text .text.target._omp_fn.0 { .data, .bss, etc.} .gnu.offload_vars .gnu.offload_funcs

.gnu.offload_lto_target._omp_fn.0 .gnu.offload_lto_.{decls, refs, etc.}

LTO.object (GIMPLE)

PULP Syslibs

HAL libgomp

src.bin (ARM-ISA)

.text .text.target._omp_fn.0 { .data, .bss, etc.} .gnu.offload_vars .gnu.offload_funcs

.text { .data, .bss, etc. } .gnu.offload_vars .gnu.offload_funcs .target._omp_fn.0

target.bin (or1k ISA) .gnu.offload_images

Linking All LTO.objects are passed by the lto-wrapper to pulp-mkoffload <pulp-mkoffload>

  • Compile the target region

to the accelerator ISA

  • Link

the pre-compiled accelerator (PULP syslib)

  • Append to the “host” binary

whole .gnu.offload_image

slide-25
SLIDE 25

Compiler UVM support for PULP

GCC

(arm-linux-gnueabihf-gcc) ld

lto-wrapper pulp-mkoffload

  • r1kl-none-gcc

cc1-lto ld

cc1

OpenMP Expansion RAB STMT mark PULP_RAB_Pass SSA-opt1 ... ... SSA SSA passes . . . . . . IPA passes ipa_write_passes

cc1

( i=0; i < (&n_vertices); i++) { vertex_i = (&vertices) + i*20; // &vertices[i] p_rank = compute(...); (&vertex_i->pagerank,p_rank); (&vertex_i->pagerank_next,compute_next(...)); pr_sum = p_rank + pr_sum; ((&vertex_i->n_successors) == 0) pr_sum_dangling = p_rank + pr_sum_dangling; } PULP_RAB_Pass

1 2 3

slide-26
SLIDE 26

Compiler UVM support for PULP

GCC

(arm-linux-gnueabihf-gcc) ld

lto-wrapper pulp-mkoffload

  • r1kl-none-gcc

cc1-lto ld

cc1

OpenMP Expansion RAB STMT mark PULP_RAB_Pass SSA-opt1 ... ... SSA SSA passes . . . . . . IPA passes ipa_write_passes

cc1

( i=0; i < (&n_vertices); i++) { vertex_i = (&vertices) + i*20; // &vertices[i] p_rank = compute(...); (&vertex_i->pagerank,p_rank); (&vertex_i->pagerank_next,compute_next(...)); pr_sum = p_rank + pr_sum; ((&vertex_i->n_successors) == 0) pr_sum_dangling = p_rank + pr_sum_dangling; } PULP_RAB_Pass

1 2 3

OpenMP Expansion Pass: annotate the statements containing the first use of every map variable. New SSA PULP_RAB_PASS: traverses use-def chains to determine which uses of the value/address

  • f the annotated variables need to be

instrumented

slide-27
SLIDE 27

Full SW stack overview

Pass virtual address pointers.

Protect accesses to virtual address pointers.

RAB miss handling

Wake up sleeping cores. Extension Only protect pointers passed by the host. Tight integration into compiler.

PULP‐plugin

slide-28
SLIDE 28

Objective: while UVM’s greatest advantage is simplified programmability we want evaluate the advantage of UVM on performance. Benchmarks:

  • memcpy (MEM): representative example for heavily memory-

bound, streaming applications with regular access pattern to shared memory.

  • pointer chasing (PC): is representative of graph-processing

applications with highly irregular access patterns, like Page- Rank, Breadth-First Search, clustering, Nearest Neighbor Search.

  • random forest traversal (RFT): is representative of irregular

applications for regression, classification problem solving, and pattern recognition

Experimental setup

slide-29
SLIDE 29

On regular applications UVM executes avg. 1.6× faster. Capacity RAB misses when the data size exceeds the TLB capacity limits the speedup at 1.79×

Results (1)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 128 256 384 512

Speedup (UVM vs Copy-Based) Data Size [KiB] MEM

0.5 1 1.5 2 2.5 3 3.5 4 4.5 10240 12740 15240 17740 20240

Graph Size [KiB] PC

0.5 1 1.5 2 2.5 3 3.5 4 4.5 12 14 16

Tree Depth RFT

0.2 Cycles/Byte 20 Cycles/Byte

slide-30
SLIDE 30

PC shows a slowly but steadily increasing speedup (up to 1.4× for the considered data sets) Small graphs are penalized by the higher RAB handling costs compared to regular applications like MEM

Results (2)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 128 256 384 512

Speedup (UVM vs Copy-Based) Data Size [KiB] MEM

0.5 1 1.5 2 2.5 3 3.5 4 4.5 10240 12740 15240 17740 20240

Graph Size [KiB] PC

0.5 1 1.5 2 2.5 3 3.5 4 4.5 12 14 16

Tree Depth RFT

0.2 Cycles/Byte 20 Cycles/Byte

slide-31
SLIDE 31

RFT reaches 4.11× and 2.85× speedup, for CCRs equal to 20 and 0.2, respectively The higher speedups are due to the fact that in copy-based a lot of data is copied that is (potentially) never accessed.

Results (3)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 128 256 384 512

Speedup (UVM vs Copy-Based) Data Size [KiB] MEM

0.5 1 1.5 2 2.5 3 3.5 4 4.5 10240 12740 15240 17740 20240

Graph Size [KiB] PC

0.5 1 1.5 2 2.5 3 3.5 4 4.5 12 14 16

Tree Depth RFT

0.2 Cycles/Byte 20 Cycles/Byte

slide-32
SLIDE 32

We presented a RTL-proven mixed HW/SW lightweight IOMMU for low-power embedded many-core accelerator. We presented a full implementation of OpenMP 4 on GCC for PULP architecture. We extended the toolchain at compiler and runtime level to enable Unified Virtual Memory support achieved by a low-cost, low- area, IOTLB infrastructure. UVM enables, with smaller programming effort, a performance gain compare standard copy-based offloading mechanisms.

Conclusion

slide-33
SLIDE 33

Current status:

  • Make the first OpenMP-ready,

RISC-V accelerator!

  • bring UVM support to FPGA

accelerators (custom or HLS flow, SDSoC, ecc…) Looking ahead

  • release Open-Source (near

future)

  • Looking at ultra large scale of

accelerator (tens, hundreds clusters)

Conclusion (2)

Contact us! If you are interested to use it as research platform or to join as collaborator!

http://www.pulp-platform.org/

slide-34
SLIDE 34

Work supported by

ERC GRANT N° 291125

Thank You!

Questions? Answers?

Alessandro Capotondi (alessandro.capotondi@unibo.it)

slide-35
SLIDE 35

How many parallel programming models?

Proprietary Programming models Khronos Standard for Heterogeneous Computing Standard for shared memory system Academic Proposals

  • OmpSS
  • OpenHMPP
  • SparkCL, many-others
slide-36
SLIDE 36

GCC Runtime library extensions

On host side:

  • Modified the standard libgomp

to remove forced device to/from host data transfer

  • Created two libgomp plugin for

the PULP accelerator On PULP side:

  • Customized – already existing –

libgomp to manage offload requests

Libgomp <host> GOMP_OFFLOAD_run GOMP_OFFLOAD_async_run GOMP_OFFLOAD_dev2dev GOMP_OFFLOAD_get_name GOMP_OFFLOAD_get_caps GOMP_OFFLOAD_get_type GOMP_OFFLOAD_get_num_devices GOMP_OFFLOAD_init_device GOMP_OFFLOAD_fini_device GOMP_OFFLOAD_version GOMP_OFFLOAD_load_image GOMP_OFFLOAD_unload_image GOMP_OFFLOAD_alloc GOMP_OFFLOAD_free GOMP_OFFLOAD_dev2host GOMP_OFFLOAD_host2dev

pulp-vmem-plugin.so pulp-cmem-plugin.so pulp-cmem-plugin.so

slide-37
SLIDE 37
  • Contiguous Memory Allocator (CMA):

– Pre-allocate a contiguous kernel-space buffer at boot time. – Apply a constant offset for virtual-to-physical address translation. – Zero-copy

3

Memory Sharing in Embedded Systems (1)

  • Drawbacks:
  • Requires custom kernel module to expose the contiguous

memory to user-space and to get the physical address.

  • High latency, no guarantees on the availability
  • Contiguous Buffer is un-cached on ARM
slide-38
SLIDE 38
  • Fulmine cluster with 8 Cores, 256 KiB L1, 8 KiB I$
  • RAB:

– L1 TLB: 4 + 32 slices – L2 TLB: 1024 entries

  • IOMMU [Kornaros et al., SoC’14]:

– 64-entry IOTLB, 6 cycles look-up latency

Results: FPGA Resource Utilization

Block Slice LUTs [K] Slice Regs [K] Block RAM [Kb] PULP Cluster 120 56 2163 L1 TLBs 6.6 4.7 L2 TLB 0.3 0.1 45 Buffer & Control 1.8 2.7 RAB Total 8.7 7.5 45 IOMMU [1] 11.15 407.65

slide-39
SLIDE 39
  • ARM @ 333 MHz, PULP @ 50 MHz, DDR @ 300 MHz
  • RAB could be clocked at @ 100 MHz, peak bandwidth to shared

memory of 6.4 Gbps

  • FIFO replacement strategy for RAB management

Platform Details

slide-40
SLIDE 40
  • Average RAB miss handling time ~5500 cycles
  • RAB miss handler

– Not optimized to host architecture, fully portable – Page table walker not executable in interrupt context – Use of Concurrency Managed Workqueue API of Linux

  • Cost Breakdown:

– 20% until host starts to handle the interrupt = schedule work (7) – 50% until the worker thread starts to handle the miss (8) – 30% actual miss handling

  • 23% get_user_pages()

RAB Miss Handling: Cost Breakdown Analysis