M 3 : INTEGRATING ARBITRARY COMPUTE UNITS AS FIRST-CLASS CITIZENS - - PowerPoint PPT Presentation

m 3 integrating arbitrary compute units as first class
SMART_READER_LITE
LIVE PREVIEW

M 3 : INTEGRATING ARBITRARY COMPUTE UNITS AS FIRST-CLASS CITIZENS - - PowerPoint PPT Presentation

Faculty of Computer Science, Institute for System Architecture, Operating Systems Group M 3 : INTEGRATING ARBITRARY COMPUTE UNITS AS FIRST-CLASS CITIZENS OS: Nils Asmussen, Hermann H artig, Marcus V olp EE: Benedikt N othen, Gerhard


slide-1
SLIDE 1

Faculty of Computer Science, Institute for System Architecture, Operating Systems Group

M3: INTEGRATING ARBITRARY COMPUTE UNITS AS FIRST-CLASS CITIZENS

OS: Nils Asmussen, Hermann H¨ artig, Marcus V ¨

  • lp

EE: Benedikt N ¨

  • then, Gerhard Fettweis

Dagstuhl Seminar, 02/09/2017

slide-2
SLIDE 2

Why?

  • FPGA-based memcached 16x better in

performance per watt than Atom CPU [1]

  • Machine learning accelerator is 20% faster than

FPGA and requires 128 times less energy [2]

  • . . .

[1] Thin servers with smart pipes: Designing SoC accelerators for memcached, ISCA’13 [2] PuDianNao: A polyvalent machine learning accelerator, ASPLOS’15 Nils Asmussen Slide 2 of 16

slide-3
SLIDE 3

The Problem for OSes

Intel Xeon Intel Xeon

ARM big ARM LITTLE

DSP DSP Audio Decoder FPGA

Nils Asmussen Slide 3 of 16

slide-4
SLIDE 4

The Problem for OSes

Intel Xeon Intel Xeon

ARM big ARM LITTLE

DSP DSP Audio Decoder FPGA Kernel Kernel

Nils Asmussen Slide 3 of 16

slide-5
SLIDE 5

The Problem for OSes

Intel Xeon Intel Xeon

ARM big ARM LITTLE

DSP DSP Audio Decoder FPGA Kernel Kernel Kernel Kernel

Nils Asmussen Slide 3 of 16

slide-6
SLIDE 6

The Problem for OSes

Intel Xeon Intel Xeon

ARM big ARM LITTLE

Kernel Kernel Kernel Kernel

Nils Asmussen Slide 3 of 16

slide-7
SLIDE 7

The Goal

Treat all compute units (CU) as first-class citizens:

1

Run untrusted code without causing harm

2

Access operating system services

3

Interact as the master with other CUs

Nils Asmussen Slide 4 of 16

slide-8
SLIDE 8

First-class Citizenchip as Enabler

  • Pipe communication between arbitrary CUs
  • Use parallism on GPUs for FS operations
  • Direct access to accelerators from the net
  • . . .

Nils Asmussen Slide 5 of 16

slide-9
SLIDE 9

M3 Approach – Hardware

Intel Xeon Intel Xeon

ARM big ARM LITTLE

DSP DSP Audio Decoder FPGA

Mem Mem Mem Mem Mem Mem Mem Mem

Asmussen et al.: M3: A Hardware/OS Co-Design to Tame Heterogeneous Manycores, ASPLOS’16 Nils Asmussen Slide 6 of 16

slide-10
SLIDE 10

M3 Approach – Hardware

Intel Xeon Intel Xeon

ARM big ARM LITTLE

DSP DSP Audio Decoder FPGA

Mem Mem Mem Mem Mem Mem Mem Mem DTU DTU DTU DTU DTU DTU DTU DTU

Asmussen et al.: M3: A Hardware/OS Co-Design to Tame Heterogeneous Manycores, ASPLOS’16 Nils Asmussen Slide 6 of 16

slide-11
SLIDE 11

M3 Approach – Hardware

Intel Xeon Intel Xeon

ARM big ARM LITTLE

DSP DSP Audio Decoder FPGA

Mem Mem Mem Mem Mem Mem Mem Mem DTU DTU DTU DTU DTU DTU DTU DTU

PE PE PE PE PE PE PE PE

Asmussen et al.: M3: A Hardware/OS Co-Design to Tame Heterogeneous Manycores, ASPLOS’16 Nils Asmussen Slide 6 of 16

slide-12
SLIDE 12

M3 Approach – Software

Intel Xeon Intel Xeon

ARM big ARM LITTLE

DSP DSP Audio Decoder FPGA

Mem Mem Mem Mem Mem Mem Mem Mem DTU DTU DTU DTU DTU DTU DTU DTU

PE PE PE PE PE PE PE PE App App App App App App App Kernel

Asmussen et al.: M3: A Hardware/OS Co-Design to Tame Heterogeneous Manycores, ASPLOS’16 Nils Asmussen Slide 6 of 16

slide-13
SLIDE 13

Data Transfer Unit

  • Supports memory access and message passing
  • Provides a number of endpoints
  • Each endpoint can be configured for:

1

Accessing memory (contiguous range, byte granular)

2

Receiving messages into a receive buffer

3

Sending messages to a receiving endpoint

  • Configuration only by kernel, usage by application
  • Credit system to prevent DoS attacks
  • Direct reply on received messages

Nils Asmussen Slide 7 of 16

slide-14
SLIDE 14

M3 System Call

Mem DTU

App Kernel

S Mem DTU R

Nils Asmussen Slide 8 of 16

slide-15
SLIDE 15

Prototype Platform: Tomahawk 2

Xtensa LX4

Instr. SPM Data SPM DTU PE PE PE PE PE PE PE DRAM

R R R R R R R R R

PE

Mem Ctrl.

PEs have no OS support:

  • No privileged mode
  • No MMU
  • No caches, but SPM

Nils Asmussen Slide 9 of 16

slide-16
SLIDE 16

Prototype Platform: gem5

PE DTU x86 L1 L2 DRAM

DRAM Ctl

DTU

VM

DTU

SPM

Hash Accel PE DTU

VM

SPM

L1 PE x86 PE DTU x86

Nils Asmussen Slide 10 of 16

slide-17
SLIDE 17

M3

  • Microkernel-based system for het. manycores
  • Mechanisms for PEs, memory and communication
  • Drivers, filesystems, . . . are implemented on top
  • Kernel manages permissions
  • DTU enforces permissions (communication,

memory access)

  • Kernel is independent of other CUs in the system

Nils Asmussen Slide 11 of 16

slide-18
SLIDE 18

Virtual PEs

  • Comparable to a process with 0/1 threads
  • Creating VPE yields a VPE cap. and memory cap.
  • Library provides primitives like fork and exec

Nils Asmussen Slide 12 of 16

slide-19
SLIDE 19

Virtual PEs

  • Comparable to a process with 0/1 threads
  • Creating VPE yields a VPE cap. and memory cap.
  • Library provides primitives like fork and exec

Execute function on different PE

VPE vpe; vpe.run_async([]() { Serial::get() << "Hello World!\n"; return 0; }); int exitcode = vpe.wait();

Nils Asmussen Slide 12 of 16

slide-20
SLIDE 20

Virtual PEs

  • VPE with 0 threads for HW accelerators
  • Allows direct access for applications
  • Time-multiplexed by the kernel

Access an accelerator

VPE vpe(VPEDesc::HASH_ACCEL); SendGate sg(vpe); GateIStream reply = send_receive_vmsg(sg, 1, 2, 3); int res; reply >> res;

Nils Asmussen Slide 13 of 16

slide-21
SLIDE 21

Filesystem: m3fs

Mem DTU

App Kernel

S Mem DTU

m3fs

Mem DTU

DRAM

S R S R

Nils Asmussen Slide 14 of 16

slide-22
SLIDE 22

Filesystem: m3fs

Mem DTU

App Kernel

S Mem DTU

m3fs

Mem DTU

DRAM

S R S R

Nils Asmussen Slide 14 of 16

slide-23
SLIDE 23

Filesystem: m3fs

Mem DTU

App Kernel

S Mem DTU

m3fs

Mem DTU

DRAM

S R S R

Nils Asmussen Slide 14 of 16

slide-24
SLIDE 24

Filesystem: m3fs

Mem DTU

App Kernel

S Mem DTU

m3fs

Mem DTU

DRAM

S R S R M

Nils Asmussen Slide 14 of 16

slide-25
SLIDE 25

Performance Comparison

M3 Lx

tar 1 2 3 4 5 6 7 Time (M cycles)

M3 Lx

untar

M3 Lx

find

M3 Lx

sqlite

App Xfers OS

Nils Asmussen Slide 15 of 16

slide-26
SLIDE 26

Summary

  • M3 uses a HW/SW co-design
  • DTU creates common interface for all CUs
  • M3 kernel controls DTUs remotely
  • Allows to treat all CUs as first-class citizens

Nils Asmussen Slide 16 of 16