Helios: Heterogeneous Multiprocessing with Satellite Kernels Ed - - PowerPoint PPT Presentation

helios heterogeneous multiprocessing with satellite
SMART_READER_LITE
LIVE PREVIEW

Helios: Heterogeneous Multiprocessing with Satellite Kernels Ed - - PowerPoint PPT Presentation

Helios: Heterogeneous Multiprocessing with Satellite Kernels Ed Nightingale, Orion Hodson, Ross McIlroy, Chris Hawblitzel, Galen Hunt MICROSOFT RESEARCH 1 Once upon a time RAM CPU Single CPU Hardware was homogeneous 2 Once upon a


slide-1
SLIDE 1

Ed Nightingale, Orion Hodson, Ross McIlroy, Chris Hawblitzel, Galen Hunt

MICROSOFT RESEARCH

Helios: Heterogeneous Multiprocessing with Satellite Kernels

1

slide-2
SLIDE 2

RAM

2

CPU

Once upon a time…

 Hardware was homogeneous

Single CPU

slide-3
SLIDE 3

RAM

3

CPU

Once upon a time…

CPU

 Hardware was homogeneous

SMP

slide-4
SLIDE 4

RAM

4

Once upon a time…

 Hardware was homogeneous

CPU CPU CPU CPU CPU CPU CPU CPU

RAM

CPU CPU CPU CPU CPU CPU CPU CPU

NUMA

slide-5
SLIDE 5

Problem: HW now heterogeneous

 Heterogeneity ignored by operating systems  Programming models are fragmented  Standard OS abstractions are missing

5

GP-GPU

RAM

Programmable NIC

RAM RAM

CPU CPU CPU CPU CPU CPU CPU CPU

RAM

CPU CPU CPU CPU CPU CPU CPU CPU

NUMA

slide-6
SLIDE 6

Solution

 Helios manages ‘distributed system in the small’

 Simplify app development, deployment, and tuning  Provide single programming model for heterogeneous systems

 4 techniques to manage heterogeneity

 Satellite kernels: Same OS abstraction everywhere  Remote message passing: Transparent IPC between kernels  Affinity: Easily express arbitrary placement policies to OS  2-phase compilation: Run apps on arbitrary devices

6

slide-7
SLIDE 7

Results

 Helios offloads processes with zero code changes

 Entire networking stack  Entire file system  Arbitrary applications

 Improve performance on NUMA architectures

 Eliminate resource contention with multiple kernels  Eliminate remote memory accesses

7

slide-8
SLIDE 8

Outline

 Motivation  Helios design

 Satellite kernels  Remote message passing  Affinity  Encapsulating many ISAs

 Evaluation  Conclusion

8

slide-9
SLIDE 9

Kernel

Driver interface is poor app interface

9 CPU I/O device

driver 1010

App App

slide-10
SLIDE 10

Kernel

Programmable device

Driver interface is poor app interface

 Hard to perform basic tasks: debugging, I/O, IPC  Driver encompasses services and runtime…an OS!

10 CPU

driver

App App

JIT Sched. Mem. IPC

slide-11
SLIDE 11

Satellite kernels provide single interface

11

  • Sat. Kernel

CPU Programmable device

FS App

 Satellite kernels:

 Efficiently manage local resources  Apps developed for single system call interface  μkernel: Scheduler, memory manager, namespace manager

  • Sat. Kernel

TCP

\\

slide-12
SLIDE 12

Satellite kernels provide single interface

12

  • Sat. Kernel

Programmable device

App

NUMA

App FS App

 Satellite kernels:

 Efficiently manage local resources  Apps developed for single system call interface  μkernel: Scheduler, memory manager, namespace manager

  • Sat. Kernel

TCP

NUMA

\\

  • Sat. Kernel
slide-13
SLIDE 13

Remote Message Passing

 Local IPC uses zero-copy message passing  Remote IPC transparently marshals data

13

 Unmodified apps work with multiple kernels

  • Sat. Kernel

Programmable device

App

NUMA

App FS App

  • Sat. Kernel

TCP

NUMA

\\

  • Sat. Kernel
slide-14
SLIDE 14

Connecting processes and services

 Applications register in a namespace as services  Namespace is used to connect IPC channels

14

/fs /dev/nic0 /dev/disk0 /services/TCP /services/PNGEater /services/kernels/ARMv5

 Satellite kernels register in namespace

slide-15
SLIDE 15

Where should a process execute?

 Three constraints impact initial placement decision

1.

Heterogeneous ISAs makes migration is difficult

2.

Fast message passing may be expected

3.

Processes might prefer a particular platform

 Helios exports an affinity metric to applications

 Affinity is expressed in application metadata and acts as a hint  Positive represents emphasis on communication – zero copy IPC  Negative represents desire for non-interference

15

slide-16
SLIDE 16

Affinity Expressed in Manifests

 Affinity easily edited by dev, admin, or user

16

<?xml version=“1.0” encoding=“utf-8”?> <application name=TcpTest” runtime=full> <endpoints> <inputPipe id=“0” affinity=“0” contractName=“PipeContract”/> <endpoint id=“2” affinity=“+10” contractName=“TcpContract”/> </endpoints> </application>

slide-17
SLIDE 17

Affinity Expressed in Manifests

 Affinity easily edited by dev, admin, or user

17

<?xml version=“1.0” encoding=“utf-8”?> <application name=TcpTest” runtime=full> <endpoints> <inputPipe id=“0” affinity=“0” contractName=“PipeContract”/> <endpoint id=“2” affinity=“+10” contractName=“TcpContract”/> </endpoints> </application>

slide-18
SLIDE 18

Platform Affinity

 Platform affinity processed first  Guarantees certain performance characteristics

18

X86 NUMA

GP-GPU Programmable NIC

X86 NUMA

/services/kernels/vector-CPU platform affinity = +2 /services/kernels/x86 platform affinity = +1

+2

slide-19
SLIDE 19

Platform Affinity

 Platform affinity processed first  Guarantees certain performance characteristics

19

X86 NUMA

GP-GPU Programmable NIC

X86 NUMA

/services/kernels/vector-CPU platform affinity = +2 /services/kernels/x86 platform affinity = +1

+2 +1 +1

slide-20
SLIDE 20

Positive Affinity

 Represents ‘tight-coupling’ between processes

 Ensure fast message passing between processes

 Positive affinities on each kernel summed

20

X86 NUMA

GP-GPU Programmable NIC

X86 NUMA

/services/TCP communication affinity = +1 /services/PNGEater communication affinity = +2 /services/antivirus communication affinity = +3 X86 NUMA Programmable NIC

+1 TCP PNG A/V

slide-21
SLIDE 21

Positive Affinity

 Represents ‘tight-coupling’ between processes

 Ensure fast message passing between processes

 Positive affinities on each kernel summed

21

X86 NUMA

GP-GPU Programmable NIC

X86 NUMA

/services/TCP communication affinity = +1 /services/PNGEater communication affinity = +2 /services/antivirus communication affinity = +3 X86 NUMA Programmable NIC

+1 +2 TCP PNG A/V

slide-22
SLIDE 22

Positive Affinity

 Represents ‘tight-coupling’ between processes

 Ensure fast message passing between processes

 Positive affinities on each kernel summed

22

X86 NUMA

GP-GPU Programmable NIC

X86 NUMA

/services/TCP communication affinity = +1 /services/PNGEater communication affinity = +2 /services/antivirus communication affinity = +3 X86 NUMA Programmable NIC

+1 +5 TCP PNG A/V

slide-23
SLIDE 23

Negative Affinity

 Expresses a preference for non-interference

 Used as a means of avoiding resource contention

 Negative affinities on each kernel summed

23

X86 NUMA

GP-GPU Programmable NIC

X86 NUMA

/services/kernels/x86 platform affinity = +100 /services/antivirus non-interference affinity = -1

A/V

slide-24
SLIDE 24

Negative Affinity

 Expresses a preference for non-interference

 Used as a means of avoiding resource contention

 Negative affinities on each kernel summed

24

X86 NUMA

GP-GPU Programmable NIC

X86 NUMA

/services/kernels/x86 platform affinity = +100 /services/antivirus non-interference affinity = -1 X86 NUMA X86 NUMA

A/V

slide-25
SLIDE 25

Negative Affinity

 Expresses a preference for non-interference

 Used as a means of avoiding resource contention

 Negative affinities on each kernel summed

25

X86 NUMA

GP-GPU Programmable NIC

X86 NUMA

/services/kernels/x86 platform affinity = +100 /services/antivirus non-interference affinity = -1 X86 NUMA

  • 1

X86 NUMA

A/V

slide-26
SLIDE 26

Self-Reference Affinity

 Simple scale-out policy across available processors

26

X86 NUMA

GP-GPU Programmable NIC

X86 NUMA

/services/webserver non-interference affinity = -1

W1

slide-27
SLIDE 27

Self-Reference Affinity

 Simple scale-out policy across available processors

27

X86 NUMA

GP-GPU Programmable NIC

X86 NUMA

/services/webserver non-interference affinity = -1

W1

  • 1

W2

slide-28
SLIDE 28

Self-Reference Affinity

 Simple scale-out policy across available processors

28

X86 NUMA

GP-GPU Programmable NIC

X86 NUMA

/services/webserver non-interference affinity = -1

W1

  • 1
  • 1

W2 W3

slide-29
SLIDE 29

Turning policies into actions

 Priority based algorithm reduces candidate kernels by:

 First: Platform affinities  Second: Other positive affinities  Third: Negative affinities  Fourth: CPU utilization

 Attempt to balance simplicity and optimality

29

slide-30
SLIDE 30

Encapsulating many architectures

 Two-phase compilation strategy

 All apps first compiled to MSIL  At install-time, apps compiled down to available ISAs

 MSIL encapsulates multiple versions of a method  Example: ARM and x86 versions of

Interlocked.CompareExchange function

30

slide-31
SLIDE 31

Implementation

 Based on Singularity operating system

 Added satellite kernels, remote message passing, and affinity

 XScale programmable I/O card

 2.0 GHz ARM processor, Gig E, 256 MB of DRAM  Satellite kernel identical to x86 (except for ARM asm bits)  Roughly 7x slower than comparable x86

 NUMA support on 2-socket, dual-core AMD machine

 2 GHz CPU, 1 GB RAM per domain  Satellite kernel on each NUMA domain.

31

slide-32
SLIDE 32

Limitations

 Satellite kernels require timer, interrupts, exceptions

 Balance device support with support for basic abstractions  GPUs headed in this direction (e.g., Intel Larrabee)

 Only supports two platforms

 Need new compiler support for new platforms

 Limited set of applications

 Create satellite kernels out of commodity system  Access to more applications

32

slide-33
SLIDE 33

Outline

 Motivation  Helios design

 Satellite kernels  Remote message passing  Affinity  Encapsulating many ISAs

 Evaluation  Conclusion

33

slide-34
SLIDE 34

Evaluation platform

34

NUMA Evaluation XScale

NIC Kernel X86 X86 Satellite Kernel NIC XScale Satellite Kernel X86 NUMA Single Kernel X86 NUMA X86 NUMA Satellite Kernel X86 NUMA Satellite Kernel

A B A B

slide-35
SLIDE 35

Offloading Singularity applications

 Helios applications offloaded with very little effort

35

Name LOC LOC changed LOM changed Networking stack 9600 1 FAT 32 FS 14200 1 TCP test harness 300 5 1 Disk indexer 900 1 Network driver 1700 Mail server 2700 1 Web server 1850 1

slide-36
SLIDE 36

Offloading Singularity applications

 Helios applications offloaded with very little effort

36

Name LOC LOC changed LOM changed Networking stack 9600 1 FAT 32 FS 14200 1 TCP test harness 300 5 1 Disk indexer 900 1 Network driver 1700 Mail server 2700 1 Web server 1850 1

slide-37
SLIDE 37

Offloading Singularity applications

 Helios applications offloaded with very little effort

37

Name LOC LOC changed LOM changed Networking stack 9600 1 FAT 32 FS 14200 1 TCP test harness 300 5 1 Disk indexer 900 1 Network driver 1700 Mail server 2700 1 Web server 1850 1

slide-38
SLIDE 38

Offloading Singularity applications

 Helios applications offloaded with very little effort

38

Name LOC LOC changed LOM changed Networking stack 9600 1 FAT 32 FS 14200 1 TCP test harness 300 5 1 Disk indexer 900 1 Network driver 1700 Mail server 2700 1 Web server 1850 1

slide-39
SLIDE 39

Netstack offload

 Offloading improves performance as cycles freed  Affinity made it easy to experiment with offloading

39

PNG Size X86 Only uploads/sec X86+Xscale uploads/sec Speedup % reduction in context switches 28 KB 161 171 6% 54% 92 KB 55 61 12% 58% 150 KB 35 38 10% 65% 290 KB 19 21 10% 53%

slide-40
SLIDE 40

Email NUMA benchmark

 Satellite kernels improve performance 39%

40

10 20 30 40 50 60 70 80 90 Emails Per Second No Sat. Kernel

  • Sat. Kernel

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Instructions Per Cycle (IPC) No Sat. Kernel

  • Sat. Kernel
slide-41
SLIDE 41

Related Work

 Hive [Chapin et. al. ‘95]

 Multiple kernels – single system image

 Multikernel [Baumann et. Al. ’09]

 Focus on scale-out performance on large NUMA architectures

 Spine [Fiuczynski et. al.‘98]

Hydra [Weinsberg et. al. ‘08]

 Custom run-time on programmable device

41

slide-42
SLIDE 42

Conclusions

 Helios manages ‘distributed system in the small’

 Simplify application development, deployment, tuning

 4 techniques to manage heterogeneity

 Satellite kernels: Same OS abstraction everywhere  Remote message passing: Transparent IPC between kernels  Affinity: Easily express arbitrary placement policies to OS  2-phase compilation: Run apps on arbitrary devices

 Offloading applications with zero code changes  Helios code release soon.

42