Achieving Performance Isolation with Lightweight Co-Kernels Jiannan - - PowerPoint PPT Presentation

achieving performance isolation
SMART_READER_LITE
LIVE PREVIEW

Achieving Performance Isolation with Lightweight Co-Kernels Jiannan - - PowerPoint PPT Presentation

Achieving Performance Isolation with Lightweight Co-Kernels Jiannan Ouyang , Brian Kocoloski, John Lange The Prognostic Lab @ University of Pittsburgh Kevin Pedretti Sandia National Laboratories HPDC 2015 HPC Architecture Traditional In Situ


slide-1
SLIDE 1

Jiannan Ouyang, Brian Kocoloski, John Lange The Prognostic Lab @ University of Pittsburgh Kevin Pedretti Sandia National Laboratories HPDC 2015

Achieving Performance Isolation with Lightweight Co-Kernels

slide-2
SLIDE 2

HPC Architecture

2

— Move computation to data — Improved data locality — Reduced power consumption — Reduced network traffic Compute Node Operating System and Runtimes (OS/R) Simulation Analytic / Visualization Supercomputer Shared Storage Cluster Processing Cluster

Problem: massive data movement

  • ver interconnects

Traditional In Situ Data Processing

slide-3
SLIDE 3

Challenge: Predictable High Performance

3

— Tightly coupled HPC workloads are sensitive to OS noise

and overhead [Petrini SC’03, Ferreira SC’08, Hoefler SC’10]

— Specialized kernels for predictable performance

— Tailored from Linux: CNL for Cray supercomputers — Lightweight kernels (LWK) developed from scratch: IBM CNK, Kitten

— Data processing workloads favor Linux environments — Cross workload interference — Shared hardware (CPU time, cache, memory bandwidth) — Shared system software

How to provide both Linux and specialized kernels on the same node, while ensuring performance isolation??

slide-4
SLIDE 4

Approach: Lightweight Co-Kernels

4

— Hardware resources on one node are dynamically composed into

multiple partitions or enclaves

— Independent software stacks are deployed on each enclave

— Optimized for certain applications and hardware

— Performance isolation at both the software and hardware level

Hardware Linux LWK

Analytic / Visualization

Hardware Linux

Analytic / Visualization Simulation Simulation

slide-5
SLIDE 5

Agenda

5

— Introduction — The Pisces Lightweight Co-Kernel Architecture — Implementation — Evaluation — Related Work — Conclusion

slide-6
SLIDE 6

Building Blocks: Kitten and Palacios

— the Kitten Lightweight Kernel (LWK)

— Goal: provide predictable performance for massively parallel HPC applications — Simple resource management policies — Limited kernel I/O support + direct user-level network access

— the Palacios Lightweight Virtual Machine Monitor (VMM)

— Goal: predictable performance — Lightweight resource management policies — Established history of providing virtualized environments for HPC [Lange et al.

VEE ’11, Kocoloski and Lange ROSS ‘12]

Kitten: https://software.sandia.gov/trac/kitten Palacios: http://www.prognosticlab.org/palacios http://www.v3vee.org/

slide-7
SLIDE 7

The Pisces Lightweight Co-Kernel Architecture

7

Linux

Hardware

Isolated Virtual Machine Applications + Virtual Machines Palacios VMM Kitten Co-kernel (1) Kitten Co-kernel (2) Isolated Application

Pisces Pisces http://www.prognosticlab.org/pisces/

Pisces Design Goals

— Performance isolation at both software and hardware level — Dynamic creation of resizable enclaves — Isolated virtual environments

slide-8
SLIDE 8

Design Decisions

8

— Elimination of cross OS dependencies

— Each enclave must implement its own complete set of supported

system calls

— No system call forwarding is allowed

— Internalized management of I/O

— Each enclave must provide its own I/O device drivers and manage

its hardware resources directly

— Userspace cross enclave communication

— Cross enclave communication is not a kernel provided feature — Explicitly setup cross enclave shared memory at runtime (XEMEM)

— Using virtualization to provide missing OS features

slide-9
SLIDE 9

Cross Kernel Communication

9

Hardware'Par))on' Hardware'Par))on' User% Context% Kernel% Context%

Linux'

Cross%Kernel* Messages*

Control' Process' Control' Process'

Shared*Mem* *Ctrl*Channel*

Linux' Compa)ble' Workloads'

Isolated' Processes'' +' Virtual' Machines'

Shared*Mem* Communica6on*Channels*

Ki@en' CoAKernel'

XEMEM: Efficient Shared Memory for Composed Applications on Multi-OS/R Exascale Systems [Kocoloski and Lange, HPDC ‘15]

slide-10
SLIDE 10

Agenda

10

— Introduction — The Pisces Lightweight Co-Kernel Architecture — Implementation — Evaluation — Related Work — Conclusion

slide-11
SLIDE 11

Challenges & Approaches

11

— How to boot a co-kernel?

— Hot-remove resources from Linux, and load co-kernel — Reuse Linux boot code with modified target kernel address — Restrict the Kitten co-kernel to access assigned resources only

— How to share hardware resources among kernels?

— Hot-remove from Linux + direct assignment and adjustment (e.g.

CPU cores, memory blocks, PCI devices)

— Managed by Linux and Pisces (e.g. IOMMU)

— How to communicate with a co-kernel?

— Kernel level: IPI + shared memory, primarily for Pisces commands — Application level: XEMEM [Kocoloski HPDC’15]

— How to route device interrupts?

slide-12
SLIDE 12

I/O Interrupt Routing

12

Legacy Device IO-APIC

Management Kernel Co-Kernel

IRQ Forwarder IRQ Handler

MSI/MSI-X Device

Management Kernel Co-Kernel

IRQ Forwarder IRQ Handler

MSI/MSI-X Device

MSI MSI INTx IPI

Legacy Interrupt Forwarding Direct Device Assignment (w/ MSI)

  • Legacy interrupt vectors are potentially shared among multiple devices
  • Pisces provides IRQ forwarding service
  • IRQ forwarding is only used during initialization for PCI devices
  • Modern PCI devices support dedicated interrupt vectors (MSI/MSI-X)
  • Directly route to the corresponding enclave
slide-13
SLIDE 13

Implementation

13

— Pisces

— Linux kernel module supports unmodified Linux kernels

(2.6.3x – 3.x.y)

— Co-kernel initialization and management

— Kitten (~9000 LOC changes)

— Manage assigned hardware resources — Dynamic resource assignment — Kernel level communication channel

— Palacios (~5000 LOC changes)

— Dynamic resource assignment — Command forwarding channel

Pisces: http://www.prognosticlab.org/pisces/ Kitten: https://software.sandia.gov/trac/kitten Palacios: http://www.prognosticlab.org/palacios http://www.v3vee.org/

slide-14
SLIDE 14

Agenda

14

— Introduction — The Pisces Lightweight Co-Kernel Architecture — Implementation — Evaluation — Related Work — Conclusion

slide-15
SLIDE 15

Evaluation

15

— 8 node Dell R450 cluster

— Two six-core Intel “Ivy-Bridge” Xeon processors — 24GB RAM split across two NUMA domains — QDR Infiniband — CentOS 7, Linux kernel 3.16

— For performance isolation experiments, the hardware is

partitioned by NUMA domains.

— i.e. Linux on one NUMA domain, co-kernel on the other

slide-16
SLIDE 16

Fast Pisces Management Operations

16

Operations Latency (ms) Booting a co-kernel 265.98 Adding a single CPU core 33.74 Adding a 128MB memory block 82.66 Adding an Ethernet NIC 118.98

slide-17
SLIDE 17

Eliminating Cross Kernel Dependencies

17

solitary workloads (us) w/ other workloads (us) Linux 3.05 3.48 co-kernel fwd 6.12 14.00 co-kernel 0.39 0.36

Execution Time of getpid()

— Co-kernel has the best average performance — Co-kernel has the most consistent performance — System call forwarding has longer latency and suffers from

cross stack performance interference

slide-18
SLIDE 18

Noise Analysis

18

5 10 15 20 1 2 3 4 5 Latency (us) Time (seconds) (a) without competing workloads 1 2 3 4 5 Time (seconds) (b) with competing workloads

5 10 15 20 1 2 3 4 5 Latency (us) Time (seconds) (a) without competing workloads 1 2 3 4 5 Time (seconds) (b) with competing workloads

Linux Kitten co-kernel

Co-Kernel: less noise + better isolation

* Each point represents the latency of an OS interruption

slide-19
SLIDE 19

Single Node Performance

19

1 CentOS Kitten/KVM co-Kernel 82 83 84 85 Completion Time (Seconds) without bg with bg

250 CentOS Kitten/KVM co-Kernel 20250 20500 20750 21000 21250 Throughput (GUPS) without bg with bg

CoMD Performance Stream Performance

Co-Kernel: consist performance + performance isolation

slide-20
SLIDE 20

8 Node Performance

20

2 4 6 8 10 12 14 16 18 20 1 2 3 4 5 6 7 8 Throughput (GFLOP/s) Number of Nodes co-VMM native KVM co-VMM bg native bg KVM bg

w/o bg: co-VMM achieves native Linux performance w/ bg: co-VMM outperforms native Linux

slide-21
SLIDE 21

Co-VMM for HPC in the Cloud

21

20 40 60 80 100 44 45 46 47 48 49 50 51 CDF (%) Runtime (seconds)

Co-VMM Native KVM

CDF of HPCCG Performance (running with Hadoop, 8 nodes)

co-VMM: consistent performance + performance isolation

slide-22
SLIDE 22

Related Work

22

— Exascale operating systems and runtimes (OS/Rs)

— Hobbes (SNL, LBNL, LANL, ORNL, U. Pitt, various universities) — Argo (ANL, LLNL, PNNL, various universities)

— FusedOS (Intel / IBM) — mOS (Intel) — McKernel (RIKEN AICS, University of Tokyo)

Our uniqueness: performance isolation, dynamic resource composition, lightweight virtualization

slide-23
SLIDE 23

Conclusion

23

— Design and implementation of the Pisces co-kernel architecture

— Pisces framework — Kitten co-kernel — Palacios VMM for Kitten co-kernel

— Demonstrated that the co-kernel architecture provides

— Optimized execution environments for in situ processing — Performance isolation

https://software.sandia.gov/trac/kitten http://www.prognosticlab.org/pisces/ http://www.prognosticlab.org/palacios

slide-24
SLIDE 24

Thank You Jiannan Ouyang

— Ph.D. Candidate @ University of Pittsburgh — ouyang@cs.pitt.edu — http://people.cs.pitt.edu/~ouyang/ — The Prognostic Lab @ U. Pittsburgh — http://www.prognosticlab.org