The benefits and costs of writing a UNIX kernel in a high-level - - PowerPoint PPT Presentation

the benefits and costs of writing a unix kernel in a high
SMART_READER_LITE
LIVE PREVIEW

The benefits and costs of writing a UNIX kernel in a high-level - - PowerPoint PPT Presentation

The benefits and costs of writing a UNIX kernel in a high-level language Cody Cutler, M. Frans Kaashoek, Robert T. Morris MIT CSAIL 1 / 62 What language to use for developing a kernel? A hotly-debated question but often with few facts


slide-1
SLIDE 1

The benefits and costs of writing a UNIX kernel in a high-level language

Cody Cutler, M. Frans Kaashoek, Robert T. Morris

MIT CSAIL 1 / 62

slide-2
SLIDE 2

What language to use for developing a kernel?

A hotly-debated question but often with few facts 6.828/6.S081 students: why are we using C? why not a type-safe language? To shed some light, we wrote a new kernel with

  • A language with automatic memory management (i.e., with a garbage collector)
  • A traditional, monolithic Unix organization

2 / 62

slide-3
SLIDE 3

C is popular for kernels

Windows Linux *BSD

3 / 62

slide-4
SLIDE 4

Why C is good: complete control

Control of memory allocation and freeing Almost no implicit, hidden code Direct access to memory Few dependencies

4 / 62

slide-5
SLIDE 5

Why C is bad

Writing secure C code is difficult

  • buffer overruns
  • use-after-free bugs
  • threads sharing dynamic memory

40 Linux kernel execute-code CVEs in 2017 due to memory-safety errors (execute-code CVE is a bug that enables attacker to run malicious code in kernel)

5 / 62

slide-6
SLIDE 6

High-level languages (HLLs) provide memory-safety

All 40 CVEs would not execute malicious code in an HLL

6 / 62

slide-7
SLIDE 7

HLL benefits

Type safety Automatic memory management with garbage collector Concurrency Abstraction

7 / 62

slide-8
SLIDE 8

HLL potential downsides

Poor performance:

  • Bounds, cast, nil-pointer checks
  • Garbage collection

Incompatibility with kernel programming:

  • No direct memory access
  • No hand-written assembly
  • Limited concurrency or parallelism

8 / 62

slide-9
SLIDE 9

Goal: measure HLL trade-offs

Explore total effect of using HLL instead of C:

  • Impact on safety
  • Impact on programmability
  • Performance cost

...for production-grade kernel

9 / 62

slide-10
SLIDE 10

Prior work: HLL trade-offs

Many studies of HLL trade-offs for user programs (Hertz’05, Yang’04) But kernels different from user programs (ex: more careful memory management) Need to measure HLL trade-offs in kernel

10 / 62

slide-11
SLIDE 11

Prior work: HLL kernels

Singularity(SOSP’07), J-kernel(ATC’98), Taos(ASPLOS’87), Spin(SOSP’95), Tock(SOSP’17), KaffeOS(ATC’00), House(ICFP’05),... Explore new ideas and architectures None measure HLL trade-offs vs C kernel

11 / 62

slide-12
SLIDE 12

Measuring trade-offs is tricky

Must compare with production-grade C kernel (e.g., Linux) Problem: can’t build production-grade HLL kernel

12 / 62

slide-13
SLIDE 13

The most we can do

Build HLL kernel Keep important parts the same as Linux Optimize until performance is roughly similar to Linux Measure HLL trade-offs Risk: measurements of production-grade kernels differ

13 / 62

slide-14
SLIDE 14

Methodology

Built HLL kernel Same apps, POSIX interface, and monolithic organization Optimized, measured HLL trade-offs

14 / 62

slide-15
SLIDE 15

Which HLL?

Go is a good choice:

  • Easy to call assembly
  • Compiled to machine code w/good compiler
  • Easy concurrency
  • Easy static analysis
  • GC (Concurrent mark and sweep)

Rust might be a fine choice too

15 / 62

slide-16
SLIDE 16

BISCUIT overview 58 system calls, LOC: 28k Go,

16 / 62

slide-17
SLIDE 17

BISCUIT Features

  • Multicore
  • Threads
  • Journaled FS (7k LOC)
  • Virtual memory (2k LOC)
  • TCP/IP stack (5k LOC)
  • Drivers: AHCI and Intel 10Gb NIC (3k LOC)

17 / 62

slide-18
SLIDE 18

User programs

Process has own address space User/kernel memory isolated by hardware Each user thread has companion kernel thread Kernel threads are “goroutines”

18 / 62

slide-19
SLIDE 19

System calls

User thread put args in registers User thread executes SYSENTER Control passes to kernel thread Kernel thread executes system call, returns via SYSEXIT

19 / 62

slide-20
SLIDE 20

BISCUIT design puzzles

Runtime on bare-metal Goroutines run different applications Device interrupts in runtime critical sections Hardest puzzle: heap exhaustion

20 / 62

slide-21
SLIDE 21

Puzzle: Heap exhaustion

21 / 62

slide-22
SLIDE 22

Puzzle: Heap exhaustion

21 / 62

slide-23
SLIDE 23

Puzzle: Heap exhaustion

21 / 62

slide-24
SLIDE 24

Puzzle: Heap exhaustion

21 / 62

slide-25
SLIDE 25

Puzzle: Heap exhaustion

Can’t allocate heap memory = ⇒ nothing works All kernels face this problem

21 / 62

slide-26
SLIDE 26

How to recover?

Strawman 0: panic (xv6) Strawman 1: Wait for memory in allocator?

  • May deadlock!

Strawman 2: Check/handle allocation failure, like C kernels?

  • Difficult to get right
  • Can’t – Go implicitly allocates
  • Doesn’t expose failed allocations

Both cause problems for Linux; see “too small to fail” rule

22 / 62

slide-27
SLIDE 27

BISCUIT solution: reserve memory

To execute system call... No checks, no error handling code, no deadlock

23 / 62

slide-28
SLIDE 28

Heap reservation bounds

How to compute max memory for each system call? Smaller heap bounds = ⇒ more concurrent system calls

24 / 62

slide-29
SLIDE 29

Heap bounds via static analysis

HLL easy to analyze Tool computes reservation via escape analysis Using Go’s static analysis packages Annotations for difficult cases ≈ three days of expert effort to apply tool

25 / 62

slide-30
SLIDE 30

BISCUIT implementation

Building BISCUIT was similar to other kernels

26 / 62

slide-31
SLIDE 31

BISCUIT implementation

Building BISCUIT was similar to other kernels BISCUIT adopted many Linux optimizations:

  • large pages for kernel text
  • per-CPU NIC transmit queues
  • RCU-like directory cache
  • execute FS ops concurrently with commit
  • pad structs to remove false sharing

Good OS performance more about optimizations, less about HLL

26 / 62

slide-32
SLIDE 32

Evaluation

Part 1: HLL benefits Part 2: HLL performance costs

27 / 62

slide-33
SLIDE 33

Evaluation: HLL benefits

Should we use high-level languages to build OS kernels? 1 Does BISCUIT use HLL features? 2 Does HLL simplify BISCUIT code? 3 Would HLL prevent kernel exploits?

28 / 62

slide-34
SLIDE 34

1: Does BISCUIT use HLL features?

Counted HLL feature use in BISCUIT and two huge Go projects (Moby and Golang, >1M LOC)

29 / 62

slide-35
SLIDE 35

1: BISCUIT uses HLL features

2 4 6 8 10 12 14 16 18 A l l

  • c

a t i

  • n

s M a p s S l i c e s C h a n n e l S t r i n g M u l t i

  • r

e t u r n C l

  • s

u r e F i n a l i z e r D e f e r G

  • s

t m t I n t e r f a c e T y p e a s s e r t s I m p

  • r

t s Count/1K lines Biscuit Golang Moby

30 / 62

slide-36
SLIDE 36

2: Does HLL simplify BISCUIT code?

Qualitatively, my favorite features:

  • GC’ed allocation
  • slices
  • defer
  • multi-valued return
  • strings
  • closures
  • maps

Net effect: simpler code

31 / 62

slide-37
SLIDE 37

2: Simpler concurrency

Simpler data sharing between threads In HLL, GC frees memory In C, programmer must free memory

32 / 62

slide-38
SLIDE 38

2: Simpler concurrency example

buf := new(object_t) // Initialize buf... go func() { process1(buf) }() process2(buf) // When should C code free(buf)?

33 / 62

slide-39
SLIDE 39

2: Simpler read-lock-free concurrency

Locks and reference counts expensive in hot paths Good for performance to avoid them Challenge in C: when is object free?

34 / 62

slide-40
SLIDE 40

2: Read-lock-free example var Head *Node func get() *Node { return atomic_load(&Head) } func pop() { Lock() v := Head if v != nil { atomic_store(&Head, v.next) } Unlock()

35 / 62

slide-41
SLIDE 41

2: Simpler read-lock-free concurrency

Linux safely frees via RCU (McKenney’98) Defers free until all CPUs context switch Programmer must follow RCU rules:

  • Prologue and epilogue surrounding accesses
  • No sleeping or scheduling

Error prone in more complex situations GC makes these challenges disappear HLL significantly simplifies read-lock-free code

36 / 62

slide-42
SLIDE 42

3: Would HLL prevent kernel exploits?

Inspected fixes for all publicly-available execute code CVEs in Linux kernel for 2017 Classify based on outcome of bug in BISCUIT

37 / 62

slide-43
SLIDE 43

3: HLL prevents kernel exploits

Category # Outcome in Go — 11 unknown logic 14 same use-after-free/double-free 8 disappear due to GC

  • ut-of-bounds

32 panic or disappear panic likely better than malicious code execution HLL would prevent kernel exploits

38 / 62

slide-44
SLIDE 44

Evaluation: HLL performance

Should we use high-level languages to build OS kernels? 1 Is BISCUIT’s performance roughly similar to Linux? 2 What is the breakdown of HLL tax? 3 How much might GC cost? 4 What are the GC pauses? 5 What is the performance cost of Go compared to C? 6 Does BISCUIT’s performance scale with cores?

39 / 62

slide-45
SLIDE 45

Experimental setup

Hardware:

  • 4 core 2.8Ghz Xeon-X3460
  • 16 GB RAM
  • Hyperthreads disabled

Eval applications:

  • NGINX (1.11.5) – webserver
  • Redis (3.0.5) – key/value store
  • CMailbench – mail-server benchmark

40 / 62

slide-46
SLIDE 46

Applications are kernel intensive

No idle time; 79%-92% kernel time In-memory FS Ran for a minute 512MB heap RAM for BISCUIT

41 / 62

slide-47
SLIDE 47

1: Is BISCUIT’s perf roughly similar to Linux?

i.e. is BISCUIT’s performace similar to production-grade kernel? Compare app throughput on BISCUIT and Linux

42 / 62

slide-48
SLIDE 48

Linux setup

Debian 9.4, Linux 4.9.82 Disabled features that slowed Linux down on our apps:

  • page-table isolation
  • retpoline
  • kernel address space layout randomization
  • transparent huge-pages
  • ...

43 / 62

slide-49
SLIDE 49

1: Is BISCUIT’s perf roughly similar to Linux? BISCUIT ops/s Linux ops/s Ratio CMailbench (mem) 15,862 17,034 1.?? NGINX 88,592 94,492 1.?? Redis 711,792 775,317 1.?? Linux has more features: NUMA, scales to many cores, ... Not apples-to-apples, but BISCUIT perf roughly similar

44 / 62

slide-50
SLIDE 50

2: What is the breakdown of HLL tax?

Record CPU time profile of our apps Categorize samples into HLL cost buckets Measure HLL tax of our apps:

  • GC cycles
  • Prologue cycles
  • Write barrier cycles
  • Safety cycles

45 / 62

slide-51
SLIDE 51

Prologue cycles are most expensive GC GCs Prologue Write barrier Safety cycles cycles cycles cycles CMailbench 3% 42 6% < 1% 3% NGINX 2% 32 6% < 1% 2% Redis 1% 30 4% < 1% 2% Benchmarks allocate kernel heap rapidly but have few long-lived kernel heap objects

46 / 62

slide-52
SLIDE 52

GC cost varies by program

More live data = ⇒ more cycles per GC Less free heap RAM = ⇒ GC more frequent Total GC cost ∝ ratio of live data to free heap RAM

47 / 62

slide-53
SLIDE 53

3: How much might GC cost?

Created two million vnodes of live data Varied free heap RAM Ran CMailbench, measured GC cost

48 / 62

slide-54
SLIDE 54

3: How much might GC cost?

Live Free Ratio Tput GC% (MB) (MB) 640 320 2 10,448 34% 640 640 1 12,848 19% 640 1280 0.5 14,430 9% ⇒ Need 3× heap RAM to keep GC < 10%

49 / 62

slide-55
SLIDE 55

3: GC memory cost in practice?

Few programs allocate millions of resources MIT’s big time-sharing machines: 80 users, 800 tasks, 9-16GB RSS, <2GB kernel heap (Exception: cached files, maybe evictable) Memory cost acceptable in common situations?

50 / 62

slide-56
SLIDE 56

GC pauses

GC must eventually execute Could delay latency-sensitive work Some GCs cause one large pause, but not Go’s

  • Go’s GC is interleaved with execution (Baker’78, McCloskey’08)
  • Causes many small delays

51 / 62

slide-57
SLIDE 57

4: What are the GC pauses?

Measured duration of each GC pause during NGINX Multiple pauses occur during a single request Sum pause durations over each request

52 / 62

slide-58
SLIDE 58

4: What are the GC pauses?

Max single pause: 115 µs (marking large part of TCP connection table) Max total pauses during request: 582 µs Less than 0.3% of requests paused > 100µs

53 / 62

slide-59
SLIDE 59

4: GC pauses OK?

Some programs can’t tolerate rare 582 µs pauses But many probably can 99%-ile latency in service of Google’s “Tail at Scale” was 10ms

54 / 62

slide-60
SLIDE 60

5: What is the cost of Go compared to C?

Compared OS code paths with identical functionality Chose paths that are:

  • core OS paths
  • small enough to make them have same functionality

Two code paths in OSDI’18 paper

  • pipe ping-pong (systems calls, context switching)
  • page-fault handler (exceptions, VM)

55 / 62

slide-61
SLIDE 61

5: What is the cost of Go compared to C?

Pipe ping-pong code path:

  • LOC: 1.2k Go, 1.8k C
  • No allocation; no GC
  • Top-10 most expensive instructions match

56 / 62

slide-62
SLIDE 62

5: C is 15% faster

Pipe ping-pong: C Go (ops/s) (ops/s) Ratio 536,193 465,811 1.15 Prologue/safety-checks ⇒ 16% more instructions Go slower, but competitive

57 / 62

slide-63
SLIDE 63

6: Does BISCUIT scale?

Can BISCUIT efficiently use many cores? Is Go scalability bottleneck?

58 / 62

slide-64
SLIDE 64

6: Does BISCUIT scale?

Ran CMailbench, varied cores from 1 to 20 Measured throughput

59 / 62

slide-65
SLIDE 65

6: BISCUIT scales well to 10 cores

10 20 30 40 50 60 70 80 5 10 15 20 Throughput (k/s) Cores Perfect Biscuit

Lock contention in CMailbench at 20 cores, not NUMA-aware

60 / 62

slide-66
SLIDE 66

Should one use HLL for a new kernel?

The HLL worked well for kernel development Performance is paramount ⇒ use C (up to 15%) Minimize memory use ⇒ use C (↓ mem. budget, ↑ GC cost) Safety is paramount ⇒ use HLL (40 CVEs stopped) Performance merely important ⇒ use HLL (pay 15%, memory)

61 / 62

slide-67
SLIDE 67

6.S081/6.828 and HLL

Should we use HLL in 6.S081? git clone https://github.com/mit-pdos/biscuit.git

62 / 62