Container mechanics in Linux and rkt FOSDEM 2016 Alban Crequy - - PowerPoint PPT Presentation

container mechanics in linux and rkt
SMART_READER_LITE
LIVE PREVIEW

Container mechanics in Linux and rkt FOSDEM 2016 Alban Crequy - - PowerPoint PPT Presentation

Container mechanics in Linux and rkt FOSDEM 2016 Alban Crequy github.com/alban Jonathan Boulle github.com/jonboulle @baronboulle a modern, secure, composable container runtime an implementation of appc (image format, execution


slide-1
SLIDE 1

FOSDEM 2016

Container mechanics in Linux and rkt

slide-2
SLIDE 2

Jonathan Boulle

github.com/jonboulle @baronboulle

Alban Crequy

github.com/alban

slide-3
SLIDE 3

a modern, secure, composable container runtime

slide-4
SLIDE 4

an implementation of appc (image format, execution environment)

slide-5
SLIDE 5

rkt

simple CLI tool golang + Linux self-contained

slide-6
SLIDE 6

simple CLI tool

no (mandatory) daemon: apps run directly under spawning process

slide-7
SLIDE 7

bash/runit/systemd rkt application(s)

slide-8
SLIDE 8

rkt internals

modular architecture execution divided into stages stage0 → stage1 → stage2

slide-9
SLIDE 9

bash/runit/systemd rkt application(s)

slide-10
SLIDE 10

rkt (stage0) pod (stage1) bash/runit/systemd/... (invoking process) app1 (stage2) app2 (stage2)

slide-11
SLIDE 11

rkt (stage0) pod (stage1) bash/runit/systemd/... (invoking process) app1 (stage2) app2 (stage2)

slide-12
SLIDE 12

stage0 (rkt binary)

discover, fetch, manage application images set up pod filesystems commands to manage pod lifecycle

slide-13
SLIDE 13

rkt (stage0) pod (stage1) bash/runit/systemd/... (invoking process) app1 (stage2) app2 (stage2)

slide-14
SLIDE 14

rkt (stage0) pod (stage1) bash/runit/systemd/... (invoking process) app1 (stage2) app2 (stage2)

slide-15
SLIDE 15

stage1

"the container" execution environment for pods process lifecycle management resource constraints (isolators)

slide-16
SLIDE 16

stage1 (swappable)

  • binary ABI with stage0
  • rkt's stage0 calls exec(stage1, args...)
  • default implementation

○ based on systemd-nspawn + systemd ○ Linux namespaces + cgroups for isolation

  • kvm implementation

○ based on lkvm + systemd ○ hardware virtualisation for isolation

slide-17
SLIDE 17

rkt (stage0) pod (stage1) bash/runit/systemd/... (invoking process) app1 (stage2) app2 (stage2)

slide-18
SLIDE 18

rkt (stage0) systemd-nspawn (stage1) bash/runit/systemd/... (invoking process) cached (stage2) workerd (stage2) systemd

slide-19
SLIDE 19

rkt (stage0) systemd-nspawn (stage1) bash/runit/systemd/... (invoking process) cached (stage2) workerd (stage2) systemd

container

slide-20
SLIDE 20

Containers on Linux

namespaces cgroups chroot

slide-21
SLIDE 21

Linux namespaces

slide-22
SLIDE 22

hostname: thunderstorm hostname: rainbow hostname: sunshine containers host

slide-23
SLIDE 23

Containers: no guest kernel

Hardware Host Linux kernel rkt rkt app

app app

system calls: example: sethostname()

kernel API

slide-24
SLIDE 24

Containers with an example

Getting and setting the hostname: ✤ The system calls for getting and setting the hostname are older than containers int uname(struct utsname *buf); int gethostname(char *name, size_t len); int sethostname(const char *name, size_t len);

slide-25
SLIDE 25

Processes in namespaces

1 2 6 3 9

gethostname() -> “rainbow” gethostname() -> “thunderstorm”

slide-26
SLIDE 26

Linux Namespaces

Several independent namespaces

✤ uts (Unix Timesharing System) namespace ✤ mount namespace ✤ pid namespace ✤ network namespace ✤ user namespace

slide-27
SLIDE 27

1 2 6

“rainbow” unshare(CLONE_NEWUTS);

Creating new namespaces

slide-28
SLIDE 28

1 2 6

“rainbow”

6

“rainbow”

Creating new namespaces

slide-29
SLIDE 29
slide-30
SLIDE 30

PID namespace

slide-31
SLIDE 31

✤ the host sees all processes ✤ the container only its own processes Hiding processes and PID translation

slide-32
SLIDE 32

✤ the host sees all processes ✤ the container only its own processes

Actually pid 30920

Hiding processes and PID translation

slide-33
SLIDE 33

Initial PID namespace

1 2 6 7

slide-34
SLIDE 34

1 2 6 7

Creating a new namespace

clone(CLONE_NEWPID, ...);

slide-35
SLIDE 35

Creating a new namespace

1 2 6 1 7

slide-36
SLIDE 36

rkt

rkt run ... ✤ uses clone() to start the first process in the container with a new pid namespace ✤ uses unshare() to create a new network namespace rkt enter ... ✤ uses setns() to enter an existing namespace

slide-37
SLIDE 37

Joining an existing namespace

1 2 6 1 7

setns(...,CLONE_NEWPID);

slide-38
SLIDE 38

Joining an existing namespace

1 2 6 1 4

slide-39
SLIDE 39

Mount namespaces

slide-40
SLIDE 40

container host

/ /home /var /etc user / /my-app

slide-41
SLIDE 41

Storing the container data (Copy-on-write)

Container filesystem Overlay fs “upper” directory

/var/lib/rkt/pods/run/<pod-uuid>/overlay/sha512- .../upper/

Application Container Image

/var/lib/rkt/cas/tree/sha512-...

slide-42
SLIDE 42

rkt directories

/var/lib/rkt ├─ cas │ └─ tree │ ├─ deps-sha512-19bf... │ └─ deps-sha512-a5c2... └─ pods └─ run └─ e0ccc8d8 └─ overlay/sha512-19bf.../upper └─ stage1/rootfs/

slide-43
SLIDE 43

/ /home /var /etc user 1 7 3

unshare(..., CLONE_NEWNS);

slide-44
SLIDE 44

/ /home /var /etc user 1 7 3 / /home /var /etc user 7

slide-45
SLIDE 45

/

Changing root with MS_MOVE

/ ... rootfs my-app

mount($ROOTFS, “/”, MS_MOVE)

/ ... rootfs my-app

$ROOTFS = /var/lib/rkt/pods/run/e0ccc8d8.../stage1/rootfs

slide-46
SLIDE 46

/ /home /var /etc user / /home /var /etc user

Relationship between the two mounts:

  • shared
  • master / slave
  • private

Mount propagation events

slide-47
SLIDE 47

Private Shared Master and slave

/home /home

Mount propagation events

/home /home /home /home

slide-48
SLIDE 48

✤ / in the container namespace is recursively set as slave:

mount(NULL, "/", NULL, MS_SLAVE|MS_REC, NULL)

How rkt uses mount propagation events

/ /home /var /etc user

slide-49
SLIDE 49

Network namespace

slide-50
SLIDE 50

Network isolation

Goal: ✤ each container has their own network interfaces ✤ Cannot see the network traffic outside the container (e.g. tcpdump)

container1 host container2

eth0 eth0 eth0

slide-51
SLIDE 51

Network tooling

✤ Linux can create pairs

  • f virtual net

interfaces ✤ Can be linked in a bridge container1 container2

eth0 veth1 eth0 veth2

IP masquerading via iptables

eth0

bridge

slide-52
SLIDE 52

rkt networking ✤ plugin based ✤ Container Network Interface (CNI) ✣ rkt ✣ Kubernetes ✣ Calico

slide-53
SLIDE 53

Container Runtime (e.g. rkt) veth macvlan ipvlan OVS

Container Networking Interface (CNI)

slide-54
SLIDE 54

How does rkt do it?

✤ rkt uses the network plugins implemented by the Container Network Interface (CNI, https://github.com/appc/cni) rkt network plugins

exec

systemd-nspawn

exec() /var/lib/rkt/pods/run/$POD_UUID/netns

network namespace

configure via setns + netlink create, join

slide-55
SLIDE 55

User namespaces

slide-56
SLIDE 56

History of Linux namespaces

✓ 1991: Linux ✓ 2002: namespaces in Linux 2.4.19 ✓ 2008: LXC ✓ 2011: systemd-nspawn ✓ 2013: user namespaces in Linux 3.8 ✓ 2013: Docker ✓ 2014: rkt … development still active

slide-57
SLIDE 57

Why user namespaces?

✤ Better isolation ✤ Run applications which would need more capabilities ✤ Per user limits ✤ Future: ✣ Unprivileged containers: possibility to have container without root

slide-58
SLIDE 58

host 65535 4,294,967,295 (32-bit range) container 1 65535 container 2

User ID ranges

slide-59
SLIDE 59

unmapped

User ID mapping

/proc/$PID/uid_map: “0 1048576 65536” host container 1048576

65536 65536

unmapped unmapped

slide-60
SLIDE 60

Problems with container images

Container filesystem Container filesystem Overlayfs “upper” directory Overlayfs “upper” directory Application Container Image (ACI) Application Container Image (ACI)

container 1 container 2 downloading web server

slide-61
SLIDE 61

Problems with container images

✤ Files UID / GID ✤ rkt currently only supports user namespaces without overlayfs ✣ Performance loss: no COW from overlayfs ✣ “chown -R” for every file in each container

slide-62
SLIDE 62

Problems with volumes

/ /home /var user / /data /my-app

bind mount (rw / ro)

/data

✤ mounted in several containers ✤ No UID translation ✤ Dynamic UID maps

/data

slide-63
SLIDE 63

User namespace and filesystem problem

✤ Possible solution: add options to mount() to apply a UID mapping ✤ rkt would use it when mounting: ✣ the overlay rootfs ✣ volumes

slide-64
SLIDE 64

Isolators

slide-65
SLIDE 65

Isolators in rkt

✤ specified in an image manifest ✤ limiting capabilities

  • r resources
slide-66
SLIDE 66

Isolators in rkt

Currently implemented ✤ capabilities ✤ cpu ✤ memory Possible additions ✤ block-bandwidth ✤ block-iops ✤ network-bandwidth ✤ disk-space

slide-67
SLIDE 67

cgroups

slide-68
SLIDE 68

What’s a control group (cgroup) ✤ group processes together ✤ organised in trees ✤ applying limits to them as a group

slide-69
SLIDE 69

cgroup API

/sys/fs/cgroup/*/ /proc/cgroups /proc/$PID/cgroup

slide-70
SLIDE 70

cgroups

slide-71
SLIDE 71

List of cgroup controllers

/sys/fs/cgroup/ ├─ cpu ├─ devices ├─ freezer ├─ memory ├─ ... └─ systemd

slide-72
SLIDE 72

Memory isolator

“limit”: “500M” Application Image Manifest [Service] ExecStart= MemoryLimit=500M systemd service file write to memory.limit_in_ bytes systemd action

slide-73
SLIDE 73

CPU isolator

“limit”: “500m” Application Image Manifest write to cpu.share systemd action [Service] ExecStart= CPUShares=512 systemd service file

slide-74
SLIDE 74

Unified cgroup hierarchy

✤ Multiple hierarchies: ✣ one cgroup mount point for each controller (memory, cpu...) ✣ flexible but complex ✣ cannot remount with a different set of controllers ✣ difficult to give to containers in a safe way ✤ Unified hierarchy: ✣ cgroup filesystem mounted only one time ✣ soon to be stable in Linux (mount option “__DEVEL__sane_behavior” being removed) ✣ initial implementation in systemd-v226 (September 2015) ✣ no support in rkt yet

slide-75
SLIDE 75

Questions?

github.com/coreos/rkt Join us!

slide-76
SLIDE 76
  • Early bird tickets
  • Sponsorships are still available
  • Submit a talk before February 29th!

coreos.com/fest

May 9 & 10, 2016 | Berlin, Germany

@coreosfest