Rethinking the Linux kernel
Thomas Graf Cilium Project, Co-Founder & CTO, Isovalent
Rethinking the Linux kernel Thomas Graf Cilium Project, Co-Founder - - PowerPoint PPT Presentation
Rethinking the Linux kernel Thomas Graf Cilium Project, Co-Founder & CTO, Isovalent Remember GeoCities? 2 Cameron Askin: Camerons World What enabled this evolution? Programmable Platform Markup Only (HTML) 3 Programmability
Thomas Graf Cilium Project, Co-Founder & CTO, Isovalent
2
Cameron Askin: Cameron’s World
3
Markup Only (HTML)
Programmable Platform
4
Untrusted code runs in the browser of the user. → Sandboxing Allow evolution of logic without requiring to constantly ship new browser versions. → Deploy anytime with seamless upgrades Programmability must be provided with minimal overhead. → Native Execution (JIT compiler)
Safety Continuous Delivery Performance
5
TCP/IP VFS
Linux Kernel
Network Device Block Device
Admin Process Process
Network Hardware Storage Hardware Configuration
(sysfs,netlink,procfs,...)
Sockets
recvmsg() sendmsg()
Syscall
read()
File Descriptor
write()
Syscall
User Space HW
Cons:
module for each kernel version
to upgrade
6
Cons: Option 1 Native Support Option 2 Kernel Module
7
8
9
Process
Scheduler
execve()
Linux Kernel
Syscall
10
Controller
Sockets
bpf()
Linux Kernel
TCP/IP Network Device
recvmsg() sendmsg()
Process
Syscall Verifier JIT Compiler
BPF Program BPF Program BPF Program
approved x86_64
Syscall
Safety & Security The verifier will reject any unsafe program and provides a sandbox. Continuous Delivery Programs can be exchanged without disrupting workloads. Performance The JIT compiler ensures native execution performance.
bytecode
11
Process
Storage Hardware Sockets TCP/IP Network Device read() File Descriptor VFS Block Device write()
Linux Kernel
Network Hardware
Process
Syscall Syscall
Where can you hook? kernel functions (kprobes), userspace functions (uprobes), system calls, fentry/fexit, tracepoints, network devices (tc/xdp), network routes, TCP congestion algorithms, sockets (data level)
recvmsg() sendmsg()
12
Controller
Sockets
Linux Kernel
TCP/IP Network Device
Process
Syscall Syscall
Admin
BPF Map
Syscall
Map Types:
What are Maps used for?
statistics with user space
recvmsg() sendmsg()
13
Sockets
Linux Kernel
TCP/IP Network Device
Process
Syscall
What helpers exist?
forwarding
[...] num = bpf_get_prandom_u32(); [...] recvmsg() sendmsg()
14
Linux Kernel
What are Tail Calls used for?
logical components
What are Functions Calls used for?
program
inlining)
15
287 contributors: (Jan 2016 to Jan 2020)
16
High-performance L4 Loadbalancer facebookincubator/katran Android & Security kernel runtime security instrumentation (KRSI), Android BPF loader, eBPF traffic monitor bcc, bpftrace Performance troubleshooting & profiling iovisor/bcc Traffic Optimization DDoS mitigation, QoS, traffic optimization, load balancer cloudflare/bpftools Falco Container runtime security, behavior analysis falcosecurity/falco Cilium Networking, security and load-balancing for k8s cilium/cilium
et al.
17
Sockets
Linux Kernel
TCP/IP
Process
Syscall Verifier JIT Compiler Syscall
BPF Program
Python
BCC
BPF Maps
BCC: github.com/iovisor/bcc
recvmsg() sendmsg() # tcptop Tracing... Output every 1 secs. Hit Ctrl-C to end <screen clears> 19:46:24 loadavg: 1.86 2.67 2.91 3/362 16681 PID COMM LADDR RADDR RX_KB TX_KB 16648 16648 100.66.3.172:22 100.127.69.165:6684 1 0 16647 sshd 100.66.3.172:22 100.127.69.165:6684 0 2149 14374 sshd 100.66.3.172:22 100.127.69.165:25219 0 0 14458 sshd 100.66.3.172:22 100.127.69.165:7165 0 0
bpftrace
18
File Descriptors
Linux Kernel
VFS
Process
Syscall Verifier JIT Compiler Syscall
bpftrace Program
BPF Maps
bpftrace: github.com/iovisor/bpftrace
# bpftrace -e 'kprobe:do_sys_open { printf("%s: %s\n", comm, str(arg1)) }' Attaching 1 probe... git: .git/objects/da git: .git/objects/pack git: /etc/localtime systemd-journal: /var/log/journal/72d0774c88dc4943ae3d34ac356125dd DNS Res~ver #15: /etc/hosts ^C
19
Sockets
Linux Kernel
TCP/IP
Container
Syscall Verifier JIT Compiler Syscall
Clium
BPF Maps
Network Device Sockets
Container
Syscall Network Device Network Hardware TCP/IP
Kubernetes
20
Container Networking
Service Load balancing:
kube-proxy)
Container Security
Cassandra, memcached, ..)
Visibility
Servicemesh:
servicemesh sidecar proxies
21
# hubble observe --since=1m -t l7 -j \ | jq 'select(.l7.dns.rcode==3) | .destination.namespace + "/" + .destination.pod_name' \ | sort | uniq -c | sort -r 42 "starwars/jar-jar-binks-6f5847c97c-qmggv"
Development
Program Maps
Runtime
22
clang -target bpf
Sockets
Linux Kernel
TCP/IP
recvmsg() sendmsg()
Process
Verifier JIT Compiler Syscall
BPF Program
C source
BPF Program
bytecode
BPF Map
Syscall Go Library Go Library: https:/ /github.com/cilium/ebpf
23
is turning the Linux kernel into a microkernel.
functionality is implemented with eBPF.
Much quicker than normal kernel development. Example: The linux kernel is not aware of containers and microservices (it only knows about namespaces). Cilium is making the Linux kernel container and Kubernetes aware. could enable the Linux kernel hotpatching we always dreamed about. Problem:
patch kernel.
long time without risking extensive downtime.
Function Function Function Hotfix
Linux Kernel
eBPF Maintainers Daniel Borkmann, Alexei Starovoitov Cilium Team André Martins, Jarno Rajahalme, Joe Stringer, John Fastabend, Maciej Kwiek, Martynas Pumputis, Paul Chaignon, Quentin Monnet, Ray Bejjani, Tobias Klauser Facebook Team Andrii Nakryiko, Andrey Ignatov, Jakub Kicinski, Martin KaFai Lau, Roman Gushchin, Song Liu, Yonghong Song Google Team Chenbo Feng, KP Singh, Lorenzo Colitti, Maciej Żenczykowski, Stanislav Fomichev, BCC & bpftrace Alastair Robertson, Brendan Gregg, Brenden Blanco Kernel Team Björn Töpel, David S. Miller, Edward Cree, Jesper Brouer, Toke Høiland-Jørgensen
24
BPF and XDP Reference Guide
github.com/cilium/cilium
@ciliumproject
@tgraf__
All images: Pixabay