Replacing iptables with eBPF in Kubernetes with Cilium
Cilium, eBPF, Envoy, Istio, Hubble
Michal Rostecki Software Engineer mrostecki@suse.com mrostecki@opensuse.org Swaminathan Vasudevan Software Engineer svasudevan@suse.com
Replacing iptables with eBPF in Kubernetes with Cilium Cilium, - - PowerPoint PPT Presentation
Replacing iptables with eBPF in Kubernetes with Cilium Cilium, eBPF, Envoy, Istio, Hubble Michal Rostecki Swaminathan Vasudevan Software Engineer Software Engineer mrostecki@suse.com svasudevan@suse.com mrostecki@opensuse.org Whats
Michal Rostecki Software Engineer mrostecki@suse.com mrostecki@opensuse.org Swaminathan Vasudevan Software Engineer svasudevan@suse.com
2
3
IPtables runs into a couple of significant problems:
sequential list of rules.
Based on the above mentioned issues under heavy traffic conditions or in a system that has a large number of changes to iptable rules the performance degrades. Measurements show unpredictable latency and reduced performance as the number of services grows.
4
5
6
7
8 HW Bridge OVS . Netdevice / Drivers Traffic Shaping Ethernet IPv4 IPv6 Netfilter TCP UDP Raw Sockets System Call Interface Process Process Process
layers.
than two decades and packet filtering has to applied to packets that moves up and down the stack.
9 HW Bridge OVS . Netdevice / Drivers Traffic Shaping Ethernet IPv4 IPv6 Netfilter TCP UDP Raw Sockets System Call Interface Process Process Process
BPF System calls BPF Sockmap and Sockops BPF TC hooks BPF XDP
BPF cGroups
10
Mpps
11
PREROUTING INPUT OUTPUT FORWARD POSTROUTING FILTER FILTER FILTER
NAT NAT
Routing Decision
NAT
Routing Decision Routing Decision
Netdev (Physical or virtual Device) Netdev (Physical or virtual Device) Local Processes
eBPF Code eBPF Code
IPTables netfilter hooks eBPF TC hooks XDP hooks
12
NetFilter NetFilter To Linux Stack From Linux Stack Netdev (Physical or virtual Device) Netdev (Physical or virtual Device)
Ingress Chain Selector INGRESS CHAIN FORWARD CHAIN
[local dst] [remote dst]
TC/XDP Ingress hook TC Egress hook
Egress Chain Selector OUTPUT CHAIN
[local src] [ r e m
e s r c ]
Update session Label Packet Update session Label Packet Store session Store session Store session Update session Label Packet
Connection Tracking
13
….
Headers parsing IP.dst lookup IP1 bitv1 IP2 bitv2 IP3 bitv3
eBPF Program #1 eBPF Program #2 eBPF Program #3
IP.proto lookup * bitv1 udp bitv2 tcp bitv3
Bitwise AND bit-vectors Search first Matching rule Update counters
ACTION (drop/ accept)
rule1 act1 rule2 act2 rule3 act3 rule1 cnt1 rule2 cnt2
eBPF Program eBPF Program #N
Packet in Packet out From eBPF hook To eBPF hook Tail call Tail call Tail call Tail call Packet header offsets Bitvector with temporary result
per cpu _array shared across the entire program chain per cpu _array shared across the entire program chain Each eBPF program can exploit a different matching algorithm (e.g., exact match, longest prefix match, etc).
Each eBPF program is injected only if there are rules operating on that field. LBVS is implemented with a chain of eBPF programs, connected through tail calls. Header parsing is done
in a shared map for performance reasons
14
15
16 16
17
18
CNI is a CNCF ( Cloud Native Computing Foundation) project for Linux Containers It consists of specification and libraries for writing plugins. Only care about networking connectivity of containers
General container runtime considerations for CNI: The container runtime must
When CNI ADD call is invoked it tries to add the network to the container with respective veth pairs and assigning IP address from the respective IPAM Plugin or using the Host Scope. When CNI DEL call is invoked it tries to remove the container network, release the IP Address to the IPAM Manager and cleans up the veth pairs.
19
Kubernetes API Server
Kubelet CRI-Containerd CNI-Plugin (Cilium) Cilium Agent
eth0 BPF Maps
Container2 Container1
Linux Kernel Network Stack
000 c1 FE 0A 001 54 45 31 002 A1 B1 C1 004 32 66 AA
cni-add()..
Kubectl
K8s Pod
Userspace Kernel bpf_syscall()
BPF Hook
20 VM1 Cont 1 Cont 2 Cont 3
App
TC BPF XDP
CILIUM AGENT DAEMON CILIUM CLI CILIUM MONITOR CILIUM HEALTH CILIUM HEALTH NAMESPACE PLUGIN
Build sk_buff
B P F m a p s
Device Driver Queueing and Forwarding IP Layer Virtual Net Devices PHYSICAL LAYER ( NETWORK HARDWARE TCP/UDP Layer
A F
D P
AF-INET AF-RAW
VM’s and Containers Apps
CILIUM POD (Control Plane)
U S E R S P A C E
K E R N E L S P A C E NETWORK STACK with BPF hook points
Bpf_create_map s
SO_ATTACH_BPF BPF (sockmap, sockopts BPF-Cont3 BPF-Cont2 BPF-Cont1 BPF-Cilium
Bpf_lookup_elements
CILIUM HOST_NET CILIUM OPERATOR
Orchestrator
21
container A container B container C
Cilium Networking CNI K8s pod K8s cluster K8s node K8s node K8s pod K8s pod
eth0 eth0 eth0 lxc0 lxc0 lxc1 eth0 eth0
22
Encapsulation
Direct routing
Node A Node B Node C
VXLAN VXLAN V X L A N
Node A Node B Node C Cloud or BGP routing
23
24
25
Pod Labels: role=frontend IP: 10.0.0.1 Pod Labels: role=frontend IP: 10.0.0.2 Pod IP: 10.0.0.5 Pod Labels: role=backend IP: 10.0.0.3 Pod Labels: role=frontend IP: 10.0.0.4
allow deny
26
apiVersion: "cilium.io/v2" kind: CiliumNetworkPolicy description: "Allow frontends to access backends" metadata: name: "frontend-backend" spec: endpointSelector: matchLabels: role: backend ingress:
class: frontend
27
IP: 10.0.1.1 Subnet: 10.0.1.0/24 IP: 10.0.2.1 Subnet: 10.0.2.0/24
allow d e n y
Pod Labels: role=backend IP: 10.0.0.1 Any IP not belonging to 10.0.1.0/24
28
apiVersion: "cilium.io/v2" kind: CiliumNetworkPolicy description: "Allow backends to access 10.0.1.0/24" metadata: name: "frontend-backend" spec: endpointSelector: matchLabels: role: backend egress:
29
apiVersion: "cilium.io/v2" kind: CiliumNetworkPolicy description: "Allow to access backends only on TCP/80" metadata: name: "frontend-backend" spec: endpointSelector: matchLabels: role: backend ingress:
protocol: “TCP”
30
Pod Labels: role=backend IP: 10.0.0.1
allow deny
TCP/80 Any other port
31
Pod Labels: role=api IP: 10.0.0.1 GET /articles/{id} GET /private Pod IP: 10.0.0.5
32
apiVersion: "cilium.io/v2" kind: CiliumNetworkPolicy description: "L7 policy to restict access to specific HTTP endpoints" metadata: name: "frontend-backend" endpointSelector: matchLabels: role: backend ingress:
protocol: “TCP” rules: http:
path: "/article/$"
33
Node A
Pod A + BPF Envoy
Generating BPF programs for L7 filtering through libcilium.so
Node B
Pod B + BPF Envoy
Generating BPF programs for L7 filtering through libcilium.so Generating BPF programs for L3/L4 filtering Generating BPF programs for L3/L4 filtering
VXLAN
34
35
Node A
Pod A
+ BPF Node B + BPF
Container
eth0 Pod B
Container
eth0 Pod C
Container
eth0
External etcd Node A
Pod A
+ BPF
Container
eth0
36
Socket Socket Socket Socket
Service Service
Socket TCP/IP
Ethernet
eth0 Socket TCP/IP
Ethernet
eth0 Network K8s Pod K8s Pod K8s Node TCP/IP
Ethernet IPtables IPtables
TCP/IP
Ethernet IPtables
Loopback
IPtables IPtables IPtables
TCP/IP TCP/IP
Ethernet Ethernet
Loopback
37
Cilium CNI Cilium CNI
Socket Socket Socket Socket
Service Service
Socket TCP/IP
Ethernet
eth0 Socket TCP/IP
Ethernet
eth0 Network K8s Pod K8s Pod K8s Node
38
Service A Service B Service C Cilium Networking CNI K8s pod K8s cluster K8s node K8s node K8s pod K8s pod Istio Pilot/Mixer/Citadel
39
Service A Service B Cilium Networking CNI K8s pod K8s cluster K8s node K8s node K8s pod Istio Pilot/Mixer/Citadel Mutual TLS
40
Service A Service B Cilium Networking CNI K8s pod K8s cluster K8s node K8s node K8s pod Istio Pilot/Mixer/Citadel Deferred kTLS encryption External Github Service
External Cloud Network
41
BPF, Cilium
Iptables, kube-proxy
Key Key Key Value Value Value Rule 1 Rule 2 Rule n
Search O(1) InsertO(1) Delete O(1) Search O(n) InsertO(1) Delete O(n)
42
usec number of services in cluster
43
44
45
Hubble is a fully distributed networking and security observability platform for cloud native workloads. It is built on top of Cilium and eBPF to enable deep visibility in a transparent manner. Hubble provides
Known limitations of Hubble:
46
The following components make up Hubble:
○ The Hubble Agent is what runs on each worker node. It interacts with the Cilium agent running on the same node and serves the flow query API as well as the metrics.
○ Hubble storage layer consists of an in-memory storage able to store a fixed number of flows per node.
○ The CLI connects to the flow query API of a Hubble agent running on a node and allows to query the flows stored in the in-memory storage using server-side filtering.
○ The Hubble UI uses the flow query API to provide a graphical service communication map based on the
47
48
49
50 50
51