Linux Kernel Networking Raoul Rivas Kernel vs Application - - PowerPoint PPT Presentation

linux kernel networking
SMART_READER_LITE
LIVE PREVIEW

Linux Kernel Networking Raoul Rivas Kernel vs Application - - PowerPoint PPT Presentation

Linux Kernel Networking Raoul Rivas Kernel vs Application Programming No memory protection Memory Protection We share memory with Segmentation Fault devices, scheduler Preemption Sometimes no preemption Scheduling


slide-1
SLIDE 1

Linux Kernel Networking

Raoul Rivas

slide-2
SLIDE 2

Kernel vs Application Programming

  • No memory protection
  • We share memory with

devices, scheduler

  • Sometimes no preemption
  • Can hog the CPU
  • Concurrency is difficult
  • No libraries
  • Printf, fopen
  • No security descriptors
  • In Linux no access to files
  • Direct access to hardware
  • Memory Protection
  • Segmentation Fault
  • Preemption
  • Scheduling isn't our

responsibility

  • Signals (Control-C)
  • Libraries
  • Security Descriptors
  • In Linux everything is a file

descriptor

  • Access to hardware as files
slide-3
SLIDE 3

Outline

  • User Space and Kernel Space
  • Running Context in the Kernel
  • Locking
  • Deferring Work
  • Linux Network Architecture
  • Sockets, Families and Protocols
  • Packet Creation
  • Fragmentation and Routing
  • Data Link Layer and Packet Scheduling
  • High Performance Networking
slide-4
SLIDE 4

System Calls

  • A system call is an interrupt
  • syscall(number,

arguments)

  • The kernel runs in a

different address space

  • Data must be copied back

and forth

  • copy_to_user(),

copy_from_user()

  • Never directly dereference

any pointer from user space

Kernel Space User Space Syscall table write(ptr, size); ptr syscall(WRITE, ptr, size) sys_write() Copy_from_user() INT 0x80 0xFFFF50 0x011075

slide-5
SLIDE 5

Context

  • Context: Entity whom the kernel is running code on behalf of
  • Process context and Kernel Context are preemptible
  • Interrupts cannot sleep and should be small
  • They are all concurrent
  • Process context and Kernel context have a PID:
  • Struct task_struct* current

Kernel Context Process Context Interrupt Context Preemptible Yes Yes No PID Itself Application PID No Can Sleep? Yes Yes No Example Kernel Thread System Call Timer Interrupt

slide-6
SLIDE 6

Race Conditions

  • Process context, Kernel Context and Interrupts

run concurrently

  • How to protect critical zones from race

conditions?

  • Spinlocks
  • Mutex
  • Semaphores
  • Reader-Writer Locks (Mutex, Semaphores)
  • Reader-Writer Spinlocks
slide-7
SLIDE 7

Inside Locking Primitives

  • Spinlock

//spinlock_lock: disable_interrupts(); while(locked==true); //critical region //spinlock_unlock: enable_interrupts(); locked=false;

  • Mutex

//mutex_lock: If (locked==true) { Enqueue(this); Yield(); } locked=true; //critical region //mutex_unlock: If !isEmpty(waitqueue) { wakeup(Dequeue()); } Else locked=false;

We can't sleep while the spinlock is locked! We can't use a mutex in an interrupt because interrupts can't sleep!

THE MUTEX SLEEPS THE SPINLOCK SPINS...

slide-8
SLIDE 8

When to use what?

Mutex Spinlock Short Lock Time Long Lock Time Interrupt Context Sleeping

  • Usually functions that handle memory, user space or

devices and scheduling sleep

  • Kmalloc, printk, copy_to_user, schedule
  • wake_up_process does not sleep
slide-9
SLIDE 9

Linux Kernel Modules

  • Extensibility
  • Ideally you don't want to

patch but build a kernel module

  • Separate Compilation
  • Runtime-Linkage
  • Entry and Exit Functions
  • Run in Process Context
  • LKM “Hello-World”

#define MODULE #define LINUX #define __KERNEL__ #include <linux/module.h> #include <linux/kernel.h> #include <linux/init.h> static int __init myinit(void) { printk(KERN_ALERT "Hello, world\n"); Return 0; } static void __exit myexit(void) { printk(KERN_ALERT "Goodbye, world\n"); } module_init(myinit); module_exit(myexit); MODULE_LICENSE("GPL");

slide-10
SLIDE 10

The Kernel Loop

  • The Linux kernel uses the concept of

jiffies to measure time

  • Inside the kernel there is a loop to

measure time and preempt tasks

  • A jiffy is the period at which the timer

in this loop is triggered

  • Varies from system to system 100

Hz, 250 Hz, 1000 Hz.

  • Use the variable HZ to get the

value.

  • The schedule function is the

function that preempts tasks

schedule() Timer 1/HZ add_timer(1 jiffy) jiffies++ scheduler_tick() tick_periodic:

slide-11
SLIDE 11

Deferring Work / Two Halves

  • Kernel Timers are used to create

timed events

  • They use jiffies to measure time
  • Timers are interrupts
  • We can't do much in them!
  • Solution: Divide the work in two

parts

  • Use the timer handler to signal a
  • thread. (TOP HALF)
  • Let the kernel thread do the

real job. (BOTTOM HALF)

Timer Timer Handler: wake_up(thread); Thread: While(1) { Do work(); Schedule(); } Interrupt context Kernel context TOP HALF BOTTOM HALF

slide-12
SLIDE 12

Linux Kernel Map

slide-13
SLIDE 13

Linux Network Architecture

Socket Access INET UNIX VFS Socket Splice Protocol Families NFS SMB iSCSI Network Storage UDP TCP Protocols IP 802.11 ethernet Network Interface Network Device Driver File Access Logical Filesystem EXT4

slide-14
SLIDE 14

Socket Access

  • Contains the system call

functions like socket, connect, accept, bind

  • Implements the POSIX

socket interface

  • Independent of protocols or

socket types

  • Responsible of mapping socket

data structures to integer handlers

  • Calls the underlying layer

functions

  • sys_socket()→sock_create

sys_socket socket Integer handler Socket create Handler table

slide-15
SLIDE 15

Protocol Families

  • Implements different socket

families INET, UNIX

  • Extensible through the use
  • f pointers to functions and

modules.

  • Allocates memory for the

socket

  • Calls net_proto_familiy →

create for familiy specific initilization

*pf inet_create net_proto_family AF_LOCAL AF_UNIX

slide-16
SLIDE 16

Socket Splice

  • Unix uses the abstraction of Files as first class
  • bjects
  • Linux supports to send entire files between file

descriptors.

  • A descriptor can be a socket
  • Also Unix supports Network File Systems
  • NFS, Samba, Coda, Andrew
  • The socket splice is responsible of handling

these abstractions

slide-17
SLIDE 17

Protocols

  • Families have multiple

protocols

  • INET: TCP, UDP
  • Protocol functions are

stored in proto_ops

  • Some functions are not

used in that protocol so they point to dummies

  • Some functions are the

same across many protocols and can be shared

inet_bind inet_listen inet_stream_connect socket inet_stream_ops proto_ops inet_bind NULL inet_dgram_connect inet_dgram_ops

slide-18
SLIDE 18

Packet Creation

  • At the sending function, the

buffer is packetized.

  • Packets are represented by

the sk_buff data structure

  • Contains pointers the:
  • transport layer header
  • Link-layer header
  • Received Timestamp
  • Device we received it
  • Some fields can be NULL

tcp_send_msg tcp_transmit_skb ip_queue_xmit Struct sk_buf char* Struct sk_buf TCP Header

slide-19
SLIDE 19

Fragmentation and Routing

  • Fragmentation is performed

inside ip_fragment

  • If the packet does not have

a route it is filled in by ip_route_output_flow

  • There are routing

mechanisms used

  • Route Cache
  • Forwarding Information

Base

  • Slow Routing

ip_fragment FIB Slow routing ip_route_output_flow Route cache forward dev_queue_xmit (queue packet) N Y N N N Y Y Y ip_forward (packet forwarding)

slide-20
SLIDE 20

Data Link Layer

  • The Data Link Layer is

responsible of packet scheduling

  • The dev_queue_xmit is

responsible of enqueing packets for transmission in the qdisc of the device

  • Then in process context it is

tried to send

  • If the device is busy we

schedule the send for a later time

  • The dev_hard_start_xmit is

responsible for sending to the device

Dev_queue_xmit(sk_buf) Dev qdisc enqueue dev_hard_start_xmit() Dev qdisc dequeue

slide-21
SLIDE 21

Case Study: iNET

  • INET is an EDF (Earliest

Deadline First) packet scheduler

  • Each Packet has a deadline

specified in the TOS field

  • We implemented it as a

Linux Kernel Module

  • We implement a packet

scheduler at the qdisc level.

  • Replace qdisc enqueue and

dequeue functions

  • Enqueued packets are put

in a heap sorted by deadline

enqueue(sk_buf) dequeue(sk_buf) HW Deadline heap

slide-22
SLIDE 22

High-Performance Network Stacks

  • Minimize copying
  • Zero copy technique
  • Page remapping
  • Use good data structures
  • Inet v0.1 used a list instead of a heap
  • Optimize the common case
  • Branch optimization
  • Avoid process migration or cache misses
  • Avoid dynamic assignment of interrupts to different CPUs
  • Combine Operations within the same layer to minimize

passes to the data

  • Checksum + data copying
slide-23
SLIDE 23

High-Performance Network Stacks

  • Cache/Reuse as much as you can
  • Headers, SLAB allocator
  • Hierarchical Design + Information Hiding
  • Data encapsulation
  • Separation of concerns
  • Interrupt Moderation/Mitigation
  • Receive packets in timed intervals only (e.g. ATM)
  • Packet Mitigation
  • Similar but at the packet level
slide-24
SLIDE 24

Conclusion

  • The Linux kernel has 3 main contexts: Kernel, Process and

Interrupt.

  • Use spinlock for interrupt context and mutexes if you plan to

sleep holding the lock

  • Implement a module avoid patching the kernel main tree
  • To defer work implement two halves. Timers + Threads
  • Socket families are implemented through pointers to

functions (net_proto_family and proto_ops)

  • Packets are represented by the sk_buf structure
  • Packet scheduling is done at the qdisc level in the Link Layer
slide-25
SLIDE 25

References

  • Linux Kernel Map http://www.makelinux.net/kernel_map
  • A. Chimata, Path of a Packet in the Linux Kernel Stack, University
  • f Kansas, 2005
  • Linux Kernel Cross Reference Source
  • R. Love, Linux Kernel Development , 2nd Edition, Novell Press,

2006

  • H. Nguyen, R. Rivas, iDSRT: Integrated Dynamic Soft Realtime

Architecture for Critical Infrastructure Data Delivery over WAN, Qshine 2009

  • M. Hassan and R. Jain, High Performance TCP/IP Networking:

Concepts, Issues, and Solutions, Prentice-Hall, 2003

  • K. Ilhwan, Timer-Based Interrupt Mitigation for High Performance

Packet Processing, HPC, 2001

  • Anand V., TCPIP Network Stack Performance in Linux Kernel 2.4

and 2.5, IBM Linux Technology Center