Everythings a File Descriptor Josh Triplett josh@joshtriplett.org - - PowerPoint PPT Presentation

everything s a file descriptor
SMART_READER_LITE
LIVE PREVIEW

Everythings a File Descriptor Josh Triplett josh@joshtriplett.org - - PowerPoint PPT Presentation

Everythings a File Descriptor Josh Triplett josh@joshtriplett.org Linux Plumbers Conference 2015 Everythings a file /home/josh/doc/presentations/lpc-2015/fd/fd.pdf /home/josh/doc/presentations/lpc-2015/fd/fd.pdf


slide-1
SLIDE 1

Everything’s a File Descriptor

Josh Triplett josh@joshtriplett.org Linux Plumbers Conference 2015

slide-2
SLIDE 2

“Everything’s a file”

slide-3
SLIDE 3

◮ /home/josh/doc/presentations/lpc-2015/fd/fd.pdf

slide-4
SLIDE 4

◮ /home/josh/doc/presentations/lpc-2015/fd/fd.pdf ◮ /etc/hostname

slide-5
SLIDE 5

◮ /home/josh/doc/presentations/lpc-2015/fd/fd.pdf ◮ /etc/hostname ◮ /dev/null ◮ /dev/zero

slide-6
SLIDE 6

◮ /home/josh/doc/presentations/lpc-2015/fd/fd.pdf ◮ /etc/hostname ◮ /dev/null ◮ /dev/zero ◮ /dev/ttyS0 ◮ /dev/dri/card0 ◮ /dev/cpu/0/cpuid

slide-7
SLIDE 7

◮ /home/josh/doc/presentations/lpc-2015/fd/fd.pdf ◮ /etc/hostname ◮ /dev/null ◮ /dev/zero ◮ /dev/ttyS0 ◮ /dev/dri/card0 ◮ /dev/cpu/0/cpuid ◮ /tmp/.X11-unix/X0

slide-8
SLIDE 8

◮ /home/josh/doc/presentations/lpc-2015/fd/fd.pdf ◮ /etc/hostname ◮ /dev/null ◮ /dev/zero ◮ /dev/ttyS0 ◮ /dev/dri/card0 ◮ /dev/cpu/0/cpuid ◮ /tmp/.X11-unix/X0 ◮ /proc/1/environ

slide-9
SLIDE 9

◮ /home/josh/doc/presentations/lpc-2015/fd/fd.pdf ◮ /etc/hostname ◮ /dev/null ◮ /dev/zero ◮ /dev/ttyS0 ◮ /dev/dri/card0 ◮ /dev/cpu/0/cpuid ◮ /tmp/.X11-unix/X0 ◮ /proc/1/environ ◮ /proc/cmdline

slide-10
SLIDE 10

◮ /home/josh/doc/presentations/lpc-2015/fd/fd.pdf ◮ /etc/hostname ◮ /dev/null ◮ /dev/zero ◮ /dev/ttyS0 ◮ /dev/dri/card0 ◮ /dev/cpu/0/cpuid ◮ /tmp/.X11-unix/X0 ◮ /proc/1/environ ◮ /proc/cmdline ◮ /sys/class/block/sda/queue/rotational ◮ /sys/firmware/acpi/tables/DSDT

slide-11
SLIDE 11

Everything has a filename?

slide-12
SLIDE 12

Everything has a filename?

slide-13
SLIDE 13

//////////////////////////////// Everything has a filename?

slide-14
SLIDE 14

◮ Pipes ◮ Sockets ◮ epoll ◮ memfd ◮ KVM virtual machines and CPUs ◮ . . .

slide-15
SLIDE 15

Everything’s a file descriptor

slide-16
SLIDE 16

◮ What is a file descriptor, really? ◮ What can you do with a file descriptor? ◮ What interesting file descriptors exist? ◮ How do you build a new type of file descriptors? ◮ What interesting file descriptors don’t exist?

slide-17
SLIDE 17

◮ What is a file descriptor, really? ◮ What can you do with a file descriptor? ◮ What interesting file descriptors exist? ◮ How do you build a new type of file descriptors? ◮ What interesting file descriptors don’t exist yet?

slide-18
SLIDE 18

What is a file descriptor, really?

slide-19
SLIDE 19

◮ struct fd, struct fdtable ◮ struct file

slide-20
SLIDE 20

struct fd versus struct file

testfile contains “0123456789” x = open("testfile", O_RDONLY); xdup = dup(x); y = open("testfile", O_RDONLY);

slide-21
SLIDE 21

struct fd versus struct file

testfile contains “0123456789” x = open("testfile", O_RDONLY); xdup = dup(x); y = open("testfile", O_RDONLY); read(x, &c, 1); putchar(c); read(xdup, &c, 1); putchar(c); read(y, &c, 1); putchar(c);

slide-22
SLIDE 22

struct fd versus struct file

testfile contains “0123456789” x = open("testfile", O_RDONLY); xdup = dup(x); y = open("testfile", O_RDONLY); read(x, &c, 1); putchar(c); /* Prints ’0’ */ read(xdup, &c, 1); putchar(c); read(y, &c, 1); putchar(c);

slide-23
SLIDE 23

struct fd versus struct file

testfile contains “0123456789” x = open("testfile", O_RDONLY); xdup = dup(x); y = open("testfile", O_RDONLY); read(x, &c, 1); putchar(c); /* Prints ’0’ */ read(xdup, &c, 1); putchar(c); /* Prints ’1’ */ read(y, &c, 1); putchar(c);

slide-24
SLIDE 24

struct fd versus struct file

testfile contains “0123456789” x = open("testfile", O_RDONLY); xdup = dup(x); y = open("testfile", O_RDONLY); read(x, &c, 1); putchar(c); /* Prints ’0’ */ read(xdup, &c, 1); putchar(c); /* Prints ’1’ */ read(y, &c, 1); putchar(c); /* Prints ’0’ */

slide-25
SLIDE 25

struct fd struct file testfile 1 2 3 . . .

slide-26
SLIDE 26

struct fd x struct file f pos: 0 f count: 1 testfile 1 2 3 . . .

slide-27
SLIDE 27

struct fd x xdup struct file f pos: 0 f count: 2 testfile 1 2 3 . . .

slide-28
SLIDE 28

struct fd x xdup y struct file f pos: 0 f count: 2 f pos: 0 f count: 1 testfile 1 2 3 . . .

slide-29
SLIDE 29

struct fd x xdup y struct file f pos: 1 f count: 2 f pos: 0 f count: 1 testfile 1 2 3 . . .

slide-30
SLIDE 30

struct fd x xdup y struct file f pos: 2 f count: 2 f pos: 0 f count: 1 testfile 1 2 3 . . .

slide-31
SLIDE 31

struct fd x xdup y struct file f pos: 2 f count: 2 f pos: 1 f count: 1 testfile 1 2 3 . . .

slide-32
SLIDE 32

struct fd x xdup y struct file f pos: 2 f count: 2 f pos: 1 f count: 1 testfile 1 2 3 . . . userspace int

slide-33
SLIDE 33

struct fd x xdup y struct file f pos: 2 f count: 2 f pos: 1 f count: 1 testfile 1 2 3 . . . userspace int kernel object

slide-34
SLIDE 34

struct fd x xdup y struct file f pos: 2 f count: 2 f pos: 1 f count: 1 testfile 1 2 3 . . . userspace int kernel object driver-specific

slide-35
SLIDE 35

File descriptor: Userspace reference to kernel object

slide-36
SLIDE 36

What can you do with a file descriptor?

slide-37
SLIDE 37

◮ read, write

slide-38
SLIDE 38

◮ read, write ◮ seek

slide-39
SLIDE 39

◮ read, write ◮ seek ◮ preadv, pwritev

slide-40
SLIDE 40

◮ read, write ◮ seek ◮ preadv, pwritev ◮ stat

slide-41
SLIDE 41

◮ read, write ◮ seek ◮ preadv, pwritev ◮ stat ◮ Blocking or non-blocking

slide-42
SLIDE 42

◮ read, write ◮ seek ◮ preadv, pwritev ◮ stat ◮ Blocking or non-blocking ◮ poll, select, epoll

slide-43
SLIDE 43

◮ read, write ◮ seek ◮ preadv, pwritev ◮ stat ◮ Blocking or non-blocking ◮ poll, select, epoll ◮ dup, dup2

slide-44
SLIDE 44

◮ read, write ◮ seek ◮ preadv, pwritev ◮ stat ◮ Blocking or non-blocking ◮ poll, select, epoll ◮ dup, dup2 ◮ Send over a UNIX socket via SCM_RIGHTS

slide-45
SLIDE 45

◮ read, write ◮ seek ◮ preadv, pwritev ◮ stat ◮ Blocking or non-blocking ◮ poll, select, epoll ◮ dup, dup2 ◮ Send over a UNIX socket via SCM_RIGHTS ◮ Inherited over exec

slide-46
SLIDE 46

◮ read, write ◮ seek ◮ preadv, pwritev ◮ stat ◮ Blocking or non-blocking ◮ poll, select, epoll ◮ dup, dup2 ◮ Send over a UNIX socket via SCM_RIGHTS ◮ Inherited over exec ◮ mmap

slide-47
SLIDE 47

◮ read, write ◮ seek ◮ preadv, pwritev ◮ stat ◮ Blocking or non-blocking ◮ poll, select, epoll ◮ dup, dup2 ◮ Send over a UNIX socket via SCM_RIGHTS ◮ Inherited over exec ◮ mmap ◮ sendfile, splice, tee

slide-48
SLIDE 48

◮ read, write ◮ seek ◮ preadv, pwritev ◮ stat ◮ Blocking or non-blocking ◮ poll, select, epoll ◮ dup, dup2 ◮ Send over a UNIX socket via SCM_RIGHTS ◮ Inherited over exec ◮ mmap ◮ sendfile, splice, tee ◮ openat

slide-49
SLIDE 49

◮ read, write ◮ seek ◮ preadv, pwritev ◮ stat ◮ Blocking or non-blocking ◮ poll, select, epoll ◮ dup, dup2 ◮ Send over a UNIX socket via SCM_RIGHTS ◮ Inherited over exec ◮ mmap ◮ sendfile, splice, tee ◮ openat ◮ . . .

slide-50
SLIDE 50

◮ read, write ◮ seek ◮ preadv, pwritev ◮ stat ◮ Blocking or non-blocking ◮ poll, select, epoll ◮ dup, dup2 ◮ Send over a UNIX socket via SCM_RIGHTS ◮ Inherited over exec ◮ mmap ◮ sendfile, splice, tee ◮ openat ◮ . . . ◮ ioctl

slide-51
SLIDE 51

Use file descriptors!

slide-52
SLIDE 52

What interesting file descriptors exist?

slide-53
SLIDE 53

eventfd

◮ 64-bit counter used as an event queue

slide-54
SLIDE 54

eventfd

◮ 64-bit counter used as an event queue ◮ write: Add value to counter

slide-55
SLIDE 55

eventfd

◮ 64-bit counter used as an event queue ◮ write: Add value to counter ◮ read: Block until non-zero; read value and reset to 0

◮ “Semaphore mode”: Read 1 and decrement by 1

slide-56
SLIDE 56

eventfd

◮ 64-bit counter used as an event queue ◮ write: Add value to counter ◮ read: Block until non-zero; read value and reset to 0

◮ “Semaphore mode”: Read 1 and decrement by 1

◮ poll: Ready for reading if non-zero

slide-57
SLIDE 57

eventfd

◮ 64-bit counter used as an event queue ◮ write: Add value to counter ◮ read: Block until non-zero; read value and reset to 0

◮ “Semaphore mode”: Read 1 and decrement by 1

◮ poll: Ready for reading if non-zero ◮ Several drivers use eventfd to signal events between kernel

and userspace

slide-58
SLIDE 58

timerfd

◮ Allows handling timers as file descriptors ◮ Throw them in the poll loop with everything else ◮ Create with specified timeout ◮ read: Block until timeout; return number of times expired ◮ poll: Reading for reading if timeout passed

slide-59
SLIDE 59

Signals

slide-60
SLIDE 60

Signals

◮ Receive asynchronous events in a process

slide-61
SLIDE 61

Signals

◮ Receive asynchronous events in a process ◮ Suspend execution, save registers, move execution to handler ◮ Restore registers and resume execution when handler done

slide-62
SLIDE 62

Signals

◮ Receive asynchronous events in a process ◮ Suspend execution, save registers, move execution to handler ◮ Restore registers and resume execution when handler done ◮ Assume a userspace stack to push and pop state

slide-63
SLIDE 63

Signals

◮ Receive asynchronous events in a process ◮ Suspend execution, save registers, move execution to handler ◮ Restore registers and resume execution when handler done ◮ Assume a userspace stack to push and pop state ◮ sigaltstack sets an alternate stack to switch to

slide-64
SLIDE 64

Signals

◮ Receive asynchronous events in a process ◮ Suspend execution, save registers, move execution to handler ◮ Restore registers and resume execution when handler done ◮ Assume a userspace stack to push and pop state ◮ sigaltstack sets an alternate stack to switch to ◮ Set up stack to return into call to sigreturn for cleanup

slide-65
SLIDE 65

Signals

◮ Receive asynchronous events in a process ◮ Suspend execution, save registers, move execution to handler ◮ Restore registers and resume execution when handler done ◮ Assume a userspace stack to push and pop state ◮ sigaltstack sets an alternate stack to switch to ◮ Set up stack to return into call to sigreturn for cleanup ◮ Can receive signals while in a kernel syscall

slide-66
SLIDE 66

Signals

◮ Receive asynchronous events in a process ◮ Suspend execution, save registers, move execution to handler ◮ Restore registers and resume execution when handler done ◮ Assume a userspace stack to push and pop state ◮ sigaltstack sets an alternate stack to switch to ◮ Set up stack to return into call to sigreturn for cleanup ◮ Can receive signals while in a kernel syscall ◮ Some syscalls restart afterward ◮ Syscalls with timeouts adjust them (restart_syscall) ◮ Other syscalls return EINTR

slide-67
SLIDE 67

Signals

◮ Receive asynchronous events in a process ◮ Suspend execution, save registers, move execution to handler ◮ Restore registers and resume execution when handler done ◮ Assume a userspace stack to push and pop state ◮ sigaltstack sets an alternate stack to switch to ◮ Set up stack to return into call to sigreturn for cleanup ◮ Can receive signals while in a kernel syscall ◮ Some syscalls restart afterward ◮ Syscalls with timeouts adjust them (restart_syscall) ◮ Other syscalls return EINTR ◮ Can mask signals to avoid interruption

slide-68
SLIDE 68

Signals

◮ Receive asynchronous events in a process ◮ Suspend execution, save registers, move execution to handler ◮ Restore registers and resume execution when handler done ◮ Assume a userspace stack to push and pop state ◮ sigaltstack sets an alternate stack to switch to ◮ Set up stack to return into call to sigreturn for cleanup ◮ Can receive signals while in a kernel syscall ◮ Some syscalls restart afterward ◮ Syscalls with timeouts adjust them (restart_syscall) ◮ Other syscalls return EINTR ◮ Can mask signals to avoid interruption ◮ Special syscalls that also set signal mask (ppoll, pselect,

KVM_SET_SIGNAL_MASK ioctl)

slide-69
SLIDE 69

Signals

◮ Receive asynchronous events in a process ◮ Suspend execution, save registers, move execution to handler ◮ Restore registers and resume execution when handler done ◮ Assume a userspace stack to push and pop state ◮ sigaltstack sets an alternate stack to switch to ◮ Set up stack to return into call to sigreturn for cleanup ◮ Can receive signals while in a kernel syscall ◮ Some syscalls restart afterward ◮ Syscalls with timeouts adjust them (restart_syscall) ◮ Other syscalls return EINTR ◮ Can mask signals to avoid interruption ◮ Special syscalls that also set signal mask (ppoll, pselect,

KVM_SET_SIGNAL_MASK ioctl)

◮ “async-signal-safe” library functions

slide-70
SLIDE 70

Signed-off-by: <( ; , ; )@r’lyeh>

slide-71
SLIDE 71

signalfd

◮ File descriptor to receive a given set of signals ◮ Block “normal” signal delivery; receive via signalfd instead

slide-72
SLIDE 72

signalfd

◮ File descriptor to receive a given set of signals ◮ Block “normal” signal delivery; receive via signalfd instead ◮ read: Block until signal, return struct signalfd_siginfo ◮ poll: Readable when signal received

slide-73
SLIDE 73

How do you build a new type

  • f file descriptor?
slide-74
SLIDE 74

Semantics

◮ read and write

◮ Nothing ◮ Raw data ◮ Specific data structure

slide-75
SLIDE 75

Semantics

◮ read and write

◮ Nothing ◮ Raw data ◮ Specific data structure

◮ poll/select/epoll

◮ Must match read/write blocking behavior if any ◮ Can have pollable fd even if read/write do nothing

slide-76
SLIDE 76

Semantics

◮ read and write

◮ Nothing ◮ Raw data ◮ Specific data structure

◮ poll/select/epoll

◮ Must match read/write blocking behavior if any ◮ Can have pollable fd even if read/write do nothing

◮ seek and file position

slide-77
SLIDE 77

Semantics

◮ read and write

◮ Nothing ◮ Raw data ◮ Specific data structure

◮ poll/select/epoll

◮ Must match read/write blocking behavior if any ◮ Can have pollable fd even if read/write do nothing

◮ seek and file position ◮ mmap

slide-78
SLIDE 78

Semantics

◮ read and write

◮ Nothing ◮ Raw data ◮ Specific data structure

◮ poll/select/epoll

◮ Must match read/write blocking behavior if any ◮ Can have pollable fd even if read/write do nothing

◮ seek and file position ◮ mmap ◮ What happens with multiple processes, or dup?

slide-79
SLIDE 79

Semantics

◮ read and write

◮ Nothing ◮ Raw data ◮ Specific data structure

◮ poll/select/epoll

◮ Must match read/write blocking behavior if any ◮ Can have pollable fd even if read/write do nothing

◮ seek and file position ◮ mmap ◮ What happens with multiple processes, or dup? ◮ For everything else: ioctl

slide-80
SLIDE 80

Implementation

◮ anon_inode_getfd

◮ Doesn’t need a backing inode or filesystem ◮ Provide an ops structure and private data pointer ◮ Private data points to your kernel object

slide-81
SLIDE 81

Implementation

◮ anon_inode_getfd

◮ Doesn’t need a backing inode or filesystem ◮ Provide an ops structure and private data pointer ◮ Private data points to your kernel object

◮ simple_read_from_buffer, simple_write_to_buffer

slide-82
SLIDE 82

Implementation

◮ anon_inode_getfd

◮ Doesn’t need a backing inode or filesystem ◮ Provide an ops structure and private data pointer ◮ Private data points to your kernel object

◮ simple_read_from_buffer, simple_write_to_buffer ◮ no_llseek, fixed_size_llseek

slide-83
SLIDE 83

Implementation

◮ anon_inode_getfd

◮ Doesn’t need a backing inode or filesystem ◮ Provide an ops structure and private data pointer ◮ Private data points to your kernel object

◮ simple_read_from_buffer, simple_write_to_buffer ◮ no_llseek, fixed_size_llseek ◮ Check file->f_flags & O_NONBLOCK

◮ Blocking: wait_queue_head ◮ Non-blocking: return -EAGAIN

slide-84
SLIDE 84

What interesting file descriptors don’t exist yet?

slide-85
SLIDE 85

Child processes

slide-86
SLIDE 86

◮ fork/clone

slide-87
SLIDE 87

◮ fork/clone ◮ Parent process gets the child PID

slide-88
SLIDE 88

◮ fork/clone ◮ Parent process gets the child PID ◮ Parent uses dedicated syscalls (waitpid) to wait for child exit

slide-89
SLIDE 89

◮ fork/clone ◮ Parent process gets the child PID ◮ Parent uses dedicated syscalls (waitpid) to wait for child exit ◮ When child exits, parent gets SIGCHLD signal

slide-90
SLIDE 90

◮ fork/clone ◮ Parent process gets the child PID ◮ Parent uses dedicated syscalls (waitpid) to wait for child exit ◮ When child exits, parent gets SIGCHLD signal ◮ Parent makes waitpid call to get exit status

slide-91
SLIDE 91

◮ fork/clone ◮ Parent process gets the child PID ◮ Parent uses dedicated syscalls (waitpid) to wait for child exit ◮ When child exits, parent gets SIGCHLD signal ◮ Parent makes waitpid call to get exit status

Problems:

slide-92
SLIDE 92

◮ fork/clone ◮ Parent process gets the child PID ◮ Parent uses dedicated syscalls (waitpid) to wait for child exit ◮ When child exits, parent gets SIGCHLD signal ◮ Parent makes waitpid call to get exit status

Problems:

◮ Waiting not integrated with poll loops

slide-93
SLIDE 93

◮ fork/clone ◮ Parent process gets the child PID ◮ Parent uses dedicated syscalls (waitpid) to wait for child exit ◮ When child exits, parent gets SIGCHLD signal ◮ Parent makes waitpid call to get exit status

Problems:

◮ Waiting not integrated with poll loops

Signals

slide-94
SLIDE 94

◮ fork/clone ◮ Parent process gets the child PID ◮ Parent uses dedicated syscalls (waitpid) to wait for child exit ◮ When child exits, parent gets SIGCHLD signal ◮ Parent makes waitpid call to get exit status

Problems:

◮ Waiting not integrated with poll loops

Signals

◮ Process-global; libraries can’t manage only their own processes

slide-95
SLIDE 95

Alternatives

◮ Set SIGCHLD handler, write to pipe or eventfd

◮ Still process-global; gets all child exit notifications ◮ Requires coordinating global signal handling between libraries

Signals

slide-96
SLIDE 96

Alternatives

◮ Set SIGCHLD handler, write to pipe or eventfd

◮ Still process-global; gets all child exit notifications ◮ Requires coordinating global signal handling between libraries

Signals

◮ signalfd for SIGCHLD

◮ Still process-global; gets all child exit notifications ◮ Requires coordinating global signal handling between libraries ◮ Must block SIGCHLD; breaks code expecting SIGCHLD

slide-97
SLIDE 97

clonefd

slide-98
SLIDE 98

clonefd

◮ New flag for clone ◮ Return a file descriptor for the child process

slide-99
SLIDE 99

clonefd

◮ New flag for clone ◮ Return a file descriptor for the child process ◮ read: block until child exits, return exit information

slide-100
SLIDE 100

clonefd

◮ New flag for clone ◮ Return a file descriptor for the child process ◮ read: block until child exits, return exit information ◮ poll: becomes readable when child exits

slide-101
SLIDE 101

clonefd

◮ New flag for clone ◮ Return a file descriptor for the child process ◮ read: block until child exits, return exit information ◮ poll: becomes readable when child exits ◮ Maintains a reference to the child’s task_struct

slide-102
SLIDE 102

clonefd

◮ New flag for clone ◮ Return a file descriptor for the child process ◮ read: block until child exits, return exit information ◮ poll: becomes readable when child exits ◮ Maintains a reference to the child’s task_struct ◮ Relatively simple, except. . .

slide-103
SLIDE 103

Complications

Need a new clone system call for the fd out parameter

slide-104
SLIDE 104

Complications

Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture

slide-105
SLIDE 105

Complications

Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture

◮ Avoided in the new syscall

slide-106
SLIDE 106

Complications

Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture

◮ Avoided in the new syscall

clone is out of parameters (6) on some architectures

slide-107
SLIDE 107

Complications

Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture

◮ Avoided in the new syscall

clone is out of parameters (6) on some architectures

◮ Pass parameters via a struct and size

slide-108
SLIDE 108

Complications

Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture

◮ Avoided in the new syscall

clone is out of parameters (6) on some architectures

◮ Pass parameters via a struct and size

Low-level copy_thread function grabbed tls parameter directly from syscall register arguments; couldn’t move it

slide-109
SLIDE 109

Complications

Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture

◮ Avoided in the new syscall

clone is out of parameters (6) on some architectures

◮ Pass parameters via a struct and size

Low-level copy_thread function grabbed tls parameter directly from syscall register arguments; couldn’t move it

◮ Pass parameter normally via C, fix assembly syscall entry ◮ Fixed with copy_thread_tls (merged in 4.2)

slide-110
SLIDE 110

Complications

Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture

◮ Avoided in the new syscall

clone is out of parameters (6) on some architectures

◮ Pass parameters via a struct and size

Low-level copy_thread function grabbed tls parameter directly from syscall register arguments; couldn’t move it

◮ Pass parameter normally via C, fix assembly syscall entry ◮ Fixed with copy_thread_tls (merged in 4.2)

ptrace and reparenting

slide-111
SLIDE 111

Complications

Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture

◮ Avoided in the new syscall

clone is out of parameters (6) on some architectures

◮ Pass parameters via a struct and size

Low-level copy_thread function grabbed tls parameter directly from syscall register arguments; couldn’t move it

◮ Pass parameter normally via C, fix assembly syscall entry ◮ Fixed with copy_thread_tls (merged in 4.2)

ptrace and reparenting

◮ Work in progress

slide-112
SLIDE 112

History and status

◮ Thiago Macieira originally proposed forkfd to simplify Qt ◮ Josh and Thiago started on clonefd earlier this year ◮ Some infrastructure merged into 4.2 ◮ Syscall aimed for future kernel after resolving ptrace issues

slide-113
SLIDE 113

File descriptor: Userspace reference to kernel object

slide-114
SLIDE 114

What else can we do with a reference to task_struct?

slide-115
SLIDE 115

Process IDs

slide-116
SLIDE 116

Process IDs

◮ Small integers used to reference processes ◮ Used pervasively in process syscalls ◮ Enumerated as directories in /proc

slide-117
SLIDE 117

Process IDs

◮ Small integers used to reference processes ◮ Used pervasively in process syscalls ◮ Enumerated as directories in /proc ◮ Unique within root container ◮ Container PID namespaces map a subset of these

slide-118
SLIDE 118

Process IDs

◮ Small integers used to reference processes ◮ Used pervasively in process syscalls ◮ Enumerated as directories in /proc ◮ Unique within root container ◮ Container PID namespaces map a subset of these ◮ PIDs do not hold a reference; can be reused ◮ Race condition if used from non-parent process

slide-119
SLIDE 119

clonefd as process identifier

slide-120
SLIDE 120

clonefd as process identifier

◮ Unique across the entire system

slide-121
SLIDE 121

clonefd as process identifier

◮ Unique across the entire system ◮ Holds a reference to the process ◮ Race-free

slide-122
SLIDE 122

clonefd as process identifier

◮ Unique across the entire system ◮ Holds a reference to the process ◮ Race-free ◮ Can pass via exec, UNIX sockets

slide-123
SLIDE 123

clonefd as process identifier

◮ Unique across the entire system ◮ Holds a reference to the process ◮ Race-free ◮ Can pass via exec, UNIX sockets ◮ Allows non-parent processes to obtain exit information

slide-124
SLIDE 124

Next steps

◮ Merge clonefd ◮ For each PID syscall, add an fd variant ◮ Add ioctls to obtain process information ◮ Add process enumeration (next, child, root)

slide-125
SLIDE 125

Other future file descriptors

slide-126
SLIDE 126

Other future file descriptors Warning: wild speculation and conjecture ahead

slide-127
SLIDE 127

User and group IDs

slide-128
SLIDE 128

User and group IDs

◮ Suppose users and groups were unique kernel objects?

slide-129
SLIDE 129

User and group IDs

◮ Suppose users and groups were unique kernel objects? ◮ Unique across container user namespaces ◮ “Get unused user/group”

slide-130
SLIDE 130

User and group IDs

◮ Suppose users and groups were unique kernel objects? ◮ Unique across container user namespaces ◮ “Get unused user/group” ◮ Set up arbitrary mappings when mounting a filesystem

slide-131
SLIDE 131

User and group IDs

◮ Suppose users and groups were unique kernel objects? ◮ Unique across container user namespaces ◮ “Get unused user/group” ◮ Set up arbitrary mappings when mounting a filesystem ◮ Allow a process to hold multiple credentials (like setgroups)

slide-132
SLIDE 132

Filesystem mounts

slide-133
SLIDE 133

Filesystem mounts

◮ Suppose mount returned a directory file descriptor

slide-134
SLIDE 134

Filesystem mounts

◮ Suppose mount returned a directory file descriptor ◮ openat relative to the filesystem

slide-135
SLIDE 135

Filesystem mounts

◮ Suppose mount returned a directory file descriptor ◮ openat relative to the filesystem ◮ Separate call to bind into the filesystem namespace ◮ Bind existing dirfd for bind mounts

slide-136
SLIDE 136

Summary

◮ File descriptor: Userspace reference to kernel object

slide-137
SLIDE 137

Summary

◮ File descriptor: Userspace reference to kernel object ◮ Reference-counted, race-free, unambiguous ID

slide-138
SLIDE 138

Summary

◮ File descriptor: Userspace reference to kernel object ◮ Reference-counted, race-free, unambiguous ID ◮ Well-defined semantics ◮ Extensive operations

slide-139
SLIDE 139

Summary

◮ File descriptor: Userspace reference to kernel object ◮ Reference-counted, race-free, unambiguous ID ◮ Well-defined semantics ◮ Extensive operations ◮ poll and blocking

slide-140
SLIDE 140

Summary

◮ File descriptor: Userspace reference to kernel object ◮ Reference-counted, race-free, unambiguous ID ◮ Well-defined semantics ◮ Extensive operations ◮ poll and blocking ◮ Use file descriptors in new APIs ◮ Don’t invent new identifier namespaces