Walking Through Walls
tmunro@freebsd.org tmunro@postgresql.org thomas.munro@enterprisedb.com
Walking Through Walls PostgreSQL FreeBSD tmunro@freebsd.org - - PowerPoint PPT Presentation
Walking Through Walls PostgreSQL FreeBSD tmunro@freebsd.org tmunro@postgresql.org thomas.munro@enterprisedb.com About me New to FreeBSD hacking Mentors: mjg, allanjude ~20 years work on proprietary C, C++,
tmunro@freebsd.org tmunro@postgresql.org thomas.munro@enterprisedb.com
Mentors: mjg, allanjude
gateway drug of ZFS home storage boxes, now my main development and server environment
best relational database stack
instead of SQL), developed on PDPs just as Unix arrived at Berkeley
needed AT&T licence for parts)
4.2BSD, added SMP support for Sequent computers) and (probably) various other flavours of BSD
UX, Solaris, Windows; in the past we supported IRIX, Tru64, UnixWare, BSD/OS, BeOS, QNX, SunOS, SCO OpenServer
Latter day PostgreSQL hackers on a pilgrimage to Berkeley Michael Stonebraker
standards chiselled in stone
tools, tunables, monitoring tools
interesting if you can actually influence the
do own work scheduling over N threads (tuned for CPU topology)?
buffered, what amount of user space IO scheduling (read ahead, write behind, write coalescing etc)?
switched to direct IO around the time of the following messages from Linux leadership (indicating that this was highly contentious)
approximately the last RDBMS still using buffered IO (though others can as an option)
postmaster checkpointer
…
user backend
Buffer pool, WAL buffers, process tracking, array of sem_t, … } mmap anonymous shared, inherited by all processes (before PG9.3, was just one big sysv segment) vestigial tiny sysv shm segment
pgdata/postgres/mycluster ├─ base │ ├─ 16384 │ └─ 16385 │ ├─ 12345 │ └─ 12346 └─ pg_wal ├─ 000000010000000000000003 └─ 000000010000000000000004
parallel worker
Relation files
Parallel query segment
“dynamic” shared memory segments created/destroyed as needed (POSIX shm_open())
socket
File descriptor pool
13316 └─ postgres -D /data/clusters/main 13441 ├─ postgres: fred salesdb [local] idle 13437 ├─ postgres: fred salesdb [local] idle 13337 ├─ postgres: fred salesdb [local] SELECT 13323 ├─ postgres: logical replication launcher 13322 ├─ postgres: stats collector 13321 ├─ postgres: autovacuum launcher 13320 ├─ postgres: walwriter 13319 ├─ postgres: background writer 13318 └─ postgres: checkpointer
"Currently, POSTGRES runs as one process for each active user. This was done as an expedient to get a system operational as quickly as possible. We plan on converting POSTGRES to use lightweight processes available in the operating systems we are using. These include PRESTO for the Sequent Symmetry and threads in Version 4 of Sun/OS." Stonebraker, Rowe and Herohama, “The Implementation of POSTGRES”, 1989
Idle backend process:
poll({ 9/POLLIN 10/POLLIN 3/POLLIN },3,-1) = 1 (0x1)
Processing a simple read-only query with and without hot cache:
recvfrom(9,"B\0\0\0\^[\0P0_1\0\0\0\0\^A\0\0"...,8192,0,NULL,0x0) = 50 (0x32) sendto(9,"2\0\0\0\^DT\0\0\0!\0\^Aabalance"...,71,0,NULL,0) = 71 (0x47) recvfrom(9,"B\0\0\0\^[\0P0_1\0\0\0\0\^A\0\0"...,8192,0,NULL,0x0) = 50 (0x32) pread(14,"\0\0\0\0000D?\^B\0\0\^D\0\f\^A"...,8192,0x1bc000) = 8192 (0x2000) sendto(9,"2\0\0\0\^DT\0\0\0!\0\^Aabalance"...,71,0,NULL,0) = 71 (0x47)
Writing to the WAL when we COMMIT a transaction:
pwrite(30,"\M^X\M-P\^D\0\^A\0\0\0\0`\M-l\n"...,16384,0xec6000) = 16384 (0x4000) fdatasync(0x1e) = 0 (0x0)
The checkpointer process writing back dirty data durably:
pwrite(17,"\0\0\0\0x\M^?D\f\0\0\0\0\M-P\^C"...,8192,0x2c000) = 8192 (0x2000) pwrite(17,"\0\0\0\0\bOD\f\0\0\0\0\M-@\^C\0"...,8192,0x4e000) = 8192 (0x2000) pwrite(17,"\0\0\0\08\^\D\f\0\0\0\0\^P\^D \b"...,8192,0x5a000) = 8192 (0x2000) ... fsync(0x13) = 0 (0x0) fsync(0xf) = 0 (0x0) fsync(0xe) = 0 (0x0) fsync(0xd) = 0 (0x0) …
memory region and switch to an inherited anonymous shared mmap region
was being used on large machines (avoiding the creation of pv entries that performed poorly at scale), but we don’t have a similar mode for anonymous memory
contention problems; is the problem completely fixed?
back to System V (mainly for AIX), which will allow this option to be used again
that way on a 40 vCPU m4.x10large system (but I don’t have high confidence in the results, more testing required)
WAL, just flush the data. 1 random IO instead of 2?
for a starter patch (filesystems are scary), though I had something kinda working…
guns had beaten me to it
D
e i n F r e e B S D 1 1
13316 └─ postgres -D /data/clusters/main 13441 ├─ postgres: fred salesdb [local] idle 13437 ├─ postgres: fred salesdb [local] UPDATE 13337 ├─ postgres: fred salesdb [local] SELECT 13323 ├─ postgres: logical replication launcher 13322 ├─ postgres: stats collector 13321 ├─ postgres: autovacuum launcher 13320 ├─ postgres: walwriter 13319 ├─ postgres: background writer 13318 └─ postgres: checkpointer
process title 2+ times per query
simply write to a buffer in user-space memory
makes two system calls
setproctitle_fast(3): no more syscalls!
TPS on 40-core pgbench
D
e i n F r e e B S D 1 2 + P
t g r e S Q L 1 2 PostgreSQL 9.6 running trivial query:
recvfrom(9,”…”...,8192,0,NULL,0x0) getpid() __sysctl(0x7fffffffde80,0x4,0x0,0x0,0x801a0f000,0x28) sendto(8,”\…”...,152,0,NULL,0) getpid() __sysctl(0x7fffffffdfd0,0x4,0x0,0x0,0x801a0f000,0x26) sendto(9,”…”...,63,0,NULL,0)
PostgreSQL 12 running trivial query:
recvfrom(9,”…”...,8192,0,NULL,0x0) sendto(9,”….”…,71,0,NULL,0)
immediately if the ‘postmaster’
but testing that is inconvenient and expensive during busy work loops
prctl(PR_SET_PDEATHSIG), stolen from IRIX, to request a signal when your parent dies; PostgreSQL 12 now uses that
procctrl(PROC_PDEATHSIG_CTL)
measurably faster*
13316 └─ postgres -D /data/clusters/main 13441 ├─ postgres: fred salesdb [local] idle 13437 ├─ postgres: fred salesdb [local] UPDATE 13337 ├─ postgres: fred salesdb [local] SELECT 13323 ├─ postgres: logical replication launcher 13322 ├─ postgres: stats collector 13321 ├─ postgres: autovacuum launcher 13320 ├─ postgres: walwriter 13319 ├─ postgres: background writer 13318 └─ postgres: checkpointer
D
e i n F r e e B S D 1 2
separate jails would interfere with each other, because System V shared memory was not jailed (there was one single key namespace for all jails on the same host); this required using different network ports or UIDs for PostgreSQL instances in different jails! (And probably worse things.)
F i x e d i n F r e e B S D 1 1
select(2)
Linux, reusing an fd set to fix contention problems on large multi-socket machines
high concurrency pgbench tests, but lower on some others!
apparently related to timing-sensitive wakeup and scheduling logic in the kernel
to influence write-back rates, instead of (or in preparation for) the big hammer of fsyncdata()
ZFS-on-Linux support it?)
I d e a f
F r e e B S D 1 3 ?
semantics of fsync() on a very popular operating system
your buffered data on the floor even though it was still dirty; subsequent fsync() calls may therefore report success but your data is gone
probably everyone else who spat out their coffee while reading LWN) now PANIC on any fsync() failure, rather than retrying
have this problem since 1999; dirty data is dirty data, you can’t drop it unless the device goes away, so future fsync() calls will also fail (or perhaps truly succeed)
strcoll(). Almost everybody wants to use Unicode. The FreeBSD PostgreSQL port carried a patch to use ICU instead for collations.
code with Illumos and DragonflyBSD.
database).
F i x e d i n F r e e B S D 1 1
strcoll_l(3), and come from upstream sources like the Unicode CLDR project
the indexes are corrupted … this really happens!
for the version of the collation definition with querylocale(3) so we know when we need to rebuild indexes
I d e a f
F r e e B S D 1 3 ?
time, malloc->expand->free. Could we… not do that?
terminator! Non-standard strncoll_l()? (note “n”, was rejected from C99, semantics unclear)
we turned it off because popular implementations were busted (didn’t always match strcoll_l() order, causing corruption); can we make a 100% reliable strxfrm_l()? strnxfrm_l() (note “n”)?
OOM killer strikes (like AIX)
memory or exit voluntarily
exist on iOS, and Linux user- space OOM tools
installed simultaneously (using paths like /usr/local/libexec/postgresql/ 12/bin/postgres)
nicest way to manage starting/stopping/listing multiple PostgreSQL server instances on the same machine, possibly of different major versions
pg_lsclusters pg_createcluster <version> <name> …
to add special ZFS based commands (snapshot, clone, send/receive database clusters) and jails support
segments that exist on the system! There is no ipcs or similar to see them. Solving this possibly involves changing the way shm_open() jailing works first.
the name of the mapped segment. It should!
table with a big lock around it; perhaps that could be improved.
aborts); it also contains a bunch of patches that are no upstreamed
starts with a file descriptor for its root directory
new interlocking trick probably based on a file under pgdata, flock() not available) and POSIX shm (would need to file-backed mmap files for parallel query?)
image” the first time you dirty each data page, to defend against torn pages
power failure and restart, you either have the old 8KB page or the new 8KB page contents, but not some kind of frankenpage)
you can set full_page_writes = off to avoid a ton of extra WAL IO on some workloads (a form of “write amplication” that people complain about on PostgreSQL)
underlying device can make a gurantee about power failure atomic write size) then it’s be nice if it could report that to us somehow
be faster (on non-COW filesystems), was slower than simply creating new files every time on ZFS, for them
you have to fault in some random access disk pages (even though the actual data pages don’t need to be faulted in, because we entirely write
recycling; the whole concept doesn’t really make sense for COW…
PostgreSQL’s buffers, performance is best
in PostgreSQL’s buffers, but does fit in OS page cache, performance is nearly as good
doom where it doesn’t quite fit in either, even though you have enough physical RAM
became available and reliable on operating systems
be a bit tricky, but probably worth the pain
data to be returned to the kernel page cache (or ARC), so you could have exclusive buffering and fix the “valley of doom”
file
are really sequential, but UFS doesn’t detect that
then why stop there, you could have more streams like ZFS does….)
8kb 8kb 8kb 8kb 4kb 2 3 8kb 8kb 1 8kb 32kb (or 128kb, or …) 32kb
8kb pages into the PostgreSQL buffer pool
ahead heuristics (hopefully) detects this pattern and ideally begins issuing larger reads to the disk to pre-load OS page cache pages
better with direct IO and our own IO scheduler 4kb 4kb 4kb 4kb 4kb 4kb 4kb 4kb 4kb 4kb 4kb 4kb 4kb
it expects eager reservation (close() and fsync() are not great times to get ENOSPC, after pwrite() succeeded).
clear how a humble userspace program can get the right guarantees (on various operating systems)