Walking Through Walls PostgreSQL FreeBSD tmunro@freebsd.org - PowerPoint PPT Presentation

Walking Through Walls PostgreSQL ♥ FreeBSD tmunro@freebsd.org   tmunro@postgresql.org   thomas.munro@enterprisedb.com  

About me • New to FreeBSD hacking   Mentors: mjg, allanjude • ~20 years work on proprietary C, C++, … applications on lots of di ff erent kinds of Unix • Past ~4 years working on PostgreSQL at EnterpriseDB • Past ~3 years dabbling in FreeBSD, beginning with the gateway drug of ZFS home storage boxes, now my main development and server environment • Personal goal: make FreeBSD and PostgreSQL the best relational database stack

Berkeley • INGRES: Developed at UC Berkeley, 197x-1985 • Relational database ideas inspired by IBM’s System/R (though using QUEL instead of SQL), developed on PDPs just as Unix arrived at Berkeley • First software released entirely under BSD licence (CSRG distribution still needed AT&T licence for parts) Michael Stonebraker • POSTGRES: Developed at UC Berkeley, 1986-1994 • Entirely new system (but still using INGRES’s QUEL query language) • Developed on SunOS (derived from 4.3BSD) and Dynix (derived from 4.2BSD, added SMP support for Sequent computers) and (probably) various other flavours of BSD • PostgreSQL: Modern open source project, 1996- • We current claim to support Linux, {Open,Net,Free}BSD, macOS, AIX, HP/ UX, Solaris, Windows; in the past we supported IRIX, Tru64, UnixWare, Latter day PostgreSQL BSD/OS, BeOS, QNX, SunOS, SCO OpenServer hackers on a pilgrimage to Berkeley

How operating systems look to database hackers

• APIs, man pages, standards chiselled in stone • Administration tools, tunables, monitoring tools • Things get more interesting if you can actually influence the operating system!

Database hacker dilemmas • Use thread/process per sessions and rely on kernel scheduling, or do own work scheduling over N threads (tuned for CPU topology)? • Use OS page cache (as well as own cache!), or do direct IO? If bu ff ered, what amount of user space IO scheduling (read ahead, write behind, write coalescing etc)? • Use OS-supplied collation rules for text? • Use our own userspace locking and IPC primitives? • … more questions like this • General theme: use OS facilities or do it ourselves?

• DB2 and Oracle switched to direct IO around the time of the following messages from Linux leadership (indicating that this was highly contentious) • PostgreSQL is approximately the last RDBMS still using bu ff ered IO (though others can as an option)

How PostgreSQL looks to operating systems

} vestigial tiny sem_t, … } sysv postmaster shm segment mmap checkpointer Bu ff er pool, WAL anonymous bu ff ers, process shared, tracking, array of inherited by … all processes (before PG9.3, was just one big   user backend sysv segment) socket “dynamic” } shared parallel worker memory Parallel query segments segment created/destroyed as needed (POSIX shm_open()) File   pgdata/postgres/mycluster descriptor ├─ base pool } │ ├─ 16384 │ └─ 16385   Relation files │ ├─ 12345 │ └─ 12346 } Write ahead log └─ pg_wal ├─ 000000010000000000000003 └─ 000000010000000000000004

Processes 13316 └─ postgres -D /data/clusters/main 13441 ├─ postgres: fred salesdb [local] idle 13437 ├─ postgres: fred salesdb [local] idle 13337 ├─ postgres: fred salesdb [local] SELECT 13323 ├─ postgres: logical replication launcher   13322 ├─ postgres: stats collector 13321 ├─ postgres: autovacuum launcher 13320 ├─ postgres: walwriter 13319 ├─ postgres: background writer 13318 └─ postgres: checkpointer "Currently, POSTGRES runs as one process for each active user. This was done as an expedient to get a system operational as quickly as possible. We plan on converting POSTGRES to use lightweight processes available in the operating systems we are using. These include PRESTO for the Sequent Symmetry and threads in Version 4 of Sun/OS." Stonebraker, Rowe and Herohama, “The Implementation of POSTGRES”, 1989

              System calls Idle backend process:   poll({ 9/POLLIN 10/POLLIN 3/POLLIN },3,-1) = 1 (0x1) Processing a simple read-only query with and without hot cache:   recvfrom(9,"B\0\0\0\^[\0P0_1\0\0\0\0\Â\0\0"...,8192,0,NULL,0x0) = 50 (0x32) sendto(9,"2\0\0\0\^DT\0\0\0!\0\Âabalance"...,71,0,NULL,0) = 71 (0x47) recvfrom(9,"B\0\0\0\^[\0P0_1\0\0\0\0\Â\0\0"...,8192,0,NULL,0x0) = 50 (0x32) pread(14,"\0\0\0\0000D?\^B\0\0\^D\0\f\Â"...,8192,0x1bc000) = 8192 (0x2000) sendto(9,"2\0\0\0\^DT\0\0\0!\0\Âabalance"...,71,0,NULL,0) = 71 (0x47) Writing to the WAL when we COMMIT a transaction:   pwrite(30,"\M^X\M-P\^D\0\Â\0\0\0\0`\M-l\n"...,16384,0xec6000) = 16384 (0x4000) fdatasync(0x1e) = 0 (0x0) The checkpointer process writing back dirty data durably:   openat(AT_FDCWD,"base/13002/2674",O_RDWR,00) = 17 (0x11) pwrite(17,"\0\0\0\0x\M^?D\f\0\0\0\0\M-P\^C"...,8192,0x2c000) = 8192 (0x2000) pwrite(17,"\0\0\0\0\bOD\f\0\0\0\0\M-@\^C\0"...,8192,0x4e000) = 8192 (0x2000) pwrite(17,"\0\0\0\08\^\D\f\0\0\0\0\^P\^D \b"...,8192,0x5a000) = 8192 (0x2000) ... fsync(0x13) = 0 (0x0) fsync(0xf) = 0 (0x0) fsync(0xe) = 0 (0x0) fsync(0xd) = 0 (0x0) …

MAP_SHARED | MAP_ANONYMOUS • In PostgreSQL 9.3 we stopped using a big System V shared memory region and switch to an inherited anonymous shared mmap region • Performance tanked on large many-cored FreeBSD systems • The explanation was mostly that kern.ipc.shm_use_phys=1 was being used on large machines (avoiding the creation of pv entries that performed poorly at scale), but we don’t have a similar mode for anonymous memory • Many improvements were made since then to address the contention problems; is the problem completely fixed?

• We are planning to add a shared_memory_type=sysv option so that we can go back to System V (mainly for AIX), which will allow this option to be used again • Quick testing seemed to indicate that there is still some speed-up reachable that way on a 40 vCPU m4.x10large system (but I don’t have high confidence in the results, more testing required)

fdatasync() • Don’t flush file meta-data when flushing data blocks in the WAL, just flush the data. 1 random IO instead of 2? • I tried to work on this myself… probably a bit too tricky for a starter patch (filesystems are scary), though I had something kinda working… • I updated my source tree one day and *blam*, the big guns had beaten me to it 1 1 D S B e e r F n i e n o D

  setproctitle_fast(3) 13316 └─ postgres -D /data/clusters/main • PostgreSQL updates the 13441 ├─ postgres: fred salesdb [local] idle process title 2+ times per 13437 ├─ postgres: fred salesdb [local] UPDATE 13337 ├─ postgres: fred salesdb [local] SELECT query 13323 ├─ postgres: logical replication launcher   13322 ├─ postgres: stats collector 13321 ├─ postgres: autovacuum launcher • Linux and other BSDs: 13320 ├─ postgres: walwriter 13319 ├─ postgres: background writer simply write to a bu ff er in 13318 └─ postgres: checkpointer user-space memory • FreeBSD: setproctitle(3) PostgreSQL 9.6 running trivial query:   makes two system calls recvfrom(9,”…”...,8192,0,NULL,0x0) getpid() __sysctl(0x7fffffffde80,0x4,0x0,0x0,0x801a0f000,0x28) • New in FreeBSD 12: sendto(8,”\…”...,152,0,NULL,0) getpid() setproctitle_fast(3): no __sysctl(0x7fffffffdfd0,0x4,0x0,0x0,0x801a0f000,0x26) more syscalls! sendto(9,”…”...,63,0,NULL,0) 2 1 PostgreSQL 12 running trivial query: D 2 • Result: ~10% increase in S 1 B L e recvfrom(9,”…”...,8192,0,NULL,0x0) Q e TPS on 40-core pgbench r S F sendto(9,”….”…,71,0,NULL,0) e n r -S g i e t s n o o P D +

PROC_PDEATHSIG_CTL 13316 └─ postgres -D /data/clusters/main • We want child processes to exit 13441 ├─ postgres: fred salesdb [local] idle 13437 ├─ postgres: fred salesdb [local] UPDATE immediately if the ‘postmaster’ 13337 ├─ postgres: fred salesdb [local] SELECT 13323 ├─ postgres: logical replication launcher   dies. Every process holds a pipe, 13322 ├─ postgres: stats collector but testing that is inconvenient 13321 ├─ postgres: autovacuum launcher 13320 ├─ postgres: walwriter and expensive during busy work 13319 ├─ postgres: background writer loops 13318 └─ postgres: checkpointer • Linux has prctl(PR_SET_PDEATHSIG), stolen from IRIX, to request a signal when your parent dies; PostgreSQL 12 now uses that • New in FreeBSD 11.2: procctrl(PROC_PDEATHSIG_CTL) 2 1 D S B • Result: replication/recovery is e e r F measurably faster* n i e n o D

System V shared memory in jails • Previously, multiple copies of PostgreSQL running in separate jails would interfere with each other, because System V shared memory was not jailed (there was one single key namespace for all jails on the same host); this required using di ff erent network ports or UIDs for PostgreSQL instances in di ff erent jails! (And probably worse things.) • This was fixed in FreeBSD 11. 1 1 D S B e e r F n i d e x i F

Walking Through Walls PostgreSQL FreeBSD tmunro@freebsd.org - PowerPoint PPT Presentation

Walking Through Walls PostgreSQL FreeBSD tmunro@freebsd.org tmunro@postgresql.org thomas.munro@enterprisedb.com About me New to FreeBSD hacking Mentors: mjg, allanjude ~20 years work on proprietary C, C++,

PORTFOLIO OUR MANIFEST Walls. We have them all around. Walls - in front of or within us. There

Providing for walking Identifying how we can encourage a better walking environment Walking in

Leaders Guide Benefits of walking and biking Barriers to walking and biking

Walking and Work Reflections on the practice in light of the research Walking and Work Look at

Water Walking Faith Peter Walking on Water Matthew 14:25-32, NIV During the fourth watch of the

Tearing Down Walls Tearing Down Walls & & Building Bridges Building Bridges

Walking, the dilaton, and complex CFT (II) Walking, the dilaton, and complex CFT (II) Chik Him

Age-Friendly Walking in Small and Rural Towns April 18, 2018 What is Age-Friendly Walking? What

Were Walking and Biking!! What are the rules for walking? Be aware of your surroundings

Walking Bass & Jazz Founda'ons Guide The Easy To Understand Guide To Crea'ng Walking Bass

The Walking for Health Sheffield AGM Charlene Simon- Walking for Health Development Manager

Architectural walls OVERVIEW BENEFITS SOLUTIONS EXPERIENCE HAWORTH OVERVIEW When it comes to

NVIDIA Quadro and NVS Video Walls NVIDIA Quadro and NVS Video Walls Using NVIDIA technology to

GIAnT: Visualizing Group Interaction at Large Display Walls Ulrich von Zadow , Raimund Dachselt

Severe Weather Walls/Roofs Walls/Roofs SPFA Conference March 16 th 2008 a c 6 008 Value

Island of Linga, Walls A Unique Shetland Crofting & Fishing Experience May 2015 Rev D Walls

Transaction Management Part II: Recovery vanilladb.org Todays Topic: Recovery Mgr VanillaCore

Drive-Thru: Drive-Thru: Fast, Accurate Evaluation of Fast, Accurate Evaluation of Storage Power

Installing slides into electronic enclosures Installing slides into electronic enclosures 2012

Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations M ACIEJ B

CS5460: Operating Systems Lecture 9: Implementing Synchronization (Chapter 6) CS 5460:

A Coherent and Managed Run0me for ML on the SCC KC

Flutter Top to Bottom Hans Muller Dan Field Advanced User Interface Software, 05-830 Spring

Build beautiful native apps in record time with flutter Eduardo Telaya - CTO / Software

Walking Through Walls PostgreSQL FreeBSD tmunro@freebsd.org - PowerPoint PPT Presentation

Walking Through Walls PostgreSQL FreeBSD tmunro@freebsd.org tmunro@postgresql.org thomas.munro@enterprisedb.com About me New to FreeBSD hacking Mentors: mjg, allanjude ~20 years work on proprietary C, C++,

PORTFOLIO OUR MANIFEST Walls. We have them all around. Walls - in front of or within us. There

Providing for walking Identifying how we can encourage a better walking environment Walking in

Leaders Guide Benefits of walking and biking Barriers to walking and biking

Walking and Work Reflections on the practice in light of the research Walking and Work Look at

Water Walking Faith Peter Walking on Water Matthew 14:25-32, NIV During the fourth watch of the

Tearing Down Walls Tearing Down Walls &amp; &amp; Building Bridges Building Bridges

Walking, the dilaton, and complex CFT (II) Walking, the dilaton, and complex CFT (II) Chik Him

Age-Friendly Walking in Small and Rural Towns April 18, 2018 What is Age-Friendly Walking? What

Were Walking and Biking!! What are the rules for walking? Be aware of your surroundings

Walking Bass &amp; Jazz Founda'ons Guide The Easy To Understand Guide To Crea'ng Walking Bass

The Walking for Health Sheffield AGM Charlene Simon- Walking for Health Development Manager

Architectural walls OVERVIEW BENEFITS SOLUTIONS EXPERIENCE HAWORTH OVERVIEW When it comes to

NVIDIA Quadro and NVS Video Walls NVIDIA Quadro and NVS Video Walls Using NVIDIA technology to

GIAnT: Visualizing Group Interaction at Large Display Walls Ulrich von Zadow , Raimund Dachselt

Severe Weather Walls/Roofs Walls/Roofs SPFA Conference March 16 th 2008 a c 6 008 Value

Island of Linga, Walls A Unique Shetland Crofting &amp; Fishing Experience May 2015 Rev D Walls

Transaction Management Part II: Recovery vanilladb.org Todays Topic: Recovery Mgr VanillaCore

Drive-Thru: Drive-Thru: Fast, Accurate Evaluation of Fast, Accurate Evaluation of Storage Power

Installing slides into electronic enclosures Installing slides into electronic enclosures 2012

Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations M ACIEJ B

CS5460: Operating Systems Lecture 9: Implementing Synchronization (Chapter 6) CS 5460:

A Coherent and Managed Run0me for ML on the SCC KC

Flutter Top to Bottom Hans Muller Dan Field Advanced User Interface Software, 05-830 Spring

Build beautiful native apps in record time with flutter Eduardo Telaya - CTO / Software

Tearing Down Walls Tearing Down Walls & & Building Bridges Building Bridges

Walking Bass & Jazz Founda'ons Guide The Easy To Understand Guide To Crea'ng Walking Bass

Island of Linga, Walls A Unique Shetland Crofting & Fishing Experience May 2015 Rev D Walls