NUMA Mark Johnston markj@FreeBSD.org FreeBSD Bay Area Vendor - PowerPoint PPT Presentation

NUMA Mark Johnston markj@FreeBSD.org FreeBSD Bay Area Vendor Summit October 11, 2019

Non-Uniform Memory Access RAM CPU CPU RAM RAM CPU CPU RAM

OS Responsibilities Minimize remote memory accesses ◮ Avoid remote access latency penalty ◮ Avoid bottlenecking on cross-domain interconnect Requirements: ◮ Balance resource utilization ◮ Allow applications to provide hints (scheduling, memory allocation) ◮ Handle local memory shortages gracefully ◮ Affinitize static data structures

APIs Kernel: ◮ bus get domain(9) , bus dma tag set domain(9) ◮ malloc domainset(9) , kmem malloc domainset(9) ◮ uma zalloc domain(9) (slow!) Userspace: ◮ cpuset(1) ◮ cpuset getdomain(2) cpuset setdomain(2)

Review: Domain Selection Policies, domainset(9) DOMAINSET POLICY ROUNDROBIN ◮ Cycle through domains: d = iter++ % ds->ds cnt ◮ 0, 1, 2, 3, 0, 1, 2, 3, 0, ... DOMAINSET POLICY FIRSTTOUCH ◮ Pick the domain of the current CPU: d = PCPU GET(domain) ◮ Userland default, good for short-lived processes DOMAINSET POLICY PREFER ◮ Pick the domain specified in the policy: d = ds->ds prefer ◮ Fall back to round-robin when free pages are scarce DOMAINSET POLICY INTERLEAVE ◮ Round-robin with a stride ◮ 0, 0, ..., 0, 1, 1, ..., 1, 0, 0, ... ◮ Superpage-friendly: use a stride of 512 ◮ Kernel default

Review: UMA per-CPU caches ◮ Bucket: dynamically allocated array ◮ Items allocated from CPU alloc bucket ◮ Items freed to free bucket ◮ Buckets are swapped alloc free if empty (alloc) or full item ptr item ptr (free) item ptr empty slot ◮ Per-domain cache of item ptr empty slot full buckets ... ... ◮ Slow path: lock the item ptr empty slot zone, check bucket empty slot empty slot cache

options UMA XDOMAIN xdomain item ptr CPU item ptr ◮ On free, find item domain item ptr ◮ Cache in free if domain == empty slot ... PCPU GET(domain) , else empty slot xdomain ◮ Slow path: lock the zone, alloc free drain xdomain item ptr item ptr item ptr empty slot ◮ Special optimization for 2 item ptr empty slot domains ... ... item ptr empty slot empty slot empty slot

Network affinity socket syncache TCP PCB tcp_output() TCP SYN, ACK RX mbufs LAGG UMA lacp_select_tx_port() ifnet if_alloc_dev(dev_t)

vm page array (amd64 only) ◮ One vm page structure per 4KB page ◮ vm page array allocated early during boot ◮ Physically contiguous → allocated from single domain ◮ Unfriendly to first-touch allocation policy ◮ Now backed by “correct” memory, up to 2MB boundaries

Other Data Structures ◮ PCPU area (amd64) ◮ ULE per-CPU thread queues ◮ callout wheel ◮ vm page locks (by removing their usage) ◮ Kernel thread stacks

Memory-bound pgbench on a 2-socket system, r353116 Core | IPC | Instructions | Cycles | Local DRAM accesses | Remote DRAM Accesses 0 0.46 1097 M 2402 M 1467 K 759 K 1 0.45 1090 M 2402 M 1464 K 766 K 2 0.46 1095 M 2402 M 1556 K 801 K 3 0.46 1096 M 2402 M 1445 K 755 K 4 0.46 1099 M 2402 M 1507 K 787 K 5 0.45 1091 M 2402 M 1550 K 813 K 6 0.46 1099 M 2402 M 1482 K 785 K 7 0.45 1092 M 2402 M 1509 K 790 K 8 0.46 1100 M 2402 M 1469 K 771 K 9 0.45 1090 M 2402 M 1535 K 800 K 10 0.46 1094 M 2402 M 1585 K 830 K 11 0.45 1092 M 2402 M 1507 K 777 K 12 0.46 1099 M 2402 M 1481 K 776 K 13 0.46 1095 M 2402 M 1482 K 780 K 14 0.46 1094 M 2402 M 1535 K 793 K 15 0.45 1092 M 2402 M 1516 K 776 K 16 0.41 992 M 2402 M 796 K 1256 K 17 0.41 991 M 2402 M 763 K 1208 K 18 0.43 1040 M 2402 M 851 K 1365 K 19 0.43 1034 M 2402 M 860 K 1390 K 20 0.43 1042 M 2402 M 840 K 1332 K 21 0.43 1030 M 2402 M 852 K 1404 K 22 0.43 1035 M 2402 M 857 K 1392 K 23 0.43 1035 M 2402 M 836 K 1335 K 24 0.43 1039 M 2402 M 834 K 1341 K 25 0.43 1034 M 2402 M 830 K 1335 K 26 0.43 1040 M 2402 M 838 K 1339 K 27 0.43 1035 M 2402 M 841 K 1335 K 28 0.43 1040 M 2402 M 835 K 1321 K 29 0.43 1038 M 2402 M 818 K 1319 K 30 0.43 1041 M 2402 M 806 K 1269 K 31 0.43 1031 M 2402 M 831 K 1327 K

Future Direction ◮ Continue affinitizing static kernel data structures ◮ e.g., vm reserv array , vm dom[] ◮ Taskqueue affinity ◮ NUMA awareness in UMA by default ◮ Improve NUMA support on !amd64 ◮ ...?

NUMA Mark Johnston markj@FreeBSD.org FreeBSD Bay Area Vendor - PowerPoint PPT Presentation

NUMA Mark Johnston markj@FreeBSD.org FreeBSD Bay Area Vendor Summit October 11, 2019 Non-Uniform Memory Access RAM CPU CPU RAM RAM CPU CPU RAM OS Responsibilities Minimize remote memory accesses Avoid remote access latency penalty

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice

NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement Strategies for NUMA Machines

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

NUMA obliviousness through memory mapping Mrunal Gawade Martin Kersten CWI, Amsterdam

COMP 633 - Parallel Computing Lecture 12 September 22, 2020 CC-NUMA (3) Synchronization

Linux NUMA evolution survival of the quickest or: related information on lwn.net, lkml.org and

Patrick Schmidt, Christoph Sterz NUMA-aware SURF Speeded Up Robust Features Object detection

Design and performance evaluation of NUMA-aware RDMA-based end-to-end data transfer systems Yufei

NUMA DataCity @ wocomoco Experiences from our open innovation program for sustainable cities

Implementing Onsite Water Reuse in San Francisco Taylor Chang Navigating Bay Area Water

Low-Income Residential Solar+Storage October 10, 2019 Hosted by Seth Mullendore, Clean Energy

Philips EasySense SNH200 DLC qualified network lighting control for high-bay/industrial

San Francisquito Creek Joint Powers Authority Len Materman, Executive Director East Palo Alto

Overview and Call for Projects Citizens Advisory Committee Agenda Item 1 1 SA SAN FRANCISC SCO

UC-CSU Faculty Network Network Composition 33 UC and CSU faculty network members 58 CSU

Building A Brand Narrative Recruiting>Engaging>Converting Through Narratives Workshop

Timeline of San Francisco Mandates Passage of San Francisco labor standards Airport quality

NUMA Mark Johnston markj@FreeBSD.org FreeBSD Bay Area Vendor - PowerPoint PPT Presentation

NUMA Mark Johnston markj@FreeBSD.org FreeBSD Bay Area Vendor Summit October 11, 2019 Non-Uniform Memory Access RAM CPU CPU RAM RAM CPU CPU RAM OS Responsibilities Minimize remote memory accesses Avoid remote access latency penalty

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice

NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement Strategies for NUMA Machines

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

NUMA obliviousness through memory mapping Mrunal Gawade Martin Kersten CWI, Amsterdam

COMP 633 - Parallel Computing Lecture 12 September 22, 2020 CC-NUMA (3) Synchronization

Linux NUMA evolution survival of the quickest or: related information on lwn.net, lkml.org and

Patrick Schmidt, Christoph Sterz NUMA-aware SURF Speeded Up Robust Features Object detection

Design and performance evaluation of NUMA-aware RDMA-based end-to-end data transfer systems Yufei

NUMA DataCity @ wocomoco Experiences from our open innovation program for sustainable cities

Implementing Onsite Water Reuse in San Francisco Taylor Chang Navigating Bay Area Water

Low-Income Residential Solar+Storage October 10, 2019 Hosted by Seth Mullendore, Clean Energy

Philips EasySense SNH200 DLC qualified network lighting control for high-bay/industrial

San Francisquito Creek Joint Powers Authority Len Materman, Executive Director East Palo Alto

Overview and Call for Projects Citizens Advisory Committee Agenda Item 1 1 SA SAN FRANCISC SCO

UC-CSU Faculty Network Network Composition 33 UC and CSU faculty network members 58 CSU

Building A Brand Narrative Recruiting&gt;Engaging&gt;Converting Through Narratives Workshop

Timeline of San Francisco Mandates Passage of San Francisco labor standards Airport quality

Building A Brand Narrative Recruiting>Engaging>Converting Through Narratives Workshop