Large Pages May Be Harmful on NUMA Systems Fabien Gaud

Large ¡Pages ¡May ¡Be ¡Harmful ¡on ¡ NUMA ¡Systems ¡ Fabien ¡Gaud ¡ Bap?ste ¡Lepers ¡ Jeremie ¡Decouchant ¡ Simon ¡Fraser ¡University ¡ CNRS ¡ Grenoble ¡University ¡ Jus?n ¡Funston ¡ Alexandra ¡Fedorova ¡ Vivien ¡Quéma ¡ Simon ¡Fraser ¡University ¡ Simon ¡Fraser ¡University ¡ Grenoble ¡INP ¡

Virtual-‑to-‑physical ¡transla?on ¡is ¡done ¡ by ¡the ¡TLB ¡and ¡page ¡table ¡ ¡ ¡ TLB hit Virtual address TLB Physical address 大页面能减少 TLB MISS 吗？文件的访问用页表的方式管理， TLB miss 问题会不会很严 TLB miss 重？文件比内存数据大，页表多。文件访问局部性较差， TLB miss 问题更大。混合内存在 numa 下的问题 Page table Typical TLB size: 1024 entries (AMD Bulldozer), 512 entries (Intel i7). 2

Virtual-‑to-‑physical ¡transla?on ¡is ¡done ¡ by ¡the ¡TLB ¡and ¡page ¡table ¡ ¡ ¡ TLB hit Virtual address TLB Physical address TLB miss 43 cycles Page table Typical TLB size: 1024 entries (AMD Bulldozer), 512 entries (Intel i7). 3

Large ¡pages ¡known ¡advantages ¡& ¡ downsides ¡ Known advantages: • Fewer TLB misses Page size 512 entries coverage 1024 entries coverage 4KB (default) 2MB 4MB 2MB 1GB 2GB 1GB 512GB 1024GB • Fewer page allocations (reduces contention in the kernel memory manager) Known downsides: • Increased memory footprint • Memory fragmentation 4

Machines ¡are ¡NUMA ¡ Remote memory accesses hurt performance Memory Memory 8GB/s 160 cycles 3GB/s 300 cycles Node 1 CPU0 CPU1 CPU2 CPU3 Node 2 Node 3 Memory Memory 6

Machines ¡are ¡NUMA ¡ Contention hurts performance even more. Memory Memory 1200 cycles ! Node 1 CPU0 CPU1 CPU2 CPU3 Node 2 Node 3 Memory Memory 7

Experimental ¡Environment ¡ • Machine ¡A ¡ – Two ¡AMD ¡processors, ¡12 ¡cores ¡per ¡processor, ¡64GB ¡DRAM ¡ – 4 ¡nodes, ¡6 ¡cores ¡and ¡12 ¡GB ¡RAM ¡per ¡node ¡ ¡ • Machine ¡B ¡ – Four ¡AMD ¡processors, ¡16 ¡cores ¡per ¡processor, ¡512GB ¡ DRAM ¡ – 8 ¡nodes, ¡8 ¡cores ¡and ¡64GB ¡RAM ¡per ¡node ¡ • Benchmark ¡ – NAS ¡Parallel ¡Benchmarks ¡ – SSCA ¡ ¡ – SPECjbb ¡

New ¡observa?on: ¡large ¡pages ¡may ¡hurt ¡ Perf. Improvement Perf. Improvement performance ¡on ¡NUMA ¡machines ¡ relative to 4K (%) relative to 4K (%) -30 -20 -10 -30 -20 -10 10 20 30 10 20 30 0 0 B B T T . . B B C C G G . . D D Performance improvement of THP (2M pages) over 4K pages -43 D D C C . . A A E E P P . . C C 24-core machine 64-core machine F F T T . . C C I I S S . . D D L L U U . . B B M M G G . . D D S S P P . . B B U U A A . . B B U U A A . . C C W W 109 70 C C W W R R K K m m e e a a n n s s M M a a t t r r i i x x M M . . p p c c a a w w r r m m e e m m 51 S S S S C C A A . . 2 2 0 0 S S P P E E C C j j b b b b 5

Performance ¡example ¡(1/2) ¡ App. Perf. % of time % of time Imbalance Imbalance increase spent in spent in 4K (%) 2M (%) THP/4K TLB miss TLB miss (%) 4K 2M CG.D -43 0 0 1 59 SSCA.20 17 15 2 8 52 SpecJBB -6 7 0 16 39 Using large pages, 1 node is overloaded in CG, SSCA and SpecJBB. Only SSCA benefits from the reduction of TLB misses. 11

Large ¡pages ¡on ¡NUMA ¡machines ¡(1/2) ¡ void *a = malloc(2MB); Node 0 Node 1 Node 2 Node 3 With 4K pages, load is balanced. 8

Large ¡pages ¡on ¡NUMA ¡machines ¡(1/2) ¡ void *a = malloc(2MB); Node 0 Node 1 Node 2 Node 3 With 2M pages, data are allocated on 1 node => contention. 9

Large ¡pages ¡on ¡NUMA ¡machines ¡(1/2) ¡ HOT PAGE void *a = malloc(2MB); Node 0 Node 1 Node 2 Node 3 With 2M pages, data are allocated on 1 node => contention. 10

Performance ¡example ¡(2/2) ¡ App. Perf. Local Local increase Access Access THP/4K Ratio 4K Ratio 2M (%) (%) (%) UA.C -15 88 66 The locality decreases when using large pages. 13

Large ¡pages ¡on ¡NUMA ¡machines ¡(2/2) ¡ PAGE-LEVEL void *a = malloc(1.5MB); // node 0 FALSE SHARING void *b = malloc(1.5MB); // node 1 Node 0 Node 1 Node 2 Node 3 Page-level false sharing reduces the maximum achievable locality. 12

Can ¡exis?ng ¡memory ¡management ¡ algorithms ¡solve ¡the ¡problem? ¡ 14

Exis?ng ¡memory ¡management ¡ algorithms ¡do ¡not ¡solve ¡the ¡problem ¡ We run the application with Carrefour[1], the state-of-the-art memory management algorithm. Carrefour monitors memory accesses and places pages to minimize imbalance and maximize locality. Carrefour solves imbalance / locality issues on some applications 30 20 Perf. Improvement 10 relative to 4K (%) 0 -10 -20 -30 D B B C . m 0 b M 2 b . . . . e G U A A x . j C m A L U U i C r C E t r a w S P M S S But does not improve performance on some other applications (hot pages or page-level false sharing) [1] DASHTI M., FEDOROVA A., FUNSTON J., GAUD F.,LACHAIZE R., LEPERS B., QUEMA V., AND ROTH M. Traffic management: A holistic approach to memory placement on NUMA systems. ASPLOS 2013. 15

Carrefour ¡ • Method ¡ – gathering ¡access ¡samples ¡for ¡memory ¡pages ¡ – choosing ¡a ¡host ¡node ¡for ¡a ¡page ¡based ¡on ¡the ¡ samples ¡ • if ¡all ¡of ¡the ¡samples ¡for ¡a ¡page ¡originated ¡from ¡a ¡single ¡ node ¡ ¡ – the ¡page ¡is ¡migrated ¡to ¡that ¡node ¡ ¡ • if ¡the ¡samples ¡came ¡from ¡mul7ple ¡nodes ¡ – ¡the ¡page ¡is ¡migrated ¡to ¡a ¡random ¡node ¡ ¡ 1 ¡

Reasons ¡ Hot pages �

Reasons ¡ Page sharing �

We ¡need ¡a ¡beYer ¡memory ¡ management ¡algorithm ¡ 16

Our ¡solu?on ¡– ¡ Carrefour-‑LP ¡ • Built on top of Carrefour. • By default, 2M pages are activated. • Two components that run every second: Reactive component Conservative component Splits 2M pages Promotes 4K pages Detects and removes “hot When the time spent pages” and page-level handling TLB misses is “false sharing”. high. Deactivate 2M page allocation Forces 2M page allocation In case of contention in the page fault handler. • We show in the paper that the two components are required. 17

Implementa?on ¡ Reactive component (splits 2M pages) Sample memory accesses using IBS A page represents more YES than 5% of all Split and interleave the hot page accesses and is accessed from multiple nodes? 18

Implementa?on ¡ Reactive component (splits 2M pages) Sample memory accesses using IBS • Compute observed local access ratio (LAR 1 ) • Compute the LAR that would have been obtained if each page was placed on the node that accessed it the most. LAR1 can be YES Run carrefour significantly improved? NO • Compute the LAR that would have been obtained if each page was split and then placed on the node that accessed it the most. LAR1 can be YES significantly Split all 2M pages and run carrefour improved? 19

Implementa?on ¡challenges ¡ Reactive component (splits 2M pages) Sample memory accesses using IBS COSTLY • Compute observed local access ratio (LAR 1 ) • Compute the LAR that would have been obtained if each page was placed on the node that accessed it the most (without splitting). LAR1 can be YES Run carrefour significantly improved? IMPRECISE NO • Compute the LAR that would have been obtained if each page was split and then placed on the node that accessed it the most. LAR1 can be YES COSTLY significantly Split all 2M pages and run carrefour improved? 20

Implementa?on ¡challenges ¡ Reactive component (splits 2M pages) • We only have few IBS samples. • The LAR with “2M pages split into 4K pages” can be wrong. • We try to be conservative by running Carrefour first and only splitting pages when necessary (splitting pages is expensive). • Predicting that splitting a 2M page will increase TLB miss rate is too costly. This is why the conservative component is required. 21

Implementa?on ¡ Conservative component Monitor time spent in TLB misses (hardware counters) YES Cluster 4K pages and force 2M pages allocation > 5% Monitor time spent in page fault handler (kernel statistics) YES Force 2M page allocation > 5% 22

Evalua?on ¡ Carrefour-2M over Linux 4K Reactive over Linux 4K Carrefour-LP over Linux 4K Conservative over Linux 4K 30 Perf. Improvement 24-core machine relative to 4K (%) 20 10 0 -10 -20 -30 D B B C . m 0 b M 2 b . . . . e U A G A x . j m A C U i L U C r C E t r a w P S M S S 46 45 32 46 30 Perf. Improvement 64-core machine relative to 4K (%) 20 10 0 -10 -20 -43 -30 D B B C . m 0 b M 2 b . . . . e G U A A x . j C m A L U U i C r C E t r a w 23 S P M S S

Large Pages May Be Harmful on NUMA Systems Fabien Gaud - PowerPoint PPT Presentation

Large Pages May Be Harmful on NUMA Systems Fabien Gaud Bap?ste Lepers Jeremie Decouchant Simon Fraser University CNRS Grenoble University Jus?n

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

Cloning Considered Harmful Considered Harmful Cory Kapser and Michael W. Godfrey David R.

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice

NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement Strategies for NUMA Machines

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

OpenStack performance optimization NUMA, Large pages & CPU pinning Daniel P. Berrang

Harmful Algal Blooms Harmful Algal Blooms = HABs Photo credit: Darren Brandt Foam Scum Paint

QjackCtl Considered Harmful QjackCtl Considered Harmful rncbc a.k.a. a.k.a. Rui Nuno Capela Rui

NUMA obliviousness through memory mapping Mrunal Gawade Martin Kersten CWI, Amsterdam

July 12, 2020 You will need communion elements for todays gathering. Please join with audio to

Our Grandchildrens Water Ramont Bell September 22, 2018 Join the conversation!

HotStorage '20 JULY 13-14, 2020 SplitKV: Splitting IO Paths for Different Sized Key- Value

True God There is one God one divine essence or being. Our Triune God There are three divine

Jesus, grace and generosity Looking again at Luke Development Day, Saturday 7 October 2017

by : Raoufehsadat Hashemian The 4th ACM/SPEC International Conference on Performance Engineering

Blue Bible pg 1091 Jesus First Words (Luke 2:49) Why were you looking for me? Did you not

What we read about in the Book of Mormon is the Nephite Disease and we have it! . . . We

Sambuz

Useful Links

Newsletter

Mail Us