NUMA obliviousness through memory mapping Mrunal Gawade - PowerPoint PPT Presentation

NUMA obliviousness through memory mapping Mrunal Gawade Martin Kersten CWI, Amsterdam DaMoN 2015 (1 st June 2015) Melbourne, Australia

NUMA architecture Intel Xeon E5-4657L v2 @2.40GHz

Memory mapping What is it? Operating system maps disk files to memory E.g. Executable file mapping How is it done? System call – mmap(), munmap() Relevance for the Database world? In memory columnar storage disk files mapped to memory

Motivation Memory mapped columnar storage and NUMA effects, in analytic workload

TPC-H Q1 … (4 sockets, 100GB, MonetDB) 3 5 3 0 2 5 ) c e s 2 0 ( e m 1 5 i T 1 0 5 0 0 , 1 0 0 , 2 1 , 2 2 0 , 3 3 S o c k e t s o n w h i c h m e m o r y i s a l l o c a t e d numactl -N 0,1 -m “Varied between sockets 0-3” “Database server process”

Contributions  NUMA oblivious (shared-everything) is relatively good compared to NUMA aware (shared-nothing). (using SQL workload)  Effect of memory mapping on NUMA obliviousness insights. (using micro-benchmarks)  Distributed database system using multi-sockets (shared- nothing) reduces remote memory accesses.

NUMA oblivious vs NUMA aware plans NUMA_Shard- (Variation of NUMA_Distr- (shared nothing) NUMA_Obliv- (shared    NUMA_Obliv) everything) Socket aware plans in MonetDB Shard aware plans in Default parallel plans in “Lineitem” and “Orders” table MonetDB MonetDB sharded in 4 pieces(orderkey), “Lineitem” and “Orders” table and sliced Only “Lineitem” table is sharded in 4 pieces sliced Dimension tables replicated (orderkey) and sliced

System configuration  Intel Xeon E5-4657L v2 @2.40GHz, 4 sockets, 12 cores per socket (total 96 threads with Hyper-threading)  Cache - L1=32KB, L2 =256KB, shared L3=30MB.  1TB four channel DDR3 memory, (256 GB memory / socket).  O.S. - Fedora 20 Data-set- TPC-H 100GB  Tools – numactl, Intel PCM, Linux Perf  MonetDB open-source system with memory mapped columnar storage

TPC-H performance 6 N U M A _ O b l i v 5 N U M A _ S h a r d N U M A _ D i s t r ) 4 c e s ( 3 e m i T 2 1 0 4 6 1 5 1 9 T P C - H Q u e r i e s NUMA_Shard is a variation of NUMA_Obliv with sharded & partitioned “orders” table.

Micro-experiments on modified Q6 Why Q6? - select count(*) from lineitem where l_quantity > 24000000;  Selection on “lineitem” table  Easily parallelizable  NUMA effects analysis is easy (read only query)

Process and memory affinity Example: numactl -C 0-11,12-23,24-35 -m 0,1,2 “Database Server” Socket 0 Socket 1 Socket 2 Socket 3 cores 0-11 12-23 24-35 36-47 cores 48-59 60-71 72-83 84-95

NUMA_Obliv Micro-experiments on Q6

Local vs Remote memory access 8 0 8 0 8 0 s s s n n n L o c a l m e m o r y a c c e s s L o c a l m e m o r y a c c e s s L o c a l m e m o r y a c c e s s o o o i 7 0 i 7 0 i 7 0 l l l R e m o t e m e m o r y a c c e s s R e m o t e m e m o r y a c c e s s R e m o t e m e m o r y a c c e s s l l l i i i M M M 6 0 6 0 6 0 n n n i i i s 5 0 s 5 0 s 5 0 e e e s s s s s s 4 0 4 0 4 0 e e e c c c c c c 3 0 3 0 3 0 a a a y y y r r r 2 0 o 2 0 2 0 o o m m m e 1 0 e e 1 0 1 0 M M M 0 0 0 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 N u m b e r o f t h r e a d s N u m b e r o f t h r e a d s N u m b e r o f t h r e a d s PMA= yes, BCC=yes PMA= no, BCC=yes PMA= no, BCC=no Process and memory affinity = PMA Buffer cache cleared = BCC (echo 3 | sudo /usr/bin/tee /proc/sys/vm/drop caches)

Execution time (Robustness) 3 5 0 3 5 0 3 5 0 3 0 0 3 0 0 ) 3 0 0 ) s s d d d n n n o 2 5 0 2 5 0 2 5 0 o c o c e c s e e - 2 0 0 s 2 0 0 i s 2 0 0 l - - l i i i l l l l mi i m ( 1 5 0 1 5 0 m 1 5 0 ( ( e me e m i 1 0 0 1 0 0 1 0 0 i m T i T T 5 0 5 0 5 0 0 0 0 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 N u m b e r o f t h r e a d s N u m b e r o f t h r e a d s N u m b e r o f t h r e a d s PMA= yes, BCC=yes PMA= no, BCC=yes PMA= no, BCC=no Most robust Less robust Least robust Process and memory affinity = PMA Buffer cache cleared = BCC (echo 3 | sudo /usr/bin/tee /proc/sys/vm/drop caches)

Increase in threads = more remote accesses?

Distribution of mapped pages 1 0 0 s s o c k e t 3 e g s o c k e t 2 a s o c k e t 1 p 8 0 d s o c k e t 0 e p p 6 0 a m f o 4 0 n o i t r o 2 0 p o r P 0 1 2 2 4 3 6 4 8 N u m b e r o f t h r e a d s /proc/process id/numa maps

# CPU migrations 2 0 0 s n 1 5 0 o i t a r g i m 1 0 0 U P C # 5 0 0 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 N u m b e r o f t h r e a d s

Why remote accesses are bad? 1 6 0 1 4 0 ) s d 1 2 0 n #Local Access # Remote Access o c 1 0 0 e s NUMA_Obliv 69 Million (M) 136 M - i l l 8 0 i m ( 6 0 e NUMA_Distr 196 M 9 M m i 4 0 T 2 0 0 N U M A _ O b l i vN U M A _ D i s t r M o d i fj e d T P C - H Q 6

NUMA_Distr to minimize remote accesses ?

Comparison with Vectorwise 6 M o n e t D B N U M A _ S h a r d 5 M o n e t D B N U M A _ D i s t r ) 4 c V e c t o r _ D e f e s ( V e c t o r _ D i s t r 3 e m i T 2 1 0 4 6 1 5 1 9 T P C - H Q u e r i e s Vectorwise has no NUMA awareness and also uses a dedicated buffer manager

Comparison with Hyper 3 . 5 M o n e t D B N U M A _ D i s t r 3 H y p e r 2 . 5 ) 1.15 c e s 2 ( e m 1 . 5 i T 2.3 1 5.7 2.5 0 . 5 2 0 4 6 9 1 2 1 4 1 5 1 9 T P C - H Q u e r i e s The RED numbers indicate speed-up of Hyper over MonetDB NUMA_Distr plans. Hyper generates NUMA aware, LLVM JIT compiled fused operator pipeline plans.

Conclusion ● NUMA obliviousness fares reasonably to NUMA awareness. ● Process and memory affinity helps NUMA oblivious plans to perform robustly. ● Simple distributed shared nothing database configuration can compete with the state of the art database.

Thank you

NUMA obliviousness through memory mapping Mrunal Gawade - PowerPoint PPT Presentation

NUMA obliviousness through memory mapping Mrunal Gawade Martin Kersten CWI, Amsterdam DaMoN 2015 (1 st June 2015) Melbourne, Australia NUMA architecture Intel Xeon E5-4657L v2 @2.40GHz Memory mapping What is it? Operating system maps

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

Texture and other Mappings Texture Mapping Texture Mapping Bump Mapping Bump Mapping

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice

NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement Strategies for NUMA Machines

Image Warping Image Mapping Image Mapping - Examples Forward Mapping Forward Mapping -

TEXTURE MAPPING 1 OUTLINE Introduce Mapping Methods Texture Mapping Environment

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Dynamic Tracing Tools on ARM/AArch64 platform Updates and Challenges Hiroyuki ISHII Panasonic

The webinar will be starting soon How to Get Your Annual State Cannabis Retail License With

BoF - What Can BPF Do For You? Brenden Blanco Aug. 22, 2016 Agenda A bit of history and project

Th The Ne e Next L Linux S x Super erpower er: eBPF eBPF Prime mer Sasha Goldshtein CTO,

Simulated Galaxy Catalogs for DES and LSST Risa Wechsler Stanford/SLAC/KIPAC with Matt Becker

Objec tives Bec ome familiar with two models of a Spir itual Car e volunteer pr ogr am.

The reasons behind some classical constructions in analysis V. Milman Tel-Aviv University In

validscale: A Stata module to validate subjective measurement scales using Classical Test Theory