NUMA obliviousness through memory mapping Mrunal Gawade - - PowerPoint PPT Presentation

numa obliviousness through memory mapping
SMART_READER_LITE
LIVE PREVIEW

NUMA obliviousness through memory mapping Mrunal Gawade - - PowerPoint PPT Presentation

NUMA obliviousness through memory mapping Mrunal Gawade Martin Kersten CWI, Amsterdam DaMoN 2015 (1 st June 2015) Melbourne, Australia NUMA architecture Intel Xeon E5-4657L v2 @2.40GHz Memory mapping What is it? Operating system maps


slide-1
SLIDE 1

NUMA obliviousness through memory mapping

Mrunal Gawade Martin Kersten CWI, Amsterdam DaMoN 2015 (1st June 2015) Melbourne, Australia

slide-2
SLIDE 2

NUMA architecture

Intel Xeon E5-4657L v2 @2.40GHz

slide-3
SLIDE 3

Memory mapping

What is it? Operating system maps disk files to memory E.g. Executable file mapping How is it done? System call – mmap(), munmap() Relevance for the Database world? In memory columnar storage disk files mapped to memory

slide-4
SLIDE 4

Motivation

Memory mapped columnar storage and NUMA effects, in analytic workload

slide-5
SLIDE 5

TPC-H Q1 … (4 sockets, 100GB, MonetDB)

5 1 1 5 2 2 5 3 3 5 , 1 , 2 1 , 2 2 , 3 3 T i m e ( s e c ) S

  • c

k e t s

  • n

w h i c h m e m

  • r

y i s a l l

  • c

a t e d

numactl -N 0,1 -m “Varied between sockets 0-3” “Database server process”

slide-6
SLIDE 6

Contributions

 NUMA oblivious (shared-everything) is relatively good

compared to NUMA aware (shared-nothing). (using SQL workload)

 Effect of memory mapping on NUMA obliviousness

  • insights. (using micro-benchmarks)

 Distributed database system using multi-sockets (shared-

nothing) reduces remote memory accesses.

slide-7
SLIDE 7

NUMA oblivious vs NUMA aware plans

NUMA_Obliv- (shared everything) Default parallel plans in MonetDB Only “Lineitem” table is sliced

NUMA_Shard- (Variation of NUMA_Obliv) Shard aware plans in MonetDB “Lineitem” and “Orders” table sharded in 4 pieces (orderkey) and sliced

NUMA_Distr- (shared nothing) Socket aware plans in MonetDB “Lineitem” and “Orders” table sharded in 4 pieces(orderkey), and sliced Dimension tables replicated

slide-8
SLIDE 8

System configuration

 Intel Xeon E5-4657L v2 @2.40GHz, 4 sockets, 12 cores per socket (total 96

threads with Hyper-threading)

 Cache - L1=32KB, L2 =256KB, shared L3=30MB.  1TB four channel DDR3 memory, (256 GB memory / socket).  O.S. - Fedora 20 Data-set- TPC-H 100GB  Tools – numactl, Intel PCM, Linux Perf  MonetDB open-source system with memory mapped columnar storage

slide-9
SLIDE 9

TPC-H performance

1 2 3 4 5 6 4 6 1 5 1 9 T i m e ( s e c ) T P C

  • H

Q u e r i e s

N U M A _ O b l i v N U M A _ S h a r d N U M A _ D i s t r

NUMA_Shard is a variation of NUMA_Obliv with sharded & partitioned “orders” table.

slide-10
SLIDE 10

Micro-experiments on modified Q6

Why Q6? - select count(*) from lineitem where l_quantity > 24000000;

 Selection on “lineitem” table  Easily parallelizable  NUMA effects analysis is easy (read only query)

slide-11
SLIDE 11

Process and memory affinity

Socket 0 Socket 1 Socket 2 Socket 3 cores 0-11 12-23 24-35 36-47 cores 48-59 60-71 72-83 84-95

Example:

numactl -C 0-11,12-23,24-35 -m 0,1,2 “Database Server”

slide-12
SLIDE 12

NUMA_Obliv Micro-experiments on Q6

slide-13
SLIDE 13

1 2 3 4 5 6 7 8 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 M e m

  • r

y a c c e s s e s i n M i l l i

  • n

s N u m b e r

  • f

t h r e a d s

L

  • c

a l m e m

  • r

y a c c e s s R e m

  • t

e m e m

  • r

y a c c e s s

Local vs Remote memory access

1 2 3 4 5 6 7 8 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 M e m

  • r

y a c c e s s e s i n M i l l i

  • n

s N u m b e r

  • f

t h r e a d s

L

  • c

a l m e m

  • r

y a c c e s s R e m

  • t

e m e m

  • r

y a c c e s s

1 2 3 4 5 6 7 8 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 M e m

  • r

y a c c e s s e s i n M i l l i

  • n

s N u m b e r

  • f

t h r e a d s

L

  • c

a l m e m

  • r

y a c c e s s R e m

  • t

e m e m

  • r

y a c c e s s

Process and memory affinity = PMA Buffer cache cleared = BCC (echo 3 | sudo /usr/bin/tee /proc/sys/vm/drop caches)

PMA= yes, BCC=yes PMA= no, BCC=yes PMA= no, BCC=no

slide-14
SLIDE 14

Execution time (Robustness)

5 1 1 5 2 2 5 3 3 5 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 T i m e ( m i l l i

  • s

e c

  • n

d N u m b e r

  • f

t h r e a d s 5 1 1 5 2 2 5 3 3 5 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 T i me ( mi l l i

  • s

e c

  • n

d s ) N u m b e r

  • f

t h r e a d s 5 1 1 5 2 2 5 3 3 5 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 T i m e ( m i l l i

  • s

e c

  • n

d s ) N u m b e r

  • f

t h r e a d s

PMA= yes, BCC=yes PMA= no, BCC=yes PMA= no, BCC=no

Most robust Less robust Least robust Process and memory affinity = PMA Buffer cache cleared = BCC (echo 3 | sudo /usr/bin/tee /proc/sys/vm/drop caches)

slide-15
SLIDE 15

Increase in threads = more remote accesses?

slide-16
SLIDE 16

Distribution of mapped pages

2 4 6 8 1 1 2 2 4 3 6 4 8 P r

  • p
  • r

t i

  • n
  • f

m a p p e d p a g e s N u m b e r

  • f

t h r e a d s s

  • c

k e t s

  • c

k e t 1 s

  • c

k e t 2 s

  • c

k e t 3

/proc/process id/numa maps

slide-17
SLIDE 17

# CPU migrations

5 1 1 5 2 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 # C P U m i g r a t i

  • n

s N u m b e r

  • f

t h r e a d s

slide-18
SLIDE 18

Why remote accesses are bad?

2 4 6 8 1 1 2 1 4 1 6 N U M A _ O b l i vN U M A _ D i s t r T i m e ( m i l l i

  • s

e c

  • n

d s ) M

  • d

i fj e d T P C

  • H

Q 6

#Local Access # Remote Access NUMA_Obliv 69 Million (M) 136 M NUMA_Distr 196 M 9 M

slide-19
SLIDE 19

NUMA_Distr to minimize remote accesses ?

slide-20
SLIDE 20

Comparison with Vectorwise

1 2 3 4 5 6 4 6 1 5 1 9 T i m e ( s e c ) T P C

  • H

Q u e r i e s M

  • n

e t D B N U M A _ S h a r d M

  • n

e t D B N U M A _ D i s t r V e c t

  • r

_ D e f V e c t

  • r

_ D i s t r

Vectorwise has no NUMA awareness and also uses a dedicated buffer manager

slide-21
SLIDE 21

Comparison with Hyper

. 5 1 1 . 5 2 2 . 5 3 3 . 5 4 6 9 1 2 1 4 1 5 1 9 T i m e ( s e c ) T P C

  • H

Q u e r i e s M

  • n

e t D B N U M A _ D i s t r H y p e r

2.5 2 1.15 5.7 2.3 The RED numbers indicate speed-up of Hyper over MonetDB NUMA_Distr plans. Hyper generates NUMA aware, LLVM JIT compiled fused operator pipeline plans.

slide-22
SLIDE 22

Conclusion

  • NUMA obliviousness fares reasonably to NUMA awareness.
  • Process and memory affinity helps NUMA oblivious plans to perform

robustly.

  • Simple distributed shared nothing database configuration can compete

with the state of the art database.

slide-23
SLIDE 23

Thank you