Hardware Transactional Memory on Haswell EP Viktor Leis Technische - - PowerPoint PPT Presentation

hardware transactional memory on haswell ep
SMART_READER_LITE
LIVE PREVIEW

Hardware Transactional Memory on Haswell EP Viktor Leis Technische - - PowerPoint PPT Presentation

Hardware Transactional Memory on Haswell EP Viktor Leis Technische Universitt Mnchen 1 / 14 Introduction Intels new mid-level server platform: Haswell EP up to 18 cores per socket (up to 72 hardware threads with 2 sockets)


slide-1
SLIDE 1

Hardware Transactional Memory on Haswell EP

Viktor Leis

Technische Universität München

1 / 14

slide-2
SLIDE 2

Introduction

◮ Intel’s new mid-level server platform: Haswell EP ◮ up to 18 cores per socket (up to 72 hardware threads with 2

sockets)

◮ supports hardware transactional memory (TSX)

2 / 14

slide-3
SLIDE 3

Experimental Setup

◮ global fallback lock

◮ built-in Hardware Lock Elision (HLE) ◮ lock elision implemented using RTM, restarts and

re-speculation

◮ workload

◮ Adaptive Radix Tree (trie, fanout 2-256), designed for

main-memory database systems

◮ random lookups in tree with 64M entries ◮ 64M random inserts into (initially empty) tree 3 / 14

slide-4
SLIDE 4

Intel Xeon E5-2697 v3

◮ 14 cores (28 threads), 2.6GHz-3.6GHz, 35MB LLC ◮ 2 sockets QPI interconnect

(to other socket)

core 0 core 1 core 2 core 3 core 4 core 5 core 6 memory controller L3 L3 L3 L3 L3 L3 L3 core 10 core 11 core 12 core 13 core 7 core 8 core 9 L3 L3 L3 L3 L3 L3 L3 internal link (to other ring) memory controller

4 / 14

slide-5
SLIDE 5

Lookups with Locking

25 50 75 1 14 28 42 56

threads M ops/s

atomic no sync rw_spin_lock

5 / 14

slide-6
SLIDE 6

Lookups with HTM

25 50 75 1 14 28 42 56

threads M ops/s

no sync 7 or more restarts 3 restarts 2 restarts 1 restarts 0 restarts built-in HLE

6 / 14

slide-7
SLIDE 7

Random Inserts with HTM

26.0x 16.1x 12.2x 0.8x

30 60 90 120 1 14 28 42 56

threads M ops/s

malloc pre−allocate pre−allocate + memset tcmalloc

7 / 14

slide-8
SLIDE 8

HTM and NUMA

◮ lookup:

1 thread 7 threads speedup 1 cluster 9.2 53.0 5.8× 1 socket 5.4 36.0 6.7× 2 sockets 3.6 24.5 6.8×

◮ insert:

insert 1 thread 7 threads speedup 1 cluster 5.3 30.6 5.8× 1 socket 4.3 26.8 6.2× 2 sockets 3.0 20.2 6.7×

8 / 14

slide-9
SLIDE 9

Conclusions

◮ Intel’s HTM implementation can scale to NUMA systems with

many many cores

◮ pitfalls at higher thread counts:

◮ built-in HLE does not scale ◮ lock elision with 20 restarts and re-speculation should be used

instead

◮ even infrequent kernel traps or system calls can be a problem

at higher thread counts (Amdahl’s Law)

9 / 14