No Free Lunch David Cervini, Danica Porobic , Pnar Tzn, Anastasia - - PowerPoint PPT Presentation

no free lunch
SMART_READER_LITE
LIVE PREVIEW

No Free Lunch David Cervini, Danica Porobic , Pnar Tzn, Anastasia - - PowerPoint PPT Presentation

Applying HTM to an OLTP System: No Free Lunch David Cervini, Danica Porobic , Pnar Tzn, Anastasia Ailamaki Why Hardware Transactional Memory? Multicores are here to stay Lock-based and lock-free programming is hard Transactional


slide-1
SLIDE 1

Applying HTM to an OLTP System: No Free Lunch

David Cervini, Danica Porobic, Pınar Tözün, Anastasia Ailamaki

slide-2
SLIDE 2

Why Hardware Transactional Memory?

  • Multicores are here to stay
  • Lock-based and lock-free programming is hard
  • Transactional memory should ease programming
  • Software transactional memory is not fast enough

Very promising for synchronization-heavy software

2

slide-3
SLIDE 3

Why HTM and OLTP?

Many critical sections even for a simple transaction

10 20 30 40 50 60 Number of CSs per Transaction

  • ther

xct manager logging buffer pool catalog latching locking

3

Shore-MT Retrieving 1 row

slide-4
SLIDE 4

Promise:

– HTM simplifies lock-free programming – Shore-MT relies on fine-grained locking – Expect performance improvement

A match made in heaven?

4

slide-5
SLIDE 5

Transactional Synchronization eXtensions

  • On Intel’s Haswell
  • RTM (Restricted Transactional Memory):

– Directly uses TM, more flexible – _xbegin, _xabort, _xend, _xtest – Requires new implementation

  • HLE (Hardware Lock Elision):

– Speculative execution of existing locking code

– __ATOMIC_HLE_ACQUIRE or __ATOMIC_HLE_RELEASE

5

slide-6
SLIDE 6

TSX in a nutshell

6

  • Uses cache coherency of L1 cache
  • Tracks data at cache line granularity

+ capacity and misc aborts conflict

slide-7
SLIDE 7

Experimental platform

  • Software:

– Shore-MT – TM-1 benchmark: GetSubscriberData – From 1 to 8 workers – SLI enabled – 80000 row dataset

  • Hardware:

– Intel i7-4770 3.4Ghz 4-core processor, hyperthreading on – 16GB RAM

7

slide-8
SLIDE 8

Which lock types are used?

Many lock types for different use-cases

10 20 30 40 50 60 Number of CSs per Transaction

  • ther

xct manager logging buffer pool catalog latching locking

8

Shore-MT GetSubscriberData

  • cc_rwlock

tatas srwlock mcs

slide-9
SLIDE 9

HLE pthread instead of occ_rwlock

100 200 300 400 500 600 700 800 1 2 3 4 5 6 7 8 Throughput (KTps) Number of threads baseline pthread_rwlock

No impact: pthread implementation is limited

9

slide-10
SLIDE 10

RTM-enabled lock example: acquire

void occ_rwlock::acquire() { #ifdef OCC_RWLOCK_RTM_WRAPPER unsigned int status; for(int i = 0; i < 2; i++) { if ((status = _xbegin()) == _XBEGIN_STARTED) { if (has_reader()) { _xabort(0xff); } return; } else if ((status & _XABORT_EXPLICIT) && _XABORT_CODE(status) == 0xff) { while (__atomic_load_n(&_active_count, __ATOMIC_ACQUIRE)) _mm_pause(); } else if (status & _XABORT_CONFLICT) { long int backoff=10*random()/RAND_MAX; while (backoff--) _mm_pause(); } else if (status & _XABORT_RETRY) { _mm_pause(); } else { break; } } #endif /**original acquire code here**/ }

Retry transaction multiple times Tune retry policies Avoid Lemming Effect

RTM preferable for new code for flexibility

10

slide-11
SLIDE 11

RTM-enabled lock example: release

Ending a transaction requires no tuning

11

void occ_rwlock::release() { #ifdef OCC_RWLOCK_RTM_WRAPPER if (!has_reader() & _xtest()) { _xend(); return; } #endif /**original release code here**/ }

Must be inside a transaction

slide-12
SLIDE 12

RTM locks: good & bad news

100 200 300 400 500 600 700 800 1 2 3 4 5 6 7 8 Throughput (KTps) Number of threads baseline

  • cc, tatas, mcs

12

HTM improves throughput 13-18%

1 2 3 4 5 6 7 8 100 200 300 400 500 600 700 800 Number of threads Throughput (KTps) baseline

  • cc, tatas, mcs, srwlock

Improvement is not guaranteed

slide-13
SLIDE 13

Reason: aborts

13

Hyper-threading causes capacity aborts Real data conflicts cause high abort rates

5 10 15 20 25 1 2 3 4 5 6 7 8 Percent of aborts Number of threads total capacity conflict 5 10 15 20 25 1 2 3 4 5 6 7 8 Percent of aborts Number of threads

slide-14
SLIDE 14

Coarse grained B-tree lock

14

Throughput drops by up to 73% Large critical section make aborts very expensive

100 200 300 400 500 600 700 800 1 2 3 4 5 6 7 8 Throughput (KTps) Number of threads baseline coarse-grained lock 5 10 15 20 25 1 2 3 4 5 6 7 8 Percent of aborts Number of threads total capacity conflict

slide-15
SLIDE 15
  • Promise

– TSX democratizes lock-free programming – Shore-MT relies on fine-grained locking – Possible match made in heaven

  • Reality

– Low hanging fruit: TSX is great for short critical sections – Requires tuning – not always beneficial – Cannot be used for large code sections – Realizing full benefits requires system redesign

Applying HTM to an OLTP system Thank you!

15