no free lunch
play

No Free Lunch David Cervini, Danica Porobic , Pnar Tzn, Anastasia - PowerPoint PPT Presentation

Applying HTM to an OLTP System: No Free Lunch David Cervini, Danica Porobic , Pnar Tzn, Anastasia Ailamaki Why Hardware Transactional Memory? Multicores are here to stay Lock-based and lock-free programming is hard Transactional


  1. Applying HTM to an OLTP System: No Free Lunch David Cervini, Danica Porobic , Pınar Tözün, Anastasia Ailamaki

  2. Why Hardware Transactional Memory? • Multicores are here to stay • Lock-based and lock-free programming is hard • Transactional memory should ease programming • Software transactional memory is not fast enough Very promising for synchronization-heavy software 2

  3. Why HTM and OLTP? Shore-MT 60 Retrieving 1 row other Number of CSs per Transaction 50 xct manager logging 40 buffer pool catalog 30 latching locking 20 10 0 Many critical sections even for a simple transaction 3

  4. A match made in heaven? Promise: – HTM simplifies lock-free programming – Shore-MT relies on fine-grained locking – Expect performance improvement 4

  5. Transactional Synchronization eXtensions • On Intel’s Haswell • RTM (Restricted Transactional Memory) : – Directly uses TM, more flexible – _xbegin, _xabort, _xend, _xtest – Requires new implementation • HLE (Hardware Lock Elision): – Speculative execution of existing locking code – __ATOMIC_HLE_ACQUIRE or __ATOMIC_HLE_RELEASE 5

  6. TSX in a nutshell conflict + capacity and misc aborts • Uses cache coherency of L1 cache • Tracks data at cache line granularity 6

  7. Experimental platform • Software: – Shore-MT – TM-1 benchmark: GetSubscriberData – From 1 to 8 workers – SLI enabled – 80000 row dataset • Hardware: – Intel i7-4770 3.4Ghz 4-core processor, hyperthreading on – 16GB RAM 7

  8. Which lock types are used? Shore-MT 60 GetSubscriberData other Number of CSs per Transaction 50 occ_rwlock xct manager logging mcs 40 buffer pool catalog 30 latching srwlock locking 20 tatas 10 0 Many lock types for different use-cases 8

  9. HLE pthread instead of occ_rwlock 800 700 600 Throughput (KTps) 500 400 300 baseline 200 pthread_rwlock 100 0 1 2 3 4 5 6 7 8 Number of threads No impact: pthread implementation is limited 9

  10. RTM-enabled lock example: acquire void occ_rwlock::acquire() { #ifdef OCC_RWLOCK_RTM_WRAPPER Retry transaction multiple times unsigned int status; for(int i = 0; i < 2; i++) { if ((status = _xbegin()) == _XBEGIN_STARTED) { if (has_reader()) { _xabort(0xff); } Tune retry policies return; } else if ((status & _XABORT_EXPLICIT) && _XABORT_CODE(status) == 0xff) { while (__atomic_load_n(&_active_count, __ATOMIC_ACQUIRE)) Avoid Lemming Effect _mm_pause(); } else if (status & _XABORT_CONFLICT) { long int backoff=10*random()/RAND_MAX; while (backoff--) _mm_pause(); } else if (status & _XABORT_RETRY) { _mm_pause(); } else { break; } } #endif /**original acquire code here**/ } RTM preferable for new code for flexibility 10

  11. RTM-enabled lock example: release void occ_rwlock::release() { #ifdef OCC_RWLOCK_RTM_WRAPPER if (!has_reader() & _xtest()) { _xend(); return; } #endif Must be inside a transaction /**original release code here**/ } Ending a transaction requires no tuning 11

  12. RTM locks: good & bad news 800 800 700 700 600 600 Throughput (KTps) Throughput (KTps) 500 500 400 400 300 300 200 200 baseline baseline 100 100 occ, tatas, mcs occ, tatas, mcs, srwlock 0 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Number of threads Number of threads HTM improves throughput 13-18% Improvement is not guaranteed 12

  13. Reason: aborts 25 total 25 capacity 20 20 conflict Percent of aborts Percent of aborts 15 15 10 10 5 5 0 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Number of threads Number of threads Hyper-threading causes capacity aborts Real data conflicts cause high abort rates 13

  14. Coarse grained B-tree lock 800 total 25 700 capacity conflict 600 20 Throughput (KTps) Percent of aborts 500 15 400 300 10 200 5 baseline 100 coarse-grained lock 0 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Number of threads Number of threads Throughput drops by up to 73% Large critical section make aborts very expensive 14

  15. Applying HTM to an OLTP system • Promise – TSX democratizes lock-free programming – Shore-MT relies on fine-grained locking – Possible match made in heaven • Reality – Low hanging fruit: TSX is great for short critical sections – Requires tuning – not always beneficial – Cannot be used for large code sections – Realizing full benefits requires system redesign Thank you! 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend