non blocking data structures and transactional memory
play

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, - PowerPoint PPT Presentation

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 14 November 2014 Lecture 6 Introduction Amdahls law Basic spin-locks Queue-based locks Hierarchical locks Reader-writer locks Reading


  1. NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 14 November 2014

  2. Lecture 6 � Introduction � Amdahl’s law � Basic spin-locks � Queue-based locks � Hierarchical locks � Reader-writer locks � Reading without locking � Flat combining

  3. Overview � Building shared memory data structures � Lists, queues, hashtables, … � Why? � Used directly by applications (e.g., in C/C++, Java, C#, …) � Used in the language runtime system (e.g., management of work, implementations of message passing, …) � Used in traditional operating systems (e.g., synchronization between top/bottom-half code) � Why not? � Don’t think of “threads + shared data structures” as a default/good/complete/desirable programming model � It’s better to have shared memory and not need it… 3

  4. What do we care about? Ease to write Suppose I have a sequential How does performance change implementation (no as we increase the number of concurrency control at all): is Does it matter? Who is the When can it threads? When does the the new implementation 5% Correctness target audience? How much be used? implementation add or avoid slower? 5x slower? 100x effort can they put into it? Is synchronization? slower? What does it mean implementing a data structure to be correct? an undergrad programming Between threads in the same e.g., if multiple concurrent exercise? …or a research process? Between processes threads are using iterators on a paper? How well sharing memory? Within an shared data structure at the How fast is it? interrupt handler? does it scale? same time? With/without some kind of runtime system support? 4

  5. What do we care about? Ease to write When can it Correctness be used? How well How fast is it? does it scale? 5

  6. What do we care about? Be explicit about goals and trade-offs 1. � A benefit in one dimension often has costs in another � Does a perf increase prevent a data structure being used in some particular setting? � Does a technique to make something easier to write make the implementation slower? � Do we care? It depends on the setting 2. Remember, parallel programming is rarely a recreational activity � The ultimate goal is to increase perf (time, or resources used) � Does an implementation scale well enough to out-perform a good sequential implementation? 6

  7. Suggested reading � “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures, from both practical and theoretical perspectives � “Transactional memory, 2 nd edition”, Harris, Larus, Rajwar – recently revamped survey of TM work, with 350+ references � “NOrec: streamlining STM by abolishing ownership records”, Dalessandro, Spear, Scott, PPoPP 2010 � “Simplifying concurrent algorithms by exploiting transactional memory”, Dice, Lev, Marathe, Moir, Nussbaum, Olszewski, SPAA 2010 � Intel “Haswell” spec for SLE (speculative lock elision) and RTM (restricted transactional memory) 7

  8. Amdahl’s law

  9. Amdahl’s law � “Sorting takes 70% of the execution time of a sequential program. You replace the sorting algorithm with one that scales perfectly on multi-core hardware. On a machine with n cores, how many cores do you need to use to get a 4x speed-up on the overall algorithm?” 9

  10. Amdahl’s law, f=70% 4.5 4.0 3.5 Desired 4x speedup 3.0 Speedup 2.5 2.0 Speedup achieved (perfect scaling on 70%) 1.5 1.0 0.5 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #cores 10

  11. Amdahl’s law, f=70% 1 ����������(�, �) = �(1 − �) + � � � � f = fraction of code speedup applies to c = number of cores used 11

  12. Amdahl’s law, f=70% 4.5 4.0 3.5 Desired 4x speedup 3.0 Speedup 2.5 Limit as c → ∞ = 1/(1-f) = 3.33 2.0 1.5 Speedup achieved (perfect scaling on 70%) 1.0 0.5 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #cores 12

  13. Amdahl’s law, f=10% 1.12 1.10 1.08 Amdahl’s law limit, just 1.11x 1.06 Speedup achieved Speedup with perfect scaling 1.04 1.02 1.00 0.98 0.96 0.94 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #cores 13

  14. Amdahl’s law, f=98% Speedup 60 20 40 50 10 30 0 1 7 13 19 25 31 37 43 49 55 #cores 61 67 73 79 85 91 97 103 109 115 121 127 14

  15. Amdahl’s law & multi-core Suppose that the same h/w budget (space or power) can make us: 1 2 3 4 1 2 5 6 7 8 1 9 10 11 12 3 4 13 14 15 16 15

  16. Perfof big & small cores 1.2 Assumption: perf = α √resource 1.0 Core perf (relative to 1 big core 0.8 Total perf: Total perf: 1 * 1 = 1 0.6 16 * 1/4 = 4 0.4 0.2 0.0 1/16 1/8 1/4 1/2 1 Resources dedicated to core 16

  17. Amdahl’s law, f=98% 3.5 3.0 Perf (relative to 1 big core) 2.5 16 small 4 medium 2.0 1.5 1 big 1.0 0.5 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #Cores 17

  18. Amdahl’s law, f=75% 1.2 1 big 1.0 Perf (relative to 1 big core) 4 medium 0.8 16 small 0.6 0.4 0.2 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #Cores 18

  19. Amdahl’s law, f=5% 1.2 1 big 1.0 Perf (relative to 1 big core) 0.8 4 medium 0.6 0.4 16 small 0.2 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #Cores 19

  20. Asymmetric chips 3 4 1 7 8 9 10 11 12 13 14 15 16 20

  21. Amdahl’s law, f=75% 1.6 1+12 1.4 Perf (relative to 1 big core) 4 medium 1.2 1 big 1 0.8 16 small 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #Cores 21

  22. Amdahl’s law, f=5% 1.2 Perf (relative to 1 big core) 1 1 big 0.8 4 medium 0.6 1+12 0.4 0.2 16 small 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #Cores 22

  23. Amdahl’s law, f=98% 3.5 Perf (relative to 1 big core) 3 1+12 2.5 16 small 4 medium 2 1.5 1 big 1 0.5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #Cores 23

  24. Amdahl’s law, f=98% 9 Speedup (relative to 1 big core) 8 7 1+192 6 5 4 3 2 256 small 1 0 #Cores 24

  25. Amdahl’s law, f=98% 9 Speedup (relative to 1 big core) 8 7 1+192 6 5 4 3 2 256 small 1 0 Leave larger core idle #Cores in parallel section 25

  26. Basic spin-locks

  27. Test and set (pseudo-code) Pointer to a location holding a boolean value (TRUE/FALSE) �������������������������� ������������ �������� Read the current ������������ contents of the ���������� location b points to… � �������������� � …set the contents of *b to TRUE 27

  28. Test and set • Suppose two threads use it at once testAndSet(b)->true Thread 1: time Thread 2: testAndSet(b)->false 28 Non-blocking data structures and transactional memory

  29. Test and set lock lock: FALSE FALSE => lock available TRUE => lock held void acquireLock(bool *lock) { Each call tries to acquire while (testAndSet(lock)) { the lock, returning TRUE /* Nothing */ if it is already held } } NB: all this is pseudo- code, assuming SC void releaseLock(bool *lock) { memory *lock = FALSE; } 29 Non-blocking data structures and transactional memory

  30. Test and set lock lock: FALSE TRUE Thread 1 Thread 2 void acquireLock(bool *lock) { while (testAndSet(lock)) { /* Nothing */ } } void releaseLock(bool *lock) { *lock = FALSE; } 30 Non-blocking data structures and transactional memory

  31. What are the problems here? testAndSet implementation causes contention 31 Non-blocking data structures and transactional memory

  32. Contention from testAndSet Single- Single- threaded threaded core core L1 cache L1 cache L2 cache L2 cache Main memory 32 Non-blocking data structures and transactional memory

  33. Multi-core h/w –separate L2 testAndSet(k) Single- Single- threaded threaded core core L1 cache L1 cache k L2 cache L2 cache k Main memory 33 Non-blocking data structures and transactional memory

  34. Multi-core h/w –separate L2 testAndSet(k) Single- Single- threaded threaded core core k L1 cache L1 cache k L2 cache L2 cache Main memory 34 Non-blocking data structures and transactional memory

  35. Multi-core h/w –separate L2 testAndSet(k) Single- Single- threaded threaded core core Does this still happen in practice? Do modern CPUs avoid fetching the L1 cache L1 cache k line in exclusive mode on failing TAS? L2 cache L2 cache k Main memory 35 Non-blocking data structures and transactional memory

  36. What are the problems here? testAndSet No control over implementation locking policy causes contention Only supports mutual Spinning may waste exclusion: not reader- resources while writer locking waiting 36

  37. General problem � No logical conflict between two failed lock acquires � Cache protocol introduces a physical conflict � For a good algorithm: only introduce physical conflicts if a logical conflict occurs � In a lock: successful lock-acquire & failed lock-acquire � In a set: successful insert(10) & failed insert(10) � But not: � In a lock: two failed lock acquires � In a set: successful insert(10) & successful insert(20) � In a non-empty queue: enqueue on the left and remove on the right 37

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend