the moment of truth are we done with stm
play

The moment of truth: are we done with STM? Nuno Diegues , Paolo - PowerPoint PPT Presentation

The moment of truth: are we done with STM? Nuno Diegues , Paolo Romano, Lus Rodrigues ndiegues@gsd.inesc-id.pt Nuno Diegues 1/27 Over 20 years of Transactional Memory Nuno Diegues 2/27 Over 20 years of Transactional Memory Commodity


  1. The moment of truth: are we done with STM? Nuno Diegues , Paolo Romano, Luís Rodrigues ndiegues@gsd.inesc-id.pt Nuno Diegues 1/27

  2. Over 20 years of Transactional Memory Nuno Diegues 2/27

  3. Over 20 years of Transactional Memory Commodity processors with hardware support Nuno Diegues 2/27

  4. Over 20 years of Transactional Memory Processors by IBM (BG/Q and zEC12) and Intel (Haswell) Nuno Diegues 2/27

  5. The question Raise the question: are we done with STM ? Nuno Diegues 3/27

  6. The question Raise the question: are we done with STM ? + Hardware ought to be faster + Transparency and ease of use Nuno Diegues 3/27

  7. The question Raise the question: are we done with STM ? + Hardware ought to be faster + Transparency and ease of use - Research in STMs has evolved into a mature state - Limited nature of hardware Nuno Diegues 3/27

  8. The question Raise the question: are we done with STM ? + Hardware ought to be faster + Transparency and ease of use - Research in STMs has evolved into a mature state - Limited nature of hardware What else is there to find? Nuno Diegues 3/27

  9. Outline 1 (Quick) Motivation 2 Study Description 3 Compared Techniques 4 Results and Insights 5 Summary of Conclusions Nuno Diegues 4/27

  10. Outline 1 Motivation 2 Study Description 3 Compared Techniques 4 Results and Insights 5 Summary of Conclusions Nuno Diegues 5/27

  11. Study Commodity hardware in Intel TSX ◮ IBM processors target high performance computing Nuno Diegues 6/27

  12. Study Commodity hardware in Intel TSX ◮ IBM processors target high performance computing ◮ Intel Haswell Xeon E3-1275v3 3.5GHz (3.9GHz Turbo) ◮ 4 cores, 8 hardware threads (via hyper-threading) ◮ 4x32KB L1 caches, 4x256KB L2 caches, 8MB L3 cache Nuno Diegues 6/27

  13. Study Commodity hardware in Intel TSX ◮ IBM processors target high performance computing ◮ Intel Haswell Xeon E3-1275v3 3.5GHz (3.9GHz Turbo) ◮ 4 cores, 8 hardware threads (via hyper-threading) ◮ 4x32KB L1 caches, 4x256KB L2 caches, 8MB L3 cache Standard metrics for evaluation ◮ Time to complete benchmarks ◮ Power consumed (collected via Intel RAPL) ◮ Relative to sequential, non-instrumented executions Nuno Diegues 6/27

  14. Study Commodity hardware in Intel TSX ◮ IBM processors target high performance computing ◮ Intel Haswell Xeon E3-1275v3 3.5GHz (3.9GHz Turbo) ◮ 4 cores, 8 hardware threads (via hyper-threading) ◮ 4x32KB L1 caches, 4x256KB L2 caches, 8MB L3 cache Standard metrics for evaluation ◮ Time to complete benchmarks ◮ Power consumed (collected via Intel RAPL) ◮ Relative to sequential, non-instrumented executions ◮ Combined metric: Speedup / KJoules Nuno Diegues 6/27

  15. Study Commodity hardware in Intel TSX ◮ IBM processors target high performance computing ◮ Intel Haswell Xeon E3-1275v3 3.5GHz (3.9GHz Turbo) ◮ 4 cores, 8 hardware threads (via hyper-threading) ◮ 4x32KB L1 caches, 4x256KB L2 caches, 8MB L3 cache Standard metrics for evaluation ◮ Time to complete benchmarks ◮ Power consumed (collected via Intel RAPL) ◮ Relative to sequential, non-instrumented executions ◮ Combined metric: Speedup / KJoules STAMP benchmarks (excluded Bayes) with standard parameters Nuno Diegues 6/27

  16. Outline 1 Motivation 2 Study Description 3 Compared Techniques 4 Results and Insights 5 Summary of Conclusions Nuno Diegues 7/27

  17. Compared Techniques Locks STM HTM Hybrid TM Nuno Diegues 8/27

  18. Compared Techniques - Locks All benchmarks used an interface with the atomic construct: GL : single global lock FL : fine-grained locks — per-application effort Nuno Diegues 9/27

  19. Compared Techniques - STM Nuno Diegues 10/27

  20. Compared Techniques - STM TL2 : commit-time locking NOrec : aimed at low thread count (single commit lock) TinySTM : encounter-time locking SwissTM : mixed encounter-time and commit-time locking Nuno Diegues 10/27

  21. Compared Techniques - HTM Intel TSX is single version, ensures strong isolation and allows nesting. Most important it is best-effort : Nuno Diegues 11/27

  22. Compared Techniques - HTM Intel TSX is single version, ensures strong isolation and allows nesting. Most important it is best-effort : No transaction is guaranteed to commit Exhausting cache lines with transactional footprint Architectural states, instructions, traps Nuno Diegues 11/27

  23. Compared Techniques - HTM Intel TSX is single version, ensures strong isolation and allows nesting. Most important it is best-effort : No transaction is guaranteed to commit Exhausting cache lines with transactional footprint Architectural states, instructions, traps Fallback path must be provided in software ◮ address to routine provided on XBEGIN Nuno Diegues 11/27

  24. Compared Techniques - HTM Intel TSX is single version, ensures strong isolation and allows nesting. Most important it is best-effort : No transaction is guaranteed to commit Exhausting cache lines with transactional footprint Architectural states, instructions, traps Fallback path must be provided in software ◮ address to routine provided on XBEGIN TSX-GL and TSX-FL Nuno Diegues 11/27

  25. Compared Techniques - HyTM Use an STM in the fallback path of TSX: TSX-TL2 with reduced hardware transactions TSX-NOrec simpler, since NOrec has a single lock Nuno Diegues 12/27

  26. Outline 1 Motivation 2 Study Description 3 Compared Techniques 4 Results and Insights 5 Summary of Conclusions Nuno Diegues 13/27

  27. STAMP results Workload Characterization: Time in Tx (%) Contention kmeans low (7) low ssca2 low (17) low intruder medium (33) high vacation high (89) low genome high (97) low yada high (99) medium labyrinth high (100) high Nuno Diegues 14/27

  28. STAMP results Workload Characterization: Time in Tx (%) Contention kmeans low (7) low L L ssca2 low (17) low M intruder medium (33) high vacation high (89) low M H genome high (97) low yada high (99) medium H H labyrinth high (100) high Nuno Diegues 14/27

  29. STAMP results Characterization of the Techniques Most Performant Least Power Consumption L kmeans L ssca2 M intruder M vacation H genome H yada H labyrinth Nuno Diegues 15/27

  30. Plot labels GL TSX-GL TL2 TSX-TL2 NOrec TSX-NOrec SwissTM TinySTM Nuno Diegues 16/27

  31. Plot labels TSX-GL TSX-NOrec TinySTM Nuno Diegues 16/27

  32. Plot labels TSX-GL TSX-NOrec TinySTM Speedup / KJoule along increasing threads Nuno Diegues 16/27

  33. kmeans - low intensity 120 100 80 Speedup/Joule 60 40 20 0 1 2 3 4 5 6 7 8 threads Sequential overhead is noticeable TSX-GL TSX-NOrec GL allows some concurrency due to L workload TinySTM HyTMs lag behind due to the STMs poor performance Nuno Diegues 17/27

  34. STAMP results - low intensity of transactions capacity architectural % of transactions aborted 80 conflict interaction 8 threads 4 threads 40 1 thread 0 TSX-GL TSX-TL2 TSX-NOrec 1 thread has negligible aborts STMs have 15 % abort rate Nuno Diegues 18/27

  35. STAMP results - low intensity of transactions Characterization of the Techniques Most Performant Least Power Consumption L kmeans TSX-GL TSX-GL L ssca2 TSX-GL TSX-GL M intruder M vacation H genome H yada H labyrinth Nuno Diegues 19/27

  36. intruder - medium intensity 14 12 10 Speedup/Joule 8 6 4 2 1 2 3 4 5 6 7 8 threads Binding threads round-robin: > 4t uses hyper-threading TSX-GL TSX-NOrec TSX -based approaches suffer from pressure on caches TinySTM Best STMs (not TL2 ) scale regardless Nuno Diegues 20/27

  37. STAMP results - medium intensity of transactions Most Performant Least Power Consumption L kmeans TSX-GL TSX-GL L ssca2 TSX-GL TSX-GL M intruder TSX-GL ≤ 4t; TinySTM ≥ 5t TSX-GL ≤ 5t; TinySTM ≥ 6t M vacation TSX-GL ≤ 2t; TinySTM ≥ 3t TSX-GL ≤ 4t; TinySTM ≥ 5t H genome H yada H labyrinth Nuno Diegues 21/27

  38. yada - high intensity 6 5.5 5 4.5 Speedup/Joule 4 3.5 3 2.5 2 1.5 1 0.5 1 2 3 4 5 6 7 8 threads TSX-GL does not scale HyTMs follow the trend of the STM counter-part TSX-GL When time to complete stagnates, power consumption TSX-NOrec stagnates TinySTM ◮ Logical cores of hyper-threading ◮ Allow for additional hardware parallelism ◮ Do not consume as much additional power Nuno Diegues 22/27

  39. STAMP results - high intensity of transactions capacity architectural 80 conflict interaction 40 0 TSX-GL TSX-TL2 TSX-NOrec Most conflicts are not due to data accesses Nuno Diegues 23/27

  40. STAMP results Most Performant Least Power Consumption L kmeans TSX-GL TSX-GL L ssca2 TSX-GL TSX-GL M intruder TSX-GL ≤ 4t; TinySTM ≥ 5t TSX-GL ≤ 5t; TinySTM ≥ 6t M vacation TSX-GL ≤ 2t; TinySTM ≥ 3t TSX-GL ≤ 4t; TinySTM ≥ 5t H genome TinySTM TinySTM H yada SwissTM TinySTM H labyrinth STMs (except TL2) STMs (except TL2) Nuno Diegues 24/27

  41. STAMP - fine-grained locking Requires a per-application effort Reasoning with transactions is meant to simplify programming Nuno Diegues 25/27

  42. STAMP - fine-grained locking Requires a per-application effort Reasoning with transactions is meant to simplify programming Does not change the landscape of performance and power consumption Nuno Diegues 25/27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend