The moment of truth: are we done with STM?
Nuno Diegues, Paolo Romano, Luís Rodrigues
ndiegues@gsd.inesc-id.pt
Nuno Diegues 1/27
The moment of truth: are we done with STM? Nuno Diegues , Paolo - - PowerPoint PPT Presentation
The moment of truth: are we done with STM? Nuno Diegues , Paolo Romano, Lus Rodrigues ndiegues@gsd.inesc-id.pt Nuno Diegues 1/27 Over 20 years of Transactional Memory Nuno Diegues 2/27 Over 20 years of Transactional Memory Commodity
Nuno Diegues 1/27
Nuno Diegues 2/27
Nuno Diegues 2/27
Nuno Diegues 2/27
Nuno Diegues 3/27
Nuno Diegues 3/27
Nuno Diegues 3/27
Nuno Diegues 3/27
1 (Quick) Motivation 2 Study Description 3 Compared Techniques 4 Results and Insights 5 Summary of Conclusions Nuno Diegues 4/27
1 Motivation 2 Study Description 3 Compared Techniques 4 Results and Insights 5 Summary of Conclusions Nuno Diegues 5/27
◮ IBM processors target high performance computing Nuno Diegues 6/27
◮ IBM processors target high performance computing ◮ Intel Haswell Xeon E3-1275v3 3.5GHz (3.9GHz Turbo) ◮ 4 cores, 8 hardware threads (via hyper-threading) ◮ 4x32KB L1 caches, 4x256KB L2 caches, 8MB L3 cache Nuno Diegues 6/27
◮ IBM processors target high performance computing ◮ Intel Haswell Xeon E3-1275v3 3.5GHz (3.9GHz Turbo) ◮ 4 cores, 8 hardware threads (via hyper-threading) ◮ 4x32KB L1 caches, 4x256KB L2 caches, 8MB L3 cache
◮ Time to complete benchmarks ◮ Power consumed (collected via Intel RAPL) ◮ Relative to sequential, non-instrumented executions Nuno Diegues 6/27
◮ IBM processors target high performance computing ◮ Intel Haswell Xeon E3-1275v3 3.5GHz (3.9GHz Turbo) ◮ 4 cores, 8 hardware threads (via hyper-threading) ◮ 4x32KB L1 caches, 4x256KB L2 caches, 8MB L3 cache
◮ Time to complete benchmarks ◮ Power consumed (collected via Intel RAPL) ◮ Relative to sequential, non-instrumented executions ◮ Combined metric: Speedup / KJoules Nuno Diegues 6/27
◮ IBM processors target high performance computing ◮ Intel Haswell Xeon E3-1275v3 3.5GHz (3.9GHz Turbo) ◮ 4 cores, 8 hardware threads (via hyper-threading) ◮ 4x32KB L1 caches, 4x256KB L2 caches, 8MB L3 cache
◮ Time to complete benchmarks ◮ Power consumed (collected via Intel RAPL) ◮ Relative to sequential, non-instrumented executions ◮ Combined metric: Speedup / KJoules
Nuno Diegues 6/27
1 Motivation 2 Study Description 3 Compared Techniques 4 Results and Insights 5 Summary of Conclusions Nuno Diegues 7/27
Nuno Diegues 8/27
Nuno Diegues 9/27
Nuno Diegues 10/27
Nuno Diegues 10/27
Nuno Diegues 11/27
Nuno Diegues 11/27
◮ address to routine provided on XBEGIN Nuno Diegues 11/27
◮ address to routine provided on XBEGIN
Nuno Diegues 11/27
Nuno Diegues 12/27
1 Motivation 2 Study Description 3 Compared Techniques 4 Results and Insights 5 Summary of Conclusions Nuno Diegues 13/27
Nuno Diegues 14/27
Nuno Diegues 14/27
Nuno Diegues 15/27
Nuno Diegues 16/27
Nuno Diegues 16/27
Nuno Diegues 16/27
20 40 60 80 100 120 1 2 3 4 5 6 7 8 Speedup/Joule threads
TSX-GL TSX-NOrec TinySTM
Nuno Diegues 17/27
1 thread 4 threads 8 threads
Nuno Diegues 18/27
Nuno Diegues 19/27
2 4 6 8 10 12 14 1 2 3 4 5 6 7 8 Speedup/Joule threads
TSX-GL TSX-NOrec TinySTM
Nuno Diegues 20/27
Nuno Diegues 21/27
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 1 2 3 4 5 6 7 8 Speedup/Joule threads
TSX-GL TSX-NOrec TinySTM
◮ Logical cores of hyper-threading ◮ Allow for additional hardware parallelism ◮ Do not consume as much additional power Nuno Diegues 22/27
Nuno Diegues 23/27
Nuno Diegues 24/27
Nuno Diegues 25/27
Nuno Diegues 25/27
◮ Additional lock acquisitions are noticeable in L workloads ◮ An efficient fine-grained lock scheme was not found ◮ TinySTM was competitive in H workloads Nuno Diegues 25/27
Nuno Diegues 26/27
Nuno Diegues 26/27
Nuno Diegues 26/27
Nuno Diegues 26/27
Nuno Diegues 27/27