ROME Workshop, August 23, 2016
DEALING WITH LAYERS OF OBFUSCATION IN PSEUDO-UNIFORM MEMORY
Robert Kuban, Mark Simon Sch¨
- ps, J¨
- rg Nolte,
Randolf Rotta rottaran@b-tu.de
1Research supported by German BMBF grant 01IH13003.
DEALING WITH LAYERS OF OBFUSCATION IN PSEUDO-UNIFORM MEMORY Robert - - PowerPoint PPT Presentation
ROME Workshop, August 23, 2016 DEALING WITH LAYERS OF OBFUSCATION IN PSEUDO-UNIFORM MEMORY Robert Kuban, Mark Simon Sch ops, J org Nolte, Randolf Rotta rottaran@b-tu.de 1 Research supported by German BMBF grant 01IH13003. PROBLEM: MEMORY
ROME Workshop, August 23, 2016
DEALING WITH LAYERS OF OBFUSCATION IN PSEUDO-UNIFORM MEMORY
Robert Kuban, Mark Simon Sch¨
Randolf Rotta rottaran@b-tu.de
1Research supported by German BMBF grant 01IH13003.
PROBLEM: MEMORY LATENCY ON INTEL XEON PHI KNC Example: Measuring avg. time is unstable between restarts Affects: micro-benchmarks, algorithm tuning, developer’s sanity. . . also application performance?
⇒ Outline
2
OUTLINE
1·Causes? 3
CAUSES: MULTIPLE PERFORMANCE BOTTLENECKS
streaming, matrix alg.
key-value stores, graphs
synchronisation variable
many sync. variables
core cache coherence directory memory core cache 1 2,3 4,5
1·Causes? 4
HW SOLUTION: STRIPING TO MAXIMISE THROUGHPUT
core cache directory memory memory directory memory memory core cache core cache directory memory memory core cache
1·Causes? 5
HW SOLUTION: STRIPING TO MAXIMISE THROUGHPUT
core cache directory memory memory directory memory memory core cache core cache directory memory memory core cache
1·Causes? 5
HW SOLUTION: STRIPING TO MAXIMISE THROUGHPUT
core cache directory memory memory directory memory memory core cache core cache directory memory memory core cache directory
1·Causes? 5
INTEL XEON PHI KNC IN DETAIL
4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory 57 - 61 cores 64 directories bi-directional ring network 2x memory ctrl 2x memory ctrl 2x memory ctrl
memory striping by (PhysAddr/62)&0xF.1
some lines near to memory, up to 28% app. speedup possible.3
1John McCalpin: https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/586138 2Ramos et al: Modeling communication in cache-coherent SMP systems: A case-study with Xeon Phi. 3Balazs Gerofi et al: Exploiting Hidden Non-uniformity of Uniform Memory Access on Manycore CPUs 1·Causes? 6
OUTLINE
2·Solutions? 7
REVERSE ENGINEERING KNC’S DIRECTORY STRIPING measure: fetch line currently owned by neighbour L2 two cores, two lines: one for measurement, other for coordination minimum RDTSC cycles, MyThOS kernel as bare-metal env.
core core L2 cache L2 cache L2 cache L2 cache directory directory 1 2 3 1 2 3 core core
2·Solutions? 8
RESULTS: PSEUDO-RANDOMLY SCATTERED
≈140 cycles best case vs. ≈400 cycles worst case 100 200 300 400 256 512 768 1024
cache line latency from core 0 to 1
2·Solutions? 9
RESULTS: RECONSTRUCTED MAPPING OF LINES TO DIRECTORIES
Enables quick initialisation without measurements 100 200 300 400 16 32 48 64
tag directory latency from core 0 to 1
2·Solutions? 10
IMPLICATIONS
Support in the MyThOS kernel
per page: base address for line → directory per node: balanced mapping for directory → nearby core kernel objects can allocate local lines for sync. vars.
Application challenges
avoid >16 threads accessing same line co-locate dependent tasks squeeze synchronisation into cache lines no easy migration after allocation
2·Solutions? 11
OUTLINE
3·Is it worthwhile? 12
PING-PONG BENCHMARK: BUSY POLLING, THEN WRITE
200 400 600 800 1000 1200 1400 1600 1 15 30
distance between the cores mean latency [ns] Placement: worst best
3·Is it worthwhile? 13
PING-PONG BENCHMARK: TIMES DON’T ADD UP
core 3:copy 1:read 2 5:invalidation broadcast L2 cache core 4:rfo 6:ack L2 cache directory
3·Is it worthwhile? 14
PING-PONG BENCHMARK: AVOID INVALIDATION BROADCASTS!
200 400 600 800 1000 1200 1400 1600 1 15 30
distance between the cores mean latency [ns] Placement: worst best
3·Is it worthwhile? 15
INTEL XEON PHI KNL: DOES IT APPLY?
DDR4 DDR4 MCDRAM MCDRAM MCDRAM MCDRAM MCDRAM MCDRAM MCDRAM MCDRAM 1 2 3
modes: all2all, quadrant, sub-numa; as memory or L3 cache benchmarks4: quadrant > all2all > sub-numa ⇒ memory + directory striping persists smaller latency? overhead of Y-X crossing?
4Carlos Rosales: A Comparative Study of Application Performance and Scalability on the Intel Knights Landing Processor 3·Is it worthwhile? 16
OUTLINE
4·Conclusions 17
CONCLUSIONS
memory striping = directory striping
good for throughput-bound computations, bad for latency- and synchronisation-bound computations
Intel KNC: pseudo-uniform
up to 3x synchronisation latency but avoiding broadcasts and contention equally important benchmarks: average over multiple random allocations
MyThOS: evaluate impact on in-kernel synchronisation Intel KNL: latency and contention benchmarks HW: dedicated memory/network for synchronisation !?
4·Conclusions 18
OUTLINE
5·Appendix 19
RESULTS: UNEVEN MAPPING, DEPENDS ON ENABLED CORES!
0% 1% 2% 3% 4% 20 40 60
core nearest cache lines
5·Appendix 20
READING FROM MEMORY: LATENCY FROM CORE 0
100 200 300 400 1024 2048 3072 4096 5120 6144 7168 8192
cache line latency from core 0
5·Appendix 21
READING FROM MEMORY: LATENCY FROM BEST CORE
100 200 300 400 1024 2048 3072 4096 5120 6144 7168 8192
cache line latency from best core
5·Appendix 22
“PSEUDO-UNIFORM” MEMORY ARCHITECTURES
Good for throughput bound computations
HW maximises average throughput over large data sets, average latency hidden by prefetching & many threads ⇒ no need for data partitioning and placement, can focus on computation balance
Bad for latency and synchronisation bound computations
most synchronisation variables are very small, prefetching does not help average latency does not apply, permanent overhead depending on placement
5·Appendix 23