dealing with layers of obfuscation in pseudo uniform

DEALING WITH LAYERS OF OBFUSCATION IN PSEUDO-UNIFORM MEMORY Robert - PowerPoint PPT Presentation

ROME Workshop, August 23, 2016 DEALING WITH LAYERS OF OBFUSCATION IN PSEUDO-UNIFORM MEMORY Robert Kuban, Mark Simon Sch ops, J org Nolte, Randolf Rotta rottaran@b-tu.de 1 Research supported by German BMBF grant 01IH13003. PROBLEM: MEMORY


  1. ROME Workshop, August 23, 2016 DEALING WITH LAYERS OF OBFUSCATION IN PSEUDO-UNIFORM MEMORY Robert Kuban, Mark Simon Sch¨ ops, J¨ org Nolte, Randolf Rotta rottaran@b-tu.de 1 Research supported by German BMBF grant 01IH13003.

  2. PROBLEM: MEMORY LATENCY ON INTEL XEON PHI KNC Example: Measuring avg. time is unstable between restarts Affects: micro-benchmarks, algorithm tuning, developer’s sanity. . . also application performance? ⇒ Outline 1. Causes? 2. Solutions? 3. Is it worthwhile? 2

  3. OUTLINE 1. Causes? 2. Solutions? 3. Is it worthwhile? 4. Conclusions 1 · Causes? 3

  4. CAUSES: MULTIPLE PERFORMANCE BOTTLENECKS 1 1. compute bound core core cache cache 2. memory throughput : streaming, matrix alg. 4,5 3. memory latency : key-value stores, graphs coherence directory 4. coherence latency : synchronisation variable 2,3 5. coherence throughput : many sync. variables memory 1 · Causes? 4

  5. HW SOLUTION: STRIPING TO MAXIMISE THROUGHPUT 1. striping over memory channels, banks, and coherence directories 2. past: NUMA throughput bottlenecks ⇒ mostly local striping 3. many-cores: no throughput bottlenecks but larger network core core core core cache cache cache cache directory directory directory memory memory memory memory memory memory 1 · Causes? 5

  6. HW SOLUTION: STRIPING TO MAXIMISE THROUGHPUT 1. striping over memory channels, banks, and coherence directories 2. past: NUMA throughput bottlenecks ⇒ mostly local striping 3. many-cores: no throughput bottlenecks but larger network core core core core cache cache cache cache directory directory directory memory memory memory memory memory memory 1 · Causes? 5

  7. HW SOLUTION: STRIPING TO MAXIMISE THROUGHPUT 1. striping over memory channels, banks, and coherence directories 2. past: NUMA throughput bottlenecks ⇒ mostly local striping 3. many-cores: no throughput bottlenecks but larger network core core core core cache cache cache cache directory directory directory directory memory memory memory memory memory memory 1 · Causes? 5

  8. INTEL XEON PHI KNC IN DETAIL 4 threads 4 threads 57 - 61 cores L2 cache L2 cache 2x 2x memory memory directory 64 directories directory ctrl ctrl 2x 2x memory memory bi-directional ring network ctrl ctrl memory striping by (PhysAddr/62)&0xF . 1 avg. remote L2 read ≈ 240 cycles, contention > 16 threads. 2 some lines near to memory, up to 28% app. speedup possible. 3 1John McCalpin: https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/586138 2Ramos et al: Modeling communication in cache-coherent SMP systems: A case-study with Xeon Phi. 3Balazs Gerofi et al: Exploiting Hidden Non-uniformity of Uniform Memory Access on Manycore CPUs 1 · Causes? 6

  9. OUTLINE 1. Causes? 2. Solutions? 3. Is it worthwhile? 4. Conclusions 2 · Solutions? 7

  10. REVERSE ENGINEERING KNC’S DIRECTORY STRIPING measure: fetch line currently owned by neighbour L2 two cores, two lines: one for measurement, other for coordination minimum RDTSC cycles, MyThOS kernel as bare-metal env. core core L2 cache L2 cache 3 2 1 directory core core L2 cache 3 L2 cache 2 directory 1 2 · Solutions? 8

  11. RESULTS: PSEUDO-RANDOMLY SCATTERED ≈ 140 cycles best case vs. ≈ 400 cycles worst case 400 latency from core 0 to 1 300 200 100 0 256 512 768 1024 cache line 2 · Solutions? 9

  12. RESULTS: RECONSTRUCTED MAPPING OF LINES TO DIRECTORIES Enables quick initialisation without measurements 400 latency from core 0 to 1 300 200 100 0 16 32 48 64 tag directory 2 · Solutions? 10

  13. IMPLICATIONS Support in the MyThOS kernel per page: base address for line �→ directory per node: balanced mapping for directory �→ nearby core kernel objects can allocate local lines for sync. vars. Application challenges avoid > 16 threads accessing same line co-locate dependent tasks squeeze synchronisation into cache lines no easy migration after allocation 2 · Solutions? 11

  14. OUTLINE 1. Causes? 2. Solutions? 3. Is it worthwhile? 4. Conclusions 3 · Is it worthwhile? 12

Recommend


More recommend