DEALING WITH LAYERS OF OBFUSCATION IN PSEUDO-UNIFORM MEMORY Robert - - PowerPoint PPT Presentation

dealing with layers of obfuscation in pseudo uniform
SMART_READER_LITE
LIVE PREVIEW

DEALING WITH LAYERS OF OBFUSCATION IN PSEUDO-UNIFORM MEMORY Robert - - PowerPoint PPT Presentation

ROME Workshop, August 23, 2016 DEALING WITH LAYERS OF OBFUSCATION IN PSEUDO-UNIFORM MEMORY Robert Kuban, Mark Simon Sch ops, J org Nolte, Randolf Rotta rottaran@b-tu.de 1 Research supported by German BMBF grant 01IH13003. PROBLEM: MEMORY


slide-1
SLIDE 1

ROME Workshop, August 23, 2016

DEALING WITH LAYERS OF OBFUSCATION IN PSEUDO-UNIFORM MEMORY

Robert Kuban, Mark Simon Sch¨

  • ps, J¨
  • rg Nolte,

Randolf Rotta rottaran@b-tu.de

1Research supported by German BMBF grant 01IH13003.

slide-2
SLIDE 2

PROBLEM: MEMORY LATENCY ON INTEL XEON PHI KNC Example: Measuring avg. time is unstable between restarts Affects: micro-benchmarks, algorithm tuning, developer’s sanity. . . also application performance?

⇒ Outline

  • 1. Causes?
  • 2. Solutions?
  • 3. Is it worthwhile?

2

slide-3
SLIDE 3

OUTLINE

  • 1. Causes?
  • 2. Solutions?
  • 3. Is it worthwhile?
  • 4. Conclusions

1·Causes? 3

slide-4
SLIDE 4

CAUSES: MULTIPLE PERFORMANCE BOTTLENECKS

  • 1. compute bound
  • 2. memory throughput:

streaming, matrix alg.

  • 3. memory latency:

key-value stores, graphs

  • 4. coherence latency:

synchronisation variable

  • 5. coherence throughput:

many sync. variables

core cache coherence directory memory core cache 1 2,3 4,5

1·Causes? 4

slide-5
SLIDE 5

HW SOLUTION: STRIPING TO MAXIMISE THROUGHPUT

  • 1. striping over memory channels, banks, and coherence directories
  • 2. past: NUMA throughput bottlenecks ⇒ mostly local striping
  • 3. many-cores: no throughput bottlenecks but larger network

core cache directory memory memory directory memory memory core cache core cache directory memory memory core cache

1·Causes? 5

slide-6
SLIDE 6

HW SOLUTION: STRIPING TO MAXIMISE THROUGHPUT

  • 1. striping over memory channels, banks, and coherence directories
  • 2. past: NUMA throughput bottlenecks ⇒ mostly local striping
  • 3. many-cores: no throughput bottlenecks but larger network

core cache directory memory memory directory memory memory core cache core cache directory memory memory core cache

1·Causes? 5

slide-7
SLIDE 7

HW SOLUTION: STRIPING TO MAXIMISE THROUGHPUT

  • 1. striping over memory channels, banks, and coherence directories
  • 2. past: NUMA throughput bottlenecks ⇒ mostly local striping
  • 3. many-cores: no throughput bottlenecks but larger network

core cache directory memory memory directory memory memory core cache core cache directory memory memory core cache directory

1·Causes? 5

slide-8
SLIDE 8

INTEL XEON PHI KNC IN DETAIL

4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory 57 - 61 cores 64 directories bi-directional ring network 2x memory ctrl 2x memory ctrl 2x memory ctrl

memory striping by (PhysAddr/62)&0xF.1

  • avg. remote L2 read ≈ 240 cycles, contention >16 threads.2

some lines near to memory, up to 28% app. speedup possible.3

1John McCalpin: https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/586138 2Ramos et al: Modeling communication in cache-coherent SMP systems: A case-study with Xeon Phi. 3Balazs Gerofi et al: Exploiting Hidden Non-uniformity of Uniform Memory Access on Manycore CPUs 1·Causes? 6

slide-9
SLIDE 9

OUTLINE

  • 1. Causes?
  • 2. Solutions?
  • 3. Is it worthwhile?
  • 4. Conclusions

2·Solutions? 7

slide-10
SLIDE 10

REVERSE ENGINEERING KNC’S DIRECTORY STRIPING measure: fetch line currently owned by neighbour L2 two cores, two lines: one for measurement, other for coordination minimum RDTSC cycles, MyThOS kernel as bare-metal env.

core core L2 cache L2 cache L2 cache L2 cache directory directory 1 2 3 1 2 3 core core

2·Solutions? 8

slide-11
SLIDE 11

RESULTS: PSEUDO-RANDOMLY SCATTERED

≈140 cycles best case vs. ≈400 cycles worst case 100 200 300 400 256 512 768 1024

cache line latency from core 0 to 1

2·Solutions? 9

slide-12
SLIDE 12

RESULTS: RECONSTRUCTED MAPPING OF LINES TO DIRECTORIES

Enables quick initialisation without measurements 100 200 300 400 16 32 48 64

tag directory latency from core 0 to 1

2·Solutions? 10

slide-13
SLIDE 13

IMPLICATIONS

Support in the MyThOS kernel

per page: base address for line → directory per node: balanced mapping for directory → nearby core kernel objects can allocate local lines for sync. vars.

Application challenges

avoid >16 threads accessing same line co-locate dependent tasks squeeze synchronisation into cache lines no easy migration after allocation

2·Solutions? 11

slide-14
SLIDE 14

OUTLINE

  • 1. Causes?
  • 2. Solutions?
  • 3. Is it worthwhile?
  • 4. Conclusions

3·Is it worthwhile? 12

slide-15
SLIDE 15

PING-PONG BENCHMARK: BUSY POLLING, THEN WRITE

  • read

200 400 600 800 1000 1200 1400 1600 1 15 30

distance between the cores mean latency [ns] Placement: worst best

3·Is it worthwhile? 13

slide-16
SLIDE 16

PING-PONG BENCHMARK: TIMES DON’T ADD UP

core 3:copy 1:read 2 5:invalidation broadcast L2 cache core 4:rfo 6:ack L2 cache directory

3·Is it worthwhile? 14

slide-17
SLIDE 17

PING-PONG BENCHMARK: AVOID INVALIDATION BROADCASTS!

  • atomic fetch−and−add

200 400 600 800 1000 1200 1400 1600 1 15 30

distance between the cores mean latency [ns] Placement: worst best

3·Is it worthwhile? 15

slide-18
SLIDE 18

INTEL XEON PHI KNL: DOES IT APPLY?

DDR4 DDR4 MCDRAM MCDRAM MCDRAM MCDRAM MCDRAM MCDRAM MCDRAM MCDRAM 1 2 3

modes: all2all, quadrant, sub-numa; as memory or L3 cache benchmarks4: quadrant > all2all > sub-numa ⇒ memory + directory striping persists smaller latency? overhead of Y-X crossing?

4Carlos Rosales: A Comparative Study of Application Performance and Scalability on the Intel Knights Landing Processor 3·Is it worthwhile? 16

slide-19
SLIDE 19

OUTLINE

  • 1. Causes?
  • 2. Solutions?
  • 3. Is it worthwhile?
  • 4. Conclusions

4·Conclusions 17

slide-20
SLIDE 20

CONCLUSIONS

memory striping = directory striping

good for throughput-bound computations, bad for latency- and synchronisation-bound computations

Intel KNC: pseudo-uniform

up to 3x synchronisation latency but avoiding broadcasts and contention equally important benchmarks: average over multiple random allocations

  • Future. . .

MyThOS: evaluate impact on in-kernel synchronisation Intel KNL: latency and contention benchmarks HW: dedicated memory/network for synchronisation !?

4·Conclusions 18

slide-21
SLIDE 21

OUTLINE

  • 5. Appendix

5·Appendix 19

slide-22
SLIDE 22

RESULTS: UNEVEN MAPPING, DEPENDS ON ENABLED CORES!

0% 1% 2% 3% 4% 20 40 60

core nearest cache lines

5·Appendix 20

slide-23
SLIDE 23

READING FROM MEMORY: LATENCY FROM CORE 0

100 200 300 400 1024 2048 3072 4096 5120 6144 7168 8192

cache line latency from core 0

5·Appendix 21

slide-24
SLIDE 24

READING FROM MEMORY: LATENCY FROM BEST CORE

100 200 300 400 1024 2048 3072 4096 5120 6144 7168 8192

cache line latency from best core

5·Appendix 22

slide-25
SLIDE 25

“PSEUDO-UNIFORM” MEMORY ARCHITECTURES

Good for throughput bound computations

HW maximises average throughput over large data sets, average latency hidden by prefetching & many threads ⇒ no need for data partitioning and placement, can focus on computation balance

Bad for latency and synchronisation bound computations

most synchronisation variables are very small, prefetching does not help average latency does not apply, permanent overhead depending on placement

5·Appendix 23