does cache sharing on modern cmp matter to the
play

Does Cache Sharing on Modern CMP Matter to the Performance of - PowerPoint PPT Presentation

Does Cache Sharing on Modern CMP Matter to the Performance of Contemporary Multithreaded Programs? Eddy Zheng Zhang Yunlian Jiang Xipeng Shen (presenter) Computer Science Department The College of William and Mary, VA, USA Cache Sharing


  1. Does Cache Sharing on Modern CMP Matter to the Performance of Contemporary Multithreaded Programs? Eddy Zheng Zhang Yunlian Jiang Xipeng Shen (presenter) Computer Science Department The College of William and Mary, VA, USA

  2. Cache Sharing • A common feature on modern CMP 2 The College of William and Mary

  3. Cache Sharing on CMP • A double-edged sword • Reduces communication latency • But causes conflicts & contention 3 The College of William and Mary

  4. Cache Sharing on CMP Non-Uniformity • A double-edged sword • Reduces communication latency • But causes conflicts & contention 4 The College of William and Mary

  5. Many Efforts for Exploitation • Example: shared-cache-aware scheduling • Assigning suitable programs/threads to the same chip • Independent jobs • Job Co-Scheduling [Snavely+:00, Snavely+:02, El- Moursy+:06, Fedorova+:07, Jiang+:08, Zhou+:09] • Parallel threads of server applications • Thread Clustering [Tam+:07] 5 The College of William and Mary

  6. Overview of this Work (1/3) • A surprising finding • Insignificant effects from shared cache on a recent multithreaded benchmark suite (PARSEC) • Drawn from a systematic measurement • thousands of runs • 7 dimensions on levels of programs, OS, & architecture • derived from timing results • confirmed by hardware performance counters 6 The College of William and Mary

  7. Overview of this Work (2/3) • A detailed analysis • Reason • three mismatches between executables and CMP cache architecture • Cause • the current development and compilation are oblivious to cache sharing 7 The College of William and Mary

  8. Overview of this Work (3/3) • An exploration of the implications • Exploiting cache sharing deserves not less but more attention. • But to exert the power, cache-sharing-aware transformations are critical • Cuts half of cache misses • Improves performance by 36%. 8 The College of William and Mary

  9. Outline • Experiment design • Measurement and findings • Cache-sharing-aware transformation • Related work, summary, and conclusion. 9 The College of William and Mary

  10. Benchmarks (1/3) • PARSEC suite by Princeton Univ [Bienia+:08] “focuses on emerging workloads and was designed to be representative of next-generation shared-memory programs for chip-multiprocessors” 10 The College of William and Mary

  11. Benchmarks (2/3) • Composed of • RMS applications • Systems applications • …… • A wide spectrum of • working sets, locality, data sharing, synch., off-chip traffic, etc. 11 The College of William and Mary

  12. Benchmarks (3/3) Program Description Parallelism Working Set Blackscholes Black-Scholes equation data 2MB Bodytrack body tracking data 8MB Canneal sim. Annealing unstruct. 256MB Facesim face simulation data 256MB Fluidanimate fluid dynamics data 64MB Streamcluster online clustering data 16MB Swaptions portfolio pricing data 0.5MB X264 video encoding pipeline 16MB Dedup stream compression pipeline 256MB Ferret image search pipeline 64MB 12 The College of William and Mary

  13. Factors Covered in Measurements Dimension Variations tions Description benchmarks 10 from PARSEC parallelism 3 data, pipeline, unstructured Program level simsmall, simmedium, simlarge, inputs 4 native # of threads 4 1,2,4,8 assignment 3 threads assignment to cores OS level binding 2 yes, no subset of cores 7 The cores a program uses Arch. level platforms 2 Intel Xeon & AMD Operon 13 The College of William and Mary

  14. Machines Intel (Xeon 5310) 32K 32K 32K 32K 32K 32K 32K 32K 4MB L2 4MB L2 4MB L2 4MB L2 AMD (Opeteron 2352) 8GB DRAM 64K 64K 64K 64K 64K 64K 64K 64K 512K 512K 512K 512K 512K 512K 512K 512K 2MB L3 2MB L3 4GB DRAM 4GB DRAM 14 The College of William and Mary

  15. Measurement Schemes • Running times • Built-in hooks in PARSEC • Hardware performance counters • PAPI • cache miss, mem. bus, shared data accesses 15 The College of William and Mary

  16. Outline • Experiment design • Measurement and findings • Cache-sharing-aware transformation • Related work, summary, and conclusions 16 The College of William and Mary

  17. Observation I: Sharing vs. Non-sharing 17 The College of William and Mary

  18. Sharing vs. Non-sharing T1 T2 VS. T1 T2 18

  19. Sharing vs. Non-sharing T1 T3 T2 T4 VS. T1 T3 T4 T2 19

  20. Sharing vs. Non-sharing • Performance Evaluation (Intel) 1.4 2t simlarge 2t native 4t simlarge 4t native 1.2 1 0.8 0.6 0.4 0.2 0 blackscholes fluidanimate streamcluster x264 bodytrack swaptions canneal facesim 20 The College of William and Mary

  21. Sharing vs. Non-sharing • Performance Evaluation (AMD) 1.4 2t simlarge 2t native 4t simlarge 4t native 1.2 1 0.8 0.6 0.4 0.2 0 blackscholes fluidanimate streamcluster x264 bodytrack swaptions canneal facesim 21 The College of William and Mary

  22. Sharing vs. Non-sharing • L2-cache accesses & misses (Intel) 22 The College of William and Mary

  23. Reasons (1/2) 1) Small amount of inter-thread data sharing sharing ratio of reads (%) (Intel) 7 5.25 3.5 1.75 blackscholes bodytrack 0 canneal facesim fluidanimate streamcluster swaptions x264 23 The College of William and Mary

  24. Reasons (2/2) 2) Large working sets Program Description Parallelism Working Set Blackscholes Black-Scholes equation data 2MB Bodytrack body tracking data 8MB Canneal sim. Annealing unstruct. 256MB Facesim face simulation data 256MB Fluidanimate fluid dynamics data 64MB Streamcluster online clustering data 16MB Swaptions portfolio pricing data 0.5MB X264 video encoding pipeline 16MB Dedup stream compression pipeline 256MB Ferret image search pipeline 64MB The College of William and Mary

  25. Observation II: Different Sharing Cases • Threads may differ • Different data to be processed or tasks to be conducted. • Non-uniform communication and data sharing. • Different thread placement may give different performance in the sharing case. 25 The College of William and Mary

  26. Di fg erent Sharing Cases T1 T3 T2 T4 T1 T2 T3 T4 T1 T2 T4 T3 26

  27. Max. Perf. Diff (%) 2t simlarge 2t native 4t simlarge 4t native 16 statistically insignificant---large 14 fluctuations across runs of the same config. 12 10 8 6 4 2 0 blackscholes fluidanimate streamcluster x264 bodytrack swaptions canneal facesim 27 The College of William and Mary

  28. Two Possible Reasons • Similar interactions among threads • Differences are smoothed by phase shifts 28 The College of William and Mary

  29. Temporal Traces of L2 misses 29 The College of William and Mary

  30. Temporal Traces of L2 misses 30 The College of William and Mary

  31. Two Possible Reasons • Similar interactions among threads • Differences are smoothed by phase shifts 31 The College of William and Mary

  32. Pipeline Programs • Two such programs: ferret, and dedup • Numerous concurrent stages • Interactions within and between stages • Large differences between different thread-core assignments • Mainly due to load balance rather than differences in cache sharing. 32 The College of William and Mary

  33. A Short Summary • Insignificant influence on performance • Large working sets • Little data sharing • Thread placement does not matter • Due to uniform relations among threads • Hold across inputs, # threads, architecture, phases. 33 The College of William and Mary

  34. Outline • Experiment design • Measurement and findings • Cache-sharing-aware transformation • Related work, summary, and conclusions 34 The College of William and Mary

  35. Principle • Increase data sharing among siblings • Decrease data sharing otherwise Non-uniform threads Non-uniform cache sharing 35 The College of William and Mary

  36. Example: streamcluster original code thread 1 thread 2 for i = 1 to N, step =1 for i = 1 to N, step =1 … … … … for j= T1 to T2 for j= T2+1 to T3 dist=foo(p[j],p[c[i]]) dist=foo(p[j],p[c[i]]) end end … … … … end end 36 The College of William and Mary

  37. Example: streamcluster optimized code thread 1 thread 2 for i = 1 to N, step =2 for i = 1 to N, step =2 … … … … for j= T1 to T3 for j= T1 to T3 dist=foo(p[j],p[c[i]]) dist=foo(p[j],p[c[i+1]]) end end … … … … end end 37 The College of William and Mary

  38. Performance Improvement (streamcluster) 1 0.75 0.5 0.25 0 4t 8t L2 Cache Miss Mem Bus Trans 38 The College of William and Mary

  39. Other Programs Normalized L2 Misses (on Intel) 1 0.75 0.5 0.25 0 4t 8t 4t 8t Blacksholes Bodytrack 39 The College of William and Mary

  40. Implication • To exert the potential of shared cache, program-level transformations are critical. • Limited existing explorations • Sarkar & Tullsen’08, Kumar& Tullsen’02, Nokolopoulos’03. * A contrast to the large body of work in OS and architecture. 40 The College of William and Mary

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend