Does Cache Sharing on Modern CMP Matter to the Performance of - - PowerPoint PPT Presentation

does cache sharing on modern cmp matter to the
SMART_READER_LITE
LIVE PREVIEW

Does Cache Sharing on Modern CMP Matter to the Performance of - - PowerPoint PPT Presentation

Does Cache Sharing on Modern CMP Matter to the Performance of Contemporary Multithreaded Programs? Eddy Zheng Zhang Yunlian Jiang Xipeng Shen (presenter) Computer Science Department The College of William and Mary, VA, USA Cache Sharing


slide-1
SLIDE 1

Does Cache Sharing on Modern CMP Matter to the Performance of Contemporary Multithreaded Programs?

Eddy Zheng Zhang Yunlian Jiang Xipeng Shen (presenter) Computer Science Department The College of William and Mary, VA, USA

slide-2
SLIDE 2

The College of William and Mary

Cache Sharing

  • A common feature on modern CMP

2

slide-3
SLIDE 3

Cache Sharing on CMP

  • A double-edged sword
  • Reduces communication latency
  • But causes conflicts & contention

3

The College of William and Mary

slide-4
SLIDE 4

Cache Sharing on CMP

  • A double-edged sword
  • Reduces communication latency
  • But causes conflicts & contention

4

The College of William and Mary

Non-Uniformity

slide-5
SLIDE 5

The College of William and Mary

Many Efforts for Exploitation

  • Example: shared-cache-aware scheduling
  • Assigning suitable programs/threads to the same

chip

5

  • Independent jobs
  • Job Co-Scheduling [Snavely+:00, Snavely+:02, El-

Moursy+:06, Fedorova+:07, Jiang+:08, Zhou+:09]

  • Parallel threads of server applications
  • Thread Clustering [Tam+:07]
slide-6
SLIDE 6

The College of William and Mary

Overview of this Work (1/3)

  • A surprising finding
  • Insignificant effects from shared cache on a recent

multithreaded benchmark suite (PARSEC)

6

  • Drawn from a systematic measurement
  • thousands of runs
  • 7 dimensions on levels of programs, OS, &

architecture

  • derived from timing results
  • confirmed by hardware performance counters
slide-7
SLIDE 7

The College of William and Mary

Overview of this Work (2/3)

  • A detailed analysis
  • Reason
  • three mismatches between executables and CMP

cache architecture

  • Cause
  • the current development and compilation are
  • blivious to cache sharing

7

slide-8
SLIDE 8

The College of William and Mary

Overview of this Work (3/3)

  • An exploration of the implications
  • Exploiting cache sharing deserves not less but

more attention.

  • But to exert the power, cache-sharing-aware

transformations are critical

  • Cuts half of cache misses
  • Improves performance by 36%.

8

slide-9
SLIDE 9

The College of William and Mary

Outline

  • Experiment design
  • Measurement and findings
  • Cache-sharing-aware transformation
  • Related work, summary, and conclusion.

9

slide-10
SLIDE 10

The College of William and Mary

Benchmarks (1/3)

  • PARSEC suite by Princeton Univ [Bienia+:08]

10

“focuses on emerging workloads and was designed to be representative of next-generation shared-memory programs for chip-multiprocessors”

slide-11
SLIDE 11

The College of William and Mary

Benchmarks (2/3)

  • Composed of
  • RMS applications
  • Systems applications
  • ……
  • A wide spectrum of
  • working sets, locality, data sharing, synch., off-chip

traffic, etc.

11

slide-12
SLIDE 12

The College of William and Mary

Program Description Parallelism Working Set Blackscholes Black-Scholes equation data 2MB Bodytrack body tracking data 8MB Canneal

  • sim. Annealing

unstruct. 256MB Facesim face simulation data 256MB Fluidanimate fluid dynamics data 64MB Streamcluster

  • nline clustering

data 16MB Swaptions portfolio pricing data 0.5MB X264 video encoding pipeline 16MB Dedup stream compression pipeline 256MB Ferret image search pipeline 64MB

12

Benchmarks (3/3)

slide-13
SLIDE 13

The College of William and Mary

Factors Covered in Measurements

Dimension Variations tions Description benchmarks 10 from PARSEC parallelism 3 data, pipeline, unstructured inputs 4 simsmall, simmedium, simlarge, native # of threads 4 1,2,4,8 assignment 3 threads assignment to cores binding 2 yes, no subset of cores 7 The cores a program uses platforms 2 Intel Xeon & AMD Operon

Program level OS level

  • Arch. level

13

slide-14
SLIDE 14

The College of William and Mary

Intel (Xeon 5310)

32K 32K 32K 32K 8GB DRAM 32K 32K 32K 32K 4MB L2 4MB L2 4MB L2 4MB L2

Machines

14

64K 64K 64K 64K 512K 512K 512K 512K 2MB L3 4GB DRAM 64K 64K 64K 64K 512K 512K 512K 512K 2MB L3 4GB DRAM

AMD (Opeteron 2352)

slide-15
SLIDE 15

The College of William and Mary

Measurement Schemes

  • Running times
  • Built-in hooks in PARSEC
  • Hardware performance counters
  • PAPI
  • cache miss, mem. bus, shared data accesses

15

slide-16
SLIDE 16

The College of William and Mary

Outline

  • Experiment design
  • Measurement and findings
  • Cache-sharing-aware transformation
  • Related work, summary, and conclusions

16

slide-17
SLIDE 17

The College of William and Mary

Observation I: Sharing vs. Non-sharing

17

slide-18
SLIDE 18

Sharing vs. Non-sharing

18

T1 T2

VS.

T1 T2

slide-19
SLIDE 19

Sharing vs. Non-sharing

19

T1 T2

VS.

T1 T3 T3 T4 T2 T4

slide-20
SLIDE 20

The College of William and Mary

Sharing vs. Non-sharing

  • Performance Evaluation (Intel)

0.2 0.4 0.6 0.8 1 1.2 1.4

2t simlarge 2t native 4t simlarge 4t native

20

blackscholes bodytrack canneal facesim fluidanimate streamcluster swaptions x264

slide-21
SLIDE 21

The College of William and Mary

Sharing vs. Non-sharing

  • Performance Evaluation (AMD)

0.2 0.4 0.6 0.8 1 1.2 1.4

2t simlarge 2t native 4t simlarge 4t native

21

blackscholes bodytrack canneal facesim fluidanimate streamcluster swaptions x264

slide-22
SLIDE 22

The College of William and Mary

Sharing vs. Non-sharing

  • L2-cache accesses & misses (Intel)

22

slide-23
SLIDE 23

The College of William and Mary

Reasons (1/2)

1) Small amount of inter-thread data sharing

23

1.75 3.5 5.25 7

blackscholes bodytrack canneal facesim fluidanimate streamcluster swaptions x264

sharing ratio of reads (%) (Intel)

slide-24
SLIDE 24

The College of William and Mary

Reasons (2/2)

2) Large working sets

Program Description Parallelism Working Set Blackscholes Black-Scholes equation data 2MB Bodytrack body tracking data 8MB Canneal

  • sim. Annealing

unstruct. 256MB Facesim face simulation data 256MB Fluidanimate fluid dynamics data 64MB Streamcluster

  • nline clustering

data 16MB Swaptions portfolio pricing data 0.5MB X264 video encoding pipeline 16MB Dedup stream compression pipeline 256MB Ferret image search pipeline 64MB

slide-25
SLIDE 25

The College of William and Mary

25

Observation II: Different Sharing Cases

  • Threads may differ
  • Different data to be processed or tasks to be

conducted.

  • Non-uniform communication and data sharing.
  • Different thread placement may give different

performance in the sharing case.

slide-26
SLIDE 26

Difgerent Sharing Cases

26

T1 T2 T3 T4 T1 T3 T2 T4 T1 T4 T2 T3

slide-27
SLIDE 27

The College of William and Mary

2 4 6 8 10 12 14 16

  • Max. Perf. Diff (%)

27

statistically insignificant---large fluctuations across runs of the same config.

2t simlarge 2t native 4t simlarge 4t native

blackscholes bodytrack canneal facesim fluidanimate streamcluster swaptions x264

slide-28
SLIDE 28

The College of William and Mary

Two Possible Reasons

  • Similar interactions among threads
  • Differences are smoothed by phase shifts

28

slide-29
SLIDE 29

The College of William and Mary

Temporal Traces of L2 misses

29

slide-30
SLIDE 30

The College of William and Mary

Temporal Traces of L2 misses

30

slide-31
SLIDE 31

The College of William and Mary

Two Possible Reasons

  • Similar interactions among threads
  • Differences are smoothed by phase shifts

31

slide-32
SLIDE 32

The College of William and Mary

Pipeline Programs

  • Two such programs: ferret, and dedup
  • Numerous concurrent stages
  • Interactions within and between stages
  • Large differences between different thread-core

assignments

  • Mainly due to load balance rather than differences in

cache sharing.

32

slide-33
SLIDE 33

The College of William and Mary

A Short Summary

  • Insignificant influence on performance
  • Large working sets
  • Little data sharing
  • Thread placement does not matter
  • Due to uniform relations among threads
  • Hold across inputs, # threads, architecture, phases.

33

slide-34
SLIDE 34

The College of William and Mary

Outline

  • Experiment design
  • Measurement and findings
  • Cache-sharing-aware transformation
  • Related work, summary, and conclusions

34

slide-35
SLIDE 35

The College of William and Mary

Principle

  • Increase data sharing among siblings
  • Decrease data sharing otherwise

35

Non-uniform threads Non-uniform cache sharing

slide-36
SLIDE 36

The College of William and Mary

Example: streamcluster

  • riginal code

36

for i = 1 to N, step =1 … … for j= T2+1 to T3 dist=foo(p[j],p[c[i]]) end … … end for i = 1 to N, step =1 … … for j= T1 to T2 dist=foo(p[j],p[c[i]]) end … … end

thread 1 thread 2

slide-37
SLIDE 37

The College of William and Mary

37

Example: streamcluster

  • ptimized code

for i = 1 to N, step =2 … … for j= T1 to T3 dist=foo(p[j],p[c[i+1]]) end … … end for i = 1 to N, step =2 … … for j= T1 to T3 dist=foo(p[j],p[c[i]]) end … … end

thread 1 thread 2

slide-38
SLIDE 38

The College of William and Mary

Performance Improvement (streamcluster)

0.25 0.5 0.75 1 4t 8t

L2 Cache Miss Mem Bus Trans

38

slide-39
SLIDE 39

The College of William and Mary

Other Programs

Normalized L2 Misses (on Intel)

0.25 0.5 0.75 1 4t 8t 4t 8t

Blacksholes Bodytrack

39

slide-40
SLIDE 40

The College of William and Mary

Implication

  • To exert the potential of shared cache, program-level

transformations are critical.

  • Limited existing explorations
  • Sarkar & Tullsen’08, Kumar& Tullsen’02,

Nokolopoulos’03.

* A contrast to the large body of work in OS and architecture.

40

slide-41
SLIDE 41

The College of William and Mary

Related Work

  • Co-runs of independent programs
  • Snavely+:00, Snavely+:02, El-Moursy+:06, Fedorova+:07, Jiang+:08, Zhou

+:09, Tian+:09

  • Co-runs of parallel threads of multithreaded programs
  • Liao+:05, Tuck+:03, Tam+:07
  • Have been focused on certain aspects of CMP
  • Simulators-based for cache design
  • Old benchmarks (e.g. SPLASH-2)
  • Specific class of apps (e.g., server apps)
  • Old CMP with no shared cache

41

First systematic examin. of the influence of cache sharing in modern CMP on the perf. of contemporary multithreaded apps.

slide-42
SLIDE 42

Measurement

Insignificant influence from cache

sharing despite inputs, arch, # threads, thread placement,

parallelism, phases, etc.

Analysis

Mismatch between SW & HW causing the

  • bservations.

Transformation

Large potential of cache-share-aware code

  • ptimizations.

Summary

42

slide-43
SLIDE 43

The College of William and Mary

Conclusion

  • Yes. But the main effects show up only after

cache-sharing-aware transformations.

Does cache sharing on CMP matter to contemporary multithreaded programs?

43

slide-44
SLIDE 44

The College of William and Mary

Thanks!

44

Questions?