Is Reuse Distance Applicable to Data Locality Analysis on Chip - - PowerPoint PPT Presentation

is reuse distance applicable to data locality analysis on
SMART_READER_LITE
LIVE PREVIEW

Is Reuse Distance Applicable to Data Locality Analysis on Chip - - PowerPoint PPT Presentation

Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors? Yunlian Jiang Eddy Z. Zhang Kai Tian Xipeng Shen (presenter) Department of Computer Science The College of William and Mary, VA, USA Cache Sharing A common


slide-1
SLIDE 1

Is Reuse Distance Applicable to Data Locality Analysis

  • n Chip Multiprocessors?

Yunlian Jiang Eddy Z. Zhang Kai Tian Xipeng Shen (presenter)

Department of Computer Science The College of William and Mary, VA, USA

slide-2
SLIDE 2

The College of William and Mary

Cache Sharing

  • A common feature on modern CMP

2

slide-3
SLIDE 3

The College of William and Mary

Data Locality

  • Extensively studied for uni-core processors
  • Two classes of metrics
  • At hardware level
  • E.g., cache miss rate
  • At program level
  • E.g., reuse distance

3

slide-4
SLIDE 4

The College of William and Mary

Reuse Distance (RD)

  • Def: # of distinct data between two adjacent ref. to

a data element

  • E.g. b c a a c b rd=2

4

c a

RD histogram

slide-5
SLIDE 5

The College of William and Mary

Reuse Distance (RD)

  • Def: # of distinct data between two adjacent ref. to a

data element

  • E.g. b c a a c b rd=2
  • Appealing properties
  • Hardware-independence
  • Accurate, point to point
  • Cross-input predictable
  • Bounded value---data size

5

slide-6
SLIDE 6

The College of William and Mary

Many Uses of Reuse Distance

  • Cross-arch performance prediction [Marin

+:SIGMETRICS04,Zhong+:PACT03]

  • Model reference affinity [Zhong+:PLDI04]
  • Guide memory disambiguation [Fang+:PACT05]
  • Detect locality phases [Shen+:ASPLOS04]
  • Software refactoring [Beyls+:HPCC06]
  • Model cache sharing [Chandra+:HPCA05]
  • Study data reuses [Ding+:SC04,Huang+:ASPLOS05]
  • Insert cache hints [Beyls+:JSA05]
  • Manage superpages [Cascaval+:PACT05]

6

slide-7
SLIDE 7

The College of William and Mary

Complexity Caused by Cache Sharing

  • Data locality is not solely determined by a process

itself

  • Accesses by its co-runners need to be considered

7

slide-8
SLIDE 8

The College of William and Mary

Questions to Answer

  • Is reuse distance applicable for locality

characterization on CMP?

  • What are the new challenges?
  • Are these challenges addressable?

8

slide-9
SLIDE 9

The College of William and Mary

Outline

  • Complexities in extending reuse distance model to CMP
  • Addressing the issues for some multithreading app.

9

  • Loss of hardware-independence
  • A chicken-egg dilemma for performance prediction
  • A probabilistic model to derive reuse distance in

co-runs

  • Evaluation
slide-10
SLIDE 10

The College of William and Mary

Terms

  • Concurrent reuse distance (CRD)
  • # of distinct data accessed by all co-runners

between two adjacent ref. to a data element.

  • Standalone reuse distance (SRD)
  • # of distinct data accessed by the current process

between two adjacent ref. to a data element.

  • Example

10

a b b c d a p q p q P1: P2: SRD = 3; CRD =3+2=5

slide-11
SLIDE 11

The College of William and Mary

Distinctive Property of CRD

  • Example

11

a b c b a

  • Mem. references by P1

SRD = 2 CRD = 2 + x r = speed(P2)/speed(P1) The larger r is, the greater x tends to be. Dependance on relative running speeds of co-runners.

slide-12
SLIDE 12

The College of William and Mary

Two Implications

  • First, CRD is hard to measure in real programs.
  • Instrumentation changes relative speeds

12

relative speed original: r = IPCi/IPCj after instrumentation: r’ = IPC’i/IPC’j changes of relative speed: |r-r’|/r

slide-13
SLIDE 13

The College of William and Mary

Two Implications (cont.)

  • Second, CRD loses hardware-independence.
  • Relative speeds change across architectures.

13

  • Consequence
  • Cross-arch. perf. pred. becomes hard for co-runs
slide-14
SLIDE 14

The College of William and Mary

Cross-Arch. Performance Prediction

14

training SRD SRD IPC

training platform testing platform predictor = for single runs

training CRD CRD IPC

training platform testing platform chicken-egg dilemma for co-runs

slide-15
SLIDE 15

The College of William and Mary

Iterative Approach Not Applicable

15

training CRD CRD IPC

training platform testing platform

slide-16
SLIDE 16

The College of William and Mary

Iterative Approach Not Applicable

16

IPC(J) IPC(I) CRD(J) CRD(I) CacheMiss(J) CacheMiss(I) IPC(J) IPC(I)

training CRD CRD IPC

training platform testing platform

slide-17
SLIDE 17

The College of William and Mary

Outline

  • Complexities in extending reuse distance model to

CMP

  • Loss of hardware-independence
  • A chicken-egg dilemma for performance prediction
  • Addressing the issues for some multithreading app.
  • A probabilistic model to derive reuse distance in

co-runs

  • Evaluation

17

slide-18
SLIDE 18

The College of William and Mary

Favorable Observations

  • From a systematic study [Zhang+:PPoPP’10] on

PARSEC non-pipelining multithreading benchmarks

  • All parallel threads of an app. conduct similar

computations

  • Uniform relations among threads.

18

They hold across arch, inputs, # of threads, thread-core assignments, program phases.

slide-19
SLIDE 19

The College of William and Mary

Implication

  • Relative speeds among threads tend to remain the

same across arch. and inputs.

19

slide-20
SLIDE 20

The College of William and Mary

An Efficient Way to Estimate CRD

20

SRDT1 SRDT2 SRDTm ... CRDT1 CRDT2 CRDTm ... prob. model

slide-21
SLIDE 21

The College of William and Mary

Two Steps

(1) ∆ d (# of distinct data accessed)

21

a ... a ∆ trace of T1: ∆ dT1 dT2 dTm ... CRDT1= dT1 + dT1 + ... + dTm

assuming no data sharing

(2) Handle effects of data sharing

slide-22
SLIDE 22

The College of William and Mary

Time Distance (TD)

  • Def : the # of elements between reuses

22

  • E.g. b c a a c b td=4 (rd=2)

time distance

  • TD Histogram (TDH)

Shows the probability for an access to have a certain TD.

slide-23
SLIDE 23

The College of William and Mary

  • Pi(∆): Probability for an object Oi to be referenced in

a ∆-long interval.

23

Pi(∆ ) = Pi(∆-1) + qi(∆ ) ∆ Pi(∆-1) = Pi(∆-2) + qi(∆-1) Pi(1) = Pi(0) + qi(1) ...

P

i(Δ) =

qi(τ)

τ =1 Δ

qi(∆ ): Oi is accessed at time point ∆, but not at the ∆-1 points ahead.

∆ d

TDH

slide-24
SLIDE 24

The College of William and Mary

  • qi(τ): Oi is accessed at time point τ, but not at the τ-1 points
  • ahead. It is equivalent to

1)The object accessed at τ is Oi & 2)The time distance of that reference is greater than τ.

24

τ

qi(τ) = ni T Hi(δ)

δ =τ +1 T

TDH TDH

∆ d

P

i(Δ) = ni

T τ =1

Δ

Hi(δ)

δ =τ +1 T

slide-25
SLIDE 25

The College of William and Mary

  • P(k, ∆): prob. for a ∆-long interval to contain k

distinct data.

25

  • d: # of distinct data referenced in a ∆-long interval.

d

∆ d

TDH See paper for details.

slide-26
SLIDE 26

The College of William and Mary

Handling Data Sharing

  • Two effects from data sharing on CRD
  • Example

26

a b X X b X c d X a X: references by T2

  • Scenario 1: Xs ∉ {a, b, c, d}.
  • a b p q b p c d q a CRD=3+2=5
  • Scenario 2: a ∈ Xs.
  • a b p a b p c d q a break into 2 reuse intervals
  • Scenario 3: {b,c,d} ∩ Xs ≠ ϕ.
  • a b p c b p c d c a CRD=3+1=4

should not be counted.

slide-27
SLIDE 27

The College of William and Mary

Treating the Effects

  • Probability for a reuse interval to break
  • Probability for |C|=c is

27

S: set of all shared data. N1, N2: data size of T1 and T2. n1, n2: # of distinct data accessed by T1 and T2 in an interval of length V. C: intersection of data sets referenced by T1 and T2 in the interval.

See paper for details.

slide-28
SLIDE 28

The College of William and Mary

  • Estimation

accuracy of CRD histograms

  • n

synthetic traces

s: sharing ratio n1, n2: data sizes

slide-29
SLIDE 29

The College of William and Mary

On Traces of Real Programs

  • Using simulator to record traces.
  • SIMICS with GEMS
  • Simulate UltraSPARC with 1MB shared L2 cache.
  • Three PARSEC programs
  • vips (image processing)
  • negligible shared data, 33,000 locks
  • accuracy 76%
  • swaptions (portfolio pricing)
  • 27% shared data, 23 locks
  • accuracy 74%
  • streamcluster (online clustering)
  • 3% shared data, 129,600 barriers
  • accuracy 72%

29

slide-30
SLIDE 30

The College of William and Mary

Related Work

  • All-window profiling [Ding and Chilimbi]
  • Predict cache misses of co-runs from circular stack

distance histograms [Chandra et al., Chen & Aamodt]

  • Statistical shared cache model [Berg et al.]

30

slide-31
SLIDE 31

The College of William and Mary

Conclusions

31

  • Is reuse distance applicable for locality

characterization on CMP?

  • What are the new challenges?
  • Are these challenges addressable?

Difficult in general. Reliance on relative speeds; loss of hardware-indep; falling into a chicken-egg dilemma. Yes for a class of multithreading applications. A probabilistic model facilitates the derivation of CRD.

slide-32
SLIDE 32

The College of William and Mary

Thanks!

32

Questions?