1 Prefetching Implementations Recall Stream Buffer Diagram - - PDF document

1
SMART_READER_LITE
LIVE PREVIEW

1 Prefetching Implementations Recall Stream Buffer Diagram - - PDF document

Memory Wall 1000 Lecture 25: Advanced Data CPU Prefetching Techniques 100 Prefetching and data prefetching overview, Stride prefetching, 10 Markov prefetching, precomputation- DRAM based prefetching 1 1980 1981 1982 1983 1984 1985


slide-1
SLIDE 1

1

1

Lecture 25: Advanced Data Prefetching Techniques

Prefetching and data prefetching

  • verview, Stride prefetching,

Markov prefetching, precomputation- based prefetching

Zhao Zhang, CPRE 585 Fall 2003

2

Memory Wall

1 10 100 1000 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 DRAM CPU

Consider memory latency of 1000 processor cycles or a few thousands of instructions …

3

Where Are Solutions?

3. Reducing miss penalty or miss rates via parallelism Non-blocking caches Hardware prefetching Compiler prefetching 4. Reducing cache hit time Small and simple caches Avoiding address translation Pipelined cache access Trace caches

1.

Reducing miss rates

  • Larger block size
  • larger cache size
  • higher associativity
  • victim caches
  • way prediction and

Pseudoassociativity

  • compiler optimization

2.

Reducing miss penalty

  • Multilevel caches
  • critical word first
  • read miss first
  • merging write buffers

4

Where Are Solutions?

Consider an 4-way issue OOO processor

20-entry issue queue 80-entry ROB 100ns main memory

access latency

In how many cycles the processor will stall on cache miss to main memory? OOO processors may tolerate L2 latency but not main memory latency Increase cache size? More levels of memory hierarchy?

Itanium: 2-4MB L3

cache

IBM Power4: 32MB

eDRAM cache

Large caches are still very useful but may not help fully address the issue

5

Prefetching Evaluation

Prefetch: Predict future accesses and fetch data before they are demanded Accuracy: How many prefetched items are really needed?

False prefetching: fetched wrong data Cache pollution: replace “good” data with “bad”

data

Coverage: How many cache misses are removed? Timeliness: Does the data return before they are demanded? Other considerations: complexity and cost

6

Prefetching Targets

Instruction prefetching

Stream buffer is very useful

Data prefetching

More complicated because of the diversities in

data access pattern

Prefetching for dynamic data (hashing, heap, sparse array, etc.)

Usually with irregular access patterns

Linked-list prefetching (Pointer chasing)

A special type of data prefetching for data in

linked-list

slide-2
SLIDE 2

2

7

Prefetching Implementations

Sequential and stride prefetching

  • Tagged prefetching
  • Simple stream buffer
  • Stride prefetching

Correlation-based prefetching

  • Markov prefetching
  • Dead-block correlating prefetching

Precomputation-based

  • Keep running programs on cache misses; or
  • Use separate hardware for prefetching; or
  • Use compiler-generated threads on multithreaded processors

Other considerations

  • Predict on miss addresses or reference address?
  • Prefetch into cache or a temp. buffer?
  • Demand-based or decoupled prefetching?

8

Recall Stream Buffer Diagram

Data Tags

Direct mapped cache

  • ne cache block of data
  • ne cache block of data
  • ne cache block of data
  • ne cache block of data

Stream buffer tag and comp tag tag tag head +1 a a a a from processor to processor tail Shown with a single stream buffer (way); multiple ways and filter may be used next level of cache Source: Jouppi ICS’90

9

Stride Prefetching

Limits of streaming buffer: Program may access data in either direction; i.e. how about

for (i = N-1; i >= 0; i --)

Data may be accessed in strides, i.e.

for (i = 0; i < N; i ++) for (j = 0; j < N; j ++) sum[i] += X[i][j];

10

Stride Prefetching Diagram

PC Effective address

Inst tag Previous address stride state

  • +

Prefetch address

Reference prediction table

11

Stride Prefetching Example

float a[100][100], b[100][100], c[100][100]; ... for ( i = 0; i < 100; i++) for ( j = 0; j < 100; j++) for ( k = 0; k < 100; k++) a[i][j] += b[i][k] * c[k][j];

steady 30000 Load a trans. 400 30400 Load c trans. 4 20004 Load b state stride addr tag steady 30000 Load a steady 400 30800 Load c steady 4 20008 Load b state stride addr tag init 30000 Load a init 30000 Load c init 20000 Load b state stride addr tag

Iteration 1 Iteration 2 Iteration 3

12

Markov Prefetching

Miss addresses: A B C D C E A C F F E A A B C D E A B C D C

Target irregular mem access pattern

pred 4 pred 3 pred 2 pred 1 miss N … … … … … pred 4 pred 3 pred 2 pred 1 miss 1 miss addr Predicted addresses Prefetch queue

Markov model

Joseph and Grunwald, ISCA 1997

slide-3
SLIDE 3

3

13

Markov Prefetching Performance

From left to right: number of addresses in table

14

Markov Prefetching Performance

From left to right: stream, stride, correlation (Pomerene and Puzak), Markov, stream+stride+Markov serial, stream+stride+Markov parallel

15

Predictor-directed Stream Buffer

Cons of existing approaches: Stride prefetching (using updated stream buffer): Only useful for strid access; being interfered by non-stride accesses Markov prefetching: Working for general access patterns but requiring large history storage (megabytes) PSB: Combining the two methods

To improve coverage of stream buffer; and Keep the required storage low (several kilobytes)

Sair et al., MICRO 2000

16

Predictor-directed Stream Buffer

Markov prediction table filters out irregular address transitions (reduce stream buffer thrashing) Stream buffer filters out addresses used to train Markov prediction table (reduce storage)

Which prefetching is used on each address?

17

Precomputation-based Prefetching

Potential problems of stream buffer or Markov prefetching: Low accuracy => high memory bandwidth waste Another approach: use some computation resource for prefetching, because computation is increasingly cheaper Speculative execution for prefetching

  • No architectural changes
  • Not limited by hardware
  • With high accuracy and

good coverage

for (i=0; i<10; i++) for (j=0; j<100; j++) data[j]->val[j]++; Loop: I1 load r1=[r2] I2 add r3=r3+1 I3 add r6=r3-100 I4 add r2=r2+8 I5 add r1=r4+r1 I6 load r5=[r1] I7 add r5=r5+1 I8 store [r1]=r5 I9 blt r6, lop Collins et al, MICRO 2001

18

Prefetching by Dynamically Building Data-dependence Graph

Annavaram et al., “Data prefetching by dependence graph precomputation”, ISCA 2001 Needs external help to identify problematic loads Builds dependence graph in reverse order Uses separate prefetching engine

IF PRE-DE DECODE EXE WB COMMIT

INST OP1 OP2 IT

DG generator DG Buffer EXE Engine prefetching

Updated inst fetch queue

slide-4
SLIDE 4

4

19

Using “Future” Threads for Prefetching

Balasubramonian et al. “Dynamically allocating processor resources between nearby and distant ILP,” ISCA 2001 OOO processors stall on cache misses to DRAM because of exhausting some resources (IQ or ROB or registers) Why not keep the program run during the stall time for prefetching? Then, must reserve resources for “future” thread Future thread continues the execution for prefetching Using the existing OOO pipeline and FUs for execution May release registers or ROB speculatively, thus can examine a much larger instruction window Still accurate in producing reference addresses

20

Precomputation with SMT Supporting Speculative Threads

Collins et al. “Speculative precomputation: long-range prefetching of delinquent loads,” ISCA 2001. Precomputation is done by an explicit speculative thread (p-thread) The code of p-threads may be constructed by compiler or hardware Main thread execution spawns p-threads on triggers (e.g. when an PC is encountered)

Main thread some register values and initial PC for p-thread P-thread may trigger another p-thread for further

prefetching

For more complier issues, see Luk, “Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors”, ISCA 2001

21

Summary of Advanced Prefetching

Being actively studied because of the increasing CPU-memory speed gap Improving cache performance beyond the limit of cache size Precomputation may be limited in prefetching distance (how good is the timeliness?) Note there is no perfect cache/prefetching solution, e.g.

while (1) { myload (addr); addr = myrandom() + addr; }

How to design complexity-effective memory systems for future processors?