Improving Cache Performance AMAT: Average Memory Access Time AMAT = - - PowerPoint PPT Presentation

improving cache performance
SMART_READER_LITE
LIVE PREVIEW

Improving Cache Performance AMAT: Average Memory Access Time AMAT = - - PowerPoint PPT Presentation

Improving Cache Performance AMAT: Average Memory Access Time AMAT = T hit + Miss Rate x Miss Penalty Optimizations based on : Reducing Miss Rate: Structural: Cache size, Associativity, Block size, Compiler support Reducing Miss


slide-1
SLIDE 1

Improving Cache Performance

AMAT: Average Memory Access Time AMAT = Thit + Miss Rate x Miss Penalty Optimizations based on :

  • Reducing Miss Rate:
  • Structural: Cache size, Associativity, Block size, Compiler support
  • Reducing Miss Penalty
  • Structural: Multi-level caches, Critical word/Early Restart,
  • Latency Hiding: Using concurrency to reduce miss rate or miss penalty
  • Improving Hit Time

1

slide-2
SLIDE 2

Cache Performance Models

Temporal Locality: Repeated access to the same word Spatial Locality: Access to words in physical proximity to accessed word Miss Categories:

  • Compulsory: Cold-start (first-reference) misses
  • Infinite cache miss rate
  • Characteristic of the workload: e.g streams (majority of misses compulsory)
  • Capacity: Data set size larger than that of cache
  • Increase size of cache to avoid thrashing
  • Fully associative abstraction

Replacement Algorithms: Optimal off-line algorithm: Belady Rule: Evict the cache block whose next reference is furthest in the future Provides lower bound on the number of capacity misses for a given cache size

  • Conflict: Cache organizations causes block to be discarded and later retrieved
  • Collision or Interference misses

2

slide-3
SLIDE 3

Cache Replacement

Replacement Algorithms: Optimal off-line algorithm: Belady Rule: Evict the cache block whose next reference is furthest in the future Provides lower bound on the number of capacity misses for a given cache size Cache size: 4 Blocks Block Access Sequence: A B C D E C E A D B C D E A B

A B C D

5. 5 Evict B

A E C

6.

D

A B C D E C E A D B C D E A B

6 Evict A

B E C D

7.

A B C D E C E A D B C D E A B

7

B E C A

Evict D (or E or C)

2 Capacity Misses 5 Compulsory Misses (A, B, C, D, E)

OPTIMAL (Belady)

3

slide-4
SLIDE 4

Cache Replacement

Replacement Algorithms: Least Recently Used (LRU): Evict the cache block that was last referenced furthest in the past Cache size: 4 Blocks Block Access Sequence: A B C D E C E A D B C D E A B

A B C D

5. 5 Evict A

E B C

6.

D

A B C D E C E A D B C D E A B

6 Evict B

B A C D

7.

A B C D E C E A D B C D E A B

7

B E C D

Evict A

5 Compulsory Misses (A, B, C, D, E)

LRU

A B C D E C E A D B C D E A B

8 8. Evict B

A E C D

9.

A B C D E C E A D B C D E A B

9 Evict C

2 additional misses due to non-optimal replacement

4

slide-5
SLIDE 5

LRU

  • Hard to implement efficiently
  • Software: LRU Stack

A B C D E C E A D B C D E A B D C B A

2

TOP LRU Block

E D C B C E D B E C D B A E C D

Miss Hits

On hit: Need to read and write ordering information: Not for hardware maintained cache

slide-6
SLIDE 6

LRU

  • Approximate LRU (Some Intel processors)

A B C D E F G H

Left/Right accessed last? A,B,C,D,E,F,G,H On Miss: Follow the path of NOT Accessed last

R R R R R R R

  • Random Selection
slide-7
SLIDE 7

Reducing Miss Rate

1. Larger cache size: + Reduce capacity misses

  • Hit time may increase
  • Cost increase

2. Increased Associativity: + Miss rate decreases -- conflict misses

  • Hit time increases

may increase clock cycle time

  • Hardware cost increases

Miss rate with 8-way associative comparable to fully associative (empirical finding)

Example Direct mapped cache: Hit time 1 cycle, Miss Penalty 25 cycles (low!), Miss rate = 0.08 8-way set associative: Clock cycle 1.5x, Miss rate = 0.07 Let T be clock cycle of direct mapped cache AMAT (direct mapped) = (1 + 0.08 x 25) x T = 3.0T AMAT (set associative): New clock period = 1.5 x T + 0.07 x Miss Penalty Miss Penalty = ceiling (25 x T /1.5T) x 1.5T = ceiling (25/1.5) x 1.5T = 17 x 1.5 T= 25.5T AMAT = 1.5T+ 0.07 x 25.5T = T(1.5+1.785) = 3.285T (Increasing associativity hurts in this example!!!)

5

slide-8
SLIDE 8

Reducing Miss Rate

3.

Block Size (B):

  • Miss rate
  • decreases and then increases with increasing block size

+ a) Compulsory miss rate decreases due to better use of spatial locality

  • b) Capacity (conflict) misses increase as effective cache size decreases
  • Miss penalty
  • increases with increasing block size
  • c) Wasted memory access time: Miss penalty increase not providing any gain

Do (a) and (c) balance each other? + d) Amortized memory access time per byte decreases (burst mode memory)

  • Tag overhead decreases

Low latency, Low bandwidth memory: Smaller block size High latency, High bandwidth: Larger block size

6

slide-9
SLIDE 9

Reducing Miss Rate Block Size B (contd):

Low latency, Low bandwidth memory: Smaller block size High latency, High bandwidth: Larger block size

Example: Case 1: Miss ratio of 5% with B=8 and Case 2: Miss ratio of 4% with B=16. Burst-mode Memory: Memory latency of 8 cycles, Transfer rate 2 bytes/cycle. Cache Hit time 1 cycle. AMAT = Hit time + Miss Rate x Miss penalty Case 1: AMAT = 1 + 5% x (8 + 8/2) = 1.6 cycles Case 2: AMAT = 1 + 4% x (8 + 16/2) = 1.64 cycles Suppose memory latency was 16 cycles: Favors larger block size. Case 1: AMAT = 1 + 5% x (16 + 8/2) = 2.0 cycles Case 2: AMAT = 1 + 4% x (16 + 16/2) = 1.96 cycles

7

slide-10
SLIDE 10

Reducing Miss Rate

4. Pseudo Associative caches + Maintain hit-speed of direct mapped. + Reduce conflict misses Column (or pseudo) associative: On miss: check one more location in Direct Mapped Cache Like having a fixed way-prediction Way Prediction: Predict block in set to be read on next access. If tag match: 1 cycle hit If failure: do complete selection on subsequent cycles + Power savings potential

  • Poor prediction increases hit time

12

slide-11
SLIDE 11

Column (or pseudo) associative On miss: check one more location in Direct Mapped Cache Like having a fixed way-prediction

12

0xxxx 1xxxx Direct Map Alternate Cache Location for Green block 0xxxx 1xxxx Direct Map

slide-12
SLIDE 12

Way Prediction Predict block in set to be read on next access. If tag match: 1 cycle hit If failure: do complete selection on subsequent cycles

12

0xxxx 2-way set associative map 0xxxx 1xxxx

slide-13
SLIDE 13

Reducing Miss Rate

  • 5. Compiler Optimizations
  • Instruction access
  • Rearrange code (procedure, code block placements) to reduce conflict misses
  • Align entry point of basic block with start of a cache block
  • Data access: Improve spatial/temporal locality in arrays

a) Merging arrays: Replace parallel arrays with array of struct (spatial locality) update(j): { *name[j] = …; id[j] = …; age[j] = …; salary[j] = …; }

update(j): { *(person[j].name) = …; person[j].id = …; person[j].age = …; person[j].salary = …;}

When might separate arrays be better? b) Loop Fusion: Combine loops which use the same data (temporal locality)

for (j=0; < n; j++) x[j] = y[2 * j]; for (j=0; j < n; j++) { for (j=0; < n; j++) sum += x[j]; x[j] = y[2 * j]; sum += x[j] ; }

When might separate loops be better?

8

slide-14
SLIDE 14

Reducing Miss Rate

A B C D E F

Compiler Optimizations (contd …)

  • Data access: Improve spatial/temporal locality in arrays

c) Loop interchange: Convert column-major matrix access to row-major access (spatial)

A B F C

for (j=0; j < n; j++) for (k=0; k < m; k++) a[k][j] = 0; Assuming Row-Major storage in memory: Could miss on each access of a[ ][ ] Misses: mn for (k=0; k < m; k++) for (j=0; j < n; j++) a[k][j] = 0; Only compulsory misses: 1 per block: Array element size w bytes Block size B bytes B/w elements per block Misses: mn/ (B/w) = mnw/B m n P

P

9

slide-15
SLIDE 15

Reducing Miss Rate

Compiler/Programmer Optimizations (contd …) d) Blocking: Use block-oriented access to maximize both temporal and spatial locality

Cache Insensitive Matrix Multiplication: O(n3) cache misses for accessing matrix b elements for (i=0; i < n; i++) for (j=0; j < n; j++) for (k=0; k < n; k++) c[i][j] += a[i][k] * b[k][j];

a b

10

slide-16
SLIDE 16

Reducing Miss Rate

Compiler/Programmer Optimizations (contd …) d) Blocking: Use block-oriented access to maximize both temporal and spatial locality

O(n3) cache misses for accessing matrix b elements for (i=0; i < n/s; i++) for (j=0; j < n/s; j++) for (k=0; k < n/s; k++) C[i][j] = C[i][j] +++ A[i][k] *** B[k][j];

a b

s s Block Matrix A[0][0] Block Matrix B[0][0] Block Matrix Multiplication of A[i][k] with B[k][j] to get one update of C[i][j] Matrix Addition

11