Complex Address Patterns Manjunath Shevgoor , Sahil Koladiya, Rajeev - - PowerPoint PPT Presentation

complex address patterns
SMART_READER_LITE
LIVE PREVIEW

Complex Address Patterns Manjunath Shevgoor , Sahil Koladiya, Rajeev - - PowerPoint PPT Presentation

Efficiently Prefetching Complex Address Patterns Manjunath Shevgoor , Sahil Koladiya, Rajeev Balasubramonian University of Utah Chris Wilkerson, Zeshan Chishti, Seth Pugsley *Intel Labs Variable Length Delta Prefetcher 1 Prefetchers


slide-1
SLIDE 1

Efficiently Prefetching Complex Address Patterns

Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian University of Utah Chris Wilkerson, Zeshan Chishti, Seth Pugsley *Intel Labs

Variable Length Delta Prefetcher 1

slide-2
SLIDE 2

Prefetchers

Confirmation Based Prefetchers

  • Issue predictions after a few deltas
  • High Accuracy
  • Short Streams Lose out

Variable Length Delta Prefetcher 2

Immediate Prefetchers

  • Aggressive
  • Low Accuracy
  • Waste DRAM bandwidth and

cache capacity

Accurate Fast

slide-3
SLIDE 3

Spatial Correlation

  • Learn Access (Delta) Patterns
  • Apply patterns when similar conditions re-occur.
  • Eg: PC, physical address, delta patterns

Variable Length Delta Prefetcher 3

Delta Patterns

  • Regular Delta Patterns. Eg: ( +1, +1, +1)…, (+2, +2, +2, +2)…
  • Irregular Delta Patterns. Eg: ( +1, +2, +3 )…
slide-4
SLIDE 4

Long Repeatable Streams of Irregular Deltas

Variable Length Delta Prefetcher 4

Page Num: 479218 Deltas: 1, 9, -8, 1, 8, 1, -8, 1, 1, 7…….. Delta patterns for milc

slide-5
SLIDE 5

Long Repeatable Streams of Irregular Deltas

Variable Length Delta Prefetcher 5

Deltas : 1, 9, -8, 1, 8, 1, -8, 1, 1, 7, -1, -5,….. Cache Line: A+1, A+10, A+2, A+3, A+11, A+12, A+4, A+5, A+6, A+13, A+12, A+7…… Stream 1 : A+1, A+2, A+ 3, A+4, A+5, A+6, A+7 Stream2: A+10, A+11, A+12, A+13

Confirmation Prefetches Stride Prefetcher Coverage: 5/11 SandBox Prefetcher Coverage: 9/11

Neither are perfectly timely!

slide-6
SLIDE 6

Variable Length Delta Prefetcher

Variable Length Delta Prefetcher 6

slide-7
SLIDE 7

Variable Length Delta Prefetcher 7

Core 1 Last Level $$

$ Access $ Access

Core 8

Delta Prediction Tables Per Page Delta History Tables Per Page Delta History Tables Predicted Delta/Offset Offset Prediction Tables Delta Prediction Tables Offset Prediction Tables

Structure of VLDP

Predicted Delta/Offset

slide-8
SLIDE 8

Delta History Table

  • Tracks delta within a page

for (i=0;i<BIGNUM; i++) { a[i]=b[i]+c[i]; }

  • a, b, c can each belong to different pages
  • So Deltas between pages is meaningless

Variable Length Delta Prefetcher 8

Delta = Last Address- Current Address

slide-9
SLIDE 9

Delta History Table

Variable Length Delta Prefetcher 9

Page Num. Last Add. Last 4 Deltas Last Predictor

  • Num. Times

Used Last Four Prefetched Offsets

slide-10
SLIDE 10

Delta Prediction Tables

Variable Length Delta Prefetcher 10

Delta(1) Pred. Accuracy 8 b 8 b 2 b Deltas (3) Pred. Accuracy 8b 8b 8b 8b 2b Match? Predicted Delta

64 Rows per Table

Highest Priority (t=3) Lowest Priority (t=1)

MUX

Match?

slide-11
SLIDE 11

Offset Prediction Table

Variable Length Delta Prefetcher 11

First Page Offset Pred. Offset Accuracy 7 b 7 b 2 b

OPT is used only to predict the second access to a page

slide-12
SLIDE 12

Need for Multiple Tables

Repeating Delta Pattern- (1, 2, 3, 5, 2, 4)…

Variable Length Delta Prefetcher 12

Delta Pred. 1 2 2 3 3 5 5 2 Delta Pred. 1,2 3 2,3 5 3,5 2 5,2 4 Table 1 Table 2

50% Accuracy

Search for Delta pattern match starts from right most table

slide-13
SLIDE 13

Looking farther than one Delta ahead

Variable Length Delta Prefetcher 13

Repeating Delta Pattern- (1, 2, 3), (1, 2, 3)…….

Delta Pred. 1 2 2 3 3 1

  • Delta

Pred. 1,2 3 2,3 1 3,1 2

  • ,-
  • Degree 1 Prediction

Current Delta

slide-14
SLIDE 14

Looking farther than one Delta ahead

Variable Length Delta Prefetcher 14

Repeating Delta Pattern- 1, 2, 3, 1, 2, 3…….

Delta Pred. 1 2 2 3 3 1

  • Delta

Pred. 1,2 3 2,3 1 3,1 2

  • ,-
  • Degree 1 Prediction

Degree 2 Prediction

Use Recursive lookup to look farther than one Delta

Current Delta Deg 1 Prediction

slide-15
SLIDE 15

Case Study: Streaming Workloads

Repeating Delta Pattern- 1, 1, 1, 1, 1…

Variable Length Delta Prefetcher 15

Delta Pred. 1 1

  • Delta

Pred.

  • ,-
  • ,-
  • ,-
  • ,-
  • Table 1

Table 2

Patterns learned from one page is applied to another

slide-16
SLIDE 16

Updating the Delta History Tables

Variable Length Delta Prefetcher 16

Evict Not Recently Used

If Page present, add Delta If Page not present, replace Page Num. Last Add. Last 4 Deltas Last Predictor Num. Used Last 4 Prefetches Page Num. Last Add. Last 4 Deltas Last Predictor Num. Used Last 4 Prefetches

LLC Access

slide-17
SLIDE 17

Updating the Prediction Tables

Variable Length Delta Prefetcher 17

Page Num. Last Add. Last 3 Deltas

B, C, D

Delta Pred. B,C,D E?

  • Table 3

E

Latest Delta

If Prediction is Correct Increment Accuracy If Prediction of Wrong Decrement Accuracy If Accuracy==0 Update + Promote Prediction If Prediction is Missing Seed T1 with prediction Delta Pred. C,D E?

  • Delta

Pred. D F?

  • Table 2

Table 1 Can the current state predict Latest Delta?

Last Predictor

slide-18
SLIDE 18

Populating the Prediction Tables

Variable Length Delta Prefetcher 18

Delta Pred. 1 A

  • Delta

Pred. 1,1 B

  • ,-
  • ,-
  • ,-
  • Delta

Pred. 1,1,1 C

  • Table 1

Table 2 Table 3

Table 1 Wrong Table 2 Wrong

NRU NRU NRU

If mis-predict, a longer Delta history might be needed

Pattern Missing

slide-19
SLIDE 19

Evaluation Methodology

Variable Length Delta Prefetcher 19

  • Simics + USIMM
  • 8 RISC cores, UltraSPARC III ISA
  • 3.2 GHz, 4-wide OoO, 128-entry RoB
  • 32 KB I&D L1 caches, 4 cycles
  • 8 MB shared (1MB per core) L2 cache, 10 cycles
  • DRAM Specifications
  • 2Channels, 2 Ranks per Channel, 8 Banks per Rank
  • 800MHz DDR3 DRAM
  • SPEC 2006, NPB, and Cloudsuite
  • Mix1- milc, astar, lbm, libq; Mix2- xalancbmk, lbm, zeusmp, milc;
slide-20
SLIDE 20

VLDP Configuration

Variable Length Delta Prefetcher 20

  • Per-Core VLDP
  • 1 Offset Prediction Table, 64 entry
  • 3 Delta Prediction Tables, 64 entries each
  • 16 entry Delta History Table
  • Only Delta Prediction Tables 2,3 contribute to multi degree prefetch

Offset Prediction Table 128 B Delta History Table 222 B Delta Prediction Table 648 B Total 998 B/Core

slide-21
SLIDE 21

Performance Improvement (Vs No PC)

VLDP is 6% better than AMPM 9% better than SBP 17% better than FDP

0.8 1.0 1.2 1.4 1.6 1.8 2.0 Speedup FDP SBP AMPM VLDP

Variable Length Delta Prefetcher 21

slide-22
SLIDE 22

Performance Improvement (Vs PC)

VLDP is 7.1% better than GHB 7.6% better than SMS

Variable Length Delta Prefetcher 22

0.8 1.0 1.2 1.4 1.6 1.8 2.0 Speedup SMS GHB_PC_DC VLDP

slide-23
SLIDE 23

Coverage

FDP 16% SMS 55% SBP 40%

Variable Length Delta Prefetcher 23

GHB 33% AMPM 49% VLDP 61%

0% 20% 40% 60% 80% 100% 120% NPB CloudSuite Spec2006 Spec2006-Mix GM Coverage FDP SMS SBP GHB_PC_DC AMPM VLDP

slide-24
SLIDE 24

Sensitivity to table size

Variable Length Delta Prefetcher 24

0.98 0.99 1.00 1.01 1.02 1.03 Speedup

2% increase in performance when DPT size is increased

slide-25
SLIDE 25

Sensitivity number of Delta Prediction Tables

Variable Length Delta Prefetcher 25

3DPT improves efficiency despite a modest 1% 1% performance improvement by reducing DRAM requests by 3% 3% 1 1.1 1.2 1.3 1.4 1.5 1DPT_NoOPT 1DPT+OPT 2DPT+OPT 3DPT+OPT 4DPT+OPT Speedup DRAM Accesses

slide-26
SLIDE 26

Conclusions

  • OPT Issues predictions without confirmation
  • DPT recognizes Irregular Delta Patterns
  • Long delta patterns provide high accuracy
  • Less than 1KB per core overhead
  • 6% better performance

Variable Length Delta Prefetcher 26

slide-27
SLIDE 27

Thank You

Variable Length Delta Prefetcher 27