Five poTAGEs and a COLT for an unrealistic predictor Pierre Michaud - - PowerPoint PPT Presentation

five potages and a colt for an unrealistic predictor
SMART_READER_LITE
LIVE PREVIEW

Five poTAGEs and a COLT for an unrealistic predictor Pierre Michaud - - PowerPoint PPT Presentation

Five poTAGEs and a COLT for an unrealistic predictor Pierre Michaud june 2014 Competition track: Unlimited size 2 I did not modify the predictor after the submission 3 Two-level history branch predictors E.g., global branch history, First


slide-1
SLIDE 1

Five poTAGEs and a COLT for an unrealistic predictor

Pierre Michaud june 2014

slide-2
SLIDE 2

Competition track:

2

Unlimited size

slide-3
SLIDE 3

3

I did not modify the predictor after the submission

slide-4
SLIDE 4

Two-level history branch predictors

First level = context Second level

E.g., global branch history, local branch history

4

branch address prediction E.g., TAGE

slide-5
SLIDE 5

PPM-like second level

  • Search the longest context that already occurred at least
  • nce, and predict from the past history for that context
  • search with the maximum context length L1
  • if no past occurrence for L1, search with L2 < L1
  • if no past occurrence for L2, search with L3 < L2
  • and so on…
  • One table per context length
  • To know if a context already occurred, use tags
  • false hit probability divided by 2 every time we increase the tag

length by 1 bit

5

slide-6
SLIDE 6

TAGE

  • PPM-like (TAgged) with GEometric context lengths
  • does not name a specific predictor but a predictor family
  • PPM-like 2004, TAGE 2006, TAGE 2011
  • Most of the tricks are in the update
  • allocation policy, u bit, selection counter,...
  • makes the difference between bad TAGE (e.g., PPM-like 2004)

and good TAGE

6

slide-7
SLIDE 7

7

Let’s tune TAGE for limit studies

slide-8
SLIDE 8

8

PPM’s main weakness: the cold-counter problem

slide-9
SLIDE 9

9

slide-10
SLIDE 10

Biased-coin tossing game

  • The coin is biased, we don’t know which side is the bias
  • We play repeatedly with the same coin
  • At game N+1, we count how many times head occurred vs.

tail in the N previous games  we choose the side which

  • ccurred the most
  • if equal head and tail counts  choice = outcome of last game

10

slide-11
SLIDE 11

Biased-coin tossing game

  • The coin is biased, we don’t know which side is the bias
  • We play repeatedly with the same coin
  • At game N+1, we count how many times head occurred vs.

tail in the N previous games  we choose the side which

  • ccurred the most
  • if equal head and tail counts  choice = outcome of last game

11

similar to TAGE’s taken/not-taken counters

slide-12
SLIDE 12

Cold-counter problem

12

game win proba.

1 2 0.520 3 0.520 4 5 6 7 8 9 10 0.530 0.530 0.537 0.537 0.542 0.542 0.547 0.500

bias = 60% bias = 90% game win proba.

1 2 0.820 3 0.820 4 5 6 7 8 9 10 0.878 0.878 0.893 0.893 0.898 0.898 0.899 0.500

slide-13
SLIDE 13
  • Limited storage  allocate entry for longer context only upon

misprediction

  •  counter likely to be initialized with least frequent outcome
  • TAGE has a mechanism for reducing the cold counter problem
  • sometimes, second longest match entry more accurate than (cold)

longest match entry

  • single global selection counter chooses between longest match and

second longest

13

Cold counter problem in TAGE

slide-14
SLIDE 14

poTAGE: post-predicted TAGE

  • TAGE tuned for limit studies
  • Tackle cold counter problem
  • Replace the selection counter with a post-predictor
  • Aggressive update & allocation for fast ramp up

14

slide-15
SLIDE 15

Selection counter  post-predictor

  • Selection counter is cost-effective, but does not solve the

cold counter problem completely

  • Post-predictor  more effective solution

15

slide-16
SLIDE 16

Post-predictor

16

TAGE

ctr ctr ctr u first hit second hit third hit 1024 five-bit counters T/NT prediction 3 3 3 1 10 T: increment NT: decrement

slide-17
SLIDE 17

Post-predictor

17

TAGE

ctr ctr ctr u first hit second hit third hit 1024 five-bit counters T/NT prediction 3 3 3 1 10 T: increment NT: decrement 5% fewer mispredictions than selection counter

slide-18
SLIDE 18

Ramp up

  • Realistic TAGE  careful policy allocates new entries only upon

mispredictions

  • good use of limited storage by minimizing useless allocations
  • poTAGE  aggressive policy for reducing cold-start mispredictions
  • update all hitting counters
  • allocate for all context lengths greater than the longest hitting

context and for which u bit is reset

  • stop aggressive allocation for context lengths greater than 200

when all hitting counters are saturated

  • switch to careful policy after a fixed number of mispredictions

18

slide-19
SLIDE 19

Ramp up

  • Realistic TAGE  careful policy allocates new entries only upon

mispredictions

  • good use of limited storage by minimizing useless allocations
  • poTAGE  aggressive policy for reducing cold-start mispredictions
  • update all hitting counters
  • allocate for all context lengths greater than the longest hitting

context and for which u bit is reset

  • stop aggressive allocation for context lengths greater than 200

when all hitting counters are saturated

  • switch to careful policy after a fixed number of mispredictions

19

4% fewer mispredictions

slide-20
SLIDE 20

Global-path TAGE: footprint problem

  • Global path, if long enough, can (in theory) capture all branch correlations
  • Problem: high-entropy branches grow the footprint (number of allocations)
  • We could try to filter out of the global path branches that carry no useful

correlation information

  • in practice, difficult to identify these branches
  • filtering them out does not necessarily reduce the footprint
  • Alternative approach: intentional path aliasing

20

slide-21
SLIDE 21

Intentional path aliasing

  • Path aliasing = several distinct global paths aliased to the same predictor

entry and tag

  • something we try to avoid in a global-path TAGE
  • Intentional path aliasing reduces the footprint
  • we lose some correlation information  only some branches benefit

from it

  • Local history can be viewed as intentional path aliasing
  • Per-set history (Yeh & Patt, 1993) is intentional path aliasing
  • was used in the FTL++ predictor (Yasuo Ishii et al., CBP-3)

21

slide-22
SLIDE 22

multi-poTAGE

  • Combine several poTAGE predictors using different first-level histories
  • P0: 1 global path
  • P1: 32 local (per-address) subpaths
  • P2: 16 per-set subpaths (128-byte sets)
  • P3: 4 per-set subpaths (2-byte sets)
  • P4: 8 frequency subpaths
  • Combined through COLT Fusion
  • Loh & Henry, PACT 2002
  • Better to have a few long subpaths than many short ones
  • Yasuo Ishii et al., CBP-3

22

slide-23
SLIDE 23

multi-poTAGE

23

P0 (global) P1 (local) P2 (per set) P4 (frequency) P3 (per set) COLT T/NT prediction branch address

slide-24
SLIDE 24

multi-poTAGE

24

P0 (global) P1 (local) P2 (per set) P4 (frequency) P3 (per set) COLT T/NT prediction branch address

slide-25
SLIDE 25

Frequency-based first-level history

  • Branch frequency = number of times the branch was executed
  • Branch Frequency Table  one counter per branch address
  • increment counter on each dynamic occurrence
  • Exploit correlations between branches with (roughly) same frequency
  • Define 8 frequency bins
  • from high to low frequency
  • Associate one subpath with each frequency bin
  • Access poTAGE with subpath corresponding to the branch frequency

25

slide-26
SLIDE 26

Global path: most accurate single component

26

P0 (global)

slide-27
SLIDE 27

Global path: most accurate single component

27

P0 (global) COLT branch address

  • 0.5 %
slide-28
SLIDE 28

2nd most important: 128-byte sets

28

P0 (global) P2 (per set) COLT branch address

  • 5 %
slide-29
SLIDE 29

3rd: local

29

P0 (global) P1 (local) P2 (per set) COLT branch address

  • 5 %
  • 3 %
slide-30
SLIDE 30

4th: frequency

30

P0 (global) P1 (local) P2 (per set) P4 (frequency) COLT branch address

  • 5 %
  • 3 %
  • 2.5 %
slide-31
SLIDE 31

5th: 4-byte sets

31

P0 (global) P1 (local) P2 (per set) P4 (frequency) P3 (per set) COLT branch address

  • 5 %
  • 3 %
  • 2.5 %
  • 1 %
slide-32
SLIDE 32

Total

32

P0 (global) P1 (local) P2 (per set) P4 (frequency) P3 (per set) COLT branch address

  • 10 %
slide-33
SLIDE 33

Conclusion

  • Post-predictor more effective than selection counter for reducing cold-

counter problem

  • Huge TAGE can use aggressive update & allocation
  • Fundamental weakness of global-path TAGE: high-entropy branches

grow the footprint

  • Proposed solution: blind use of intentional path aliasing
  • Is it possible to use intentional path aliasing in a cost-effective way ?

33

slide-34
SLIDE 34

34

Questions ?