Execution-based Prediction Using Speculative Slices Craig Zilles - - PowerPoint PPT Presentation

execution based prediction using speculative slices
SMART_READER_LITE
LIVE PREVIEW

Execution-based Prediction Using Speculative Slices Craig Zilles - - PowerPoint PPT Presentation

Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers to achieving high ILP: MISPREDICTED


slide-1
SLIDE 1

Execution-based Prediction Using Speculative Slices

Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001

slide-2
SLIDE 2

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

2

The Problem

Two major barriers to achieving high ILP: MISPREDICTED BRANCHES and CACHE MISSES

TRADITIONAL PREDICTION: SOMEWHAT MATURE TECHNOLOGY

  • correctly anticipate > 90% instructions
  • exploit patterns in outcome/address stream
  • remaining mispredictions still expensive

EXECUTION-BASED PREDICTION

  • exploit regularity in computations
  • speculatively compute results early for use as predictions
  • speedups from 1 to 43% on SPECINT 2000
slide-3
SLIDE 3

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

3

The Solution

BRANCH LOAD

TIME BRANCH

branch mispredict

STREAM LOAD

cache miss

RETIREMENT

Identify frequently mispredicting instructions Extract and pack dependant computation into code fragments

1 2

PROGRAM

called slices

branch slice load slice

slide-4
SLIDE 4

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

4

The Solution

TIME BRANCH

branch mispredict

BRANCH LOAD

cache miss

fork

cache hit

prediction fork

RETIREMENT STREAM

}

speedup

idle thread

LOAD

cache miss

Execute slices in helper threads to generate predictions

3

branch slice load slice

idle thread

slide-5
SLIDE 5

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

5

The Outline

  • PROBLEM INSTRUCTIONS

BRANCH LOAD

cache miss

branch slice load slice fork

cache hit

prediction fork

}

speedup

slide-6
SLIDE 6

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

6

The Outline

  • PROBLEM INSTRUCTIONS
  • EXECUTION-BASED PREDICTION

BRANCH LOAD

cache miss

branch slice load slice fork

cache hit

prediction fork

}

speedup

slide-7
SLIDE 7

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

7

The Outline

  • PROBLEM INSTRUCTIONS
  • EXECUTION-BASED PREDICTION
  • PREDICTION CORRELATION

BRANCH LOAD

cache miss

branch slice load slice fork

cache hit

prediction fork

}

speedup

slide-8
SLIDE 8

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

8

The Outline

  • PROBLEM INSTRUCTIONS
  • EXECUTION-BASED PREDICTION
  • PREDICTION CORRELATION
  • PERFORMANCE RESULTS

BRANCH LOAD

cache miss

branch slice load slice fork

cache hit

prediction fork

}

speedup

slide-9
SLIDE 9

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

9

while (...) { ... ptr = ptr->next; }

Problem Instructions

Misses and mispredictions are not evenly distributed. EXAMPLE: PERLBMK

  • 82 static branches: 68% of misp., 9% of dynamic branches
  • 140 static loads: 67% misses, 2% of dynamic memory insts

Fixing just problem inst’s gives > 1/2 perf. of perfect cache/pred OUTCOMES OF THESE INSTRUCTIONS DO NOT EXHIBIT A PREDICTABLE PATTERN...

  • consistently mispredicted

... BUT SOMETIMES THE COMPUTATION IS REGULAR.

while (i < n) { if (object[i] != NULL) { ... }

slide-10
SLIDE 10

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

10

Outline

  • PROBLEM INSTRUCTIONS
  • EXECUTION-BASED PREDICTION

O An different pre-execution approach O Speculative slices and imprecise transformations O Slice structure O Slice characterization

  • PREDICTION CORRELATION
  • PERFORMANCE RESULTS

BRANCH LOAD

cache miss

branch slice load slice fork

cache hit

prediction fork

}

speedup

slide-11
SLIDE 11

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

11

Previous pre-execution proposal

Speculative Data-driven Multithreading: Roth and Sohi, HPCA’01

  • Speculatively pre-executes data-driven threads (DDTs)
  • Register integration matches DDTs to main thread

+ avoids re-execution of DDT instructions + early branch resolution (at decode stage)

  • DDTs must be sub-set of original program
slide-12
SLIDE 12

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

12

Two Observations

Two Observations:

  • benefit comes from prefetches and predictions
  • strict program subsets not most efficient slices

Our approach: generate predictions/prefetches in as efficient manner as possible. OPTIMIZE SLICES:

+ reduce fetch/execution overhead + reduce critical path to making prediction

  • need a new mechanism to correlate predictions
slide-13
SLIDE 13

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

13

Speculative Slices

DON’T ALLOW SLICES TO AFFECT ARCHITECTED STATE

  • only generate pre-fetches and predictions
  • need not be 100% accurate

3 CLASSES OF TRANSFORMATIONS: (NOT ORIGINALLY APPLIED BY COMPILER)

  • Imprecise

O static branch assertion (remove branches/cold code)

  • Not-provably safe

O register allocation in the presence of aliases

  • Previously unprofitable

O if-conversion (of a subset of a block)

slide-14
SLIDE 14

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

14

Slice Structure

  • problem instructions frequently in loops
  • encapsulate loop in slice

Program Slice Fork

BENEFITS:

  • lower overhead
  • earlier predictions
  • amortize fork overhead
  • single helper thread

ISSUES & SOLUTIONS: in paper problem load

slide-15
SLIDE 15

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

15

Slice Characterization

CONSTRUCTED AND OPTIMIZED SLICES BY HAND

  • encouraging results

STATISTICS:

  • 85% of slices cover multiple static problem instructions
  • 70% of slices contained loops
  • small static size

O smaller than 4 * # problem instructions covered

  • prefetch or prediction generated every ~3 dynamic inst’s.
  • small number of live-in values

O 80% of slices had 2 or less

slices can be very small

slide-16
SLIDE 16

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

16

Outline

  • PROBLEM INSTRUCTIONS
  • EXECUTION-BASED PREDICTION
  • PREDICTION CORRELATION

O difficult problem O valid regions

  • RESULTS AND ANALYSIS

BRANCH LOAD

cache miss

branch slice load slice fork

cache hit

prediction fork

}

speedup

slide-17
SLIDE 17

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

17

Prediction Correlation

TO BENEFIT FROM A SLICE-GENERATED PREDICTION

  • must bind it to fetched branch instruction
  • overrides hardware branch predictor

HOW ARE PREDICTIONS CORRELATED TO DYNAMIC BRANCHES?

BRANCH

branch slice fork prediction

slide-18
SLIDE 18

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

18

Prediction Correlation

CHALLENGES:

  • re-ordering predictions produced out-of-order
  • recovering from misspeculation by main thread
  • dealing with conditionally-executed problem branches

BRANCH

branch slice fork prediction

F T T T F F T T BRANCH PC T F T T F BRANCH PC BRANCH PC

Tagged prediction queues

Related Work: Farcy, et al, Micro ‘98

slide-19
SLIDE 19

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

19

Conditionally-executed problem branches

MINIMIZE OVERHEAD BY BUILDING SIMPLEST SLICE

  • compute prediction for each iteration

NAIVE IMPLEMENTATION

  • predictions dequeued when used
  • mis-alignment occurs on path CF

CONDITIONALLY GENERATE PREDICTIONS?

  • include “existence slice” in slice
  • too much overhead

INSIGHT

  • existence slice encoded in fetch path

A G B C F E D

problem branch fork point not executed

  • n all iterations

program’s CFG

slide-20
SLIDE 20

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

20

Valid Regions

DEFINE REGION WHERE PREDICTION IS VALID

  • using assumptions from building slice

A B C F E D B D G F F E G

1st pred 2nd pred

C

first iteration second iteration

slide-21
SLIDE 21

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

21

Valid Regions

DEFINE REGION WHERE PREDICTION IS VALID

  • using assumptions from building slice
  • “markers” to indicate region boundary
  • implementation discussed in paper

DEQUEUE PREDICTION WHEN MARKER ENCOUNTERED

  • using a prediction doesn’t dequeue it

greater than 99% correlation accuracy

A B C F E D B D G F F E G

1st pred 2nd pred

C

first iteration second iteration X X X X X X X X

slide-22
SLIDE 22

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

22

Outline

  • PROBLEM INSTRUCTIONS
  • EXECUTION-BASED PREDICTION
  • PREDICTION CORRELATION
  • RESULTS AND ANALYSIS

O Methodology O Results O Discussion

BRANCH LOAD

cache miss

branch slice load slice fork

cache hit

prediction fork

}

speedup

slide-23
SLIDE 23

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

23

Methodology

Used SPEC2000 integer benchmarks

  • spectrum of program behaviors

Identified dominant program phase

  • selected 100M inst. region for simulation

Built slices (by hand) to cover problem instructions Warmed up simulator for 100M instructions

slide-24
SLIDE 24

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

24

Methodology, cont.

AGGRESSIVE BASELINE:

  • 4-wide superscalar, 128 entry window, 14 cycle mispredict

penalty

  • 2 load/store units, 4 fully pipelined integer/floating point

units

  • 64Kb YAGS branch, 32Kb cascaded indirect, RAS predictors
  • Fetches across basic blocks, perfect BTB for direct

branches

  • 2-way associative 64KB L1 caches (64B blocks)
  • 4-way associative 2MB unified L2 cache (128B blocks)
  • 64-entry unified pre-fetch/victim buffer with hardware

stream pre-fetcher Deeply-pipelined, 4-wide, out-of-order superscalar with big predictors, associative caches, hardware stride pre-fetcher, and victim buffers.

slide-25
SLIDE 25

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

25

Results

speedups ranging from 1% to 43%

  • must be regularity in branch/address computation
  • speedups proportional to memory, branch stall time
  • low base IPC → lower opportunity cost of slice execution

bzip2 crafty eon gap gcc gzip mcf parser perl twolf vpr vortex

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

IPC

slice base 16% 3% 7% 11% 1% 16% 43% 1% 7% 12% 35% 1%

slide-26
SLIDE 26

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

26

Related Work

Pre-execution:

  • Roth and Sohi: HPCA-2001 and TR-2000

SPECULATIVE SLICES:

  • Zilles and Sohi: ISCA-2000

Limited forms of pre-execution:

  • Roth, et al: ASPLOS-1998 and ICS-1999
  • Farcy, et al: Micro-1998

Slipstream processors:

  • Sundaramoorthy, et al: ASPLOS-2000

Helper threads:

  • Chappell, et al: ISCA-1999
  • Song and Dubois: TR-1998
slide-27
SLIDE 27

Execution-based Prediction using Speculative Slices - Craig Zilles and Guri Sohi International Symposium on Computer Architecture (ISCA-28), July 2001

27

Summary

PROBLEM INSTRUCTIONS

  • behavior not predictable with existing predictors
  • sometimes computation is regular

EXECUTION-BASED PREDICTION

  • execute code fragments to generate prediction/prefetch
  • imprecise transformations enable small slices

PREDICTION CORRELATION: VALID REGIONS

  • monitor main thread’s fetch path
  • greater than 99% correlation accuracy

Speedups of 1 to 43% over an aggressive baseline