Prediction and speculation : the role of stochastic models of - - PowerPoint PPT Presentation

prediction and speculation
SMART_READER_LITE
LIVE PREVIEW

Prediction and speculation : the role of stochastic models of - - PowerPoint PPT Presentation

Prediction and speculation : the role of stochastic models of program behaviour in the performance of modern computers r. innocente 20 Nov 2005 roberto innocente 1 1 Speculation from the Merriam-Webster dict : an assumption of an


slide-1
SLIDE 1

20 Nov 2005 roberto innocente 1

1

Prediction and speculation :

the role of stochastic models of program behaviour in the performance of modern computers

  • r. innocente
slide-2
SLIDE 2

20 Nov 2005 roberto innocente 2

2

Speculation

 from the Merriam-Webster dict :

an assumption of an unusual risk in hopes of obtaining commensurate gains

slide-3
SLIDE 3

20 Nov 2005 roberto innocente 3

3

Speculative execution

 A prediction of what work is likely to be

needed soon is made. Then it is speculatively executed in such a way that you can commit it if the prediction was correct or abort it.

slide-4
SLIDE 4

20 Nov 2005 roberto innocente 4

4

von Neumann's model : Stored Program Computer

The Control Counter today called Program Counter (PC) or Instruction Pointer (IP) keeps the address of the next instruction to be

  • executed. The control

part fetches this instruction, decodes and executes it. At the end the PC is updated.

slide-5
SLIDE 5

20 Nov 2005 roberto innocente 5

5

Linear scaling of speed- Quadratic scaling of transistors

Let's look at the last scaling in silicon litography from 0.13 u to 0.9 u : a 0.70 linear scaling, a 0.49 scaling of surface.

Gate delays scale linearly, transistors available scale quadratically

We will get much more in available complexity than in gate speed

0,00 50,00 100,00 150,00 200,00 250,00 0.25 0.18 0.13 0.09 0.065 Gate Speed Transistors

slide-6
SLIDE 6

20 Nov 2005 roberto innocente 6

6

von Neumann's Projection/ Collapse postulate of QM

A system can be described with any mix of states, but if you observe it you can only find it in one of the eigenstates, and you can only measure an eigenvalue .

( When you look at it the Shroedinger's cat is aut dead aut alive )

slide-7
SLIDE 7

20 Nov 2005 roberto innocente 7

7

Modern microprocessors

Today µ processors take advantage of the

fact that they need to present an architectural state compliant with the standard von Neumann's model only from time to time, being for the remaining time free to proceed in whatever way they find it convenient

slide-8
SLIDE 8

20 Nov 2005 roberto innocente 8

8

ILP – Instruction Level Parallelism (Fisher 1981)

 Obeying the standard semantic when

required, try to overlap the execution of multiple instructions as much as possible. (We will see that current microprocessors can have more than 100 instructions in flight)

slide-9
SLIDE 9

20 Nov 2005 roberto innocente 9

9

Enabling technologies for ILP exploitation

  Pipelining   Multiple issue = Superscalar

slide-10
SLIDE 10

20 Nov 2005 roberto innocente 10

10

A microprocessor in 1989 (Intel 386)

 CPI = Cycles Per Instruction  Performance = Frequency / CPI  Intel 386 :

feature size : 1 micron

frequency: 33 Mhz

CPI = 5/6

 Performance = 33 M/6 ~ 6 Kinstructions/s

slide-11
SLIDE 11

20 Nov 2005 roberto innocente 11

11

Pipelining

The work to be done is divided in stages, with a clear signal interface between them. After each stage a latch memorizes the state for the next

  • cycle. It adds some
  • verhead, but the hope is

to get 1 result per cycle, after the pipe is full.

F

eXecute Memory Writeback Decode Fetch

D X M W

Pipeline latch

slide-12
SLIDE 12

20 Nov 2005 roberto innocente 12

12

Limits of pipelining

A latch can add 2 or 3 gate delays.

Current work is around 400 gate delays

you get a result every 400/n + 3 gate delays

you add an overhead of 3n gate delays

slide-13
SLIDE 13

20 Nov 2005 roberto innocente 13

13

Pipeline at work

cycle

F D X M W

1 add r1,r3,r4 2 mul r5,r6,r7 3 bnez loop,r1 4 X 5 X X 6 X X X 7 X X X X 8 div r8,r3,r6 X X X X 9 add r10,r8,r9 X X X 10 jmp loop X X

When there is a dependency we say that the pipeline is stalled

  • r a bubble is

inserted waiting for the dependency to solve. Here a control dependency causes a 4 cycles stall.

slide-14
SLIDE 14

20 Nov 2005 roberto innocente 14

14

Instruction dependencies

Data dependency : add r1,r2,r3 ; r1<-r2+r3 mul r1,r4,r5 ; r5<-r4*r5

Solution:

register renaming, result forwarding

Structural dependency:

Solution:

add functional units

Control dependency : bne label1,r1,r2 add r1,r2,r3 label1: mul r4,r5,r6

Solution:

branch prediction

slide-15
SLIDE 15

20 Nov 2005 roberto innocente 15

15

Multiple issue (Superscalar) Architectures

F D X M W F D X M W

Architectures that are able to process multiple instructions at a time. While it was common to have multiple execution units (like an integer and a FP unit), only in the '90 appeared the first superscalar architectures e.g. IBM Power and Pentium Pro. These architectures require a very good branch prediction. Here it's depicted a 2 way superscalar.

slide-16
SLIDE 16

20 Nov 2005 roberto innocente 16

16

Superscalar/2

 Current architectures are commonly 4 or 8

way superscalars

 The design of the last Alpha, canceled in its

late phase, was for an 8 way superscalar

 Extremely good branch prediction is

needed : there can be hundredths of instructions in flight ( 4 way*30 stages=120)

slide-17
SLIDE 17

20 Nov 2005 roberto innocente 17

17

Superscalar at work

cycle

F D X M W

1 add r1,r3,r4 mul r5,r6,r7 2 bnez loop,r1 X 3 X X X 4 X X X X X 5 X X X X X X X

The wasted slots are now much more than in the pipelined

  • nly case
slide-18
SLIDE 18

20 Nov 2005 roberto innocente 18

18

Real World Architectures

IBM power5

slide-19
SLIDE 19

20 Nov 2005 roberto innocente 19

19

15 years of x86

year processor 1979 8088 12 1988 386dx 1 275 5 33 80 1991 486dx 1 1100 50 1993 pentium 60 0.8 3100 60 5 1995 pentiumPro 0.6 5500 150 10 1997 Pentium II 0.35 7500 233 10 1999 Pentium III 0.25 9500 450 10 2000 Pentium 4 0.18 42000 1300 20 2005 Pentium 4 571 0.09 130000 3800 30 13 feature size transistor count cycles / instr. frequen cy pipe length FO4 gates per cycle

slide-20
SLIDE 20

20 Nov 2005 roberto innocente 20

20

Feature size, frequency, complexity

386 486dx P 60 p pro P II P III P 4 P 4 571 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 feat.size 386 486dx P 60 p pro P II P III P 4 P 4 571 500 1000 1500 2000 2500 3000 3500 4000 freq 386 486dx P 60 p pro P II P III P 4 P 4 571 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 110000 120000 130000 trans.#
slide-21
SLIDE 21

20 Nov 2005 roberto innocente 21

21

A microprocessor in 2005 (Intel Pentium4)

 IPC = Instructions Per Cycle  Performance = Frequency * IPC  Intel Pentium4 :

feature size : 90 nm

frequency: 3 Ghz

IPC ~ 2/3 (2 for SPECint,3 for SPECfp)

 Performance = 3 G * 2 = 6 Ginstructions/s

slide-22
SLIDE 22

20 Nov 2005 roberto innocente 22

22

Control xfer instructions

Some of the instructions, instead of simply incrementing the PC to the next instruction, change it to a different value. We distinguish :

Unconditional branches or simply jumps

Conditional branches or simply branches

subroutine calls

subroutine returns

traps, returns from interrupts or exceptions

slide-23
SLIDE 23

20 Nov 2005 roberto innocente 23

23

Assembly – Machine instructions

 Only jumps or branches :

j <label>

j @register

beq <label>

bne <label>

bz <label>

bnz <label>

slide-24
SLIDE 24

20 Nov 2005 roberto innocente 24

24

High level Language – Assembly

for(i=1;i<=4;i++) { .. }

if (i) { .. }

while (i--) { .. }

ld r1,1 ld r1,1 ld r2,4 ld r2,4 loop:cmp r1,r2 loop:cmp r1,r2 beq out beq out .. .. add r1,r1,1 add r1,r1,1 jmp loop jmp loop

  • ut:
  • ut:

ld r1,i ld r1,i bz next bz next .. .. next: next: loop: sub r1,1 loop: sub r1,1 bz out bz out .. .. jmp loop jmp loop

  • ut:
  • ut:
slide-25
SLIDE 25

20 Nov 2005 roberto innocente 25

25

SPEC-Std Perf. Evaluation Corporation benchmarks

Well-known set of benchmarks, continuously updated, recognized as representative of possible workloads

Divided in 2 big sets :

SPECint : integer programs( go, m88ksim, compress, li, ijpeg, perl, vortex)

SPECfp : floating point programs (mathematical simulation prgs)

http://www.spec.org

slide-26
SLIDE 26

20 Nov 2005 roberto innocente 26

26

Condition al 7 2 % I m m ediat e 1 6 % Returns 1 0 % I ndirect 2 % Conditional I m m ediate Returns I ndirect

Branches by type

Average from SPECint95

slide-27
SLIDE 27

20 Nov 2005 roberto innocente 27

27

5 0 1 0 0 1 5 0 2 0 0 2 5 0

com press gcc go ijpeg m 8 8 ksim perl vortex xlisp

Dynamic instructions Dynamic branches Dynamic Cond BR

Branches by frequency

SPEC95 Benchmarks (on y-axis millions

  • f instruction)
slide-28
SLIDE 28

20 Nov 2005 roberto innocente 28

28

Alw ays taken 1 4 % 9 5 -1 0 0 % 2 1 % 5 0 -9 5 % 2 0 % 5 -5 0 % 2 4 % 0 -5 % 7 % Never Taken 1 4 % Alw ays taken 9 5 -1 0 0 % 5 0 -9 5 % 5 -5 0 % 0 -5 % Never Taken

Branches by taken rate

Average from SPECint95

slide-29
SLIDE 29

20 Nov 2005 roberto innocente 29

29

Occurrences of branches

Occurrences of branches (conditional branches) :

SPECint 95 1 out of 5 instruction executed (20%)

SPECfp 95 1 out of 10 instruction executed (10%)

Basic block is the term used for a sequence of instructions without any control xfer Note : this is different and much more than the rate of branches in the static program

slide-30
SLIDE 30

20 Nov 2005 roberto innocente 30

30

Mispredictions effects

b=rate of instruction executed that are branches (0.1-0.2)

p=prediction accuracy (currently the best is in the 0.90-0.97 range)

f=instructions “in-flight” (in execution, currently

  • ver 100)

Oversimplification: misprediction is recognized

  • nly at the very end and

forces to squash all the following f in-flight instr.

Then every 1/(b*(1-p)) instr. we squash f instr.

E = 1/(1+f*(b*(1-p))

slide-31
SLIDE 31

20 Nov 2005 roberto innocente 31

31

Efficiency versus bp accuracy

5 inflight instructions 100 in flight 200 inflight 300 inflight

slide-32
SLIDE 32

20 Nov 2005 roberto innocente 32

32

Branch prediction methods

Are they using informations collected from the running programs ?

 Static : no  Semi-static : info collected from test

samples

 Dynamic : yes

slide-33
SLIDE 33

20 Nov 2005 roberto innocente 33

33

Static branch prediction

Always taken (AT), Always not taken (ANT)

Backward taken, forward not taken(BTFNT)

frequently used by current processors, relies on compilers too (Intel Pentium4)

Complicate rules : for example the bp of Pentium-M looks at the distance between addresses and opcodes

Programmer hints (special opcodes on Pentium, flags on Itanium)

program reorganization by compilers

Achieves ~ 60/70 % accuracy

slide-34
SLIDE 34

20 Nov 2005 roberto innocente 34

34

Semi-static branch prediction

 It relies on data collected from previous runs

  • f the program (profiling : Sun Sparc)

 Insertion in the code of appropriate hints :

predict taken

predict not taken

 Achieves accuracy of ~ 65/80 %

slide-35
SLIDE 35

20 Nov 2005 roberto innocente 35

35

Dynamic branch prediction

As the last time

Bimodal predictors

Achieve accuracy of 70/85 %

2 level / correlation predictors

Achieve accuracy of 80/90 %

Combining/Meta predictors

Markov/PPM predictors

Neural predictors

static semi- static bimod al 2 level combi ned 50 55 60 65 70 75 80 85 90 95

% branch pred. accuracy

from to

slide-36
SLIDE 36

20 Nov 2005 roberto innocente 36

36

2bc – Two bit saturated counter

The best 4 states FSA (Finite State Automaton)

SNT,NT,T,ST (Strongly NotTaken, NotTaken, Taken, Strongly Taken)

Add 1 when branch is taken, subtract 1 when not taken. Saturate at 0 and 3

ST T NT SNT t t t t nt nt nt 00 01 10 11 nt t

slide-37
SLIDE 37

20 Nov 2005 roberto innocente 37

37

bimodal predictor (Smith ’85)

Array of counters ST T NT SNT t nt

  • At every branch, hashing on the

instruction address (usually simply using some of the bits of the address), a counter is chosen and a prediction is made. Whole array is initialized to the T or NT state

  • When the outcome of the branch is

known the counter is updated

  • General consensus on using the 2

bit saturated counter Address of branch instruction

slide-38
SLIDE 38

20 Nov 2005 roberto innocente 38

38

Branch correlations

Global correlation

Local correlation

for(i=0;i<1000;i++) { if (i%4 == 0) a[i]=0; }

if (cond1) { .. } if (cond1 && cond2) { .. } if (cond1) a=2; .. if (a==0) { .. } if (cond1) { .. } if (cond2) { .. } if (cond1 && cond2) { .. } for(i=0;i<12;i++) { .. } Outcome depends on the outcome of previous branches Outcome depends on previous

  • utcomes of same branch
slide-39
SLIDE 39

20 Nov 2005 roberto innocente 39

39

Two-level/Correlation predictor (Yeh- Patt’92,Pan-Soh-Rameh’92)

Branches are correlated one to the other

We keep a shift register with the most recent branch

  • utcomes

We index a bimodal table (Pattern History Table) with this branch history register (BHT)

We can keep only one global BHT for all the branches (global 2-level predictor) or a BHT per each branch (local 2- level predictor). The same we can do for the PHT.

Branch history register Pattern History Table Prediction Last

  • utcome
slide-40
SLIDE 40

20 Nov 2005 roberto innocente 40

40

Taxonomy of 2 level predictors

G S P a g s p Branch History Pattern History Global Shared(aSsociative) Per address global shared(associative) per address

Gas = Global History Register, adaptive, with shared Pattern History Table (for instance 8 ways)

} {

slide-41
SLIDE 41

20 Nov 2005 roberto innocente 41

41

gshare (McFarling ’93)

Alleviates the problem of PHT destructive interference between branches

The PHT is indexed with the XOR of the BHT and the BIA (branch instruction address)

Branch history register Pattern History Table Prediction Last

  • utcome

XOR Branch address

slide-42
SLIDE 42

20 Nov 2005 roberto innocente 42

42

Path correlated prediction

The same branch history may be the result of very different program behaviours. To disentangle such situations we can take some bits of the target address of each of the last n taken branches and use those to address the bimodal PHT.

TA bits TA bits TA bits TA bits TA bits TA bits Path history register Pattern History Table

slide-43
SLIDE 43

20 Nov 2005 roberto innocente 43

43

Tournament/Meta predictor (McFarling ’93)

Often happens that a predictor is better for some branches and another for other branches

A bimodal predictor can then be used to drive a mux that will choose between the 2 predictors

When the outcome is known the metapredictor is updated if

  • ne of the predictors was right

and the other wrong

In this case the states are the confidence on the 2 predictors

Predictor1 Predictor2 Meta Predictor Address of branch instruction Mux Hybrid predictor

  • utcome
slide-44
SLIDE 44

20 Nov 2005 roberto innocente 44

44

Data compression

It is a similar and well studied problem, for which there exists an algorithm reputed nearly optimal (PPM).

the goal is to represent the data with fewer bits :

You use fewer bits for frequent sequences and more bits for the infrequent ones. The net effect is to use less bits overall.

It relies on accurately predicting the probabilistic distribution of data and using a coder tuned to that

slide-45
SLIDE 45

20 Nov 2005 roberto innocente 45

45

Markov predictor

A Markov predictor of

  • rder j, bases its

prediction on the last j

  • utcomes

It builds the matrix of transition frequencies and makes the prediction according to that

pattern next frequency 00 1 1 01 1 2 10 1 1 1 11 2 1

1 1 1 1 1 0 Last outcomes

slide-46
SLIDE 46

20 Nov 2005 roberto innocente 46

46

PPM – (Cleary, Witten 1984) Prediction by Partial Matching

 A PPM predictor of order m is a set of m+1

Markov predictors

1 1 1 1 1

Last m bits if found Predict with Markov predictor of order m Last m-1 bits if found if found Markov predictor of order m-1 Markov predictor of order 1 Markov predictor of order 0 if not found

if not found if not found

slide-47
SLIDE 47

20 Nov 2005 roberto innocente 47

47

Neural methods-D.Jimenez 2002

Machine learning has often used neural methods

Most neural networks can't be candidates for hardware prediction at the microarchitecture level

Their implementation would require much more than several cycles

The standard method of training, the backpropagation algorithm, is infeasible in a few machine cycles

slide-48
SLIDE 48

20 Nov 2005 roberto innocente 48

48

Perceptron

Introduced by Rosenblatt in 1962 as a model of brain functioning, popularized by M.Minsky

We will consider the simplest: the single-layer perceptron

A vector of n inputs: x[1]..x[n]

Each input has a weight associated with it: w[0]..w[n]. This vector characterizes the perceptron

slide-49
SLIDE 49

20 Nov 2005 roberto innocente 49

49

Bipolar perceptron

 The inputs and the outcome t can be only 1

  • r -1

 Then t*x[i] = 1 if they agree, or -1 if they

disagree

 if the w[i] are integers, y is an integer too

and sign(y) is the prediction

slide-50
SLIDE 50

20 Nov 2005 roberto innocente 50

50

Perceptron training

Simply stated : increase the weights of those inputs that agree with the outcome, and decrease the weight

  • f those that do not

Let t be the outcome and θ be a threshold after which we stop to train the perceptron. Then the algorithm is : if ((sign(y) <> t)||(|y| < theta)) {

for (i=0 ; i<=n;i++) {

w[i] = w[i] + t * x[i];

}

}

slide-51
SLIDE 51

20 Nov 2005 roberto innocente 51

51

Perceptron limitations

A single perceptron can only learn linearly separable functions of the inputs. The linear equation

w[0]+Σ w[i]*x[i]=0 represents an hyperplane in the

n-dim space of inputs

AND, OR, NAND, NOR are linearly separable, XOR is not

Of course any boolean function can be learned by a 2- layer network of perceptrons (as any boolean function can be represented by a 2-layer net of ANDs and ORs), but it has been shown that for bp there is not much gain and the delay gets much worse

slide-52
SLIDE 52

20 Nov 2005 roberto innocente 52

52

Branch prediction with perceptrons

The inputs of the perceptron are the branch history

We keep a table of perceptrons (the weights) that we address hashing on the branch address

Every time we meet a branch we load the perceptron in a vector register and we compute in parallel the dot product between the weights and the branch history (summing the complements to 1 instead of those to 2)

According to the result we predict the branch taken or not taken

The training alg. is performed and the updated perceptron is written back

slide-53
SLIDE 53

20 Nov 2005 roberto innocente 53

53 

It's the serialization constraint imposed by data dependencies among instructions

Was always thought to be an insurmountable limit

An instruction that needs data from another instruction needs to be executed after that

The dataflow limit

ADD R1,R2,R3 ; R1<-R2+R3 ADD R4,R1,R5 ; R4<-R1+R5

slide-54
SLIDE 54

20 Nov 2005 roberto innocente 54

54

Exceeding the dataflow limit

At the end of the '90 some authors proposed the use of data prediction to overcome the dataflow limit

M.Lipasti, Shen Exceeding the data flow limit

This is much more difficult than branch prediction where you need to predict only a binary value

slide-55
SLIDE 55

20 Nov 2005 roberto innocente 55

55

Value locality

 The simulations

showed in fact that the applications are

  • beying also to a

new locality principle : Value Locality

Value Locality Temporal Spatial

slide-56
SLIDE 56

20 Nov 2005 roberto innocente 56

56

Value prediction/1

 It was shown for instance that 20/30 % of

instructions that write value in registers write the same value as the last time

 And 40/50 % write one of the last 4

preceeding values

slide-57
SLIDE 57

20 Nov 2005 roberto innocente 57

57

Value prediction/2

 What makes these values so predictable ?

It seems this is due to severe penalties real-world programs incur not only because they are designed to manage quite infrequent contingencies like exceptions and error conditions but because they are general by design. This is shown even by code aggressively optimized by modern state of the art compilers

slide-58
SLIDE 58

20 Nov 2005 roberto innocente 58

58

Value prediction/3

 Load Value prediction  Register Value prediction

slide-59
SLIDE 59

20 Nov 2005 roberto innocente 59

59

Speculation taxonomy

Speculative execution Control speculation Data speculation Branch outcome Branch target Data Location Data Value Load Register Value Address

slide-60
SLIDE 60

20 Nov 2005 roberto innocente 60

60

Research areas

Reverse engineering of prediction algorithms implementations

Simulation of new prediction algorithms :

Using legacy Instructions Sets (IS)

Using abstract RISC instructions sets

Hand code optimization and compiler optimization techniques

slide-61
SLIDE 61

20 Nov 2005 roberto innocente 61

61

Reverse engineering

 A python or perl script :

produces assembly language kernels (with for example fix distance between branch instructions)

compiles and runs the kernels using the hardware counters for mispredictions to detect table sizes, conflicts and so on

slide-62
SLIDE 62

20 Nov 2005 roberto innocente 62

62

Legacy IS/OS simulations

 Can be obtained instrumenting an x86 open

source simulator like bochs that can run windows or linux

 You can then run statically precompiled

binaries over it

 Problem : bochs is not even a complete

Pentium II simulator !

slide-63
SLIDE 63

20 Nov 2005 roberto innocente 63

63

Abstract IS simulators

SimpleScalar is an opensource framework for a generic software simulator over which modules for different prediction algorithms can be implemented

Offers the possibility to customize also the Instruction Set (IS)

Problem : you need the source and compile all special libraries to use this tool

slide-64
SLIDE 64

20 Nov 2005 roberto innocente 64

64

Optimization techniques

 Basic block extension  Code duplication  Scheduling techniques :

slide-65
SLIDE 65

20 Nov 2005 roberto innocente 65

65

Scheduling

Code scheduling or reordering of instruction is used to improve performance or guarantee correctness

Important for dynamically scheduled architectures, essential for static scheduled architectures

Examples : branch delay slots, memory delays, multi-cycle operations

Block scheduling, List scheduling, Superblock scheduling, Trace Scheduling

slide-66
SLIDE 66

20 Nov 2005 roberto innocente 66

66

BTA era is here (Billion Transistor Architecture)

Intel Itanium2 with 6MB L3 cache has 0.41 billion transistors of which around 0.3 billion transistors are for the cache memory

It's not clear what will be the best use of the available silicon:

CMP (Single-Chip MultiProcessors)

Superwide superspeculative superscalar

Simultaneous MultiThreading

Raw Processors

slide-67
SLIDE 67

20 Nov 2005 roberto innocente 67

67

386 486dx P 60 p pro P II P III P 4 P 4 571 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 110000 120000 130000

FO4 gates pipe length feat.size trans.# freq