20 Nov 2005 roberto innocente 1
1
Prediction and speculation :
the role of stochastic models of program behaviour in the performance of modern computers
- r. innocente
Prediction and speculation : the role of stochastic models of - - PowerPoint PPT Presentation
Prediction and speculation : the role of stochastic models of program behaviour in the performance of modern computers r. innocente 20 Nov 2005 roberto innocente 1 1 Speculation from the Merriam-Webster dict : an assumption of an
20 Nov 2005 roberto innocente 1
1
the role of stochastic models of program behaviour in the performance of modern computers
20 Nov 2005 roberto innocente 2
2
from the Merriam-Webster dict :
an assumption of an unusual risk in hopes of obtaining commensurate gains
20 Nov 2005 roberto innocente 3
3
A prediction of what work is likely to be
needed soon is made. Then it is speculatively executed in such a way that you can commit it if the prediction was correct or abort it.
20 Nov 2005 roberto innocente 4
4
The Control Counter today called Program Counter (PC) or Instruction Pointer (IP) keeps the address of the next instruction to be
part fetches this instruction, decodes and executes it. At the end the PC is updated.
20 Nov 2005 roberto innocente 5
5
Linear scaling of speed- Quadratic scaling of transistors
Let's look at the last scaling in silicon litography from 0.13 u to 0.9 u : a 0.70 linear scaling, a 0.49 scaling of surface.
Gate delays scale linearly, transistors available scale quadratically
We will get much more in available complexity than in gate speed
0,00 50,00 100,00 150,00 200,00 250,00 0.25 0.18 0.13 0.09 0.065 Gate Speed Transistors
20 Nov 2005 roberto innocente 6
6
A system can be described with any mix of states, but if you observe it you can only find it in one of the eigenstates, and you can only measure an eigenvalue .
( When you look at it the Shroedinger's cat is aut dead aut alive )
20 Nov 2005 roberto innocente 7
7
Today µ processors take advantage of the
fact that they need to present an architectural state compliant with the standard von Neumann's model only from time to time, being for the remaining time free to proceed in whatever way they find it convenient
20 Nov 2005 roberto innocente 8
8
Obeying the standard semantic when
required, try to overlap the execution of multiple instructions as much as possible. (We will see that current microprocessors can have more than 100 instructions in flight)
20 Nov 2005 roberto innocente 9
9
Pipelining Multiple issue = Superscalar
20 Nov 2005 roberto innocente 10
10
CPI = Cycles Per Instruction Performance = Frequency / CPI Intel 386 :
feature size : 1 micron
frequency: 33 Mhz
CPI = 5/6
Performance = 33 M/6 ~ 6 Kinstructions/s
20 Nov 2005 roberto innocente 11
11
The work to be done is divided in stages, with a clear signal interface between them. After each stage a latch memorizes the state for the next
to get 1 result per cycle, after the pipe is full.
F
eXecute Memory Writeback Decode Fetch
D X M W
Pipeline latch
20 Nov 2005 roberto innocente 12
12
A latch can add 2 or 3 gate delays.
Current work is around 400 gate delays
you get a result every 400/n + 3 gate delays
you add an overhead of 3n gate delays
20 Nov 2005 roberto innocente 13
13
cycle
1 add r1,r3,r4 2 mul r5,r6,r7 3 bnez loop,r1 4 X 5 X X 6 X X X 7 X X X X 8 div r8,r3,r6 X X X X 9 add r10,r8,r9 X X X 10 jmp loop X X
When there is a dependency we say that the pipeline is stalled
inserted waiting for the dependency to solve. Here a control dependency causes a 4 cycles stall.
20 Nov 2005 roberto innocente 14
14
Data dependency : add r1,r2,r3 ; r1<-r2+r3 mul r1,r4,r5 ; r5<-r4*r5
Solution:
register renaming, result forwarding
Structural dependency:
Solution:
add functional units
Control dependency : bne label1,r1,r2 add r1,r2,r3 label1: mul r4,r5,r6
Solution:
branch prediction
20 Nov 2005 roberto innocente 15
15
F D X M W F D X M W
Architectures that are able to process multiple instructions at a time. While it was common to have multiple execution units (like an integer and a FP unit), only in the '90 appeared the first superscalar architectures e.g. IBM Power and Pentium Pro. These architectures require a very good branch prediction. Here it's depicted a 2 way superscalar.
20 Nov 2005 roberto innocente 16
16
Current architectures are commonly 4 or 8
way superscalars
The design of the last Alpha, canceled in its
late phase, was for an 8 way superscalar
Extremely good branch prediction is
needed : there can be hundredths of instructions in flight ( 4 way*30 stages=120)
20 Nov 2005 roberto innocente 17
17
cycle
1 add r1,r3,r4 mul r5,r6,r7 2 bnez loop,r1 X 3 X X X 4 X X X X X 5 X X X X X X X
The wasted slots are now much more than in the pipelined
20 Nov 2005 roberto innocente 18
18
IBM power5
20 Nov 2005 roberto innocente 19
19
year processor 1979 8088 12 1988 386dx 1 275 5 33 80 1991 486dx 1 1100 50 1993 pentium 60 0.8 3100 60 5 1995 pentiumPro 0.6 5500 150 10 1997 Pentium II 0.35 7500 233 10 1999 Pentium III 0.25 9500 450 10 2000 Pentium 4 0.18 42000 1300 20 2005 Pentium 4 571 0.09 130000 3800 30 13 feature size transistor count cycles / instr. frequen cy pipe length FO4 gates per cycle
20 Nov 2005 roberto innocente 20
20
20 Nov 2005 roberto innocente 21
21
IPC = Instructions Per Cycle Performance = Frequency * IPC Intel Pentium4 :
feature size : 90 nm
frequency: 3 Ghz
IPC ~ 2/3 (2 for SPECint,3 for SPECfp)
Performance = 3 G * 2 = 6 Ginstructions/s
20 Nov 2005 roberto innocente 22
22
Some of the instructions, instead of simply incrementing the PC to the next instruction, change it to a different value. We distinguish :
Unconditional branches or simply jumps
Conditional branches or simply branches
subroutine calls
subroutine returns
traps, returns from interrupts or exceptions
20 Nov 2005 roberto innocente 23
23
Only jumps or branches :
j <label>
j @register
beq <label>
bne <label>
bz <label>
bnz <label>
20 Nov 2005 roberto innocente 24
24
for(i=1;i<=4;i++) { .. }
if (i) { .. }
while (i--) { .. }
ld r1,1 ld r1,1 ld r2,4 ld r2,4 loop:cmp r1,r2 loop:cmp r1,r2 beq out beq out .. .. add r1,r1,1 add r1,r1,1 jmp loop jmp loop
ld r1,i ld r1,i bz next bz next .. .. next: next: loop: sub r1,1 loop: sub r1,1 bz out bz out .. .. jmp loop jmp loop
20 Nov 2005 roberto innocente 25
25
Well-known set of benchmarks, continuously updated, recognized as representative of possible workloads
Divided in 2 big sets :
SPECint : integer programs( go, m88ksim, compress, li, ijpeg, perl, vortex)
SPECfp : floating point programs (mathematical simulation prgs)
http://www.spec.org
20 Nov 2005 roberto innocente 26
26
Condition al 7 2 % I m m ediat e 1 6 % Returns 1 0 % I ndirect 2 % Conditional I m m ediate Returns I ndirect
Average from SPECint95
20 Nov 2005 roberto innocente 27
27
5 0 1 0 0 1 5 0 2 0 0 2 5 0
com press gcc go ijpeg m 8 8 ksim perl vortex xlisp
Dynamic instructions Dynamic branches Dynamic Cond BR
SPEC95 Benchmarks (on y-axis millions
20 Nov 2005 roberto innocente 28
28
Alw ays taken 1 4 % 9 5 -1 0 0 % 2 1 % 5 0 -9 5 % 2 0 % 5 -5 0 % 2 4 % 0 -5 % 7 % Never Taken 1 4 % Alw ays taken 9 5 -1 0 0 % 5 0 -9 5 % 5 -5 0 % 0 -5 % Never Taken
Average from SPECint95
20 Nov 2005 roberto innocente 29
29
Occurrences of branches (conditional branches) :
SPECint 95 1 out of 5 instruction executed (20%)
SPECfp 95 1 out of 10 instruction executed (10%)
Basic block is the term used for a sequence of instructions without any control xfer Note : this is different and much more than the rate of branches in the static program
20 Nov 2005 roberto innocente 30
30
b=rate of instruction executed that are branches (0.1-0.2)
p=prediction accuracy (currently the best is in the 0.90-0.97 range)
f=instructions “in-flight” (in execution, currently
Oversimplification: misprediction is recognized
forces to squash all the following f in-flight instr.
Then every 1/(b*(1-p)) instr. we squash f instr.
E = 1/(1+f*(b*(1-p))
20 Nov 2005 roberto innocente 31
31
5 inflight instructions 100 in flight 200 inflight 300 inflight
20 Nov 2005 roberto innocente 32
32
Are they using informations collected from the running programs ?
Static : no Semi-static : info collected from test
samples
Dynamic : yes
20 Nov 2005 roberto innocente 33
33
Always taken (AT), Always not taken (ANT)
Backward taken, forward not taken(BTFNT)
frequently used by current processors, relies on compilers too (Intel Pentium4)
Complicate rules : for example the bp of Pentium-M looks at the distance between addresses and opcodes
Programmer hints (special opcodes on Pentium, flags on Itanium)
program reorganization by compilers
Achieves ~ 60/70 % accuracy
20 Nov 2005 roberto innocente 34
34
It relies on data collected from previous runs
Insertion in the code of appropriate hints :
predict taken
predict not taken
Achieves accuracy of ~ 65/80 %
20 Nov 2005 roberto innocente 35
35
As the last time
Bimodal predictors
Achieve accuracy of 70/85 %
2 level / correlation predictors
Achieve accuracy of 80/90 %
Combining/Meta predictors
Markov/PPM predictors
Neural predictors
static semi- static bimod al 2 level combi ned 50 55 60 65 70 75 80 85 90 95
% branch pred. accuracy
from to
20 Nov 2005 roberto innocente 36
36
The best 4 states FSA (Finite State Automaton)
SNT,NT,T,ST (Strongly NotTaken, NotTaken, Taken, Strongly Taken)
Add 1 when branch is taken, subtract 1 when not taken. Saturate at 0 and 3
ST T NT SNT t t t t nt nt nt 00 01 10 11 nt t
20 Nov 2005 roberto innocente 37
37
Array of counters ST T NT SNT t nt
instruction address (usually simply using some of the bits of the address), a counter is chosen and a prediction is made. Whole array is initialized to the T or NT state
known the counter is updated
bit saturated counter Address of branch instruction
20 Nov 2005 roberto innocente 38
38
Global correlation
Local correlation
for(i=0;i<1000;i++) { if (i%4 == 0) a[i]=0; }
if (cond1) { .. } if (cond1 && cond2) { .. } if (cond1) a=2; .. if (a==0) { .. } if (cond1) { .. } if (cond2) { .. } if (cond1 && cond2) { .. } for(i=0;i<12;i++) { .. } Outcome depends on the outcome of previous branches Outcome depends on previous
20 Nov 2005 roberto innocente 39
39
Two-level/Correlation predictor (Yeh- Patt’92,Pan-Soh-Rameh’92)
Branches are correlated one to the other
We keep a shift register with the most recent branch
We index a bimodal table (Pattern History Table) with this branch history register (BHT)
We can keep only one global BHT for all the branches (global 2-level predictor) or a BHT per each branch (local 2- level predictor). The same we can do for the PHT.
Branch history register Pattern History Table Prediction Last
20 Nov 2005 roberto innocente 40
40
G S P a g s p Branch History Pattern History Global Shared(aSsociative) Per address global shared(associative) per address
Gas = Global History Register, adaptive, with shared Pattern History Table (for instance 8 ways)
20 Nov 2005 roberto innocente 41
41
Alleviates the problem of PHT destructive interference between branches
The PHT is indexed with the XOR of the BHT and the BIA (branch instruction address)
Branch history register Pattern History Table Prediction Last
XOR Branch address
20 Nov 2005 roberto innocente 42
42
The same branch history may be the result of very different program behaviours. To disentangle such situations we can take some bits of the target address of each of the last n taken branches and use those to address the bimodal PHT.
TA bits TA bits TA bits TA bits TA bits TA bits Path history register Pattern History Table
20 Nov 2005 roberto innocente 43
43
Tournament/Meta predictor (McFarling ’93)
Often happens that a predictor is better for some branches and another for other branches
A bimodal predictor can then be used to drive a mux that will choose between the 2 predictors
When the outcome is known the metapredictor is updated if
and the other wrong
In this case the states are the confidence on the 2 predictors
Predictor1 Predictor2 Meta Predictor Address of branch instruction Mux Hybrid predictor
20 Nov 2005 roberto innocente 44
44
It is a similar and well studied problem, for which there exists an algorithm reputed nearly optimal (PPM).
the goal is to represent the data with fewer bits :
You use fewer bits for frequent sequences and more bits for the infrequent ones. The net effect is to use less bits overall.
It relies on accurately predicting the probabilistic distribution of data and using a coder tuned to that
20 Nov 2005 roberto innocente 45
45
A Markov predictor of
prediction on the last j
It builds the matrix of transition frequencies and makes the prediction according to that
pattern next frequency 00 1 1 01 1 2 10 1 1 1 11 2 1
1 1 1 1 1 0 Last outcomes
20 Nov 2005 roberto innocente 46
46
A PPM predictor of order m is a set of m+1
Markov predictors
1 1 1 1 1
Last m bits if found Predict with Markov predictor of order m Last m-1 bits if found if found Markov predictor of order m-1 Markov predictor of order 1 Markov predictor of order 0 if not found
if not found if not found
20 Nov 2005 roberto innocente 47
47
Machine learning has often used neural methods
Most neural networks can't be candidates for hardware prediction at the microarchitecture level
Their implementation would require much more than several cycles
The standard method of training, the backpropagation algorithm, is infeasible in a few machine cycles
20 Nov 2005 roberto innocente 48
48
Introduced by Rosenblatt in 1962 as a model of brain functioning, popularized by M.Minsky
We will consider the simplest: the single-layer perceptron
A vector of n inputs: x[1]..x[n]
Each input has a weight associated with it: w[0]..w[n]. This vector characterizes the perceptron
20 Nov 2005 roberto innocente 49
49
The inputs and the outcome t can be only 1
Then t*x[i] = 1 if they agree, or -1 if they
disagree
if the w[i] are integers, y is an integer too
and sign(y) is the prediction
20 Nov 2005 roberto innocente 50
50
Simply stated : increase the weights of those inputs that agree with the outcome, and decrease the weight
Let t be the outcome and θ be a threshold after which we stop to train the perceptron. Then the algorithm is : if ((sign(y) <> t)||(|y| < theta)) {
for (i=0 ; i<=n;i++) {
w[i] = w[i] + t * x[i];
}
}
20 Nov 2005 roberto innocente 51
51
A single perceptron can only learn linearly separable functions of the inputs. The linear equation
w[0]+Σ w[i]*x[i]=0 represents an hyperplane in the
n-dim space of inputs
AND, OR, NAND, NOR are linearly separable, XOR is not
Of course any boolean function can be learned by a 2- layer network of perceptrons (as any boolean function can be represented by a 2-layer net of ANDs and ORs), but it has been shown that for bp there is not much gain and the delay gets much worse
20 Nov 2005 roberto innocente 52
52
The inputs of the perceptron are the branch history
We keep a table of perceptrons (the weights) that we address hashing on the branch address
Every time we meet a branch we load the perceptron in a vector register and we compute in parallel the dot product between the weights and the branch history (summing the complements to 1 instead of those to 2)
According to the result we predict the branch taken or not taken
The training alg. is performed and the updated perceptron is written back
20 Nov 2005 roberto innocente 53
53
It's the serialization constraint imposed by data dependencies among instructions
Was always thought to be an insurmountable limit
An instruction that needs data from another instruction needs to be executed after that
ADD R1,R2,R3 ; R1<-R2+R3 ADD R4,R1,R5 ; R4<-R1+R5
20 Nov 2005 roberto innocente 54
54
At the end of the '90 some authors proposed the use of data prediction to overcome the dataflow limit
M.Lipasti, Shen Exceeding the data flow limit
This is much more difficult than branch prediction where you need to predict only a binary value
20 Nov 2005 roberto innocente 55
55
The simulations
showed in fact that the applications are
new locality principle : Value Locality
Value Locality Temporal Spatial
20 Nov 2005 roberto innocente 56
56
It was shown for instance that 20/30 % of
instructions that write value in registers write the same value as the last time
And 40/50 % write one of the last 4
preceeding values
20 Nov 2005 roberto innocente 57
57
What makes these values so predictable ?
It seems this is due to severe penalties real-world programs incur not only because they are designed to manage quite infrequent contingencies like exceptions and error conditions but because they are general by design. This is shown even by code aggressively optimized by modern state of the art compilers
20 Nov 2005 roberto innocente 58
58
Load Value prediction Register Value prediction
20 Nov 2005 roberto innocente 59
59
Speculative execution Control speculation Data speculation Branch outcome Branch target Data Location Data Value Load Register Value Address
20 Nov 2005 roberto innocente 60
60
Reverse engineering of prediction algorithms implementations
Simulation of new prediction algorithms :
Using legacy Instructions Sets (IS)
Using abstract RISC instructions sets
Hand code optimization and compiler optimization techniques
20 Nov 2005 roberto innocente 61
61
A python or perl script :
produces assembly language kernels (with for example fix distance between branch instructions)
compiles and runs the kernels using the hardware counters for mispredictions to detect table sizes, conflicts and so on
20 Nov 2005 roberto innocente 62
62
Can be obtained instrumenting an x86 open
source simulator like bochs that can run windows or linux
You can then run statically precompiled
binaries over it
Problem : bochs is not even a complete
Pentium II simulator !
20 Nov 2005 roberto innocente 63
63
SimpleScalar is an opensource framework for a generic software simulator over which modules for different prediction algorithms can be implemented
Offers the possibility to customize also the Instruction Set (IS)
Problem : you need the source and compile all special libraries to use this tool
20 Nov 2005 roberto innocente 64
64
Basic block extension Code duplication Scheduling techniques :
20 Nov 2005 roberto innocente 65
65
Code scheduling or reordering of instruction is used to improve performance or guarantee correctness
Important for dynamically scheduled architectures, essential for static scheduled architectures
Examples : branch delay slots, memory delays, multi-cycle operations
Block scheduling, List scheduling, Superblock scheduling, Trace Scheduling
20 Nov 2005 roberto innocente 66
66
Intel Itanium2 with 6MB L3 cache has 0.41 billion transistors of which around 0.3 billion transistors are for the cache memory
It's not clear what will be the best use of the available silicon:
CMP (Single-Chip MultiProcessors)
Superwide superspeculative superscalar
Simultaneous MultiThreading
Raw Processors
20 Nov 2005 roberto innocente 67
67
386 486dx P 60 p pro P II P III P 4 P 4 571 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 110000 120000 130000
FO4 gates pipe length feat.size trans.# freq