Precise Exceptions and Out-of-Order Execution
Samira Khan
Precise Exceptions and Out-of-Order Execution Samira Khan - - PowerPoint PPT Presentation
Precise Exceptions and Out-of-Order Execution Samira Khan Multi-Cycle Execution Not all instructions take the same amount of time for execution Idea: Have multiple different functional units that take different number of cycles
Samira Khan
“execution”
different number of cycles
functional unit before a previous long-latency instruction finishes execution
2
EXECUTE stage
F D E W F D E W E E E E E E E FMUL R4 ß R1, R2 ADD R3 ß R1, R2 F D E W F D E W F D E W F D E W FMUL R2 ß R5, R6 ADD R4 ß R5, R6 F D E W E E E E E E E
3
memory). Two key properties:
instructions
4
(interrupt)
5
exception/interrupt is ready to be handled
Retire = commit = finish execution and update arch. state
6
7
F D E W F D E W E E E E E E E F D E W F D E W F D E W F D E W F D E W E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E FMUL R3 ß R1, R2 ADD R4 ß R1, R2
8
before making results visible to architectural state
result moved to reg. file or memory
Register File Func Unit Func Unit Func Unit Reorder Buffer Instruction Cache
9
V DEST REG DEST VAL CO MPL ETE 1 R4
R3
1 1
Reorder File Oldest Youngest
FMUL ADD FMUL ADD
F D E W F D E R E E E E E E E F D E W F D E R F D E R F D E R F D E R E E E E E E E W R R W W W W
11
1 2
10
3 4 5 6 7 8 9
11
V DEST REG DEST VAL CO MPL ETE 1 R4
R3 1000 1 1 1 1 R2
Oldest Youngest
FMUL ADD FMUL ADD FMUL R2 ß R5, R6 ADD R4 ß R5, R6
CYCLE 5
F D E W F D E R E E E E E E E F D E W F D E R F D E R F D E R F D E R E E E E E E E W R R W W W W
12
1 2
10
3 4 5 6 7 8 9
11
V DEST REG DEST VAL CO MPL ETE 1 R4
R3 1000 1 1 1 1 R2
R4
Oldest Youngest
FMUL ADD FMUL ADD
CYCLE 5
FMUL R2 ß R5, R6 ADD R4 ß R5, R6
F D E W F D E R E E E E E E E F D E W F D E R F D E R F D E R F D E R E E E E E E E W R R W W W W
13
1 2
10
3 4 5 6 7 8 9
11
V DEST REG DEST VAL CO MPL ETE 1 R4 101 1 R3 1000 1 1 1 1 R2
R4
Oldest Youngest
FMUL ADD FMUL ADD
CYCLE 11
FMUL R2 ß R5, R6 ADD R4 ß R5, R6
F D E W F D E R E E E E E E E F D E W F D E R F D E R F D E R F D E R E E E E E E E W R R W W W W
14
1 2
10
3 4 5 6 7 8 9
11
V DEST REG DEST VAL CO MPL ETE 1 R4 101 1 1 R3 1000 1 1 1 1 R2
R4
Oldest Youngest
FMUL ADD FMUL ADD
CYCLE 12
FMUL R2 ß R5, R6 ADD R4 ß R5, R6 RETIRE OLDEST
F D E W F D E R E E E E E E E F D E W F D E R F D E R F D E R F D E R E E E E E E E W R R W W W W
15
1 2
10
3 4 5 6 7 8 9
11
V DEST REG DEST VAL CO MPL ETE R4 101 1 1 R3 1000 1 1 1 1 R2
R4
Oldest Youngest
FMUL ADD FMUL ADD
CYCLE 12
FMUL R2 ß R5, R6 ADD R4 ß R5, R6 RETIRE OLDEST
F D E W F D E R E E E E E E E F D E W F D E R F D E R F D E R F D E R E E E E E E E W R R W W W W
16
1 2
10
3 4 5 6 7 8 9
11
V DEST REG DEST VAL CO MPL ETE 1 R3 1000 1 1 1 1 R2
R4
Oldest Youngest
ADD FMUL ADD
CYCLE 12
FMUL R2 ß R5, R6 ADD R4 ß R5, R6
What if a later operation needs a value in the reorder buffer? Read reorder buffer in parallel with the register file. How?
(or bypass paths)
Register File Func Unit Func Unit Func Unit Reorder Buffer Instruction Cache bypass path Content Addressable Memory (searched with register ID)
17
V DEST REG DEST VAL CO MPL ETE 1 R3 1000 1 1 1 1 R2
R4
VAL V R1 1 1 R2 R3 R4 R5 5 1 R6 6 1 R7 8 1 R8 8 1 R9 9 1 R10 10 1 R11 11
Oldest Youngest
ADD ADD
entry that contains (or will contain) the value of the register
V
DestRegID DestRegVal StoreAddr StoreData BranchTarget PC/IP Control/valid bits
19
V DEST REG DEST VAL CO MPL ETE 1 R3 1000 1 1 1 1 R2
R4
VAL TAG V R1 1 1 R2 5 R3 2 R4 6 R5 5 1 R6 6 1 R7 8 1 R8 8 1 R9 9 1 R10 10 1 R11 11 1
Oldest Youngest
ADD ADD
yet to be written to the register file
21
22
Boggs et al., “The Microarchitecture of the Pentium 4 Processor,” Intel Technology Journal, 2001.
execute, if so dispatch instruction
architectural register file or memory; else, flush pipeline and start from exception handler
23
F D E W E E E E E E E E E E E E E E E E E E E E . . .
Integer add Integer mul FP mul Load/store
R
younger instructions into functional (execution) units
unit
F D E R E E E E E E E E E E E E E E E E E E E E
. . .
Integer add Integer mul FP mul Cache miss
W
25
(with respect to execution in the previous design)?
IMUL R3 ß R1, R2 ADD R3 ß R3, R1 ADD R1 ß R6, R7 IMUL R5 ß R6, R8 ADD R7 ß R9, R9 LD R3 ß R1 (0) ADD R3 ß R3, R1 ADD R1 ß R6, R7 IMUL R5 ß R6, R8 ADD R7 ß R9, R9
26
F D W E E E E R F D E R W F IMUL R3 ß R1, R2 ADD R3 ß R3, R1 ADD R1 ß R6, R7 IMUL R5 ß R6, R8 ADD R7 ß R3, R5 D E R W F D E R W F D E R W F D W E E E E R F D STALL STALL E R W F D E E E E STALL E R F D E E E E R W F D E R W WAIT WAIT W
27
ready
28
Units,” IBM Journal of R&D, Jan. 1967.
introduction,” MICRO 1985.
microarchitecture,” MICRO 1985.
29
independent ones (s.t. independent ones can execute)
resting area
“fire” (i.e. dispatch) the instruction
complete in the presence of a long latency operation
32
memory). Two key properties:
instructions
33 When is a value interpreted as an instruction?
executed in control flow order
in data flow order
34
nConsider a Von Neumann program
qWhat is the significance of the program order? qWhat is the significance of the storage locations?
nWhich model is more natural to you as a programmer?
35
v <= a + b; w <= b * 2; x <= v - w y <= v + w z <= x * y
+ *2
*
a b z
Sequential Dataflow
semantics?)
36
graph of a piece of the program
instructions
window?
37
n
Smith and Sohi, “The Microarchitecture of Superscalar Processors,” Proc. IEEE, Dec. 1995.
38
FP FU FP FU
from memory load buffers from instruction unit FP registers store buffers to memory
reservation stations Common data bus
39