Superscalar Processors
Raul Queiroz Feitosa
Parts of these slides are from the support material provided by W. Stallings
Superscalar Processors Raul Queiroz Feitosa Parts of these slides - - PowerPoint PPT Presentation
Superscalar Processors Raul Queiroz Feitosa Parts of these slides are from the support material provided by W. Stallings Objective To provide an overview of the superscalar approach and the key design issues associated with its implementation.
Superscalar Processors
Raul Queiroz Feitosa
Parts of these slides are from the support material provided by W. Stallings
Objective
To provide an overview of the superscalar approach and the key design issues associated with its implementation.
Outline
Parallelism Concepts Superscalar × Superpipelining Limitations to Parallelism Instruction Issue Policies Register Renaming and Dynamic
Scheduling
Two Parallelism Concepts
Instruction Level Parallelism (ILP)
exists when instructions in a sequence are independent and thus can be executed in parallel, e.g.,
Machine Parallelism
is a measure of the ability of the processor to take advantage of ILP.
... ADD EAX,ECX MOV EBX,ESI ...
can be executed simultaneously keeping the same result as in a sequential execution
Outline
Parallelism Concepts Superscalar × Superpipelining Limitations to Parallelism Instruction Issue Policies Register Renaming and Dynamic
Scheduling
Superpipelining Approach
In a conventional pipeline the most time consuming task determines the clock rate. A superpipelined machine runs at higher clock rates by splitting most time consuming stages into smaller stages.
2t 2t 2t 2t 2t STAGES
clock period = 2t clock period = t
more stages Pentium IV (20 stages)
t t t t t STAGES t t t 2t Ifetch Decode Write Execute
Superpipelining Execution
Conventional Pipeline Time diagram Superpipeline Time diagram
Ifetch Decode Execute Write 0 1 2 3 4 5 6 7 8 9 time 0 1 2 3 4 5 6 7 8 9 time
Superscalar Approach
A superscalar machine is able to execute multiple instructions independently and concurrently in multiple pipelines
integer register file floating-point register file pipeline functional units memory
General Superscalar Organization
Superscalar Execution
Conventional Pipeline Time diagram Superscalar Time diagram
Ifetch Decode Execute Write 0 1 2 3 4 5 6 7 8 9 time Ifetch Decode Execute Write 0 1 2 3 4 5 6 7 8 9 time
Outline
Parallelism Concepts Superscalar × Superpipelining Limitations to Parallelism Instruction Issue Policies Register Renaming and Dynamic
Scheduling
Limitations to Parallelism
1.
Resource Conflicts
Competition of two or more instructions for the same resource at the same time, e.g., consecutive arithmetic instructions → possible solution is adding a second ALU
2.
Procedural Dependency
conditional branches (already seen) in superscalar processors more is lost if prediction fails
3.
Data Dependencies
Data Dependencies
1.
True Data Dependency
2.
Output Dependency
3.
Antidependency
... ADD ,ECX MOV EBX, ... ... ADD ,ECX MOV ,EBX ... ... ADD EBX, MOV ,ECX ... EAX EAX EAX EAX EAX EAX
Outline
Parallelism Concepts Superscalar × Superpipelining Limitations to Parallelism Instruction Issue Policies Register Renaming and Dynamic
Scheduling
Instruction Issue Policy
It refers to:
1.
Instruction fetch: order in which instructions are fetched
2.
Instruction execution: order in which instructions are delivered to a functional unit to execute the
3.
Instruction commit: order in which instruction results are stored in registers and memory
In-order issue with in-order completion
instruction issuing is stalled by resource conflicts, procedural or any data dependencies.
Example:
1.
up to two instructions may be fetched, issued and written back at a time
2.
fetch of next two instructions waits till decode buffer is cleared
3.
3 functional units: * (2 clocks), /(2 clocks), (+,-) 1 clock.
4.
Data dependency stalls instruction issuing until the execution of the earlier instruction is completed.
5.
In RAW later instruction may be issued only after the earlier instruction has written the result.
In-order issue and completion
decode / * +/- write CY1 2
13 4 1 2
24 3 1
34 3 1 2
45 6 4 3
56 5 4
66 5
77 8 6 5
87 8 6
98 7
108 7
118 7
128
13 14 151 2
13 4 1 2
24 3 1
34 3 1 2
45 6 4 3
56 5 4
66 5
77 8 6 5
87 8 6
98 7
108 7
118 7
128
13 14 15Instruction window
A buffer where decoded instruction are stored waiting for execution. It decouples decode stages from execution stages Can continue to fetch and decode until this window is full When a functional unit becomes available an instruction can be executed Since instructions have been decoded, processor can look ahead
Out-of-order issue and completion
instruction issuing is stalled by resource conflicts, procedural or TRUE data dependencies.
Example:
1.
Up to two instructions may be fetched, issued and written back at a time
2.
3 functional units: * (2 clocks), /(2 clocks), (+,-) 1 clock.
3.
Data dependency does not stall instruction issuing.
4.
In RAW later instruction may be issued only after the earlier instruction has written the result.
Out-of-order issue and completion
Out-of-order issue and completion
decode / * +/- write CY1 2
13 4 1 2
25 6 3 1 2
37 8 3 5 4 1
45 6 3 4
58 5 6
67 8
77
87
9 10 11 12 13 14 15S1 S1 S2
Register Renaming
decode / * +/- write CY1 2
13 4 1 2
25 6 3 1 2
37 8 3 5 4 1
45 6 3 4
58 5 6
67 8
77
87
9 10 11 12 13 14 15Out-of-order issue with In-order completion
decode / * +/- write CY1 2
13 4 1 2
25 6 3 1
37 8 3 1 2
45 4 3
55 6 4
68 5 6
77
87
97 8
10 11 12 13 14 15Exercise : How would it be if out-of-order issue is allowed but in-order completion is required?
Exercises
Exercise 1: Complete the tables on the
right under the same assumptions of the previous examples for the program fragment below and for in-order issue and completion.
decode / * +/- write CY1 2 1
11
23 4 2 1
33 2 4
43 2
55 6 5 3 4
66 5
77 6
87
97
107
11 12 13 14 15Exercises
Exercise 2: Complete the tables on the
right under the same assumptions of the previous examples for the program fragment below and for out-of-order issue and completion.
decode / * +/- write CY1 2
13 4 1
25 6 3 1
37 3 2 4 1
42 5 3 4
57 6 2 5
67 6
77
8 9 10 11 12 13 14 15Exercises
Exercise 3: Complete the tables on the
right under the same assumptions of the previous examples for the program fragment below and for in-order issue and completion.
decode / * +/- write CY1 2
11
21
33 4 3 2
43 4 2
55 6 4 3
67 6 4 5
76
87 6
97
107
11 12 13 14 15Exercises
Exercise 4: Complete the tables on the
right under the same assumptions of the previous examples for the program fragment below and for out-of-order issue and completion.
Decode / * +/- write CY1 2
13 4 1
25 6 3 5 1
37 3 6 2 5
46 2 3
57 6
67
77
8 9 10 11 12 13 14 15Outline
Parallelism Concepts Superscalar × Superpipelining Limitations to Parallelism Instruction Issue Policies Register Renaming and Dynamic
Scheduling
Register Renaming
3 4 7 5
S0 S1 S2 S3 S4 S5 S6 S7 R0 R1 R2 R3
Hidden Registers contain data
Logical registers contain pointers to hidden registers, which actually contain the data. HW keeps track of non committed hidden registers.
Logical Registers contain pointers to hidden Registers
>
Register Renaming
After instruction decode instruction packages are dispatched specifying the
Functional unit (F), which will execute the operation Hidden Register containing the 1st Operand (Op1), or the
functional unit that will provide it
Hidden Register containing the 2nd Operand (Op2)or the
functional unit that will provide it
Hidden Register that will contain the Result (Re)
* 4 / 5
F Op1 Op2 Re
Dynamic Scheduling
6 7 4 * Re Op2 Op1 F 7 3
*
2 3 +
R3=R1*R2 R1=R0-R2 R3=R3/R1 R3=R1+R0
6 7 4 3 6 7 3 1 7 3 2 1 3
content of logical registers
R0 R1 R2 R3
5 7 4 3
reorder buffer instruction issue instruction dispatch instruction fetch and branch prediction dispatched instructions pipeline processing time
Techniques for Machine Parallelism
1.
Duplication of Resources
2.
Out of order issue
3.
Renaming
4.
Instruction window large enough
Speedups with no Procedural Dependencies
Not worth duplicating functions without register renaming Need instruction window large enough (more than 8)
Superscalar Processors