Superscalar Processors Raul Queiroz Feitosa Parts of these slides - - PowerPoint PPT Presentation

superscalar processors
SMART_READER_LITE
LIVE PREVIEW

Superscalar Processors Raul Queiroz Feitosa Parts of these slides - - PowerPoint PPT Presentation

Superscalar Processors Raul Queiroz Feitosa Parts of these slides are from the support material provided by W. Stallings Objective To provide an overview of the superscalar approach and the key design issues associated with its implementation.


slide-1
SLIDE 1

Superscalar Processors

Raul Queiroz Feitosa

Parts of these slides are from the support material provided by W. Stallings

slide-2
SLIDE 2 Superscalar Processors 2

Objective

To provide an overview of the superscalar approach and the key design issues associated with its implementation.

slide-3
SLIDE 3 Superscalar Processors 3

Outline

 Parallelism Concepts  Superscalar × Superpipelining  Limitations to Parallelism  Instruction Issue Policies  Register Renaming and Dynamic

Scheduling

slide-4
SLIDE 4 Superscalar Processors 4

Two Parallelism Concepts

Instruction Level Parallelism (ILP)

exists when instructions in a sequence are independent and thus can be executed in parallel, e.g.,

Machine Parallelism

is a measure of the ability of the processor to take advantage of ILP.

... ADD EAX,ECX MOV EBX,ESI ...

can be executed simultaneously keeping the same result as in a sequential execution

slide-5
SLIDE 5 Superscalar Processors 5

Outline

 Parallelism Concepts  Superscalar × Superpipelining  Limitations to Parallelism  Instruction Issue Policies  Register Renaming and Dynamic

Scheduling

slide-6
SLIDE 6 Superscalar Processors 6

Superpipelining Approach

In a conventional pipeline the most time consuming task determines the clock rate. A superpipelined machine runs at higher clock rates by splitting most time consuming stages into smaller stages.

2t 2t 2t 2t 2t STAGES

 clock period = 2t  clock period = t

more stages Pentium IV (20 stages)

t t t t t STAGES t t t 2t Ifetch Decode Write Execute

slide-7
SLIDE 7 Superscalar Processors 7

Superpipelining Execution

Conventional Pipeline Time diagram Superpipeline Time diagram

Ifetch Decode Execute Write 0 1 2 3 4 5 6 7 8 9 time 0 1 2 3 4 5 6 7 8 9 time

slide-8
SLIDE 8 Superscalar Processors 8

Superscalar Approach

A superscalar machine is able to execute multiple instructions independently and concurrently in multiple pipelines

integer register file floating-point register file pipeline functional units memory

General Superscalar Organization

slide-9
SLIDE 9 Superscalar Processors 9

Superscalar Execution

Conventional Pipeline Time diagram Superscalar Time diagram

Ifetch Decode Execute Write 0 1 2 3 4 5 6 7 8 9 time Ifetch Decode Execute Write 0 1 2 3 4 5 6 7 8 9 time

slide-10
SLIDE 10 Superscalar Processors 10

Outline

 Parallelism Concepts  Superscalar × Superpipelining  Limitations to Parallelism  Instruction Issue Policies  Register Renaming and Dynamic

Scheduling

slide-11
SLIDE 11 Superscalar Processors 11

Limitations to Parallelism

1.

Resource Conflicts

Competition of two or more instructions for the same resource at the same time, e.g., consecutive arithmetic instructions → possible solution is adding a second ALU

2.

Procedural Dependency

conditional branches (already seen) in superscalar processors more is lost if prediction fails

3.

Data Dependencies

slide-12
SLIDE 12 Superscalar Processors 12

Data Dependencies

1.

True Data Dependency

  • r Read-after-Write (RAW)

2.

Output Dependency

  • r Write-after-Write (WAW)

3.

Antidependency

  • r Write-after-Read (WAR)

... ADD ,ECX MOV EBX, ... ... ADD ,ECX MOV ,EBX ... ... ADD EBX, MOV ,ECX ... EAX EAX EAX EAX EAX EAX

slide-13
SLIDE 13 Superscalar Processors 13

Outline

 Parallelism Concepts  Superscalar × Superpipelining  Limitations to Parallelism  Instruction Issue Policies  Register Renaming and Dynamic

Scheduling

slide-14
SLIDE 14 Superscalar Processors 14

Instruction Issue Policy

It refers to:

1.

Instruction fetch: order in which instructions are fetched

2.

Instruction execution: order in which instructions are delivered to a functional unit to execute the

  • peration

3.

Instruction commit: order in which instruction results are stored in registers and memory

slide-15
SLIDE 15 Superscalar Processors 15

In-order issue with in-order completion

instruction issuing is stalled by resource conflicts, procedural or any data dependencies.

Example:

1.

up to two instructions may be fetched, issued and written back at a time

2.

fetch of next two instructions waits till decode buffer is cleared

3.

3 functional units: * (2 clocks), /(2 clocks), (+,-) 1 clock.

4.

Data dependency stalls instruction issuing until the execution of the earlier instruction is completed.

5.

In RAW later instruction may be issued only after the earlier instruction has written the result.

slide-16
SLIDE 16 Superscalar Processors 16

In-order issue and completion

decode / * +/- write CY

1 2

1

3 4 1 2

2

4 3 1

3

4 3 1 2

4

5 6 4 3

5

6 5 4

6

6 5

7

7 8 6 5

8

7 8 6

9

8 7

10

8 7

11

8 7

12

8

13 14 15
  • 1. R3=R0*R1
  • 2. R4=R0+R2
  • 3. R5=R0/R1
  • 4. R6=R1+R4
  • 5. R7=R1*R2
  • 6. R1=R0-R2
  • 7. R3=R3*R1
  • 8. R1=R4+R4
  • 1. R3=R0*R1
  • 2. R4=R0+R2
  • 3. R5=R0/R1
  • 4. R6=R1+R4
  • 5. R7=R1*R2
  • 6. R1=R0-R2
  • 7. R3=R3*R1
  • 8. R1=R4+R4
decode / * +/- write CY

1 2

1

3 4 1 2

2

4 3 1

3

4 3 1 2

4

5 6 4 3

5

6 5 4

6

6 5

7

7 8 6 5

8

7 8 6

9

8 7

10

8 7

11

8 7

12

8

13 14 15
slide-17
SLIDE 17 Superscalar Processors 17

Instruction window

 A buffer where decoded instruction are stored waiting for execution.  It decouples decode stages from execution stages  Can continue to fetch and decode until this window is full  When a functional unit becomes available an instruction can be executed  Since instructions have been decoded, processor can look ahead

Out-of-order issue and completion

slide-18
SLIDE 18 Superscalar Processors 18

instruction issuing is stalled by resource conflicts, procedural or TRUE data dependencies.

Example:

1.

Up to two instructions may be fetched, issued and written back at a time

2.

3 functional units: * (2 clocks), /(2 clocks), (+,-) 1 clock.

3.

Data dependency does not stall instruction issuing.

4.

In RAW later instruction may be issued only after the earlier instruction has written the result.

Out-of-order issue and completion

slide-19
SLIDE 19 Superscalar Processors 19

Out-of-order issue and completion

decode / * +/- write CY

1 2

1

3 4 1 2

2

5 6 3 1 2

3

7 8 3 5 4 1

4

5 6 3 4

5

8 5 6

6

7 8

7

7

8

7

9 10 11 12 13 14 15
  • 1. R3=R0*R1
  • 2. R4=R0+R2
  • 3. R5=R0/R1
  • 4. R6=R1+R4
  • 5. R7=R1*R2
  • 6. R1=R0-R2
  • 7. R3=R3*R1
  • 8. R1=R4+R4

S1 S1 S2

Register Renaming

decode / * +/- write CY

1 2

1

3 4 1 2

2

5 6 3 1 2

3

7 8 3 5 4 1

4

5 6 3 4

5

8 5 6

6

7 8

7

7

8

7

9 10 11 12 13 14 15
slide-20
SLIDE 20 Superscalar Processors 20

Out-of-order issue with In-order completion

decode / * +/- write CY

1 2

1

3 4 1 2

2

5 6 3 1

3

7 8 3 1 2

4

5 4 3

5

5 6 4

6

8 5 6

7

7

8

7

9

7 8

10 11 12 13 14 15

Exercise : How would it be if out-of-order issue is allowed but in-order completion is required?

  • 1. R3=R0*R1
  • 2. R4=R0+R2
  • 3. R5=R0/R1
  • 4. R6=R1+R4
  • 5. R7=R1*R2
  • 6. R1=R0-R2
  • 7. R3=R3*R1
  • 8. R1=R4+R4
slide-21
SLIDE 21 Superscalar Processors 21

Exercises

Exercise 1: Complete the tables on the

right under the same assumptions of the previous examples for the program fragment below and for in-order issue and completion.

decode / * +/- write CY

1 2 1

1

1

2

3 4 2 1

3

3 2 4

4

3 2

5

5 6 5 3 4

6

6 5

7

7 6

8

7

9

7

10

7

11 12 13 14 15
  • 1. R3=R0*R1
  • 2. R4=R0*R2
  • 3. R5=R0/R1
  • 4. R6=R5+R4
  • 5. R5=R1-R2
  • 6. R1=R0-R2
  • 7. R3=R3*R1
slide-22
SLIDE 22 Superscalar Processors 22

Exercises

Exercise 2: Complete the tables on the

right under the same assumptions of the previous examples for the program fragment below and for out-of-order issue and completion.

decode / * +/- write CY

1 2

1

3 4 1

2

5 6 3 1

3

7 3 2 4 1

4

2 5 3 4

5

7 6 2 5

6

7 6

7

7

8 9 10 11 12 13 14 15
  • 1. R3=R0*R1
  • 2. R4=R0*R2
  • 3. R5=R0/R1
  • 4. R6=R5+R4
  • 5. R5=R1-R2
  • 6. R1=R0-R2
  • 7. R3=R3*R1
slide-23
SLIDE 23 Superscalar Processors 23

Exercises

Exercise 3: Complete the tables on the

right under the same assumptions of the previous examples for the program fragment below and for in-order issue and completion.

decode / * +/- write CY

1 2

1

1

2

1

3

3 4 3 2

4

3 4 2

5

5 6 4 3

6

7 6 4 5

7

6

8

7 6

9

7

10

7

11 12 13 14 15
  • 1. R3=R0-R1
  • 2. R4=R0+R3
  • 3. R3=R0/R1
  • 4. R6=R5*R4
  • 5. R5=R1-R2
  • 6. R1=R0*R2
  • 7. R3=R3*R5
slide-24
SLIDE 24 Superscalar Processors 24

Exercises

Exercise 4: Complete the tables on the

right under the same assumptions of the previous examples for the program fragment below and for out-of-order issue and completion.

Decode / * +/- write CY

1 2

1

3 4 1

2

5 6 3 5 1

3

7 3 6 2 5

4

6 2 3

5

7 6

6

7

7

7

8 9 10 11 12 13 14 15
  • 1. R3=R0-R1
  • 2. R4=R0+R3
  • 3. R3=R0/R1
  • 4. R6=R5*R4
  • 5. R5=R1-R2
  • 6. R1=R0*R2
  • 7. R3=R3*R5
slide-25
SLIDE 25 Superscalar Processors 25

Outline

 Parallelism Concepts  Superscalar × Superpipelining  Limitations to Parallelism  Instruction Issue Policies  Register Renaming and Dynamic

Scheduling

slide-26
SLIDE 26 Superscalar Processors 26

Register Renaming

3 4 7 5

S0 S1 S2 S3 S4 S5 S6 S7 R0 R1 R2 R3

Hidden Registers contain data

Logical registers contain pointers to hidden registers, which actually contain the data. HW keeps track of non committed hidden registers.

Logical Registers contain pointers to hidden Registers

>

slide-27
SLIDE 27 Superscalar Processors 27

Register Renaming

After instruction decode instruction packages are dispatched specifying the

 Functional unit (F), which will execute the operation  Hidden Register containing the 1st Operand (Op1), or the

functional unit that will provide it

 Hidden Register containing the 2nd Operand (Op2)or the

functional unit that will provide it

 Hidden Register that will contain the Result (Re)

* 4 / 5

F Op1 Op2 Re

slide-28
SLIDE 28 Superscalar Processors 28

Dynamic Scheduling

6 7 4 * Re Op2 Op1 F 7 3

  • 1

*

  • /

2 3 +

R3=R1*R2 R1=R0-R2 R3=R3/R1 R3=R1+R0

6 7 4 3 6 7 3 1 7 3 2 1 3

content of logical registers

R0 R1 R2 R3

5 7 4 3

  • 1. R3=R1*R2
  • 2. R1=R0-R2
  • 3. R3=R3/R1
  • 4. R3=R1+R0

reorder buffer instruction issue instruction dispatch instruction fetch and branch prediction dispatched instructions pipeline processing time

slide-29
SLIDE 29 Superscalar Processors 29

Techniques for Machine Parallelism

1.

Duplication of Resources

2.

Out of order issue

3.

Renaming

4.

Instruction window large enough

slide-30
SLIDE 30 Superscalar Processors 30

Speedups with no Procedural Dependencies

Not worth duplicating functions without register renaming Need instruction window large enough (more than 8)

slide-31
SLIDE 31 Superscalar Processors 31

Superscalar Processors

End