Exploitation of instruction level parallelism Computer Architecture - - PowerPoint PPT Presentation

exploitation of instruction level parallelism
SMART_READER_LITE
LIVE PREVIEW

Exploitation of instruction level parallelism Computer Architecture - - PowerPoint PPT Presentation

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel Garca Snchez (coordinator) David Expsito Singh Francisco Javier Garca Blas ARCOS Group Computer Science and


slide-1
SLIDE 1

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism

Computer Architecture

  • J. Daniel García Sánchez (coordinator)

David Expósito Singh Francisco Javier García Blas

ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 1/55

slide-2
SLIDE 2

Exploitation of instruction level parallelism Compilation techniques and ILP

1

Compilation techniques and ILP

2

Advanced branch prediction techniques

3

Introduction to dynamic scheduling

4

Speculation

5

Multiple issue techniques

6

ILP limits

7

Thread level parallelism

8

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 2/55

slide-3
SLIDE 3

Exploitation of instruction level parallelism Compilation techniques and ILP

Taking advantage of ILP

ILP directly applicable to basic blocks.

Basic block: sequence of instructions without branching. Typical program in MIPS:

Basic block average size from 3 to 6 instructions. Low ILP exploitation within block.

Need to exploit ILP across basic blocks.

Example

for ( i=0;i<1000;i++) { x[ i ] = x[ i ] + y[ i ]; }

Loop level parallelism.

Can be transformed to ILP . By compiler or hardware.

Alternative:

Vector instructions. SIMD instructions in processor.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 3/55

slide-4
SLIDE 4

Exploitation of instruction level parallelism Compilation techniques and ILP

Scheduling and loop unrolling

Parallelism exploitation:

Interleave execution of unrelated instructions. Fill stalls with instructions. Do not alter original program effects.

Compiler can do this with detailed knowledge of the architecture.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 4/55

slide-5
SLIDE 5

Exploitation of instruction level parallelism Compilation techniques and ILP

ILP exploitation

Example

for ( i=999;i>=0;i−−) { x[ i ] = x[ i ] + s; }

Each iteration body is independent. Latencies between instructions

Instruction Instruction Latency (cycles) producing result using result FP ALU operation FP ALU operation 3 FP ALU operation Store double 2 Load double FP ALU operation 1 Load double Store double

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 5/55

slide-6
SLIDE 6

Exploitation of instruction level parallelism Compilation techniques and ILP

Compiled code

R1 → Last array element. F2 → Scalar s. R2 → Precomputed so that 8(R2) is the first element in array. Assembler code

Loop : L.D F0 , 0(R1) ; F0 <− x [ i ] ADD.D F4 , F0 , F2 ; F4 <− F0 + s S.D F4 , 0(R1) ; x [ i ] <− F4 DADDUI R1, R1, #−8 ; i− − BNE R1, R2, Loop ; Branch i f R1!=R2

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 6/55

slide-7
SLIDE 7

Exploitation of instruction level parallelism Compilation techniques and ILP

Stalls in execution

Original

Loop : L.D F0 , 0(R1) ADD.D F4 , F0 , F2 S.D F4 , 0(R1) DADDUI R1, R1, #−8 BNE R1, R2, Loop

Stalls

Loop : L.D F0 , 0(R1) < s t a l l > ADD.D F4 , F0 , F2 < s t a l l > < s t a l l > S.D F4 , 0(R1) DADDUI R1, R1, #−8 < s t a l l > BNE R1, R2, Loop

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 7/55

slide-8
SLIDE 8

Exploitation of instruction level parallelism Compilation techniques and ILP

Loop scheduling

Original

Loop : L.D F0 , 0(R1) < s t a l l > ADD.D F4 , F0 , F2 < s t a l l > < s t a l l > S.D F4 , 0(R1) DADDUI R1, R1, #−8 < s t a l l > BNE R1, R2, Loop

9 cycles per iteration. Scheduled

Loop : L.D F0 , 0(R1) DADDUI R1, R1, #−8 ADD.D F4 , F0 , F2 < s t a l l > < s t a l l > S.D F4 , 8(R1) BNE R1, R2, Loop

7 cycles per iteration.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 8/55

slide-9
SLIDE 9

Exploitation of instruction level parallelism Compilation techniques and ILP

Loop unrolling

Idea:

Replicate loop body several times. Adjust termination code. Use different registers for each iteration replica to reduce dependencies.

Effect:

Increase basic block length. Increase use of available ILP .

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 9/55

slide-10
SLIDE 10

Exploitation of instruction level parallelism Compilation techniques and ILP

Unrolling

Unrolling (x4)

Loop : L.D F0 , 0(R1) ADD.D F4 , F0 , F2 S.D F4 , 0(R1) L.D F6 , −8(R1) ADD.D F8 , F6 , F2 S.D F8 , −8(R1) L.D F10 , −16(R1)

Unrolling (x4)

ADD.D F12 , F10 , F2 S.D F12 , −16(R1) L.D F14 , −24(R1) ADD.D F16 , F14 , F2 S.D F16 , −24(R1) DADDUI R1, R1, #−32 BNE R1, R2, Loop

4 iterations require more registers. This example assumes that array size is multiple of 4.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 10/55

slide-11
SLIDE 11

Exploitation of instruction level parallelism Compilation techniques and ILP

Stalls and unrolling

Unrolling (x4)

Loop : L.D F0 , 0(R1) < s t a l l > ADD.D F4 , F0 , F2 < s t a l l > < s t a l l > S.D F4 , 0(R1) L.D F6 , −8(R1) < s t a l l > ADD.D F8 , F6 , F2 < s t a l l > < s t a l l > S.D F8 , −8(R1) L.D F10 , −16(R1) < s t a l l >

Unrolling (x4)

ADD.D F12 , F10 , F2 < s t a l l > < s t a l l > S.D F12 , −16(R1) L.D F14 , −24(R1) < s t a l l > ADD.D F16 , F14 , F2 < s t a l l > < s t a l l > S.D F16 , −24(R1) DADDUI R1, R1, #−32 < s t a l l > BNE R1, R2, Loop

27 cycles for every 4 iterations → 6.75 cycles per iteration.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 11/55

slide-12
SLIDE 12

Exploitation of instruction level parallelism Compilation techniques and ILP

Scheduling and unrolling

Unrolling (x4)

Loop : L.D F0 , 0(R1) L.D F6 , −8(R1) L.D F10 , −16(R1) L.D F14 , −24(R1) ADD.D F4 , F0 , F2 ADD.D F8 , F6 , F2 ADD.D F12 , F10 , F2 ADD.D F16 , F14 , F2 S.D F4 , 0(R1) S.D F8 , −8(R1) S.D F12 , −16(R1) DADDUI R1, R1, #−32 S.D F16 , 8(R1) BNE R1, R2, Loop

Code reorganization.

Preserve dependencies. Semantically equivalent. Goal: Make use of stalls.

Update of R1 at enough distance from BNE. 14 cycles for every 4 iterations → 3.5 cycles per iteration.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 12/55

slide-13
SLIDE 13

Exploitation of instruction level parallelism Compilation techniques and ILP

Limits of loop unrolling

Improvement is decreased with each additional unrolling.

Improvement limited to stalls removal. Overhead amortized among iterations.

Increase in code size.

May affect to instruction cache miss rate.

Pressure on register file.

May generate shortage of registers. Advantages are lost if there are not enough available registers.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 13/55

slide-14
SLIDE 14

Exploitation of instruction level parallelism Advanced branch prediction techniques

1

Compilation techniques and ILP

2

Advanced branch prediction techniques

3

Introduction to dynamic scheduling

4

Speculation

5

Multiple issue techniques

6

ILP limits

7

Thread level parallelism

8

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 14/55

slide-15
SLIDE 15

Exploitation of instruction level parallelism Advanced branch prediction techniques

Branch prediction

High impact of branches on programs performance. To reduce impact:

Loop unrolling. Branch prediction:

Compile time. Each branch handled isolated.

Advanced branch prediction:

Correlated predictors. Tournament predictors.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 15/55

slide-16
SLIDE 16

Exploitation of instruction level parallelism Advanced branch prediction techniques

Dynamic scheduling

Hardware reorders instructions execution to reduce stalls while keeping data flow and exceptions. Able to handle unknown cases at compile time:

Cache misses/hits.

Code less dependent on a concrete pipeline.

Simplifies compiler.

Permits the hardware speculation.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 16/55

slide-17
SLIDE 17

Exploitation of instruction level parallelism Advanced branch prediction techniques

Correlated prediction

If first and second branch are taken, third is NOT-taken. example

if (a==2) { a=0; } if (b==2) { b=0; } if (a!=b) { f () ; }

Maintains last branches history to select among several predictors. A (m, n) predictor:

Uses the result of m last branches to select among 2m predictors. Each predictor has n bits.

Predictor (1, 2):

Result of last branch to select among 2 predictors.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 17/55

slide-18
SLIDE 18

Exploitation of instruction level parallelism Advanced branch prediction techniques

Size of predictor

A predictor (m, n) has several entries for each branch address. Total size: S = 2m × n × entriesaddress Examples:

(0, 2) with 4K entries → 8 Kb (2, 2) with 4K entries → 32 Kb (2, 2) with 1K entries → 8 Kb

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 18/55

slide-19
SLIDE 19

Exploitation of instruction level parallelism Advanced branch prediction techniques

Miss rate

Correlated predictor has less misses that simple predictor with same size. Correlated predictor has less misses than simple predictor with unlimited number of entries.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 19/55

slide-20
SLIDE 20

Exploitation of instruction level parallelism Advanced branch prediction techniques

Tournament prediction

Combines two predictors:

Global information based predictor. Local information based predictor.

Uses a selector to choose between predictors.

Change among two selectors uses a saturation counter (2 bits).

Advantage:

Allows different behavior for integer and FP .

SPEC:

Integer benchmarks → global predictor 40% FP benchmarks → global predictor 15%.

Uses: Alpha and AMD Opteron.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 20/55

slide-21
SLIDE 21

Exploitation of instruction level parallelism Advanced branch prediction techniques

Intel Core i7

Predictor with two levels:

Smaller first level predictor. Larger second level predictor as backup.

Each predictor combines 3 predictors:

Simple 2-bits predictor. Global history predictor. Exit-loop predictor (iterations counter).

Besides:

Indirect jumps predictor. Return address predictor.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 21/55

slide-22
SLIDE 22

Exploitation of instruction level parallelism Introduction to dynamic scheduling

1

Compilation techniques and ILP

2

Advanced branch prediction techniques

3

Introduction to dynamic scheduling

4

Speculation

5

Multiple issue techniques

6

ILP limits

7

Thread level parallelism

8

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 22/55

slide-23
SLIDE 23

Exploitation of instruction level parallelism Introduction to dynamic scheduling

Dynamic scheduling

Idea: hardware reorders instruction execution to decrease stalls. Advantages:

Compiled code optimized for one pipeline runs efficiently in another pipeline. Correctly manages dependencies that are unknown at compile time. Allows to tolerate unpredictable delays (e.g. cache misses).

Drawback:

More complex hardware.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 23/55

slide-24
SLIDE 24

Exploitation of instruction level parallelism Introduction to dynamic scheduling

Dynamic scheduling

Effects:

Out of Order execution (OoO). Out of Order instruction finalization. May introduce WAR and WAW hazards.

Separation of ID stage into two different stages:

Issue: Decodes instruction and checks for structural hazards. Operands fetch: Waits until there is no data hazard and fetches operands.

Instruction Fetch (IF):

Fetches into instruction register or instruction queue.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 24/55

slide-25
SLIDE 25

Exploitation of instruction level parallelism Introduction to dynamic scheduling

Dynamic scheduling techniques

Scoreboard:

Stalls issued instructions until enough resources are available and there is no data hazard. Examples: CDC 6600, ARM A8.

Tomasulo Algorithm:

Removes WAR and WAW dependencies with register renaming. Examples: IBM 360, Intel Core i7.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 25/55

slide-26
SLIDE 26

Exploitation of instruction level parallelism Speculation

1

Compilation techniques and ILP

2

Advanced branch prediction techniques

3

Introduction to dynamic scheduling

4

Speculation

5

Multiple issue techniques

6

ILP limits

7

Thread level parallelism

8

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 26/55

slide-27
SLIDE 27

Exploitation of instruction level parallelism Speculation

Branches and parallelism limits

As parallelism increases, control dependencies become a harder problem.

Branch prediction is not enough.

Next step is speculation on branch outcome and execution assuming that speculation was right.

Instructions fetched, issued and executed. Need of a mechanism to handle wrong speculations.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 27/55

slide-28
SLIDE 28

Exploitation of instruction level parallelism Speculation

Components

Ideas:

Dynamic branch prediction: Selects instructions to be executed. Speculation: Executes before control dependencies are resolved and may eventually undone. Dynamic scheduling.

To achieve this, must separate:

passing instruction result to another instruction using it. Instruction finalization.

IMPORTANT: Processor state (register file / memory) not updated until changes are confirmed.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 28/55

slide-29
SLIDE 29

Exploitation of instruction level parallelism Speculation

Solution

Reorder Buffer (ROB):

When an instruction is finalized ROB is written. When execution is confirmed real target is written. Instructions read modified data from ROB.

ROB entries:

Instruction type: branch, store, register operation. Target: Register Id or memory address. Value: Instruction result value. Ready: Indication of instruction completion.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 29/55

slide-30
SLIDE 30

Exploitation of instruction level parallelism Multiple issue techniques

1

Compilation techniques and ILP

2

Advanced branch prediction techniques

3

Introduction to dynamic scheduling

4

Speculation

5

Multiple issue techniques

6

ILP limits

7

Thread level parallelism

8

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 30/55

slide-31
SLIDE 31

Exploitation of instruction level parallelism Multiple issue techniques

CPI < 1

CPI ≥ 1 → Issue one instruction per cycle. Multiple issue processors (CPI < 1 → IPC > 1):

Statically scheduled superscalar processors.

In-order execution. Variable number of instructions per cycle.

Dynamically scheduled superscalar processors.

Out-of-order execution. Variable number of instructions per cycle.

VLIW processors (Very Long Instruction Word).

Several instructions into a packet. Static scheduling. Explicit ILP by the compiler.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 31/55

slide-32
SLIDE 32

Exploitation of instruction level parallelism Multiple issue techniques

Approaches to multiple issue

Several approaches to multiple issue.

Static superscalar. Dynamic superscalar. Speculative superscalar. VLIW/LIW. EPIC.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 32/55

slide-33
SLIDE 33

Exploitation of instruction level parallelism Multiple issue techniques

Static superscalar

Issue: Dynamic. Hazards detection: Hardware. Scheduling: Static. Discriminating feature:

In-order execution.

Examples:

MIPS. ARM Cortex-A7.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 33/55

slide-34
SLIDE 34

Exploitation of instruction level parallelism Multiple issue techniques

Dynamic Superscalar

Issue: Dynamic. Hazards detection: Hardware. Scheduling: Dynamic. Discriminating feature:

Out-of-Order execution with no speculation.

Examples: None.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 34/55

slide-35
SLIDE 35

Exploitation of instruction level parallelism Multiple issue techniques

Speculative superscalar

Issue: Dynamic. Hazards detection: Hardware. Scheduling: Speculative dynamic. Discriminating feature:

Out-of-Order execution with speculation.

Examples:

Intel Core i3, i5, i7. AMD Phenom. IBM Power 7

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 35/55

slide-36
SLIDE 36

Exploitation of instruction level parallelism Multiple issue techniques

VLIW

Packs several operations into a single instruction. Example instruction in ISA VLIW:

One integer instruction or a branch. Two independent floating point operations. Two independent memory references.

IMPORTANT: Code must exhibit enough parallelism.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 36/55

slide-37
SLIDE 37

Exploitation of instruction level parallelism Multiple issue techniques

VLIW / LIW

Issue: Static. Hazards detection: Mostly software. Scheduling: Static. Discriminating feature:

All hazards determined and specified by the compiler.

Examples:

DSPs (e.g. TI C6x).

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 37/55

slide-38
SLIDE 38

Exploitation of instruction level parallelism Multiple issue techniques

Problems with VLIW

Drawbacks from original VLIW model:

Complexity of finding statically enough parallelism. Generated code size. No hazard detection hardware. More binary compatibility problems than in regular superscalar designs.

EPIC tries to solve most of this problems.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 38/55

slide-39
SLIDE 39

Exploitation of instruction level parallelism Multiple issue techniques

EPIC

Issue: Mostly static. Hazards detection: Mostly software. Scheduling: Mostly static. Discriminating feature:

All hazards determined and specified by compiler.

Examples:

Itanium.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 39/55

slide-40
SLIDE 40

Exploitation of instruction level parallelism ILP limits

1

Compilation techniques and ILP

2

Advanced branch prediction techniques

3

Introduction to dynamic scheduling

4

Speculation

5

Multiple issue techniques

6

ILP limits

7

Thread level parallelism

8

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 40/55

slide-41
SLIDE 41

Exploitation of instruction level parallelism ILP limits

ILP limits

To study maximum ILP we model an ideal processor. Ideal processor:

Infinite register renaming: All WAR and WAW hazards can be avoided. Perfect branch prediction: All conditional branch predictions are a hit. Perfect jump prediction: All jumps (include returns) are correctly predicted. Perfect memory address alias analysis: A load can be safely moved before a store if address is not identical Perfect caches: All cache accesses require one clock cycle (always hit).

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 41/55

slide-42
SLIDE 42

Exploitation of instruction level parallelism ILP limits

Available ILP

20 40 60 80 100 120 140 160 gcc expresso li fppp doduc tomcatv 54.8 62.6 17.9 75.2 118.7 150.1 Instructions per cycle

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 42/55

slide-43
SLIDE 43

Exploitation of instruction level parallelism ILP limits

However . . .

More ILP implies more control logic:

Smaller caches. Longer clock cycle. Higher energy consumption.

Practical limitation:

Issue from 3 to 6 instructions per cycle.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 43/55

slide-44
SLIDE 44

Exploitation of instruction level parallelism Thread level parallelism

1

Compilation techniques and ILP

2

Advanced branch prediction techniques

3

Introduction to dynamic scheduling

4

Speculation

5

Multiple issue techniques

6

ILP limits

7

Thread level parallelism

8

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 44/55

slide-45
SLIDE 45

Exploitation of instruction level parallelism Thread level parallelism

Why TLP?

Some applications exhibit more natural parallelism than the achieved with ILP.

Servers, scientific applications, . . .

Two models emerge:

Thread level parallelism (TLP):

Thread: Process with its own instructions and data. May be either part of a program or an independent program. Each thread has an associated state (instructions, data, PC, registers, . . . ).

Data level parallelism (DLP):

Identical operation on different data items.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 45/55

slide-46
SLIDE 46

Exploitation of instruction level parallelism Thread level parallelism

TLP

ILP exploits implicit parallelism within a basic block or a loop. TLP uses multiple threads of execution inherently parallel. TLP Goal:

Use multiple instruction flows to improve:

Throughput in computers using many programs. Execution time of multi-threaded programs.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 46/55

slide-47
SLIDE 47

Exploitation of instruction level parallelism Thread level parallelism

Multi-threaded execution

Multiple threads share processor functional units

  • verlapping its use.

Need to replicate state n-times.

Register file, PC, page table (when threads do note belong to the same program). Shared memory through virtual memory mechanisms. Hardware for fast thread context switch.

Kinds:

Fine grain: Thread switch in every instruction. Coarse grain: Thread switch in stalls (e.g. Cache miss). Simultaneous: Fine grain with multiple-issue and dynamic scheduling.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 47/55

slide-48
SLIDE 48

Exploitation of instruction level parallelism Thread level parallelism

Fine-grain multithreading

Switches between threads in each instruction.

Interleaves thread execution. Usually does round-robin. Threads in a stall are excluded from round-robin. Processor must be able to switch every clock cycle.

Advantage:

Can hide short and long stalls.

Drawback:

Delays individual thread execution due to sharing.

Example: Sun Niagara.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 48/55

slide-49
SLIDE 49

Exploitation of instruction level parallelism Thread level parallelism

Coarse grain multithreading

Switch thread only on long stalls.

Example: L2 cache miss.

Advantages:

No need for a highly fast thread switch. Does not delay individual threads.

Drawbacks:

Must flush or freeze the pipeline. Needs to fill pipeline with instructions from the new thread (latency).

Appropriate when filling the pipeline takes much shorter than a stall.

Example: IBM AS/400.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 49/55

slide-50
SLIDE 50

Exploitation of instruction level parallelism Thread level parallelism

SMT: Simultaneous multithreading

Idea: Dynamically scheduled processors already have many mechanisms to support multithreading.

Large sets of virtual registers.

Registers for multiple threads.

Register renaming.

Avoid conflicts in access to registers from threads.

Out-of-order finalization.

Modifications:

Per thread renaming table. Separate PC registers. Separate ROB.

Examples: Intel Core i7, IBM Power 7

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 50/55

slide-51
SLIDE 51

Exploitation of instruction level parallelism Thread level parallelism

TLP: Summary

Superscalar Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 stall Fine Grain Coarse Grain SMT Multiprocessor

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 51/55

slide-52
SLIDE 52

Exploitation of instruction level parallelism Conclusion

1

Compilation techniques and ILP

2

Advanced branch prediction techniques

3

Introduction to dynamic scheduling

4

Speculation

5

Multiple issue techniques

6

ILP limits

7

Thread level parallelism

8

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 52/55

slide-53
SLIDE 53

Exploitation of instruction level parallelism Conclusion

Summary

Loop unrolling allows hiding stall latencies, but offers a limited improvement. Dynamic scheduling manages stalls unknown at compile-time. Speculative techniques built on branch prediction and dynamic scheduling. Multiple issue in ILP is limited in practice from 3 to 6 instructions. SMT is an approach to TLP within one core.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 53/55

slide-54
SLIDE 54

Exploitation of instruction level parallelism Conclusion

References

Computer Architecture. A Quantitative Approach 5th Ed. Hennessy and Patterson. Sections 3.1, 3.2, 3.3, 3.4, 3.6, 3.7, 3.10, 3.12. Recommended exercises:

3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.11, 3.14, 3.17.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 54/55

slide-55
SLIDE 55

Exploitation of instruction level parallelism Conclusion

Exploitation of instruction level parallelism

Computer Architecture

  • J. Daniel García Sánchez (coordinator)

David Expósito Singh Francisco Javier García Blas

ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 55/55