Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 - - PowerPoint PPT Presentation

kilo instruction processors
SMART_READER_LITE
LIVE PREVIEW

Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 - - PowerPoint PPT Presentation

Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 Processor-DRAM Gap (latency) Proc 1000 CPU 60%/yr. Moores Law Performance 100 Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM 7%/yr. DRAM 1


slide-1
SLIDE 1

Kilo Instruction Processors

Adrián Cristal

2/7/2019 YALE 80

slide-2
SLIDE 2

Processor-DRAM Gap (latency)

µProc 60%/yr. DRAM 7%/yr.

1 10 100 1000

1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

DRAM CPU

1982

Processor-Memory Performance Gap: (grows 50% / year)

Performance

Time

“Moore’s Law”

D.A. Patterson “ New directions in Computer Architecture” Berkeley, June 1998

slide-3
SLIDE 3

Integer, 8-way, L2 1MB

0,0 0,5 1,0 1,5 2,0 2,5 3,0 3,5 4,0 4,5 5,0 128 512 1024 4096 128 512 1024 4096 perceptron perfect ROB Size / Branch Predictor IPC 100 500 1000 Memory Latency

Perfect Mem. & Perfect BP Perfect Mem. & Perceptron BP

0.6X 1.22X 1.41X

Research Proposal to Intel (July 2001) and presentation to Intel-MRL March 2002 Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002

  • M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003
slide-4
SLIDE 4

Floating-point, 8-way, L2 1MB

0,0 1,0 2,0 3,0 4,0 5,0 6,0 128 512 1024 4096 8192 128 512 1024 4096 8192 perceptron perfect ROB Size / Branch Predictor IPC 100 500 1000 Memory Latency

Perfect Mem. & Perfect BP Perfect Mem. & Perceptron BP

0.45X 2.34X 3.91X

Research Proposal to Intel (July 2001) and presentation to Intel-MRL March 2002 Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002

  • M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003
slide-5
SLIDE 5

Execution Locality

void smvp(int nodes, double ***A, int *Acol, int *Aindex, double **v, double **w) { [...] sum0 = A[Anext][0][0]*v[i][0] + A[Anext][0][1]*v[i][1] + A[Anext][0][2]*v[i][2]; [...] } 22.2%

Cache-Dependent Code Miss-Dependent Instruction Clusters

slide-6
SLIDE 6

Mapping Clusters to Processors

  • An execution cluster is a partition of the dynamic DDG

belonging to the same locality group.

High Locality Clusters:

large amount of instructions (70%) SpecFP, even more SpecINT

need to tolerate L2 cache hit latencies

advance as fast as possible (prefetching effect!)

thus the Cache Processor can be small, but must be Out-of-Order

Low Locality Clusters:

small amount of instructions (<30%)

generally not in critical path (Karkhanis, WMPI'02)

thus the Memory Processor can be even smaller, and probably In-Order

  • o3
slide-7
SLIDE 7

Slide 6

  • o3

again LLC instead of L2?

  • sman.s.unsal osman.s.unsal, 7/2/2019
slide-8
SLIDE 8

 Processes only Cache Hit/Register dependent Insts  Latency Critical  Buffer few instructions (<100)  Speculative / Out-of-Order  Executes most of control code  LD/ST intensive  Memory Lookahead (Prefetching)

58% 15% 6% 20%

A different view: D-KIP About 70% High Locality About 30% Low Locality

Cache Processor Memory Processor

FETCH

miss-dependent insts

DECOUPLED KILO-INSTRUCTION PROCESSOR (HPCA'06, PACT'07)

400 cycles main memory access latency Distribution of Instructions based on Decode->Issue Latency KILO-Instruction processor model 2MB L2 Cache measured in groups of 30 cycles

SPEC FP 2000

 Miss-dependent Instructions  Latency Tolerant  Buffer thousands of instructions  Relaxed Scheduling  Little Control Code -> Few Recoveries  Few address calculations  No caches, No fetch/decode logic

Miquel Pericas et al, “A decoupled KILO-instruction processor”, HPCA06

slide-9
SLIDE 9

Flexible Heterogeneous MultiCore (I)

Cache Processor Memory Processor

Instruction Window

Miss-Dependent Code Cache- Dependent Code

youngest

  • ldest

small out-of-order core designed assuming perfect

  • L2. It processes all cache-

dependent code small dual-issue in-order processors executing code depending

  • n L2 cache misses. Each memory engine processes a portion of

the instruction window. Inclusion of all loads/stores provides sequential memory semantics. Activation and de-activation of memory engines changes the window size and allows to control power consumption Memory Engines ROB

Miquel Pericas et al, “A Flexible Heterogeneous Multi-Core Architecture”, PACT07

slide-10
SLIDE 10

Flexible Heterogeneous MultiCore (II)

Extension to MultiThreading

Pool of Memory Engines can be shared for higher throughput/fairness

Dynamically Assigned Pool of MEs

Cache Processors ROB ROB ROB

slide-11
SLIDE 11

Kilo, Runahead and Prefetching

  • Prefetching
  • Anticipates memory requests
  • Reduces the impact of misses in the memory hierarchy
  • Runahead Mechanism
  • Executes speculative instructions under a LLC miss
  • Prevents the processor from stall when the ROB is full
  • Allows generating useful data prefetch
  • Kilo-instruction Processors
  • Exploits more ILP by maintaining thousands of in-flight instructions

while long-latency loads are outstanding in memory (implicit prefetching)

Tanausú Ramírez et al., “Kilo-instruction Processors, Runahead and Prefetching”, CF’06, May, 2006.

slide-12
SLIDE 12

Performance versus RunAhead and Stride Prefetching

  • OoO and RunAhead are 4-way with 64/256-entry ROBs
  • Cache Processors are 4-way with 64-entry ROB
  • Memory Processor/Memory Engines are two-way in-order processors
  • A Memory Engine can hold up to 128 long-latency instructions and 128 loads/stores
  • RunAhead features ideal runahead cache
  • Stream prefetcher can hold up to 64KB of prefetched data
slide-13
SLIDE 13

“Kilo-processor” and multiprocessor systems

0,5 1 1,5 2 2,5 3 3,5 64 128 512 1024 2048 64 128 512 1024 2048 64 128 512 1024 2048 FFT RADIX LU ROB Size / Benchmark IPC

IDEAL NET BADA

IDEAL NET & MEM

  • M. Galluzzi et al. “ A First glance at Kiloinstruction Based Multiprocessors” Invited Paper. ACM Computing Frontiers
  • Conference. Ischia, Italy, April 10-12, 2004
slide-14
SLIDE 14

What we wanted to do

  • Can we extend a Big-Little multicore to implement

the FMC?

  • Are the Memory Engines (Mes) used all the time or

are they waiting for long latency loads?

  • Can we do something to avoid discarding all the MEs

in case of branch mispredictions?

  • How does a practical kilo-vector processor look like?
slide-15
SLIDE 15

Some ideas “stolen” from “Edge Processors” and “Decoupled Architectures”

Pool of MEs Cache Processors Pool of Waiting Queues

… Split the functionality of the MEs, the instruction queues and the functional units

slide-16
SLIDE 16

Waiting Queue

  • Instructions + Logical Registers
  • Wait until all used logical registers are ready
  • Assign a Memory Engine

Input Register File Inst Inst Inst … Inst Waiting Waiting Memory Engine

slide-17
SLIDE 17

Where to start a waiting queue

  • Loop:
  • If: Br …
  • Else: …
  • Fi:
  • Endloop

Loop: Loop: Loop: Loop: Loop: Loop:

Loop: Loop: If

slide-18
SLIDE 18

Where to start a waiting queue

  • Loop:
  • If: Br …
  • Else: …
  • Fi:
  • Endloop

Loop: Loop: Loop: Loop: Loop: Loop:

Loop: Loop: If Else R1<-

slide-19
SLIDE 19

Loop: Loop: Else R1<-

Where to start a waiting queue

  • Loop:
  • If: Br …
  • Else: …
  • Fi:
  • Endloop

Loop: Loop: Loop: Loop: Loop: Loop:

New R1, R2 New R1

slide-20
SLIDE 20

Problems and more problems

  • What to do if the addresses of Loads and

Stores are modified

  • Fetching instructions, and partial reexecution
  • Pointer Chasing
  • Start a new waiting queue or suspend the execution of a waiting queue?
slide-21
SLIDE 21

“Kilo-vector” processor

20 80 Program: 20 8 Program: 5 Program: 8 Kilo Vector Speedup: 3.5 Speedup: 7.7

  • F. Quintana et al, “Kilo-vector” processors, UPC-DAC
slide-22
SLIDE 22

Adrian.cristal@bsc.es