Kilo Instruction Processors
Adrián Cristal
2/7/2019 YALE 80
Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 - - PowerPoint PPT Presentation
Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 Processor-DRAM Gap (latency) Proc 1000 CPU 60%/yr. Moores Law Performance 100 Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM 7%/yr. DRAM 1
2/7/2019 YALE 80
µProc 60%/yr. DRAM 7%/yr.
1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
DRAM CPU
1982
Processor-Memory Performance Gap: (grows 50% / year)
“Moore’s Law”
D.A. Patterson “ New directions in Computer Architecture” Berkeley, June 1998
0,0 0,5 1,0 1,5 2,0 2,5 3,0 3,5 4,0 4,5 5,0 128 512 1024 4096 128 512 1024 4096 perceptron perfect ROB Size / Branch Predictor IPC 100 500 1000 Memory Latency
Perfect Mem. & Perfect BP Perfect Mem. & Perceptron BP
0.6X 1.22X 1.41X
Research Proposal to Intel (July 2001) and presentation to Intel-MRL March 2002 Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002
0,0 1,0 2,0 3,0 4,0 5,0 6,0 128 512 1024 4096 8192 128 512 1024 4096 8192 perceptron perfect ROB Size / Branch Predictor IPC 100 500 1000 Memory Latency
Perfect Mem. & Perfect BP Perfect Mem. & Perceptron BP
0.45X 2.34X 3.91X
Research Proposal to Intel (July 2001) and presentation to Intel-MRL March 2002 Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002
void smvp(int nodes, double ***A, int *Acol, int *Aindex, double **v, double **w) { [...] sum0 = A[Anext][0][0]*v[i][0] + A[Anext][0][1]*v[i][1] + A[Anext][0][2]*v[i][2]; [...] } 22.2%
Cache-Dependent Code Miss-Dependent Instruction Clusters
belonging to the same locality group.
High Locality Clusters:
large amount of instructions (70%) SpecFP, even more SpecINT
need to tolerate L2 cache hit latencies
advance as fast as possible (prefetching effect!)
thus the Cache Processor can be small, but must be Out-of-Order
Low Locality Clusters:
small amount of instructions (<30%)
generally not in critical path (Karkhanis, WMPI'02)
thus the Memory Processor can be even smaller, and probably In-Order
Slide 6
again LLC instead of L2?
Processes only Cache Hit/Register dependent Insts Latency Critical Buffer few instructions (<100) Speculative / Out-of-Order Executes most of control code LD/ST intensive Memory Lookahead (Prefetching)
58% 15% 6% 20%
A different view: D-KIP About 70% High Locality About 30% Low Locality
Cache Processor Memory Processor
FETCH
miss-dependent insts
DECOUPLED KILO-INSTRUCTION PROCESSOR (HPCA'06, PACT'07)
400 cycles main memory access latency Distribution of Instructions based on Decode->Issue Latency KILO-Instruction processor model 2MB L2 Cache measured in groups of 30 cycles
SPEC FP 2000
Miss-dependent Instructions Latency Tolerant Buffer thousands of instructions Relaxed Scheduling Little Control Code -> Few Recoveries Few address calculations No caches, No fetch/decode logic
Miquel Pericas et al, “A decoupled KILO-instruction processor”, HPCA06
Cache Processor Memory Processor
Instruction Window
Miss-Dependent Code Cache- Dependent Code
youngest
small out-of-order core designed assuming perfect
dependent code small dual-issue in-order processors executing code depending
the instruction window. Inclusion of all loads/stores provides sequential memory semantics. Activation and de-activation of memory engines changes the window size and allows to control power consumption Memory Engines ROB
Miquel Pericas et al, “A Flexible Heterogeneous Multi-Core Architecture”, PACT07
Extension to MultiThreading
Pool of Memory Engines can be shared for higher throughput/fairness
Dynamically Assigned Pool of MEs
Cache Processors ROB ROB ROB
while long-latency loads are outstanding in memory (implicit prefetching)
Tanausú Ramírez et al., “Kilo-instruction Processors, Runahead and Prefetching”, CF’06, May, 2006.
0,5 1 1,5 2 2,5 3 3,5 64 128 512 1024 2048 64 128 512 1024 2048 64 128 512 1024 2048 FFT RADIX LU ROB Size / Benchmark IPC
IDEAL NET BADA
IDEAL NET & MEM
Pool of MEs Cache Processors Pool of Waiting Queues
… Split the functionality of the MEs, the instruction queues and the functional units
Input Register File Inst Inst Inst … Inst Waiting Waiting Memory Engine
Loop: Loop: Loop: Loop: Loop: Loop:
Loop: Loop: If
Loop: Loop: Loop: Loop: Loop: Loop:
Loop: Loop: If Else R1<-
Loop: Loop: Else R1<-
Loop: Loop: Loop: Loop: Loop: Loop:
New R1, R2 New R1
20 80 Program: 20 8 Program: 5 Program: 8 Kilo Vector Speedup: 3.5 Speedup: 7.7