HiDISC: A De coup l ed A r c h i t e c t u r e f
- r
App l i c a t i
- n
s i n Da t a I n t e n s i ve Comput i n g
Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro University of Southern California http:/ / www-pdpc.usc.edu 19 May 2000
HiDISC: A De coup l ed A r c h i t e c t u r e f o - - PowerPoint PPT Presentation
HiDISC: A De coup l ed A r c h i t e c t u r e f o r App l i c a t i on s i n Da t a I n t e n s i ve Comput i n g Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro University of
HiDISC: A De coup l ed A r c h i t e c t u r e f
App l i c a t i
s i n Da t a I n t e n s i ve Comput i n g
Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro University of Southern California http:/ / www-pdpc.usc.edu 19 May 2000
USC USC
UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIAUniversity of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
HiDISC: Hierarchical Decoupled Instruction Set Computer New Ideas
memory hierarchy
using instructions generated by the compiler
access predictability to data access locality
extensive scheduling hardware
computation throughput
Impact
sets over an in-order superscalar processor
issue superscalar processor
an in-order issue superscalar processor
high memory bandwidths (e.g. PIMs, RAMBUS)
irregular applications
Schedule
Dynamic Databas e Memory Cache Reg i s t e r s App l i c a t iDecoup l i ng Comp i l e r Decoup l i ng Comp i l e r
HiDISC P roc e s so r Apr i l 9 8 S t a r t May 01 End Apr i l 9 9 Apr i lUSC USC
UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIAUniversity of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
HiDISC: Hierarchical Decoupled Instruction Set Computer
Dynamic Databas e Memory Cache Reg i s t e r s App l i c a t iDecoup l i ng Comp i l e r Decoup l i ng Comp i l e r
HiDISC P roc e s so r Senso r I npu t s P roce s so r P roce s so r S i t ua t iTechnological Trend: Memory latency is getting longer relative to microprocessor speed (40% per year) Problem: Some SPEC benchmarks spend more than half of their time stalling [Lebeck and Wood 1994] Domain: benchmarks with large data sets: symbolic, signal processing and scientific programs Present Solutions: Multithreading (Homogenous), Larger Caches, Prefetching, Software Multithreading
USC USC
UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIAUniversity of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
Present Solutions
Solution
Larger Caches Hardware Prefetching Software Prefetching M ultithreading
Limitations
— Slow — W orks well only if working set fits cache and there is temporal locality. — Cannot be tailored for each application — Behavior based on past and present execution- time behavior — Ensure overheads of prefetching do not outweigh the benefits > conservative prefetching — Adaptive software prefetching is required to change prefetch distance during run-time — Hard to insert prefetches for irregular access patterns — Solves the throughput problem, not the memory latency problem
USC USC
UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIAUniversity of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
Obse rv a t i
:
f twa r e p r e f e t ch i ngimpac t s c
t e p e r f
IMs and RA M BUS
f e r a h i g h
id t h memory s y s t em
s e f u l f
s p e cu l a t i v e p r e f e t c h i n g
The HiDISC Approach
App roach :
a p r
e s so r t
r e f e t c h i ng
h i d e
r h e ad
i l e r e xp l i c i t l y manage s t h e memory h i e r a r c hy
r e f e t c h d i s t a n c e a d ap t s t
h e p r
r am r un t ime behav i
USC USC
UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIAUniversity of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
Cache
W hat’s HiDISC
level of the memory hierarchy
the memory hierarchy using instructions generated by the compiler
converting data access predictability to data access locality (Just in Time Fetch)
parallelism without extensive scheduling hardware
maximal computation throughput
Access Instructions Computation Instructions Acce s s P r
e s s
(AP ) Acce s s P r
e s s
(AP ) Compu t a t i
P roc e s s
(CP ) Compu t a t i
P roc e s s
(CP ) Reg i s t e r s
Cache Mgmt . P roce s so r (CMP) Cache Mgmt . P roce s so r (CMP)
Cache Mgmt. Instructions
Comp i l e r P rog r am 2nd
l Cache and Ma in Memory
USC USC
UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIAUniversity of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
Decoupled Architectures
Cache Access Second-Level Cache Processor (AP) Computation Processor (CP) Cache Management Processor (CMP) Load Queue Store Address Queue Slip Control Queue Store Data Queue Registers and Main Memory Cache Second-Level Cache Processor Registers and Main Memory Cache Access Second-Level Cache Processor (AP) Computation Processor (CP) Load Queue Store Address Queue Store Data Queue Registers and Main Memory Cache Second-Level Cache Processor Cache Management Processor (CMP) Slip Control Queue Registers and Main MemoryMIPS DEAP CAPP HiDISC
8-way 3-way 5-way 5-way 3-way 2-way 3-way 3-way
(Conven t i
l ) (Decoup l ed ) (Decoup l ed ) (Decoup l ed ) ( n ew)
USC USC
UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIAUniversity of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
Slip Control Queue
– Late prefetches = prefetched data arrived after load had been issued – Useful prefetches = prefetched data arrived before load had been issued
if (prefetch_buffer_full ()) Don’t change size of SCQ; else if ((2* late_prefetches) > useful_prefetches) Increase size of SCQ; else Decrease size of SCQ;
USC USC
UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIAUniversity of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
Decoupling Programs for HiDISC-3 (Discrete Convolution - Inner Loop)
for (j = 0; j < i; ++j) y[i]=y[i]+(x[j]* h[i-j-1]);
while (not end of loop) y = y + (x * h); send y to SDQ
for (j = 0; j < i; ++j) { load (x[j]); load (h[i-j-1]); GET_SCQ; } send (EOD token) send address of y[i] to SAQ for (j = 0; j < i; ++j) { prefetch (x[j]); prefetch (h[i-j-1]; PUT_SCQ; } Inner Loop Convolution
Computation Processor Code Access Processor Code Cache Management Code
USC USC
UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIAUniversity of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
Benchmarks
Benchmark Source of Benchmark Lines of Source Code Description Data Set Size
LLL1 Livermore Loops [45] 20 1024-element arrays, 100 iterations 24 KB LLL2 Livermore Loops 24 1024-element arrays, 100 iterations 16 KB LLL3 Livermore Loops 18 1024-element arrays, 100 iterations 16 KB LLL4 Livermore Loops 25 1024-element arrays, 100 iterations 16 KB LLL5 Livermore Loops 17 1024-element arrays, 100 iterations 24 KB Tomcatv SPECfp95 [68] 190 33x33-element matrices, 5 iterations <64 KB MXM NAS kernels [5] 113 Unrolled matrix multiply, 2 iterations 448 KB CHOLSKY NAS kernels 156 Cholesky matrix decomposition 724 KB VPENTA NAS kernels 199 Invert three pentadiagonals simultaneously 128 KB Qsort Quicksort sorting algorithm [14] 58 Quicksort 128 KB
USC USC
UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIAUniversity of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
Simulation Parameters
Parameter Value Parameter Value
FLC Size 4 KB SLC Size 16 KB FLC Associativity 2 SLC Associativity 2 FLC Block Size 32 B SLC Block Size 32 B Memory Latency varied Memory Contention Time varied Victim Cache Size 32 Entries Prefetch Buffer Size 8 entries Load Queue Size 128 Store Address Queue Size 128 Store Data Queue Size 128 Total issue width 8
USC USC
UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIAUniversity of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
Simulation Results
2 3 4 5 40 80 120 160 200 Main Memory Latency LLL3 MIPS DEAP CAPP HiDISC
1 1.5 2 2.5 3 40 80 120 160 200 Main Memory Latency Tomcatv MIPS DEAP CAPP HiDISC 2 4 6 8 10 12 14 16 40 80 120 160 200 Main Memory Latency Cholsky MIPS DEAP CAPP HiDISC
4 6 8 10 12 40 80 120 160 200 Main Memory Latency Vpenta MIPS DEAP CAPP HiDISC
USC USC
UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIAUniversity of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
Our Results: Impact
superscalar processor - (similar operations are used in ATR/SLD)
an in-order issue superscalar processor
bandwidths (e.g. PIMs, RAMBUS)
applications
USC USC
UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIAUniversity of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
Schedule
April 98 Start May 01 End April 99 April 00
simulations on hand- compiled benchmarks
(ATR/SLD)
architecture
statistics and evaluate design
full decoupling compiler
USC USC
UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIAUniversity of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
Summary
bandwidths (PIMs, RAMBUS)
for ATR/ SLD)
USC USC
UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIAUniversity of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
BEYOND HiDISC
! Distributed Processing
! Multiprocessing
! Applications
USC USC
UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIAUniversity of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
The McDISC Invention
when faced with a tightly-coupled parallel program.
communication is needed between nodes. This long latency kills performance when executing tightly-coupled programs. (Note that multi-threading a la the Tera machine does not help when there are dependencies.)
(NIP) with a programmable processor to execute not only OS code (e.g. Stanford Flash), but user code, generated by the compiler.
is needed by the node processors, eliminating the network fetch latency most of the time.
programs.
USC USC
UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIAUniversity of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
The McDISC System: Memory-Centered Distributed Instruction Set Computer
Cache
Main Memory Processor (AP)
Processor (CP)
Processor (CMP)
Processor (DP)
PIM (ASP)
PIM (AGP)
Processor (NIP)
Awareness Sensor Inputs SAR Video
Database
Y Z 3-D Torus
RAID Understanding Inference Analysis
to CP Neighbors
SES
Data
and Network
USC USC
UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIAUniversity of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
Matrix Multiply on McDISC
p = # of pr oces s or s mi n_i = ( pi d / p ) * n ; max_i = m i n _i + ( p / n) - 1; f or ( i = m i n_i ; i <= m ax_i ; ++i ) { f o r ( j = 0; j < m ; ++j ) { c[ i ] [ j ] = 0; f o r ( k = 0; k < l ; ++k ) { c[ i ] [ j ] = c[ i ] [ j ] + a[ i ] [ k] * b[ k] [ j ] ; } } }
n
B A Par al l el M a t r i x M ul t i pl y
USC USC
UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIAUniversity of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
Matrix Multiply on McDISC
i n _i ; i <= m ax_i ; ++i ) { f or ( j = 0 ; j < m ; ++j ) { f or ( k = 0 ; k < l ; ++k) { p r ef et ch ( a [ i ] [ k ] ) ; p r ef et ch ( b [ i ] [ k ] ) ; } } }
whi l e ( not end- of - da t a) { c = 0; whi l e ( not end- of - da t a) { / * a and b f r om q ueue */ c = c + a * b; } s end c t o s t or e queu e; } }
i n_ i ; i <= m ax _i ; ++i ) { f or ( j = 0 ; j < m ; ++j ) { f or ( k = 0 ; k < l ; ++k) { l oad a [ i ] [ k] t o l oad q ueue; l oad b [ i ] [ k] t o l oad q ueue; } s end end- o f - dat a t o CP; put & c[ i ] [ j ] i n s t or e que ue; s end s i gna l t o NI P; } s end end- o f - dat a; } s en d end - of - d at a
i n_i ; i <= m ax_i ; ++i ) { f o r ( j = 0; j < m ; ++j ) { wa i t f o r s i g nal f r om AP; s e nd c[ i ] [ j ] t o p r oces s or 0 ; } }
AP CM P NI P
USC USC
UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIAUniversity of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
Sparse Matrix Multiply on McDISC
b p = b l i s t ; whi l e ( ( ap ! = NULL) && ( ap - >r ow == i ) & & ( bp ! = NULL) && ( bp - >r ow == i ) ) { i f ( ap- >col == bp- >col ) { s u m = s um + ( ap- >dat a * bp- >dat a ) ; ap = ap - >nex t ; bp = bp - >nex t ; } el s e i f ( ap- >col < bp- >col ) ap = ap - >nex t ; el s e bp = bp - >nex t ; }
Par al l el Spar s e M a t r i x M ul t i pl y
row
USC USC
UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIAUniversity of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
Sparse Matrix Multiply on McDISC
= 0 ; wh i l e ( not EOD) s um += LQ * LQ s e nd s u m t o SDQ
bp = bl i s t ; whi l e ( ( a p ! = NULL) & & ( ap - >r ow == i ) & & ( b p ! = NULL) & & ( bp- >r ow == i ) ) { i f ( ap- >co l == bp- >c ol ) { Put ap- >d at a a nd bp - >dat a i n LQ ap = ap- >next ; bp = bp- >next ; } el s e i f ( a p- >co l < b p- >co l ) ap = ap- >next ; el s e bp = bp- >next ; } Sen d EOD t oke n t o CP; Sen d & c[ i ] [ j ] t o SAQ; Sen d s i g nal a nd ad dr es s t o NI P;
AP; s end dat a t o home nod e;
AP NI P
bp = bl i s t ; whi l e ( ( a p ! = NULL) & & ( ap - >r ow == i ) & & ( b p ! = NULL) & & ( bp- >r ow == i ) ) { i f ( ap- >co l == bp- >c ol ) { pr ef et ch ( ap- >dat a) ; pr ef et ch ( bp- >dat a) ; ap = ap- >next ; bp = bp- >next ; } el s e i f ( a p- >co l < b p- >co l ) ap = ap- >next ; el s e bp = bp- >next ; }
P