HiDISC: A De coup l ed A r c h i t e c t u r e f o - - PowerPoint PPT Presentation

hidisc a de coup l ed a r c h i t e c t u r e f o r app l
SMART_READER_LITE
LIVE PREVIEW

HiDISC: A De coup l ed A r c h i t e c t u r e f o - - PowerPoint PPT Presentation

HiDISC: A De coup l ed A r c h i t e c t u r e f o r App l i c a t i on s i n Da t a I n t e n s i ve Comput i n g Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro University of


slide-1
SLIDE 1

HiDISC: A De coup l ed A r c h i t e c t u r e f

  • r

App l i c a t i

  • n

s i n Da t a I n t e n s i ve Comput i n g

Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro University of Southern California http:/ / www-pdpc.usc.edu 19 May 2000

slide-2
SLIDE 2 OF SOUTHERN OF SOUTHERN

USC USC

UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIA

University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

HiDISC: Hierarchical Decoupled Instruction Set Computer New Ideas

  • A dedicated processor for each level of the

memory hierarchy

  • Explicitly manage each level of the memory hierarchy

using instructions generated by the compiler

  • Hide memory latency by converting data

access predictability to data access locality

  • Exploit instruction-level parallelism without

extensive scheduling hardware

  • Zero overhead prefetches for maximal

computation throughput

Impact

  • 2x speedup for scientific benchmarks with large data

sets over an in-order superscalar processor

  • 7.4x speedup for matrix multiply over an in-order

issue superscalar processor

  • 2.6x speedup for matrix decomposition/substitution over

an in-order issue superscalar processor

  • Reduced memory latency for systems that have

high memory bandwidths (e.g. PIMs, RAMBUS)

  • Allows the compiler to solve indexing functions for

irregular applications

  • Reduced system cost for high-throughput scientific codes

Schedule

Dynamic Databas e Memory Cache Reg i s t e r s App l i c a t i
  • n
(FL IR SAR V IDEO ATR / SLD Sc i en t i f i c ) App l i c a t i
  • n
(FL IR SAR V IDEO ATR / SLD Sc i en t i f i c ) P roce s so r

Decoup l i ng Comp i l e r Decoup l i ng Comp i l e r

HiDISC P roc e s so r Apr i l 9 8 S t a r t May 01 End Apr i l 9 9 Apr i l
  • De
f i n ed b enchma rk s
  • Comp
l e t e d s imu l a t
  • r
  • P
e r f
  • rmed
i n s t r u c t i
  • n
  • l
e v e l s imu l a t i
  • n
s
  • n
h and
  • comp
i l e d b e n chma rk s
  • Con
t i nu e s imu l a t i
  • n
s
  • f
mo r e b enchma rk s ( SAR)
  • Def
i n e H iD ISC a r c h i t e c t u r e
  • Deve
l
  • p
a nd t e s t a f u l l d e coup l i ng c
  • mp
i l e r
  • Gene
r a t e p e r f
  • rmance
s t a t i s t i c s a nd ev a l u a t e d e s i gn
  • Upda
t e S imu l a t
  • r
Senso r I npu t s P roce s so r P roce s so r S i t ua t i
  • na
l Awarene s s
slide-3
SLIDE 3 OF SOUTHERN OF SOUTHERN

USC USC

UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIA

University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

HiDISC: Hierarchical Decoupled Instruction Set Computer

Dynamic Databas e Memory Cache Reg i s t e r s App l i c a t i
  • n
(FL IR SAR V IDEO ATR / SLD Sc i en t i f i c ) App l i c a t i
  • n
(FL IR SAR V IDEO ATR / SLD Sc i en t i f i c ) P roce s so r

Decoup l i ng Comp i l e r Decoup l i ng Comp i l e r

HiDISC P roc e s so r Senso r I npu t s P roce s so r P roce s so r S i t ua t i
  • na
l Awarene s s

Technological Trend: Memory latency is getting longer relative to microprocessor speed (40% per year) Problem: Some SPEC benchmarks spend more than half of their time stalling [Lebeck and Wood 1994] Domain: benchmarks with large data sets: symbolic, signal processing and scientific programs Present Solutions: Multithreading (Homogenous), Larger Caches, Prefetching, Software Multithreading

slide-4
SLIDE 4 OF SOUTHERN OF SOUTHERN

USC USC

UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIA

University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

Present Solutions

Solution

Larger Caches Hardware Prefetching Software Prefetching M ultithreading

Limitations

— Slow — W orks well only if working set fits cache and there is temporal locality. — Cannot be tailored for each application — Behavior based on past and present execution- time behavior — Ensure overheads of prefetching do not outweigh the benefits > conservative prefetching — Adaptive software prefetching is required to change prefetch distance during run-time — Hard to insert prefetches for irregular access patterns — Solves the throughput problem, not the memory latency problem

slide-5
SLIDE 5 OF SOUTHERN OF SOUTHERN

USC USC

UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIA

University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

Obse rv a t i

  • n

:

  • So

f twa r e p r e f e t ch i ngimpac t s c

  • mpu

t e p e r f

  • rmance
  • P

IMs and RA M BUS

  • f

f e r a h i g h

  • bandw

id t h memory s y s t em

  • u

s e f u l f

  • r

s p e cu l a t i v e p r e f e t c h i n g

The HiDISC Approach

App roach :

  • Add

a p r

  • c

e s so r t

  • manage p

r e f e t c h i ng

  • >

h i d e

  • ve

r h e ad

  • Comp

i l e r e xp l i c i t l y manage s t h e memory h i e r a r c hy

  • P

r e f e t c h d i s t a n c e a d ap t s t

  • t

h e p r

  • g

r am r un t ime behav i

  • r
slide-6
SLIDE 6 OF SOUTHERN OF SOUTHERN

USC USC

UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIA

University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

Cache

W hat’s HiDISC

  • A dedicated processor for each

level of the memory hierarchy

  • Explicitly manage each level of

the memory hierarchy using instructions generated by the compiler

  • Hide memory latency by

converting data access predictability to data access locality (Just in Time Fetch)

  • Exploit instruction-level

parallelism without extensive scheduling hardware

  • Zero overhead prefetches for

maximal computation throughput

Access Instructions Computation Instructions Acce s s P r

  • c

e s s

  • r

(AP ) Acce s s P r

  • c

e s s

  • r

(AP ) Compu t a t i

  • n

P roc e s s

  • r

(CP ) Compu t a t i

  • n

P roc e s s

  • r

(CP ) Reg i s t e r s

Cache Mgmt . P roce s so r (CMP) Cache Mgmt . P roce s so r (CMP)

Cache Mgmt. Instructions

Comp i l e r P rog r am 2nd

  • Leve

l Cache and Ma in Memory

slide-7
SLIDE 7 OF SOUTHERN OF SOUTHERN

USC USC

UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIA

University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

Decoupled Architectures

Cache Access Second-Level Cache Processor (AP) Computation Processor (CP) Cache Management Processor (CMP) Load Queue Store Address Queue Slip Control Queue Store Data Queue Registers and Main Memory Cache Second-Level Cache Processor Registers and Main Memory Cache Access Second-Level Cache Processor (AP) Computation Processor (CP) Load Queue Store Address Queue Store Data Queue Registers and Main Memory Cache Second-Level Cache Processor Cache Management Processor (CMP) Slip Control Queue Registers and Main Memory

MIPS DEAP CAPP HiDISC

8-way 3-way 5-way 5-way 3-way 2-way 3-way 3-way

(Conven t i

  • na

l ) (Decoup l ed ) (Decoup l ed ) (Decoup l ed ) ( n ew)

slide-8
SLIDE 8 OF SOUTHERN OF SOUTHERN

USC USC

UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIA

University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

Slip Control Queue

  • The Slip Control Queue (SCQ) adapts dynamically

– Late prefetches = prefetched data arrived after load had been issued – Useful prefetches = prefetched data arrived before load had been issued

if (prefetch_buffer_full ()) Don’t change size of SCQ; else if ((2* late_prefetches) > useful_prefetches) Increase size of SCQ; else Decrease size of SCQ;

slide-9
SLIDE 9 OF SOUTHERN OF SOUTHERN

USC USC

UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIA

University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

Decoupling Programs for HiDISC-3 (Discrete Convolution - Inner Loop)

for (j = 0; j < i; ++j) y[i]=y[i]+(x[j]* h[i-j-1]);

while (not end of loop) y = y + (x * h); send y to SDQ

for (j = 0; j < i; ++j) { load (x[j]); load (h[i-j-1]); GET_SCQ; } send (EOD token) send address of y[i] to SAQ for (j = 0; j < i; ++j) { prefetch (x[j]); prefetch (h[i-j-1]; PUT_SCQ; } Inner Loop Convolution

Computation Processor Code Access Processor Code Cache Management Code

slide-10
SLIDE 10 OF SOUTHERN OF SOUTHERN

USC USC

UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIA

University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

Benchmarks

Benchmark Source of Benchmark Lines of Source Code Description Data Set Size

LLL1 Livermore Loops [45] 20 1024-element arrays, 100 iterations 24 KB LLL2 Livermore Loops 24 1024-element arrays, 100 iterations 16 KB LLL3 Livermore Loops 18 1024-element arrays, 100 iterations 16 KB LLL4 Livermore Loops 25 1024-element arrays, 100 iterations 16 KB LLL5 Livermore Loops 17 1024-element arrays, 100 iterations 24 KB Tomcatv SPECfp95 [68] 190 33x33-element matrices, 5 iterations <64 KB MXM NAS kernels [5] 113 Unrolled matrix multiply, 2 iterations 448 KB CHOLSKY NAS kernels 156 Cholesky matrix decomposition 724 KB VPENTA NAS kernels 199 Invert three pentadiagonals simultaneously 128 KB Qsort Quicksort sorting algorithm [14] 58 Quicksort 128 KB

slide-11
SLIDE 11 OF SOUTHERN OF SOUTHERN

USC USC

UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIA

University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

Simulation Parameters

Parameter Value Parameter Value

FLC Size 4 KB SLC Size 16 KB FLC Associativity 2 SLC Associativity 2 FLC Block Size 32 B SLC Block Size 32 B Memory Latency varied Memory Contention Time varied Victim Cache Size 32 Entries Prefetch Buffer Size 8 entries Load Queue Size 128 Store Address Queue Size 128 Store Data Queue Size 128 Total issue width 8

slide-12
SLIDE 12 OF SOUTHERN OF SOUTHERN

USC USC

UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIA

University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

Simulation Results

  • 1

2 3 4 5 40 80 120 160 200 Main Memory Latency LLL3 MIPS DEAP CAPP HiDISC

  • 0.5

1 1.5 2 2.5 3 40 80 120 160 200 Main Memory Latency Tomcatv MIPS DEAP CAPP HiDISC 2 4 6 8 10 12 14 16 40 80 120 160 200 Main Memory Latency Cholsky MIPS DEAP CAPP HiDISC

  • 2

4 6 8 10 12 40 80 120 160 200 Main Memory Latency Vpenta MIPS DEAP CAPP HiDISC

slide-13
SLIDE 13 OF SOUTHERN OF SOUTHERN

USC USC

UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIA

University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

Our Results: Impact

  • 2x speedup for scientific benchmarks with large data sets over an in-
  • rder superscalar processor
  • 7.4x speedup for matrix multiply (MXM) over an in-order issue

superscalar processor - (similar operations are used in ATR/SLD)

  • 2.6x speedup for matrix decomposition/substitution (Cholsky) over

an in-order issue superscalar processor

  • Reduced memory latency for systems that have high memory

bandwidths (e.g. PIMs, RAMBUS)

  • Allows the compiler to solve indexing functions for irregular

applications

  • Reduced system cost for high-throughput scientific codes
slide-14
SLIDE 14 OF SOUTHERN OF SOUTHERN

USC USC

UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIA

University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

Schedule

April 98 Start May 01 End April 99 April 00

  • Defined benchmarks
  • Completed simulator
  • Performed instruction-level

simulations on hand- compiled benchmarks

  • Continue simulations
  • f more benchmarks

(ATR/SLD)

  • Define HiDISC

architecture

  • Update Simulator
  • Generate performance

statistics and evaluate design

  • Develop and test a

full decoupling compiler

slide-15
SLIDE 15 OF SOUTHERN OF SOUTHERN

USC USC

UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIA

University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

Summary

  • A processor for each level of the memory hierarchy
  • Adaptive memory hierarchy management
  • Reduces memory latency for systems with high memory

bandwidths (PIMs, RAMBUS)

  • 2x speedup for scientific benchmarks
  • 3x speedup for matrix decomposition/ substitution (Cholesky)
  • 7x speedup for matrix multiply (MXM) (similar results expected

for ATR/ SLD)

slide-16
SLIDE 16 OF SOUTHERN OF SOUTHERN

USC USC

UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIA

University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

BEYOND HiDISC

! Distributed Processing

  • Sensors
  • Data I/ O (disk farms)
  • Multiprocessors

! Multiprocessing

  • Mc Fisc-on-a-chip
  • SMT/ MT/ I-structures
  • VLSI layout/ performance tradeoffs

! Applications

  • Compute/ database search and retrieval
slide-17
SLIDE 17 OF SOUTHERN OF SOUTHERN

USC USC

UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIA

University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

The McDISC Invention

  • Problem: All extant, large-scale multiprocessors perform poorly

when faced with a tightly-coupled parallel program.

  • Reason: Extant machines have a long latency when

communication is needed between nodes. This long latency kills performance when executing tightly-coupled programs. (Note that multi-threading a la the Tera machine does not help when there are dependencies.)

  • The McDISC solution: Provide the network interface processor

(NIP) with a programmable processor to execute not only OS code (e.g. Stanford Flash), but user code, generated by the compiler.

  • Advantage: The NIP, executing user code, fetches data before it

is needed by the node processors, eliminating the network fetch latency most of the time.

  • Result: Fast execution (speedup) of tightly-coupled parallel

programs.

slide-18
SLIDE 18 OF SOUTHERN OF SOUTHERN

USC USC

UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIA

University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

The McDISC System: Memory-Centered Distributed Instruction Set Computer

  • Registers

Cache

  • Access

Main Memory Processor (AP)

  • Computation

Processor (CP)

  • Cache Management

Processor (CMP)

  • Program
  • Compiler
  • Access Instructions
Computation Instructions Cache Management Instructions
  • Disc

Processor (DP)

  • Adaptive Signal

PIM (ASP)

  • Adaptive Graphics

PIM (AGP)

  • Network Interface

Processor (NIP)

  • Disc Cache
  • Situation

Awareness Sensor Inputs SAR Video

  • Dynamic

Database

  • X

Y Z 3-D Torus

  • f Pipelined Rings

RAID Understanding Inference Analysis

  • Decision Process
  • Targeting
  • Network
Management
  • Register Links

to CP Neighbors

  • Instructions
  • FLIR SAR VIDEO ESS

SES

  • Sensor

Data

  • Disc Farm
  • to Displays

and Network

slide-19
SLIDE 19 OF SOUTHERN OF SOUTHERN

USC USC

UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIA

University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

Matrix Multiply on McDISC

  • p i d = pr oce s s or i d

p = # of pr oces s or s mi n_i = ( pi d / p ) * n ; max_i = m i n _i + ( p / n) - 1; f or ( i = m i n_i ; i <= m ax_i ; ++i ) { f o r ( j = 0; j < m ; ++j ) { c[ i ] [ j ] = 0; f o r ( k = 0; k < l ; ++k ) { c[ i ] [ j ] = c[ i ] [ j ] + a[ i ] [ k] * b[ k] [ j ] ; } } }

[ ] [ ] [ ] =

n

  • l
  • n
  • m
  • l
  • m
  • C

B A Par al l el M a t r i x M ul t i pl y

slide-20
SLIDE 20 OF SOUTHERN OF SOUTHERN

USC USC

UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIA

University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

Matrix Multiply on McDISC

  • f or ( i = m

i n _i ; i <= m ax_i ; ++i ) { f or ( j = 0 ; j < m ; ++j ) { f or ( k = 0 ; k < l ; ++k) { p r ef et ch ( a [ i ] [ k ] ) ; p r ef et ch ( b [ i ] [ k ] ) ; } } }

  • whi l e ( n ot en d- of - d at a) {

whi l e ( not end- of - da t a) { c = 0; whi l e ( not end- of - da t a) { / * a and b f r om q ueue */ c = c + a * b; } s end c t o s t or e queu e; } }

  • f or ( i = m

i n_ i ; i <= m ax _i ; ++i ) { f or ( j = 0 ; j < m ; ++j ) { f or ( k = 0 ; k < l ; ++k) { l oad a [ i ] [ k] t o l oad q ueue; l oad b [ i ] [ k] t o l oad q ueue; } s end end- o f - dat a t o CP; put & c[ i ] [ j ] i n s t or e que ue; s end s i gna l t o NI P; } s end end- o f - dat a; } s en d end - of - d at a

  • f or ( i = m

i n_i ; i <= m ax_i ; ++i ) { f o r ( j = 0; j < m ; ++j ) { wa i t f o r s i g nal f r om AP; s e nd c[ i ] [ j ] t o p r oces s or 0 ; } }

  • CP

AP CM P NI P

slide-21
SLIDE 21 OF SOUTHERN OF SOUTHERN

USC USC

UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIA

University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

Sparse Matrix Multiply on McDISC

  • a p = a l i s t ;

b p = b l i s t ; whi l e ( ( ap ! = NULL) && ( ap - >r ow == i ) & & ( bp ! = NULL) && ( bp - >r ow == i ) ) { i f ( ap- >col == bp- >col ) { s u m = s um + ( ap- >dat a * bp- >dat a ) ; ap = ap - >nex t ; bp = bp - >nex t ; } el s e i f ( ap- >col < bp- >col ) ap = ap - >nex t ; el s e bp = bp - >nex t ; }

Par al l el Spar s e M a t r i x M ul t i pl y

row

  • col
  • val
  • next
  • Alist
  • row
  • col
  • val
  • next
  • Blist (B transpose)
  • ( I nner Loop)
slide-22
SLIDE 22 OF SOUTHERN OF SOUTHERN

USC USC

UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIA

University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

Sparse Matrix Multiply on McDISC

  • s u m

= 0 ; wh i l e ( not EOD) s um += LQ * LQ s e nd s u m t o SDQ

  • ap = al i s t ;

bp = bl i s t ; whi l e ( ( a p ! = NULL) & & ( ap - >r ow == i ) & & ( b p ! = NULL) & & ( bp- >r ow == i ) ) { i f ( ap- >co l == bp- >c ol ) { Put ap- >d at a a nd bp - >dat a i n LQ ap = ap- >next ; bp = bp- >next ; } el s e i f ( a p- >co l < b p- >co l ) ap = ap- >next ; el s e bp = bp- >next ; } Sen d EOD t oke n t o CP; Sen d & c[ i ] [ j ] t o SAQ; Sen d s i g nal a nd ad dr es s t o NI P;

  • wai t f or s i gnal f r o m

AP; s end dat a t o home nod e;

  • CP

AP NI P

  • ap = al i s t ;

bp = bl i s t ; whi l e ( ( a p ! = NULL) & & ( ap - >r ow == i ) & & ( b p ! = NULL) & & ( bp- >r ow == i ) ) { i f ( ap- >co l == bp- >c ol ) { pr ef et ch ( ap- >dat a) ; pr ef et ch ( bp- >dat a) ; ap = ap- >next ; bp = bp- >next ; } el s e i f ( a p- >co l < b p- >co l ) ap = ap- >next ; el s e bp = bp- >next ; }

  • CM

P