EDGE, TRIPS, and CLP Bending architecture to fit workload Zachary - - PowerPoint PPT Presentation

edge trips and clp
SMART_READER_LITE
LIVE PREVIEW

EDGE, TRIPS, and CLP Bending architecture to fit workload Zachary - - PowerPoint PPT Presentation

EDGE, TRIPS, and CLP Bending architecture to fit workload Zachary Weinberg 22 Jan 2009 The superscalar problem To have many instructions in flight at once, need huge on-chip control structures issue queue, reorder buffer, rename


slide-1
SLIDE 1

EDGE, TRIPS, and CLP

Bending architecture to fit workload Zachary Weinberg 22 Jan 2009

slide-2
SLIDE 2

The superscalar problem

◮ To have many instructions in flight at once, need huge

  • n-chip control structures

◮ issue queue, reorder buffer, rename registers, hazard

detection, bypass network, result bus, . . .

◮ Wire delays limit these to 100 instructions or so

◮ Alpha 21264: 80 ◮ Intel Core 2: 96 ◮ Intel Nehalem: 128 (64 with two threads)

◮ Heavy load on branch prediction

◮ Have only a few gate delays to make a prediction ◮ Need to issue a speculative branch nearly every cycle ◮ Need near-perfect accuracy to avoid frequent pipeline

flushes

slide-3
SLIDE 3

Grid processor

I R R R R G E E E E D I M M N

C2C

N N

SDC

N

DMA

E E E E D I M M N E E E E D I M M N E E E E D I M M N N

EBC SDC DMA

I R R R R G E E E E D I M M N N N N E E E E D I M M N E E E E D I M M N E E E E D I M M N N Processor 0 Processor 1 Secondary Memory System C2C (x4) IRQ SDRAM 0 SDRAM 1 EBI N N N N N N N N

TRIPS prototype block diagram.

I I I I I

Global dispatch network (GDN)

R R R R G E E E E D E E E E D E E E E D E E E E D I I I I I

Global status network (GSN)

R R R R G E E E E D E E E E D E E E E D E E E E D

Operand network (OPN)

R R R R G E E E E D E E E E D E E E E D E E E E D I I I I I

Global control network (GCN)

R R R R G E E E E D E E E E D E E E E D E E E E D I I I I I Issues block fetch command and dispatches instructions Handles transport of all data operands Issues block commit and block flush commands Signals completion of block execution, I-cache miss refill, and block commit completion

TRIPS micronetworks.

◮ Limit communication between tiles ◮ Limit size of global control logic (G-tile only) ◮ Tremendous execution bandwidth (64 insns per E-tile) ◮ This is what we’d like to build, but how?

slide-4
SLIDE 4

Data flow architectures avoid the superscalar problem

◮ Instructions activate when they have all their inputs ◮ There is no “program order” to maintain ◮ Each instruction says where its output goes ◮ Much control flow is replaced with predicated execution

which means. . .

◮ Issue queue is simpler and can be distributed ◮ No reorder buffer necessary ◮ No need for rename regs or shared result bus ◮ Reduced load on branch prediction

slide-5
SLIDE 5

But they have their own problems

◮ Loops must be “throttled” to avoid swamping the system

with tokens

◮ Some implementations require exotic memory hardware

(e.g. I-structures)

◮ Arguably superior but unfamiliar concurrency model

(also true for exceptions, virtual memory, multitasking)

◮ Difficult or impossible to support code written in a

conventional language

slide-6
SLIDE 6

EDGE: a middle ground

◮ ISA designed for grid processors ◮ Lay out code in hyperblocks

◮ one entry, many exits ◮ instructions within form a data flow graph ◮ static assignment to execution tiles ◮ commit all instructions at once

◮ Conventional control flow between blocks ◮ Exceptions delayed to block boundary ◮ Can speculatively execute ahead of current block

slide-7
SLIDE 7

Benefits

◮ Within a hyperblock, get benefits of data flow architecture ◮ Global structures only need to track inter-hyperblock state

◮ can be done with smaller structures ◮ gives more time to make decisions ◮ allows far more instructions in flight overall

◮ Avoids problems of pure data flow architecture

◮ Can execute conventional language code ◮ More familiar concurrency and exception model ◮ No looping within hyperblock, so no loop throttling needed

slide-8
SLIDE 8

Problems

◮ Only one branch target per hyperblock

◮ Not uncommon to need one per 5–10 instructions ◮ Compiler must aggresively if-convert and unroll loops ◮ Code size may increase significantly

◮ Intra-hyperblock scheduling is fragile

◮ Goal is to put dependent instructions near each other ◮ Optimal schedule depends on processor details ◮ Like VLIW, may need recompilation for good performance

◮ Exception model is awkward for virtual memory

◮ must repeat entire blocks for each page fault ◮ worst case O(n2) penalty

slide-9
SLIDE 9

First iteration: TRIPS

◮ Concrete design of a grid processor ◮ Simulated with simplifications (e.g. no page faults) ◮ 128-instruction hyperblocks ◮ Very simple execution tiles ◮ Three operational modes:

D-morph From one thread, take one definite and many speculative blocks T-morph From several threads, take one definite and a few speculative blocks each S-morph Unroll a computational kernel across many blocks and run them all at once

◮ Values forwarded from producer to consumer blocks as

available, through register file

slide-10
SLIDE 10

D-morph

a d p c m b z i p 2 c

  • m

p r g z i p m 8 8 k s i m m c f p a r s e r t w

  • l

f v

  • r

t e x M E A N 5 10 15 IPC ammp art dct equake hydro2d mgrid mpeg2 swim tomcatv turb3d MEAN 5 10 15 IPC

speculative depth

1 2 4 8 16 32 32 + perfect mem 32 + perfect mem & BP 21264 (baseline)

Integer Floating point

◮ Most like a regular single-thread OOO ◮ Tested on a subset of SPEC (what compiler could handle) ◮ Skip initialization, count only “useful” insns ◮ Competitive with Alpha even with no speculation ◮ Leaves Alpha in the dust with deeper speculation,

  • esp. for floating point

◮ Mispredictions hurt, especially integer code

slide-11
SLIDE 11

T-morph

◮ Like any SMT, sacrifices per-thread resources for

concurrency

◮ Biggest hit is from lower speculative depth, higher network

contention

◮ Selected eight SPEC benchmarks and ran them in pairs,

fours, or all at once

◮ Two threads: 87% single thread throughput, 1.7x speedup ◮ Four threads: 61% single thread throughput, 2.4x speedup ◮ Eight threads: 39% single thread throughput, 2.9x speedup

◮ No comparison to multitasking on D-morph

slide-12
SLIDE 12

S-morph

◮ Unroll a loop into many

concurrent hyperblocks

◮ Can repeat without refetching ◮ Software control of L2 cache ◮ Benchmarked on 7 streaming

kernels; hand-coded, machine-scheduled assembly

◮ Graph shows several different

design points

◮ 2–4x D-morph performance ◮ Requires extra control logic

c

  • n

v e r t d c t f f t 8 f i r 1 6 i d e a t r a n s f

  • r

m M E A N 5 10 15 Compute Insts/Cycle

D-morph S-morph S-morph idealized 1/4 LD B/W 4X ST B/W No Revitalization

slide-13
SLIDE 13

Second iteration: CLP/TFlex

128-entry instruction window Int. ALU FP ALU Memory network in/out Select logic R e g i s t e r f

  • r

w a r d i n g l

  • g

i c & q u e u e s 40-entry load/store queue 8KB 2-way L1 D-cache 4KB direct-mapped L1 I-cache Operand buffer 128x64b Operand buffer 128x64b 128-entry architectural register file 2R , 1W port Operand network

  • ut queue

Operand network in queue 8-Kbit next-block predictor Block control Control networks 4KB block header cache local exit history global exit history vector

Predicted next-block address

l2 g2 t2 B TB CTB Btype R AS SEQ

Global history vector 2 top R AS entries Next-block address

T

  • next
  • wner core

One TFlex Core Next-Block Predictor 32-Core TFlex Array

◮ Dynamically aggregates cores as workload demands ◮ Distributes all control logic (abolish the G-tile) ◮ Cores are now dual-issue for integer ops ◮ Each core has its own L1 caches ◮ More operand network bandwidth

slide-14
SLIDE 14

Dynamic aggregation of cores

P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P

L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2

P P P P

P P

P P

L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2

P

◮ Threads can run on any number of cores ◮ Authors anticipate operating system (or even hardware!)

will assign each task an appropriate number of cores

◮ Benchmarking done with core counts chosen by hand ◮ Will execution with varying core count mess up

scheduling?

slide-15
SLIDE 15

Distributing all the control logic

(g) Execution1 (h) Commit1 (e) Fetch1 (f) Next-Block Prediction1

thread0 thread1 thread0 thread1 thread0 thread1 thread0 thread1 thread0 thread1 thread0 thread1 thread0 thread1 thread0 thread1

Lifetime of block A0 in thread0 and block B0 in thread1 Lifetime of block A1 in thread0 and block B1 in thread1

The owner of block A0, B0 The owner of block A1, B1

(c) Execution0 (d) Commit0 (a) Fetch0 (b) Next-Block Prediction0

◮ Each EDGE block has an owner core

◮ Chosen by hash of block address ◮ Directs the other cores through fetch, execution, commit

◮ Instructions spread across cores as in TRIPS ◮ Fetch done by all cores in parallel ◮ Block commit by four-way handshake ◮ Branch history either kept with owner core (local) or

transmitted along with branch predictions (global)

slide-16
SLIDE 16

Coordination overhead

Distributed fetch overhead

5 10 15 20 25 30 1-core 2-core 4-core 8-core 16-core 32-core Cycles Read instructions Propagate orders Control Hand-off Block Prediction I-cache Access Block Tag Access

Distributed commit overhead

2 4 6 8 10 12 14 16 18 20 1-core 2-core 4-core 8-core 16-core 32-core Cycles write values four-way handshake

◮ Fetch has significant fixed costs ◮ As cores increase, both fetch and commit can use more

parallelism accessing architectural state

◮ But pay for it in more inter-core communication delay ◮ 4–8 cores seems to be a sweet spot ◮ These are overhead per block ◮ 32 cores suffer less than 2% overhead per unit time

slide-17
SLIDE 17

Performance relative to TRIPS

1 2 3 4 5 6 7 8 conv ct a2time01 autocor00 basefp01 bezier02 rspeed01 802.11a 8b10b swim mgrid applu apsi BEST TRIPS genalg dither01 tblook01 gzip gcc crafty parser perlbmk vortex bzip2 twolf wupwise sixtrack BEST TRIPS Total AVG Best TRIPS Hand-Optimized Compiled Only Hand- Optimized Compiled Only High-ILP Benchmarks Low-ILP Benchmarks Total Relative Performance 2-core 4-core 8-core 16-core 32-core

11.6 18.3

◮ Yeah, they crammed too much stuff into this graph. ◮ Take-homes seem to be:

◮ 19% faster than TRIPS with matched configuration ◮ 42% faster than TRIPS with configuration tuned for each

application

◮ Makes more of a difference for benchmarks with more ILP ◮ Hand-optimized code much better than compiled code

◮ This time around they have a cycle-accurate simulator

slide-18
SLIDE 18

Unaddressed issues

◮ Code size code size code size ◮ Bandwidth to main memory ◮ Compiler quality ◮ Robustness of schedule ◮ Virtual memory ◮ Context switching ◮ Debugging