EDGE, TRIPS, and CLP Bending architecture to fit workload Zachary - PowerPoint PPT Presentation

EDGE, TRIPS, and CLP Bending architecture to fit workload Zachary Weinberg 22 Jan 2009

The superscalar problem ◮ To have many instructions in flight at once, need huge on-chip control structures ◮ issue queue, reorder buffer, rename registers, hazard detection, bypass network, result bus, . . . ◮ Wire delays limit these to 100 instructions or so ◮ Alpha 21264: 80 ◮ Intel Core 2: 96 ◮ Intel Nehalem: 128 (64 with two threads) ◮ Heavy load on branch prediction ◮ Have only a few gate delays to make a prediction ◮ Need to issue a speculative branch nearly every cycle ◮ Need near-perfect accuracy to avoid frequent pipeline flushes

Grid processor SDRAM 0 IRQ EBI Global dispatch network (GDN) Global status network (GSN) Processor 0 I G R R R R I G R R R R DMA SDC EBC N I G R R R R N N N I D E E E E I D E E E E I D E E E E N M M N Secondary Memory System I D E E E E I D E E E E I D E E E E N M M N I D E E E E I D E E E E N M M N I D E E E E I D E E E E I D E E E E N M M N I D E E E E Issues block fetch command and Signals completion of block execution, I-cache dispatches instructions miss refill, and block commit completion M M N N I D E E E E Operand network (OPN) Global control network (GCN) N M M N I D E E E E I G R R R R I G R R R R N M M N I D E E E E I D E E E E I D E E E E N M M N I D E E E E I D E E E E I D E E E E N N N N I G R R R R I D E E E E I D E E E E DMA SDC C2C I D E E E E Processor 1 I D E E E E SDRAM 1 C2C (x4) Handles transport of all data operands Issues block commit and block flush commands TRIPS prototype block diagram. TRIPS micronetworks. ◮ Limit communication between tiles ◮ Limit size of global control logic (G-tile only) ◮ Tremendous execution bandwidth (64 insns per E-tile) ◮ This is what we’d like to build, but how?

Data flow architectures avoid the superscalar problem ◮ Instructions activate when they have all their inputs ◮ There is no “program order” to maintain ◮ Each instruction says where its output goes ◮ Much control flow is replaced with predicated execution which means. . . ◮ Issue queue is simpler and can be distributed ◮ No reorder buffer necessary ◮ No need for rename regs or shared result bus ◮ Reduced load on branch prediction

But they have their own problems ◮ Loops must be “throttled” to avoid swamping the system with tokens ◮ Some implementations require exotic memory hardware (e.g. I -structures) ◮ Arguably superior but unfamiliar concurrency model (also true for exceptions, virtual memory, multitasking) ◮ Difficult or impossible to support code written in a conventional language

EDGE: a middle ground ◮ ISA designed for grid processors ◮ Lay out code in hyperblocks ◮ one entry, many exits ◮ instructions within form a data flow graph ◮ static assignment to execution tiles ◮ commit all instructions at once ◮ Conventional control flow between blocks ◮ Exceptions delayed to block boundary ◮ Can speculatively execute ahead of current block

Benefits ◮ Within a hyperblock, get benefits of data flow architecture ◮ Global structures only need to track inter-hyperblock state ◮ can be done with smaller structures ◮ gives more time to make decisions ◮ allows far more instructions in flight overall ◮ Avoids problems of pure data flow architecture ◮ Can execute conventional language code ◮ More familiar concurrency and exception model ◮ No looping within hyperblock, so no loop throttling needed

Problems ◮ Only one branch target per hyperblock ◮ Not uncommon to need one per 5–10 instructions ◮ Compiler must aggresively if-convert and unroll loops ◮ Code size may increase significantly ◮ Intra-hyperblock scheduling is fragile ◮ Goal is to put dependent instructions near each other ◮ Optimal schedule depends on processor details ◮ Like VLIW, may need recompilation for good performance ◮ Exception model is awkward for virtual memory ◮ must repeat entire blocks for each page fault ◮ worst case O ( n 2 ) penalty

First iteration: TRIPS ◮ Concrete design of a grid processor ◮ Simulated with simplifications (e.g. no page faults) ◮ 128-instruction hyperblocks ◮ Very simple execution tiles ◮ Three operational modes: D-morph From one thread, take one definite and many speculative blocks T-morph From several threads, take one definite and a few speculative blocks each S-morph Unroll a computational kernel across many blocks and run them all at once ◮ Values forwarded from producer to consumer blocks as available, through register file

D-morph Floating point Integer 21264 (baseline) 15 15 speculative depth 1 2 4 8 16 10 10 32 32 + perfect mem IPC 32 + perfect mem & BP IPC 5 5 0 0 a b c g m m p t v M w ammp art dct equake hydro2d mgrid mpeg2 swim tomcatv turb3d MEAN d z o z a o 8 c E p i m i r o r p p 8 f t c s l e A 2 p f m k e N r s x r i m ◮ Most like a regular single-thread OOO ◮ Tested on a subset of SPEC (what compiler could handle) ◮ Skip initialization, count only “useful” insns ◮ Competitive with Alpha even with no speculation ◮ Leaves Alpha in the dust with deeper speculation, esp. for floating point ◮ Mispredictions hurt, especially integer code

T-morph ◮ Like any SMT, sacrifices per-thread resources for concurrency ◮ Biggest hit is from lower speculative depth, higher network contention ◮ Selected eight SPEC benchmarks and ran them in pairs, fours, or all at once ◮ Two threads: 87% single thread throughput, 1.7x speedup ◮ Four threads: 61% single thread throughput, 2.4x speedup ◮ Eight threads: 39% single thread throughput, 2.9x speedup ◮ No comparison to multitasking on D-morph

S-morph ◮ Unroll a loop into many concurrent hyperblocks ◮ Can repeat without refetching 15 D-morph S-morph S-morph idealized ◮ Software control of L2 cache 1/4 LD B/W Compute Insts/Cycle 4X ST B/W 10 No Revitalization ◮ Benchmarked on 7 streaming 5 kernels; hand-coded, machine-scheduled assembly 0 ◮ Graph shows several different c d f f f i d i r t M o c t r 8 1 e a E n t n 6 a v s A e f N r o t r m design points ◮ 2–4x D-morph performance ◮ Requires extra control logic

Second iteration: CLP/TFlex 32-Core TFlex Array One TFlex Core Next-Block Predictor Control networks Block control 128-entry architectural R e g i s t e r SEQ 8-Kbit Next-block g register file f o r w a r d n i next-block address 2R 1W port , l o g i c & q u e u e s predictor local exit l2 2 top R AS entries history Operand Global history vector B TB network 4KB in queue Operand Predicted next-block address Int. direct-mapped buffer ALU global exit L1 I-cache Select logic 128x64b history vector g2 Operand T o next CTB network owner core out queue Operand FP buffer ALU 128x64b R AS 4KB block t2 128-entry header cache instruction window 40-entry 8KB 2-way Memory load/store Btype L1 D-cache network queue in/out ◮ Dynamically aggregates cores as workload demands ◮ Distributes all control logic (abolish the G-tile) ◮ Cores are now dual-issue for integer ops ◮ Each core has its own L1 caches ◮ More operand network bandwidth

Dynamic aggregation of cores P P P P P P L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 P P P P P P L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 P P P P P P L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 P P P P P L2 L2 L2 L2 L2 L2 L2 L2 P L2 L2 L2 L2 P P P P L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 L2 L2 L2 L2 P L2 L2 L2 L2 P P P P L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 ◮ Threads can run on any number of cores ◮ Authors anticipate operating system (or even hardware!) will assign each task an appropriate number of cores ◮ Benchmarking done with core counts chosen by hand ◮ Will execution with varying core count mess up scheduling?

Distributing all the control logic The owner of block A0, B0 The owner of block A1, B1 thread0 thread1 thread0 thread1 thread0 thread1 thread0 thread1 Lifetime of block A0 in thread0 and block B0 in thread1 (a) Fetch0 (b) Next-Block Prediction0 (c) Execution0 (d) Commit0 thread0 thread1 thread0 thread1 thread0 thread1 thread0 thread1 Lifetime of block A1 in thread0 and block B1 in thread1 (f) Next-Block Prediction1 (g) Execution1 (e) Fetch1 (h) Commit1 ◮ Each EDGE block has an owner core ◮ Chosen by hash of block address ◮ Directs the other cores through fetch, execution, commit ◮ Instructions spread across cores as in TRIPS ◮ Fetch done by all cores in parallel ◮ Block commit by four-way handshake ◮ Branch history either kept with owner core (local) or transmitted along with branch predictions (global)

EDGE, TRIPS, and CLP Bending architecture to fit workload Zachary - PowerPoint PPT Presentation

EDGE, TRIPS, and CLP Bending architecture to fit workload Zachary Weinberg 22 Jan 2009 The superscalar problem To have many instructions in flight at once, need huge on-chip control structures issue queue, reorder buffer, rename

Cloud Cloud Cloud Cloud network Edge Edge Edge Edge as a Edge Edge Edge Edge Edge

Get the edge Get the edge Get the edge Get the edge Get the edge Get the edge Get the edge

CLP and Renewable Energy in Asia May 2010 Andrew Brandler CEO CLP Holdings Disclaimer This

Equity Assets 5% 6% Chile 21% 27% Per Colombia 68% 73%

Computational Logic WWW Programming Using LP/CLP Systems 1 LP/CLP , the Internet, and the WWW

Edge-based Segmentation Transform Hough Edge Tracking Linking Edge Detection Canny Edge

TRIPS: A Distributed Explicit Data Graph Execution (EDGE) Microprocessor Madhu Saravana Sibi

THE DOHA/TRIPS PARAGRAPH 6 SYSTEM: REFLECTIONS FOR GLOBAL HEALTH SOUTH CENTRE SIDE EVENT AT

Field Journals on Field Trips Kelsey Raschke, Art Educator, Yahara Elementary, DeForest Karly

INTERNATIONAL REGION Resettlement Program Activities Overseas May 2015 Area Trips Some trips

www.locallink.ie Local Link Kildare South Dublin currently operate 291 scheduled trips on 45

Transportation in a changing world Fewer trips, different trips, alone May 27, 2020 Steve

Where we are Where we are heading CLP-USA and Dnet-Bangladesh Teams July 23, 2016 Who we are

clp(pfd(Y)) : Constraints for Probabilistic Reasoning in Logic Programming Nicos Angelopoulos

BIM-Spatial Reasoning using CLP & Other Research Topics Joaqu n Arias Herrero 10 Oct

Effect of Edge Preparation Methods on Effect of Edge Preparation Methods on Edge Retention Rate

The Time-Triggered Architecture Peter Bhm 28.9.05 Overview 1. Introduction 2. Network

Parallel Programming and High-Performance Computing Part 6: Dynamic Load Balancing Dr.

A CnC-driven Implementation of Medical Imaging Algorithms on Heterogeneous Processors Yi Zou * ,

Administrivia Mini project deadline: today Attach the capture of the evaluation run output

Medium Access Control Guevara Noubir CS4700 CS5700 Fundamentals of

trr s tr q

MIMD Multicomputer Mesh, ring, linear array, 2D-torus, 3D-mesh 3D-torus, tree fat tree,

S Summary of f Technical Technical Achievements Sverre Jarp, CERN openlab CTO Sverre Jarp,

EDGE, TRIPS, and CLP Bending architecture to fit workload Zachary - PowerPoint PPT Presentation

EDGE, TRIPS, and CLP Bending architecture to fit workload Zachary Weinberg 22 Jan 2009 The superscalar problem To have many instructions in flight at once, need huge on-chip control structures issue queue, reorder buffer, rename

Cloud Cloud Cloud Cloud network Edge Edge Edge Edge as a Edge Edge Edge Edge Edge

Get the edge Get the edge Get the edge Get the edge Get the edge Get the edge Get the edge

CLP and Renewable Energy in Asia May 2010 Andrew Brandler CEO CLP Holdings Disclaimer This

Equity Assets 5% 6% Chile 21% 27% Per Colombia 68% 73%

Computational Logic WWW Programming Using LP/CLP Systems 1 LP/CLP , the Internet, and the WWW

Edge-based Segmentation Transform Hough Edge Tracking Linking Edge Detection Canny Edge

TRIPS: A Distributed Explicit Data Graph Execution (EDGE) Microprocessor Madhu Saravana Sibi

THE DOHA/TRIPS PARAGRAPH 6 SYSTEM: REFLECTIONS FOR GLOBAL HEALTH SOUTH CENTRE SIDE EVENT AT

Field Journals on Field Trips Kelsey Raschke, Art Educator, Yahara Elementary, DeForest Karly

INTERNATIONAL REGION Resettlement Program Activities Overseas May 2015 Area Trips Some trips

www.locallink.ie Local Link Kildare South Dublin currently operate 291 scheduled trips on 45

Transportation in a changing world Fewer trips, different trips, alone May 27, 2020 Steve

Where we are Where we are heading CLP-USA and Dnet-Bangladesh Teams July 23, 2016 Who we are

clp(pfd(Y)) : Constraints for Probabilistic Reasoning in Logic Programming Nicos Angelopoulos

BIM-Spatial Reasoning using CLP &amp; Other Research Topics Joaqu n Arias Herrero 10 Oct

Effect of Edge Preparation Methods on Effect of Edge Preparation Methods on Edge Retention Rate

The Time-Triggered Architecture Peter Bhm 28.9.05 Overview 1. Introduction 2. Network

Parallel Programming and High-Performance Computing Part 6: Dynamic Load Balancing Dr.

A CnC-driven Implementation of Medical Imaging Algorithms on Heterogeneous Processors Yi Zou * ,

Administrivia Mini project deadline: today Attach the capture of the evaluation run output

Medium Access Control Guevara Noubir CS4700 CS5700 Fundamentals of

trr s tr q

MIMD Multicomputer Mesh, ring, linear array, 2D-torus, 3D-mesh 3D-torus, tree fat tree,

S Summary of f Technical Technical Achievements Sverre Jarp, CERN openlab CTO Sverre Jarp,

BIM-Spatial Reasoning using CLP & Other Research Topics Joaqu n Arias Herrero 10 Oct