How Many Simulators Does it Take to Build a Chip? Steve Keckler - PowerPoint PPT Presentation

How Many Simulators Does it Take to Build a Chip? Steve Keckler Department of Computer Sciences The University of Texas at Austin 1 1 MOBS Keynote 6/22/08

2 2 MOBS Keynote 6/22/08

But Wait - There’s More  Broader question: what tools and analysis are required to design a new processor?  New ISA  New microarchitectures (processor, memory system)  New levels of design hierarchy  This is a “Design Experience” talk  No new research results  Insight into system design methodologies based on TRIPS 3 3 MOBS Keynote 6/22/08

Outline  TRIPS System Design Overview  ISA and microarchitecture  Prototype specifications  Simulators  ISA and SW design  Microarchitecture design  System design  Hardware Validation Methodology  Correctness and performance validation  Power Analysis  TRIPS Software Tools  Binary utilities, debugger, performance analysis  Conclusions 4 4 MOBS Keynote 6/22/08

TRIPS EDGE ISA  Explicit Data Graph Execution [IEEE Computer ‘04]  Defined by two key features  Program graph is broken into sequences of blocks  Basic blocks, hyperblocks (max 128 instruction in TRIPS)  Blocks commit atomically or not at all - a block never partially executes  Amortize overheads over many instructions  Compiler forms blocks via loop unrolling, predication, inlining, etc.  Within a block, ISA support for direct producer-to-consumer communication  No shared named registers within a block (point-to-point dataflow edges only)  Instructions “fire” when their operands arrive  The block’s dataflow graph (DFG) is explicit in the architecture 5 5 MOBS Keynote 6/22/08

TRIPS Processor Specifications  An aggressive, general-purpose processor  Up to 16 instructions per cycle  Up to 4 loads and stores per cycle  Up to 64 outstanding L1 data cache misses  Up to 1024 dynamically executing instructions  Up to 4 simultaneous multithreading (SMT) threads  Inter- and intra-block speculation  Memory system  4 simultaneous L1 cache fills per processor  Up to 16 simultaneous L2 cache accesses 6 6 MOBS Keynote 6/22/08

TRIPS Prototype Chip DDR  2 TRIPS Processors EBI IRQ GPIO JTAG CLK SDRAM 108 44 16  NUCA L2 Cache 1 MB, 16 banks DMA SDC EBC TEST PLLS   On-Chip Network (OCN) OCN 2D mesh network  PROC 0 Replaces on-chip bus   Controllers NUCA L2 2 DDR SDRAM controllers  Cache 2 DMA controllers  PROC 1 External bus controller  C2C network controller  DMA SDC C2C 108 8x39 DDR C2C SDRAM Links 7 7 MOBS Keynote 6/22/08

TRIPS Tile-level Microarchitecture TRIPS Tiles G: Processor control - TLB w/ variable size pages, dispatch, next block predict, commit R: Register file - 32 registers x 4 threads, register forwarding I: Instruction cache - 16KB storage per tile D: Data cache - 8KB per tile, 256-entry load/store queue, TLB E: Execution unit - Int/FP ALUs, 64 reservation stations M: Memory - 64KB, configurable as L2 cache or scratchpad N: OCN network interface - router, translation tables DMA: Direct memory access controller SDC: DDR SDRAM controller EBC: External bus controller - interface to external PowerPC C2C: Chip-to-chip network controller - 4 links to XY neighbors 8 8 MOBS Keynote 6/22/08

Grid Processor Tiles and Interfaces I G R R R R GDN: global dispatch network GDN: global dispatch network GDN: global dispatch network GDN: global dispatch network GDN: global dispatch network      I D E E E E OPN: operand network OPN: operand network OPN: operand network OPN: operand network     I D E E E E GSN: global status network GSN: global status network GSN: global status network    I D E E E E GCN: global control network GCN: global control network   I D E E E E 9 9 MOBS Keynote 6/22/08

Non-Uniform L2 Cache (NUCA)  1MB L2 cache  Sixteen tiled 64KB banks  On-chip network Bank Bank  4x10 2D mesh topology PROC 0 Bank Bank  128-bit links, 366MHz (4.7GB/sec) Bank Bank  4 virtual channels prevent deadlocks  Requests and replies are Bank Bank wormhole-routed across the network Bank Bank  Up to 10 memory requests Request per cycle Bank Bank Reply PROC 1  Up to 128 bytes per cycle Bank Bank returned to the processors  Individual banks Bank Bank reconfigurable as scratchpad 10 10 MOBS Keynote 6/22/08

TRIPS Chip Implementation 130nm ASIC with 7 Process Technology metal layers 18.3mm x 18.37mm Die Size (336 mm 2 ) Package 47mm x 47mm BGA 626 signals, 352 Vdd, Pin Count 348 GND # of placed cells 6.1 million Transistor count 170 million (est.) # of routed nets 6.5 million Total wire length 1.06 km 36W at 366MHz, 1.5V Power (measured) (chip has no power mgt.) 2.7ns (actual) Experiments show that chip achieves Clock period 4.5ns (worse case sim) 400MHz at 1.6V 11 11 MOBS Keynote 6/22/08

Chip Area Breakdown Overall Chip Area: 29% - Processor 0 29% - Processor 1 21% - Level 2 Cache 14% - On-Chip Network 7% - Other Processor Area: 30% - Functional Units (ALUs) 4% - Register Files & Queues 10% - Level 1 Caches 13% - Instruction Queues 13% - Load & Store Queues 12% - Operand Network 2% - Branch Predictor 16% - Other 12 12 MOBS Keynote 6/22/08

TRIPS Motherboard  1 motherboard includes:  4 daughter-boards  4 TRIPS chips  8 GBytes DRAM  PowerPC 440GP control processor  I/O: ethernet, serial, C2C links  FPGA I/O interface  Peak performance  48 GFlops at 366 MHz  180 Watts 13 13 MOBS Keynote 6/22/08

TRIPS System I Front Back  8 TRIPS boards  374 Gflops/Gops peak  5 boards currently deployed 14 14 MOBS Keynote 6/22/08

TRIPS System Software Stack Board 0 Ethernet Switch 0 2 PPC P 1 3 EBC HOST PC x86 Linux Board 1 Board 2  TRIPS Resource  Local Resoure Manager  Runs TRIPS apps Manager (TRM) (LRM) listens to HostPC  Interrupts PPC  File system  Runs embedded Linux if necessary  Runtime services  PPC EBI device driver  System calls, to control TRIPS chips exceptions  Login/debug/etc.  PPC EBI ↔ TRIPS EBC 15 15 MOBS Keynote 6/22/08

Outline  TRIPS System Design Overview  ISA and microarchitecture  Prototype specifications  Simulators  ISA and SW design  Microarchitecture design  System design  Hardware Validation Methodology  Correctness and performance validation  Power Analysis  TRIPS Software Tools  Binary utilities, debugger, performance analysis  Conclusions 16 16 MOBS Keynote 6/22/08

TRIPS Simulator Overview Simulator Purpose Speed LoC Accuracy ISA emulator 1M tsim_arch 5.4K None ISA and SW design instr/sec uarch simulator (1 proc.) 1-2K tsim_proc 37.2K 5% perf. analysis, HW validation instr/sec uarch cycle estimator 500K tsim_cyc 7.7K 20-30% SW perf. analysis instr/sec multiprocessor and system tsim_cyc/ tsim_sys 5.2K ~30% parallel apps, system software procs interconnect and NUCA cache 200K tsim_ocn 7.8K 10% uarch design, perf. analysis cyc/sec flexible NUCA simulator 400K tsim_nuca 5.2K 20% architecture tradeoffs cyc/sec flexible uarch simulator 100K tmax 33K ~15% TRIPS extension studies instr/sec  tsim processor simulators share common infrastructure (5.2K LoC)  Total simulator code: 126K LoC  TRIPS RTL design - 229K LoC  Processor: 169K LoC  NUCA + peripherals: 60K LoC 17 17 MOBS Keynote 6/22/08

Design Phases 2003 2004 2005 2006 2000-2002 Early architecture development (Grid Processor and NUCA) High-level simulation, experiments Chip and system specification Construction of cycle-simulator Tile-level RTL and verification Chip integration and verification trimaran-based simulator first ISA simulator Floorplanning, electrical design, physical design tsim_nuca tsim_proc tsim_services tsim/RTL validation Manufacturing tsim_ocn tsim_arch tsim_sys tsim_cyc tmax 18 18 MOBS Keynote 6/22/08

TRIPS ISA Design  First TRIPS exploration (Micro ‘01)  Trimaran VLIW compiler (block formation)  Instruction rescheduler for ALU array  Custom high-level simulator  Useful - but a long way from our final implementation  TRIPS ISA #1  Specification, assembler, simulator  Flawed in a number of ways  Predication model was broken  Instruction encodings were complicated  Didn’t have all of the byte operations  TRIPS ISA #2  Implemented in tsim_arch (C++)  Executes 1 block at a time, follows data dependences  Statistics: instruction counts, dataflow depth  Experiments proved out ISA, added features  Store null operations, constant generation 19 19 MOBS Keynote 6/22/08

TRIPS Microarchitecture Design  Tile-level specifications and interfaces  Cycle-precise C++ performance models  tsim_proc - all processor uarch features  Fully pipelined design of processor  Performance analysis of processor protocols (fetch, bypass, commit, etc.)  Common infrastructure for pipeline (wire/register models)  tsim_ocn - same for NUCA + interconnect  Uses  Performance analysis: accurate but slow  Reference model for RTL desgin (all latencies)  Functional and performance validation 20 20 MOBS Keynote 6/22/08

How Many Simulators Does it Take to Build a Chip? Steve Keckler - PowerPoint PPT Presentation

How Many Simulators Does it Take to Build a Chip? Steve Keckler Department of Computer Sciences The University of Texas at Austin 1 1 MOBS Keynote 6/22/08 2 2 MOBS Keynote 6/22/08 But Wait - Theres More Broader question: what

WHAT DOES IT TAKE WHAT DOES IT TAKE WHAT DOES IT TAKE WHAT DOES IT TAKE to

Calibration des Microroc (II) Alex, Cyril, Giom, Jean, Max 09 Mai 2011, Annecy 1 Reminder 2

Maraninchi (Verimag, Grenoble) Simulators Synchron 08 1 / 44 Writing Simulators with

Study Of Chip Breaker El-Sherbeeny, PhD 2014 Project-Group 6 TYPES ES OF F CHI HIP a)

Australian Junior Resources Blue Chip Australian Junior Resources Blue Chip Australian Junior

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

Exploring Chip to Chip Photonic Networks Philip Watts Computer Laboratory University of Cambridge

Automatic Synthesis of High-Speed Processor Simulators Martin Burtscher and Ilya Ganusov

Toward Efficient Many-to-Many Broadcast in Dynamic Wireless Networks Fabian Mager , Carsten

self build housing Ted Stevens NaSBA Chair What is Self/Custom Build? Who does it/what is

Build-Finance or Design-Build-Finance Transportation Projects Types of P3s Design-Build (DB)

Build Build Build Build System building The process of compiling and linking software

Heapsort Build-Max-Heap Next we build a full heap from an unsorted sequence Build-Max-Heap(A)

Electron Cloud Build Electron Cloud Build- Electron Cloud Build Electron Cloud Build -Up

Chip Seal ROAD FUTURE: TOWN OF STAR VALLEY RANCH Presentation Goals Chip Seal Class 101 (4

Columbia University Chip-Scale Interconnection Networks Chip multi-processors create need

Thomas Pruschke and Robert Peters Department of Theoretical Physics University of Gttingen

Cray Lustre Model Roadmap Cory Spitz and Derek Robb Cray Inc. 5/24/2011 Introduction and Agenda

Exploiting the Commutativity Lattice Donald Nguyen, Dimitrios Milind Kulkarni Prountzos, Xin

Systems Gated Latches Shankar Balachandran* Associate Professor, CSE Department Indian

Technical Challenges Nikolaus Grigorieff Brandeis University Larson, The Far Side What

Update on morphology WP activities M. Huertas-Company (GAL-SWG - morphology) EUCLID France - 7

MULTI-CRITERIA DECISION AIDING IN THE PROCESS OF APPLYING FOR AACSB ACCREDITATION FOR AACSB

Subminimal Logics and Relativistic Negation Satoru Niki School of Information Science, JAIST