fast a functionally accurate simulation toolset for the
play

FAST: A Functionally Accurate Simulation Toolset for the Cyclops-64 - PowerPoint PPT Presentation

First Workshop on Modeling, Benchmarking and Simulation (MoBS) at the 32 nd International Symposium on Computer Architecture FAST: A Functionally Accurate Simulation Toolset for the Cyclops-64 Cellular Architecture Juan del Cuvillo, Weirong Zhu,


  1. First Workshop on Modeling, Benchmarking and Simulation (MoBS) at the 32 nd International Symposium on Computer Architecture FAST: A Functionally Accurate Simulation Toolset for the Cyclops-64 Cellular Architecture Juan del Cuvillo, Weirong Zhu, Ziang Hu, Guang Gao Department of Electrical and Computer Engineering University of Delaware June 4, 2005

  2. Outline ● Cyclops-64 architecture: ● C64 supercomputer. ● C64 node. ● C64 architecture details. ● FAST design and implementation: ● Pipeline. ● Segmented memory. ● Memory and interconnect contention. ● Experience: ● Architecture design verification. ● Early system software development. ● Application development.

  3. Cyclops-64 Supercomputer

  4. Cyclops-64 Node Logical View

  5. Cyclops-64 Node Main Features Clock frequency 500MHz. ● 75 processors: ● 2 thread units (in-order issue, out-of-order ● completion). 2 32KB SRAM banks. ● FPU and integer multiply-accumulate unit. ● 32KB I-cache shared by 5 processors. ● A-switch ● 4 ports, 4GB/s per port. ● 1GB off-chip SDDRAM. ● Crossbar network: ● 96 ports, 4GB/s per port. ● Connects processors(80), I-cache(4), DRAM ● (4), A-switch(6), host-interface(2).

  6. Cyclops-64 Architecture Relevant Features Integrates proc., memory and comm. ● 150 thread units, 4.7MB on-chip SRAM, ● crossbar switch and A-switch device. No resource virtualization ● Non-preemptive execution. ● OS will not interrupt user program. ● No HW virtual memory manager. ● Memory hierarchy visible by programmer. ● ISA provides: ● Support for thread level execution. ● In-memory atomic operations. ●

  7. Simulation Requirements ● Architecture team: ● Multicore technology. ● Design verification. ● Space exploration. ● System software group: ● Toolchain development & testing. ● Runtime system design. ● Users: ● Application development. ● Performance estimation.

  8. How does FAST meet the re- quirements? ● Multichip, multithreaded C64 system. ● Functionally accurate (not cycle accu- rate). ● Timing sensitive. ● Binary compatible. ● Instrumentability. ● Execution driven.

  9. How does FAST meet the re- quirements? ● Instruction execution. ● Exception handling. ● Segmented memory space. ● Execution trace and statistics. ● Memory and interconnect contention. ● A-switch device. ● Debugger.

  10. Instruction Pipeline Fetch and Decode Instn fetch: account for delay if PIB and I-cache miss, not for branch prediction

  11. Instruction Pipeline Read Registers 1 cycle for all 1 and 2-operand instn, 2 cycles for 3-operand instn (FMAx). In-order-issue.

  12. Instruction Pipeline Execution RISC-like instn execution model based on eXecution/Delay pairs Independent instructions executed in parallel

  13. Instruction Pipeline Instruction Timing Instruction type x d Branch 2 0 Count pop. 1 1 Int. multiply 1 5 Int. divide, reminder 1 33 Float. add, multiply, convert 1 5 Float. multiply-add 1 10 Float. divide 1 33 Float square root 1 56 Memory op (local SPM) 1 2 Memory op (global SRAM) 1 20 Memory op (off-chip DRAM) 1 36 Others 1 0

  14. Instruction Pipeline Commit Put results away Out-of-order completion Don't account for write in reg. file conflicts.

  15. Commit (cont) ● Out-of-order execution: x/d model. ● But threads run synchronously; single clock signal. ● Special instn. imply inter-thread syn- chronization. ● FAST synchronizes thread execution at commit stage. ● If thread's ahead wait. Slowest thread updates clock. 1 2 3 T1 commit commit wait 1 3 T2 wait commit

  16. Segmented Memory Space Non-uniform share address space AULD Off- chip/DRAM 0x80000000 Memory SPM 149 AULS SPM 1 Local/Scratc SPM 0 hpad 0x40000000 Memories AULI Global/Interleaved Memory 0x00000000

  17. Memory and Interconnect Contention

  18. Contention in the Crossbar LOAD X STORE X DATUM X

  19. Memory Bandwidth Limita- tion STORE X FULL

  20. Interconnect Contention LOAD X FULL

  21. A-Switch Device ● Under testing. ● Reads from / writes to memory. ● Not connected to the crossbar/memory module. ● Overhead not accounted for. ● Estimation for multi-chip programs less accurate.

  22. Debugger ● Embedded assembly-level debugger. ● GDB-based source level debugger: ● Handles single-threaded programs. ● Work in progress for multi-chip mul- ti-threaded programs.

  23. Toolchain Verification C64 toolchain: supports C64 ISA, segmented memory space, etc.

  24. System Software Toolchain Verification Methodology: ● Source: C/Fortran/assembly ● Generate executable. ● Run binary in FAST simulator. ● If expected result then toolchain OK. 2 Phases: ● Thorough ISA testing: manually inspect trace file. ● Toolchain coverage: large # test cases. From compiler regression testsuite to sizable applications.

  25. Architecture Design Verifi- cation ● FAST generates execution trace. ● Trace from VHDL simulator. ● Compare program execution, instruc- tion by instruction. ● Help validate VHDL simulator, i.e., chip design. ● Continue testing once hardware plat- form (emulator) is available. ● C32 DIMES – CYSIM (SC03, SC04) ● C64 Mrs. Cyclops – FAST (SC05)

  26. System Software Develop- ment ● Early evaluation of multithreaded run- time system library: ● Test functionality of the TNT* library. ● Estimation of thread creation, termi- nation, reuse times. ● Study of spin lock algorithms. ● CNET communication library devel- opment. (*) In Workshop of Massive Parallel Processing (IPDPS05)

  27. Application Development ● FAST verification: Do the results match what the archi- tecture is capable of? ● Microbenchmarks: ● TableToy ● Matrix-matrix-multiply

  28. Experience ● TableToy: Measure Giga Updates Per Second (GUPS) 1 tmp1 = stable[j] (load) 2 tmp2 = table[i] (load) 3 val = tmp1 xor tmp2 (xor) 4 table[i] = val; (store)

  29. GUPS with TableToy Little throughput Pseudo random # generation Bank conflicts

  30. GUPS with NewToy (S- RAM) 75% maximum throughput

  31. MUPS with NewToy (DRAM) DRAM maximum bandwidth 250 MUPS

  32. MFLOPS Matrix-matrix Multiply Baseline: 17MLOPS Optimized: 215MFLOPS/thread approx. 500MFLOPS/proc ~ 50% maximum

  33. ACKNOLEDGMENTS Monty Denneau (IBM) Alban Douillet ● ● Henry Warren (IBM) Brice Dobry ● ● José Castaños (IBM) Ge Gan ● ● Christos Georgiou (IBM) Geoff Gerfin ● ● John Tully ● ET International, Inc. Weirong Zhu ● ● Our sponsors Wesley Toland ● ● Ziang Hu ● Fei Chen ● Hirofumi Sakane ● Yuhei ● Vishal Karna ●

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend