FAST: A Functionally Accurate Simulation Toolset for the Cyclops-64 - - PowerPoint PPT Presentation

fast a functionally accurate simulation toolset for the
SMART_READER_LITE
LIVE PREVIEW

FAST: A Functionally Accurate Simulation Toolset for the Cyclops-64 - - PowerPoint PPT Presentation

First Workshop on Modeling, Benchmarking and Simulation (MoBS) at the 32 nd International Symposium on Computer Architecture FAST: A Functionally Accurate Simulation Toolset for the Cyclops-64 Cellular Architecture Juan del Cuvillo, Weirong Zhu,


slide-1
SLIDE 1

FAST: A Functionally Accurate Simulation Toolset for the Cyclops-64 Cellular Architecture

Juan del Cuvillo, Weirong Zhu, Ziang Hu, Guang Gao Department of Electrical and Computer Engineering University of Delaware June 4, 2005 First Workshop on Modeling, Benchmarking and Simulation (MoBS) at the 32nd International Symposium on Computer Architecture

slide-2
SLIDE 2

Outline

  • Cyclops-64 architecture:
  • C64 supercomputer.
  • C64 node.
  • C64 architecture details.
  • FAST design and implementation:
  • Pipeline.
  • Segmented memory.
  • Memory and interconnect contention.
  • Experience:
  • Architecture design verification.
  • Early system software development.
  • Application development.
slide-3
SLIDE 3

Cyclops-64 Supercomputer

slide-4
SLIDE 4

Cyclops-64 Node Logical View

slide-5
SLIDE 5

Cyclops-64 Node Main Features

  • Clock frequency 500MHz.
  • 75 processors:
  • 2 thread units (in-order issue, out-of-order

completion).

  • 2 32KB SRAM banks.
  • FPU and integer multiply-accumulate unit.
  • 32KB I-cache shared by 5 processors.
  • A-switch
  • 4 ports, 4GB/s per port.
  • 1GB off-chip SDDRAM.
  • Crossbar network:
  • 96 ports, 4GB/s per port.
  • Connects processors(80), I-cache(4), DRAM

(4), A-switch(6), host-interface(2).

slide-6
SLIDE 6

Cyclops-64 Architecture Relevant Features

  • Integrates proc., memory and comm.
  • 150 thread units, 4.7MB on-chip SRAM,

crossbar switch and A-switch device.

  • No resource virtualization
  • Non-preemptive execution.
  • OS will not interrupt user program.
  • No HW virtual memory manager.
  • Memory hierarchy visible by programmer.
  • ISA provides:
  • Support for thread level execution.
  • In-memory atomic operations.
slide-7
SLIDE 7

Simulation Requirements

  • Architecture team:
  • Multicore technology.
  • Design verification.
  • Space exploration.
  • System software group:
  • Toolchain development & testing.
  • Runtime system design.
  • Users:
  • Application development.
  • Performance estimation.
slide-8
SLIDE 8

How does FAST meet the re- quirements?

  • Multichip, multithreaded C64 system.
  • Functionally accurate (not cycle accu-

rate).

  • Timing sensitive.
  • Binary compatible.
  • Instrumentability.
  • Execution driven.
slide-9
SLIDE 9

How does FAST meet the re- quirements?

  • Instruction execution.
  • Exception handling.
  • Segmented memory space.
  • Execution trace and statistics.
  • Memory and interconnect contention.
  • A-switch device.
  • Debugger.
slide-10
SLIDE 10

Instruction Pipeline Fetch and Decode

Instn fetch: account for delay if PIB and I-cache miss, not for branch prediction

slide-11
SLIDE 11

Instruction Pipeline Read Registers

1 cycle for all 1 and 2-operand instn, 2 cycles for 3-operand instn (FMAx). In-order-issue.

slide-12
SLIDE 12

Instruction Pipeline Execution

RISC-like instn execution model based on eXecution/Delay pairs Independent instructions executed in parallel

slide-13
SLIDE 13

Instruction Pipeline Instruction Timing

Instruction type x d Branch 2 Count pop. 1 1

  • Int. multiply

1 5

  • Int. divide, reminder

1 33

  • Float. add, multiply, convert

1 5

  • Float. multiply-add

1 10

  • Float. divide

1 33 Float square root 1 56 Memory op (local SPM) 1 2 Memory op (global SRAM) 1 20 Memory op (off-chip DRAM) 1 36 Others 1

slide-14
SLIDE 14

Instruction Pipeline Commit

Put results away Out-of-order completion Don't account for write in reg. file conflicts.

slide-15
SLIDE 15

Commit (cont)

  • Out-of-order execution: x/d model.
  • But threads run synchronously; single

clock signal.

  • Special instn. imply inter-thread syn-

chronization.

  • FAST synchronizes thread execution at

commit stage.

  • If thread's ahead wait. Slowest thread

updates clock.

T1 T2 1 commit 1 wait 2 commit 3 commit 3 wait

slide-16
SLIDE 16

Segmented Memory Space

AULI AULS AULD 0x00000000 0x40000000 0x80000000 Global/Interleaved Memory Local/Scratc hpad Memories Off- chip/DRAM Memory

SPM149 SPM1 SPM0

Non-uniform share address space

slide-17
SLIDE 17

Memory and Interconnect Contention

slide-18
SLIDE 18

Contention in the Crossbar

STORE X LOAD X DATUM X

slide-19
SLIDE 19

Memory Bandwidth Limita- tion

STORE X

FULL

slide-20
SLIDE 20

Interconnect Contention

LOAD X

FULL

slide-21
SLIDE 21

A-Switch Device

  • Under testing.
  • Reads from / writes to memory.
  • Not connected to the

crossbar/memory module.

  • Overhead not accounted for.
  • Estimation for multi-chip programs

less accurate.

slide-22
SLIDE 22

Debugger

  • Embedded assembly-level debugger.
  • GDB-based source level debugger:
  • Handles single-threaded programs.
  • Work in progress for multi-chip mul-

ti-threaded programs.

slide-23
SLIDE 23

Toolchain Verification

C64 toolchain: supports C64 ISA, segmented memory space, etc.

slide-24
SLIDE 24

System Software Toolchain Verification

Methodology:

  • Source: C/Fortran/assembly
  • Generate executable.
  • Run binary in FAST simulator.
  • If expected result then toolchain OK.

2 Phases:

  • Thorough ISA testing: manually inspect

trace file.

  • Toolchain coverage: large # test cases.

From compiler regression testsuite to sizable applications.

slide-25
SLIDE 25

Architecture Design Verifi- cation

  • FAST generates execution trace.
  • Trace from VHDL simulator.
  • Compare program execution, instruc-

tion by instruction.

  • Help validate VHDL simulator, i.e.,

chip design.

  • Continue testing once hardware plat-

form (emulator) is available.

  • C32 DIMES – CYSIM (SC03, SC04)
  • C64 Mrs. Cyclops – FAST (SC05)
slide-26
SLIDE 26

System Software Develop- ment

  • Early evaluation of multithreaded run-

time system library:

  • Test functionality of the TNT* library.
  • Estimation of thread creation, termi-

nation, reuse times.

  • Study of spin lock algorithms.
  • CNET communication library devel-
  • pment.

(*) In Workshop of Massive Parallel Processing (IPDPS05)

slide-27
SLIDE 27

Application Development

  • FAST verification:

Do the results match what the archi- tecture is capable of?

  • Microbenchmarks:
  • TableToy
  • Matrix-matrix-multiply
slide-28
SLIDE 28

Experience

  • TableToy: Measure Giga Updates Per

Second (GUPS) 1 tmp1 = stable[j] (load) 2 tmp2 = table[i] (load) 3 val = tmp1 xor tmp2 (xor) 4 table[i] = val; (store)

slide-29
SLIDE 29

GUPS with TableToy

Little throughput Pseudo random # generation Bank conflicts

slide-30
SLIDE 30

GUPS with NewToy (S- RAM)

75% maximum throughput

slide-31
SLIDE 31

MUPS with NewToy (DRAM)

DRAM maximum bandwidth 250 MUPS

slide-32
SLIDE 32

MFLOPS Matrix-matrix Multiply

Baseline: 17MLOPS Optimized: 215MFLOPS/thread

  • approx. 500MFLOPS/proc

~ 50% maximum

slide-33
SLIDE 33

ACKNOLEDGMENTS

  • Monty Denneau (IBM)
  • Henry Warren (IBM)
  • José Castaños (IBM)
  • Christos Georgiou (IBM)
  • ET International, Inc.
  • Our sponsors
  • Alban Douillet
  • Brice Dobry
  • Ge Gan
  • Geoff Gerfin
  • John Tully
  • Weirong Zhu
  • Wesley Toland
  • Ziang Hu
  • Fei Chen
  • Hirofumi Sakane
  • Yuhei
  • Vishal Karna