Day 2 VLSI Microprocessor Design Flow Session A: Circuit design - - PDF document

day 2
SMART_READER_LITE
LIVE PREVIEW

Day 2 VLSI Microprocessor Design Flow Session A: Circuit design - - PDF document

Day 2 VLSI Microprocessor Design Flow Session A: Circuit design styles Break Session B: Design paths Lunch Session C: Verification Break Session D: Manufacture, fabrication testing, packaging Today Organized Bottom-Up Circuit design style


slide-1
SLIDE 1

Day 2

VLSI Microprocessor Design Flow

Session A: Circuit design styles Break Session B: Design paths Lunch Session C: Verification Break Session D: Manufacture, fabrication testing, packaging

Today Organized Bottom-Up

Circuit design style Full-custom design path Standard cell design path RTL design Verification strategy Packaging Manufacture & testing Important: real designs proceed at all levels simultaneously

slide-2
SLIDE 2

T0 Circuit Design Style

Typical design style for modern microprocessor

Datapaths and memories Control logic Full-custom layout Standard cells Regular structures Irregular structures Most of the die area Most of the complexity Few design bugs Most of the design bugs Mostly hand-specified procedural layout and routing (some hand layout and routing) Placed and routed automatically Sometimes exotic circuit designs (dynamic, self-timed) Conservative static CMOS circuits

T0 Die Breakdown

  • Std. Cell

Full-Custom

slide-3
SLIDE 3

Global Design Style Decisions

Extremely important: Clock methodology and latch design Power, ground, and clock distribution Must be settled early since these affect every circuit on the chip.

T0 Clock and Latch Style

Input clock signal at 2x on-chip frequency (e.g., 80MHz crystal for 40MHz Spert-II board) divided by 2 on-chip to guarantee 50% duty cycle. Clock buffered up, last stage drives single clock grid across entire chip, <1ns skew across chip, <500ps rise/fall time. Clock output pad to phase lock external circuitry to T0 clock. TSPC dynamic latches (T0 has minimum operating frequency). Also, some special pseudo-static load-enabled latches. Very similar to Alpha 21064 clocking strategy.

slide-4
SLIDE 4

T0 Clock Distribution

2x Clock Input Clock Buffer Clock Grid (In reality hundreds of wires) Clock Output

T0 Latch Style

Standard-cell controller designed with edge-triggered flip-flops

  • Only negative edge-triggered flip-flops
  • Simpler for state machines
  • Simplifies synthesis timing specification
  • State stall handled with mux around flip-flop - no clock gating

Full-custom datapaths and memories used transparent latches

  • p- and n- type latches transparent on clock low or high respectively
  • Can steal time across clock cycle boundaries
  • Can place latches in convenient place in signal flow to save area
  • Simplifies double-cycling (used in vector register file, some buses)
  • Special stallable n-latch (small area without clock gating)

Designed library of latches verified to operate across all process corners with clock skew/rise/fall spec, and when placed in series with other latches.

slide-5
SLIDE 5

T0 Power/Ground Distribution

Half of all pins were power and ground (204/408) Chip-on-board packaging gave low-inductance path to board (~1nH per wire) Grid across whole chip in wide M1 and M2 strapped whereever possible. Required IR drop less than 5% of Vdd in middle of chip. On-chip gate oxide decoupling capacitors placed everywhere possible, especially under power rails. Enough bypass capacitance for <5% power bounce, even if power/ground wires open circuit for one cycle.

T0 Power/Ground Distribution

M1 M2 Power Grid Bypass cap. under power rails Additional bypass

  • cap. in empty

space Every other pad is power

  • r ground
slide-6
SLIDE 6

T0 Custom Memories

Instruction cache

  • 1KB storage + tags + valid
  • Classic 6T SRAM design
  • One port: differential write (128b) or differential read (32b)
  • 1 word line and 2 bit lines per bit cell
  • Special wire to clear all valid bits in one cycle for cache flush
  • Fast dynamic tag comparator built into tag sense amps - critical path

Scalar Register File

  • 128B storage (32x4B registers)
  • Three ports: One differential write plus two single-ended reads
  • 3 word lines and 4 bit lines per bit cell

Vector Register File (Trickiest piece of circuit design in T0)

  • 2KB storage (16x32x4B registers)
  • Eight ports: three diff. write on clock low, five single-end. read on clock high
  • Self-timed to generate all timing edges in one cycle
  • 5 word lines and 6 bit lines per bit cell

T0 Datapath Design Style

Select datapath pitch, tradeoff between:

  • wasted space for simple cells
  • crunched inefficient design for complex cells

Vector unit has 72λ bit pitch (late change from 80λ to fit reticle). Scalar unit has 80λ bit pitch.

Decide on metal layer assignments.

Data busses in Metal 1, control/clock/Vdd/GND in Metal 2. Roughly half of datapath bit pitch is used for busses passing by cell.

Design library of datapath cells (mostly latches and muxes).

Special cells created where needed (maybe 5% are special)

Mostly static CMOS logic and static pass-transistor logic, some critical places use dynamic logic:

  • Adder carry-chains
  • Branch zero comparator
  • Saturation overflow comparators
slide-7
SLIDE 7

T0 Datapath Latch Designs

Latches mostly dynamic TSPC plus holders (a la 21064)

X

9 4

Q D PHI

4 10 10 16 16 16 12 12

D

4

Q

4 4

PHI

9

X

14 14 14 14 12 12

n-latch p-latch

Special Psuedo-Static n-Latch

Restrictive enable control line timing caused problems later

X

LEN D Q 4 4 4x4 8 8 LENB 8 20 20 8

80F

8 PHI

slide-8
SLIDE 8

T0 Datapath Mux Designs

Muxes n-pass-transistor with level restoring p-transistor:

8 8 8 8 6 4x4

B ASEL

4

BSEL

4

OUT A C

4

CSEL

6 6 6

3-input mux

Example Datapath Layout

slide-9
SLIDE 9

T0 Standard Cell Designs

Started with public domain library, but hand-inspected each cell and threw away/redesigned bad cells

  • Some cells had too many series transistors or bad output driver

Changed every cell to have much wider power/ground rails

  • To avoid IR drop in middle of long standard cell row

Added separate clock rail into every cell

  • Fits into overall clock gridding scheme
  • Ensures controlled skew on clock (don’t want clock auto-routed!)

Designed our own standard cell flip-flops and latches

  • Connects to special clock rail - uses our clocking methodology
  • Latches used to synchronize with datapath signals

Added greater variety of inverters and buffers

  • Existing buffers not big enough to drive loads on our chip
  • More flexibility for synthesis to trade area and delay

T0 Pads

Pad design is especially tricky Many esoteric device structures used to provide protection against latch up and ESD damage Obtained HP’s design guidelines under NDA Designed custom pads using most of HP’s recommendations for pad protection Pad output drivers used n-type pullup to reduce power consumption - output only swings to ~4V not 5V Separate power supply rings for output drivers and core logic

slide-10
SLIDE 10

Summary

T0 circuit design mostly conservative, low risk Robustness engineered into all cells and overall design Only a few tricks where big wins possible

Fast dynamic datapath logic to shorten critical paths Double-pumped vector register file to save area Novel output drivers to reduce power

Day 2, Session B: Design Paths

Full-custom Standard cell Final global checks

slide-11
SLIDE 11

Full-Custom Tools

Pre-existing tools used:

  • Viewlogic schematic editor (commercial)
  • Magic layout editor and extraction (university)
  • HSpice circuit simulator (commercial)
  • CAzM table-driven circuit simulator (university, now commercial)
  • irsim switch-level simulator (university)
  • gemini layout versus schematic compare (university)
  • Dracula design rule checker (commercial)

In-house tools:

  • flat SPICE netlist flattener/processor
  • tilem procedural layout generator

Full-Custom Design Process

Initial specification with high-level schematic plus verbal communication (most full-custom work done before RTL finished) Design loop:

Viewlogic schematic design (functionality and transistor sizing) Timing simulations with HSpice Functionality simulations with irsim magic layout Extractions with magic (get real parasitics - feed back into schematic)

Iterate until design goals met. Clock cycle initially fixed at <50MHz to prevent over

  • ptimization.
slide-12
SLIDE 12

Example Viewlogic Schematic

(I-Cache SRAM bit)

4 4

IBIT RSEL BIT IBITB BITB

8 8 6 6

Example magic Layout

(Two halves of SRAM cache bits)

slide-13
SLIDE 13

Standard Cell Design Path

Initial RTL (Register Transfer Level) in C++ Each RTL control block manually translated into BDS

  • BDS, a limited, combinational-circuit-only hardware description language

bdsyn compiles BDS into blif (Berkeley Logic Interchange Format) blif optimized and synthesized into gates using sis Gate netlist input to TimberWolf place and route. Also, generate Viewlogic schematic from gate netlist.

RTL Model

RTL (Register Transfer Level) design in C++. RTL model is “golden reference” for whole T0 design. Models state in every latch on every clock phase. Ran at 1,500 cycles/second on Sparcstation-20/61. 100-1000 times faster than Verilog or VHDL RTL model. (More on RTL in next session)

slide-14
SLIDE 14

BDS Blocks

C++ RTL control logic was manually split into about 20 blocks that the synthesis tool could handle (by trial and error). Each control block manually translated into equivalent BDS. Example BDS code (piece of JTAG block):

routine run_tdo; state tdo<7:0>; if tapcin<3> then tdo = regioin else if iregin<3> then tdo = regioin else tdo = memioin; tdob = not tdo; endroutine;

Synthesis with sis

Each BDS block was translated into logic equations in blif Also, had to create timing specs for each block. Optimized and synthesized by sis (Berkeley synthesis package) Two basic synthesis scripts created:

  • target minimal area
  • target minimal delay

Some critical blocks were tuned with own custom synthesis scripts. Synthesis could sometimes take infinite time or infinite memory. => had to split blocks further or rewrite script.

slide-15
SLIDE 15

Place and Route

Synthesized blocks connected by schematic. Entire control unit then extracted into single gate netlist. Place and route using TimberWolf (simulated annealing). Had to fix TimberWolf to fit control into non-rectangular space. Placed outer loop around entire place and route run to iterate parameters. Last piece of T0 design: 3 months of CAD hacking after everthing else finished! Final place and route took 1 week on Sparc-20/61.

Example Stdcell Layout

slide-16
SLIDE 16

Static Timing Analysis

Find critical paths in control logic and datapath interface. Manual database of signal timing specs. Scripts extracted RC delays of long wires from layout. Timing script considered:

  • synthesis predicted timing
  • output drive capability
  • wire capactitative load
  • wire RC delay
  • input timing specs

Fixed any timing violations found by:

  • changing control logic
  • changing datapaths
  • changing wires (fatter wires for lower RC)

Gave up with 33MHz predicted cycle time (very conservative)

Critical Paths

1) Host performing DMA at same time as indexed load/store

  • Have to drive long stall wire with bad RC delay into static latches with

difficult timing constraints. No time to change latches.

2) Branches/I-cache

  • BEQ/BNE instructions need XOR plus zero comparator fed to instruction

cache fetch in same cycle. Could solve with branch prediction.

3) Address generator/new-old instruction

  • Many possible ways to load address generator input latches depending on

current/next instruction vector/scalar. Could fix with more pipelining.

slide-17
SLIDE 17

Design Rule Checks (DRC)

Magic performs dynamic DRC during layout entry Also, at each level of the design hierarchy after procedural layout. Final layout also DRC checked using Dracula, found a few minor bugs.

Problems with CAD Tools

Bugs

  • many features don’t work - fixed a lot ourselves

Limitations

  • size --- had to recompile with bigger constants (if source available)
  • signal naming --- had to stick to a-z (one case) and 0-9 (no underscores)
  • don’t handle hierarchy --- had to “flatten” circuits on the fly (used Unix pipe)

Bad Design

  • often obvious that author never built real chips (useless features esp. GUIs)
  • too automatic, can’t control what happens (“take it or leave it” tool)
  • requires bulky, constrictive framework
  • wouldn’t work in script or Makefile
  • awkward binary data formats

(Commercial tools no better than university tools, sometimes worse)

slide-18
SLIDE 18

Q: What Was Best CAD Tool? A: Unix Development Environment!

Sed/Awk/Perl used extensively for format conversion RCS used for revision control Shell scripts/Makefiles to automate processes Pipes used for on-the-fly netlist flattening, test vector generation

slide-19
SLIDE 19

Design Path Summary

Many tools (>50), many data formats Over half of the total T0 design effort was spent on CAD tools! (We began project intent on not developing any new tools) Built design flow over several projects in group. Continually added new tools/methodologies. (Many candidate tools tried and abandoned) Biggest gaps:

  • good HDL --- would use Verilog now
  • good static timing analysis --- would use TimeMill now

Day 2, Session C: Verification

slide-20
SLIDE 20

Levels of Design Representation

ISA RTL Schematic Layout Real Chip Semantics of instruction set State on each cycle Transistors Mask layers Fabricated silicon

(ISA interpreter) (RTL simulator) (Irsim switch level simulator)

Verification Framework

Defined set of “virtual machines”, each defining allowable:

  • registers
  • instructions
  • exceptions
  • memory regions
  • whether cycle accurate
  • and form of test result communication

for valid test programs Virtual machines: mips: MIPS-II scalar instruction set t0u: T0 user level instruction set t0raw: T0 user+kernel instruction set t0cyc: T0 user+kernel instruction set+cycle accurate t0die: T0 raw with no SRAM (wafer test) t0diecyc: T0 raw with no SRAM+cycle accurate (wafer test)

slide-21
SLIDE 21

Virtual Machine Execution Platforms

SGI R4K Indigo T0 ISA Interpreter T0 RTL Simulator Bare T0 Die Spert-II Board mips X X X X t0u X X X t0raw X X X t0cyc X X t0die X X X X t0diecyc X X X

Example Test Program

/* Simple test of cpu and memory system life. */ #include <t0test.h> TEST_MIPS # Type of virtual machine TEST_CODEBEGIN # Begin test program life_test: lw $2, life_test_dat addi $2, 1 sw $2, life_test_res exit: TEST_CODEEND # End test program .data TEST_DATABEGIN # Begin data region for test input and result data. life_test_dat: .word 41 life_test_res: .word 0xffffffff TEST_DATAEND # End data region for test input and result data.

slide-22
SLIDE 22

mips Test Compilation and Execution

test.S testbuild-spert testbuild-iris test.spert a.out SGI Indigo R4000 Irix t0rtltest t0isatest iris.mem isa.mem rtl.mem

Test Program Test Executables Test Compilation Test Run Test Results

Producing Switch-Level Test Vectors

test.S testbuild-spert test.spert t0cpudptr test.irsim sch/cell.1 sch/cell.1 sch/cell.1 sch/cell.1 irsim wspice flatspice spice2sim cpudp.sim Workview schematic Test program Assertion failures? Test vectors Test executable Schematic netlist Test rig based

  • n RTL model
slide-23
SLIDE 23

Simulation Speeds (Cycles/Second)

ISA RTL Schematic/Layout IRSIM Fabbed Chip

500,000 (inst/second) 1,100 0.05 45,000,000 (Simulation speeds measured on Sparcstation-10/51)

R4000 (MIPS-II only)

100,000,000

Test Programs

Two classes:

Design verification (does the RTL implement a vector micro?) Fabrication testing (does the fabricated chip meet specs?)

These classes do not necessarily overlap:

Verification tests don’t necessarily exercise all paths in circuits, (e.g., all adder propagations or SRAM data retention) Fab tests won’t tell if RTL has design bug, only that chip matches buggy RTL.

slide-24
SLIDE 24

Directed + Random Tests

Hand-written directed tests for specific functions.

  • Make sure all single events covered
  • Some events very hard to generate randomly
  • Very time-consuming

Randomly-generated tests for greater coverage

  • Good at finding bugs in combinations of events
  • Fast way of generating lots of test code
  • Difficult to randomly generate valid virtual machine test code

Hand-Written Tests

Nearly 100,000 lines of hand-written assembly test code! (Includes both design verification and fab test code) Programs Lines mips 107 14,293 t0u 199 57,992 t0raw 63 17,512 t0cyc 3 537 t0die 44 6,490 t0diecyc 1 173 Total 417 96,997

slide-25
SLIDE 25

Random Program Generation

First attempt: rantor rantor incrementally generates random test program, one instruction at a time Problems:

  • Only instruction-by-instruction random - can’t generate instruction

sequences

  • Difficult to guarantee that random code obeys virtual machine limitations

rantor found quite a few RTL bugs initially, but eventually most bugs were found to be in rantor-generated test programs.

Second Attempt: torture

1) User builds library of random test code sequence generators. 2)torture core randomly selects sequence generator. 3) Sequence generator builds random instruction sequence with virtual registers (both visible and invisible in final test state). 4) torture interleaves multiple sequences randomly allocating virtual registers to random physical registers. Test programs guaranteed to obey virtual machine constraints. All written as C++ class library.

slide-26
SLIDE 26

Random Environment Events

Run test programs on simulated machine both in quiet environment and also: with random host and timer interrupts with random host DMA I/O and scan-chain activity

Random Testing Results

Billions of RTL cycles run on network of workstations at ICSI (continuously running over several months on 4-20 workstations). Highly successful: 26 bugs found through random tests. Any random bug found, added to regression tests.

slide-27
SLIDE 27

LVS: Static Netlist Comparison

ISA RTL Schematic Layout Real Chip gemini LVS magic extract schematic netlist layout netlist

=? Verification Summary

Intensive effort Highly successful

No known logic bugs in first-pass silicon!

slide-28
SLIDE 28

Day 2, Session D: Manufacture, Testing, Packaging Manufacturing Path

T0 fabbed by Hewlett-Packard via MOSIS Wafers delivered to test house Wafer sort to select good die for bonding Good die bonded to Spert-II boards (Chip-On-Board packaging) Bare die on Spert-II board tested at ICSI Assembly house surface-mounted components to good boards Final whole board assembly and test at ICSI