Post Silicon Patchable Hardware Post-Silicon Patchable Hardware - - PowerPoint PPT Presentation

post silicon patchable hardware post silicon patchable
SMART_READER_LITE
LIVE PREVIEW

Post Silicon Patchable Hardware Post-Silicon Patchable Hardware - - PowerPoint PPT Presentation

Post Silicon Patchable Hardware Post-Silicon Patchable Hardware Masahiro F jita Masahiro Fujita VLSI Design and Education Center (VDEC) VLSI Design and Education Center (VDEC) The University of Tokyo July 22 nd , 2011 Respin Statistics (North


slide-1
SLIDE 1

Post Silicon Patchable Hardware Post-Silicon Patchable Hardware

Masahiro F jita Masahiro Fujita VLSI Design and Education Center (VDEC) VLSI Design and Education Center (VDEC) The University of Tokyo

July 22nd, 2011

slide-2
SLIDE 2

Respin Statistics (North America)

100% 80%

cess Respin is becoming more frequent

60% 48% 44%

  • n Succ

more frequent

40% 44% 39% 33%

st Silico

20%

Firs

1998 2000 2002 2004 0%

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 2

1998 2000 2002 2004

[G. S. Spirakis, DATE 2006]

slide-3
SLIDE 3

Manufacturing Cost

$5M $4M

US$)

$3M

Cost (U

$2M

sk Set C

$1M

Mas Respin risk is increasing

90nm 65nm 45nm 32nm

dramatically

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 3

90nm 65nm 45nm 32nm

[Nikkei Electronics, 2008]

slide-4
SLIDE 4

Causes for Respins

Logic/Function Cl k

91% 36%

Clock Fast Path Slow Path

36% 32% 26%

Slow Path Delay/Glitch Power

26% 26% 21%

Logic and functional errors

Yield Analog Fi

19% 19%

are the leading cause

Firmware Mixed Signal IR Drop

17% 15% 15%

IR Drop 0% 20% 40% 60% 80% 100%

15%

IC/ASIC Designs Having One or More Re spins by Type of Flaw

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 4

IC/ASIC Designs Having One or More Re-spins by Type of Flaw

[Collett International Research 2005]

slide-5
SLIDE 5

Conventional SoC Design Flow

High-Level Description Bug Fix

75% of the whole development time [Source: Intel 2007]

High-Level Synthesis Machine- Generated

[Source: Intel 2007]

Synthesis Bug Localization Logic Synthesis RTL Bug Fix Bug Verification/Simul ation Logic Synthesis Place & Route Pre-Silicon RTL Verification Need to Understand RTL Bug Localization E ation Post-Silicon SoC Error Detection

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 5

Design RTL Validation

slide-6
SLIDE 6

Proposed Patchable SoC Design Flow

High-Level i i Bug Fix Description g Bug Localization High Level Error Verification/Simul ation High-Level Synthesis of Patchable HW B Error Detection Logic Synthesis Place & Route Pre-Silicon High-Level Verification ation Bug Localization Bug Fix P t h No Respin Needed! Post-Silicon Patchable SoC Patch Compilation

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 6

Design High-Level ECO

slide-7
SLIDE 7

Proposed Patchable Hardware

Efficeum

  • ffers behavioral-level programmability

Custom Datapath Patchable Controller using a patchable controller Hardwired Hardwired Patch Patch Patchable Controller ALU1 ALU1 ALU2 ALU2 Hardwired FSM Hardwired FSM Patch FSM Patch FSM ALU1 ALU1 ALU2 ALU2

Partially-Programmable Circuit (PPC)

  • ffers logic-level programmability

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 7

  • ffers logic level programmability

using a mixed gate/LUT circuit

slide-8
SLIDE 8

Effice m Efficeum:

An Energy-Efficient Patchable Accelerator An Energy-Efficient Patchable Accelerator For Post-Silicon Engineering Changes

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 8

slide-9
SLIDE 9

Energy Efficiency vs. Programmability

Energy Efficiency of 90nm OFDM Energy Efficiency of 90nm OFDM

Fixed-function HW: 200GOPS/W E b dd d P 4GOPS/W 50X! Embedded Proc.: 4GOPS/W 50X! Laptop Proc.: 0.05GOPS/W 4,000X!

>100GOPS High Performance 1W P /Th l C i 〜1W Power/Thermal Constraints

Energy efficiency (in [GOPS/W] or [J/op]) Energy efficiency (in [GOPS/W] or [J/op])

 How much computation can be done in a given energy  Sl i d th hi d b t t ffi i

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 9

 Slowing down the chip reduces power but not efficiency

slide-10
SLIDE 10

Fixed-Function Accelerator

 Achieves high energy efficiency by customization:

 Hardwired controller → No reprogrammability  Highly-customized datapath → Low flexibility

Local Store Local Store Hardwired Controller Hardwired Controller Reg 1 Reg 1 Reg 2 Reg 2 Reg 3 Reg 3

・・・

Control Sparse Interconnect Network Sparse Interconnect Network Comp- t Comp- t Multi- li Multi- li ALU2 ALU2 ALU1 ALU1

・・・

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 10

arator arator plier plier

slide-11
SLIDE 11

Proposed Patchable Accelerator

 Behavioral reprogrammability by control patching  Increased flexibility by adding register file via data bus

Local Store Local Store Hardwired C t ll Hardwired C t ll Reg Reg Reg Reg Reg Reg Patch Patch

y y g g

Store Store Controller Controller Reg 1 Reg 1 Reg 2 Reg 2 Reg 3 Reg 3

・・・

Patch Logic Patch Logic Control Sparse Interconnect Network Sparse Interconnect Network Control Bus Sparse Interconnect Network Sparse Interconnect Network Data Bus Comp- arator Comp- arator Multi- plier Multi- plier ALU2 ALU2 ALU1 ALU1

・・・

Register Register

Fujita Lab. – VLSI Design and Education Center - University of Tokyo

arator arator plier plier g File g File

11

slide-12
SLIDE 12

Patch Logic

er

PC1 =? =?

Counte

PC2

>PCpatch? >PCpatch?

=? =?

  • gram C

・・ Signal

Hardwired Controller Hardwired Controller PC1’

Pro ・・・

・ ・

  • ntrol S

Control PC2’

Co

Memory Signal Memory

Program Counter Patch Control Signal Patch P h M

Fujita Lab. – VLSI Design and Education Center - University of Tokyo

Patch Memory

12

slide-13
SLIDE 13

Patching Example (1/2)

Scheduling Result of Initial Design

PC ALU1 ALU2 MUL1 Next PC 1 2 2 3

wired ic

3 1

Hardw log

4

  • gic

5 Dataflow graph for Initial Design

Patch lo

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 13

slide-14
SLIDE 14

Patching Example (2/2)

Scheduling Result After Engineering Change

PC ALU1 ALU2 MUL1 NextPC 1 2 4 2 3

wired ic

3 1

Hardw logi

4 3 5 Dataflow graph After EC

Patch logic

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 14

slide-15
SLIDE 15

Patching-Based Post-Silicon ECO Flow

P t Sili ECO Post-ECO Program Post-Silicon ECO (Spec. Change & Bug Fix) C Program High-Level S h i

+

  • x

<< x +

Synthesis

x

Computing the Difference Between Two Programs Fixed-Function HW

+ x << +

Writing into Patch Memory Inserting RF & Patch Logic

  • <<

x

Patch Compilation Patch

Fujita Lab. – VLSI Design and Education Center - University of Tokyo

Patch Compilation Efficeum

15

slide-16
SLIDE 16

Experimental Setup

Example: 8x8 IDCT T h l F PDK 45 Technology: FreePDK 45nm Logic Synthesis: Synopsys Design Compiler Ultra g y y p y g p

High effort options with gated clock optimization

P&R Cadence SoC Enco nter P&R: Cadence SoC Encounter Simulation: Synopsys VCS y p y Power/timing analysis: Synopsys PrimeTime PX

Si l ti lt d f l l ti Simulation results are used for power calculation

Energy efficiencies (GOPS/W) are compared

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 16

slide-17
SLIDE 17

Energy Efficiency Comparison

No Patching Fully-Patched 6% 48% 89%

8x8 IDCT (FreePDK 45nm technology)

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 17

( gy) Offers a tradeoff between efficiency and programmability

slide-18
SLIDE 18

Area & Performance Comparison

20% 5% 5X Smaller Up to 40% Up to 40% Increase

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 18

slide-19
SLIDE 19

Area Comparison

4x reduction 4x reduction 18% increase

Fully-programmable accelerator Single-function Hardwired accelerator Effi

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 19

(Technology: FreePDK 45nm (NCSU/Nangate), Operating Frequency: 200MHz)

Hardwired accelerator Efficeum

slide-20
SLIDE 20

Power Comparison

6x reduction 6x reduction 13% increase

Fully-programmable accelerator Single-function Hardwired accelerator Effi

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 20

(Technology: FreePDK 45nm (NCSU/Nangate), Operating Frequency: 200MHz)

Hardwired accelerator Efficeum

slide-21
SLIDE 21

Incremental High Le el S nthesis Incremental High-Level Synthesis and Patch Compilation and Patch Compilation For High-Level ECO g

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 21

slide-22
SLIDE 22

Conventional High-Level Synthesis

Several phases are applied separately

This prevents incremental synthesis This prevents incremental synthesis

Scheduling Allocation Binding + x + g

Step 1

+ x +

ADD1 ADD2 MUL1 SHFT1

AD D1 AD D2

Step 1

+ << x

Step 2 Step 3

+ + x << x + MU L1 SHFT 1

Step 1 Step 2 Step 3

Registers

FSM Datapath D t th FSM Fujita Lab. – VLSI Design and Education Center - University of Tokyo 22 Datapath Datapath Datapath

slide-23
SLIDE 23

Incremental High-Level Synthesis

 Each operation is scheduled and bound incrementally, and the hardware is enhanced accordingly

Incremental Scheduling & Binding Incremental Scheduling & Binding

Step 1

+ +

ADD1 ADD2 MUL1 SHFT1

+ +

ADD1 ADD2 MUL1 SHFT1 Step 1 Step 2

+ + << + + + << x +

Step 1 Step 2 Step 3

Registers Registers

S

Registers

S

Registers

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 23 FSM Datapath FSM Datapath

slide-24
SLIDE 24

Patch Compilation

 Same as incremental synthesis, except the datapath enhancement is not allowed (only FSM is enhanced)

Incremental Scheduling & Binding Incremental Scheduling & Binding

Step 1

+ +

ADD1 ADD2 MUL1 SHFT1

+ +

ADD1 ADD2 MUL1 SHFT1 Step 1 Step 2

+ + << + + + << x +

Step 1 Step 2 Step 3

Registers Registers

S S

Registers Registers

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 24 FSM FSM Datapath Datapath

slide-25
SLIDE 25

Incremental Scheduling Procedure

Extension of Swing Modulo Scheduling for VLIW compilers [Llosa et al., PACT ‘96] p [ , ]

Conventional Scheduling (Top Down) Incremental Swing Scheduling (Mix of Top Down & Bottom Up) der duling Or Sched

Very long 6 registers Shorter register Only 4 registers

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 25

y g register lifetime required lifetime y g required

slide-26
SLIDE 26

Swing Scheduling Procedure

Mix of top-down and bottom-up scheduling

Phase 1: Bottom-Up Scheduling of Critical Path and Phase 2: Top-Down Scheduling of the Descendants of Phase 3: Bottom-Up Scheduling of the Ancestors of Their Ancestors Scheduled Operations Scheduled Operations

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 26

slide-27
SLIDE 27

Incremental Step Insertion

A novel technique enabling incremental swing scheduling scheduling

During swing scheduling, a new control step is i d b h h d l d h d d inserted between the scheduled steps when needed

Step A Step B Step A Step B Step A Step B Step C Step D Step E Step D Step C Step D Step E

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 27

slide-28
SLIDE 28

Incremental Binding Procedure

For each operation, all possible combinations of (resource registers) are examined (resource, registers) are examined

If no such binding is found, new interconnects b d i i d d between resource and registers are introduced

Enhanced Datapath Datapath

Datapath Incremental Scheduling & Binding

1 2 1 T1 Datapath

Register

Datapath

Register

Datapath Enhancement

Step 1

+ +

ADD1 ADD2 MUL1 SHFT Step 2

+ << x

Step 3

A multiplier exists but no register-to-multiplier

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 28

g p interconnect exists

slide-29
SLIDE 29

Experimental Setup

The proposed method has been implemented in

  • ur high-level synthesis framework Cyneum
  • ur high level synthesis framework Cyneum

Incremental high-level synthesis P t h il f Effi i Patch compiler for Efficieum

Example: 5 benchmark designs

C programs of about ~100 lines Functions from IDCT ADPCM MPEG Functions from IDCT, ADPCM, MPEG Post-ECO examples are generated by random graph perturbation (next slide) perturbation (next slide)

Evaluated the quality of the method through the

Fujita Lab. – VLSI Design and Education Center - University of Tokyo

patch size and compilation time

29

slide-30
SLIDE 30

Generating ECO Examples

The following graph perturbations are randomly applied to CDFG applied to CDFG

Perturbation 1: Opcode Change Original CDFG Perturbation 1: Opcode Change Perturbation 2: Operand Change Perturbation 3: Introducing

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 30

Perturbation 2: Operand Change Perturbation 3: Introducing A New Operation

slide-31
SLIDE 31

Evaluation of Patch Compiler

 For each benchmark, random CDFG perturbation is applied M times. For each M, 100 different post-ECO applied M times. For each M, 100 different post ECO designs are generated and then patches are compiled.

.] s) ime [sec SM State lation Ti Size (#FS e Compil Patch S Average Average

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 31

ECO Size M (#Perturbations) A A ECO Size M (#Perturbations)

slide-32
SLIDE 32

PPC:

Increasing Yield Using Partially Programmable Circuits Partially-Programmable Circuits

A collaborative work with f

  • Prof. Shigeru Yamashita (Ritsumeikan Univ.)

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 32

slide-33
SLIDE 33

Design For Yield

 At physical level, there are many techniques available for yield/defect tolerance enhancement for yield/defect tolerance enhancement  At logic level, the following techniques can be applied f h d l for each module  TMR: Voting  DMR: If one module is defective, the other can be used  Reconfigurable devices: synthesize not to use defective parts defective parts Too much overhead in area and performance

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 33

slide-34
SLIDE 34

Objective of This Work

Enhance the defect tolerance by using a Partially- Programmable Circuit (PPC) Programmable Circuit (PPC)

PPC is a hybrid LUT/gate circuit To correct a single defect, full programmability such as FPGAs is unnecessary A defective wire can be made redundant by reprogramming LUTs in PPC p g g

Propose a design methodology

S th i f PPC Synthesis of PPC

Where to put LUTs

Fujita Lab. – VLSI Design and Education Center - University of Tokyo

How to reprogram LUTs for defective wires

34

slide-35
SLIDE 35

PPC (Partially-Programmable Circuit)

Conventional

Non-Programmable Part consisting of conventional gates

Conventional gates Primar Primar y Inputs ry Output LUT ts LUT LUT

M U X

LUT

Programmable Part

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 35

g consisting of LUTs and MUXs

slide-36
SLIDE 36

Defect Correction in PPC

Find out that a wire c is defective a wire ci is defective Conventional gates By reprogramming LUTs, the wire ci becomes redundant LUT wire ci becomes redundant LUT LUT LUT Call ci as Robust Connection (RC)

M U X

LUT

i

( )

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 36

slide-37
SLIDE 37

PPC Example: Initial Circuit

P Prim LUT LUT N1 N N4 rimary In mary Outp N2 N5 N6 N7 N N puts uts LUT LUT N8 N9

 LUTs are used partially in the circuit  There is no redundancy now

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 37

slide-38
SLIDE 38

PPC Example: Redundancy Addition 1

LUT LUT Primary Primary O N1 N2 N4 N N LUT y Inputs Outputs LUT N5 N6 N7 N8 N9

By adding this wire, some wires become Robust Connections (RCs) RC: Wires which can become redundant by reprogramming LUTs NRC: Wires which are not RCs RC: Wires which can become redundant by reprogramming LUTs NRC: Wires which are not RCs

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 38

NRC: Wires which are not RCs NRC: Wires which are not RCs

slide-39
SLIDE 39

PPC Example: Redundancy Addition 2

LUT LUT Primar Primary N1 N2 N4 N N ry Inputs y Outputs LUT LUT N5 N6 N7 N8 N9 LUT LUT

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 39

slide-40
SLIDE 40

PPC Example: Redundancy Addition 3

LUT LUT Primar Primary N1 N2 N4 N N ry Inputs y Outputs LUT LUT N5 N6 N7 N8 N9 LUT LUT

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 40

slide-41
SLIDE 41

PPC Example: Final Circuit

Since we assume a single defect, multiple redundant connections to an LUT are multiplexed

M

LUT LUT N3 p Primar Primary

M U X

N1 N2

3

N4 N N ry Inputs y Outputs LUT LUT N5 N6 N7 N8 N9 LUT LUT

Colored wires: Robust Connection (RC) Black wires: Non-Robust Connection (NRC) Colored wires: Robust Connection (RC) Black wires: Non-Robust Connection (NRC)

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 41

Black wires: Non Robust Connection (NRC) Black wires: Non Robust Connection (NRC)

slide-42
SLIDE 42

CSPF: Flexibility of Logic Circuits

Logic function

1 1 1 * *

CSPF

g1

1 1 1 1 * *

f

1 1 1 1 * 1 * * 1 1

f

1 1 1 1 1 * * * * * 1 1 1 1 1 1

Truth table

* * * 1

g2

1 1 1 1 1 1 1 1 1 * 1 *

Fujita Lab. – VLSI Design and Education Center - University of Tokyo

1 *

42

slide-43
SLIDE 43

SPFD: Flexibility of LUT Circuits

g1 1 0 g1 1

1 f

L1 g1

1 1

1 1

L1 g2

g2 1 1

1 1

g1's flexibility by SPFDs

1 1 1 1 1 1

Fujita Lab. – VLSI Design and Education Center - University of Tokyo

1 1

43

slide-44
SLIDE 44

Proposed Synthesis Method

Basic procedure

1 Perform a LUT mapping

  • 1. Perform a LUT mapping
  • 2. Determine LUTs to keep

Needs a better heuristic

Reconvergence points, non-critical nodes

  • 3. Perform a technology re-mapping

4 Adding redundant wires to LUTs

  • 4. Adding redundant wires to LUTs

How to find good wires? f Currently, wires are identified exhaustively

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 44

slide-45
SLIDE 45

Preliminary Experiments

  • 1. Mapping to K-input LUTs (K=3,4,5)
  • 2. Re-mapping with keeping LUTs at the outputs
  • 2. Re mapping with keeping LUTs at the outputs
  • 3. For each LUT, if connecting a wire to the LUTs make another

wire RC, then it is selected. Terminate if the number of LUT , inputs is 6.

Pri Prim LUT mary Inp mary Outp LUT uts puts LUT

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 45

slide-46
SLIDE 46

Experimental Results

Circuit Stuck-at-0 Stuck-at-1 R b t N R b t N #Connections which are robust to stuck-at-0/1 Robust Non- Robust Robust Non- Robust Original Added Original Added

alu2

586 13 25 582 20 29

alu4

1093 28 92 1070 25 115

apex6

591 86 106 589 98 108

rot

421 161 218 411 172 228

too_large

472 15 256 453 9 275

d

1251 153 221 1318 213 154

vda

1251 153 221 1318 213 154

C880

185 32 144 212 57 117

C1355

219 225 143 390 176 182 219 225 143 390 176 182

C1908

626 37 66 599 36 93

C2670

710 59 176 704 66 182

Fujita Lab. – VLSI Design and Education Center - University of Tokyo 46 C3540

1500 63 210 1462 52 248