Flexible Hardware Design at Flexible Hardware Design at Low Levels - - PowerPoint PPT Presentation

flexible hardware design at flexible hardware design at
SMART_READER_LITE
LIVE PREVIEW

Flexible Hardware Design at Flexible Hardware Design at Low Levels - - PowerPoint PPT Presentation

Flexible Hardware Design at Flexible Hardware Design at Low Levels of Abstraction Low Levels of Abstraction Emil Axelsson Hardware Description and Verification May 2009 Why low-level? Why low-level? Related question: Why is some software


slide-1
SLIDE 1

Flexible Hardware Design at Flexible Hardware Design at Low Levels of Abstraction Low Levels of Abstraction

Emil Axelsson Hardware Description and Verification May 2009

slide-2
SLIDE 2

Why low-level? Why low-level?

gadget a b = case a of 2 -> thing (b+10) 3 -> thing (b+20) _ -> fixNumber a

Related question: Why is some software written in C? (but difference between high- and low-level is much greater in hardware) Ideal:

Software-like code → magic compiler → chip masks

slide-3
SLIDE 3

Why low-level? Why low-level?

Related question: Why is some software written in C? (but difference between high- and low-level is much greater in hardware) Ideal:

Software-like code → magic compiler → chip masks

gadget a b = case a of 2 -> thing (b+10) 3 -> thing (b+20) _ -> fixNumber a

slide-4
SLIDE 4

Why low-level? Why low-level?

Reality:

“Ascii schematic” → chain of synthesis tools → chip masks

slide-5
SLIDE 5

Why low-level? Why low-level?

Reality:

“Ascii schematic” → chain of synthesis tools → chip masks

Reiterate to improve timing/power/area/etc.

Very costly / time-consuming

Each fabrication costs ≈ $1.000.000

slide-6
SLIDE 6

Failing abstraction Failing abstraction

Realistic flow cannot avoid low-level awareness Paradox

Modern designs require higher abstraction level ...but... Modern chip technologies make abstraction harder

Main problem: Routing wires are dominant in signal delays and power consumption

Controlling the wires is key to the performance!

slide-7
SLIDE 7

Gate vs. wire delay under scaling Gate vs. wire delay under scaling

Process technology node [nm] Relative delay

slide-8
SLIDE 8

Physical design level Physical design level

Certain high-performance components (e.g. arithmetic) need to be designed at even lower level Physical level:

A set of connected standard cells (implemented gates) Absolute or relative positions of cells (placement) Shape of connecting wires (routing)

slide-9
SLIDE 9

Physical design level Physical design level

Design by interfacing to physical CAD tools

Call automatic tools for certain tasks (mainly routing)

Often done through scripting code

Tedious Hard to explore design space Limited design reuse

Aim of this work: Raise the abstraction level of physical design! Raise the abstraction level of physical design!

slide-10
SLIDE 10

Two ways to raise abstraction Two ways to raise abstraction

Automatic synthesis

+ Powerful abstraction – May not be optimal for e.g. high-performance arithmetic – Opaque (hard to control the result) – Unstable (heuristics-based)

Language-based techniques (higher-order functions, recursion, etc.)

+ Transparent, stable – Still quite low-level – Somewhat limited to regular circuits

slide-11
SLIDE 11

Two ways to raise abstraction Two ways to raise abstraction

Automatic synthesis

+ Powerful abstraction – May not be optimal for e.g. high-performance arithmetic – Opaque (hard to control the result) – Unstable (heuristics-based)

Language-based techniques (higher-order functions, recursion, etc.)

+ Transparent, stable – Still quite low-level – Somewhat limited to regular circuits

Our approach

slide-12
SLIDE 12

Lava Lava

Gate-level hardware description in Haskell Parameterized module generators: Haskell programs that generate circuits

Can be smart, e.g. optimize for speed in a given environment

Basic placement expressed through combinators Used successfully to generate high-performance FPGA cores

slide-13
SLIDE 13

Wired: Extension to Lava Wired: Extension to Lava

Finer control over geometry More accurate performance models

Feedback from timing/power analysis enables self-optimizing generators

Wire-awareness (unique for Wired)

Performance analysis based on wire length estimates Control routing through “guides” (experimental)

...

slide-14
SLIDE 14

Monads in Haskell Monads in Haskell

Haskell functions are pure Side-effects can be “simulated” using monads

add a b = do as <- get put (a:as) return (a+b) *Main> runState prog [] (26, [18,11,5]) Monads can also be used to model e.g. IO, exceptions, non-determinism etc. prog = do a <- add 5 6 b <- add a 7 add b 8 Syntactic sugar, expands to a pure program with explicit state passing Result Side-effect

slide-15
SLIDE 15

Monad combinators Monad combinators

Haskell has a general and well-understood combinator library for monadic programs

*Main> runState (mapM (add 2) [11..13]) [] ([13,14,15],[2,2,2]) *Main> runState (mapM (add 2 >=> add 4) [11..13]) [] ([17,18,19],[4,2,4,2,4,2])

slide-16
SLIDE 16

Example: Parallel prefix Example: Parallel prefix

Given inputs compute

for ∘, an associative (but not necessarily commutative)

  • perator

x1, x2, … xn y1 = x1 y2 = x1 ∘ x2 … yn = x1 ∘ x2 ∘ … ∘ xn

slide-17
SLIDE 17

Parallel prefix Parallel prefix

Very central component in microprocessors Most common use: Computing carries in fast adders Trying different operators:

Addition: prefix (+) [1,2,3,4]

slide-18
SLIDE 18

Parallel prefix Parallel prefix

Very central component in microprocessors Most common use: Computing carries in fast adders Trying different operators:

Addition: prefix (+) [1,2,3,4] = [1, 1+2, 1+2+3, 1+2+3+4] = [1,3,6,10]

slide-19
SLIDE 19

Parallel prefix Parallel prefix

Very central component in microprocessors Most common use: Computing carries in fast adders Trying different operators:

Addition: prefix (+) [1,2,3,4] = [1, 1+2, 1+2+3, 1+2+3+4] = [1,3,6,10] Boolean OR: prefix (||) [F,F,F,T,F,T,T,F]

slide-20
SLIDE 20

Parallel prefix Parallel prefix

Very central component in microprocessors Most common use: Computing carries in fast adders Trying different operators:

Addition: prefix (+) [1,2,3,4] = [1, 1+2, 1+2+3, 1+2+3+4] = [1,3,6,10] Boolean OR: prefix (||) [F,F,F,T,F,T,T,F] = [F,F,F,T,T,T,T,T]

slide-21
SLIDE 21

Parallel prefix Parallel prefix

Implementation choices (relying on associativity): prefix (∘) [x1,x2,x3,x4] = [y1,y2,y3,y4]

Serial: y4 = ((x1 ∘ x2) ∘ x3) ∘ x4 Parallel: y4 = (x1 ∘ x2) ∘ (x3 ∘ x4) Sharing: y4 = y3 ∘ x4

slide-22
SLIDE 22

There are many of them... There are many of them...

Sklansky Brent-Kung Ladner-Fischer

slide-23
SLIDE 23

Parallel prefix: Sklansky Parallel prefix: Sklansky

sklansky op [a] = return [a] sklansky op as = do let k = length as `div` 2 (ls,rs) = splitAt k as' ls' <- sklansky op ls rs' <- sklansky op rs rs'' <- sequence [op (last ls', r) | r <- rs'] return (ls' ++ rs'')

Simplest approach (divide-and-conquer) Purely structural (no geometry) Could have been (monadic) Lava

slide-24
SLIDE 24

Refinement: Add placement Refinement: Add placement

sklansky op [a] = space cellWidth [a] sklansky op as = downwards 1 $ do let k = length as `div` 2 (ls,rs) = splitAt k as' (ls',rs') <- rightwards 0 $ liftM2 (,) (sklansky op ls) (sklansky op rs) rs'' <- rightwards 0 $ sequence [op (last ls', r) | r <- rs'] return (ls' ++ rs'')

slide-25
SLIDE 25

Sklansky with placement Sklansky with placement

Simple postscript allows interactive development of placement

slide-26
SLIDE 26

Refinement: Add routing guides Refinement: Add routing guides

bus = rightwards 0 . mapM bus1 where bus1 = space 2750 >=> guide 3 500 >=> space 1250 sklanskyIO op = downwards 0 $ inputList 16 "in" >>= bus >>= space 1000 >>= sklansky op >>= space 1000 >>= bus >>= output "out"

Reusing standard (monadic) Haskell combinators (nothing Wired-specific)

slide-27
SLIDE 27

Sklansky with guides Sklansky with guides

slide-28
SLIDE 28

Refinement: More guides Refinement: More guides

sklansky op [a] = space cellWidthD [a] sklansky op as = downwards 1 $ do bus as let k = length as `div` 2 (ls,rs) = splitAt k as (ls',rs') <- rightwards 0 $ liftM2 (,) (sklansky op ls) (sklansky op rs) rs'' <- rightwards 0 $ sequence [op (last ls', r) | r <- rs'] bus (ls' ++ rs'')

slide-29
SLIDE 29

Sklansky with guides Sklansky with guides

slide-30
SLIDE 30

Experiment: Compaction Experiment: Compaction

sklansky op [a] = space cellWidthD [a] sklansky op [a] = return [a]

Buses were compacted separately

slide-31
SLIDE 31

Export to CAD tool Export to CAD tool (Cadence Soc Encounter)

(Cadence Soc Encounter)

Auto-routed in Encounter Odd rows flipped to share power rails Simple change in recursive call:

sklansky (flipY.op) ls

Exchanged using DEF file format

slide-32
SLIDE 32

Fast, low-power prefix networks Fast, low-power prefix networks

Mary Sheeran has developed circuit generators in Lava that search for fast, low-power parallel prefix networks Initially, crude performance models

Delay: Logical depth Power: Number of operators

Still good results Now using Wired to improve accuracy

Static timing/power analysis using models from cell library

slide-33
SLIDE 33

Minimal change to search algorithm Minimal change to search algorithm

prefix f p = memo pm where pm ([],w) = perhaps id' ([],w) pm ([i],w) = perhaps id' ([i],w) pm (is,w) | 2^(maxd(is,w)) < length is = Fail pm (is,w) = (bestOn is f . dropFail) [ wrpC ds (prefix f p) (prefix p p) | ds <- igen ... ] where wrpC ds p1 p2 = wrp ds (perhaps id’ c) (p1 c1) (p2 c2) ...

slide-34
SLIDE 34

Minimal change to search algorithm Minimal change to search algorithm

prefix f p = memo pm where pm ([],w) = perhaps id' ([],w) pm ([i],w) = perhaps id' ([i],w) pm (is,w) | 2^(maxd(is,w)) < length is = Fail pm (is,w) = (bestOn is f . dropFail) [ wrpC ds (prefix f p) (prefix p p) | ds <- igen ... ] where wrpC ds p1 p2 = wrp ds (perhaps id’ c) (p1 c1) (p2 c2) ... Plug in cost functions that analyze the placed network through Wired

slide-35
SLIDE 35

85 bits, depth 8 85 bits, depth 8

slide-36
SLIDE 36

85 bits, depth 8 85 bits, depth 8

slide-37
SLIDE 37

Design exploration Design exploration

85 inputs, depth 8, varying allowed fanout At 128 bits, minimum depth is slower than going one deeper (crude delay model fails) Accurate model consistent with timing report from Encounter

Fanout 7 0,646 15,2 8 0,628 15,7 9 0,624 15,9 10 0,620 16,1 Delay [ns] Power [mW]

slide-38
SLIDE 38

Fanout 7 Fanout 8 Fanout 9 Fanout 10

slide-39
SLIDE 39

Binary multiplication Binary multiplication

101100 * 001011 101100 101100 000000 101100 000000 + 000000 000111100100

“Partial products” 484

1) Generate the partial products (PPs) 2) Sum the partial products

a) Sum until two terms left b) Add the two remaining terms

44 * 11

slide-40
SLIDE 40

Binary multiplication Binary multiplication

101100 * 001011 101100 101100 000000 101100 000000 + 000000 000111100100

“Partial products” 484

1) Generate the partial products (PPs) 2) Sum the partial products

a) Sum until two terms left b) Add the two remaining terms

44 * 11 Not in this talk

slide-41
SLIDE 41

Column compression multipliers Column compression multipliers

101100 * 001011 101100 101100 000000 101100 000000 + 000000

Use full adders to compress the bits in each column until only two bits remain Each full adder produces a carry which is forwarded to the next column Different strategies for which

  • rder to process the bits yields

very different characteristics (e.g. linear vs. logarithmic depth)

slide-42
SLIDE 42

High-performance multiplier (HPM) High-performance multiplier (HPM)

Multiplier reduction tree with logarithmic logic depth and regular connectivity. Eriksson, Sheeran, et al. ISCAS '06.

Simple scheme:

Process PP signals first Process full adder output bits “as late as possible” Prioritize carry bits

slide-43
SLIDE 43

Purely structural version ( Purely structural version (≈ ≈Lava) Lava)

Show code...

slide-44
SLIDE 44

Refinement 1 Refinement 1

slide-45
SLIDE 45

Refinement 2 Refinement 2

slide-46
SLIDE 46

Refinement 3 Refinement 3

slide-47
SLIDE 47

Rectangular transform Rectangular transform

slide-48
SLIDE 48

Using reduction tree in real design Using reduction tree in real design

slide-49
SLIDE 49

Using reduction tree in real design Using reduction tree in real design

By Kasyab, Ph.D. Student at Computer Engineering

slide-50
SLIDE 50

Summary Summary

Wire-aware hardware design methods needed Wired offers flexible hardware design at low levels of abstraction

Sklansky

At Intel: 1000 lines of scripting code (Perl) In Wired: <50 lines (though fewer details)

Layout-/wire-aware design exploration

slide-51
SLIDE 51

Get Wired Get Wired

Install Haskell Platform (to get the Cabal tool):

http://hackage.haskell.org/platform/

Install Wired: Manual download:

http://hackage.haskell.org/cgi-bin/hackage-scripts/package/Wired

> cabal install Wired