Haskell to Hardware and Other Dreams Stephen A. Edwards Richard - - PowerPoint PPT Presentation

haskell to hardware and other dreams
SMART_READER_LITE
LIVE PREVIEW

Haskell to Hardware and Other Dreams Stephen A. Edwards Richard - - PowerPoint PPT Presentation

Haskell to Hardware and Other Dreams Stephen A. Edwards Richard Townsend Martha A. Kim Lianne Lairmore Kuangya Zhai Columbia University Synchron, Bamberg, Germany, December 7, 2016 Popular Science, November 1969 Popular Science, November


slide-1
SLIDE 1

Haskell to Hardware and Other Dreams

Stephen A. Edwards Richard Townsend Martha A. Kim Lianne Lairmore Kuangya Zhai

Columbia University

Synchron, Bamberg, Germany, December 7, 2016

slide-2
SLIDE 2

Popular Science, November 1969 Popular Science, November 1969

slide-3
SLIDE 3

Where Is My Jetpack? Where Is My Jetpack?

Popular Science, November 1969 Popular Science, November 1969

slide-4
SLIDE 4
slide-5
SLIDE 5

Where The Heck Where The Heck Is My Is My 10 GHz Processor? 10 GHz Processor?

slide-6
SLIDE 6

Moore’s Law

“The complexity for minimum component costs has increased at a rate of roughly a factor

  • f two per year.”

Closer to every 24 months

Gordon Moore, Cramming More Components onto Integrated Circuits, Electronics, 38(8) April 19, 1965.

slide-7
SLIDE 7

Four Decades of Microprocessors Later...

Source: https://www.karlrupp.net/2015/06/40-years-of-microprocessor-trend-data/

slide-8
SLIDE 8

What Happened in 2005?

Pentium 4 Core 2 Duo Xeon E5 2000 2006 2012 1 core 2 cores 8 cores Transistors: 42 M 291 M 2.3 G

slide-9
SLIDE 9

The Cray-2: Immersed in Fluorinert The Cray-2: Immersed in Fluorinert

1985 ECL 150 kW 1985 ECL 150 kW

slide-10
SLIDE 10

Heat Flux in IBM Mainframes: A Familiar Trend

  • Schmidt. Liquid Cooling is Back. Electronics Cooling. August 2005.
slide-11
SLIDE 11

Liquid Cooled Apple Power Mac G5 Liquid Cooled Apple Power Mac G5

2004 CMOS 1.2 kW 2004 CMOS 1.2 kW

slide-12
SLIDE 12

Dally: Calculation Cheap; Communication Costly

64b FPU 0.1mm

2

50pJ/op 1.5GHz 64b 1mm Channel 25pJ/word 64b Off-Chip Channel 1nJ/word 64b Float ing Point

20mm 10mm 250pJ, 4 cycles

“Chips are power limited and most power is spent moving data Performance = Parallelism Efficiency = Locality

Bill Dally’s 2009 DAC Keynote, The End of Denial Architecture

slide-13
SLIDE 13

Parallelism for Performance; Locality for Efficiency

Dally: “Single-thread processors are in denial about these two facts” We need different programming paradigms and different architectures

  • n which to run them.
slide-14
SLIDE 14

Dark Silicon Dark Silicon

slide-15
SLIDE 15

Related Work

slide-16
SLIDE 16

Xilinx’s Vivado (Was xPilot, AutoESL)

Page 11

System System-

  • level Synthesis Data Model

level Synthesis Data Model

  • SSDM

SSDM (System (System-

  • level Synthesis Data Model)

level Synthesis Data Model)

  • Hierarchical

Hierarchical netlist netlist of concurrent processes and communication

  • f concurrent processes and communication

channels channels

  • Each leaf process contains a sequential program which is represe

Each leaf process contains a sequential program which is represented nted by an extended LLVM IR with hardware by an extended LLVM IR with hardware-

  • specific semantics

specific semantics

  • Port / IO interfaces, bit

Port / IO interfaces, bit-

  • vector manipulations, cycle

vector manipulations, cycle-

  • level notations

level notations

Hardware Hardware-

  • Specific SSDM Semantics

Specific SSDM Semantics

Process port/interface semantics Process port/interface semantics

FIFO: FIFO: FifoRead FifoRead() / () / FifoWrite FifoWrite() () Buffer: Buffer: BuffRead BuffRead() / () / BuffWrite BuffWrite() () Memory: Memory: MemRead MemRead() / () / MemWrite MemWrite() ()

Bit Bit-

  • vector manipulation

vector manipulation

Bit extraction / concatenation / insertion Bit extraction / concatenation / insertion Bit Bit-

  • width attributes for every operation and every value

width attributes for every operation and every value

Cycle Cycle-

  • level notation

level notation

Clock: Clock: waitClockEvent waitClockEvent() () SystemC input; classical high-level synthesis for processes

Jason Cong et al. ISARS 2005

slide-17
SLIDE 17

Taylor and Swanson’s Conservation Cores

BB1 BB0 BB2

CFG + + *

LD

+

LD LD

+1 <N?

+ + +

ST

+ Datapath Inter-BB State Machine

0.01 mm2 in 45 nm TSMC runs at 1.4 GHz

.V Synopsys IC Compiler, P&R, CTS

C-core Generation

.V Code to Stylized Verilog and through a CAD flow.

Custom datapaths, controllers for loop kernels; uses existing memory hierarchy

Swanson, Taylor, et al. Conservation Cores. ASPLOS 2010.

slide-18
SLIDE 18

Bacon et al.’s Liquid Metal

  • Fig. 2. Block level diagram of DES and Lime code snippet

JITting Lime (Java-like, side-effect-free, streaming) to FPGAs

Huang, Hormati, Bacon, and Rabbah, Liquid Metal, ECOOP 2008.

slide-19
SLIDE 19

Goldstein et al.’s Phoenix

int squares() { int i = 0, sum = 0; for (;i<10;i++) sum += i*i; return sum; }

i sum 1 * + + sum ret sum 1 <= i 10 ! 2 1 eta merge eta 3

Figure 3: C program and its representation comprising three hy- perblocks; each hyperblock is shown as a numbered rectangle. The dotted lines represent predicate values. (This figure omits the token edges used for memory synchronization.) Figure 8: Memory access network and implementation of the value and token forwarding network. The LOAD produces a data value consumed by the oval node. The STORE node may depend on the load (i.e., we have a token edge between the LOAD and the STORE, shown as a dashed line). The token travels to the root of the tree, which is a load-store queue (LSQ).

C to asynchronous logic, monolithic memory

Budiu, Venkataramani, Chelcea and Goldstein, Spatial Computation, ASPLOS 2004.

slide-20
SLIDE 20

Ghica et al.’s Geometry of Synthesis

com SEQ WHILE SEQ ASG DELTA exp exp exp com var com DER D D X T D D init curr more next f D D D Figure 1. In-place map schematic and implementation

Algol-like imperative language to handshake circuits

Ghica, Smith, and Singh. Geometry of Synthesis IV, ICFP 2011

slide-21
SLIDE 21

Greaves and Singh’s Kiwi

public static void SendDeviceID() { int deviceID = 0x76; for (int i = 7; i > 0; i−−) { scl = false; sda out = (deviceID & 64) != 0; Kiwi.Pause(); // Set it i−th bit of the device ID scl = true; Kiwi.Pause(); // Pulse SCL scl = false; deviceID = deviceID << 1; Kiwi.Pause(); } } C# with a concurrency library to FPGAs

Greaves and Singh. Kiwi, FCCM 2008

slide-22
SLIDE 22

Arvind, Hoe, et al.’s Bluespec

GCD Mod Rule Gcd(a, b) if (a b)!(b

" 0) # Gcd(a$b, b)

GCD Flip Rule Gcd(a, b) if a b

# Gcd(b, a) # # # #

δFlip,a δFlip,b

Mod,a

δ

Flip,a

δ

Flip πMod

π

Flip

πMod π

Flip

π

Flip,b

δ

Mod

π

Flip

π =0

ce ce

b a

+ δ Mod,a

Figure 1.3 Circuit for computing Gcd(a, b) from Example 1.

O R M
  • d
F l ip # # # # $ % &&&% $ n % &&&% n $ O R #

Guarded commands and functions to synchronous logic

Hoe and Arvind, Term Rewriting, VLSI 1999

slide-23
SLIDE 23

Sheeran et al.’s Lava

where the constan t W N is de,ned as e j " # N
  • Eac
h signal in the transformed sequence X 5k 6 dep ends
  • n
ev ery input signal x5n6: the DFT
  • p
eration is therefore ex? p ensiv e to implemen t directly
  • The
F ast F
  • urier
T ransforms 5FFTs6 are e@cien t algorithms for computing the DFT that exploit symmetries in the twid$ d le factors W k N
  • The
la ws that state these symmetries areA W ! N B C W N N B C W k n W m n B W k "m n W k n B W k n & 5n& k ! N 6 W e will later use the fact that W # $ equals "j
  • These
la wsE together with a restriction
  • f
sequence length 5for example to p
  • w
ers
  • f
t w
  • 6E
simplify the computations- An FFT implemen tation has few er gates than the
  • riginal
direct DFT implemen tationE whic h reduces circuit area and p
  • w
er consumption- FFTs are k ey building blo c ks in most signal pro cessing applications- W e discuss the description
  • f
circuits for t w
  • diIeren
t FFT algorithmsA the Radix?K FFT and the Radix?K FFT LHeNO P- !" Tw
  • FFT
circuits The de cimation in time Radix?K FFT is a standard al? gorithmE whic h
  • p
erates
  • n
input sequences
  • f
whic h the length is a p
  • w
er
  • f
t w
  • LPMNK
P- This restriction mak es it p
  • ssible
to divide the input in to smaller sequences b y re? p eated halving un til sequences
  • f
length t w
  • are
reac hed- A DFT
  • f
length t w
  • can
b e computed b y a simple butter$ 1y circuit- ThenE at eac h stageE the smaller sequences are com bined to form bigger transformed sequences un til the complete DFT has b een pro duced- The Radix?K FFT algorithm can b e mapp ed
  • n
to a com? binational net w
  • rk
as in ,gure SE whic h sho ws a size CU implemen tation- In this diagramE digits and t widdle factors
  • n
a wire indicate constan t m ultiplication and the merging
  • f
t w
  • arro
ws means addition- The b
  • unding
b
  • xes
con tain t w
  • FFTs
  • f
size W- A less w ell?kno wn algorithm for computation
  • f
the DFT is the de cimation in fr e quency Radix?K FFTE whic h assumes that the input length N is a p
  • w
er
  • f
four- The corresp
  • nding
circuit implemen tation 5in ,gure W6 is also v ery regular and migh t b e mistak en for a rev ersed Radix?K circuit at a passing glance- Ho w ev erE it diIers sub? stan tially in that two diIeren t butterXy net w
  • rks
are used in eac h stageE the t widdle factor m ultiplications are mo di,edE and "j m ultiplication stages ha v e b een inserted- ! Comp
  • nen
ts W e need three main comp
  • nen
ts to implemen t FFT circuits- The ,rst is a butter1y cir cuitE whic h tak es t w
  • inputs
x # and x to t w
  • utputs
x # Y x and x # " x 5see ,gure N6- It is the heart
  • f
FFT implemen tations since it computes the K? p
  • in
t DFT- Systems
  • f
suc h comp
  • nen
ts will b e applied to the in?signals in man y stages 5,gures S and W6- The FFT butterXy stages are constructed b y ri[ing together t w
  • halv
es
  • f
a sequence
  • f
length k E pro cessing them b y a Figure NA A butterXy Figure C\A A butterXy stage
  • f
size W expressed with ri[ing column
  • f
k )K butterXy circuitsE and unri[ing the result 5see ,gure C\6- Here riffle is the sh u[e
  • f
a card sharp who p erfectly in terlea v es the cards
  • f
t w
  • half
dec ks- bfly '' CmplxArithmetic m 01 2CmplxSig5 61 m 2CmplxSig5 bfly 2i78 i95 do
  • 7
<6 csubtract @i78 i9A
  • 9
<6 cplus @i78 i9A return 2o78
  • 95
bflys '' CmplxArithmetic m 01 Int 61 2CmplxSig5 61 m 2CmplxSig5 bflys n riffle 161 raised n two bfly 161 unriffle Another imp
  • rtan
t comp
  • nen
t
  • f
an FFT algorithm is m ul? tiplication b y a complex constan tE whic h can b e imple? men ted using a primitiv e comp
  • nen
t called a t widdle factor m ultiplier- This circuit maps a single complex input x to x W k N for some N and k
  • The
circuit w n k computes W k N
  • wMult
'' CmplxArithmetic m 01 Int 61 Int 61 CmplxSig 61 m CmplxSig wMult n k a do twd <6 w @n8 kA ctimes @twd8 aA The m ultiplication
  • f
complete buses with "j is de,ned as follo wsE using the fact that W # $ equals "j
  • minusJ
'' CmplxArithmetic m 01 2CmplxSig5 61 m 2CmplxSig5 minusJ mapM @wMult H 7A Another useful comp
  • nen
t is the bit r eversal p ermutationE used in the ,rst
  • r
last stage
  • f
the FFT circuits- A new wire p
  • sition
is the rev ersed binary represen tation
  • f
the
  • ld
p
  • sition
LPMNK P- The p erm utation can b e expressed using riffleA bitRev '' Monad m 01 Int 61 2a5 61 m 2a5 bitRev n compose 2 raised @n6iA two riffle K i <6 27LLn5 5 where the constan t W N is de,ned as e j " # N
  • Eac
h signal in the transformed sequence X 5k 6 dep ends
  • n
ev ery input signal x5n6: the DFT
  • p
eration is therefore ex? p ensiv e to implemen t directly
  • The
F ast F
  • urier
T ransforms 5FFTs6 are e@cien t algorithms for computing the DFT that exploit symmetries in the twid$ d le factors W k N
  • The
la ws that state these symmetries areA W ! N B C W N N B C W k n W m n B W k "m n W k n B W k n & 5n& k ! N 6 W e will later use the fact that W # $ equals "j
  • These
la wsE together with a restriction
  • f
sequence length 5for example to p
  • w
ers
  • f
t w
  • 6E
simplify the computations- An FFT implemen tation has few er gates than the
  • riginal
direct DFT implemen tationE whic h reduces circuit area and p
  • w
er consumption- FFTs are k ey building blo c ks in most signal pro cessing applications- W e discuss the description
  • f
circuits for t w
  • diIeren
t FFT algorithmsA the Radix?K FFT and the Radix?K FFT LHeNO P- !" Tw
  • FFT
circuits The de cimation in time Radix?K FFT is a standard al? gorithmE whic h
  • p
erates
  • n
input sequences
  • f
whic h the length is a p
  • w
er
  • f
t w
  • LPMNK
P- This restriction mak es it p
  • ssible
to divide the input in to smaller sequences b y re? p eated halving un til sequences
  • f
length t w
  • are
reac hed- A DFT
  • f
length t w
  • can
b e computed b y a simple butter$ 1y circuit- ThenE at eac h stageE the smaller sequences are com bined to form bigger transformed sequences un til the complete DFT has b een pro duced- The Radix?K FFT algorithm can b e mapp ed
  • n
to a com? binational net w
  • rk
as in ,gure SE whic h sho ws a size CU implemen tation- In this diagramE digits and t widdle factors
  • n
a wire indicate constan t m ultiplication and the merging
  • f
t w
  • arro
ws means addition- The b
  • unding
b
  • xes
con tain t w
  • FFTs
  • f
size W- A less w ell?kno wn algorithm for computation
  • f
the DFT is the de cimation in fr e quency Radix?K FFTE whic h assumes that the input length N is a p
  • w
er
  • f
four- The corresp
  • nding
circuit implemen tation 5in ,gure W6 is also v ery regular and migh t b e mistak en for a rev ersed Radix?K circuit at a passing glance- Ho w ev erE it diIers sub? stan tially in that two diIeren t butterXy net w
  • rks
are used in eac h stageE the t widdle factor m ultiplications are mo di,edE and "j m ultiplication stages ha v e b een inserted- ! Comp
  • nen
ts W e need three main comp
  • nen
ts to implemen t FFT circuits- The ,rst is a butter1y cir cuitE whic h tak es t w
  • inputs
x # and x to t w
  • utputs
x # Y x and x # " x 5see ,gure N6- It is the heart
  • f
FFT implemen tations since it computes the K? p
  • in
t DFT- Systems
  • f
suc h comp
  • nen
ts will b e applied to the in?signals in man y stages 5,gures S and W6- The FFT butterXy stages are constructed b y ri[ing together t w
  • halv
es
  • f
a sequence
  • f
length k E pro cessing them b y a
  • 1

x y x - y x + y

Figure NA A butterXy
  • 1
  • 1
  • 1
  • 1
Figure C\A A butterXy stage
  • f
size W expressed with ri[ing column
  • f
k )K butterXy circuitsE and unri[ing the result 5see ,gure C\6- Here riffle is the sh u[e
  • f
a card sharp who p erfectly in terlea v es the cards
  • f
t w
  • half
dec ks- bfly '' CmplxArithmetic m 01 2CmplxSig5 61 m 2CmplxSig5 bfly 2i78 i95 do
  • 7
<6 csubtract @i78 i9A
  • 9
<6 cplus @i78 i9A return 2o78
  • 95
bflys '' CmplxArithmetic m 01 Int 61 2CmplxSig5 61 m 2CmplxSig5 bflys n riffle 161 raised n two bfly 161 unriffle Another imp
  • rtan
t comp
  • nen
t
  • f
an FFT algorithm is m ul? tiplication b y a complex constan tE whic h can b e imple? men ted using a primitiv e comp
  • nen
t called a t widdle factor m ultiplier- This circuit maps a single complex input x to x W k N for some N and k
  • The
circuit w n k computes W k N
  • wMult
'' CmplxArithmetic m 01 Int 61 Int 61 CmplxSig 61 m CmplxSig wMult n k a do twd <6 w @n8 kA ctimes @twd8 aA The m ultiplication
  • f
complete buses with "j is de,ned as follo wsE using the fact that W # $ equals "j
  • minusJ
'' CmplxArithmetic m 01 2CmplxSig5 61 m 2CmplxSig5 minusJ mapM @wMult H 7A Another useful comp
  • nen
t is the bit r eversal p ermutationE used in the ,rst
  • r
last stage
  • f
the FFT circuits- A new wire p
  • sition
is the rev ersed binary represen tation
  • f
the
  • ld
p
  • sition
LPMNK P- The p erm utation can b e expressed using riffleA bitRev '' Monad m 01 Int 61 2a5 61 m 2a5 bitRev n compose 2 raised @n6iA two riffle K i <6 27LLn5 5

Functional specifications of regular structures

Bjesse, Claessen, Sheeran, and Singh. Lava, ICFP 1998

slide-24
SLIDE 24

Kuper et al.’s CλaSH

fir (State (xs, hs)) x = (State (shiftInto x xs, hs), (x ⊲ xs) • hs)

  • Fig. 6.

4-taps FIR Filter

More operational Haskell specifications of regular structures

Baaij, Kooijman, Kuper, Boeijink, and Gerards. Cλash, DSD 2010

slide-25
SLIDE 25

My Crusade

slide-26
SLIDE 26

Deterministic Concurrency: A Fool’s Errand?

What Models of Computation Provide Determinstic Concurrency? Synchrony The Columbia Esterel Compiler 2001–2006 Kahn Networks The SHIM Model/Language 2006–2010 The Lambda Calculus This Project 2010–

slide-27
SLIDE 27

Our Project: Functional Programs to Hardware

slide-28
SLIDE 28

Our Project: Functional Programs to Hardware

slide-29
SLIDE 29

Our Project: Functional Programs to Hardware

slide-30
SLIDE 30

Our Project: Functional Programs to Hardware

slide-31
SLIDE 31

Our Project: Functional Programs to Hardware

slide-32
SLIDE 32

Our Project: Functional Programs to Hardware

slide-33
SLIDE 33

Our Project: Functional Programs to Hardware

slide-34
SLIDE 34

Why Functional?

Referential transparency simplifies

formal reasoning about programs

Inherently concurrent and

deterministic (Thank Church and Rosser)

Immutable data makes it vastly

easier to reason about memory in the presence of concurrency

slide-35
SLIDE 35

To Implement Real Algorithms, We Need

Structured, recursive data types Recursion to handle recursive data types Memories Memory Hierarchy

slide-36
SLIDE 36

Structured, Recursive Data Types

slide-37
SLIDE 37

Algebraic Data Types

In modern functional languages: ML, OCaml, Haskell, . . . An algebraic type is a sum of product types Basic example: List of integers data IntList = Nil | Cons Int IntList Constructing a list: Cons 42 (Cons 17 (Cons 2 (Cons 1 Nil))) Summing the elements of a list: sum li = case li of Nil

→ 0

Cons x xs → x + sum xs

slide-38
SLIDE 38

An Interpreter in One Slide

Abstract syntax tree data type: data Expr = Lit Int | Plus Expr Expr | Minus Expr Expr | Times Expr Expr Recursive evaluation function: eval e = case e of Lit x

→ x

Plus e1 e2

→ eval e1 + eval e2

Minus e1 e2 → eval e1 − eval e2 Times e1 e2 → eval e1 * eval e2 eval (Plus ( Lit 42) (Times (Lit 2) ( Lit 50))) gives 42+2×50 = 142

slide-39
SLIDE 39

Algebraic Datatypes in Hardware: Lists

data IntList = Cons Int IntList | Nil

1 32 33 48

1 Cons int pointer Nil

slide-40
SLIDE 40

Recursion to Handle Recursive Data Types

slide-41
SLIDE 41

What Do We Do With Recursion?

Compile it into tail recursion with explicit stacks [Zhai et al., CODES+ISSS 2015]

[Proceedings of the ACM Annual Conference, 1972]

Really clever idea: Sophisticated language ideas such as recursion and higher-order functions can be implemented using simpler mechanisms (e.g., tail recursion) by rewriting.

slide-42
SLIDE 42

Removing Recursion: The Fib Example

fib n = case n of 1

→ 1

2

→ 1

n

→ fib (n−1) + fib (n−2)

slide-43
SLIDE 43

Transform to Continuation-Passing Style

fibk n k = case n of 1

→ k 1

2

→ k 1

n

→ fibk (n−1) (λn1 →

fibk (n−2) (λn2 → k (n1 + n2))) fib n = fibk n (λx → x)

slide-44
SLIDE 44

Name Lambda Expressions (Lambda Lifting)

fibk n k = case n of 1

→ k 1

2

→ k 1

n

→ fibk (n−1) (k1 n k)

k1 n k n1 = fibk (n−2) (k2 n1 k) k2 n1 k n2 = k (n1 + n2) k0 x = x fib n = fibk n k0

slide-45
SLIDE 45

Represent Continuations with a Type

data Cont = K0 | K1 Int Cont | K2 Int Cont fibk n k = case (n,k) of (1, k) → kk k 1 (2, k) → kk k 1 (n, k) → fibk (n−1) (K1 n k) kk k a = case (k, a) of ((K1 n k), n1) → fibk (n−2) (K2 n1 k) ((K2 n1 k), n2) → kk k (n1 + n2) (K0, x ) → x fib n = fibk n K0

slide-46
SLIDE 46

Merge Functions

data Cont = K0 | K1 Int Cont | K2 Int Cont data Call = Fibk Int Cont | KK Cont Int fibk z = case z of (Fibk 1 k) → fibk (KK k 1) (Fibk 2 k) → fibk (KK k 1) (Fibk n k) → fibk (Fibk (n−1) (K1 n k)) (KK (K1 n k) n1) → fibk (Fibk (n−2) (K2 n1 k)) (KK (K2 n1 k) n2) → fibk (KK k (n1 + n2)) (KK K0 x ) → x fib n = fibk (Fibk n K0)

slide-47
SLIDE 47

Add Explicit Memory Operations

read :: CRef → Cont write :: Cont → CRef data Cont = K0 | K1 Int CRef | K2 Int CRef data Call = Fibk Int CRef | KK Cont Int fibk z = case z of (Fibk 1 k) → fibk (KK (read k) 1) (Fibk 2 k) → fibk (KK (read k) 1) (Fibk n k) → fibk (Fibk (n−1) (write (K1 n k))) (KK (K1 n k) n1) → fibk (Fibk (n−2) (write (K2 n1 k))) (KK (K2 n1 k) n2) → fibk (KK (read k) (n1 + n2)) (KK K0 x ) → x fib n = fibk (Fibk n (write K0))1

slide-48
SLIDE 48

Simplified Functional to Dataflow

slide-49
SLIDE 49

Functional to Dataflow

Sum a list using an accumulator and tail-recursion sum lp s = case read lp of Nil

→ s

Cons x xs → sum xs (s + x) lp s

slide-50
SLIDE 50

Functional to Dataflow

Sum a list using an accumulator and tail-recursion sum lp s = case read lp of Nil

→ s

Cons x xs → sum xs (s + x) lp s

slide-51
SLIDE 51

Functional to Dataflow

Sum a list using an accumulator and tail-recursion sum lp s = case read lp of Nil

→ s

Cons x xs → sum xs (s + x) read lp s

slide-52
SLIDE 52

Functional to Dataflow

Sum a list using an accumulator and tail-recursion sum lp s = case read lp of Nil

→ s

Cons x xs → sum xs (s + x) read lp s

slide-53
SLIDE 53

Functional to Dataflow

Sum a list using an accumulator and tail-recursion sum lp s = case read lp of Nil

→ s

Cons x xs → sum xs (s + x) read

Nil Cons Nil Cons

lp s x xs

slide-54
SLIDE 54

Functional to Dataflow

Sum a list using an accumulator and tail-recursion sum lp s = case read lp of Nil

→ s

Cons x xs → sum xs (s + x) read

Nil Cons Nil Cons

lp s x xs

slide-55
SLIDE 55

Functional to Dataflow

Sum a list using an accumulator and tail-recursion sum lp s = case read lp of Nil

→ s

Cons x xs → sum xs (s + x) read

Nil Cons Nil Cons

lp s x xs

slide-56
SLIDE 56

Functional to Dataflow

Sum a list using an accumulator and tail-recursion sum lp s = case read lp of Nil

→ s

Cons x xs → sum xs (s + x) read

Nil Cons Nil Cons

lp s x xs

slide-57
SLIDE 57

Functional to Dataflow

Sum a list using an accumulator and tail-recursion sum lp s = case read lp of Nil

→ s

Cons x xs → sum xs (s + x) read

Nil Cons Nil Cons

lp s x xs

slide-58
SLIDE 58

Functional to Dataflow

Sum a list using an accumulator and tail-recursion sum lp s = case read lp of Nil

→ s

Cons x xs → sum xs (s + x) read

Nil Cons Nil Cons

+

lp s x xs

slide-59
SLIDE 59

Functional to Dataflow

Sum a list using an accumulator and tail-recursion sum lp s = case read lp of Nil

→ s

Cons x xs → sum xs (s + x) read

Nil Cons Nil Cons

+

lp s x xs

slide-60
SLIDE 60

Functional to Dataflow

Sum a list using an accumulator and tail-recursion sum lp s = case read lp of Nil

→ s

Cons x xs → sum xs (s + x) read

Nil Cons Nil Cons

+

lp s x xs

slide-61
SLIDE 61

Functional to Dataflow

Sum a list using an accumulator and tail-recursion sum lp s = case read lp of Nil

→ s

Cons x xs → sum xs (s + x) read

Nil Cons Nil Cons

+

lp s x xs

slide-62
SLIDE 62

Functional to Dataflow

Sum a list using an accumulator and tail-recursion sum lp s = case read lp of Nil

→ s

Cons x xs → sum xs (s + x) read

Nil Cons Nil Cons

+

lp s x xs

slide-63
SLIDE 63

Functional to Dataflow

Sum a list using an accumulator and tail-recursion sum lp s = case read lp of Nil

→ s

Cons x xs → sum xs (s + x) read

Nil Cons Nil Cons

+

lp s x xs

slide-64
SLIDE 64

Functional to Dataflow

Sum a list using an accumulator and tail-recursion sum lp s = case read lp of Nil

→ s

Cons x xs → sum xs (s + x) read

Nil Cons Nil Cons

+

lp s x xs

slide-65
SLIDE 65

Non-strict functions enables pipelining

0.2 0.4 0.6 0.8 1 Cycles Relative to Strict TreeMap MergeSort Map Filter DFS Append

speedup from non-strict functions due to pipelining best possible speedup from unbounded buffers

slide-66
SLIDE 66

Dataflow to Hardware

slide-67
SLIDE 67

A Latency-Insensitive Protocol

upstream downstream data valid ready valid ready action

No token 1 1 Token Transfer 1 Token held upstream clk data

1 2 3 4 5 6

valid ready Inspired by Carloni et al. [Cao et al., Memocode 2015]

slide-68
SLIDE 68

Input and Output Buffers

Input Buf. Core Output Buf. Input Output

1 1 1

data ready

1

ready data data ready Combinational paths broken: Input buffer breaks ready path Output buffer breaks data/valid path

slide-69
SLIDE 69

Larger Systems Run Just As Fast

Splitters Token Fmax Area Resources Bits MHz ALMs % Registers 2 32 167 189 1 414 2 64 157 350 1 798 2 128 152 672 2 1573 32 128 137 10821 26 25536 64 128 140 21704 52 51168 4 64 158 700 2 1621 8 64 145 1409 3 3261 16 64 147 2826 7 6559 32 64 144 5682 14 13148 64 64 138 11404 27 26414 128 64 140 22914 55 53087

Synthesis results on an Altera Cyclone V. 166 MHz target clock rate.

slide-70
SLIDE 70

Moore’s Law is alive and well But we hit a power wall in 2005.

Massive parallelism now mandatory

Communication is the culprit

64b FPU 0.1mm

2

50pJ/op 1.5GHz 64b 1mm Channel 25pJ/word 64b Off-Chip Channel 1nJ/word 64b Float ing Point

20mm 10mm 250pJ, 4 cycles

slide-71
SLIDE 71

Dark Silicon is the future: faster

transistors; most must remain off

Custom accelerators are the

future; many approaches

Our project: A Pure Functional

Language to FPGAs

BB1 BB0 BB2

CFG + + *

LD

+

LD LD +1 <N?

+ + +

ST

+ Datapath Inter-BB State Machine

0.01 mm2 in 45 nm TSMC runs at 1.4 GHz

.V Synopsys IC Compiler, P&R, CTS

C-core Generation

.V Code to Stylized Verilog and through a CAD flow.

slide-72
SLIDE 72

Algebraic Data Types in Hardware Removing recursion Functional to dataflow Dataflow to hardware

Encoding the Types

Huffman tree nodes: (19 bits) 1 Leaf 8-bit char 9-bit pointer 9-bit pointer Branch Boolean input stream: (14 bits) 1 Cons B 12-bit pointer Nil Character output stream: (19 bits) 1 Cons 8-bit char 10-bit pointer Nil

Add Explicit Memory Operations

read :: CRef → Cont write :: Cont → CRef data Cont = K0 | K1 Int CRef | K2 Int CRef data Call = Fibk Int CRef | KK Cont Int fibk z = case z of (Fibk 1 k) → fibk (KK (read k) 1) (Fibk 2 k) → fibk (KK (read k) 1) (Fibk n k) → fibk (Fibk (n−1) (write (K1 n k))) (KK (K1 n k) n1) → fibk (Fibk (n−2) (write (K2 n1 k))) (KK (K2 n1 k) n2) → fibk (KK (read k) (n1 + n2)) (KK K0 x ) → x fib n = fibk (Fibk n (write K0))1

Functional to Dataflow

Sum a list using an accumulator and tail-recursion sum lp s = case read lp of Nil → s Cons x xs → sum xs (s + x) read Nil Cons Nil Cons + lp s x xs

Input and Output Buffers

Input Buf. Core Output Buf. Input Output 1 1 1 ⊥ data ready 1 ready data data ready Combinational paths broken: Input buffer breaks ready path Output buffer breaks data/valid path