Graph Reduction Hardware Revisited Rob Stewart ( R.Stewart@hw.ac.uk ) - PowerPoint PPT Presentation

Graph Reduction Hardware Revisited Rob Stewart ( R.Stewart@hw.ac.uk ) 1 Evgenij Belikov 1 Hans-Wolfgang Loidl 1 Paulo Garcia 2 14 th June 2018 1 Mathematical and Computer Sciences Heriot-Wat University Edinburgh, UK 2 United Technologies Research Center Cork, Republic of Ireland

FPUs on FPGAs? • CPUs for general purpose sofware • GPUs for numeric computation • Growth of domain specific custom hardware • e.g. TensorFlow ASIC chip for deep learning • How about FPUs ? (Functional Processing Units) • Goal: accelerated, efficient graph reduction • Deployment: Amazon F1 Cloud instances include FPGAs • Motivation: widening use of functional languages 1

Graph Reduction 1. Write programs in a big language e.g. Haskell 2. Compile to a small language • e.g. Haskell → GHC Core → hardware backend • Lambda terms variable x ( λ x . M ) abstraction ( M N ) application 3. Computation by reduction ( λ x . M [ x ]) α → ( λ y . M [ y ]) α conversion − β ( λ x . M ) E → ( M [ x := E ]) β reduction − β ( λ y . y + 1 ) 3 → ( 3 + 1 ) − 2

Historic Graph Reduction Machines • Graph reduction machines in 1980s e.g. • GRIP (Imperial College), ALICE (Manchester) • 15 year conference series • Functional Programming Languages and Computer Architectures • Dedicated workshop in 1986 • Graph Reduction, Santa Fé, New Mexico, USA. Springer LNCS, volume 279, 1987 3

Graph Reduction Hardware Abandonment “ Programmed reduction systems are not so elegant as pure reduction systems, but offer the advantage that we can make use of technology developed over the last 35 years to implement Richard Kieburtz, 1985 ” von Neumann based architectures. • Abandoned in favour of commodity-off-the-shelf (COTS) processors e.g. Intel/AMD CPUs • Custom hardware took years to build • Free lunch: clock frequencies speedups • Just build a compiler + runtime system in sofware 4

Graph Reduction Hardware Resurgence “ Current RISC technology will probably have increased in speed enough by the time a [graph reduction] chip could be Philip John Koopman Jr, 1990 ” designed and fabricated to make the exercise pointless. • Historic drawbacks no longer hold thanks to FPGAs • hardware development time reduced • design iteration: FPGAs are reconfigurable • Hardware trend to more space rather than more performance 5 "Three Ages of FPGAs: A Retrospective on the First Thirty Years of FPGA Technology", Stephen M. Trimberger, IEEE, 2015.

Graph Reduction

Concrete Graph Representation ( λ x . ( x + (( λ y . y + 1 ) 3 ))) 2 @ N 2 λ x @ @ N 3 λ y V x @ P + @ N 1 V y @ P + 6 “Possible Concrete Representations”, Chapter 10 of The Implementation of Functional Programming Languages . S. P. Jones, 1987.

Evaluation with β -reduction let’s compute ( λ x . ( x + (( λ y . y + 1 ) 3 ))) 2 β ( λ x . M ) N → M [ x := N ] − ( λ x . ( x + (( λ y . y + 1 ) 3 ))) 2 ( 2 + (( λ y . y + 1 ) 3 )) ( 2 + ( 3 + 1 )) ⇒ ⇒ @ @ @ @ @ @ @ λ 2 + 2 + 2 @ 1 3 λ x @ + 3 y @ @ 1 @ @ + y + x λ 3 y @ @ 1 + y 7

Parallel Graph Reduction

Parallel Graph Reduction in Sofware “ I wonder how popular Haskell needs to become for Intel to op- timise their processors for my runtime, rather than the other Simon Marlow, 2009 ” way around. • par/pseq to enforce evaluation order • Parallel graph reduction with par , e.g. • f ‘par‘ (e ‘pseq‘ (e + f)) • Multicore: instructions for sequential reductions in each core • Distributed parallel graph reduction: GUM supports par/pseq 8

Graph Reduction on FPGAs

FPGAs versus CPUs CPUs: heap in DDR memory, sequential β reduction in each core Idea: sof processor on an FPGA for parallel graph reduction. A Comparison of CPUs, GPUs, FPGAs, and Massively Parallel Processor Arrays for Random Number Generation . D Thomas et al. 9 Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays, 2009.

Reduceron: Graph Reduction with Templates on FPGAs • Uses template instantiation , substitutes arguments into bodies • F-lite functions compiled to templates • Large reduction steps, avr. 2 reductions for function application • 6 reduction operations each cost only 1 cycle because • parallel memory transactions • wide memories “ The reduceron reconfigured and re-evaluated” , Mathew Naylor and Colin Runciman 10 Journal of Functional Programming, Volume 22 Issue 4-5, 574-613, 2012.

Reduceron: reduction with templates f ys x xs = g x (h xs ys) Stack Heap T emplates g @2 c f: h @3@1 g: b a h: f From “Dynamic analysis in the Reduceron”, Mathew Naylor and Colin Runciman, University of York, 2009. 11 https://www.cs.york.ac.uk/fp/reduceron/memos/Memo41.pdf

Reduceron: reduction with templates f ys x xs = g x (h xs ys) Stack Heap T emplates g @2 c g f: h @3@1 g: b a h: f From “Dynamic analysis in the Reduceron”, Mathew Naylor and Colin Runciman, University of York, 2009. 11 https://www.cs.york.ac.uk/fp/reduceron/memos/Memo41.pdf

Reduceron: reduction with templates f ys x xs = g x (h xs ys) Stack Heap T emplates g @2 c g f: h @3@1 b g: b a h: f From “Dynamic analysis in the Reduceron”, Mathew Naylor and Colin Runciman, University of York, 2009. 11 https://www.cs.york.ac.uk/fp/reduceron/memos/Memo41.pdf

Reduceron: reduction with templates f ys x xs = g x (h xs ys) Stack Heap T emplates g @2 c g f: h @3@1 b h g: b a h: f From “Dynamic analysis in the Reduceron”, Mathew Naylor and Colin Runciman, University of York, 2009. 11 https://www.cs.york.ac.uk/fp/reduceron/memos/Memo41.pdf

Reduceron: reduction with templates f ys x xs = g x (h xs ys) Stack Heap T emplates g @2 c g f: h @3@1 b h c g: b a h: f From “Dynamic analysis in the Reduceron”, Mathew Naylor and Colin Runciman, University of York, 2009. 11 https://www.cs.york.ac.uk/fp/reduceron/memos/Memo41.pdf

Reduceron: reduction with templates f ys x xs = g x (h xs ys) Stack Heap T emplates g @2 c g f: h @3@1 b h c a g: b a h: f From “Dynamic analysis in the Reduceron”, Mathew Naylor and Colin Runciman, University of York, 2009. 11 https://www.cs.york.ac.uk/fp/reduceron/memos/Memo41.pdf

Reduceron: reduction with templates f ys x xs = g x (h xs ys) Stack Heap T emplates g @2 c g f: h @3@1 b h c a g: b a h: f • Reduceron access to parallel memories: 1. stack 2. heap 3. templates • FPGAs: function application in a single clock cycle From “Dynamic analysis in the Reduceron”, Mathew Naylor and Colin Runciman, University of York, 2009. 11 https://www.cs.york.ac.uk/fp/reduceron/memos/Memo41.pdf

Extending Reduceron for Modern FPGAs 1. Parallel graph reduction • Fit multiple reduction cores onto FPGA fabric 2. Off-chip heap • Real world programs use need 100s MBs of heap space 3. Caching • Low latency on-chip successive reductions of an expression 4. Compiler optimisations • Profit from space and time saving compiler optimisations 12

Modern FPGAs for Graph Reduction

On-Chip Memory Would a heap fit entirely on chip (with a garbage collector)? 13

On-Chip Space “ The functional language is a ballerina at an imperative square dance. A multiprocessor of appropriate design could beter William Partain, 1989 ” serve the functional language’s requirements. FPGA Slice LUTs BRAMs Reduceron Cores Kintex 7 kc705 10% 30% 3 Zynq 7 zc706 9% 24% 4 Virtex 7 vc709 5% 9% 10 Virtex UltraScale vcu100 2% 3% 28 Multiple reducer cores, potential for parallel graph reduction. 14

Allocations off-chip Can GHC optimisations reduce need for FPGA off chip allocation? European Symposium on Programming, 1996. • inlining, • strictness analysis, • let floating, • Eta expansion, ... 15

Allocations off-chip Can GHC optimisations reduce need for FPGA off chip allocation? European Symposium on Programming, 1996. • inlining, • strictness analysis, • let floating, • Eta expansion, ... GHC 7.8 → 8.2: 72% reduced allocation for k-nucleotide, almost 100% for n-body. 15

FPGA ↔ Memory Bandwidth Potential for throughput of off-chip heap reads/writes? 16

Summary • Simple parallelism? • evaluate strict (always needed) function arguments in parallel • Dynamic parallelism? borrow GHC RTS ideas: • par/pseq for programmer controlled parallel task sizes • black holes to avoid duplicating work • load balancing between cores • Proposed hardware • HDL → synthesis → graph reduction FPGA machine • Dedicate cache manager • Off-chip memories for heap • Parallel reduction with multiple reduction cores • Compiling Haskell to it • Haskell → GHC Core → templates • profit from GHC optimisations • ... but GHC Core is bigger language than F-lite (Reduceron) • ... therefore more challenging to support 17

Graph Reduction Hardware Revisited Rob Stewart ( R.Stewart@hw.ac.uk ) - PowerPoint PPT Presentation

Graph Reduction Hardware Revisited Rob Stewart ( R.Stewart@hw.ac.uk ) 1 Evgenij Belikov 1 Hans-Wolfgang Loidl 1 Paulo Garcia 2 14 th June 2018 1 Mathematical and Computer Sciences Heriot-Wat University Edinburgh, UK 2 United Technologies Research

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Hom and Ext, Revisited Justin Lyle Lawrence, KS justin.lyle@ku.edu April 28, 2018 JL Hom and

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Reduction Revisited: Verifying Round-Based Distributed Algorithms Stephan Merz INRIA Nancy &

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

Seminar of Lattice Analysis 01 Slide Reduction Revisited Wenling Liu, Shanghai Jiao Tong

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

Introduction to Harm Reduction Definition of Harm Reduction Harm reduction refers to policies,

software and hardware for the Internet of Things. Choose hardware Design hardware Design

Problem-solving revisited Problem-solving revisited David Lim (District Judge / Mediator) State

Environmental Acquisition Revisited Richard Cobbe and Matthias Felleisen Northeastern University

CEPH WIRE PROTOCOL REVISITED CEPH WIRE PROTOCOL REVISITED MESSENGER V2 MESSENGER V2 Ricardo

Hard State Revisited: Network Filesystems Hard State Revisited: Network Filesystems Jeff Chase

VoCCN: Voice over Content-Centric Networks Van Jacobson, Diana Smetters, Nick Briggs, Michael

Strategic Social Listening & Engagement Matthew Panzano Practice Manager Dynamics

UBS Australasia Conference 17 November 2015 Tom Gorman CEO Investor value proposition

Detection of topological order with quantum simulators EU FET-Proactive QUIC a FNP i n o f

A Hennessy-Milner Theorem for ATL with Imperfect Information Francesco Belardinelli, C at

TRAFFIC ANALYSIS: or... encryption is not enough Carmela Troncoso* IMDEA Software Institute

ECFA HL-LHC Workshop for 21 st to 23 rd October October 2013 ECFA HL-LHC Experiments Workshop at

Financial Inclusion Summit: Seattle Slide 1: Title Financial Inclusion Summit: Seattle September

Graph Reduction Hardware Revisited Rob Stewart ( R.Stewart@hw.ac.uk ) - PowerPoint PPT Presentation

Graph Reduction Hardware Revisited Rob Stewart ( R.Stewart@hw.ac.uk ) 1 Evgenij Belikov 1 Hans-Wolfgang Loidl 1 Paulo Garcia 2 14 th June 2018 1 Mathematical and Computer Sciences Heriot-Wat University Edinburgh, UK 2 United Technologies Research

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Hom and Ext, Revisited Justin Lyle Lawrence, KS justin.lyle@ku.edu April 28, 2018 JL Hom and

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Reduction Revisited: Verifying Round-Based Distributed Algorithms Stephan Merz INRIA Nancy &amp;

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

Seminar of Lattice Analysis 01 Slide Reduction Revisited Wenling Liu, Shanghai Jiao Tong

Graph Indexing: Tree + Delta Delta &gt;= Graph &gt;= Graph Graph Indexing: Tree + Peixian Zhao,

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

Introduction to Harm Reduction Definition of Harm Reduction Harm reduction refers to policies,

software and hardware for the Internet of Things. Choose hardware Design hardware Design

Problem-solving revisited Problem-solving revisited David Lim (District Judge / Mediator) State

Environmental Acquisition Revisited Richard Cobbe and Matthias Felleisen Northeastern University

CEPH WIRE PROTOCOL REVISITED CEPH WIRE PROTOCOL REVISITED MESSENGER V2 MESSENGER V2 Ricardo

Hard State Revisited: Network Filesystems Hard State Revisited: Network Filesystems Jeff Chase

VoCCN: Voice over Content-Centric Networks Van Jacobson, Diana Smetters, Nick Briggs, Michael

Strategic Social Listening &amp; Engagement Matthew Panzano Practice Manager Dynamics

UBS Australasia Conference 17 November 2015 Tom Gorman CEO Investor value proposition

Detection of topological order with quantum simulators EU FET-Proactive QUIC a FNP i n o f

A Hennessy-Milner Theorem for ATL with Imperfect Information Francesco Belardinelli, C at

TRAFFIC ANALYSIS: or... encryption is not enough Carmela Troncoso* IMDEA Software Institute

ECFA HL-LHC Workshop for 21 st to 23 rd October October 2013 ECFA HL-LHC Experiments Workshop at

Financial Inclusion Summit: Seattle Slide 1: Title Financial Inclusion Summit: Seattle September

Reduction Revisited: Verifying Round-Based Distributed Algorithms Stephan Merz INRIA Nancy &

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,

Strategic Social Listening & Engagement Matthew Panzano Practice Manager Dynamics