[PPT] - Making FPGAs Programmable as Computers and Doing It At Scale Paul PowerPoint Presentation

SLIDE 1

High-Performance Reconfigurable Computing Group

Department of Electrical and Computer Engineering University of Toronto

Making FPGAs Programmable as Computers and Doing It At Scale

Paul Chow

SLIDE 2

What’s the real goal?

Build large-scale applications with FPGAs without

added pain J

Where do we stand?
Need for abstractions and middleware support
Our work at UofT to support HPC with FPGAs

November 14, 2016 H2RC 2016

2

SLIDE 3

OUR P PHILOSOPHY

November 14, 2016 H2RC 2016

3

SLIDE 4

Preserve Current Programming Models

Program and use an FPGA just like any other

software-based processor

(Software) programmers should not necessarily

need to know that processing is done on an FPGA

– Ability to pick FPGA execution for performance/power reasons – even better if this is automatic!

November 14, 2016 H2RC 2016

4

SLIDE 5

WH WHERE D DO WE WE S STAND?

November 14, 2016 H2RC 2016

5

SLIDE 6

High-Level Synthesis

Raises the level of abstraction above hardware design
Lots of great research
Absolutely required
Tremendous progress recently
Can describe complex computations and functions

algorithmically and create hardware But!!! We are still just building custom hardware. HLS is only a part of the big picture…

November 14, 2016 H2RC 2016

6

SLIDE 7

Consider Portability

How many of you have taken a C program written on

ne platform and just recompiled it to run on another?

How many have done the same for code targeted for an FPGA platform? (If you have even tried! J )

Much invested in writing any application
Reuse, modify, enhance, evolve it
Code should run on any platform

November 14, 2016 H2RC 2016

7

SLIDE 8

Consider Design Environments

Well-developed in software

– IDEs – Visual Studio, NetBeans – Linux + emacs/vi + gcc – Good open source options

FPGA Hardware

– Vivado, Quartus – Not what software developer would expect – No open source options

Makefiles vs TCL!

November 14, 2016 H2RC 2016

8

SLIDE 9

COMPUTING A ABSTRACTIONS F FOR FP FPGAS AS

November 14, 2016 H2RC 2016

9

SLIDE 10

What Abstractions?

Memory model

– Data[127:0] vs connect to memory controller

I/O

– read()/write() vs connect to a PCIe controller – USB – Networking – TCP/IP , UDP

Services

– Filesystem, status, control

FPGAs have lacked all of these things

November 14, 2016 H2RC 2016

1

SLIDE 11

Many Approaches

APIs to connect to FPGAs
Hardware threads

Ø Commercial (non-vendor) tools Ø Commercial vendor tools Ø UofT approach

Give a representative, not complete set of

examples

November 14, 2016 H2RC 2016

1 1

SLIDE 12

Commercial Non-Vendor Tools

HLS with an environment to debug, monitor

performance, load and run hardware

– Handel-C – Impulse-C – Maxeler – proprietary hardware

Not broadly used because of proprietary tools

(and sometimes hardware)

November 14, 2016 H2RC 2016

1 2

SLIDE 13

Commercial Vendor Tools

OpenCL tools – SDAccel, SDK for OpenCL

– Data centre is a major target

OpenCL is an open “standard”

– Possible to have cross-vendor, cross-platform portability, even cross-architecture (GPUs, PHI, etc.) – More interest than proprietary approach

SDSoC – for C/C++

– On SoC platforms, but could be any heterogeneous system

November 14, 2016 H2RC 2016

1 3

SLIDE 14

Why FPGA OpenCL is more like computing

Provides a higher-level software abstraction

– Don’t worry about CPU-FPGA communication layer

PCIe, QPI, AXI, ethernet, etc.

– Runtime manages bit streams, memory allocation, data transfers – Transparently uses HLS for the kernels

Knowledgeable software person can use

– Must understand parallelism, basic architectural concepts, latency and throughput, I/O for data in terms of structure and protocols – Doesn’t need to know about clocks

Early days still, but you can see where it’s going

– FPGA vendors learning to be computer companies

November 14, 2016 H2RC 2016

1 4

SLIDE 15

OpenCL is not HLS

Recognize that HLS is not what makes OpenCL a

computing environment

– HLS is necessary but not sufficient

It’s the other stuff under the hood + HLS

– Run time services

Data transfer, memory management, bit stream loading

– Hardware shell services

CPU/FPGA interconnect, DMA engine, memory

controller

November 14, 2016 H2RC 2016

1 5

SLIDE 16

Scalability

OpenCL, as a programming model, does not scale
Could scale by using MPI between nodes, and

OpenCL to build the accelerator

– As is done with MPI + OpenMP – Need to deal with two programming models

November 14, 2016 H2RC 2016

1 6

SLIDE 17

WORK A AT UOFT UOFT

November 14, 2016 H2RC 2016

1 7

SLIDE 18

Classic accelerator model: Master-Slave

November 14, 2016 H2RC 2016

1 8

Need custom APIs to interact with accelerators Lacks portability and scalability

SLIDE 19

Our programming model philosophy

Use a common API for Software and Hardware

November 14, 2016 H2RC 2016

1 9

FPGA

CPU (x86) Application Common API Interconnect Drivers

Interconnect

Kernel migration

FPGA

CPU (x86) Application Common API Embedded CPU Application Common API Common API Hardware Interconnect Drivers

Interconnect

Kernel migration

FPGA

CPU (x86) Application Common API Embedded CPU Application Common API Common API Hardware Custom Processing Element Common API Hardware Interconnect Drivers

Interconnect

Kernel migration

FPGA

CPU (x86) Application Common API Embedded CPU Application Common API Common API Hardware Custom Processing Element Common API Hardware Interconnect Drivers

Interconnect

Kernel migration

SLIDE 20

Common SW/HW API

CPU and FPGA components can initiate data

transfers – they are peers

SW and HW components use similar call formats
For distributed memory and message-passing,

this was implemented by TMD-MPI (TMD: Toronto Molecular Dynamics)

For shared memory, building hardware

infrastructure for a common API for PGAS

November 14, 2016 H2RC 2016

2

SLIDE 21

Why, again, a common API?

Developer can focus on algorithm, exposing parallel tasks in

pure software environment

– Easier development: SW Prototyping à Migration

Model makes no distinction between CPUs and FPGAs

(in terms of data communication, synchronization)

Map tasks to computing elements later

– Not as part of initial design

FPGA-initiated communication relieves CPU

(even more so for one-sided communication)

FPGA-only systems (or one CPU + many FPGAs) can work

efficiently

November 14, 2016 H2RC 2016

2 1

SLIDE 22

BUILDING A A L LARGE H HETEROGENEOUS HPC A APPLICATION WI WITH M MPI

November 14, 2016 H2RC 2016

2 2

SLIDE 23

November 14, 2016 H2RC 2016

2 3

Molecular Dynamics

Simulate motion of molecules at atomic level
Highly compute-intensive
Understand protein folding
Computer-aided drug design

SLIDE 24

Origin of Computational Complexity

103 - 1010 ∑

− =

i i i i b

r r k U

2 0 )

(

∑∑ ∑

= =

+ =

N i N j ij j i n

n r q q U

1 1

2 1  

 τ

⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ =

6 12

4 ) ( r r r V σ σ ε

∑

− =

i i i i a

k U

2 0 )

( θ θ

( ) [ ] ( )

∑⎩

⎨ ⎧ = − ≠ − + =

i i i i i i i i i t

n k n n k U , , cos 1

2

γ γ φ

O(n2) O(n)

November 14, 2016

2 4

H2RC 2016

SLIDE 25

The TMD Machine

The Toronto Molecular Dynamics Machine
Use multi-FPGA system to accelerate MD
Built using an MPI programming model
Principal algorithm developer: Chris Madill, Ph.D.

candidate (now done!) in Biochemistry

– Writes C++ using MPI, not Verilog/VHDL

Have used three platforms – portability
Plus scalability and maintainability

November 14, 2016 H2RC 2016

2 5

SLIDE 26

UofT MPI Approach (FPL 2006)

November 14, 2016 H2RC 2016

2 6

Also a system simulation HLS can do this

SLIDE 27

Platform Evolution

Network of Five V2Pro PCI Cards (2006) Network of BEE2 Multi-FPGA Boards (2007)

First to integrate hardware acceleration
Simple LJ fluids only
Added electrostatic terms
Added bonded terms

FPGA portability and design abstraction facilitated ongoing migration.

November 14, 2016 H2RC 2016

2 7

SLIDE 28

2010 – Xilinx/Nallatech ACP

November 14, 2016 H2RC 2016

2 8

Stack of 5 large Virtex-5 FPGAs + 1 FPGA for FSB PHY interface Quad socket Xeon Server

SLIDE 29

CPUi Processi

Bonded Nonbonded PME Datai

CPUi Processi

Bonded Nonbonded PME Datai

Typical MD Simulator

CPUi Processi

Bonded Nonbonded PME Datai

CPUi Processi

Bonded Nonbonded PME Datai November 14, 2016

2 9

H2RC 2016

SLIDE 30

TMD Machine Architecture

Bond Engine Visualizer Output Scheduler Input MPI::Send(&msg, size, dest …); Atom Manager Atom Manager Atom Manager Bond Engine Long range Electrostatics Engine Long range Electrostatics Engine Long range Electrostatics Engine Atom Manager Short range Nonbond Engine Short range Nonbond Engine Short range Nonbond Engine Short range Nonbond Engine Short range Nonbond Engine Short range Nonbond Engine

November 14, 2016

3

H2RC 2016

SLIDE 31

FSB

Target Platform for MD

FSB NBE NBE NBE NBE FSB NBE NBE NBE NBE MEM PME FSB NBE NBE NBE NBE PME MEM

Socket Socket 2 Socket 1 Socket 3

Short range Nonbonded Long range Electrostatic Bonds

Initial Breakdown of CPU Time



12 short range nonbond FPGAs



2-3 pipelines/NBE FPGA; Each runs 15-30x CPU



NBE 360-1080x



2 PME FPGAs with fast memory and fibre optic interconnects



PME 420x



Bonds on quad-core Xeon server



Bonds 1x

Sys Mem Sys Mem

Quad Xeon

Sys Mem 8.5 GB/s @ 1066 MHz 72.5 GB/s

3 1

November 14, 2016 H2RC 2016

SLIDE 32

Performance Modeling

Problem : Difficult to mathematically predict the expected speedup a priori due to the contentious nature of many- to-many communications. Solution: Measuring the non-deterministic behaviour using Jumpshot on the software version and back-annotate the deterministic behaviour.

Make use of existing tools!

November 14, 2016

3 2

H2RC 2016

SLIDE 33

Single Timestep Profile

Timestep = 108 ms (327 506 atoms)

November 14, 2016

3 3

H2RC 2016

SLIDE 34

Performance

Significant overlap between all

force calculations.

108.02 ms is equivalent to

between 80 and 88 Infiniband- connected cores at U of T’s supercomputer, SciNet.

160-176 hyperthreaded cores
Can we do better?

– 140 with hardware bond engines – change engine from SW to HW, no architectural change

November 14, 2016

3 4

H2RC 2016

SLIDE 35

Final Performance Equivalent for MD

FPGA/CPU Supercomputer Scaling Factor

Space

5U 17.5*2U 1/7

Cooling

N/A Share of 735-ton chiller

∞?

Capital Cost

$15000* $120000 1/8

Annual Electricity Cost

$241 (Assuming 500W) $6758 1/30

Performance (Core Equivalent)

140 Cores 1*140 Cores 140x *Current system is a prototype. Cost is based on projections for next-generation system.

November 14, 2016 H2RC 2016

3 5

SLIDE 36

TMD Perspective

Still comparing apples to oranges.
Individually, hardware engines are able to sustain calculations

hundreds of times faster than traditional CPUs.

Communication costs degrade overall performance.
FPGA platform is using older CPUs and older communication

links than SciNet.

Migrating the FPGA portion to a SciNet compatible platform

will further increase the relative performance and provide a more accurate CPU/FPGA comparison.

Need to integrate HLS

November 14, 2016 H2RC 2016

3 6

SLIDE 37

INTEGRATING F FPGAS I INTO P PGAS

November 14, 2016 H2RC 2016

3 7

SLIDE 38

November 14, 2016 H2RC 2016

3 8

Partitioned Global Address Space

Programmer has access to all

data in the system, but can distinguish between local and remote

Communication is one-sided:

the remote application process doesn’t need to get involved in an access to its data

Address Space Processes/threads

// Process 1 local int a; remote int b; a += b;

a b Programming is data-centric /memory-centric

SLIDE 39

One-sided communication in the software stack

November 14, 2016 H2RC 2016

3 9

Network One-sided Communication Library PGAS Library or Language Runtime PGAS Application Network One-sided Communication Library PGAS Library or Language Runtime PGAS Application

SLIDE 40

One-sided data access: GASNet

Global Address Space Networking
Software communication library for (P)GAS
Developed as a communication layer for UPC,

Co-Array Fortran and Titanium (Java)

GASNet’s Core API is built on the concept of

“Active Messages”: Remote memory copies followed by a Remote Procedure Call

November 14, 2016 H2RC 2016

4

SLIDE 41

GASNet Active Messages

November 14, 2016 H2RC 2016

4 1

SLIDE 42

GASNet API

Network-independent

communication library

Provides global address

space support

Basis of:

– Toronto Heterogeneous GASNet (THe_GASNet)

November 14, 2016 H2RC 2016

4 2 PGAS ¡Language ¡Runtime PGAS ¡Language ¡Application GASNet ¡Extended ¡API Network ¡Hardware ¡ ¡ ¡ ¡ ¡ ¡ ¡GASNet ¡Core ¡API

SLIDE 43

How FPGAs figure into this

Our programming model philosophy: Use a common API for Software and Hardware

November 14, 2016 H2RC 2016

4 3

FPGA

CPU (x86) Application Common API Interconnect Drivers

Interconnect

Kernel migration

FPGA

CPU (x86) Application Common API Embedded CPU Application Common API Common API Hardware Interconnect Drivers

Interconnect

Kernel migration

FPGA

CPU (x86) Application Common API Embedded CPU Application Common API Common API Hardware Custom Processing Element Common API Hardware Interconnect Drivers

Interconnect

Kernel migration

FPGA

CPU (x86) Application Common API Embedded CPU Application Common API Common API Hardware Custom Processing Element Common API Hardware Interconnect Drivers

Interconnect

Kernel migration

SLIDE 44

THeGASNet: A common SW/HW API

Toronto Heterogeneous GASNet
Compatible software and hardware implementations of the

GASNet Core API

Active Messages (=data transfers) can be initiated by CPU

and FPGA components

The common API enables easy software-to-hardware

migration

November 14, 2016 H2RC 2016

4 4

SLIDE 45

THe_GASNet Core API

Heterogeneous multi-processor communication library
Provides a uniform communication standard between a

software application and an hardware accelerator

November 14, 2016 H2RC 2016

4 5

PGAS ¡Language ¡Runtime PGAS ¡Language ¡Application THe_GASNet ¡Core ¡API THe_GASNet ¡Extended ¡API Network ¡Hardware Accelerator ¡Core GASCore PAMS

SLIDE 46

THe_GASNet Extended API

Provides higher flexibility in coding
Serves as a bridge between the core API and

higher-level PGAS runtime libraries

November 14, 2016 H2RC 2016

4 6

PGAS ¡Language ¡Runtime PGAS ¡Language ¡Application THe_GASNet ¡Core ¡API THe_GASNet ¡Extended ¡API Accelerator ¡Core GASCore Extended ¡PAMS Network ¡Hardware

SLIDE 47

THeGASNet Hardware: GAScore

November 14, 2016 H2RC 2016

4 7

Memory GAScore

RDMA GASNet.h App.c

CPU FPGA

On-Chip Network

Custom Hardware

SLIDE 48

THeGASNet Hardware: xPAMS

November 14, 2016 H2RC 2016

4 8

Memory GAScore

RDMA GASNet.h App.c

CPU FPGA

On-Chip Network

Custom Hardware

P A M S

Custom Core

xPAMS

SLIDE 49

xPAMS

Extended Programmable Active Message

Sequencer (xPAMS)

– Extended API support – High-level functions

Contains an instruction memory that is

programmed by a software node

Provides:

– Communication – Synchronization – Accelerator control

November 14, 2016 H2RC 2016

4 9

SLIDE 50

THe_GASNet Extended API: Memory Transfer

Memory transfers are divided into three types:

– Blocking memory transfer – Non-blocking memory transfer with explicit handle – Non-blocking memory transfer with implicit handle

“Handle” is a variable that stores the status of a

message

– FREE, INFLIGHT, COMPLETE

November 14, 2016 H2RC 2016

5

SLIDE 51

THe_GASNet Extended API: Functions

put()/get() operations

– For passing data between the parallel nodes

Synchronization try() operations

– Determines if the messages have completed

Synchronization wait() operations

– Blocks until the messages have completed

Similar functions available for the Extended PAMS

November 14, 2016 H2RC 2016

5 1

SLIDE 52

xPAMS Code

//Software Node

gasnet_put_nbi(mymem, FPGAmem, 32); gasnet_get_nbi(FPGAmem+32, mymem+32, 32); gasnet_syncnbi_all();

//FPGA Node

pams_put_nbi(mymem, FPGAmem, 32); pams_get_nbi(FPGAmem+32, mymem+32, 32); pams_syncnbi_all();

November 14, 2016 H2RC 2016

5 2

SLIDE 53

Case Study: Jacobi Iterative Method (I)

Solves Partial Differential Equations iteratively
Jacobi method for Heat Equation calculates the spread of

heat over a surface

November 14, 2016 H2RC 2016

5 3

The surface is divided

equally among the parallel nodes

Source: R. Willenberg

SLIDE 54

Case Study: Jacobi Iterative Method (II)

The value of the current cell at time T+1 is based
n the values of the neighboring cells at time T

November 14, 2016 H2RC 2016

5 4

At the end of each

iteration, edges of the neighbor nodes are communicated

Source: R. Willenberg

SLIDE 55

Case Study: Test Platform

November 14, 2016 H2RC 2016

5 5

ARM Cluster
ARM-FPGA Cluster

SLIDE 56

Results: Runtime Performance

November 14, 2016 H2RC 2016

5 6

1024 Iterations
Eight nodes for

the clusters

Two nodes for the

i5

SLIDE 57

Results: Scalability

November 14, 2016 H2RC 2016

5 7

1024 Iterations
Two nodes for

the i5

SLIDE 58

Outlook: A complete heterogeneous PGAS stack

November 14, 2016 H2RC 2016

5 8

CPU and FPGA Networking THeGASNet SW Core API (Unlimited AMs optional) Generated PAMS code C++ PGAS Application THeGASNet HW API (GAScore) Extended remote communication layer Generated HDL code HDL Dataflow Templates THePaC++ PGAS library }

PC FPGA

SLIDE 59

Outlook: THePaC++ Heterogeneous C++ PGAS library

November 14, 2016 H2RC 2016

5 9

CPU-based Host CPU-based Host

THeGASNet

CPU-based Host GASNet Library C++ PGAS Library C++ PGAS Application

Compile Static generation Dynamic generation

P A M S Custom FPGA Hardware

High- Level- Synthesis

SLIDE 60

PGAS Perspective: Making it Work

Accommodate limited memory management in FPGAs
Keeping a software-based control-flow execution capability

in the FPGA is advisable (e.g. PAMS)

A full runtime system needs application launch capability

including FPGA configuration

To leverage heterogeneity, PGAS languages/libraries need to

expose and manage platform-type properties and subgroups

Future FPGA platforms should offer portable networking

and memory abstractions to application cores

November 14, 2016 H2RC 2016

6

SLIDE 61

WH WHAT’S N NEXT?

November 14, 2016 H2RC 2016

6 1

SLIDE 62

Conclusion

HLS is necessary but not sufficient
Possible to achieve acceleration for HPC

applications using FPGAs

Need a lot of middleware support equivalent to

MPI and PGAS communication libraries for FPGAs

– With integrated HLS

Need runtimes that understand FPGAs
Need to include FPGAs into the standards

November 14, 2016 H2RC 2016

6 2

SLIDE 63

Questions?

November 14, 2016 H2RC 2016

6 3