Making FPGAs Programmable as Computers and Doing It At Scale Paul - - PowerPoint PPT Presentation

making fpgas programmable as
SMART_READER_LITE
LIVE PREVIEW

Making FPGAs Programmable as Computers and Doing It At Scale Paul - - PowerPoint PPT Presentation

Making FPGAs Programmable as Computers and Doing It At Scale Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Whats the real goal? Build large-scale


slide-1
SLIDE 1

High-Performance Reconfigurable Computing Group

Department of Electrical and Computer Engineering University of Toronto

Making FPGAs Programmable as Computers and Doing It At Scale

Paul Chow

slide-2
SLIDE 2

What’s the real goal?

  • Build large-scale applications with FPGAs without

added pain J

  • Where do we stand?
  • Need for abstractions and middleware support
  • Our work at UofT to support HPC with FPGAs

November 14, 2016 H2RC 2016

2

slide-3
SLIDE 3

OUR P PHILOSOPHY

November 14, 2016 H2RC 2016

3

slide-4
SLIDE 4

Preserve Current Programming Models

  • Program and use an FPGA just like any other

software-based processor

  • (Software) programmers should not necessarily

need to know that processing is done on an FPGA

– Ability to pick FPGA execution for performance/power reasons – even better if this is automatic!

November 14, 2016 H2RC 2016

4

slide-5
SLIDE 5

WH WHERE D DO WE WE S STAND?

November 14, 2016 H2RC 2016

5

slide-6
SLIDE 6

High-Level Synthesis

  • Raises the level of abstraction above hardware design
  • Lots of great research
  • Absolutely required
  • Tremendous progress recently
  • Can describe complex computations and functions

algorithmically and create hardware But!!! We are still just building custom hardware. HLS is only a part of the big picture…

November 14, 2016 H2RC 2016

6

slide-7
SLIDE 7

Consider Portability

How many of you have taken a C program written on

  • ne platform and just recompiled it to run on another?

How many have done the same for code targeted for an FPGA platform? (If you have even tried! J )

  • Much invested in writing any application
  • Reuse, modify, enhance, evolve it
  • Code should run on any platform

November 14, 2016 H2RC 2016

7

slide-8
SLIDE 8

Consider Design Environments

  • Well-developed in software

– IDEs – Visual Studio, NetBeans – Linux + emacs/vi + gcc – Good open source options

  • FPGA Hardware

– Vivado, Quartus – Not what software developer would expect – No open source options

  • Makefiles vs TCL!

November 14, 2016 H2RC 2016

8

slide-9
SLIDE 9

COMPUTING A ABSTRACTIONS F FOR FP FPGAS AS

November 14, 2016 H2RC 2016

9

slide-10
SLIDE 10

What Abstractions?

  • Memory model

– Data[127:0] vs connect to memory controller

  • I/O

– read()/write() vs connect to a PCIe controller – USB – Networking – TCP/IP , UDP

  • Services

– Filesystem, status, control

  • FPGAs have lacked all of these things

November 14, 2016 H2RC 2016

1

slide-11
SLIDE 11

Many Approaches

  • APIs to connect to FPGAs
  • Hardware threads

Ø Commercial (non-vendor) tools Ø Commercial vendor tools Ø UofT approach

  • Give a representative, not complete set of

examples

November 14, 2016 H2RC 2016

1 1

slide-12
SLIDE 12

Commercial Non-Vendor Tools

  • HLS with an environment to debug, monitor

performance, load and run hardware

– Handel-C – Impulse-C – Maxeler – proprietary hardware

  • Not broadly used because of proprietary tools

(and sometimes hardware)

November 14, 2016 H2RC 2016

1 2

slide-13
SLIDE 13

Commercial Vendor Tools

  • OpenCL tools – SDAccel, SDK for OpenCL

– Data centre is a major target

  • OpenCL is an open “standard”

– Possible to have cross-vendor, cross-platform portability, even cross-architecture (GPUs, PHI, etc.) – More interest than proprietary approach

  • SDSoC – for C/C++

– On SoC platforms, but could be any heterogeneous system

November 14, 2016 H2RC 2016

1 3

slide-14
SLIDE 14

Why FPGA OpenCL is more like computing

  • Provides a higher-level software abstraction

– Don’t worry about CPU-FPGA communication layer

  • PCIe, QPI, AXI, ethernet, etc.

– Runtime manages bit streams, memory allocation, data transfers – Transparently uses HLS for the kernels

  • Knowledgeable software person can use

– Must understand parallelism, basic architectural concepts, latency and throughput, I/O for data in terms of structure and protocols – Doesn’t need to know about clocks

  • Early days still, but you can see where it’s going

– FPGA vendors learning to be computer companies

November 14, 2016 H2RC 2016

1 4

slide-15
SLIDE 15

OpenCL is not HLS

  • Recognize that HLS is not what makes OpenCL a

computing environment

– HLS is necessary but not sufficient

  • It’s the other stuff under the hood + HLS

– Run time services

  • Data transfer, memory management, bit stream loading

– Hardware shell services

  • CPU/FPGA interconnect, DMA engine, memory

controller

November 14, 2016 H2RC 2016

1 5

slide-16
SLIDE 16

Scalability

  • OpenCL, as a programming model, does not scale
  • Could scale by using MPI between nodes, and

OpenCL to build the accelerator

– As is done with MPI + OpenMP – Need to deal with two programming models

November 14, 2016 H2RC 2016

1 6

slide-17
SLIDE 17

WORK A AT UOFT UOFT

November 14, 2016 H2RC 2016

1 7

slide-18
SLIDE 18

Classic accelerator model: Master-Slave

November 14, 2016 H2RC 2016

1 8

Need custom APIs to interact with accelerators Lacks portability and scalability

slide-19
SLIDE 19

Our programming model philosophy

  • Use a common API for Software and Hardware

November 14, 2016 H2RC 2016

1 9

FPGA

CPU (x86) Application Common API Interconnect Drivers

Interconnect

Kernel migration

FPGA

CPU (x86) Application Common API Embedded CPU Application Common API Common API Hardware Interconnect Drivers

Interconnect

Kernel migration

FPGA

CPU (x86) Application Common API Embedded CPU Application Common API Common API Hardware Custom Processing Element Common API Hardware Interconnect Drivers

Interconnect

Kernel migration

FPGA

CPU (x86) Application Common API Embedded CPU Application Common API Common API Hardware Custom Processing Element Common API Hardware Interconnect Drivers

Interconnect

Kernel migration

slide-20
SLIDE 20

Common SW/HW API

  • CPU and FPGA components can initiate data

transfers – they are peers

  • SW and HW components use similar call formats
  • For distributed memory and message-passing,

this was implemented by TMD-MPI (TMD: Toronto Molecular Dynamics)

  • For shared memory, building hardware

infrastructure for a common API for PGAS

November 14, 2016 H2RC 2016

2

slide-21
SLIDE 21

Why, again, a common API?

  • Developer can focus on algorithm, exposing parallel tasks in

pure software environment

– Easier development: SW Prototyping à Migration

  • Model makes no distinction between CPUs and FPGAs

(in terms of data communication, synchronization)

  • Map tasks to computing elements later

– Not as part of initial design

  • FPGA-initiated communication relieves CPU

(even more so for one-sided communication)

  • FPGA-only systems (or one CPU + many FPGAs) can work

efficiently

November 14, 2016 H2RC 2016

2 1

slide-22
SLIDE 22

BUILDING A A L LARGE H HETEROGENEOUS HPC A APPLICATION WI WITH M MPI

November 14, 2016 H2RC 2016

2 2

slide-23
SLIDE 23

November 14, 2016 H2RC 2016

2 3

Molecular Dynamics

  • Simulate motion of molecules at atomic level
  • Highly compute-intensive
  • Understand protein folding
  • Computer-aided drug design
slide-24
SLIDE 24

Origin of Computational Complexity

103 - 1010 ∑

− =

i i i i b

r r k U

2 0 )

(

∑∑ ∑

= =

+ =

N i N j ij j i n

n r q q U

1 1

2 1  

 τ

⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ =

6 12

4 ) ( r r r V σ σ ε

− =

i i i i a

k U

2 0 )

( θ θ

( ) [ ] ( )

∑⎩

⎨ ⎧ = − ≠ − + =

i i i i i i i i i t

n k n n k U , , cos 1

2

γ γ φ

O(n2) O(n)

November 14, 2016

2 4

H2RC 2016

slide-25
SLIDE 25

The TMD Machine

  • The Toronto Molecular Dynamics Machine
  • Use multi-FPGA system to accelerate MD
  • Built using an MPI programming model
  • Principal algorithm developer: Chris Madill, Ph.D.

candidate (now done!) in Biochemistry

– Writes C++ using MPI, not Verilog/VHDL

  • Have used three platforms – portability
  • Plus scalability and maintainability

November 14, 2016 H2RC 2016

2 5

slide-26
SLIDE 26

UofT MPI Approach (FPL 2006)

November 14, 2016 H2RC 2016

2 6

Also a system simulation HLS can do this

slide-27
SLIDE 27

Platform Evolution

Network of Five V2Pro PCI Cards (2006) Network of BEE2 Multi-FPGA Boards (2007)

  • First to integrate hardware acceleration
  • Simple LJ fluids only
  • Added electrostatic terms
  • Added bonded terms

FPGA portability and design abstraction facilitated ongoing migration.

November 14, 2016 H2RC 2016

2 7

slide-28
SLIDE 28

2010 – Xilinx/Nallatech ACP

November 14, 2016 H2RC 2016

2 8

Stack of 5 large Virtex-5 FPGAs + 1 FPGA for FSB PHY interface Quad socket Xeon Server

slide-29
SLIDE 29

CPUi Processi

Bonded Nonbonded PME Datai

CPUi Processi

Bonded Nonbonded PME Datai

Typical MD Simulator

CPUi Processi

Bonded Nonbonded PME Datai

CPUi Processi

Bonded Nonbonded PME Datai November 14, 2016

2 9

H2RC 2016

slide-30
SLIDE 30

TMD Machine Architecture

Bond Engine Visualizer Output Scheduler Input MPI::Send(&msg, size, dest …); Atom Manager Atom Manager Atom Manager Bond Engine Long range Electrostatics Engine Long range Electrostatics Engine Long range Electrostatics Engine Atom Manager Short range Nonbond Engine Short range Nonbond Engine Short range Nonbond Engine Short range Nonbond Engine Short range Nonbond Engine Short range Nonbond Engine

November 14, 2016

3

H2RC 2016

slide-31
SLIDE 31

FSB

Target Platform for MD

FSB NBE NBE NBE NBE FSB NBE NBE NBE NBE MEM PME FSB NBE NBE NBE NBE PME MEM

Socket Socket 2 Socket 1 Socket 3

Short range Nonbonded Long range Electrostatic Bonds

Initial Breakdown of CPU Time

ž

12 short range nonbond FPGAs

ž

2-3 pipelines/NBE FPGA; Each runs 15-30x CPU

ž

NBE 360-1080x

ž

2 PME FPGAs with fast memory and fibre optic interconnects

ž

PME 420x

ž

Bonds on quad-core Xeon server

ž

Bonds 1x

Sys Mem Sys Mem

Quad Xeon

Sys Mem 8.5 GB/s @ 1066 MHz 72.5 GB/s

3 1

November 14, 2016 H2RC 2016

slide-32
SLIDE 32

Performance Modeling

Problem : Difficult to mathematically predict the expected speedup a priori due to the contentious nature of many- to-many communications. Solution: Measuring the non-deterministic behaviour using Jumpshot on the software version and back-annotate the deterministic behaviour.

  • Make use of existing tools!

November 14, 2016

3 2

H2RC 2016

slide-33
SLIDE 33

Single Timestep Profile

Timestep = 108 ms (327 506 atoms)

November 14, 2016

3 3

H2RC 2016

slide-34
SLIDE 34

Performance

  • Significant overlap between all

force calculations.

  • 108.02 ms is equivalent to

between 80 and 88 Infiniband- connected cores at U of T’s supercomputer, SciNet.

  • 160-176 hyperthreaded cores
  • Can we do better?

– 140 with hardware bond engines – change engine from SW to HW, no architectural change

November 14, 2016

3 4

H2RC 2016

slide-35
SLIDE 35

Final Performance Equivalent for MD

FPGA/CPU Supercomputer Scaling Factor

Space

5U 17.5*2U 1/7

Cooling

N/A Share of 735-ton chiller

∞?

Capital Cost

$15000* $120000 1/8

Annual Electricity Cost

$241 (Assuming 500W) $6758 1/30

Performance (Core Equivalent)

140 Cores 1*140 Cores 140x *Current system is a prototype. Cost is based on projections for next-generation system.

November 14, 2016 H2RC 2016

3 5

slide-36
SLIDE 36

TMD Perspective

  • Still comparing apples to oranges.
  • Individually, hardware engines are able to sustain calculations

hundreds of times faster than traditional CPUs.

  • Communication costs degrade overall performance.
  • FPGA platform is using older CPUs and older communication

links than SciNet.

  • Migrating the FPGA portion to a SciNet compatible platform

will further increase the relative performance and provide a more accurate CPU/FPGA comparison.

  • Need to integrate HLS

November 14, 2016 H2RC 2016

3 6

slide-37
SLIDE 37

INTEGRATING F FPGAS I INTO P PGAS

November 14, 2016 H2RC 2016

3 7

slide-38
SLIDE 38

November 14, 2016 H2RC 2016

3 8

Partitioned Global Address Space

  • Programmer has access to all

data in the system, but can distinguish between local and remote

  • Communication is one-sided:

the remote application process doesn’t need to get involved in an access to its data

Address Space Processes/threads

// Process 1 local int a; remote int *b; a += *b;

a b Programming is data-centric /memory-centric

slide-39
SLIDE 39

One-sided communication in the software stack

November 14, 2016 H2RC 2016

3 9

Network One-sided Communication Library PGAS Library or Language Runtime PGAS Application Network One-sided Communication Library PGAS Library or Language Runtime PGAS Application

slide-40
SLIDE 40

One-sided data access: GASNet

  • Global Address Space Networking
  • Software communication library for (P)GAS
  • Developed as a communication layer for UPC,

Co-Array Fortran and Titanium (Java)

  • GASNet’s Core API is built on the concept of

“Active Messages”: Remote memory copies followed by a Remote Procedure Call

November 14, 2016 H2RC 2016

4

slide-41
SLIDE 41

GASNet Active Messages

November 14, 2016 H2RC 2016

4 1

slide-42
SLIDE 42

GASNet API

  • Network-independent

communication library

  • Provides global address

space support

  • Basis of:

– Toronto Heterogeneous GASNet (THe_GASNet)

November 14, 2016 H2RC 2016

4 2 PGAS ¡Language ¡Runtime PGAS ¡Language ¡Application GASNet ¡Extended ¡API Network ¡Hardware ¡ ¡ ¡ ¡ ¡ ¡ ¡GASNet ¡Core ¡API

slide-43
SLIDE 43

How FPGAs figure into this

Our programming model philosophy: Use a common API for Software and Hardware

November 14, 2016 H2RC 2016

4 3

FPGA

CPU (x86) Application Common API Interconnect Drivers

Interconnect

Kernel migration

FPGA

CPU (x86) Application Common API Embedded CPU Application Common API Common API Hardware Interconnect Drivers

Interconnect

Kernel migration

FPGA

CPU (x86) Application Common API Embedded CPU Application Common API Common API Hardware Custom Processing Element Common API Hardware Interconnect Drivers

Interconnect

Kernel migration

FPGA

CPU (x86) Application Common API Embedded CPU Application Common API Common API Hardware Custom Processing Element Common API Hardware Interconnect Drivers

Interconnect

Kernel migration

slide-44
SLIDE 44

THeGASNet: A common SW/HW API

  • Toronto Heterogeneous GASNet
  • Compatible software and hardware implementations of the

GASNet Core API

  • Active Messages (=data transfers) can be initiated by CPU

and FPGA components

  • The common API enables easy software-to-hardware

migration

November 14, 2016 H2RC 2016

4 4

slide-45
SLIDE 45

THe_GASNet Core API

  • Heterogeneous multi-processor communication library
  • Provides a uniform communication standard between a

software application and an hardware accelerator

November 14, 2016 H2RC 2016

4 5

PGAS ¡Language ¡Runtime PGAS ¡Language ¡Application THe_GASNet ¡Core ¡API THe_GASNet ¡Extended ¡API Network ¡Hardware Accelerator ¡Core GASCore PAMS

slide-46
SLIDE 46

THe_GASNet Extended API

  • Provides higher flexibility in coding
  • Serves as a bridge between the core API and

higher-level PGAS runtime libraries

November 14, 2016 H2RC 2016

4 6

PGAS ¡Language ¡Runtime PGAS ¡Language ¡Application THe_GASNet ¡Core ¡API THe_GASNet ¡Extended ¡API Accelerator ¡Core GASCore Extended ¡PAMS Network ¡Hardware

slide-47
SLIDE 47

THeGASNet Hardware: GAScore

November 14, 2016 H2RC 2016

4 7

Memory GAScore

RDMA GASNet.h App.c

CPU FPGA

On-Chip Network

Custom Hardware

slide-48
SLIDE 48

THeGASNet Hardware: xPAMS

November 14, 2016 H2RC 2016

4 8

Memory GAScore

RDMA GASNet.h App.c

CPU FPGA

On-Chip Network

Custom Hardware

P A M S

Custom Core

xPAMS

slide-49
SLIDE 49

xPAMS

  • Extended Programmable Active Message

Sequencer (xPAMS)

– Extended API support – High-level functions

  • Contains an instruction memory that is

programmed by a software node

  • Provides:

– Communication – Synchronization – Accelerator control

November 14, 2016 H2RC 2016

4 9

slide-50
SLIDE 50

THe_GASNet Extended API: Memory Transfer

  • Memory transfers are divided into three types:

– Blocking memory transfer – Non-blocking memory transfer with explicit handle – Non-blocking memory transfer with implicit handle

  • “Handle” is a variable that stores the status of a

message

– FREE, INFLIGHT, COMPLETE

November 14, 2016 H2RC 2016

5

slide-51
SLIDE 51

THe_GASNet Extended API: Functions

  • put()/get() operations

– For passing data between the parallel nodes

  • Synchronization try() operations

– Determines if the messages have completed

  • Synchronization wait() operations

– Blocks until the messages have completed

  • Similar functions available for the Extended PAMS

November 14, 2016 H2RC 2016

5 1

slide-52
SLIDE 52

xPAMS Code

//Software Node

gasnet_put_nbi(mymem, FPGAmem, 32); gasnet_get_nbi(FPGAmem+32, mymem+32, 32); gasnet_syncnbi_all();

//FPGA Node

pams_put_nbi(mymem, FPGAmem, 32); pams_get_nbi(FPGAmem+32, mymem+32, 32); pams_syncnbi_all();

November 14, 2016 H2RC 2016

5 2

slide-53
SLIDE 53

Case Study: Jacobi Iterative Method (I)

  • Solves Partial Differential Equations iteratively
  • Jacobi method for Heat Equation calculates the spread of

heat over a surface

November 14, 2016 H2RC 2016

5 3

  • The surface is divided

equally among the parallel nodes

Source: R. Willenberg

slide-54
SLIDE 54

Case Study: Jacobi Iterative Method (II)

  • The value of the current cell at time T+1 is based
  • n the values of the neighboring cells at time T

November 14, 2016 H2RC 2016

5 4

  • At the end of each

iteration, edges of the neighbor nodes are communicated

Source: R. Willenberg

slide-55
SLIDE 55

Case Study: Test Platform

November 14, 2016 H2RC 2016

5 5

  • ARM Cluster
  • ARM-FPGA Cluster
slide-56
SLIDE 56

Results: Runtime Performance

November 14, 2016 H2RC 2016

5 6

  • 1024 Iterations
  • Eight nodes for

the clusters

  • Two nodes for the

i5

slide-57
SLIDE 57

Results: Scalability

November 14, 2016 H2RC 2016

5 7

  • 1024 Iterations
  • Two nodes for

the i5

slide-58
SLIDE 58

Outlook: A complete heterogeneous PGAS stack

November 14, 2016 H2RC 2016

5 8

CPU and FPGA Networking THeGASNet SW Core API (Unlimited AMs optional) Generated PAMS code C++ PGAS Application THeGASNet HW API (GAScore) Extended remote communication layer Generated HDL code HDL Dataflow Templates THePaC++ PGAS library }

PC FPGA

slide-59
SLIDE 59

Outlook: THePaC++ Heterogeneous C++ PGAS library

November 14, 2016 H2RC 2016

5 9

CPU-based Host CPU-based Host

THeGASNet

CPU-based Host GASNet Library C++ PGAS Library C++ PGAS Application

Compile Static generation Dynamic generation

P A M S Custom FPGA Hardware

High- Level- Synthesis

slide-60
SLIDE 60

PGAS Perspective: Making it Work

  • Accommodate limited memory management in FPGAs
  • Keeping a software-based control-flow execution capability

in the FPGA is advisable (e.g. PAMS)

  • A full runtime system needs application launch capability

including FPGA configuration

  • To leverage heterogeneity, PGAS languages/libraries need to

expose and manage platform-type properties and subgroups

  • Future FPGA platforms should offer portable networking

and memory abstractions to application cores

November 14, 2016 H2RC 2016

6

slide-61
SLIDE 61

WH WHAT’S N NEXT?

November 14, 2016 H2RC 2016

6 1

slide-62
SLIDE 62

Conclusion

  • HLS is necessary but not sufficient
  • Possible to achieve acceleration for HPC

applications using FPGAs

  • Need a lot of middleware support equivalent to

MPI and PGAS communication libraries for FPGAs

– With integrated HLS

  • Need runtimes that understand FPGAs
  • Need to include FPGAs into the standards

November 14, 2016 H2RC 2016

6 2

slide-63
SLIDE 63

Questions?

November 14, 2016 H2RC 2016

6 3