[PPT] - Embedded MPI for Hardware-Based Processing Nodes Daniel L. Ly 1 , PowerPoint Presentation

SLIDE 1

The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

Daniel L. Ly1, Manuel Saldaña2 and Paul Chow1

1Department of Electrical and Computer Engineering

University of Toronto

2Arches Computing Systems, Toronto, Canada

SLIDE 2

Outline

Background and Motivation
Embedded Processor-Based Optimizations
Hardware Engine-Based Optimizations
Conclusions and Future Work

Ly D, Saldaña M, Chow P. FPT 2009 2

SLIDE 3

Motivation

Message Passing Interface (MPI) is a programming

model for distributed memory systems

Popular in high performance computing (HPC),

cluster-based systems

Ly D, Saldaña M, Chow P. FPT 2009 3

SLIDE 4

Motivation

Message Passing Interface (MPI) is a programming

model for distributed memory systems

Popular in high performance computing (HPC),

cluster-based systems

Ly D, Saldaña M, Chow P. FPT 2009 4

Processor 1 Memory Processor 2 Memory

for (i = 1; i <= 100; i++) sum += i;

Problem: sum of numbers from 1 to 100

SLIDE 5

Motivation

Message Passing Interface (MPI) is a programming

model for distributed memory systems

Popular in high performance computing (HPC),

cluster-based systems

Ly D, Saldaña M, Chow P. FPT 2009 5

Processor 1 Memory Processor 2 Memory

sum1 = 0; for (i = 1; i <= 50; i++) sum1 += i; MPI_Recv(sum2, ...); sum = sum1 + sum2; sum1 = 0; for (i = 51; i <= 100; i++) sum1 += i; MPI_Send(sum1, ...);

SLIDE 6

Motivation

Message Passing Interface (MPI) is a programming

model for distributed memory systems

Popular in high performance computing (HPC),

cluster-based systems

Ly D, Saldaña M, Chow P. FPT 2009 6

Processor 1 Memory Processor 2 Memory

sum1 = 0; for (i = 0; i <= 50; i++) sum1 += i; MPI_Recv(sum2, ...); sum = sum1 + sum2; sum1 = 0; for (i = 51; i <= 100; i++) sum1 += i; MPI_Send(sum1, ...);

SLIDE 7

Motivation

Message Passing Interface (MPI) is a programming

model for distributed memory systems

Popular in high performance computing (HPC),

cluster-based systems

Ly D, Saldaña M, Chow P. FPT 2009 7

Processor 1 Memory Processor 2 Memory

sum1 = 0; for (i = 1; i <= 50; i++) sum1 += i; MPI_Recv(sum2, ...); sum = sum1 + sum2; sum1 = 0; for (i = 51; i <= 100; i++) sum1 += i; MPI_Send(sum1, ...);

SLIDE 8

Motivation

Message Passing Interface (MPI) is a programming

model for distributed memory systems

Popular in high performance computing (HPC),

cluster-based systems

Ly D, Saldaña M, Chow P. FPT 2009 8

Processor 1 Memory Processor 2 Memory

sum1 = 0; for (i = 1; i <= 50; i++) sum1 += i; MPI_Recv(sum2, ...); sum = sum1 + sum2; sum1 = 0; for (i = 51; i <= 100; i++) sum1 += i; MPI_Send(sum1, ...);

SLIDE 9

Motivation

Strong interest in adapting MPI for embedded designs:

– Increasingly difficult to interface heterogeneous resources as FPGA chip size increases

MPI provides key benefits:

– Unified protocol – Low weight and overhead – Abstraction of end points (ranks) – Easy prototyping

Ly D, Saldaña M, Chow P. FPT 2009 9

SLIDE 10

Motivation

Property HPC Cluster Embedded FPGA Processor Clock Rate 2-3 GHz 100-200 MHz Memory Size per node > 1GB 1-20 MB Interconnect Protocol Robustness High None Latency 10μs (20k cycles) 100ns (10 cycles) Bandwidth 125 MB/s 400-800 MB/s Components Processing Nodes Homogenous Heterogeneous

Ly D, Saldaña M, Chow P. FPT 2009 10

SLIDE 11

Motivation

Interaction classes arising from heterogeneous designs:

– Class I: Software-software interactions

Collections of embedded processors
Thoroughly investigated; will not be discussed

– Class II: Software-hardware interactions

Embedded processors with hardware engines
Large variety in processing speed

– Class III: Hardware-hardware interactions

Collections of hardware engines
Hardware engines are capable of significant

concurrency compared to processors

Ly D, Saldaña M, Chow P. FPT 2009 11

SLIDE 12

Background

Work builds on TMD-MPI[1]

– Subset implementation of the MPI standard – Allows hardware engines to be part of the message passing network – Ported to Amirix PCI, BEE2, BEE3, Xilinx ACP – Software libraries for MicroBlaze, PowerPC, Intel X86

[1] M. Saldaña et al., “MPI as an abstraction for software-hardware interaction for HPRCs,” HPRCTA, Nov. 2008.

Ly D, Saldaña M, Chow P. FPT 2009 12

SLIDE 13

Class II: Processor-based Optimizations

Background
Direct Memory Access MPI Hardware Engine
Non-Interrupting, Non-Blocking Functions
Series of MPI Messages
Results and Analysis

Ly D, Saldaña M, Chow P. FPT 2009 13

SLIDE 14

Class II: Processor-based Optimizations

Background

Problem 1

– Standard message paradigm for HPC systems

Plentiful memory but high message latency
Favours combining data into a few, large messages,

which are stored in memory and retrieved as needed – Embedded designs provide different trade-off

Little memory but short message latency
‘Just-in-time’ paradigm is preferred

–Sending just enough data for one unit of computation on demand

Ly D, Saldaña M, Chow P. FPT 2009 14

SLIDE 15

Class II: Processor-based Optimizations

Background

Problem 2

– Homogeneity of HPC systems

Each rank has similar processing capabilities

– Heterogeneity of FPGA systems

Hardware engines are tailored for a specific set of

functions – extremely fast processing

Embedded processors play vital role of control

and memory distribution – little processing

Ly D, Saldaña M, Chow P. FPT 2009 15

SLIDE 16

Class II: Processor-based Optimizations

Background

‘Just-in-time’ + Heterogeneity = producer-

consumer model

– Processors produce messages for hardware engines to consume – Generally, the message production rate of the processor is the limiting factor

Ly D, Saldaña M, Chow P. FPT 2009 16

SLIDE 17

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Typical MPI implementations use only software
DMA engine offloads time-consuming, message

tasks: memory transfers

– Frees processor to continue execution – Can implement burst memory transactions – Time required to prepare a message is independent

f message length

– Allows messages to be queued

Ly D, Saldaña M, Chow P. FPT 2009 17

SLIDE 18

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 18

SLIDE 19

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 19

SLIDE 20

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 20

MPI_Send(...)

1. Processor writes 4 words
destination rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data from memory

SLIDE 21

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 21

MPI_Send(...)

1. Processor writes 4 words
destination rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data from memory

SLIDE 22

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 22

MPI_Send(...)

1. Processor writes 4 words
destination rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data from memory

SLIDE 23

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 23

MPI_Send(...)

1. Processor writes 4 words
destination rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data from memory

SLIDE 24

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 24

MPI_Send(...)

1. Processor writes 4 words
destination rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data from memory

SLIDE 25

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 25

MPI_Send(...)

1. Processor writes 4 words
destination rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data from memory

SLIDE 26

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 26

MPI_Send(...)

1. Processor writes 4 words
destination rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data from memory

SLIDE 27

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 27

MPI_Send(...)

1. Processor writes 4 words
destination rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data from memory

SLIDE 28

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 28

MPI_Send(...)

1. Processor writes 4 words
destination rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data from memory

SLIDE 29

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 29

MPI_Send(...)

1. Processor writes 4 words
destination rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data from memory

SLIDE 30

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 30

MPI_Send(...)

1. Processor writes 4 words
destination rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data from memory

SLIDE 31

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 31

MPI_Send(...)

1. Processor writes 4 words
destination rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data from memory

SLIDE 32

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 32

MPI_Recv(...)

1. Processor writes 4 words
source rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data to memory
4. PLB_MPE notifies processor

SLIDE 33

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 33

MPI_Recv(...)

1. Processor writes 4 words
source rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data to memory
4. PLB_MPE notifies processor

SLIDE 34

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 34

MPI_Recv(...)

1. Processor writes 4 words
source rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data to memory
4. PLB_MPE notifies processor

SLIDE 35

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 35

MPI_Recv(...)

1. Processor writes 4 words
source rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data to memory
4. PLB_MPE notifies processor

SLIDE 36

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 36

MPI_Recv(...)

1. Processor writes 4 words
source rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data to memory
4. PLB_MPE notifies processor

SLIDE 37

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 37

MPI_Recv(...)

1. Processor writes 4 words
source rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data to memory
4. PLB_MPE notifies processor

SLIDE 38

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 38

MPI_Recv(...)

1. Processor writes 4 words
source rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data to memory
4. PLB_MPE notifies processor

SLIDE 39

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 39

MPI_Recv(...)

1. Processor writes 4 words
source rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data to memory
4. PLB_MPE notifies processor

SLIDE 40

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 40

MPI_Recv(...)

1. Processor writes 4 words
source rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data to memory
4. PLB_MPE notifies processor

SLIDE 41

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 41

MPI_Recv(...)

1. Processor writes 4 words
source rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data to memory
4. PLB_MPE notifies processor

SLIDE 42

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 42

MPI_Recv(...)

1. Processor writes 4 words
source rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data to memory
4. PLB_MPE notifies processor

SLIDE 43

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 43

MPI_Recv(...)

1. Processor writes 4 words
source rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data to memory
4. PLB_MPE notifies processor

SLIDE 44

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 44

MPI_Recv(...)

1. Processor writes 4 words
source rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data to memory
4. PLB_MPE notifies processor

SLIDE 45

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 45

MPI_Recv(...)

1. Processor writes 4 words
source rank
address of data buffer
message size
message tag
2. PLB_MPE decodes message header
3. PLB_MPE transfers data to memory
4. PLB_MPE notifies processor

SLIDE 46

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

DMA engine is completely transparent to the

user

– Exact same MPI functions are called – DMA setup is handled by the implementation

Ly D, Saldaña M, Chow P. FPT 2009 46

SLIDE 47

Class II: Processor-based Optimizations

Non-Interrupting, Non-Blocking Functions

Ly D, Saldaña M, Chow P. FPT 2009 47

Two types of MPI message functions

– Blocking functions: returns only when buffer can be safely reused – Non-blocking functions: returns immediately

Request handle is required so the message

status can be checked later

Non-blocking functions are used to overlap

communication and computation

SLIDE 48

Class II: Processor-based Optimizations

Non-Interrupting, Non-Blocking Functions

Ly D, Saldaña M, Chow P. FPT 2009 48

Typical HPC non-blocking use case:

MPI_Request request; ... MPI_Isend(..., &request); prepare_computation(); MPI_Wait(&request, ...); finish_computation();

SLIDE 49

Class II: Processor-based Optimizations

Non-Interrupting, Non-Blocking Functions

Ly D, Saldaña M, Chow P. FPT 2009 49

Class II interactions have a different use case

– Hardware engines are responsible for computation – Embedded processors only need to send messages as fast as possible

DMA hardware allow messages to be queued
‘Fire-and-forget’ message model

– Message status is not important – Request handles are serviced by expensive, interrupts

SLIDE 50

Class II: Processor-based Optimizations

Non-Interrupting, Non-Blocking Functions

Ly D, Saldaña M, Chow P. FPT 2009 50

Standard MPI protocol provides a mechanism

for ‘fire-and-forget’:

MPI_Request request_dummy; ... MPI_Isend(..., &request_dummy); MPI_Request_free(&request_dummy);

SLIDE 51

Class II: Processor-based Optimizations

Non-Interrupting, Non-Blocking Functions

Ly D, Saldaña M, Chow P. FPT 2009 51

Standard implementation still incurs overhead:

– Setup the interrupt – Remove the interrupt – Extra function call overhead – Memory space for the MPI_Request data structure

For the ‘just-in-time’ message model on

embedded processors, these overheads create a bottleneck

SLIDE 52

Class II: Processor-based Optimizations

Non-Interrupting, Non-Blocking Functions

Ly D, Saldaña M, Chow P. FPT 2009 52

Proposed modification to the MPI protocol:

#define MPI_REQUEST_NULL NULL; ... MPI_Isend(..., MPI_REQUEST_NULL);

Non-blocking functions check that the request

pointer is valid before setting interrupts

Circumvents the overhead
Not standard, but minor modification that works

well for embedded processors with DMA

SLIDE 53

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 53

MPI message without DMA

Legend Non-MPI Code Function Preamble/Postamble MPI Function Code

SLIDE 54

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 54

MPI message without DMA

MPI_Send() Legend Non-MPI Code Function Preamble/Postamble MPI Function Code

SLIDE 55

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 55

MPI message without DMA

MPI_Send() Legend Non-MPI Code Function Preamble/Postamble MPI Function Code

SLIDE 56

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 56

MPI message without DMA

MPI_Send() Transfer data words Legend Non-MPI Code Function Preamble/Postamble MPI Function Code

SLIDE 57

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 57

MPI message without DMA

MPI_Send() Transfer data words return Legend Non-MPI Code Function Preamble/Postamble MPI Function Code

SLIDE 58

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 58

MPI message without DMA

MPI_Send() Transfer data words return Legend Non-MPI Code Function Preamble/Postamble MPI Function Code

SLIDE 59

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 59

MPI message without DMA

MPI_Send() Transfer data words return Legend Non-MPI Code Function Preamble/Postamble MPI Function Code

SLIDE 60

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 60

MPI message without DMA

MPI_Send() Transfer lots of data words return Legend Non-MPI Code Function Preamble/Postamble MPI Function Code

SLIDE 61

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 61

MPI message with DMA

MPI_Send() Transfer four words, regardless of message length return Legend Non-MPI Code Function Preamble/Postamble MPI Function Code

SLIDE 62

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 62

MPI message with DMA

55.6% Legend Non-MPI Code Function Preamble/Postamble MPI Function Code 28.7% 15.6%

SLIDE 63

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 63

MPI message with DMA

55.6% Legend Non-MPI Code Function Preamble/Postamble MPI Function Code 28.7% 15.6% + = 44.3%

SLIDE 64

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 64

MPI message with DMA

– Message queueing

Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3

SLIDE 65

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 65

Inline all MPI functions?

Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3

SLIDE 66

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 66

Inline all MPI functions?

Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3

SLIDE 67

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 67

Inline all MPI functions?

Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3

SLIDE 68

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 68

Inline all MPI functions?

– Increases program length!

Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3

SLIDE 69

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 69

Standard MPI Functions

void *msg_buf; int msg_size; ... MPI_Isend(msg_buf, msg_size, ...); MPI_Irecv(msg_buf, msg_size, ...);

SLIDE 70

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 70

void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } }

SLIDE 71

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 71

void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } }

SLIDE 72

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 72

void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } }

SLIDE 73

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 73

void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } }

SLIDE 74

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 74

void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } }

SLIDE 75

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 75

void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } }

SLIDE 76

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 76

MPI_Coalesce

Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3 For loop

SLIDE 77

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 77

MPI_Coalesce

Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3 For loop

SLIDE 78

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 78

MPI_Coalesce

Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3 For loop

SLIDE 79

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 79

MPI_Coalesce

Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3 For loop

SLIDE 80

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

MPI_Coalesce is not part of the MPI Standard
Behaviour can be easily reproduced

– Even when source code is not available

Maintains compatibility with MPI code

Ly D, Saldaña M, Chow P. FPT 2009 80

SLIDE 81

Class II: Processor-based Optimizations

Results

81

Application: Restricted Boltzmann Machines[2]

– Neural network FPGA implementation – Platform: Berkeley Emulation Engine 2 (BEE2)

Five Xilinx II-Pro XC2VP70 FPGA
Inter-FPGA communication:

–Latency: 6 cycles –Bandwidth: 1.73GB/s

[1] D. Ly et al., “A Multi-FPGA Architecture for Restricted Boltzmann Machines,” FPL, Sept. 2009.

Ly D, Saldaña M, Chow P. FPT 2009

SLIDE 82

Class II: Processor-based Optimizations

Results

82 Ly D, Saldaña M, Chow P. FPT 2009

SLIDE 83

Class II: Processor-based Optimizations

Results

83 Ly D, Saldaña M, Chow P. FPT 2009

Message # Source Destination Size [# of words] 1 R0 R1 2 R0 R1 3 3 R0 R6 4 R0 R6 3 5 R0 R11 6 R0 R11 3 7 R0 R16 8 R0 R16 3 9 R0 R1 4 10 R0 R6 4 11 R0 R11 4 12 R0 R16 4

SLIDE 84

Class II: Processor-based Optimizations

Results

Ly D, Saldaña M, Chow P. FPT 2009 84

SLIDE 85

Class II: Processor-based Optimizations

Results

Ly D, Saldaña M, Chow P. FPT 2009 85

SLIDE 86

Class II: Processor-based Optimizations

Results

Ly D, Saldaña M, Chow P. FPT 2009 86

2.33x

SLIDE 87

Class II: Processor-based Optimizations

Results

Ly D, Saldaña M, Chow P. FPT 2009 87

2.33x

SLIDE 88

Class II: Processor-based Optimizations

Results

Ly D, Saldaña M, Chow P. FPT 2009 88

2.33x 3.94x

SLIDE 89

Class II: Processor-based Optimizations

Results

Ly D, Saldaña M, Chow P. FPT 2009 89

2.33x 3.94x

SLIDE 90

Class II: Processor-based Optimizations

Results

Ly D, Saldaña M, Chow P. FPT 2009 90

2.33x 3.94x 5.32x

SLIDE 91

Class III: Hardware-based Optimizations

Background
Dataflow Message Passing Model

– Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 91

SLIDE 92

Class III: Hardware-based Optimizations

Background

Processor-based, software model

– Function calls are atomic – Program flow is quantized in message function units – Cannot execute communication and computation simultaneously

Hardware engines

– Significantly more parallelism – Communication and computations can be simultaneous

Ly D, Saldaña M, Chow P. FPT 2009 92

SLIDE 93

Class III: Hardware-based Optimizations

Dataflow Message Passing Model

Standard message processing model

MPI_Recv(...); compute(); MPI_Send(...);

Hardware uses dataflow-model

Ly D, Saldaña M, Chow P. FPT 2009 93

Logic

SLIDE 94

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Vector Addition:
va comes from Rank 1, vb comes from Rank 2
Compute vc, send result back to Rank 1 and 2

Ly D, Saldaña M, Chow P. FPT 2009 94

b a c

v v v   

i b i a i c

v v v

, , ,

SLIDE 95

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Software model:

int va[N], vb[N], vc[N]; MPI_Recv(va, N, MPI_INT, rank1, ...); MPI_Recv(vb, N, MPI_INT, rank2, ...); for(int i = 0; i < N; i++) vc[i] = va[i] + vb[i]; MPI_Send(vc, N, MPI_INT, rank1, ...); MPI_Send(vc, N, MPI_INT, rank2, ...);

Ly D, Saldaña M, Chow P. FPT 2009 95

SLIDE 96

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 96

SLIDE 97

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 97

SLIDE 98

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 98

SLIDE 99

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 99

SLIDE 100

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 100

SLIDE 101

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 101

SLIDE 102

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 102

SLIDE 103

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 103

SLIDE 104

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 104

SLIDE 105

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 105

SLIDE 106

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 106

Message transfer are atomic

– Serializes computation and communication

Vector addition has great data locality

– Entire message is not required for computation – Only one element of each vector is required

Higher granularity is required

– Hardware dataflow approach would use pipelined computation

SLIDE 107

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 107

SLIDE 108

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 108

SLIDE 109

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 109

SLIDE 110

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 110

SLIDE 111

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 111

SLIDE 112

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 112

SLIDE 113

Class III: Hardware-based Optimizations

Dataflow Message Passing Model

Ly D, Saldaña M, Chow P. FPT 2009 113

Natural extension of MPI for hardware designers

– Increased granularity  increased performance – Supports pipelining

Single processing element represents multiple

ranks

– Capable of transferring data from multiple sources – Supports data streaming

Full-duplex data transfer

SLIDE 114

Conclusion and Future Work

MPI can be very effective for FPGA designs

– FPGAs have different trade-offs than HPC

Considerations to deal with FPGA MPI

– Class II: DMA, Non-Blocking, MPI_Coalesce() – Class III: Dataflow Message Passing Model

Attempts to maintain compatibility with MPI standard

– Some incremental optimizations do not comply – Can be reduced to legitimate MPI code

Limit of where current MPI standard applies
Future work: message passing using fine-grain

parallelism

Ly D, Saldaña M, Chow P. FPT 2009 114

SLIDE 115

Thank you

Special thanks to:

Ly D, Saldaña M, Chow P. FPT 2009 115

SLIDE 116

Hardware Debugging Interfaces

Background
Tee Cores
Message Watchdog Timers

Ly D, Saldaña M, Chow P. FPT 2009 116

SLIDE 117

Code compatibility allows traditional MPI

software-only debugging

Porting to FPGA designs can still produce errors

– Improper on-chip network setup – Message passing flaws in hardware cores

Hardware has limited visibility

– No debuggers – No standard output/printf()

Ly D, Saldaña M, Chow P. FPT 2009 117

Hardware Debugging Interfaces

Background

SLIDE 118

Networks typically consists of point-to-point

FIFOs

Ly D, Saldaña M, Chow P. FPT 2009 118

Hardware Debugging Interfaces

Tee Cores

MPI Core MPI Core

SLIDE 119

Networks typically consists of point-to-point

FIFOs

Tee Cores:

Ly D, Saldaña M, Chow P. FPT 2009 119

Hardware Debugging Interfaces

Tee Cores

MPI Core MPI Core MPI Core MPI Core

SLIDE 120

Networks typically consists of point-to-point

FIFOs

Tee Cores:

Ly D, Saldaña M, Chow P. FPT 2009 120

Hardware Debugging Interfaces

Tee Cores

MPI Core MPI Core MPI Core MPI Core Processor

SLIDE 121

Transparent and does not affect original

network performance

Allows direct tracing of data link layer

– Simple communication protocols – Easy to follow message transmissions

Ly D, Saldaña M, Chow P. FPT 2009 121

Hardware Debugging Interfaces

Tee Cores

Rank 1 Rank n

SLIDE 122

Transparent and does not affect original

network performance

Allows direct tracing of data link layer

– Simple communication protocols – Easy to follow message transmissions

Ly D, Saldaña M, Chow P. FPT 2009 122

Hardware Debugging Interfaces

Tee Cores

Rank 1 Rank n

SLIDE 123

Unresponsive embedded systems cannot be

recovered

Message watchdog timers that are integrated

with MPI implementation source code

– Snoops incoming messages in a transparent manner – If there’s no activity after the timer expires, the processor gets interrupted and control is returned