Embedded MPI for Hardware-Based Processing Nodes Daniel L. Ly 1 , - - PowerPoint PPT Presentation

embedded mpi for hardware based
SMART_READER_LITE
LIVE PREVIEW

Embedded MPI for Hardware-Based Processing Nodes Daniel L. Ly 1 , - - PowerPoint PPT Presentation

The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes Daniel L. Ly 1 , Manuel Saldaa 2 and Paul Chow 1 1 Department of Electrical and Computer Engineering University of Toronto 2 Arches Computing Systems, Toronto, Canada


slide-1
SLIDE 1

The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

Daniel L. Ly1, Manuel Saldaña2 and Paul Chow1

1Department of Electrical and Computer Engineering

University of Toronto

2Arches Computing Systems, Toronto, Canada

slide-2
SLIDE 2

Outline

  • Background and Motivation
  • Embedded Processor-Based Optimizations
  • Hardware Engine-Based Optimizations
  • Conclusions and Future Work

Ly D, Saldaña M, Chow P. FPT 2009 2

slide-3
SLIDE 3

Motivation

  • Message Passing Interface (MPI) is a programming

model for distributed memory systems

  • Popular in high performance computing (HPC),

cluster-based systems

Ly D, Saldaña M, Chow P. FPT 2009 3

slide-4
SLIDE 4

Motivation

  • Message Passing Interface (MPI) is a programming

model for distributed memory systems

  • Popular in high performance computing (HPC),

cluster-based systems

Ly D, Saldaña M, Chow P. FPT 2009 4

Processor 1 Memory Processor 2 Memory

for (i = 1; i <= 100; i++) sum += i;

Problem: sum of numbers from 1 to 100

slide-5
SLIDE 5

Motivation

  • Message Passing Interface (MPI) is a programming

model for distributed memory systems

  • Popular in high performance computing (HPC),

cluster-based systems

Ly D, Saldaña M, Chow P. FPT 2009 5

Processor 1 Memory Processor 2 Memory

sum1 = 0; for (i = 1; i <= 50; i++) sum1 += i; MPI_Recv(sum2, ...); sum = sum1 + sum2; sum1 = 0; for (i = 51; i <= 100; i++) sum1 += i; MPI_Send(sum1, ...);

slide-6
SLIDE 6

Motivation

  • Message Passing Interface (MPI) is a programming

model for distributed memory systems

  • Popular in high performance computing (HPC),

cluster-based systems

Ly D, Saldaña M, Chow P. FPT 2009 6

Processor 1 Memory Processor 2 Memory

sum1 = 0; for (i = 0; i <= 50; i++) sum1 += i; MPI_Recv(sum2, ...); sum = sum1 + sum2; sum1 = 0; for (i = 51; i <= 100; i++) sum1 += i; MPI_Send(sum1, ...);

slide-7
SLIDE 7

Motivation

  • Message Passing Interface (MPI) is a programming

model for distributed memory systems

  • Popular in high performance computing (HPC),

cluster-based systems

Ly D, Saldaña M, Chow P. FPT 2009 7

Processor 1 Memory Processor 2 Memory

sum1 = 0; for (i = 1; i <= 50; i++) sum1 += i; MPI_Recv(sum2, ...); sum = sum1 + sum2; sum1 = 0; for (i = 51; i <= 100; i++) sum1 += i; MPI_Send(sum1, ...);

slide-8
SLIDE 8

Motivation

  • Message Passing Interface (MPI) is a programming

model for distributed memory systems

  • Popular in high performance computing (HPC),

cluster-based systems

Ly D, Saldaña M, Chow P. FPT 2009 8

Processor 1 Memory Processor 2 Memory

sum1 = 0; for (i = 1; i <= 50; i++) sum1 += i; MPI_Recv(sum2, ...); sum = sum1 + sum2; sum1 = 0; for (i = 51; i <= 100; i++) sum1 += i; MPI_Send(sum1, ...);

slide-9
SLIDE 9

Motivation

  • Strong interest in adapting MPI for embedded designs:

– Increasingly difficult to interface heterogeneous resources as FPGA chip size increases

  • MPI provides key benefits:

– Unified protocol – Low weight and overhead – Abstraction of end points (ranks) – Easy prototyping

Ly D, Saldaña M, Chow P. FPT 2009 9

slide-10
SLIDE 10

Motivation

Property HPC Cluster Embedded FPGA Processor Clock Rate 2-3 GHz 100-200 MHz Memory Size per node > 1GB 1-20 MB Interconnect Protocol Robustness High None Latency 10μs (20k cycles) 100ns (10 cycles) Bandwidth 125 MB/s 400-800 MB/s Components Processing Nodes Homogenous Heterogeneous

Ly D, Saldaña M, Chow P. FPT 2009 10

slide-11
SLIDE 11

Motivation

  • Interaction classes arising from heterogeneous designs:

– Class I: Software-software interactions

  • Collections of embedded processors
  • Thoroughly investigated; will not be discussed

– Class II: Software-hardware interactions

  • Embedded processors with hardware engines
  • Large variety in processing speed

– Class III: Hardware-hardware interactions

  • Collections of hardware engines
  • Hardware engines are capable of significant

concurrency compared to processors

Ly D, Saldaña M, Chow P. FPT 2009 11

slide-12
SLIDE 12

Background

  • Work builds on TMD-MPI[1]

– Subset implementation of the MPI standard – Allows hardware engines to be part of the message passing network – Ported to Amirix PCI, BEE2, BEE3, Xilinx ACP – Software libraries for MicroBlaze, PowerPC, Intel X86

[1] M. Saldaña et al., “MPI as an abstraction for software-hardware interaction for HPRCs,” HPRCTA, Nov. 2008.

Ly D, Saldaña M, Chow P. FPT 2009 12

slide-13
SLIDE 13

Class II: Processor-based Optimizations

  • Background
  • Direct Memory Access MPI Hardware Engine
  • Non-Interrupting, Non-Blocking Functions
  • Series of MPI Messages
  • Results and Analysis

Ly D, Saldaña M, Chow P. FPT 2009 13

slide-14
SLIDE 14

Class II: Processor-based Optimizations

Background

  • Problem 1

– Standard message paradigm for HPC systems

  • Plentiful memory but high message latency
  • Favours combining data into a few, large messages,

which are stored in memory and retrieved as needed – Embedded designs provide different trade-off

  • Little memory but short message latency
  • ‘Just-in-time’ paradigm is preferred

–Sending just enough data for one unit of computation on demand

Ly D, Saldaña M, Chow P. FPT 2009 14

slide-15
SLIDE 15

Class II: Processor-based Optimizations

Background

  • Problem 2

– Homogeneity of HPC systems

  • Each rank has similar processing capabilities

– Heterogeneity of FPGA systems

  • Hardware engines are tailored for a specific set of

functions – extremely fast processing

  • Embedded processors play vital role of control

and memory distribution – little processing

Ly D, Saldaña M, Chow P. FPT 2009 15

slide-16
SLIDE 16

Class II: Processor-based Optimizations

Background

  • ‘Just-in-time’ + Heterogeneity = producer-

consumer model

– Processors produce messages for hardware engines to consume – Generally, the message production rate of the processor is the limiting factor

Ly D, Saldaña M, Chow P. FPT 2009 16

slide-17
SLIDE 17

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

  • Typical MPI implementations use only software
  • DMA engine offloads time-consuming, message

tasks: memory transfers

– Frees processor to continue execution – Can implement burst memory transactions – Time required to prepare a message is independent

  • f message length

– Allows messages to be queued

Ly D, Saldaña M, Chow P. FPT 2009 17

slide-18
SLIDE 18

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 18

slide-19
SLIDE 19

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 19

slide-20
SLIDE 20

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 20

MPI_Send(...)

  • 1. Processor writes 4 words
  • destination rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data from memory
slide-21
SLIDE 21

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 21

MPI_Send(...)

  • 1. Processor writes 4 words
  • destination rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data from memory
slide-22
SLIDE 22

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 22

MPI_Send(...)

  • 1. Processor writes 4 words
  • destination rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data from memory
slide-23
SLIDE 23

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 23

MPI_Send(...)

  • 1. Processor writes 4 words
  • destination rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data from memory
slide-24
SLIDE 24

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 24

MPI_Send(...)

  • 1. Processor writes 4 words
  • destination rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data from memory
slide-25
SLIDE 25

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 25

MPI_Send(...)

  • 1. Processor writes 4 words
  • destination rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data from memory
slide-26
SLIDE 26

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 26

MPI_Send(...)

  • 1. Processor writes 4 words
  • destination rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data from memory
slide-27
SLIDE 27

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 27

MPI_Send(...)

  • 1. Processor writes 4 words
  • destination rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data from memory
slide-28
SLIDE 28

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 28

MPI_Send(...)

  • 1. Processor writes 4 words
  • destination rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data from memory
slide-29
SLIDE 29

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 29

MPI_Send(...)

  • 1. Processor writes 4 words
  • destination rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data from memory
slide-30
SLIDE 30

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 30

MPI_Send(...)

  • 1. Processor writes 4 words
  • destination rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data from memory
slide-31
SLIDE 31

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 31

MPI_Send(...)

  • 1. Processor writes 4 words
  • destination rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data from memory
slide-32
SLIDE 32

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 32

MPI_Recv(...)

  • 1. Processor writes 4 words
  • source rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data to memory
  • 4. PLB_MPE notifies processor
slide-33
SLIDE 33

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 33

MPI_Recv(...)

  • 1. Processor writes 4 words
  • source rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data to memory
  • 4. PLB_MPE notifies processor
slide-34
SLIDE 34

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 34

MPI_Recv(...)

  • 1. Processor writes 4 words
  • source rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data to memory
  • 4. PLB_MPE notifies processor
slide-35
SLIDE 35

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 35

MPI_Recv(...)

  • 1. Processor writes 4 words
  • source rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data to memory
  • 4. PLB_MPE notifies processor
slide-36
SLIDE 36

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 36

MPI_Recv(...)

  • 1. Processor writes 4 words
  • source rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data to memory
  • 4. PLB_MPE notifies processor
slide-37
SLIDE 37

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 37

MPI_Recv(...)

  • 1. Processor writes 4 words
  • source rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data to memory
  • 4. PLB_MPE notifies processor
slide-38
SLIDE 38

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 38

MPI_Recv(...)

  • 1. Processor writes 4 words
  • source rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data to memory
  • 4. PLB_MPE notifies processor
slide-39
SLIDE 39

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 39

MPI_Recv(...)

  • 1. Processor writes 4 words
  • source rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data to memory
  • 4. PLB_MPE notifies processor
slide-40
SLIDE 40

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 40

MPI_Recv(...)

  • 1. Processor writes 4 words
  • source rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data to memory
  • 4. PLB_MPE notifies processor
slide-41
SLIDE 41

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 41

MPI_Recv(...)

  • 1. Processor writes 4 words
  • source rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data to memory
  • 4. PLB_MPE notifies processor
slide-42
SLIDE 42

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 42

MPI_Recv(...)

  • 1. Processor writes 4 words
  • source rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data to memory
  • 4. PLB_MPE notifies processor
slide-43
SLIDE 43

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 43

MPI_Recv(...)

  • 1. Processor writes 4 words
  • source rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data to memory
  • 4. PLB_MPE notifies processor
slide-44
SLIDE 44

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 44

MPI_Recv(...)

  • 1. Processor writes 4 words
  • source rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data to memory
  • 4. PLB_MPE notifies processor
slide-45
SLIDE 45

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009 45

MPI_Recv(...)

  • 1. Processor writes 4 words
  • source rank
  • address of data buffer
  • message size
  • message tag
  • 2. PLB_MPE decodes message header
  • 3. PLB_MPE transfers data to memory
  • 4. PLB_MPE notifies processor
slide-46
SLIDE 46

Class II: Processor-based Optimizations

Direct Memory Access MPI Engine

  • DMA engine is completely transparent to the

user

– Exact same MPI functions are called – DMA setup is handled by the implementation

Ly D, Saldaña M, Chow P. FPT 2009 46

slide-47
SLIDE 47

Class II: Processor-based Optimizations

Non-Interrupting, Non-Blocking Functions

Ly D, Saldaña M, Chow P. FPT 2009 47

  • Two types of MPI message functions

– Blocking functions: returns only when buffer can be safely reused – Non-blocking functions: returns immediately

  • Request handle is required so the message

status can be checked later

  • Non-blocking functions are used to overlap

communication and computation

slide-48
SLIDE 48

Class II: Processor-based Optimizations

Non-Interrupting, Non-Blocking Functions

Ly D, Saldaña M, Chow P. FPT 2009 48

  • Typical HPC non-blocking use case:

MPI_Request request; ... MPI_Isend(..., &request); prepare_computation(); MPI_Wait(&request, ...); finish_computation();

slide-49
SLIDE 49

Class II: Processor-based Optimizations

Non-Interrupting, Non-Blocking Functions

Ly D, Saldaña M, Chow P. FPT 2009 49

  • Class II interactions have a different use case

– Hardware engines are responsible for computation – Embedded processors only need to send messages as fast as possible

  • DMA hardware allow messages to be queued
  • ‘Fire-and-forget’ message model

– Message status is not important – Request handles are serviced by expensive, interrupts

slide-50
SLIDE 50

Class II: Processor-based Optimizations

Non-Interrupting, Non-Blocking Functions

Ly D, Saldaña M, Chow P. FPT 2009 50

  • Standard MPI protocol provides a mechanism

for ‘fire-and-forget’:

MPI_Request request_dummy; ... MPI_Isend(..., &request_dummy); MPI_Request_free(&request_dummy);

slide-51
SLIDE 51

Class II: Processor-based Optimizations

Non-Interrupting, Non-Blocking Functions

Ly D, Saldaña M, Chow P. FPT 2009 51

  • Standard implementation still incurs overhead:

– Setup the interrupt – Remove the interrupt – Extra function call overhead – Memory space for the MPI_Request data structure

  • For the ‘just-in-time’ message model on

embedded processors, these overheads create a bottleneck

slide-52
SLIDE 52

Class II: Processor-based Optimizations

Non-Interrupting, Non-Blocking Functions

Ly D, Saldaña M, Chow P. FPT 2009 52

  • Proposed modification to the MPI protocol:

#define MPI_REQUEST_NULL NULL; ... MPI_Isend(..., MPI_REQUEST_NULL);

  • Non-blocking functions check that the request

pointer is valid before setting interrupts

  • Circumvents the overhead
  • Not standard, but minor modification that works

well for embedded processors with DMA

slide-53
SLIDE 53

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 53

  • MPI message without DMA

Legend Non-MPI Code Function Preamble/Postamble MPI Function Code

slide-54
SLIDE 54

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 54

  • MPI message without DMA

MPI_Send() Legend Non-MPI Code Function Preamble/Postamble MPI Function Code

slide-55
SLIDE 55

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 55

  • MPI message without DMA

MPI_Send() Legend Non-MPI Code Function Preamble/Postamble MPI Function Code

slide-56
SLIDE 56

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 56

  • MPI message without DMA

MPI_Send() Transfer data words Legend Non-MPI Code Function Preamble/Postamble MPI Function Code

slide-57
SLIDE 57

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 57

  • MPI message without DMA

MPI_Send() Transfer data words return Legend Non-MPI Code Function Preamble/Postamble MPI Function Code

slide-58
SLIDE 58

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 58

  • MPI message without DMA

MPI_Send() Transfer data words return Legend Non-MPI Code Function Preamble/Postamble MPI Function Code

slide-59
SLIDE 59

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 59

  • MPI message without DMA

MPI_Send() Transfer data words return Legend Non-MPI Code Function Preamble/Postamble MPI Function Code

slide-60
SLIDE 60

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 60

  • MPI message without DMA

MPI_Send() Transfer lots of data words return Legend Non-MPI Code Function Preamble/Postamble MPI Function Code

slide-61
SLIDE 61

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 61

  • MPI message with DMA

MPI_Send() Transfer four words, regardless of message length return Legend Non-MPI Code Function Preamble/Postamble MPI Function Code

slide-62
SLIDE 62

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 62

  • MPI message with DMA

55.6% Legend Non-MPI Code Function Preamble/Postamble MPI Function Code 28.7% 15.6%

slide-63
SLIDE 63

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 63

  • MPI message with DMA

55.6% Legend Non-MPI Code Function Preamble/Postamble MPI Function Code 28.7% 15.6% + = 44.3%

slide-64
SLIDE 64

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 64

  • MPI message with DMA

– Message queueing

Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3

slide-65
SLIDE 65

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 65

  • Inline all MPI functions?

Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3

slide-66
SLIDE 66

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 66

  • Inline all MPI functions?

Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3

slide-67
SLIDE 67

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 67

  • Inline all MPI functions?

Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3

slide-68
SLIDE 68

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 68

  • Inline all MPI functions?

– Increases program length!

Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3

slide-69
SLIDE 69

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 69

  • Standard MPI Functions

void *msg_buf; int msg_size; ... MPI_Isend(msg_buf, msg_size, ...); MPI_Irecv(msg_buf, msg_size, ...);

slide-70
SLIDE 70

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 70

void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } }

slide-71
SLIDE 71

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 71

void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } }

slide-72
SLIDE 72

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 72

void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } }

slide-73
SLIDE 73

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 73

void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } }

slide-74
SLIDE 74

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 74

void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } }

slide-75
SLIDE 75

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 75

void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } }

slide-76
SLIDE 76

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 76

  • MPI_Coalesce

Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3 For loop

slide-77
SLIDE 77

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 77

  • MPI_Coalesce

Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3 For loop

slide-78
SLIDE 78

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 78

  • MPI_Coalesce

Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3 For loop

slide-79
SLIDE 79

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009 79

  • MPI_Coalesce

Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3 For loop

slide-80
SLIDE 80

Class II: Processor-based Optimizations

Series of messages – MPI_Coalesce()

  • MPI_Coalesce is not part of the MPI Standard
  • Behaviour can be easily reproduced

– Even when source code is not available

  • Maintains compatibility with MPI code

Ly D, Saldaña M, Chow P. FPT 2009 80

slide-81
SLIDE 81

Class II: Processor-based Optimizations

Results

81

  • Application: Restricted Boltzmann Machines[2]

– Neural network FPGA implementation – Platform: Berkeley Emulation Engine 2 (BEE2)

  • Five Xilinx II-Pro XC2VP70 FPGA
  • Inter-FPGA communication:

–Latency: 6 cycles –Bandwidth: 1.73GB/s

[1] D. Ly et al., “A Multi-FPGA Architecture for Restricted Boltzmann Machines,” FPL, Sept. 2009.

Ly D, Saldaña M, Chow P. FPT 2009

slide-82
SLIDE 82

Class II: Processor-based Optimizations

Results

82 Ly D, Saldaña M, Chow P. FPT 2009

slide-83
SLIDE 83

Class II: Processor-based Optimizations

Results

83 Ly D, Saldaña M, Chow P. FPT 2009

Message # Source Destination Size [# of words] 1 R0 R1 2 R0 R1 3 3 R0 R6 4 R0 R6 3 5 R0 R11 6 R0 R11 3 7 R0 R16 8 R0 R16 3 9 R0 R1 4 10 R0 R6 4 11 R0 R11 4 12 R0 R16 4

slide-84
SLIDE 84

Class II: Processor-based Optimizations

Results

Ly D, Saldaña M, Chow P. FPT 2009 84

slide-85
SLIDE 85

Class II: Processor-based Optimizations

Results

Ly D, Saldaña M, Chow P. FPT 2009 85

slide-86
SLIDE 86

Class II: Processor-based Optimizations

Results

Ly D, Saldaña M, Chow P. FPT 2009 86

2.33x

slide-87
SLIDE 87

Class II: Processor-based Optimizations

Results

Ly D, Saldaña M, Chow P. FPT 2009 87

2.33x

slide-88
SLIDE 88

Class II: Processor-based Optimizations

Results

Ly D, Saldaña M, Chow P. FPT 2009 88

2.33x 3.94x

slide-89
SLIDE 89

Class II: Processor-based Optimizations

Results

Ly D, Saldaña M, Chow P. FPT 2009 89

2.33x 3.94x

slide-90
SLIDE 90

Class II: Processor-based Optimizations

Results

Ly D, Saldaña M, Chow P. FPT 2009 90

2.33x 3.94x 5.32x

slide-91
SLIDE 91

Class III: Hardware-based Optimizations

  • Background
  • Dataflow Message Passing Model

– Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 91

slide-92
SLIDE 92

Class III: Hardware-based Optimizations

Background

  • Processor-based, software model

– Function calls are atomic – Program flow is quantized in message function units – Cannot execute communication and computation simultaneously

  • Hardware engines

– Significantly more parallelism – Communication and computations can be simultaneous

Ly D, Saldaña M, Chow P. FPT 2009 92

slide-93
SLIDE 93

Class III: Hardware-based Optimizations

Dataflow Message Passing Model

  • Standard message processing model

MPI_Recv(...); compute(); MPI_Send(...);

  • Hardware uses dataflow-model

Ly D, Saldaña M, Chow P. FPT 2009 93

Logic

slide-94
SLIDE 94

Class III: Hardware-based Optimizations

Case Study: Vector Addition

  • Vector Addition:
  • va comes from Rank 1, vb comes from Rank 2
  • Compute vc, send result back to Rank 1 and 2

Ly D, Saldaña M, Chow P. FPT 2009 94

b a c

v v v   

i b i a i c

v v v

, , ,

slide-95
SLIDE 95

Class III: Hardware-based Optimizations

Case Study: Vector Addition

  • Software model:

int va[N], vb[N], vc[N]; MPI_Recv(va, N, MPI_INT, rank1, ...); MPI_Recv(vb, N, MPI_INT, rank2, ...); for(int i = 0; i < N; i++) vc[i] = va[i] + vb[i]; MPI_Send(vc, N, MPI_INT, rank1, ...); MPI_Send(vc, N, MPI_INT, rank2, ...);

Ly D, Saldaña M, Chow P. FPT 2009 95

slide-96
SLIDE 96

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 96

slide-97
SLIDE 97

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 97

slide-98
SLIDE 98

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 98

slide-99
SLIDE 99

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 99

slide-100
SLIDE 100

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 100

slide-101
SLIDE 101

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 101

slide-102
SLIDE 102

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 102

slide-103
SLIDE 103

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 103

slide-104
SLIDE 104

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 104

slide-105
SLIDE 105

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 105

slide-106
SLIDE 106

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 106

  • Message transfer are atomic

– Serializes computation and communication

  • Vector addition has great data locality

– Entire message is not required for computation – Only one element of each vector is required

  • Higher granularity is required

– Hardware dataflow approach would use pipelined computation

slide-107
SLIDE 107

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 107

slide-108
SLIDE 108

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 108

slide-109
SLIDE 109

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 109

slide-110
SLIDE 110

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 110

slide-111
SLIDE 111

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 111

slide-112
SLIDE 112

Class III: Hardware-based Optimizations

Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009 112

slide-113
SLIDE 113

Class III: Hardware-based Optimizations

Dataflow Message Passing Model

Ly D, Saldaña M, Chow P. FPT 2009 113

  • Natural extension of MPI for hardware designers

– Increased granularity  increased performance – Supports pipelining

  • Single processing element represents multiple

ranks

– Capable of transferring data from multiple sources – Supports data streaming

  • Full-duplex data transfer
slide-114
SLIDE 114

Conclusion and Future Work

  • MPI can be very effective for FPGA designs

– FPGAs have different trade-offs than HPC

  • Considerations to deal with FPGA MPI

– Class II: DMA, Non-Blocking, MPI_Coalesce() – Class III: Dataflow Message Passing Model

  • Attempts to maintain compatibility with MPI standard

– Some incremental optimizations do not comply – Can be reduced to legitimate MPI code

  • Limit of where current MPI standard applies
  • Future work: message passing using fine-grain

parallelism

Ly D, Saldaña M, Chow P. FPT 2009 114

slide-115
SLIDE 115

Thank you

  • Special thanks to:

Ly D, Saldaña M, Chow P. FPT 2009 115

slide-116
SLIDE 116

Hardware Debugging Interfaces

  • Background
  • Tee Cores
  • Message Watchdog Timers

Ly D, Saldaña M, Chow P. FPT 2009 116

slide-117
SLIDE 117
  • Code compatibility allows traditional MPI

software-only debugging

  • Porting to FPGA designs can still produce errors

– Improper on-chip network setup – Message passing flaws in hardware cores

  • Hardware has limited visibility

– No debuggers – No standard output/printf()

Ly D, Saldaña M, Chow P. FPT 2009 117

Hardware Debugging Interfaces

Background

slide-118
SLIDE 118
  • Networks typically consists of point-to-point

FIFOs

Ly D, Saldaña M, Chow P. FPT 2009 118

Hardware Debugging Interfaces

Tee Cores

MPI Core MPI Core

slide-119
SLIDE 119
  • Networks typically consists of point-to-point

FIFOs

  • Tee Cores:

Ly D, Saldaña M, Chow P. FPT 2009 119

Hardware Debugging Interfaces

Tee Cores

MPI Core MPI Core MPI Core MPI Core

slide-120
SLIDE 120
  • Networks typically consists of point-to-point

FIFOs

  • Tee Cores:

Ly D, Saldaña M, Chow P. FPT 2009 120

Hardware Debugging Interfaces

Tee Cores

MPI Core MPI Core MPI Core MPI Core Processor

slide-121
SLIDE 121
  • Transparent and does not affect original

network performance

  • Allows direct tracing of data link layer

– Simple communication protocols – Easy to follow message transmissions

Ly D, Saldaña M, Chow P. FPT 2009 121

Hardware Debugging Interfaces

Tee Cores

Rank 1 Rank n

slide-122
SLIDE 122
  • Transparent and does not affect original

network performance

  • Allows direct tracing of data link layer

– Simple communication protocols – Easy to follow message transmissions

Ly D, Saldaña M, Chow P. FPT 2009 122

Hardware Debugging Interfaces

Tee Cores

Rank 1 Rank n

slide-123
SLIDE 123
  • Unresponsive embedded systems cannot be

recovered

  • Message watchdog timers that are integrated

with MPI implementation source code

– Snoops incoming messages in a transparent manner – If there’s no activity after the timer expires, the processor gets interrupted and control is returned

  • Excellent for post-mortem analysis

– Connect with Tee Cores for a terse debugging report

Ly D, Saldaña M, Chow P. FPT 2009 123

Hardware Debugging Interfaces

Message Watchdog Timers