The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes
Daniel L. Ly1, Manuel Saldaña2 and Paul Chow1
1Department of Electrical and Computer Engineering
University of Toronto
2Arches Computing Systems, Toronto, Canada
Embedded MPI for Hardware-Based Processing Nodes Daniel L. Ly 1 , - - PowerPoint PPT Presentation
The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes Daniel L. Ly 1 , Manuel Saldaa 2 and Paul Chow 1 1 Department of Electrical and Computer Engineering University of Toronto 2 Arches Computing Systems, Toronto, Canada
1Department of Electrical and Computer Engineering
2Arches Computing Systems, Toronto, Canada
Ly D, Saldaña M, Chow P. FPT 2009 2
Ly D, Saldaña M, Chow P. FPT 2009 3
Ly D, Saldaña M, Chow P. FPT 2009 4
Processor 1 Memory Processor 2 Memory
for (i = 1; i <= 100; i++) sum += i;
Problem: sum of numbers from 1 to 100
Ly D, Saldaña M, Chow P. FPT 2009 5
Processor 1 Memory Processor 2 Memory
sum1 = 0; for (i = 1; i <= 50; i++) sum1 += i; MPI_Recv(sum2, ...); sum = sum1 + sum2; sum1 = 0; for (i = 51; i <= 100; i++) sum1 += i; MPI_Send(sum1, ...);
Ly D, Saldaña M, Chow P. FPT 2009 6
Processor 1 Memory Processor 2 Memory
sum1 = 0; for (i = 0; i <= 50; i++) sum1 += i; MPI_Recv(sum2, ...); sum = sum1 + sum2; sum1 = 0; for (i = 51; i <= 100; i++) sum1 += i; MPI_Send(sum1, ...);
Ly D, Saldaña M, Chow P. FPT 2009 7
Processor 1 Memory Processor 2 Memory
sum1 = 0; for (i = 1; i <= 50; i++) sum1 += i; MPI_Recv(sum2, ...); sum = sum1 + sum2; sum1 = 0; for (i = 51; i <= 100; i++) sum1 += i; MPI_Send(sum1, ...);
Ly D, Saldaña M, Chow P. FPT 2009 8
Processor 1 Memory Processor 2 Memory
sum1 = 0; for (i = 1; i <= 50; i++) sum1 += i; MPI_Recv(sum2, ...); sum = sum1 + sum2; sum1 = 0; for (i = 51; i <= 100; i++) sum1 += i; MPI_Send(sum1, ...);
Ly D, Saldaña M, Chow P. FPT 2009 9
Property HPC Cluster Embedded FPGA Processor Clock Rate 2-3 GHz 100-200 MHz Memory Size per node > 1GB 1-20 MB Interconnect Protocol Robustness High None Latency 10μs (20k cycles) 100ns (10 cycles) Bandwidth 125 MB/s 400-800 MB/s Components Processing Nodes Homogenous Heterogeneous
Ly D, Saldaña M, Chow P. FPT 2009 10
Ly D, Saldaña M, Chow P. FPT 2009 11
[1] M. Saldaña et al., “MPI as an abstraction for software-hardware interaction for HPRCs,” HPRCTA, Nov. 2008.
Ly D, Saldaña M, Chow P. FPT 2009 12
Ly D, Saldaña M, Chow P. FPT 2009 13
Ly D, Saldaña M, Chow P. FPT 2009 14
Ly D, Saldaña M, Chow P. FPT 2009 15
Ly D, Saldaña M, Chow P. FPT 2009 16
Ly D, Saldaña M, Chow P. FPT 2009 17
Ly D, Saldaña M, Chow P. FPT 2009 18
Ly D, Saldaña M, Chow P. FPT 2009 19
Ly D, Saldaña M, Chow P. FPT 2009 20
Ly D, Saldaña M, Chow P. FPT 2009 21
Ly D, Saldaña M, Chow P. FPT 2009 22
Ly D, Saldaña M, Chow P. FPT 2009 23
Ly D, Saldaña M, Chow P. FPT 2009 24
Ly D, Saldaña M, Chow P. FPT 2009 25
Ly D, Saldaña M, Chow P. FPT 2009 26
Ly D, Saldaña M, Chow P. FPT 2009 27
Ly D, Saldaña M, Chow P. FPT 2009 28
Ly D, Saldaña M, Chow P. FPT 2009 29
Ly D, Saldaña M, Chow P. FPT 2009 30
Ly D, Saldaña M, Chow P. FPT 2009 31
Ly D, Saldaña M, Chow P. FPT 2009 32
Ly D, Saldaña M, Chow P. FPT 2009 33
Ly D, Saldaña M, Chow P. FPT 2009 34
Ly D, Saldaña M, Chow P. FPT 2009 35
Ly D, Saldaña M, Chow P. FPT 2009 36
Ly D, Saldaña M, Chow P. FPT 2009 37
Ly D, Saldaña M, Chow P. FPT 2009 38
Ly D, Saldaña M, Chow P. FPT 2009 39
Ly D, Saldaña M, Chow P. FPT 2009 40
Ly D, Saldaña M, Chow P. FPT 2009 41
Ly D, Saldaña M, Chow P. FPT 2009 42
Ly D, Saldaña M, Chow P. FPT 2009 43
Ly D, Saldaña M, Chow P. FPT 2009 44
Ly D, Saldaña M, Chow P. FPT 2009 45
Ly D, Saldaña M, Chow P. FPT 2009 46
Ly D, Saldaña M, Chow P. FPT 2009 47
Ly D, Saldaña M, Chow P. FPT 2009 48
MPI_Request request; ... MPI_Isend(..., &request); prepare_computation(); MPI_Wait(&request, ...); finish_computation();
Ly D, Saldaña M, Chow P. FPT 2009 49
Ly D, Saldaña M, Chow P. FPT 2009 50
MPI_Request request_dummy; ... MPI_Isend(..., &request_dummy); MPI_Request_free(&request_dummy);
Ly D, Saldaña M, Chow P. FPT 2009 51
Ly D, Saldaña M, Chow P. FPT 2009 52
#define MPI_REQUEST_NULL NULL; ... MPI_Isend(..., MPI_REQUEST_NULL);
Ly D, Saldaña M, Chow P. FPT 2009 53
Legend Non-MPI Code Function Preamble/Postamble MPI Function Code
Ly D, Saldaña M, Chow P. FPT 2009 54
MPI_Send() Legend Non-MPI Code Function Preamble/Postamble MPI Function Code
Ly D, Saldaña M, Chow P. FPT 2009 55
MPI_Send() Legend Non-MPI Code Function Preamble/Postamble MPI Function Code
Ly D, Saldaña M, Chow P. FPT 2009 56
MPI_Send() Transfer data words Legend Non-MPI Code Function Preamble/Postamble MPI Function Code
Ly D, Saldaña M, Chow P. FPT 2009 57
MPI_Send() Transfer data words return Legend Non-MPI Code Function Preamble/Postamble MPI Function Code
Ly D, Saldaña M, Chow P. FPT 2009 58
MPI_Send() Transfer data words return Legend Non-MPI Code Function Preamble/Postamble MPI Function Code
Ly D, Saldaña M, Chow P. FPT 2009 59
MPI_Send() Transfer data words return Legend Non-MPI Code Function Preamble/Postamble MPI Function Code
Ly D, Saldaña M, Chow P. FPT 2009 60
MPI_Send() Transfer lots of data words return Legend Non-MPI Code Function Preamble/Postamble MPI Function Code
Ly D, Saldaña M, Chow P. FPT 2009 61
MPI_Send() Transfer four words, regardless of message length return Legend Non-MPI Code Function Preamble/Postamble MPI Function Code
Ly D, Saldaña M, Chow P. FPT 2009 62
55.6% Legend Non-MPI Code Function Preamble/Postamble MPI Function Code 28.7% 15.6%
Ly D, Saldaña M, Chow P. FPT 2009 63
55.6% Legend Non-MPI Code Function Preamble/Postamble MPI Function Code 28.7% 15.6% + = 44.3%
Ly D, Saldaña M, Chow P. FPT 2009 64
Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3
Ly D, Saldaña M, Chow P. FPT 2009 65
Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3
Ly D, Saldaña M, Chow P. FPT 2009 66
Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3
Ly D, Saldaña M, Chow P. FPT 2009 67
Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3
Ly D, Saldaña M, Chow P. FPT 2009 68
Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3
Ly D, Saldaña M, Chow P. FPT 2009 69
void *msg_buf; int msg_size; ... MPI_Isend(msg_buf, msg_size, ...); MPI_Irecv(msg_buf, msg_size, ...);
Ly D, Saldaña M, Chow P. FPT 2009 70
void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } }
Ly D, Saldaña M, Chow P. FPT 2009 71
void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } }
Ly D, Saldaña M, Chow P. FPT 2009 72
void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } }
Ly D, Saldaña M, Chow P. FPT 2009 73
void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } }
Ly D, Saldaña M, Chow P. FPT 2009 74
void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } }
Ly D, Saldaña M, Chow P. FPT 2009 75
void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } }
Ly D, Saldaña M, Chow P. FPT 2009 76
Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3 For loop
Ly D, Saldaña M, Chow P. FPT 2009 77
Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3 For loop
Ly D, Saldaña M, Chow P. FPT 2009 78
Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3 For loop
Ly D, Saldaña M, Chow P. FPT 2009 79
Legend Non-MPI Code Function Preamble/Postamble MPI Function Code msg 1 msg 2 msg 3 For loop
Ly D, Saldaña M, Chow P. FPT 2009 80
81
[1] D. Ly et al., “A Multi-FPGA Architecture for Restricted Boltzmann Machines,” FPL, Sept. 2009.
Ly D, Saldaña M, Chow P. FPT 2009
82 Ly D, Saldaña M, Chow P. FPT 2009
83 Ly D, Saldaña M, Chow P. FPT 2009
Message # Source Destination Size [# of words] 1 R0 R1 2 R0 R1 3 3 R0 R6 4 R0 R6 3 5 R0 R11 6 R0 R11 3 7 R0 R16 8 R0 R16 3 9 R0 R1 4 10 R0 R6 4 11 R0 R11 4 12 R0 R16 4
Ly D, Saldaña M, Chow P. FPT 2009 84
Ly D, Saldaña M, Chow P. FPT 2009 85
Ly D, Saldaña M, Chow P. FPT 2009 86
Ly D, Saldaña M, Chow P. FPT 2009 87
Ly D, Saldaña M, Chow P. FPT 2009 88
Ly D, Saldaña M, Chow P. FPT 2009 89
Ly D, Saldaña M, Chow P. FPT 2009 90
Ly D, Saldaña M, Chow P. FPT 2009 91
Ly D, Saldaña M, Chow P. FPT 2009 92
MPI_Recv(...); compute(); MPI_Send(...);
Ly D, Saldaña M, Chow P. FPT 2009 93
Ly D, Saldaña M, Chow P. FPT 2009 94
b a c
i b i a i c
, , ,
int va[N], vb[N], vc[N]; MPI_Recv(va, N, MPI_INT, rank1, ...); MPI_Recv(vb, N, MPI_INT, rank2, ...); for(int i = 0; i < N; i++) vc[i] = va[i] + vb[i]; MPI_Send(vc, N, MPI_INT, rank1, ...); MPI_Send(vc, N, MPI_INT, rank2, ...);
Ly D, Saldaña M, Chow P. FPT 2009 95
Ly D, Saldaña M, Chow P. FPT 2009 96
Ly D, Saldaña M, Chow P. FPT 2009 97
Ly D, Saldaña M, Chow P. FPT 2009 98
Ly D, Saldaña M, Chow P. FPT 2009 99
Ly D, Saldaña M, Chow P. FPT 2009 100
Ly D, Saldaña M, Chow P. FPT 2009 101
Ly D, Saldaña M, Chow P. FPT 2009 102
Ly D, Saldaña M, Chow P. FPT 2009 103
Ly D, Saldaña M, Chow P. FPT 2009 104
Ly D, Saldaña M, Chow P. FPT 2009 105
Ly D, Saldaña M, Chow P. FPT 2009 106
Ly D, Saldaña M, Chow P. FPT 2009 107
Ly D, Saldaña M, Chow P. FPT 2009 108
Ly D, Saldaña M, Chow P. FPT 2009 109
Ly D, Saldaña M, Chow P. FPT 2009 110
Ly D, Saldaña M, Chow P. FPT 2009 111
Ly D, Saldaña M, Chow P. FPT 2009 112
Ly D, Saldaña M, Chow P. FPT 2009 113
Ly D, Saldaña M, Chow P. FPT 2009 114
Ly D, Saldaña M, Chow P. FPT 2009 115
Ly D, Saldaña M, Chow P. FPT 2009 116
Ly D, Saldaña M, Chow P. FPT 2009 117
Ly D, Saldaña M, Chow P. FPT 2009 118
MPI Core MPI Core
Ly D, Saldaña M, Chow P. FPT 2009 119
MPI Core MPI Core MPI Core MPI Core
Ly D, Saldaña M, Chow P. FPT 2009 120
MPI Core MPI Core MPI Core MPI Core Processor
Ly D, Saldaña M, Chow P. FPT 2009 121
Rank 1 Rank n
Ly D, Saldaña M, Chow P. FPT 2009 122
Rank 1 Rank n
Ly D, Saldaña M, Chow P. FPT 2009 123