rMPI: Message Passing on Multicore Processors with On-Chip - - PowerPoint PPT Presentation

rmpi message passing on multicore processors with on chip
SMART_READER_LITE
LIVE PREVIEW

rMPI: Message Passing on Multicore Processors with On-Chip - - PowerPoint PPT Presentation

rMPI: Message Passing on Multicore Processors with On-Chip Interconnect 19. Oktober 2009 www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect 2 Outline Background RAW microprocessor rMPI


slide-1
SLIDE 1

rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

  • 19. Oktober 2009

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

slide-2
SLIDE 2

2

Outline

— Background — RAW microprocessor — rMPI — Evaluation — Discussion

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

slide-3
SLIDE 3

3

Why?

— Chips offering on-chip network — Ease programmability — MPI is a well known standard — Migrating existing code base is easy — Fine grain program control if necessary

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

slide-4
SLIDE 4

4

RAW overview

— Developed at MIT — Tiled architecture (16 in ASIC implementation) — 8 stage in-order single issue pipeline — 32kB hardware-managed data cache — 32kB software-managed instruction cache — 64kB software-managed switch instruction memory

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

slide-5
SLIDE 5

5

Architecture overview

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

slide-6
SLIDE 6

6

RAW architecture

— ISA allows direct control over network — Four 32-bit networks

  • Two static, compile time
  • Two dynamic, programmable

— General Dynamic Network (GDN)

  • Used by rMPI
  • 32 bit header
  • Messages up to 32 words
  • Guarantees message delivery atomically and in-order

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

slide-7
SLIDE 7

7

RAW pipeline

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

slide-8
SLIDE 8

8

rMPI

— MPI on RAW — Borrowed ideas from LAM/MPI and MPICH — 75 KLOC!

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

slide-9
SLIDE 9

9

rMPI architecture

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

slide-10
SLIDE 10

10

rMPI packet format

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

slide-11
SLIDE 11

11

Receiving

— Using RAW fast interrupt handler — Interrupt handler sorts and assembles packets — Drains network of contents — Interrupt driven design:

  • Allows asynchronous communication and computation
  • Reduce network contention
  • Avoids deadlocks (blocking sends)
  • No OS layer that increases delay

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

slide-12
SLIDE 12

12

Methodology

— Collected results with simulator — LAM/MPI:

  • 128 nodes
  • Two 2GHz opteron per node, 4GB RAM (use only 1 CPU)
  • 10GB Ethernet

— Speedups relative to a single CPU on each platform running serial implementation

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

slide-13
SLIDE 13

13

End-To-End overhead

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

slide-14
SLIDE 14

14

End-To-End overhead comparison

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

slide-15
SLIDE 15

15

Problems

— Balance between performance and programmability — GDN requires manual packet splitting and reassembly in software — rMPI gives too much overhead for small packets — Guidelines for future designers:

  • Handles packet splitting and sending
  • Prevent deadlocks
  • Middle ground between GDN and rMPI

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

slide-16
SLIDE 16

16

Performance scaling

— Jacobi relaxation

  • Low send/receive overhead
  • 16x16 to 2048x2048 matrices

— Matrix multiply — Trapezoidal integration — Parallel pi estimation — Better performance scalability for computationally-intensitive

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

slide-17
SLIDE 17

17

Jacobi speedup

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

slide-18
SLIDE 18

18

Speedup summary

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

slide-19
SLIDE 19

19

DRAM impact

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

slide-20
SLIDE 20

20

Overhead

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

slide-21
SLIDE 21

21

Instruction cache size

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

slide-22
SLIDE 22

22

Matrix multiply

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

slide-23
SLIDE 23

23

LAM/MPI latency

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

slide-24
SLIDE 24

24

Discussion!!

www.ntnu.no , rMPI: Message Passing on Multicore Processors with On-Chip Interconnect