QMPI: A Library for Multithreaded MPI Applications Alex Brooks - - PowerPoint PPT Presentation

qmpi a library for multithreaded
SMART_READER_LITE
LIVE PREVIEW

QMPI: A Library for Multithreaded MPI Applications Alex Brooks - - PowerPoint PPT Presentation

QMPI: A Library for Multithreaded MPI Applications Alex Brooks Hoang-Vu Dang Marc Snir Outline Motivation Communication Model Qthreads QMPI Summary 2 MOTIVATION 3 Issue Large numbers of threads performing


slide-1
SLIDE 1

QMPI: A Library for Multithreaded MPI Applications

Alex Brooks Hoang-Vu Dang Marc Snir

slide-2
SLIDE 2

Outline

  • Motivation
  • Communication Model
  • Qthreads
  • QMPI
  • Summary

2

slide-3
SLIDE 3

MOTIVATION

3

slide-4
SLIDE 4

Issue

  • Large numbers of threads performing

communication causes problems

– Locking – Polling – Scheduling

  • As a result there are very few hybrid

MPI+pthread applications

4

slide-5
SLIDE 5

Current MPI Design

  • MPI code executed by calling thread

– Requires coarse-grain locking – limits concurrency – Some implementations don’t support

  • Communication completion is observed

through polling

– Separate calls to progress engine

  • Scheduler is unaware of which threads

have become runnable

5

slide-6
SLIDE 6

Performance

6

0.001 0.01 0.1 1 10 100 1000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1024K 2048K 4096K Bandwidth (MB/s) Message size (bytes)

MPICH

0.001 0.01 0.1 1 10 100 1000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1024K 2048K 4096K Message size (bytes)

MVAPICH

slide-7
SLIDE 7

Goals

  • Enable efficient use of multithreaded

two-sided communication

– Light-weight threads – Low-overhead scheduling upon communication completion

  • Improve programmability of

multithreaded MPI

7

slide-8
SLIDE 8

COMMUNICATION MODEL

8

slide-9
SLIDE 9

Main idea

  • Light-weight tasks submit requests to
  • comm. engine
  • Comm. engine marks task as runnable

when communication completes

9

Communication Engine Worker Thread Worker Thread Worker Thread …

slide-10
SLIDE 10

QTHREADS

10

slide-11
SLIDE 11

Introduction

  • Tasking model which supports

millions of light-weight threads

  • Three main entities

– Task

  • Function of execution

– Worker

  • Thread executing tasks

– Shepherd - Queue of tasks

11

slide-12
SLIDE 12

Synchronization

  • Full/Empty bit (FEB) semantics

– FEB determines status of data

  • 0 (empty) : data is not written
  • 1 (full)

: data is written

  • Read

– Stall task until FEB is full, then read data and set as empty

  • Write

– Stall until FEB is empty, then write data and set as full

12

slide-13
SLIDE 13

Task Scheduler

  • Each work is associated with a single

shepherd

– Tasks pulled from shepherd to execute

  • Tasks can be stolen from other

shepherds under certain conditions

  • Tasks preempt when waiting on

synchronization

13

slide-14
SLIDE 14

Overview

  • Scalable over-subscription

– Millions of tasks can be spawned with minimal overhead in performance

  • Worker idle time is reduced through

task preemption at synchronization

  • “Automatic” load-balancing of tasks
  • Shared-memory environment

14

slide-15
SLIDE 15

QMPI

15

slide-16
SLIDE 16

Overview

  • Qthreads+MPI

– Qthreads light-weight task model with communication through MPI

  • Two threads dedicated for

communication engine

– One for communication, one for FEB management

16

slide-17
SLIDE 17

Communication Model

17

Worker Worker Worker … Shepherd Shepherd Shepherd … Node Network FEB Queue FEB Thread Comm Queue Comm Thread Synch Container

slide-18
SLIDE 18

Communication Model

18

Worker Worker Worker … Shepherd Shepherd Shepherd … Node Network FEB Queue FEB Thread Comm Queue Comm Thread Synch Container

slide-19
SLIDE 19

Communication Model

19

Worker Worker … Shepherd Shepherd Shepherd … Node Network FEB Queue FEB Thread Comm Queue Comm Thread Worker Synch Container

slide-20
SLIDE 20

Communication Model

20

Worker Worker Worker … Shepherd Shepherd Shepherd … Node Network FEB Queue FEB Thread Comm Queue Comm Thread Synch Container

slide-21
SLIDE 21

Communication Model

21

Worker Worker Worker … Shepherd Shepherd Shepherd … Node Network FEB Queue FEB Thread Comm Queue Comm Thread Synch Container

slide-22
SLIDE 22

Communication Model

22

Worker Worker Worker … Shepherd Shepherd Shepherd … Node Network FEB Queue FEB Thread Comm Queue Comm Thread Synch Container

slide-23
SLIDE 23

Performance

23

0.001 0.01 0.1 1 10 100 1000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1024K 2048K 4096K Bandwidth (MB/s) Message size (bytes)

MPICH

0.001 0.01 0.1 1 10 100 1000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1024K 2048K 4096K Message size (bytes)

MVAPICH

slide-24
SLIDE 24

Target Applications

  • Not beneficial for all problems

– Little overlap in multithreaded communication increase runtime

  • Bulk-synchronous communication
  • Oversubscription

– Benefit directly from Qthreads

24

slide-25
SLIDE 25

Simple Experiment

  • 5-point stencil computation

– Send edge values to neighbors – Recv edge values from neighbors – Compute new values

25

slide-26
SLIDE 26

Results

26

1 10 100 1000 120 1200 12000 Execution Time (usec) Grid Size (1 side)

Send Phase

MPI+Pthread QMPI 1 10 100 1000 10000 120 1200 12000 Grid Size (1 side)

Receive Phase

MPI+Pthread QMPI 1 10 100 1000 10000 100000 120 1200 12000 Execution Time (usec) Grid Size (1 side)

Calculation Phase

MPI+Pthread QMPI

slide-27
SLIDE 27

SUMMARY

27

slide-28
SLIDE 28

Conclusion

  • Large numbers of threads performing

communication causes problems

  • QMPI uses a communication model

to decrease communication overhead

  • QMPI performs much better than

traditional MPI+pthreads in many situations

28

slide-29
SLIDE 29

On-going/Future Work

  • Test QMPI with real applications

– MiniGhost, Lulesh, UTS, etc.

  • Message Aggregation
  • Push QMPI model to an internal

feature of MPI

29

slide-30
SLIDE 30

QUESTIONS

30