Service-Oriented Programming in MPI Sarwar Alam, Humaira Kamal and - - PowerPoint PPT Presentation

service oriented
SMART_READER_LITE
LIVE PREVIEW

Service-Oriented Programming in MPI Sarwar Alam, Humaira Kamal and - - PowerPoint PPT Presentation

Service-Oriented Programming in MPI Sarwar Alam, Humaira Kamal and Alan Wagner University of British Columbia Network Systems Security Lab Overview Problem: How to provide data structures to MPI? Fine-Grain MPI Service-Oriented


slide-1
SLIDE 1

Service-Oriented Programming in MPI

Sarwar Alam, Humaira Kamal and Alan Wagner University of British Columbia

Network Systems Security Lab

slide-2
SLIDE 2

Overview

Fine-Grain MPI Service-Oriented Programming Performance Tuning

Problem: How to provide data structures to MPI?

slide-3
SLIDE 3

Composition

  • Abstraction
  • Cohesive
  • Low coupling

Hierarchical Communication Load- balancing Slackness Scalability Properties

Issues

slide-4
SLIDE 4

Fine-Grain MPI

slide-5
SLIDE 5

MPI

  • Advantages
  • Efficient over many fabrics
  • Rich communication library
  • Disadvantages
  • Bound to OS processes
  • SPMD programming model
  • Course-grain
slide-6
SLIDE 6

Fine-Grain MPI

Program: OS processes with co-routines (fibers)

Multicore Node MPI process

  • Full-fledged MPI “processes”
  • Combination of OS-scheduled and user-level light-

weight processes inside each process

slide-7
SLIDE 7

Fine-Grain MPI

  • One model, inside and between nodes
  • Interleaved Concurrency
  • Parallel: same node between nodes

Node 1 Node 2

slide-8
SLIDE 8

Integrated into MPICH2

Composition
  • Abstraction
  • Cohesive
  • Low coupling
Hierarchical Communication Load-balancing Slackness Scalability Properties
slide-9
SLIDE 9

System Details

slide-10
SLIDE 10

Executing FG-MPI Programs

  • Example of SPMD MPI program
  • with 16 MPI processes,
  • assuming two nodes with quad-core.

8 pairs of processes executing in parallel, where each pair interleaves execution mpiexec –nfg 2 –n 8 myprog

slide-11
SLIDE 11

Decoupled from Hardware

  • Fit the number of processes to the problem

rather than the number of cores

mpiexec –nfg 350 –n 4 myprog

slide-12
SLIDE 12

Flexibility

  • Move the boundary between light-weight

user scheduled concurrency, and processes running in parallel.

mpiexec –nfg 1000 –n 4 myprog mpiexec –nfg 500 –n 8 myprog mpiexec –nfg 750 –n 4 myprog: -nfg 250 –n 4 myprog

slide-13
SLIDE 13

Scalability

  • Can have hundreds and thousands of MPI

processes.

  • 100 Million processes on 6500 cores

mpiexec –nfg 30000 –n 8 myprog mpiexec –nfg 16000 –n 6500 myprog

Composition
  • Abstraction
  • Cohesive
  • Low coupling
Hierarchical Communication Load-balancing Slackness Scalability Properties
slide-14
SLIDE 14

Service-Oriented Programming

  • Linked List Structure
  • Keys in sorted order
  • Similar
  • Distributed hash table
  • Linda Tuple Spaces
slide-15
SLIDE 15

Ordered Linked-List

43 3 27

An MPI process in ordered list

Minimum key value

  • f items stored in

next MPI process Stores one or more key values Rank of MPI process with next larger key values Previous MPI process in

  • rdered list

Next MPI process in

  • rdered list

Data associated with key

slide-16
SLIDE 16

Ordered Linked-List

L0 L12 L28 L43 L21 L18 L75 L56 A45 A38 A3

slide-17
SLIDE 17

Ordered Linked-List

slide-18
SLIDE 18

INSERT

slide-19
SLIDE 19

DELETE

slide-20
SLIDE 20

FIND

slide-21
SLIDE 21

Ordered Linked-List

L0 L12 L75 L28 L56 L43 L21 L18 F30 F65 L75 L56 A12

slide-22
SLIDE 22

Shortcuts

F30 M10 L34 L15 L28

Key value Rank (ptr) 23 15 2012 34 5510 28 Free Ranks 24 30

F24 A12

Local Process Ecosystem Local non-communication operations are ATOMIC

slide-23
SLIDE 23

Re-incarnation

F30 M10 L34 L15 L28

Free Ranks 24 30 28

F24

Recv()

A12

Local Process Ecosystem

send() F24 F28

Local non-communication operations are ATOMIC

Composition
  • Abstraction
  • Cohesive
  • Low coupling
Hierarchical Communication Load-balancing Slackness Scalability Properties
slide-24
SLIDE 24

Granularity

  • Added the ability for each process to

manage a collection of consecutive items.

  • Changes to INSERT, changes into a SPLIT
  • peration
  • Changes to DELETE, on delete of last item
  • List Traversal consists of:
  • Jumping between processes
  • Jumping co-located processes
  • Search inside a process
slide-25
SLIDE 25

Properties

  • Total Ordered – operations are ordered by

the order they arrive at the root

  • Sequentially Consistent – each application

process keeps a hold-back queue to return results in order

  • No consistency – operations can occur in

any order

Composition
  • Abstraction
  • Cohesive
  • Low coupling
Hierarchical Communication Load-balancing Slackness Scalability Properties
slide-26
SLIDE 26

Performance Tuning

  • G (granularity) the number of keys stored in

each process.

  • K (asynchrony) the number of messages in

the channel between list processes.

  • W (workload) the number of outstanding
  • perations
slide-27
SLIDE 27

16,000

  • perations/sec

5793 operations/sec

Steady-StateThroughput

Fixed list size, evenly distributed over O x M core

slide-28
SLIDE 28

Granularity (G)

Fixed-size machine (176 cores), Fixed list size (2^20)

Moving work from INSIDE a process to BETWEEN processes Sequentially Consistent No-consistency

10X larger

slide-29
SLIDE 29

W and K

W : Number of outstanding requests (workload) K : Degree of Asynchrony

Composition
  • Abstraction
  • Cohesive
  • Low coupling
Hierarchical Communication Load-balancing Slackness Scalability Properties
slide-30
SLIDE 30

Conclusions

  • Reduced coupling and increased cohesion
  • Scalability within clusters of multicore
  • Performance tuning controls
  • Adapt to hierarchical network fabric
  • Distributed systems properties pertaining to

consistency

slide-31
SLIDE 31

Thank-You