Design of MPI Passive Target Synchronization for a Non-Cache- - - PowerPoint PPT Presentation

design of mpi passive target synchronization for a non
SMART_READER_LITE
LIVE PREVIEW

Design of MPI Passive Target Synchronization for a Non-Cache- - - PowerPoint PPT Presentation

Design of MPI Passive Target Synchronization for a Non-Cache- Coherent Many-Core Processor 27th PARS Workshop, Hagen, Germany, May 5 2017 Steffen Christgau , Bettina Schnor Operating Systems and Distributed Systems Institute for Computer


slide-1
SLIDE 1

Design of MPI Passive Target Synchronization for a Non-Cache- Coherent Many-Core Processor

27th PARS Workshop, Hagen, Germany, May 5 2017 Steffen Christgau, Bettina Schnor

Operating Systems and Distributed Systems Institute for Computer Science University of Potsdam, Germany

slide-2
SLIDE 2

Motivation: Distributed Hash Table (DHT)

  • hash table as cache for computational results in MPI application

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

1 / 14

slide-3
SLIDE 3

Motivation: Distributed Hash Table (DHT)

  • hash table as cache for computational results in MPI application
  • large amount of data → distribute across processes → DHT

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

1 / 14

slide-4
SLIDE 4

Motivation: Distributed Hash Table (DHT)

  • hash table as cache for computational results in MPI application
  • large amount of data → distribute across processes → DHT

rank 0 local DHT part rank 1 local DHT part rank n − 1 local DHT part ... DHT

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

1 / 14

slide-5
SLIDE 5

Motivation: Distributed Hash Table (DHT)

  • hash table as cache for computational results in MPI application
  • large amount of data → distribute across processes → DHT

rank 0 local DHT part rank 1 local DHT part rank n − 1 local DHT part ... DHT

  • accessing distributed data:

hash function returns arbitrary process and address difficult to program with two-sided message passing MPI passive target one-sided communication to the rescue synchronization required

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

1 / 14

slide-6
SLIDE 6

Motivation: nCC Systems

  • Future many-cores may not provide (global) cache coherence.

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

2 / 14

slide-7
SLIDE 7

Motivation: nCC Systems

  • Future many-cores may not provide (global) cache coherence.

Intel Knights Landing: coherent multi-socket systems not feasible

https://www.extremetech.com/wp-content/uploads/2016/04/KnightsLanding.png

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

2 / 14

slide-8
SLIDE 8

Motivation: nCC Systems

  • Future many-cores may not provide (global) cache coherence.

Intel Knights Landing: coherent multi-socket systems not feasible HPE "The Machine", EuroServer: coherence islands

https://regmedia.co.uk/2016/11/22/the_machine_universal_memory_pool_access.jpg

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

2 / 14

slide-9
SLIDE 9

Research Platform

  • nCC many-core research system: Intel SCC

48 Pentium cores with L1/2 caches no HW cache coherence

MC 0 MC 1 MC 2 MC 3 L2$ L2$ Core Core MIU MPB

R Tile PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

3 / 14

slide-10
SLIDE 10

Research Platform

  • nCC many-core research system: Intel SCC

48 Pentium cores with L1/2 caches no HW cache coherence

MC 0 MC 1 MC 2 MC 3 L2$ L2$ Core Core MIU MPB

R Tile

  • This talk: design of synchronization on nCC platform.

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

3 / 14

slide-11
SLIDE 11

Agenda

MPI Passive Target One-Sided Communication Design for Passive Target Synchronization on the SCC Data Structures and Algorithms Data Placement Outlook and Future Work

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

4 / 14

slide-12
SLIDE 12

MPI One-Sided Communication

  • process memory exposed via windows

rank 0 local DHT part rank 1 local DHT part rank n − 1 local DHT part ... process’ address space DHT

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

5 / 14

slide-13
SLIDE 13

MPI One-Sided Communication

  • process memory exposed via windows

rank 0 local DHT part rank 1 local DHT part rank n − 1 local DHT part ... process’ address space DHT (window) (window) (window)

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

5 / 14

slide-14
SLIDE 14

MPI One-Sided Communication

  • process memory exposed via windows
  • access to windows with window object (handle)

rank 0 local DHT part rank 1 local DHT part rank n − 1 local DHT part ... process’ address space DHT (window) (window) (window) window object window object window object

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

5 / 14

slide-15
SLIDE 15

MPI One-Sided Communication

  • process memory exposed via windows
  • access to windows with window object (handle)

rank 0 local DHT part rank 1 local DHT part rank n − 1 local DHT part ... process’ address space DHT (window) (window) (window) window object window object window object

  • key concept: only one communication partner issues

communication operations

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

5 / 14

slide-16
SLIDE 16

MPI One-Sided Communication

  • process memory exposed via windows
  • access to windows with window object (handle)

rank 0 local DHT part rank 1 local DHT part rank n − 1 local DHT part ... process’ address space DHT (window) (window) (window) window object window object window object

  • key concept: only one communication partner issues

communication operations

  • rigin processes issue communication operations

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

5 / 14

slide-17
SLIDE 17

MPI One-Sided Communication

  • process memory exposed via windows
  • access to windows with window object (handle)

rank 0 local DHT part rank 1 local DHT part rank n − 1 local DHT part ... process’ address space DHT (window) (window) (window) window object window object window object

  • key concept: only one communication partner issues

communication operations

  • rigin processes issue communication operations

target processes are addressed by operations

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

5 / 14

slide-18
SLIDE 18

MPI One-Sided Communication

  • process memory exposed via windows
  • access to windows with window object (handle)

rank 0 local DHT part rank 1 local DHT part rank n − 1 local DHT part ... process’ address space DHT (window) (window) (window) window object window object window object

  • key concept: only one communication partner issues

communication operations

  • rigin processes issue communication operations

target processes are addressed by operations typical RMA operations: PUT, GET, . . .

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

5 / 14

slide-19
SLIDE 19

MPI One-Sided Communication

  • process memory exposed via windows
  • access to windows with window object (handle)

rank 0 local DHT part rank 1 local DHT part rank n − 1 local DHT part ... process’ address space DHT (window) (window) (window) window object window object window object

  • key concept: only one communication partner issues

communication operations

  • rigin processes issue communication operations

target processes are addressed by operations typical RMA operations: PUT, GET, . . . explicit synchronization required

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

5 / 14

slide-20
SLIDE 20

MPI Passive Target Synchronization

  • locks as means for synchronization, used by origins only
  • no participation of targets in synchronization (passive targets)

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

6 / 14

slide-21
SLIDE 21

MPI Passive Target Synchronization

  • locks as means for synchronization, used by origins only
  • no participation of targets in synchronization (passive targets)
  • usage similar to shared memory locks
  • 1. acquire lock for target window

WIN_LOCK(win, rank, ...)

  • 2. perform operations

PUT(win, rank, ...)

  • 3. release lock

WIN_UNLOCK(win, rank)

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

6 / 14

slide-22
SLIDE 22

MPI Passive Target Synchronization

  • locks as means for synchronization, used by origins only
  • no participation of targets in synchronization (passive targets)
  • usage similar to shared memory locks
  • 1. acquire lock for target window

WIN_LOCK(win, rank, ...)

  • 2. perform operations

PUT(win, rank, ...)

  • 3. release lock

WIN_UNLOCK(win, rank)

MPI defines two lock types: shared concurrent accesses on target window allowed exclusive prevent concurrent accesses on same target window

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

6 / 14

slide-23
SLIDE 23

Distributed Hash Table with MPI OSC

rank 0 local DHT part window object (window) rank 1 local DHT part window object (window) rank n − 1 local DHT part window object (window) ... process’ address space DHT

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

7 / 14

slide-24
SLIDE 24

Distributed Hash Table with MPI OSC

rank 0 local DHT part window object (window) rank 1 local DHT part window object (window) rank n − 1 local DHT part window object (window) ... process’ address space DHT

DHT_read

LOCK(window_obj, target, SHARED) GET(window_obj, target, &data) UNLOCK(window_obj, target)

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

7 / 14

slide-25
SLIDE 25

Distributed Hash Table with MPI OSC

rank 0 local DHT part window object (window) rank 1 local DHT part window object (window) rank n − 1 local DHT part window object (window) ... process’ address space DHT

DHT_read

LOCK(window_obj, target, SHARED) GET(window_obj, target, &data) UNLOCK(window_obj, target)

DHT_write

LOCK(window_obj, target, EXCLUSIVE) PUT(window_obj, target, data) UNLOCK(window_obj, target)

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

7 / 14

slide-26
SLIDE 26

Synchronization for the DHT

  • observation: high latency for synchronization in SCC’s MPI

previous work (PASA 2016): 5x lower latency with shared memory and uncached accesses instead of messages

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

8 / 14

slide-27
SLIDE 27

Synchronization for the DHT

  • observation: high latency for synchronization in SCC’s MPI

previous work (PASA 2016): 5x lower latency with shared memory and uncached accesses instead of messages synchronization semantics undefined by MPI: "much freedom for implementors"

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

8 / 14

slide-28
SLIDE 28

Synchronization for the DHT

  • observation: high latency for synchronization in SCC’s MPI

previous work (PASA 2016): 5x lower latency with shared memory and uncached accesses instead of messages synchronization semantics undefined by MPI: "much freedom for implementors"

  • assumption: (far) more DHT reads than writes

Readers & Writers Synchronization (Courtois et al.) advantageous writer precedence → recent data for readers

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

8 / 14

slide-29
SLIDE 29

Synchronization for the DHT

  • observation: high latency for synchronization in SCC’s MPI

previous work (PASA 2016): 5x lower latency with shared memory and uncached accesses instead of messages synchronization semantics undefined by MPI: "much freedom for implementors"

  • assumption: (far) more DHT reads than writes

Readers & Writers Synchronization (Courtois et al.) advantageous writer precedence → recent data for readers

→ design of MPI passive target synchronization scheme with R&W semantics for SCC

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

8 / 14

slide-30
SLIDE 30

Data Structures for Synchronization

use lock data structure as proposed by Mellor-Crummey/Scott (’91)

rank 0 window lock L0 – writer queue – reader queue – state rank 1 rank 2 shared memory

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

9 / 14

slide-31
SLIDE 31

Data Structures for Synchronization

use lock data structure as proposed by Mellor-Crummey/Scott (’91)

  • distributed lists of waiting readers and writers

rank 0 window lock L0 – writer queue – reader queue – state rank 1 rank 2 blocked = 0 blocked = 1 blocked = 1 shared memory

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

9 / 14

slide-32
SLIDE 32

Data Structures for Synchronization

use lock data structure as proposed by Mellor-Crummey/Scott (’91)

  • distributed lists of waiting readers and writers

no centralized object to spin on (avoid memory contention) instead: per-process list entry for spinning

rank 0 window lock L0 – writer queue – reader queue – state rank 1 rank 2 blocked = 0 blocked = 1 blocked = 1 shared memory

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

9 / 14

slide-33
SLIDE 33

Data Structures for Synchronization

use lock data structure as proposed by Mellor-Crummey/Scott (’91)

  • distributed lists of waiting readers and writers

no centralized object to spin on (avoid memory contention) instead: per-process list entry for spinning

  • state variable: counts active/interested readers/writers

rank 0 window lock L0 – writer queue – reader queue – state rank 1 rank 2 blocked = 0 blocked = 1 blocked = 1 shared memory

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

9 / 14

slide-34
SLIDE 34

Data Structures for Synchronization

use lock data structure as proposed by Mellor-Crummey/Scott (’91)

  • distributed lists of waiting readers and writers

no centralized object to spin on (avoid memory contention) instead: per-process list entry for spinning

  • state variable: counts active/interested readers/writers
  • one lock variable per process and window

rank 0 window lock L0 – writer queue – reader queue – state rank 1 window lock L1 – writer queue – reader queue – state rank 2 window lock L2 – writer queue – reader queue – state blocked = 0 blocked = 1 blocked = 1 shared memory

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

9 / 14

slide-35
SLIDE 35

Synchronization Operations

  • according to Mellor-Crummey/Scott
  • processes enter either list of readers or writers

Readers start_read blocks as long as writers are active or waiting, allows multiple active readers

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

10 / 14

slide-36
SLIDE 36

Synchronization Operations

  • according to Mellor-Crummey/Scott
  • processes enter either list of readers or writers

Readers start_read blocks as long as writers are active or waiting, allows multiple active readers end_read wake first waiting writer if no active reader left

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

10 / 14

slide-37
SLIDE 37

Synchronization Operations

  • according to Mellor-Crummey/Scott
  • processes enter either list of readers or writers

Readers start_read blocks as long as writers are active or waiting, allows multiple active readers end_read wake first waiting writer if no active reader left Writers start_write blocks when readers are active

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

10 / 14

slide-38
SLIDE 38

Synchronization Operations

  • according to Mellor-Crummey/Scott
  • processes enter either list of readers or writers

Readers start_read blocks as long as writers are active or waiting, allows multiple active readers end_read wake first waiting writer if no active reader left Writers start_write blocks when readers are active end_write wake up next writer (if any) or all waiting readers

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

10 / 14

slide-39
SLIDE 39

R&W Synchronization inside MPI Library

MPI_Win_lock(type, target_rank, win_obj) { entry = alloc_list_entry(); win_obj.entry[target_rank] = entry; win_obj.entry[target_rank].lock_type = type; if (type == SHARED) start_read(win_obj.lock[target_rank], entry); else start_write(win_obj.lock[target_rank], entry); }

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

11 / 14

slide-40
SLIDE 40

R&W Synchronization inside MPI Library

MPI_Win_lock(type, target_rank, win_obj) { entry = alloc_list_entry(); win_obj.entry[target_rank] = entry; win_obj.entry[target_rank].lock_type = type; if (type == SHARED) start_read(win_obj.lock[target_rank], entry); else start_write(win_obj.lock[target_rank], entry); }

unlock operation straight forward

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

11 / 14

slide-41
SLIDE 41

Data Placement

synchronization data located in shared memory

  • danger of contention on memory interface

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

12 / 14

slide-42
SLIDE 42

Data Placement

synchronization data located in shared memory

  • danger of contention on memory interface
  • speedup of memory-bound application with different

synchronization data locations:

16 32 48 4 8 12 16 20 24 28 32 36 40 44 48 speedup number of MPI processes distributed controller 3 controller 2 controller 1 controller 0 PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

12 / 14

slide-43
SLIDE 43

Data Placement

synchronization data located in shared memory

  • danger of contention on memory interface
  • speedup of memory-bound application with different

synchronization data locations:

16 32 48 4 8 12 16 20 24 28 32 36 40 44 48 speedup number of MPI processes distributed controller 3 controller 2 controller 1 controller 0

  • bring spinning object close to process/core → allocate list entry

in closest memory controller → local uncached spinning

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

12 / 14

slide-44
SLIDE 44

Discussion

design characteristics:

  • concurrent window access: one lock per window and process
  • per-window Readers & Writers semantic
  • contention avoidance: spin on local object only
  • truly passive: no participation of the remote process in

synchronization operations and communication

Christgau, Schnor: Exploring One-Sided Communication and Synchronization on a non-Cache-Coherent Many-Core

  • Architecture. Concurrency and Computation: Practice and Experience. 2017

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

13 / 14

slide-45
SLIDE 45

Summary and Outlook

Summary

  • presented design for implementing MPI passive target

synchronization on nCC many-core

  • applied concepts from Mellor-Crummey/Scott to nCC processor
  • distributed data structures critical

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

14 / 14

slide-46
SLIDE 46

Summary and Outlook

Summary

  • presented design for implementing MPI passive target

synchronization on nCC many-core

  • applied concepts from Mellor-Crummey/Scott to nCC processor
  • distributed data structures critical

Future Work

  • implement the presented scheme
  • evaluate performance by comparison against message-based

implementation and other designs

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

14 / 14

slide-47
SLIDE 47

Summary and Outlook

Summary

  • presented design for implementing MPI passive target

synchronization on nCC many-core

  • applied concepts from Mellor-Crummey/Scott to nCC processor
  • distributed data structures critical

Future Work

  • implement the presented scheme
  • evaluate performance by comparison against message-based

implementation and other designs Questions!?

PARS 2017

  • S. Christgau (U Potsdam): MPI Passive Target Synchronization

14 / 14