Active Access: A Mechanism for High-Performance Distributed - - PowerPoint PPT Presentation

active access a mechanism for high performance
SMART_READER_LITE
LIVE PREVIEW

Active Access: A Mechanism for High-Performance Distributed - - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations M ACIEJ B ESTA , T ORSTEN H OEFLER spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING spcl.inf.ethz.ch


slide-1
SLIDE 1

spcl.inf.ethz.ch @spcl_eth

MACIEJ BESTA, TORSTEN HOEFLER

Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations

slide-2
SLIDE 2

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

slide-3
SLIDE 3

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Process p

A

slide-4
SLIDE 4

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory Process p Process q

A B

slide-5
SLIDE 5

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray BlueWaters

Process p Process q

A B

slide-6
SLIDE 6

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray BlueWaters

Process p Process q

A B

slide-7
SLIDE 7

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray BlueWaters

Process p Process q

A B

slide-8
SLIDE 8

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray BlueWaters

Process p Process q

A B

slide-9
SLIDE 9

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray BlueWaters

put

Process p Process q

A B A A

slide-10
SLIDE 10

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray BlueWaters

put

Process p Process q

A B get B A B A B

slide-11
SLIDE 11

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray BlueWaters

put

Process p Process q

A B get B A B flush A B

slide-12
SLIDE 12

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray BlueWaters

put

Process p Process q

A B get B A B flush A B

slide-13
SLIDE 13

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS PROGRAMMING

  • Implemented in hardware in NICs in the majority of HPC

networks (RDMA)

slide-14
SLIDE 14

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS PROGRAMMING

  • Implemented in hardware in NICs in the majority of HPC

networks (RDMA)

slide-15
SLIDE 15

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS PROGRAMMING

  • Implemented in hardware in NICs in the majority of HPC

networks (RDMA)

slide-16
SLIDE 16

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS PROGRAMMING

  • Implemented in hardware in NICs in the majority of HPC

networks (RDMA)

slide-17
SLIDE 17

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS PROGRAMMING

  • Implemented in hardware in NICs in the majority of HPC

networks (RDMA)

slide-18
SLIDE 18

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS PROGRAMMING

  • Supported by many HPC libraries and languages
slide-19
SLIDE 19

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS PROGRAMMING

  • Supported by many HPC libraries and languages
slide-20
SLIDE 20

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS PROGRAMMING

  • Supported by many HPC libraries and languages
slide-21
SLIDE 21

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS PROGRAMMING

  • Enables significant speedups over message passing in

many types of applications, e.g.:

[2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC. SPAA’12 [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13

slide-22
SLIDE 22

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS PROGRAMMING

  • Enables significant speedups over message passing in

many types of applications, e.g.:

  • Speedup of ~1.5 for communication patterns in irregular

workloads

[2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC. SPAA’12 [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13

slide-23
SLIDE 23

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS PROGRAMMING

  • Enables significant speedups over message passing in

many types of applications, e.g.:

  • Speedup of ~1.5 for communication patterns in irregular

workloads

  • Speedup of ~1.4-2 in physics computations

[2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC. SPAA’12 [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13

slide-24
SLIDE 24

spcl.inf.ethz.ch @spcl_eth

RMA VS. MESSAGE PASSING

Memory Memory

put

Process p Process q

A A flush A RMA:

slide-25
SLIDE 25

spcl.inf.ethz.ch @spcl_eth

RMA VS. MESSAGE PASSING

Memory Memory

put

Process p Process q

A A flush A RMA: Message Passing:

slide-26
SLIDE 26

spcl.inf.ethz.ch @spcl_eth

RMA VS. MESSAGE PASSING

Memory Memory

put

Process p Process q

A A flush A

Memory Memory

message

Process p Process q

A A A RMA: Message Passing: A

slide-27
SLIDE 27

spcl.inf.ethz.ch @spcl_eth

  • Communication in RMA is one-sided

RMA VS. MESSAGE PASSING

Memory Memory

put

Process p Process q

A A flush A

Memory Memory

message

Process p Process q

A A A RMA: Message Passing: A

slide-28
SLIDE 28

spcl.inf.ethz.ch @spcl_eth

  • Communication in RMA is one-sided

RMA VS. MESSAGE PASSING

Memory Memory

put

Process p Process q

A A flush A

Memory Memory

message

Process p Process q

A A A put RMA: Message Passing: A

slide-29
SLIDE 29

spcl.inf.ethz.ch @spcl_eth

  • Communication in RMA is one-sided

RMA VS. MESSAGE PASSING

Memory Memory

put

Process p Process q

A A flush A no active participation, direct access to memory

Memory Memory

message

Process p Process q

A A A put RMA: Message Passing: A

slide-30
SLIDE 30

spcl.inf.ethz.ch @spcl_eth

  • Communication in RMA is one-sided

RMA VS. MESSAGE PASSING

Memory Memory

put

Process p Process q

A A flush A no active participation, direct access to memory

Memory Memory

message

Process p Process q

A A A explicit receive, possible queueing send put RMA: Message Passing: A

slide-31
SLIDE 31

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS PROGRAMMING

[1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13

slide-32
SLIDE 32

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS PROGRAMMING

  • Is it ideal?

[1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13

slide-33
SLIDE 33

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS PROGRAMMING

  • Is it ideal?

[1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13

slide-34
SLIDE 34

spcl.inf.ethz.ch @spcl_eth

REMOTE MEMORY ACCESS PROGRAMMING

  • Is it ideal?
  • Consider an insert in a

distributed hashtable...

No hash collision:

[1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13

A hash collision:  1 remote atomic  Up to 5x speedup over MP [1]  4 remote atomics + 2 remote puts  Significant performance drops

Proc q Proc p

Local execution; triggered by an active access. In RMA?

How to enable it?

Use “active” semantics Use and extend I/O MMUs and their paging capabilities

slide-35
SLIDE 35

spcl.inf.ethz.ch @spcl_eth

USE SEMANTICS FROM ACTIVE MESSAGES (AM) [1]

Memory Process p Process q

Handler A Handler Z

...

A’s addr: Z’s addr:

GASNet [3] AM++[2]

[1] T. von Eicken et al. Active messages: a mechanism for integrated communication and computation. ISCA’92. [3] D. Bonachea, GASNet Specification, v1.1. Berkeley Technical Report. 2002. [2] J. J. Willcock et al. AM++: A generalized active message framework. PACT’10.

We need active puts/gets:

  • Invoke a handler upon

accessing a given page

  • Preserve one-sided RMA

behavior

We use it in syntax & semantics to enable the “active” behavior

slide-36
SLIDE 36

spcl.inf.ethz.ch @spcl_eth

USE INPUT/OUTPUT MEMORY MANAGEMENT UNITS

+

Main memory IOMMU MMU TLB IOTLB CPU I/O devices

Virtual addresses Physical addresses Device addresses Physical addresses

We propose it as a way to implement the “active” behavior

slide-37
SLIDE 37

spcl.inf.ethz.ch @spcl_eth

IOMMUS AND RMA

NIC IOMMU CPU Main memory

An RDMA packet PCIe packets

IOTLB Dev-to-PT cache Dev-to-PT

PT

Remapping structures

W R

User handlers Handler A

...

System-wide fault log Fault entry Fault entry

... ...

SMT cores

1 2 3 9 4 5 6 7 8 10 11 12 MSI Data is discarded... Extremely BAD No parallelism (single log)... BAD No multiplexing (single log)... BAD

We could use it somehow. But…

slide-38
SLIDE 38

spcl.inf.ethz.ch @spcl_eth

ACTIVE PUTS

NIC IOMMU CPU Main memory

An RDMA packet PCIe packets

IOTLB Dev-to-PT cache Dev-to-PT

PT

Remapping structures

W R

User handlers Handler A

...

System-wide fault log Fault entry Fault entry

... ...

SMT cores Access log (private for each process) Fault entry Fault entry

...

Request data Request data

+

WL WLD IUID + + +

Access log table +

MSI Data can be reused Stores addresses of each access log Maps each page to an access log Decide on keeping/discarding the entry/data Enables data-centric programming

slide-39
SLIDE 39

spcl.inf.ethz.ch @spcl_eth

ACTIVE PUTS

IOMMU Process p Process q CPU Main memory Accessed page

W = 0 WL = 1 WLD = 1 Access log Attempt to write(X) Page fault! (W = 0) Move(X) Process(X) 1 2 3 4 5 X Do not modify the page

Log both the entry and the data of an incoming put

slide-40
SLIDE 40

spcl.inf.ethz.ch @spcl_eth

ACTIVE GETS

NIC IOMMU CPU Main memory

An RDMA packet PCIe packets

IOTLB Dev-to-PT cache Dev-to-PT

PT

Remapping structures

W R

User handlers Handler A

...

System-wide fault log Fault entry Fault entry

... ...

SMT cores Access log (private for each process) Fault entry Fault entry

...

Request data Request data

+

WL WLD RL RLD IUID + + + + +

Access log table +

MSI

slide-41
SLIDE 41

spcl.inf.ethz.ch @spcl_eth

ACTIVE GETS

IOMMU Process p Process q CPU Main memory Accessed page

R = 1 RL = 1 RLD = 1 Access log Copy(X) Process(X) 1 2 3 4 X Enable reading from the page

Log both the entry and the data accessed by a get

+

Sounds like we can reuse most

  • f the existing

stuff!

slide-42
SLIDE 42

spcl.inf.ethz.ch @spcl_eth

INTERACTIONS WITH THE CPU

NIC IOMMU CPU Main memory

An RDMA packet PCIe packets

IOTLB Dev-to-PT cache Dev-to-PT

PT

Remapping structures

W R

User handlers Handler A

...

System-wide fault log Fault entry Fault entry

... ...

SMT cores Access log (private for each process) Fault entry Fault entry

...

Request data Request data

+

WL WLD RL RLD IUID + + + + +

Access log table

MSI

slide-43
SLIDE 43

spcl.inf.ethz.ch @spcl_eth

INTERACTIONS WITH THE CPU

IOMMU CPU

IOTLB Dev-to-PT cache

...

SMT cores

Access log table

MSI

Scratchpad memory Handler A

Hyper thread

+ +

+

  • Interrupts
  • Polling
  • Direct notifications via scratchpads

Var

Are we done?

Well…

slide-44
SLIDE 44

spcl.inf.ethz.ch @spcl_eth

CONSISTENCY

  • A weak consistency model [1]
  • Consistency on-demand
  • active_flush(int target_id)
  • Enforces the completion of active accesses issued by the calling

process and targeted at target_id

  • Implemented with an active get issued at a special flushing page

Memory Memory Process p Process q

Active get

IOMMU Flushing page Access log

X Y Z X Y Z

[1] K. Gharachorloo et al. Memory consistency and event ordering in scalable shared-memory multiprocessors. ISCA '90

slide-45
SLIDE 45

spcl.inf.ethz.ch @spcl_eth

CONSISTENCY

IOMMU CPU

IOTLB Dev-to-PT cache

...

SMT cores

Access log table +

MSI

Scratchpad memory Handler A ...

Hyper thread

+ +

+

Flushing buffer +

Maps flushing pages to IUIDs and access logs Contains the addresses

  • f flushing pages

Packet tag buffer +

slide-46
SLIDE 46

spcl.inf.ethz.ch @spcl_eth

How can we use it?

Let’s summarize…

Active Puts/Gets Consistency Active Messages IOMMUs

slide-47
SLIDE 47

spcl.inf.ethz.ch @spcl_eth

ACTIVE ACCESS USE-CASES

DISTRIBUTED HASHTABLE

  • Used to construct key-value stores (e.g., Memcached [1])

Table of elements

[1] B. Fitzpatrick. Distributed caching with memcached. Linux journal, 2004.

Local volume 0 (at process 0)

Overflow heap Table of elements

Local volume 1 (at process 1)

Overflow heap Table of elements

Local volume N-1 (at process N-1)

Overflow heap

...

slide-48
SLIDE 48

spcl.inf.ethz.ch @spcl_eth

ACTIVE ACCESS USE-CASES

DISTRIBUTED HASHTABLE: INSERTS (RMA)

Proc q

Table of elements Overflow heap

Proc p

slide-49
SLIDE 49

spcl.inf.ethz.ch @spcl_eth

ACTIVE ACCESS USE-CASES

DISTRIBUTED HASHTABLE: INSERTS (AA)

Proc q

Table of elements Overflow heap

Proc p

All other accesses become local

slide-50
SLIDE 50

spcl.inf.ethz.ch @spcl_eth

ACTIVE ACCESS USE-CASES

VIRTUAL GLOBAL ADDRESS SPACE (V-GAS)

Machine 0

NIC

Machine 1 Machine N-1

Proc 0 Proc 1 Proc N-1 NIC NIC MMU MMU MMU IOMMU IOMMU IOMMU Memory Memory Memory

V-GAS Local memory protection Remote memory protection Fetch data (used for logging, fault- tolerance, etc…)

slide-51
SLIDE 51

spcl.inf.ethz.ch @spcl_eth

  • Evaluation on CSCS Monte Rosa
  • 1,496 computing Cray XE6 nodes
  • 47,872 schedulable cores
  • 46TB memory
  • 3 microbenchmarks
  • 4 use-cases

PERFORMANCE

slide-52
SLIDE 52

spcl.inf.ethz.ch @spcl_eth

PERFORMANCE: MICROBENCHMARKS

RAW DATA TRANSFER

[2] R. Olsson. PktGen the linux packet generator. Linux Symposium. 2005 [3] L. Rizzo. netmap: A novel framework for fast packet i/o. USENIX Annual Technical Conference. 2012

  • Workload simulated with [1]:
  • Data generated with:
  • PktGen [2]
  • Netmap [3]

[1] N. Binkert et al. The gem5 simulator. SIGARCH Comput. Archit. News. 2011

slide-53
SLIDE 53

spcl.inf.ethz.ch @spcl_eth

PERFORMANCE: LARGE-SCALE CODES

COMPARISON TARGETS

Active Access

AA-Int AA-Poll AA-SP RMA

Active Messages

AM AM-Exp AM-Onload AM-Ints DMAPP RoCE Cell PAMI DCMF LAPI MX GASNet AM++

slide-54
SLIDE 54

spcl.inf.ethz.ch @spcl_eth

PERFORMANCE: LARGE-SCALE CODES

DISTRIBUTED HASHTABLE

Collisions: 5% Collisions: 25%

slide-55
SLIDE 55

spcl.inf.ethz.ch @spcl_eth

slide-56
SLIDE 56

spcl.inf.ethz.ch @spcl_eth

Active Access

CONCLUSIONS

Data-centric programming Extends paging capabilities in a distributed environment Alleviates RMA’s problems with AMs while preserving one-sided semantics Addresses of pages guide the execution of handlers Hashtables, logging schemes, counters, V-GAS, checkpointing... Performance Accelerates various distributed codes Uses commodity & common IOMMUs

slide-57
SLIDE 57

spcl.inf.ethz.ch @spcl_eth

Thank you for your attention

slide-58
SLIDE 58

spcl.inf.ethz.ch @spcl_eth

ACTIVE ACCESS USE-CASES

ACCELERATING LOGGING FOR RMA

  • Logging – a popular mechanism for fault-tolerance.
  • Remote communication (puts/gets) is logged.
  • Upon a process crash, it is restored and uses the logs to

replay its previous actions.

  • Logs are stored in volatile memories.
slide-59
SLIDE 59

spcl.inf.ethz.ch @spcl_eth

ACTIVE ACCESS USE-CASES

ACCELERATING LOGGING FOR RMA

  • Logging puts:

Proc q Proc p

Log the PUT Reply the PUT

q is modified

slide-60
SLIDE 60

spcl.inf.ethz.ch @spcl_eth

ACTIVE ACCESS USE-CASES

ACCELERATING LOGGING FOR RMA

  • Logging gets (naive):

Proc q Proc p

Log the GET Attempt to reply the GET

p is modified

FAIL!

slide-61
SLIDE 61

spcl.inf.ethz.ch @spcl_eth

ACTIVE ACCESS USE-CASES

ACCELERATING LOGGING FOR RMA

  • Logging gets (traditional) [1]:

Proc q Proc p

p is modified

Fetch the logs reply the GET

Bandwidth wasted

[1] M. Besta and T. Hoefler. Fault tolerance for remote memory access programming models. HPDC’14.

slide-62
SLIDE 62

spcl.inf.ethz.ch @spcl_eth

ACTIVE ACCESS USE-CASES

ACCELERATING LOGGING FOR RMA

  • Logging gets (AA):

Proc q Proc p

Log the GET

p is modified

Fetch the logs reply the GET

IOMMU

slide-63
SLIDE 63

spcl.inf.ethz.ch @spcl_eth

ACTIVE ACCESS USE-CASES

INCREMENTAL CHECKPOINTING FOR RMA

Proc k Proc 1 Proc 1

...

...

Proc k

...

barrier compute compute compute compute compute compute compute compute global rollback barrier

slide-64
SLIDE 64

spcl.inf.ethz.ch @spcl_eth

Node 1 Node N

38

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1 Proc 1

...

...

Proc k

...

barrier compute compute compute compute compute compute compute compute global rollback barrier

slide-65
SLIDE 65

spcl.inf.ethz.ch @spcl_eth

39

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1 Proc 1

...

...

Proc k

...

barrier compute compute compute compute compute compute compute compute global rollback barrier

slide-66
SLIDE 66

spcl.inf.ethz.ch @spcl_eth

PERFORMANCE: LARGE-SCALE CODES

FAULT TOLERANCE SCHEME

Logging gets: Sorting time: