spcl.inf.ethz.ch @spcl_eth
MACIEJ BESTA, TORSTEN HOEFLER
Active Access: A Mechanism for High-Performance Distributed - - PowerPoint PPT Presentation
spcl.inf.ethz.ch @spcl_eth Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations M ACIEJ B ESTA , T ORSTEN H OEFLER spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING spcl.inf.ethz.ch
spcl.inf.ethz.ch @spcl_eth
MACIEJ BESTA, TORSTEN HOEFLER
spcl.inf.ethz.ch @spcl_eth
spcl.inf.ethz.ch @spcl_eth
Memory Process p
A
spcl.inf.ethz.ch @spcl_eth
Memory Memory Process p Process q
A B
spcl.inf.ethz.ch @spcl_eth
Memory Memory
Cray BlueWaters
Process p Process q
A B
spcl.inf.ethz.ch @spcl_eth
Memory Memory
Cray BlueWaters
Process p Process q
A B
spcl.inf.ethz.ch @spcl_eth
Memory Memory
Cray BlueWaters
Process p Process q
A B
spcl.inf.ethz.ch @spcl_eth
Memory Memory
Cray BlueWaters
Process p Process q
A B
spcl.inf.ethz.ch @spcl_eth
Memory Memory
Cray BlueWaters
put
Process p Process q
A B A A
spcl.inf.ethz.ch @spcl_eth
Memory Memory
Cray BlueWaters
put
Process p Process q
A B get B A B A B
spcl.inf.ethz.ch @spcl_eth
Memory Memory
Cray BlueWaters
put
Process p Process q
A B get B A B flush A B
spcl.inf.ethz.ch @spcl_eth
Memory Memory
Cray BlueWaters
put
Process p Process q
A B get B A B flush A B
spcl.inf.ethz.ch @spcl_eth
networks (RDMA)
spcl.inf.ethz.ch @spcl_eth
networks (RDMA)
spcl.inf.ethz.ch @spcl_eth
networks (RDMA)
spcl.inf.ethz.ch @spcl_eth
networks (RDMA)
spcl.inf.ethz.ch @spcl_eth
networks (RDMA)
spcl.inf.ethz.ch @spcl_eth
spcl.inf.ethz.ch @spcl_eth
spcl.inf.ethz.ch @spcl_eth
spcl.inf.ethz.ch @spcl_eth
many types of applications, e.g.:
[2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC. SPAA’12 [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13
spcl.inf.ethz.ch @spcl_eth
many types of applications, e.g.:
workloads
[2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC. SPAA’12 [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13
spcl.inf.ethz.ch @spcl_eth
many types of applications, e.g.:
workloads
[2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC. SPAA’12 [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13
spcl.inf.ethz.ch @spcl_eth
Memory Memory
put
Process p Process q
A A flush A RMA:
spcl.inf.ethz.ch @spcl_eth
Memory Memory
put
Process p Process q
A A flush A RMA: Message Passing:
spcl.inf.ethz.ch @spcl_eth
Memory Memory
put
Process p Process q
A A flush A
Memory Memory
message
Process p Process q
A A A RMA: Message Passing: A
spcl.inf.ethz.ch @spcl_eth
Memory Memory
put
Process p Process q
A A flush A
Memory Memory
message
Process p Process q
A A A RMA: Message Passing: A
spcl.inf.ethz.ch @spcl_eth
Memory Memory
put
Process p Process q
A A flush A
Memory Memory
message
Process p Process q
A A A put RMA: Message Passing: A
spcl.inf.ethz.ch @spcl_eth
Memory Memory
put
Process p Process q
A A flush A no active participation, direct access to memory
Memory Memory
message
Process p Process q
A A A put RMA: Message Passing: A
spcl.inf.ethz.ch @spcl_eth
Memory Memory
put
Process p Process q
A A flush A no active participation, direct access to memory
Memory Memory
message
Process p Process q
A A A explicit receive, possible queueing send put RMA: Message Passing: A
spcl.inf.ethz.ch @spcl_eth
[1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13
spcl.inf.ethz.ch @spcl_eth
[1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13
spcl.inf.ethz.ch @spcl_eth
[1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13
spcl.inf.ethz.ch @spcl_eth
distributed hashtable...
No hash collision:
[1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13
A hash collision: 1 remote atomic Up to 5x speedup over MP [1] 4 remote atomics + 2 remote puts Significant performance drops
Proc q Proc p
Local execution; triggered by an active access. In RMA?
Use “active” semantics Use and extend I/O MMUs and their paging capabilities
spcl.inf.ethz.ch @spcl_eth
Memory Process p Process q
Handler A Handler Z
A’s addr: Z’s addr:
GASNet [3] AM++[2]
[1] T. von Eicken et al. Active messages: a mechanism for integrated communication and computation. ISCA’92. [3] D. Bonachea, GASNet Specification, v1.1. Berkeley Technical Report. 2002. [2] J. J. Willcock et al. AM++: A generalized active message framework. PACT’10.
We need active puts/gets:
accessing a given page
behavior
We use it in syntax & semantics to enable the “active” behavior
spcl.inf.ethz.ch @spcl_eth
Main memory IOMMU MMU TLB IOTLB CPU I/O devices
Virtual addresses Physical addresses Device addresses Physical addresses
We propose it as a way to implement the “active” behavior
spcl.inf.ethz.ch @spcl_eth
NIC IOMMU CPU Main memory
An RDMA packet PCIe packets
IOTLB Dev-to-PT cache Dev-to-PT
PT
Remapping structures
W R
User handlers Handler A
System-wide fault log Fault entry Fault entry
... ...
SMT cores
1 2 3 9 4 5 6 7 8 10 11 12 MSI Data is discarded... Extremely BAD No parallelism (single log)... BAD No multiplexing (single log)... BAD
We could use it somehow. But…
spcl.inf.ethz.ch @spcl_eth
NIC IOMMU CPU Main memory
An RDMA packet PCIe packets
IOTLB Dev-to-PT cache Dev-to-PT
PT
Remapping structures
W R
User handlers Handler A
System-wide fault log Fault entry Fault entry
... ...
SMT cores Access log (private for each process) Fault entry Fault entry
...
Request data Request data
+
WL WLD IUID + + +
Access log table +
MSI Data can be reused Stores addresses of each access log Maps each page to an access log Decide on keeping/discarding the entry/data Enables data-centric programming
spcl.inf.ethz.ch @spcl_eth
IOMMU Process p Process q CPU Main memory Accessed page
W = 0 WL = 1 WLD = 1 Access log Attempt to write(X) Page fault! (W = 0) Move(X) Process(X) 1 2 3 4 5 X Do not modify the page
Log both the entry and the data of an incoming put
spcl.inf.ethz.ch @spcl_eth
NIC IOMMU CPU Main memory
An RDMA packet PCIe packets
IOTLB Dev-to-PT cache Dev-to-PT
PT
Remapping structures
W R
User handlers Handler A
System-wide fault log Fault entry Fault entry
... ...
SMT cores Access log (private for each process) Fault entry Fault entry
...
Request data Request data
+
WL WLD RL RLD IUID + + + + +
Access log table +
MSI
spcl.inf.ethz.ch @spcl_eth
IOMMU Process p Process q CPU Main memory Accessed page
R = 1 RL = 1 RLD = 1 Access log Copy(X) Process(X) 1 2 3 4 X Enable reading from the page
Log both the entry and the data accessed by a get
Sounds like we can reuse most
stuff!
spcl.inf.ethz.ch @spcl_eth
NIC IOMMU CPU Main memory
An RDMA packet PCIe packets
IOTLB Dev-to-PT cache Dev-to-PT
PT
Remapping structures
W R
User handlers Handler A
System-wide fault log Fault entry Fault entry
... ...
SMT cores Access log (private for each process) Fault entry Fault entry
...
Request data Request data
+
WL WLD RL RLD IUID + + + + +
Access log table
MSI
spcl.inf.ethz.ch @spcl_eth
IOMMU CPU
IOTLB Dev-to-PT cache
...
SMT cores
Access log table
MSI
Scratchpad memory Handler A
Hyper thread
+ +
+
Var
Well…
spcl.inf.ethz.ch @spcl_eth
process and targeted at target_id
Memory Memory Process p Process q
Active get
IOMMU Flushing page Access log
X Y Z X Y Z
[1] K. Gharachorloo et al. Memory consistency and event ordering in scalable shared-memory multiprocessors. ISCA '90
spcl.inf.ethz.ch @spcl_eth
IOMMU CPU
IOTLB Dev-to-PT cache
...
SMT cores
Access log table +
MSI
Scratchpad memory Handler A ...
Hyper thread
+ +
+
Flushing buffer +
Maps flushing pages to IUIDs and access logs Contains the addresses
Packet tag buffer +
spcl.inf.ethz.ch @spcl_eth
Let’s summarize…
Active Puts/Gets Consistency Active Messages IOMMUs
spcl.inf.ethz.ch @spcl_eth
DISTRIBUTED HASHTABLE
Table of elements
[1] B. Fitzpatrick. Distributed caching with memcached. Linux journal, 2004.
Local volume 0 (at process 0)
Overflow heap Table of elements
Local volume 1 (at process 1)
Overflow heap Table of elements
Local volume N-1 (at process N-1)
Overflow heap
spcl.inf.ethz.ch @spcl_eth
DISTRIBUTED HASHTABLE: INSERTS (RMA)
Proc q
Table of elements Overflow heap
Proc p
spcl.inf.ethz.ch @spcl_eth
DISTRIBUTED HASHTABLE: INSERTS (AA)
Proc q
Table of elements Overflow heap
Proc p
All other accesses become local
spcl.inf.ethz.ch @spcl_eth
VIRTUAL GLOBAL ADDRESS SPACE (V-GAS)
Machine 0
NIC
Machine 1 Machine N-1
Proc 0 Proc 1 Proc N-1 NIC NIC MMU MMU MMU IOMMU IOMMU IOMMU Memory Memory Memory
V-GAS Local memory protection Remote memory protection Fetch data (used for logging, fault- tolerance, etc…)
spcl.inf.ethz.ch @spcl_eth
spcl.inf.ethz.ch @spcl_eth
RAW DATA TRANSFER
[2] R. Olsson. PktGen the linux packet generator. Linux Symposium. 2005 [3] L. Rizzo. netmap: A novel framework for fast packet i/o. USENIX Annual Technical Conference. 2012
[1] N. Binkert et al. The gem5 simulator. SIGARCH Comput. Archit. News. 2011
spcl.inf.ethz.ch @spcl_eth
COMPARISON TARGETS
Active Access
Active Messages
spcl.inf.ethz.ch @spcl_eth
DISTRIBUTED HASHTABLE
Collisions: 5% Collisions: 25%
spcl.inf.ethz.ch @spcl_eth
spcl.inf.ethz.ch @spcl_eth
Active Access
Data-centric programming Extends paging capabilities in a distributed environment Alleviates RMA’s problems with AMs while preserving one-sided semantics Addresses of pages guide the execution of handlers Hashtables, logging schemes, counters, V-GAS, checkpointing... Performance Accelerates various distributed codes Uses commodity & common IOMMUs
spcl.inf.ethz.ch @spcl_eth
spcl.inf.ethz.ch @spcl_eth
ACCELERATING LOGGING FOR RMA
replay its previous actions.
spcl.inf.ethz.ch @spcl_eth
ACCELERATING LOGGING FOR RMA
Proc q Proc p
Log the PUT Reply the PUT
q is modified
spcl.inf.ethz.ch @spcl_eth
ACCELERATING LOGGING FOR RMA
Proc q Proc p
Log the GET Attempt to reply the GET
p is modified
FAIL!
spcl.inf.ethz.ch @spcl_eth
ACCELERATING LOGGING FOR RMA
Proc q Proc p
p is modified
Fetch the logs reply the GET
Bandwidth wasted
[1] M. Besta and T. Hoefler. Fault tolerance for remote memory access programming models. HPDC’14.
spcl.inf.ethz.ch @spcl_eth
ACCELERATING LOGGING FOR RMA
Proc q Proc p
Log the GET
p is modified
Fetch the logs reply the GET
IOMMU
spcl.inf.ethz.ch @spcl_eth
INCREMENTAL CHECKPOINTING FOR RMA
Proc k Proc 1 Proc 1
Proc k
barrier compute compute compute compute compute compute compute compute global rollback barrier
spcl.inf.ethz.ch @spcl_eth
Node 1 Node N
38
Proc k Proc 1 Proc 1
Proc k
barrier compute compute compute compute compute compute compute compute global rollback barrier
spcl.inf.ethz.ch @spcl_eth
39
Proc k Proc 1 Proc 1
Proc k
barrier compute compute compute compute compute compute compute compute global rollback barrier
spcl.inf.ethz.ch @spcl_eth
FAULT TOLERANCE SCHEME
Logging gets: Sorting time: