active access a mechanism for high performance
play

Active Access: A Mechanism for High-Performance Distributed - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations M ACIEJ B ESTA , T ORSTEN H OEFLER spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING spcl.inf.ethz.ch


  1. spcl.inf.ethz.ch @spcl_eth Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations M ACIEJ B ESTA , T ORSTEN H OEFLER

  2. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING

  3. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process p Memory A

  4. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A B

  5. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A B Cray BlueWaters

  6. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A B Cray BlueWaters

  7. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A B Cray BlueWaters

  8. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A B Cray BlueWaters

  9. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A put A A B Cray BlueWaters

  10. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A put A A B get B B B Cray BlueWaters

  11. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A put A A B get B B B flush Cray BlueWaters

  12. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory put A A A get B B B B flush Cray BlueWaters

  13. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Implemented in hardware in NICs in the majority of HPC networks (RDMA)

  14. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Implemented in hardware in NICs in the majority of HPC networks (RDMA)

  15. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Implemented in hardware in NICs in the majority of HPC networks (RDMA)

  16. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Implemented in hardware in NICs in the majority of HPC networks (RDMA)

  17. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Implemented in hardware in NICs in the majority of HPC networks (RDMA)

  18. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Supported by many HPC libraries and languages

  19. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Supported by many HPC libraries and languages

  20. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Supported by many HPC libraries and languages

  21. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Enables significant speedups over message passing in many types of applications, e.g.: [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13 [2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC . SPAA’12

  22. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Enables significant speedups over message passing in many types of applications, e.g.:  Speedup of ~1.5 for communication patterns in irregular workloads [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13 [2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC . SPAA’12

  23. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Enables significant speedups over message passing in many types of applications, e.g.:  Speedup of ~1.5 for communication patterns in irregular workloads  Speedup of ~1.4-2 in physics computations [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13 [2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC . SPAA’12

  24. spcl.inf.ethz.ch @spcl_eth RMA VS . M ESSAGE P ASSING RMA: Process q Process p A put Memory Memory A A flush

  25. spcl.inf.ethz.ch @spcl_eth RMA VS . M ESSAGE P ASSING RMA: Process q Process p A put Memory Memory A A flush Message Passing:

  26. spcl.inf.ethz.ch @spcl_eth RMA VS . M ESSAGE P ASSING RMA: Process q Process p A put Memory Memory A A flush Message Passing: Process q Process p A message Memory Memory A A A

  27. spcl.inf.ethz.ch @spcl_eth RMA VS . M ESSAGE P ASSING  Communication in RMA is one-sided RMA: Process q Process p A put Memory Memory A A flush Message Passing: Process q Process p A message Memory Memory A A A

  28. spcl.inf.ethz.ch @spcl_eth RMA VS . M ESSAGE P ASSING  Communication in RMA is one-sided RMA: Process q Process p put A put Memory Memory A A flush Message Passing: Process q Process p A message Memory Memory A A A

  29. spcl.inf.ethz.ch @spcl_eth RMA VS . M ESSAGE P ASSING no active  participation, Communication in RMA is one-sided direct access to memory RMA: Process q Process p put A put Memory Memory A A flush Message Passing: Process q Process p A message Memory Memory A A A

  30. spcl.inf.ethz.ch @spcl_eth RMA VS . M ESSAGE P ASSING no active  participation, Communication in RMA is one-sided direct access to memory RMA: Process q Process p put A put Memory Memory A A flush explicit receive, Message Passing: possible queueing Process q Process p send A message Memory Memory A A A

  31. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13

  32. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Is it ideal? [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13

  33. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Is it ideal? [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13

  34. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING No hash collision:  Is it ideal? How to enable it?  1 remote atomic  Consider an insert in a  Up to 5x speedup over MP [1] distributed hashtable... Proc p Proc q A hash collision: Use and extend I/O  4 remote atomics + 2 remote puts MMUs and their paging  Significant performance drops capabilities Use “active” semantics Local execution; triggered by an active access . In RMA? [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13

  35. spcl.inf.ethz.ch @spcl_eth U SE SEMANTICS FROM A CTIVE M ESSAGES (AM) [1] Process p AM++[2] We use it in syntax & GASNet [3] Process q semantics to enable the “active” behavior Memory We need active puts/gets: A’s addr: Handler A ...  Invoke a handler upon accessing a given page Z’s addr: Handler Z  Preserve one-sided RMA behavior [1] T. von Eicken et al. Active messages: a mechanism for integrated communication and computation . ISCA’92. [2] J. J. Willcock et al. AM++: A generalized active message framework . PACT’ 10. [3] D. Bonachea, GASNet Specification, v1.1. Berkeley Technical Report. 2002.

  36. spcl.inf.ethz.ch @spcl_eth U SE I NPUT /O UTPUT M EMORY M ANAGEMENT U NITS Main memory Physical Physical addresses addresses IOMMU MMU We propose it as a way to implement the “active” Device Virtual IOTLB TLB addresses behavior addresses I/O devices CPU +

  37. spcl.inf.ethz.ch @spcl_eth We could use it somehow. But … IOMMU S AND RMA 10 MSI An RDMA CPU IOMMU packet 1 3 11 SMT cores 4 ... Dev-to-PT NIC cache 6 IOTLB No multiplexing 2 No parallelism (single log)... BAD PCIe packets (single log)... BAD Main memory 9 Remapping structures System-wide fault log 12 5 W ... User Fault entry Fault entry 8 R Dev-to-PT handlers Handler A ... 7 Data is discarded... PT Extremely BAD

  38. spcl.inf.ethz.ch @spcl_eth Stores addresses of each access log A CTIVE P UTS MSI An RDMA CPU IOMMU packet SMT cores ... Dev-to-PT NIC cache Access log table + IOTLB Decide on PCIe packets keeping/discarding the entry/data Main memory Remapping structures System-wide fault log W ... User Fault entry Fault entry R Dev-to-PT handlers + WL + Handler A WLD + Access log (private for each process) ... ... Fault entry Fault entry Enables Request Request IUID + PT data data data-centric Data can be Maps each page to programming reused an access log

  39. spcl.inf.ethz.ch @spcl_eth A CTIVE P UTS Log both the entry and the Do not modify data of an incoming put the page Process q W = 0 Attempt to Accessed 2 WL = 1 write(X) page WLD = 1 1 IOMMU Process p Page fault! 3 (W = 0) Access log 4 Move(X) X 5 Process(X) CPU Main memory

  40. spcl.inf.ethz.ch @spcl_eth A CTIVE G ETS MSI An RDMA CPU IOMMU packet SMT cores ... Dev-to-PT NIC cache Access log table + IOTLB PCIe packets Main memory Remapping structures System-wide fault log W ... User Fault entry Fault entry R Dev-to-PT handlers + WL + Handler A WLD + Access log (private for each process) + RL ... RLD + ... Fault entry Fault entry Request Request IUID + PT data data

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend