eaking br
play

eaking Br 56 nd Ba A Breakdown of High- performance - PowerPoint PPT Presentation

1 35 eaking Br 56 nd Ba A Breakdown of High- performance Communication Rohit Zambre,* Megan Grodowitz, Aparna Chandramowlishwaran,* Pavel Shamis *University of California, Irvine Arm Research 2


  1. � 1 35 eaking Br 56 nd Ba A Breakdown of High- performance Communication Rohit Zambre,* Megan Grodowitz, ⌃ Aparna Chandramowlishwaran,* Pavel Shamis ⌃ *University of California, Irvine ⌃ Arm Research

  2. � 2 https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/

  3. � 3 https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/ Evolution of the memory capacity per core in the Top500 list 
 (Peter Kogge. Pim & memory: The need for a revolution in architecture.)

  4. � 4 https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/ ▸ Strong scaling is the way forward. ▸ Small messages at the limits of strong scaling. Evolution of the memory capacity per core in the Top500 list 
 (Peter Kogge. Pim & memory: The need for a revolution in architecture.)

  5. � 5 Latency Network I/O CPU Breakdown 27.60% 37.20% 35.20% 0 500 1000 Nanoseconds Injection overhead Misc Post_prog Post Breakdown 1.19% 22.57% 76.22% 0 100 200 Nanoseconds

  6. � 6 Latency Network I/O CPU Breakdown 27.60% 37.20% 35.20% 0 500 1000 Nanoseconds ▸ How much does a component contribute? Injection overhead Misc Post_prog Post Breakdown 1.19% 22.57% 76.22% 0 100 200 Nanoseconds

  7. � 7 Latency Network I/O CPU Breakdown 27.60% 37.20% 35.20% 0 500 1000 Nanoseconds ▸ How much does a component contribute? Injection overhead Misc Post_prog Post Breakdown 1.19% 22.57% 76.22% 0 100 200 Nanoseconds

  8. � 8 Latency Network I/O CPU Breakdown 27.60% 37.20% 35.20% 0 500 1000 Nanoseconds ▸ How much does a component contribute? ▸ If we optimize Injection overhead component X by Y%, by how much will Misc Post_prog Post Breakdown 1.19% 22.57% 76.22% communication performance improve? 0 100 200 Nanoseconds

  9. � 9 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION CONTRIBUTIONS OF THE PAPER ▸ A detailed breakdown of communication performance of small messages. ▸ Analytical models to explain the injection and latency. ▸ Effective within 5% of the observed performance.

  10. � 10 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION CONTRIBUTIONS OF THE PAPER ▸ A detailed breakdown of communication performance of small messages. ▸ Analytical models to explain the injection and latency. ▸ Effective within 5% of the observed performance. ▸ Detailed measurement methodology to produce breakdown on any other system configuration.

  11. � 11 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION CONTRIBUTIONS OF THE PAPER ▸ A detailed breakdown of communication performance of small messages. ▸ Analytical models to explain the injection and latency. ▸ Effective within 5% of the observed performance. ▸ Detailed measurement methodology to produce breakdown on any other system configuration. ▸ What-if analysis for a set of optimizations. ▸ First work of its kind on an Arm-based server.

  12. � 12 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION OUTLINE ▸ Introduction ▸ Experimental setup & Measurement methodology ▸ Injection overhead: Modeling and breakdown ▸ Latency: Modeling and breakdown ▸ Simulated optimizations

  13. � 13 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INTERNODE COMMUNICATION COMPONENTS IN HPC Examples MPICH + UCP High-level Communication Protocols (HLP) CPU UCT Low-level Communication Protocols (LLP) Root Complex + PCI Express I/O subsystem I/O NIC Mellanox InifniBand Network Switch

  14. 
 
 
 
 � 14 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION EXPERIMENTAL SETUP Node 
 Node 
 Mellanox 1 
 2 
 InfiniBand Mellanox 
 Lecroy 
 Mellanox 
 Network ConnectX-4 
 PCIe ConnectX-4 
 (Switch 
 NIC Analyzer NIC + 
 TX2-based TX2-based Wire) Server Server ▸ Software: MPICH CH4 + UCX; Hardware: Arm TX2 + PCIe + Mellanox IB ▸ CPU timer registers to measure CPU time. ▸ PCIe analyzer to measure time in other components through traces.

  15. � 15 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION EXPERIMENTAL SETUP (WHAT IT ACTUALLY LOOKED LIKE) State-of-the-art PCIe trace viewer cooling PCIe analyzer ConnectX-4 Node 1

  16. � 16 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION EXPERIMENTAL SETUP (WHAT IT ACTUALLY LOOKED LIKE) State-of-the-art PCIe trace viewer cooling PCIe analyzer ConnectX-4 Node 1

  17. � 17 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION USING CPU TIMERS Timer start <code> 
 <of> 
 <interest> Timer end Time for code of interest = Timer end - Timer start - Timer overhead

  18. � 18 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION USING CPU TIMERS MPI_Isend MPI ucp_tag_send_nb UCP uct_ep_am_short UCT ▸ Measured time in different components using deltas. ▸ Carefully isolated callbacks/functions between layers (details in paper).

  19. � 19 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION USING PCIE ANALYZER Time of event = Timestamp of packet after event - 
 Timestamp of packet before event

  20. � 20 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION USING PCIE ANALYZER NIC WRITING COMPLETION TLP 
 N 
 Root 
 Analyzer MWr 2 ✕ I 
 Complex 
 PCIe 
 DLLP 
 C (RC) wire ACK

  21. � 21 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION OUTLINE ▸ Introduction ▸ Experimental setup & Measurement methodology ▸ Injection overhead: Modeling and breakdown ▸ Latency: Modeling and breakdown ▸ Simulated optimizations

  22. � 22 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD

  23. � 23 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD: BACKGROUND Sender Programmed 
 IO 
 CPU Post Root 
 N 
 Complex 
 I 
 (RC) C MEM

  24. 
 
 � 24 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD: BACKGROUND Sender Programmed 
 IO 
 CPU Post MWr (64B) Root 
 N 
 PCIe wire Complex 
 I 
 (RC) C MEM

  25. 
 
 � 25 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD INJECTION OVERHEAD: BACKGROUND Sender Programmed 
 IO 
 CPU Post MWr (64B) Transmit Root 
 N 
 PCIe wire Complex 
 I 
 (RC) C MEM ACK

  26. 
 
 � 26 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD INJECTION OVERHEAD: BACKGROUND Sender Programmed 
 IO 
 CPU Post MWr (64B) Transmit Root 
 N 
 PCIe wire Complex 
 I 
 (RC) C Write MEM completion MWr (64B) ACK

  27. 
 
 � 27 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD: BACKGROUND INJECTION OVERHEAD Sender Programmed 
 IO 
 CPU Post MWr (64B) Transmit Root 
 N 
 PCIe wire Complex 
 I 
 (RC) C Write MEM Completion 
 completion DMA-write MWr (64B) ACK

  28. 
 
 � 28 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD INJECTION OVERHEAD: BACKGROUND Sender Programmed 
 IO 
 CPU Post MWr (64B) Transmit Root 
 N 
 PCIe wire Complex 
 I 
 Progress (RC) C Write MEM Completion 
 completion DMA-write MWr (64B) ACK

  29. 
 
 � 29 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD ▸ Overhead observed by RC Sender Programmed 
 IO 
 CPU Post MWr (64B) Transmit Root 
 N 
 PCIe wire Complex 
 I 
 Progress (RC) C ▸ Overhead observed by NIC Write MEM Completion 
 completion DMA-write MWr (64B) ACK

  30. 
 
 � 30 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD ▸ Overhead observed by RC Sender Programmed 
 b ✕ Post + b ✕ Progress + tot_Misc IO 
 CPU Post MWr (64B) Transmit b Root 
 N 
 PCIe wire = CPU_time = Post + Progress + Misc Complex 
 I 
 Progress (RC) C ▸ Overhead observed by NIC Write MEM Completion 
 completion DMA-write MWr (64B) ACK

  31. 
 
 � 31 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD ▸ Overhead observed by RC Sender Programmed 
 b ✕ Post + b ✕ Progress + tot_Misc IO 
 CPU Post MWr (64B) Transmit b Root 
 N 
 PCIe wire = CPU_time = Post + Progress + Misc Complex 
 I 
 Progress (RC) C ▸ Overhead observed by NIC Write MEM Completion 
 completion DMA-write MWr (64B) ACK (1) Credit-based flow control 
 (2) Multiple outstanding PCIe transactions

  32. 
 
 � 32 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD ▸ Overhead observed by RC Sender Programmed 
 b ✕ Post + b ✕ Progress + tot_Misc IO 
 CPU Post MWr (64B) Transmit b Root 
 N 
 PCIe wire = CPU_time = Post + Progress + Misc Complex 
 I 
 Progress (RC) C ▸ Overhead observed by NIC Write MEM Completion 
 completion DMA-write MWr (64B) ACK = Overhead observed by RC (1) Credit-based flow control 
 (2) Multiple outstanding PCIe transactions

  33. � 33 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD Injection overhead = CPU_time = Post + Progress + Misc CPU timer registers

  34. � 34 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION Misc Progress Post Post is performance Progress is semantic 1.20% 22.58% 76.23% bottleneck bottleneck 0 25 50 75 100 Percent

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend