topology aware data aggregation for intensive i o on
play

Topology-Aware Data Aggregation for Intensive I/O on Large-Scale - PowerPoint PPT Presentation

Context Approach Evaluation Conclusion Topology-Aware Data Aggregation for Intensive I/O on Large-Scale Supercomputers Franois Tessier , Preeti Malakar , Venkatram Vishwanath , Emmanuel Jeannot , Florin Isaila Argonne


  1. Context Approach Evaluation Conclusion Topology-Aware Data Aggregation for Intensive I/O on Large-Scale Supercomputers François Tessier ∗ , Preeti Malakar ∗ , Venkatram Vishwanath ∗ , Emmanuel Jeannot † , Florin Isaila ‡ ∗ Argonne National Laboratory, USA † Inria Bordeaux Sud-Ouest, France ‡ University Carlos III, Spain November 18, 2016

  2. Context Approach Evaluation Conclusion Data Movement at Scale ◮ Computational science simulation such as climate, heart and brain modelling or cosmology have large I/O needs Typically around 10% to 20% of the wall time is spent in I/O Table: Example of I/O from large simulations Scientific domain Simulation Data size Cosmology Q Continuum 2 PB / simulation High-Energy Physics Higgs Boson 10 PB / year Climate / Weather Hurricane 240 TB / simulation ◮ Increasing disparity between computing power and I/O performance in the largest supercomputers IOPS/FLOPS of the #1 system in T op 500 0.1 to Flops (TF/s) in percent Ratio of I/O (TB/s) 0.01 0.001 0.0001 1997 2001 2005 2009 2013 2017 Years

  3. Context Approach Evaluation Conclusion Complex Architectures ◮ Complex network topologies: multidimensional tori, dragonfly, ... ◮ Partitioning of the architecture to reduce I/O interference IBM BG/Q with I/O nodes (Figure), Cray with LNET nodes ◮ New tiers of storage/memory for data staging MCDRAM in KNL, NVRAM, Burst buffer nodes Pset 5D Torus network 128 nodes 2 GBps per link 2 GBps per link 4 GBps per link QDR Infiniband switch Storage GPFS filesystem Mira - 49,152 nodes / 786,432 cores - 768 TB of memory - 27 PB of storage, 330 GB/s (GPFS) Compute nodes Bridge nodes I/O nodes - 5D Torus network PowerPC A2, 16 cores 2 per I/O node IO forwarding daemon - Peak performance: 10 PetaFLOPS 16 GB of DDR3 GPFS client

  4. Context Approach Evaluation Conclusion Two-phase I/O ◮ Available in MPI I/O implementations such as ROMIO ◮ Improves I/O performance by writing larger data chunks ◮ Selects a subset of processes to aggregate data before writing it to the storage system Limitations: P0 P1 P2 P3 Processes ◮ Poor for small messages (from experiments) X Y Z X Y Z X Y Z X Y Z Data ◮ Inefficient aggregator placement policy ◮ Fails to take advantage of data model, data layout and memory hierarchy File Figure: Two-phase I/O mechanism

  5. Context Approach Evaluation Conclusion Two-phase I/O ◮ Available in MPI I/O implementations such as ROMIO ◮ Improves I/O performance by writing larger data chunks ◮ Selects a subset of processes to aggregate data before writing it to the storage system Limitations: P0 P1 P2 P3 Processes ◮ Poor for small messages (from experiments) X Y Z X Y Z X Y Z X Y Z Data ◮ Inefficient aggregator 1 - Aggr. Phase placement policy ◮ Fails to take advantage of P0 P2 X X X X Y Y Y Y Z Z Z Z Aggregators data model, data layout and memory hierarchy File Figure: Two-phase I/O mechanism

  6. Context Approach Evaluation Conclusion Two-phase I/O ◮ Available in MPI I/O implementations such as ROMIO ◮ Improves I/O performance by writing larger data chunks ◮ Selects a subset of processes to aggregate data before writing it to the storage system Limitations: P0 P1 P2 P3 Processes ◮ Poor for small messages (from experiments) X Y Z X Y Z X Y Z X Y Z Data ◮ Inefficient aggregator 1 - Aggr. Phase placement policy ◮ Fails to take advantage of P0 P2 X X X X Y Y Y Y Z Z Z Z Aggregators data model, data layout 2 - I/O Phase and memory hierarchy X X X X Y Y Y Y Z Z Z Z File Figure: Two-phase I/O mechanism

  7. Context Approach Evaluation Conclusion Outline Context 1 Approach 2 Evaluation 3 Conclusion and Perspectives 4

  8. Context Approach Evaluation Conclusion Approach Improved aggregator placement while taking into account: ◮ The topology of the architecture ◮ The data access pattern Efficient implementation of the two-phase I/O scheme ◮ Captures the data model and the data layout to optimize the I/O scheduling ◮ Pipelining of aggregation phase and I/O phase to optimize data movement ◮ Leverage one-sided communication ◮ Uses non-blocking operation to reduce synchronization

  9. Context Approach Evaluation Conclusion Aggregator Placement - Topology-aware strategy ◮ ω ( u , v ) : Amount of data exchanged between nodes u and v ◮ d ( u , v ) : Number of hops from nodes u to v ◮ l : The interconnect latency ◮ B i → j : The bandwidth from node i to node j . C1 � � l × d ( i , A ) + ω ( i , A ) ◮ C 1 = max , i ∈ V C B i → A ◮ C 2 = l × d ( A , IO ) + ω ( A , IO ) | V C |× B A → IO Vc : Compute nodes IO : I/O node A : Aggregator Objective function: ◮ TopoAware ( A ) = min ( C 1 + C 2 ) ◮ Computed by each process independently in O ( n ) , n = | V C |

  10. Context Approach Evaluation Conclusion Aggregator Placement - Topology-aware strategy ◮ ω ( u , v ) : Amount of data exchanged between nodes u and v ◮ d ( u , v ) : Number of hops from nodes u to v ◮ l : The interconnect latency ◮ B i → j : The bandwidth from node i to node j . � � l × d ( i , A ) + ω ( i , A ) ◮ C 1 = max , i ∈ V C B i → A ◮ C 2 = l × d ( A , IO ) + ω ( A , IO ) | V C |× B A → IO C2 Vc : Compute nodes IO : I/O node A : Aggregator Objective function: ◮ TopoAware ( A ) = min ( C 1 + C 2 ) ◮ Computed by each process independently in O ( n ) , n = | V C |

  11. Context Approach Evaluation Conclusion Aggregator Placement - Topology-aware strategy ◮ ω ( u , v ) : Amount of data exchanged between nodes u and v ◮ d ( u , v ) : Number of hops from nodes u to v ◮ l : The interconnect latency ◮ B i → j : The bandwidth from node i to node j . C1 � � l × d ( i , A ) + ω ( i , A ) ◮ C 1 = max , i ∈ V C B i → A ◮ C 2 = l × d ( A , IO ) + ω ( A , IO ) | V C |× B A → IO C2 Vc : Compute nodes IO : I/O node A : Aggregator Objective function: ◮ TopoAware ( A ) = min ( C 1 + C 2 ) ◮ Computed by each process independently in O ( n ) , n = | V C |

  12. Context Approach Evaluation Conclusion Algorithm ◮ Initialization: allocate buffers, create MPI windows, compute tuples {round, aggregator, buffer} for each process P Let’s say P1 is the aggregator ◮ P0, P1 and P2 put data in buffer 1 (round 1) of P1. P3 waits (fence) ◮ P1 writes buffer 1 in file and aggregates data from all the ranks in buffer 2 ◮ 2 nd round. P1 writes buffer 2 and aggregates data from P1, P2 and P3 ◮ and so on... ◮ Limitations: MPI_Comm_split , one aggr./node at most 0 1 2 3 Processes X Y Z X Y Z X Y Z X Y Z Data double-buffering Round 1 1 Aggregator Buffers n x block_size File

  13. Context Approach Evaluation Conclusion Algorithm ◮ Initialization: allocate buffers, create MPI windows, compute tuples {round, aggregator, buffer} for each process P Let’s say P1 is the aggregator ◮ P0, P1 and P2 put data in buffer 1 (round 1) of P1. P3 waits (fence) ◮ P1 writes buffer 1 in file and aggregates data from all the ranks in buffer 2 ◮ 2 nd round. P1 writes buffer 2 and aggregates data from P1, P2 and P3 ◮ and so on... ◮ Limitations: MPI_Comm_split , one aggr./node at most 0 1 2 3 Processes X Y Z X Y Z X Y Z X Y Z Data Round 1 1 Aggregator Buffers X X X File

  14. Context Approach Evaluation Conclusion Algorithm ◮ Initialization: allocate buffers, create MPI windows, compute tuples {round, aggregator, buffer} for each process P Let’s say P1 is the aggregator ◮ P0, P1 and P2 put data in buffer 1 (round 1) of P1. P3 waits (fence) ◮ P1 writes buffer 1 in file and aggregates data from all the ranks in buffer 2 ◮ 2 nd round. P1 writes buffer 2 and aggregates data from P1, P2 and P3 ◮ and so on... ◮ Limitations: MPI_Comm_split , one aggr./node at most 0 1 2 3 Processes X Y Z X Y Z X Y Z X Y Z Data RMA operations Round 1 1 Aggregator Buffers X Y Non-blocking MPI calls X X X File

  15. Context Approach Evaluation Conclusion Algorithm ◮ Initialization: allocate buffers, create MPI windows, compute tuples {round, aggregator, buffer} for each process P Let’s say P1 is the aggregator ◮ P0, P1 and P2 put data in buffer 1 (round 1) of P1. P3 waits (fence) ◮ P1 writes buffer 1 in file and aggregates data from all the ranks in buffer 2 ◮ 2 nd round. P1 writes buffer 2 and aggregates data from P1, P2 and P3 ◮ and so on... ◮ Limitations: MPI_Comm_split , one aggr./node at most 0 1 2 3 Processes X Y Z X Y Z X Y Z X Y Z Data Round 2 1 Aggregator Buffers Y Y Y X X X X Y File

  16. Context Approach Evaluation Conclusion Algorithm ◮ Initialization: allocate buffers, create MPI windows, compute tuples {round, aggregator, buffer} for each process P Let’s say P1 is the aggregator ◮ P0, P1 and P2 put data in buffer 1 (round 1) of P1. P3 waits (fence) ◮ P1 writes buffer 1 in file and aggregates data from all the ranks in buffer 2 ◮ 2 nd round. P1 writes buffer 2 and aggregates data from P1, P2 and P3 ◮ and so on... ◮ Limitations: MPI_Comm_split , one aggr./node at most 0 1 2 3 Processes X Y Z X Y Z X Y Z X Y Z Data Round 2 2 Aggregator Buffers Z Z Z X X X X Y Y Y Y File

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend