Topology-Aware Data Aggregation for Intensive I/O on Large-Scale - PowerPoint PPT Presentation

Context Approach Evaluation Conclusion Topology-Aware Data Aggregation for Intensive I/O on Large-Scale Supercomputers François Tessier ∗ , Preeti Malakar ∗ , Venkatram Vishwanath ∗ , Emmanuel Jeannot † , Florin Isaila ‡ ∗ Argonne National Laboratory, USA † Inria Bordeaux Sud-Ouest, France ‡ University Carlos III, Spain November 18, 2016

Context Approach Evaluation Conclusion Data Movement at Scale ◮ Computational science simulation such as climate, heart and brain modelling or cosmology have large I/O needs Typically around 10% to 20% of the wall time is spent in I/O Table: Example of I/O from large simulations Scientific domain Simulation Data size Cosmology Q Continuum 2 PB / simulation High-Energy Physics Higgs Boson 10 PB / year Climate / Weather Hurricane 240 TB / simulation ◮ Increasing disparity between computing power and I/O performance in the largest supercomputers IOPS/FLOPS of the #1 system in T op 500 0.1 to Flops (TF/s) in percent Ratio of I/O (TB/s) 0.01 0.001 0.0001 1997 2001 2005 2009 2013 2017 Years

Context Approach Evaluation Conclusion Complex Architectures ◮ Complex network topologies: multidimensional tori, dragonfly, ... ◮ Partitioning of the architecture to reduce I/O interference IBM BG/Q with I/O nodes (Figure), Cray with LNET nodes ◮ New tiers of storage/memory for data staging MCDRAM in KNL, NVRAM, Burst buffer nodes Pset 5D Torus network 128 nodes 2 GBps per link 2 GBps per link 4 GBps per link QDR Infiniband switch Storage GPFS filesystem Mira - 49,152 nodes / 786,432 cores - 768 TB of memory - 27 PB of storage, 330 GB/s (GPFS) Compute nodes Bridge nodes I/O nodes - 5D Torus network PowerPC A2, 16 cores 2 per I/O node IO forwarding daemon - Peak performance: 10 PetaFLOPS 16 GB of DDR3 GPFS client

Context Approach Evaluation Conclusion Two-phase I/O ◮ Available in MPI I/O implementations such as ROMIO ◮ Improves I/O performance by writing larger data chunks ◮ Selects a subset of processes to aggregate data before writing it to the storage system Limitations: P0 P1 P2 P3 Processes ◮ Poor for small messages (from experiments) X Y Z X Y Z X Y Z X Y Z Data ◮ Inefficient aggregator placement policy ◮ Fails to take advantage of data model, data layout and memory hierarchy File Figure: Two-phase I/O mechanism

Context Approach Evaluation Conclusion Two-phase I/O ◮ Available in MPI I/O implementations such as ROMIO ◮ Improves I/O performance by writing larger data chunks ◮ Selects a subset of processes to aggregate data before writing it to the storage system Limitations: P0 P1 P2 P3 Processes ◮ Poor for small messages (from experiments) X Y Z X Y Z X Y Z X Y Z Data ◮ Inefficient aggregator 1 - Aggr. Phase placement policy ◮ Fails to take advantage of P0 P2 X X X X Y Y Y Y Z Z Z Z Aggregators data model, data layout and memory hierarchy File Figure: Two-phase I/O mechanism

Context Approach Evaluation Conclusion Two-phase I/O ◮ Available in MPI I/O implementations such as ROMIO ◮ Improves I/O performance by writing larger data chunks ◮ Selects a subset of processes to aggregate data before writing it to the storage system Limitations: P0 P1 P2 P3 Processes ◮ Poor for small messages (from experiments) X Y Z X Y Z X Y Z X Y Z Data ◮ Inefficient aggregator 1 - Aggr. Phase placement policy ◮ Fails to take advantage of P0 P2 X X X X Y Y Y Y Z Z Z Z Aggregators data model, data layout 2 - I/O Phase and memory hierarchy X X X X Y Y Y Y Z Z Z Z File Figure: Two-phase I/O mechanism

Context Approach Evaluation Conclusion Outline Context 1 Approach 2 Evaluation 3 Conclusion and Perspectives 4

Context Approach Evaluation Conclusion Approach Improved aggregator placement while taking into account: ◮ The topology of the architecture ◮ The data access pattern Efficient implementation of the two-phase I/O scheme ◮ Captures the data model and the data layout to optimize the I/O scheduling ◮ Pipelining of aggregation phase and I/O phase to optimize data movement ◮ Leverage one-sided communication ◮ Uses non-blocking operation to reduce synchronization

Context Approach Evaluation Conclusion Aggregator Placement - Topology-aware strategy ◮ ω ( u , v ) : Amount of data exchanged between nodes u and v ◮ d ( u , v ) : Number of hops from nodes u to v ◮ l : The interconnect latency ◮ B i → j : The bandwidth from node i to node j . C1 � � l × d ( i , A ) + ω ( i , A ) ◮ C 1 = max , i ∈ V C B i → A ◮ C 2 = l × d ( A , IO ) + ω ( A , IO ) | V C |× B A → IO Vc : Compute nodes IO : I/O node A : Aggregator Objective function: ◮ TopoAware ( A ) = min ( C 1 + C 2 ) ◮ Computed by each process independently in O ( n ) , n = | V C |

Context Approach Evaluation Conclusion Aggregator Placement - Topology-aware strategy ◮ ω ( u , v ) : Amount of data exchanged between nodes u and v ◮ d ( u , v ) : Number of hops from nodes u to v ◮ l : The interconnect latency ◮ B i → j : The bandwidth from node i to node j . � � l × d ( i , A ) + ω ( i , A ) ◮ C 1 = max , i ∈ V C B i → A ◮ C 2 = l × d ( A , IO ) + ω ( A , IO ) | V C |× B A → IO C2 Vc : Compute nodes IO : I/O node A : Aggregator Objective function: ◮ TopoAware ( A ) = min ( C 1 + C 2 ) ◮ Computed by each process independently in O ( n ) , n = | V C |

Context Approach Evaluation Conclusion Aggregator Placement - Topology-aware strategy ◮ ω ( u , v ) : Amount of data exchanged between nodes u and v ◮ d ( u , v ) : Number of hops from nodes u to v ◮ l : The interconnect latency ◮ B i → j : The bandwidth from node i to node j . C1 � � l × d ( i , A ) + ω ( i , A ) ◮ C 1 = max , i ∈ V C B i → A ◮ C 2 = l × d ( A , IO ) + ω ( A , IO ) | V C |× B A → IO C2 Vc : Compute nodes IO : I/O node A : Aggregator Objective function: ◮ TopoAware ( A ) = min ( C 1 + C 2 ) ◮ Computed by each process independently in O ( n ) , n = | V C |

Context Approach Evaluation Conclusion Algorithm ◮ Initialization: allocate buffers, create MPI windows, compute tuples {round, aggregator, buffer} for each process P Let’s say P1 is the aggregator ◮ P0, P1 and P2 put data in buffer 1 (round 1) of P1. P3 waits (fence) ◮ P1 writes buffer 1 in file and aggregates data from all the ranks in buffer 2 ◮ 2 nd round. P1 writes buffer 2 and aggregates data from P1, P2 and P3 ◮ and so on... ◮ Limitations: MPI_Comm_split , one aggr./node at most 0 1 2 3 Processes X Y Z X Y Z X Y Z X Y Z Data double-buffering Round 1 1 Aggregator Buffers n x block_size File

Context Approach Evaluation Conclusion Algorithm ◮ Initialization: allocate buffers, create MPI windows, compute tuples {round, aggregator, buffer} for each process P Let’s say P1 is the aggregator ◮ P0, P1 and P2 put data in buffer 1 (round 1) of P1. P3 waits (fence) ◮ P1 writes buffer 1 in file and aggregates data from all the ranks in buffer 2 ◮ 2 nd round. P1 writes buffer 2 and aggregates data from P1, P2 and P3 ◮ and so on... ◮ Limitations: MPI_Comm_split , one aggr./node at most 0 1 2 3 Processes X Y Z X Y Z X Y Z X Y Z Data Round 1 1 Aggregator Buffers X X X File

Context Approach Evaluation Conclusion Algorithm ◮ Initialization: allocate buffers, create MPI windows, compute tuples {round, aggregator, buffer} for each process P Let’s say P1 is the aggregator ◮ P0, P1 and P2 put data in buffer 1 (round 1) of P1. P3 waits (fence) ◮ P1 writes buffer 1 in file and aggregates data from all the ranks in buffer 2 ◮ 2 nd round. P1 writes buffer 2 and aggregates data from P1, P2 and P3 ◮ and so on... ◮ Limitations: MPI_Comm_split , one aggr./node at most 0 1 2 3 Processes X Y Z X Y Z X Y Z X Y Z Data RMA operations Round 1 1 Aggregator Buffers X Y Non-blocking MPI calls X X X File

Context Approach Evaluation Conclusion Algorithm ◮ Initialization: allocate buffers, create MPI windows, compute tuples {round, aggregator, buffer} for each process P Let’s say P1 is the aggregator ◮ P0, P1 and P2 put data in buffer 1 (round 1) of P1. P3 waits (fence) ◮ P1 writes buffer 1 in file and aggregates data from all the ranks in buffer 2 ◮ 2 nd round. P1 writes buffer 2 and aggregates data from P1, P2 and P3 ◮ and so on... ◮ Limitations: MPI_Comm_split , one aggr./node at most 0 1 2 3 Processes X Y Z X Y Z X Y Z X Y Z Data Round 2 1 Aggregator Buffers Y Y Y X X X X Y File

Context Approach Evaluation Conclusion Algorithm ◮ Initialization: allocate buffers, create MPI windows, compute tuples {round, aggregator, buffer} for each process P Let’s say P1 is the aggregator ◮ P0, P1 and P2 put data in buffer 1 (round 1) of P1. P3 waits (fence) ◮ P1 writes buffer 1 in file and aggregates data from all the ranks in buffer 2 ◮ 2 nd round. P1 writes buffer 2 and aggregates data from P1, P2 and P3 ◮ and so on... ◮ Limitations: MPI_Comm_split , one aggr./node at most 0 1 2 3 Processes X Y Z X Y Z X Y Z X Y Z Data Round 2 2 Aggregator Buffers Z Z Z X X X X Y Y Y Y File

Topology-Aware Data Aggregation for Intensive I/O on Large-Scale - PowerPoint PPT Presentation

Context Approach Evaluation Conclusion Topology-Aware Data Aggregation for Intensive I/O on Large-Scale Supercomputers Franois Tessier , Preeti Malakar , Venkatram Vishwanath , Emmanuel Jeannot , Florin Isaila Argonne

Data Collection and Aggregation Data Collection and Aggregation 1 Challenges: data Challenges:

Topological data analysis and topology-based visualization Leila De Floriani Topology-based

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Elmwood Park: Electricity Aggregation Developing an Opt-In Municipal Aggregation Program to

simplifying the customer experience through account aggregation Sim Sangha Business Development

The Axiomatic Method in Social Choice Theory: Preference Aggregation, Judgment Aggregation, Graph

Topology-aware OpenMP Process Scheduling Peter Thoman, Hans Moritsch, and Thomas Fahringer

Review of data aggregation Review of data aggregation Query distribution AVERAGE 1 1 2 2 3

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Toolkit to Support Intelligibility in Context Aware Applications Context-Aware Applications P

Grouping and Aggregation Grouping and Aggregation in the Concept- -Oriented Data Model Oriented

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

1 CONTENT Introduction Data overflow Data aggregation Formulation of Data

Topology Discovery Correlating different network topology layers in heterogeneous environments

Combinatorics and topology of toric arrangements II. Topology of arrangements in the complex torus

Order Topology Definition Let ( X , < ) be an ordered set. Then the order topology on X is the

Aggregation in P2P Lending Tim Slesinger LendingWell - CEO November 2016 07944559442

Adding the AWG Model of Aggregates to RDA Deborah Fritz Presented to "Reconstructing RDA in

An Insiders Utility District Perspective on a Community Choice Aggregator PROCURING FUNDING

Sale of poultry assets and acquisition of cattle properties October 2019 Managed by : Disclaimer

Alternative arrangements for renewables 3 rd October 2012 These slides are initial thoughts to

2 0 1 9 VAS-aggregator VAS-platform vendor Content and Service provider Anti-Fraud service

PennyMac Mortgage Investment Trust May 2020 Investor Presentation Forward-Looking Statements

The Next Generation Lossless Network in the Data Center BrightTalk, Data Center Transformation

Topology-Aware Data Aggregation for Intensive I/O on Large-Scale - PowerPoint PPT Presentation

Context Approach Evaluation Conclusion Topology-Aware Data Aggregation for Intensive I/O on Large-Scale Supercomputers Franois Tessier , Preeti Malakar , Venkatram Vishwanath , Emmanuel Jeannot , Florin Isaila Argonne

Data Collection and Aggregation Data Collection and Aggregation 1 Challenges: data Challenges:

Topological data analysis and topology-based visualization Leila De Floriani Topology-based

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Elmwood Park: Electricity Aggregation Developing an Opt-In Municipal Aggregation Program to

simplifying the customer experience through account aggregation Sim Sangha Business Development

The Axiomatic Method in Social Choice Theory: Preference Aggregation, Judgment Aggregation, Graph

Topology-aware OpenMP Process Scheduling Peter Thoman, Hans Moritsch, and Thomas Fahringer

Review of data aggregation Review of data aggregation Query distribution AVERAGE 1 1 2 2 3

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Toolkit to Support Intelligibility in Context Aware Applications Context-Aware Applications P

Grouping and Aggregation Grouping and Aggregation in the Concept- -Oriented Data Model Oriented

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

1 CONTENT Introduction Data overflow Data aggregation Formulation of Data

Topology Discovery Correlating different network topology layers in heterogeneous environments

Combinatorics and topology of toric arrangements II. Topology of arrangements in the complex torus

Order Topology Definition Let ( X , &lt; ) be an ordered set. Then the order topology on X is the

Aggregation in P2P Lending Tim Slesinger LendingWell - CEO November 2016 07944559442

Adding the AWG Model of Aggregates to RDA Deborah Fritz Presented to &quot;Reconstructing RDA in

An Insiders Utility District Perspective on a Community Choice Aggregator PROCURING FUNDING

Sale of poultry assets and acquisition of cattle properties October 2019 Managed by : Disclaimer

Alternative arrangements for renewables 3 rd October 2012 These slides are initial thoughts to

2 0 1 9 VAS-aggregator VAS-platform vendor Content and Service provider Anti-Fraud service

PennyMac Mortgage Investment Trust May 2020 Investor Presentation Forward-Looking Statements

The Next Generation Lossless Network in the Data Center BrightTalk, Data Center Transformation

Order Topology Definition Let ( X , < ) be an ordered set. Then the order topology on X is the

Adding the AWG Model of Aggregates to RDA Deborah Fritz Presented to "Reconstructing RDA in