interconnection network models for large scale
play

Interconnection Network Models for Large-Scale Performance - PowerPoint PPT Presentation

Interconnection Network Models for Large-Scale Performance Prediction Kishwar Ahmed, Mohammad Obaida, Jason Liu, Florida International University, FL, USA Stephan Eidenbenz, Nandakishore Santhi, Joe Zerr, Los Alamos National Laboratory, NM, USA


  1. Interconnection Network Models for Large-Scale Performance Prediction Kishwar Ahmed, Mohammad Obaida, Jason Liu, Florida International University, FL, USA Stephan Eidenbenz, Nandakishore Santhi, Joe Zerr, Los Alamos National Laboratory, NM, USA 4th Summer of CODES July 17-18, 2018, Argonne National Laboratory, IL, USA

  2. Outline • Motivation • Performance Prediction Toolkit (PPT) • Automatic Performance Prediction • Conclusion

  3. Motivation • Rapid changes in HPC architecture • Multi-core and many-core architecture • Accelerator technologies • Complex memory hierarchies • HPC software adaptation is a constant theme: • No code is left behind : must guarantee good performance • Need high-skilled software architects and computational physicists • Need modeling and simulation of large-scale HPC systems and applications • And the systems are getting larger (exascale systems around the corner)

  4. HPC Performance Prediction • HPC performance prediction provides insight about • Applications (e.g., scalability, performance variability) • Hardware/software (e.g., better design) • Workload behavior (present and future) • Which is useful for – • Understanding application performance issues • Improving application and system • Budgeting, designing efficiency systems (present and future)

  5. Our Goals for Rapid Performance Prediction • Easy integration with other models of varying abstraction • Easy integration with applications (e.g., physics code) • Short development cycles • Performance and scale

  6. Outline • Motivation • Performance Prediction Toolkit (PPT) • Automatic Performance Prediction • Conclusion

  7. Performance Prediction Toolkit (PPT) • Make it simple, fast, and most of all useful • Designed to allow rapid assessment and performance prediction of large-scale applications on existing and future HPC platforms • PPT is a library of models of computational physics applications, middleware, and hardware • That allows users to predict execution time by running pseudo-code implementations of physics applications • “Scalable Codesign Performance Prediction for Computational Physics” project

  8. PPT Architecture Large-Scale Scientific Applications (SNAP, TAD, MC, ..) Message-Passing Interface (MPI) Node Models Interconnect Models I/O and File Memory Fat Tree Dragonfly Torus Processor Systems Cache Simian (Parallel Discrete-Event Simulation Engine)

  9. Simian: PDES using Interpreted Languages • Open-source general purpose parallel discrete-event library • Independent implementation in three interpreted languages: Python, LUA, and JavaScript • Minimalistic design: LOC = 500 with 8 common methods (python implementation) • Simulation code can be Just-In-Time (JIT) compiled to achieve very competitive event-rates • Support process-oriented world-view (using Python greenlets and LUA coroutines)

  10. Integrated MPI Model • Developed based on Simian (entities, processes, services) • Include all common MPI functions • Point-to-point and collective operations • Blocking and non-blocking operations • Sub-communicators and sub-groups • Packet-oriented model • Large messages are broken down into packets (say, 64B) • Reliable data transfer • Acknowledgement, retransmission, etc.

  11. MPI Example Hardware configuration MPI application Run MPI

  12. Interconnect Model Schedule service at other Simian entity req.service(handle_packet_arrival) Simian Process Output Ports R receive_process() Interface Outport Parallel Host Inport Simian Entity Outport Inport +Z -X +Y -Y Simian Process Input Ports routing_process() Parallel Outport +X Inport -Z H Switch Simian Service Simian Entity handle_packet_arrival() Interconnect model using Simian entities, processes, and services

  13. Interconnect Model (Contd.) • Common interconnect topologies • Torus (Gemini, Blue Gene/Q) • Dragonfly (Aries) • Fat-tree (Infiniband) • Some properties: • Emphasis on production systems • Cielo, Darter, Edison, Hopper, Mira, Sequoia, Stampede, Titan, Vulcan, … • Seamlessly integrated with MPI • Scalable to large number of nodes • Detailed congestion modeling

  14. 3D Torus – Cray’s Gemini Interconnect • 3D torus direct topology • Each building block • 2 compute nodes • 10 torus connections • ±X*2, ±Y, ±Z*2 • Examples: Jaguar (ORNL), Hopper (NERSC), Cielo (LANL)

  15. Gemini Validation Compared against empirical results from Hopper @ NERSC FMA Put Throughput (Empirical vs. Simulation) 7 empirical, PPN=4 empirical, PPN=2 6 Gemini FMA put empirical, PPN=1 Throughput (Gbytes/sec) simulation, PPN=4 throughput (as reported simulation, PPN=2 5 simulation, PPN=1 in [2]) versus simulated throughput as a 4 function of transfer size 3 for 1, 2, and 4 processes per node. 2 1 0 8 32 128 512 2K 8K 32K Data Size (bytes)

  16. Trace-Driven Simulation • Mini-app MPI traces: • Trace generated when running mini - apps on NERSC Hopper (Cray XE06 ) with <=1024 cores • Trace contains information of the MPI calls (including timing, source/destination ranks, data size, ...) Start time End time MPI call Data type Request ID 0.409470006 0.410042020 MPI_Isend 2601 MPI_DOUBLE 16 9 Count Destination rank

  17. Trace-Driven Simulation (Contd.) Duration of MPI Call (nanoseconds) Trace Data • For this experiment, we use: 3.0*10 8 2.5*10 8 • LULESH mini-app from ExMatEx 2.0*10 8 1.5*10 8 • 64 MPI processes 1.0*10 8 5.0*10 7 0.0*10 0 • Run trace for each MPI rank 0 2 4 6 8 10 Time (seconds) • Start MPI call at exactly same time indicated in trace file Duration of MPI Call (nanoseconds) Simulation (with Time Shift) 3.0*10 8 • Store completion time of MPI call 2.5*10 8 2.0*10 8 • Compare it with the completion time in 1.5*10 8 trace file 1.0*10 8 5.0*10 7 0.0*10 0 0 2 4 6 8 10 Time (seconds)

  18. Case Study: SN Application Proxy 64 × 32 × 48 Spatial Mesh, • SNAP is a “mini-app” for 384 Angles, 42 Energy Groups 16 Predicted (SNAPSim) PARTISN Execution Time (seconds) 14 Measured (SNAP) 12 • PARTISN is code for solving 10 radiation transport equation for 8 6 neutron and gamma transport 4 2 • Use MPI to facilitate 0 0 200 400 600 800 1000 1200 1400 1600 communication Processes NERSC’s Edison supercomputer, which • Use node model to compute time is Cray XC30 system with Aries interconnect

  19. Parallel Performance • 1500-node cluster at LANL, 1.4*10 6 connected by an Infiniband 1KB, run time 14000 4KB, run time 1.2*10 6 1KB, evt rate QDR interconnect 4KB, evt rate 12000 1.0*10 6 Run Time (seconds) 10000 Event Rate 8.0*10 5 • MPI_Allreduce, with 8000 6.0*10 5 6000 different data size (1K or 4.0*10 5 4000 4K) 2.0*10 5 2000 0.0*10 0 0 • Three times event-rate (C++ 12 48 192 768 3072 Number of Cores parallel simulator: MiniSSF)

  20. Outline • Motivation • Performance Prediction Toolkit (PPT) • Automatic Performance Prediction • Conclusion

  21. The Framework • Purpose is to maintain accuracy and performance , flexibility , and scalability ; so as to do studies of large- scale applications • Steps of an application performance analysis • Start with an application program • Statically analyze the program to build an abstract model • Transform into an executable model (encompassing CPU, GPU, and communication) • Run model with HPC simulation (for performance prediction)

  22. The Framework (Contd.)

  23. Static Analysis • Derive an abstract model • GPU computation • Identify GPU kernels • Based on COMPASS • Obtain workload (flops and memory loads/stores) • CPU computation • Transform source code to IR using LLVM • Using Analytical Memory Model (AMM) model resource- specific operations (e.g., loads, stores)

  24. GPU Model Building • OpenARC provides • Memory-GPU transfers and vice versa, loads, stores, flops, etc. • Build GPU-warp task-list from OpenARC- generated IR

  25. Execution Model • Launch application model on PPT • PPT features • Hardware models (processor, memory, GPU) • Full-fledged MPI model • Detailed interconnect models • Large-scale workload model

  26. Experiment: Runtime Prediction (CPU) • Laplace 2D benchmark • Compute-intensive application • Four different mesh sizes • With and without compiler optimizations • Two Intel Xeon processors running at 2.4GHz frequency • Observations • 7.08% error (with optimizations) • 3.12% error (without optimizations)

  27. Experiment: Runtime Prediction (GPU) • Application: Laplace 2D MM • Two 8-core Xeon E5-5645 @2.1 GHz • NVIDIA Geforce GM 204 • Observations: • 13.8% error for 1024 X 1024 • 0.16% error for 8192 X 8192

  28. Outline • Motivation • Performance Prediction Toolkit (PPT) • Automatic Performance Prediction • Conclusion

  29. Conclusion • Building a full HPC performance prediction model • PPT – Performance Prediction Toolkit • MPI model and interconnection network models (torus, dragonfly, fat-tree) • Automatic application performance prediction • Future work: • Apply dynamic analysis and ML for irregular applications • Automatic application optimization framework

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend