patrick g bridges kevin pedretti and dorian arnold
play

Patrick G Bridges Kevin Pedretti and Dorian Arnold Patrick G. - PowerPoint PPT Presentation

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and


  1. The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. Department of Computer Science Patrick G Bridges Kevin Pedretti and Dorian Arnold Patrick G. Bridges, Kevin Pedretti, and Dorian Arnold

  2. Scalable Systems Lab

  3. � If it doesn’t exist, how can we break it? � What will break that we don’t yet know about? � What will break that we don t yet know about? Scalable Systems Lab

  4. � Accuracy vs. Time ‐ to ‐ solution tradeoffs � Detailed: exascale ‐ class machine to simulate an hi t i l t exaflop machine � Fast: probably only see � Fast: probably only see effects we already expected to see Scalable Systems Lab

  5. � New machines evolving from current architectures � But some key features will be very different ◦ Memory, storage architecture ◦ Network interfaces ◦ Network interfaces � Leverage current machines to scale large simulations ◦ Emulate features similar to those on existing systems g y ◦ Completely simulate radically new features � Understand impact of new features across entire system � Tradeoff some accuracy for scale and time to solution Scalable Systems Lab

  6. � Understand the impact of modified core performance ◦ Many more or faster cores ◦ Cores with heterogeneous performance � Global Non coherent Addressing � Global Non ‐ coherent Addressing ◦ Impacts programming model ◦ May impact OS structure y p � Persistent memory systems � Active messaging network interfaces � Impact of different kinds and rates of failures Scalable Systems Lab

  7. � Goal: Large ‐ scale, fast emulation of exascale systems � Leverage large ‐ scale virtualization technology � Dilate time in the virtual machine to make minor changes to CPU/network performance � Simulate new features using attached SST simulator ◦ VMM calls into simulator to handle new devices ◦ VMM calls into simulator to handle new devices ◦ Simulator runs at user level on OS that hosts VMM � Loosely synchronize per ‐ node simulations y y p Scalable Systems Lab

  8. Scalable Systems Lab

  9. � Run the virtual machine slower than real time ◦ Gives time to emulate more or faster CPUs ◦ Also changes behavior/speed of underlying devices (e.g. NICs) � Previously researched for loosely coupled clusters � Previously researched for loosely ‐ coupled clusters ◦ Emulating faster NICs (DieCast) ◦ Uses fixed slowdown from real time � Requires careful management of virtual time in the virtual machine monitor � Not previously used for integration of simulator Scalable Systems Lab

  10. � Simulate behavior of devices that do not yet exist ◦ Network interfaces ◦ New memory and storage devices ◦ Interesting processor features ◦ Interesting processor features � VMM/Simulator interface ◦ VMM hooks physical interfaces to new device p y ◦ Invoke simulator when physical device is touched ◦ Pause passage of time in the local VMM when simulating � Using Sandia Structural Simulation Toolkit U i S di S l Si l i T lki Scalable Systems Lab

  11. � Simulator runs at user level parallel to machine being simulated � VMM redirects calls VMM di t ll between the VMM and the simulator the simulator � Causes time to pass at uneven rates in different simulations! Scalable Systems Lab

  12. � Complete accuracy requires synchronizing actions across multiple machines ◦ Preserve causality between actions on multiple machines ◦ Make sure time passes consistently across entire system ◦ Make sure time passes consistently across entire system ◦ Potentially very expensive � Fixed time dilation avoids this by synchronizing systems y y g y to a uniform clock dilated from real time � Not sufficient for us: uncertain simulation slowdowns! Scalable Systems Lab

  13. � Slack Emulation – keep simulations roughly in check and assume inaccuracies are minor � Already been used in multicore CPU simulators � Extend to large ‐ scale system simulation � Nodes periodically agree on slowdown factor ◦ Natural interface with time dilation simulation ◦ Natural interface with time dilation simulation ◦ Low slowdown with possible, high slowdown when needed � Assumes highly ‐ accurate small ‐ scale simulations also g y being used to validate the simulation Scalable Systems Lab

  14. � Need tools to understand system behavior � Integrate performance monitoring tools at base level of simulation/emulation system � Understand App/OS/Hardware Interactions � Monitor distributed interactions � Estimate potential inaccuracy in simulations E ti t t ti l i i i l ti Scalable Systems Lab

  15. � Leveraging Palacios HPC ‐ oriented VMM ◦ Low ‐ overhead virtualization on HPC systems ◦ < 5% overhead on Cray XT systems @ 4000 nodes ◦ Open source ◦ Open source � Enhanced Palacios time virtualization features ◦ Can fully virtualize time y ◦ Pause, resume, slow down guest time passage ◦ Adding complete time dilation support � Implemented interface for host ‐ level devices to tie to I l d i f f h l l d i i simulators Scalable Systems Lab

  16. � Dynamic time dilation rates � Simulation of simple devices ◦ Basic persistent memory devices ◦ Existing NIC simulation (Cray SeaStar functional simulation) E i ti NIC i l ti (C S St f ti l i l ti ) ◦ Global addressing � Basic performance monitoring device integration � Basic performance monitoring device integration Scalable Systems Lab

  17. � DOE Office of Science, Advanced Scientific Computing research, award number DE ‐ SC0005050, program manager Sonia Sachs � Faculty sabbatical appointment from Sandia F lt bb ti l i t t f S di � Ron Brightwell for giving this presentation � Sandia is a multiprogram laboratory operated by � Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under p gy contract DE ‐ AC04 ‐ 94AL85000 Scalable Systems Lab

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend