Patrick G Bridges Kevin Pedretti and Dorian Arnold Patrick G. - - PowerPoint PPT Presentation

patrick g bridges kevin pedretti and dorian arnold
SMART_READER_LITE
LIVE PREVIEW

Patrick G Bridges Kevin Pedretti and Dorian Arnold Patrick G. - - PowerPoint PPT Presentation

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and


slide-1
SLIDE 1 The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

Department of Computer Science

Patrick G Bridges Kevin Pedretti and Dorian Arnold Patrick G. Bridges, Kevin Pedretti, and Dorian Arnold

slide-2
SLIDE 2

Scalable Systems Lab

slide-3
SLIDE 3

If it doesn’t exist, how can we break it? What will break that we don’t yet know about? What will break that we don t yet know about?

Scalable Systems Lab

slide-4
SLIDE 4

Accuracy vs. Time‐to‐

solution tradeoffs

Detailed: exascale‐class

hi t i l t machine to simulate an exaflop machine

Fast: probably only see Fast: probably only see

effects we already expected to see

Scalable Systems Lab

slide-5
SLIDE 5

New machines evolving from current architectures But some key features will be very different

  • Memory, storage architecture
  • Network interfaces
  • Network interfaces

Leverage current machines to scale large simulations

  • Emulate features similar to those on existing systems

g y

  • Completely simulate radically new features

Understand impact of new features across entire system Tradeoff some accuracy for scale and time to solution

Scalable Systems Lab

slide-6
SLIDE 6

Understand the impact of modified core performance

  • Many more or faster cores
  • Cores with heterogeneous performance

Global Non coherent Addressing Global Non‐coherent Addressing

  • Impacts programming model
  • May impact OS structure

y p

Persistent memory systems Active messaging network interfaces Impact of different kinds and rates of failures

Scalable Systems Lab

slide-7
SLIDE 7

Goal: Large‐scale, fast emulation of exascale systems Leverage large‐scale virtualization technology Dilate time in the virtual machine to make minor

changes to CPU/network performance

Simulate new features using attached SST simulator

  • VMM calls into simulator to handle new devices
  • VMM calls into simulator to handle new devices
  • Simulator runs at user level on OS that hosts VMM

Loosely synchronize per‐node simulations

y y p

Scalable Systems Lab

slide-8
SLIDE 8

Scalable Systems Lab

slide-9
SLIDE 9

Run the virtual machine slower than real time

  • Gives time to emulate more or faster CPUs
  • Also changes behavior/speed of underlying devices (e.g. NICs)

Previously researched for loosely coupled clusters Previously researched for loosely‐coupled clusters

  • Emulating faster NICs (DieCast)
  • Uses fixed slowdown from real time

Requires careful management of virtual time in the

virtual machine monitor

Not previously used for integration of simulator

Scalable Systems Lab

slide-10
SLIDE 10

Simulate behavior of devices that do not yet exist

  • Network interfaces
  • New memory and storage devices
  • Interesting processor features
  • Interesting processor features

VMM/Simulator interface

  • VMM hooks physical interfaces to new device

p y

  • Invoke simulator when physical device is touched
  • Pause passage of time in the local VMM when simulating

U i S di S l Si l i T lki

Using Sandia Structural Simulation Toolkit

Scalable Systems Lab

slide-11
SLIDE 11

Simulator runs at user

level parallel to machine being simulated VMM di t ll

VMM redirects calls

between the VMM and the simulator the simulator

Causes time to pass at

uneven rates in different simulations!

Scalable Systems Lab

slide-12
SLIDE 12

Complete accuracy requires synchronizing actions

across multiple machines

  • Preserve causality between actions on multiple machines
  • Make sure time passes consistently across entire system
  • Make sure time passes consistently across entire system
  • Potentially very expensive

Fixed time dilation avoids this by synchronizing systems

y y g y to a uniform clock dilated from real time

Not sufficient for us: uncertain simulation slowdowns!

Scalable Systems Lab

slide-13
SLIDE 13

Slack Emulation – keep simulations roughly in check

and assume inaccuracies are minor

Already been used in multicore CPU simulators Extend to large‐scale system simulation Nodes periodically agree on slowdown factor

  • Natural interface with time dilation simulation
  • Natural interface with time dilation simulation
  • Low slowdown with possible, high slowdown when needed

Assumes highly‐accurate small‐scale simulations also

g y being used to validate the simulation

Scalable Systems Lab

slide-14
SLIDE 14

Need tools to understand system behavior Integrate performance monitoring tools at base level

  • f simulation/emulation system

Understand App/OS/Hardware Interactions Monitor distributed interactions

E ti t t ti l i i i l ti

Estimate potential inaccuracy in simulations

Scalable Systems Lab

slide-15
SLIDE 15

Leveraging Palacios HPC‐oriented VMM

  • Low‐overhead virtualization on HPC systems
  • < 5% overhead on Cray XT systems @ 4000 nodes
  • Open source
  • Open source

Enhanced Palacios time virtualization features

  • Can fully virtualize time

y

  • Pause, resume, slow down guest time passage
  • Adding complete time dilation support

I l d i f f h l l d i i

Implemented interface for host‐level devices to tie to

simulators

Scalable Systems Lab

slide-16
SLIDE 16

Dynamic time dilation rates Simulation of simple devices

  • Basic persistent memory devices

E i ti NIC i l ti (C S St f ti l i l ti )

  • Existing NIC simulation (Cray SeaStar functional simulation)
  • Global addressing

Basic performance monitoring device integration Basic performance monitoring device integration

Scalable Systems Lab

slide-17
SLIDE 17

DOE Office of Science, Advanced Scientific Computing

research, award number DE‐SC0005050, program manager Sonia Sachs F lt bb ti l i t t f S di

Faculty sabbatical appointment from Sandia Ron Brightwell for giving this presentation Sandia is a multiprogram laboratory operated by Sandia is a multiprogram laboratory operated by

Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under p gy contract DE‐AC04‐ 94AL85000

Scalable Systems Lab