for the Uintah Framework Qingyu Meng, Justin Luitjens, and Martin - - PowerPoint PPT Presentation

for the uintah framework
SMART_READER_LITE
LIVE PREVIEW

for the Uintah Framework Qingyu Meng, Justin Luitjens, and Martin - - PowerPoint PPT Presentation

Dynamic Task Scheduling for the Uintah Framework Qingyu Meng, Justin Luitjens, and Martin Berzins Thanks to DOE for funding since 1997, NSF since 2008 Uintah Applications Plume Fires Angiogenesis Industrial Flares Micropin Flow Explosions


slide-1
SLIDE 1

Dynamic Task Scheduling for the Uintah Framework

Qingyu Meng, Justin Luitjens, and Martin Berzins

Thanks to DOE for funding since 1997, NSF since 2008

slide-2
SLIDE 2

Uintah Applications

Virtual Soldier Angiogenesis Micropin Flow Shaped Charges Sandstone Compaction Foam Compaction Industrial Flares Plume Fires Explosions

slide-3
SLIDE 3

Uintah Development

  • Uintah is developed over a decade based on a far-sighted

design by Steve Parker - Complete separation of user code and parallelism

Tuning Expert (CS) Domain Expert (Engineering)

Goal Performance, Salability Problem, Methods Responsibility Infrastructure Components Simulation Components Major Contributions Load balancing, AMR Task-graph based scheduling Asynchronous communication Arches, ICE, MPM, MPM- ICE, etc.. View of Program Parallel Infrastructure MPI, Threads Serial Code Written for a Patch

slide-4
SLIDE 4

How Does Uintah Work

Task-Graph Specification

  • Computes & Requires
  • No Processor or Domain

Information

Patch-Based Domain Decomposition

slide-5
SLIDE 5

load balance regrid

Patch-Based Domain Decomposition

Patch

Cells Particles

Mesh of cells Processors Patch Adaptive Mesh

slide-6
SLIDE 6

Uintah Scalability

  • Currently scale up to 98K

cores on Kraken

  • Prepare for future

machines

  • Petascale
  • Exascale 2018 ~2020

e.g. Aggressive Strawmam 742 cores per socket, 166,113,024 cores

(DARPA hardware report, 2009)

slide-7
SLIDE 7

Software Model for Exascale

  • Silver model for Exascale software

which must:

  • Directed dynamic graph execution
  • Latency hiding
  • Minimize synchronization and overheads
  • Adaptive resource scheduling
  • Heterogeneous processing
  • Graph-based asynchronous-task work

queue model

(DARPA software report, 2009)

slide-8
SLIDE 8

Graph Based Applications

1: 1 1: 2 1: 3 1: 4 2: 2 2: 3 2: 4 2: 2 2: 3 2: 4 3: 3 3: 4 3: 3

Charm++: Object-based Virtualization Intel CnC: new language for graph based parallelism Plasma (Dongarra): DAG based Parallel linear algebra software

slide-9
SLIDE 9

Uintah Distributed Task Graph

  • Up to 2 million tasks per

timestep globally

  • Tasks on local and

neighboring patches

  • Callback by each patch
  • Variables in data

warehouse(DW)

  • Read - get() from OldDW

and NewDW

  • Write- put() to NewDW
  • Communication on

cutting edges

4 patches single level ICE task graph

slide-10
SLIDE 10

Example Uintah Task from the ICE Algorithm

Compute face-centered Velocities:

Input variables Output variables (include boundary conditions)

slide-11
SLIDE 11

Task Graph Compiling

slide-12
SLIDE 12

Uintah Static Task Scheduler

  • Task List
  • Static analysis
  • In order execution, same order for each patch
  • Task status
  • Running->Finished->Next Task
slide-13
SLIDE 13

Static Task Graph Execution

1) Static: Predetermined order

  • Tasks are Synchronized
  • Higher waiting times

Task Dependency Execution Order

slide-14
SLIDE 14

Dynamic Task Graph Execution

Task Dependency Execution Order

2) Dynamic: Execute when ready

  • Tasks are Asynchronous
  • Lower waiting times

1) Static: Predetermined order

  • Tasks are Synchronized
  • Higher waiting times
slide-15
SLIDE 15

Dynamic Multi-threaded Task Graph Execution

1) Static: Predetermined order

  • Tasks are Synchronized
  • Higher waiting times

Task Dependency Execution Order

2) Dynamic: Execute when ready

  • Tasks are Asynchronous
  • Lower waiting times

3) Dynamic Multi-threaded(Future):

  • Task-Level Parallelism
  • Decreases Communication
  • Decreases Load Imbalance

Multicore friendly Support GPU tasks

slide-16
SLIDE 16

Uintah Dynamic Task Scheduler

  • Task queues
  • Internal ready (MPI waiting tasks)
  • External ready (ready for concurrent execution)
  • Task status
  • Not scheduled -> Internal Ready -> External Ready ->

Running -> Finished -> New task(s) satisfied

Multi-thread

slide-17
SLIDE 17

C R M

V V V Var versions

Ondemand Datawarehouse

<name, type, patchid> Var del_T Global n/a press CC 1 press CC 2 u_vel FC 1 … … .. … <name, type, patchid> del_T Global n/a press CC 1 press CC 2 u_vel FC 1 … … .. v1 v1 v2 v1 v3 v2 v3 v3 v v v v Variable versioning for out-of-order execution Directory based hash map For fixed order execution

slide-18
SLIDE 18

Schedule Global Sync Task

  • Synchronization task
  • Update global variable
  • e.g. MPI Allreduce
  • Call third party library
  • e.g. PETSc
  • Out-of-order issues
  • Deadlock
  • Load imbalance
  • Task phases
  • One global task per phase
  • Global task runs last
  • In phase out-of-order

R1 R2 R2 R1 Deadlock Load imbalance

slide-19
SLIDE 19

Dynamic Scheduler Performance Improvements

Strong Scaling (Fixed problem size) Weak Scaling (Fixed problem size/Core)

slide-20
SLIDE 20

Task prioritization algorithms

Algorithm Random FCFS PatchOrder MostMsg. Queue Length 3.11 3.16 4.05 4.29 Wait Time 18.9 18.0 7.0 2.6 Overall Time 315.35 308.73 187.19 139.39 executed not executed MPI sends

slide-21
SLIDE 21

Granularity Effect

  • Decrease patch size
  • (+) Increase queue length
  • (+) More overlap, lower

task wait time

  • (+) More patches, better

load balance

  • (-) More MPI messages
  • (-) More regrid overheads
  • Other Factors
  • Problem size
  • Implied task level

parallelism

  • Interconnection

bandwidth and legacy

  • CPU cache size
  • Solution- Self Tuning?
slide-22
SLIDE 22

Summary

  • Dynamic task scheduling
  • Support Out of order execution
  • Two task queues
  • Variable versioning
  • Task phases
  • Task prioritization algorithms
  • Ready queue length and task wait time
  • Granularity effect
  • Multi-thread and Self-tuning (Future)
slide-23
SLIDE 23

Questions

slide-24
SLIDE 24

BACKUP SLIDES

slide-25
SLIDE 25

Uintah Components

Simulation Controller

Problem Specification

XML Simulation

(Arches, ICE, MPM, MPMICE, MPMArches, …)

Scheduler Tasks Data Archiver Tasks MPI Load Balancer Regridder Callbacks Callbacks Checkpoints Data I/O Models

(EoS, Constitutive, …)

Domain Expert Tuning Expert

slide-26
SLIDE 26

Uintah: Task Based Application

  • Automatic dependency analysis
  • Automatic message generation
slide-27
SLIDE 27

Priority: Most Messages

5

3 3 5

3

3

3

3

5 3 3 5

  • Priority external task queue
  • Give priority to tasks that satisfy external

dependencies first

patches on a single core

slide-28
SLIDE 28

External Dependency Counter

slide-29
SLIDE 29

MPM-ICE Algorithm

Uintah originally designed for simulation of fires and explosions, e.g. metal containers embedded in large hydrocarbon fires.

  • ICE is a cell-centered finite

volume method for Navier Stokes equations

  • ICE now handles fast and slow

flows (2009)

  • MPM is a novel method that

uses particles and nodes

  • Cartesian grid used as a

common frame of reference

  • MPM (solids) and ICE (fluids)

exchange data several times per timestep, not just boundary condition exchange.

Container with PBX explosive