MTAGS 2009 Many Task Computing for Multidisciplinary Ocean - - PowerPoint PPT Presentation

mtags 2009
SMART_READER_LITE
LIVE PREVIEW

MTAGS 2009 Many Task Computing for Multidisciplinary Ocean - - PowerPoint PPT Presentation

MTAGS 2009 Many Task Computing for Multidisciplinary Ocean Sciences: Real-Time Uncertainty Prediction and Data Assimilation Constantinos Evangelinos Pierre F. J. Lermusiaux Chris Hill Jinshan Xu MIT Patrick J. Haley Jr. Earth,


slide-1
SLIDE 1

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

MTAGS 2009

Many Task Computing for Multidisciplinary Ocean Sciences: Real-Time Uncertainty Prediction and Data Assimilation

Constantinos Evangelinos Pierre F. J. Lermusiaux Chris Hill Jinshan Xu

MIT

Patrick J. Haley Jr.

Earth, Atmospheric and Planetary Sciences MIT/Mechanical Engineering

slide-2
SLIDE 2

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Motivation

  • Improve the forecasting capabilities of ocean

data assimilation and related fields via increased access to parallelism

  • Move existing computational framework to a

more modern, non-site specific setup

  • Test the opportunities for executing massive

task count workflows on distributed clusters, Grid and Cloud platforms.

  • Provide an external outlet to handle peak-

demand for compute resources during live experiments in the field

  • Explore educational possibilities in the Cloud
slide-3
SLIDE 3

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Ocean Data Assimilation

dx =M(x, t) + dη; M the model operator yk

  • = H(xk, tk) + εk; H the measurement operator

minx J(xk,yk

  • ; dη, εk, Q(t), Rk); J objective function

Model errors are assumed Brownian: M dη = N(0,Q(t)) with E{dη(t) dη(t) T} = Q(t) dt In fact the models are forced by processes with noise correlated in space and time (meteo) Measurement errors follow white Gaussian: εk = N(0, Rk)

slide-4
SLIDE 4

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Ocean Acoustics

Estimate of the ocean temperature and salinity fields (and uncertainties) necessary for calculating acoustic fields and their uncertainties. Sound-propagation studies often focus on vertical

  • sections. Time is fixed and an acoustic broadband

transmission loss (TL) field is computed for each

  • cean realization.

A sound source of specific frequency, location and depth is chosen. The coupled physical-acoustical covariance P for the section is computed and non- dimensionalized and used for assimilation of hydrographic and TL data.

slide-5
SLIDE 5

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Acoustic climatology maps

  • Underwater acoustics transmission loss variability predictions in a 56 x 33

km area northeast of Taiwan.

  • 2D propagation over 15km distance at 31x31 = 961 grid points X 8 directions
  • Each job a short 3 minute acoustics 2D ray propagation problem
  • Distributed on 100 dual-core computer nodes, speed up more than 100

times in real time experiment (SGE overhead of scheduling short jobs)

( ) Mean Transmission Loss TL TL STD over depth TL STD over bearing 77km 65km

55dB 65dB . 1 3 dB . 0 1 dB 3 dB . 0 1 dB

effect of internal tides Effect of steep bathymetry

slide-6
SLIDE 6

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Canyon Nx2D acoustics modeling

– OMAS moving sound source

Bathymetry of Mien Hua Canyon

slide-7
SLIDE 7

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

AOSN-II Monterey Bay

slide-8
SLIDE 8

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Error Subspace Statistical Estimation

slide-9
SLIDE 9

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

ESSE Surf. Temp. Error Standard Deviation Forecasts for AOSN-II

Aug 12 Aug 13 Aug 27 Aug 24 Aug 14 Aug 28

End of Relaxation Second Upwelling period First Upwelling period Start of Upwelling

Leonard and Ramp, Lead PIs

slide-10
SLIDE 10

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Serial and Parallel ESSE workflows

slide-11
SLIDE 11

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

The ESSE workflow engine

  • Is actually (for historical and practical reasons)

a heavily modified C-shell script (master)!

– Catches signals to kill all remaining jobs

  • Grid Engine, Condor and PBS variants

– Submits and tracks singleton jobs

  • Or uses job arrays for scalability

– Further variants depending on I/O strategy:

  • Separate pert singletons?
  • Input/output to shared or local disk (or mixed)?
  • Shared directories store files with the execution

status of each of the singleton scripts

  • Singletons need the perturbation number:tricks!
slide-12
SLIDE 12

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Multi-level parallelism in ESSE

  • Nested ocean model

runs (HOPS) are run in parallel

– Limited parallelism – 2 or 3 levels – bi-directional

  • SVD calculation is

based on parallelizable LAPACK routines

  • Convergence check

calculation also.

slide-13
SLIDE 13

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

ESSE and ocean acoustics

  • As things stand ESSE is used to provide the

necessary temperature and salinity information for sound propagation studies.

  • The ESSE framework can also be extended to

acoustic data assimilation. With significantly more compute power one can compute the whole “acoustic climate” in a 3D region

– providing TL for any source and receiver

locations in the region as a function of time and frequency,

– by running multiple independent tasks for

different sources/frequencies/slices at different times.

slide-14
SLIDE 14

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Canyon Nx2D acoustics modeling

  • Acoustics transmission loss difference in 6 hours (internal tides or other

uncertainties)

  • In future, incorporate with ESSE for uncertainties estimation, computation

cost will be 1800 (directions) X 15 locations X HUNDREDS of cases.

slide-15
SLIDE 15

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Ocean DA/ESSE/acoustics: MTC

  • A minimum of hundreds to thousands (and with

increased fidelity tens of thousands) of ocean model runs (tens of minutes or more) preceded by an equal number of IC perturbations (secs)

  • File I/O intensive, both for reading and writing
  • Concurrent reads to forcing files etc.
  • Thousands of short acoustics runs (mins)
  • Future directions for ESSE will generate even

more tasks:

– dynamic path sampling for observing assets – combined physical-acoustical ESSE

slide-16
SLIDE 16

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

“Real-time” experiments

slide-17
SLIDE 17

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Notable differences

From many parameter sweeps and other MTC apps:

  • there is a hard deadline associated with the execution of the

ensemble worflow, as a forecast needs to be timely;

  • the size of the ensemble is dynamically adjusted according to

the convergence of the ESSE workflow which is not a DAG;

  • individual ensemble members are not significant (and their

results can be ignored if unavailable) - what is important is the statistical coverage of the ensemble;

  • the full resulting dataset of the ensemble member forecastis

required, not just a small set of numbers; IC are different for each ensemble members

  • individual forecasts within an ensemble, especially in the case
  • f interdisciplinary interactions and nested meshes, can be

parallel programs themselves.

slide-18
SLIDE 18

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

And their implications

  • Deadline: use any Advanced Reservation capabilities available
  • Dynamic: means that the actual total compute and data

requirements for the forecast are not known beforehand and change dynamically

  • Dropped members: suggests that failures (due to software or

hardware problems) are not catastrophic and can be tolerated. Moreover runs that have not finished (or even started) by the forecast deadline can be safely ignored provided they do not collectively represent a systematic hole in the statistical coverage.

  • I/O needs: mean that relatively high data storage and network

bandwidth constraints will be placed on the underlying infrastructure

  • Parallel ensemble members: mean that the compute

requirements will not be insignificant either.

slide-19
SLIDE 19

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Ocean DA on local clusters

  • Local Opteron cluster

– Opteron 250 2.4GHz (4GB RAM) compute

nodes (single gigabit network connection)

– Opteron 2380 2.5GHz (24GB RAM) head node – 18TB of shared disk (NFS) over 10Gbit Ethernet – 200Gbit switch backplane – Grid Engine and Condor co-existing

  • Tried both GridEngine and Condor versions of

ESSE workflows. Test 600 member ensemble:

– I/O optimizations (all local dirs) 86 to 77 mins – SGE 10-20% faster than Condor

  • without heroic tuning of the latter
slide-20
SLIDE 20

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Ocean DA on the Teragrid

  • Extensive use of sshfs to share directories for

checking state of runs etc.

  • Remote job submissions (over (gsi)ssh)

– part of driver and modified singletons

  • Or Condor-C and Glide-in with care if root
  • Condor-G will not scale
  • Or Personal Condor & Mycluster

System cores pert ORNL 2 67.83 1823.99 Purdue 4 6.25 1107.4 local 2 6.21 1531.33 pemodel

slide-21
SLIDE 21

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Advantages of the Teragrid

  • Enormous numbers of theoretically available

cores and very large sizes for storage

– Condor pool supposedly 14-27kcores (~1800)

  • Shared high-speed parallel filesystems
  • High speed connections to the home cluster
  • Suites of Grid software for remote file access

and job submission, control etc.

– Mixed blessing...

  • Free after writing the proposal to convince

Teragrid to get the SUs...

slide-22
SLIDE 22

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Disadvantages of the Teragrid

  • Very large heterogeneity in both hardware, O/S and

paths (to scratch disks etc.) requiring mods to the singleton code – user confusion.

  • Without advance reservations one cannot be

guaranteed not to have to use multiple Teragrid sites to reach the desired number of processors within the deadline.

– Backfilling can help but per user job limits also limit

the usability of a single Teragrid site

– Schedulers favor large processor count runs – Complicated tricks to submit many jobs as one

  • Teragrid MPPs not always suitable for scripts
  • Careful fetching of results back to home (congestion)
slide-23
SLIDE 23

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Ocean DA on the Cloud

  • We have been experimenting with the use of

Cloud computing for more traditional HPC usage – including parallel runs of I/O intensive data parallel ocean models such as MITgcm.

  • Given the limitations seen in network

performance it was natural to try and investigate the usability of Amazon EC2 for MTC applications such as ESSE.

slide-24
SLIDE 24

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Cloud Modes of usage

  • Stand-alone (batch) on-demand EC2 cluster

– Torque or SGE (all-in-the cloud or remote submits)

  • Augmented local cluster with EC2 nodes

– We have a Torque setup – Used recipes for SGE setup. – Condor use of EC2 too restrictive – MyCluster dynamic SGE or Condor merged clusters – Commercial (Univa Unicloud, Sun Cloud Adapter in

Hedeby/SDM) for fully dynamic provisioning

  • Experientation with parallel filesystems:

PVFS2/GlusterFS/FhGFS

slide-25
SLIDE 25

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Serial pert/pemodel performance

System cores pert pemodel m1.small 0.5 13.53 2850.14 m1.large 2 9.33 1817.13 m1.xlarge 4 9.14 1860.81 c1.medium 2 9.8 1008.11 c1.xlarge 8 6.67 1030.42

  • These numbers are before 4th eastern datacenter emerged
  • A binary optimized with the Pathscale compilers was used
  • All cores were loaded.
  • I/O is to local disk (EBS is slower, so is NFS that is used for

the centrally coordinating directory of the run)

  • Total runtime is reported.
  • Better than 2.5 speedup for m1.small to c1.medium
slide-26
SLIDE 26

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Advantages of the Cloud

  • For all intents and purposes the response is immediate.

Currently a request for a virtual EC2 cluster gets satisfied on- demand, without having to worry about queue times and backfill slots.

  • The use of virtual machines allows for deploying the same

environment as the home cluster. This provides for a very clean integration of the two clusters.

  • Having the same software environment also results in no need

to rebuild (and in most cases having to revalidate) executables. This means that last minute changes (because of model build- time parameter tuning) can be used ASAP instead of having to go through a buildtest-deploy cycle on each remote platform.

  • EC2 allows our virtual clusters to scale at will: (default limit 20)
  • Since the remote machines are under our complete control,

scheduling software and policies etc. are tuned to our needs.

slide-27
SLIDE 27

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Cost analysis

  • Cost-wise for example an ESSE calculation

with 1.5GB input data, 960 ensemble members each sending back 11MB (for a total of 6.6GB) would cost:

– 1.5(GB)×0.1+10.56(GB)x0.17 for the data – 2(hr)x20x0.8 for the computations – For a total of $33.95

  • Use of reserved instances would drop pricing

for the cpu usage by more than a factor of 3.

  • Compare that to the cost of overprovisioning

your local cluster resources to handle the peak load required a few times a year.

slide-28
SLIDE 28

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Disadvantages of the Cloud

  • Inhomogeneity needs to be kept in mind or it will bite you
  • Any extra security issues need to be worked out.
  • EC2 usage needs to be directly paid to Amazon. Amazon

charges by the hour - like a cell-phone, 1 hour 1 sec. counts as 2 hours. Charges for data movement in and out of EC2.

  • The performance of virtual machines is less than that of “bare

metal”, the difference more pronounced when it comes to I/O.

  • No persistent large parallel filesystem. One can be constructed
  • n demand (just like the virtual clusters) but the Gigabit

Ethernet connectivity used throughout Amazon EC2 alongside the randomization of instance placement mean that parallel performance of the filesystem is not up to par. Horror stories...

  • Unlike national and state supercomputing facilities, Amazon’s

connections to the home cluster are bound to be slower and result in file transfer delays.

slide-29
SLIDE 29

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Future work directions

  • Reimplement the workflow engine.

– Considering Swift – other options? Nimrod?

  • Generalize ESSE work-engine away:

– Use with other ocean models (MITgcm,ROMS)

  • Expand production use of ESSE:

– Heterogeneous sites on the Teragrid – Open Science Grid – MPPs with sufficient support: Blue Gene/P?

  • Expand uses for ESSE (and number of tasks):

– ESSE for Acoustics – ESSE for adaptive sampling

slide-30
SLIDE 30

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Which sampling on Aug 26 optimally reduces uncertainties on Aug 27?

4 candidate tracks, overlaid on surface T fct for Aug 26 ESSE fcts after DA of each track Aug 24 Aug 26 Aug 27

2-day ESSE fct ESSE for Track 4 ESSE for Track 3 ESSE for Track 2 ESSE for Track 1 DA 1 DA 2 DA 3 DA 4

IC(nowcast) DA Best predicted relative error reduction: track 1

  • Based on nonlinear error covariance evolution
  • For every choice of adaptive strategy, an ensemble

is computed

slide-31
SLIDE 31

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Educational uses

  • The opportunity to

host all of ESSE's computational needs on EC2 allows for a vision

  • f ocean DA for

education.

  • CITE (Cloud-

computing Infrastructure and Technology for Education) – NSF STCI project.

slide-32
SLIDE 32

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

LEGEND in action

slide-33
SLIDE 33

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Conclusions

  • We described what we believe to be a new type of

MTC application that is very relevant to Earth and Environmental Science applications (and prototypical

  • f a general class of ensemble-based forecasting and

estimation methods

  • Results on a local cluster were presented along with a

discussion of the challenges of scaling out and solutions for doing so employing Grids and Clouds. I/O locality issues are among our main concern. Cloud use economically feasible.

  • We believe that this type of ensemble based forecast

workflows can in the future represent an important new class of MTC applications.

  • Cloud opens up new horizons including educational
slide-34
SLIDE 34

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

Backup slides

slide-35
SLIDE 35

MTAGS 2009 SC'09, Portland OR, Nov. 16 2009 MIT/EAPS & Mech.Eng.

  • C. Evangelinos (ce107@computer.org)

LCML/LEGEND

  • LCML (Legacy Computing Markup Language)

is an XML Schema based framework for encapsulating the process of configuring the build-time and run-time configuration of legacy binaries alongside constraints.

  • It was implemented for ocean/climate models

but designed for general applications that use Makefiles, imake, cmake, autoconf etc. to setup build-time configuration (not ant).

  • LEGEND is a Java-based validating GUI

generator that parses LCML files describing an application and produces a GUI for the user to build and run the model.