Toward Specialized Serial Machines as TeraGrid Resources Arvind - - PowerPoint PPT Presentation
Toward Specialized Serial Machines as TeraGrid Resources Arvind - - PowerPoint PPT Presentation
Survey of TeraGrid Job Distribution: Toward Specialized Serial Machines as TeraGrid Resources Arvind Gopu, Richard Repasky, Scott McCaulay Indiana University June 5 th 2007 Introduction Proceeding toward peta-scale computing On the
June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 2
Introduction
Proceeding toward peta-scale computing
– On the TeraGrid: Massive parallel machines with high speed low latency (HSLL) networking – More to follow?
HSLL networking gear:
– Expensive – most often 1/3rd of system cost – Additional technical skill set required to build and maintain
June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 3
Introduction (contd…)
Considerable user base that run serial and
coarse-grained parallel code
– Big chunk of compute nodes, and expensive networking gear: unutilized
TeraGrid job distribution (Oct 2004-06) Backfill: System utilization vs. user satisfaction . . . More detail . . . Conclusion: specialized serial systems or hybrid
parallel machines i.e. parts of parallel systems without HSSL network?
June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 4
Peta-scale computing in the horizon
Aggressively moving toward peta-scale
computing… NSF funded TeraGrid RP sites plus more…
– Large parallel systems – great for research! Lead to many a path breaking research finding
BUT …
– High Speed Low Latency networks are expensive! – Do all researchers, who do computational analyses, need large parallel machines? No!
June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 5
Who does not necessarily need HSLL?
Users who run:
– Legacy serial applications: more likely to have longer walltime (since they don’t use multiple CPUs) – Coarse-grained parallel applications – embarrassingly parallel code, master-worker, etc.
June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 6
Who does not necessarily need HSLL? (contd...)
Consider 64 serial jobs, each running for 72
hours (or worse), on 4-core compute nodes
– 75% of cores, i.e., 3 out of 4 cores idle (mostly) for 72 hours
- 64 cores active; 192 cores idle!
– Optical fiber or the like (usually 1/3rd of system cost), connecting those 64 nodes unutilized for 72 hours! – Possibly holding parallel user for 72 hours – How about 8-core nodes – even worse.
Bad scenario!
June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 7
When do serial and coarse-grained parallel jobs help?
When they’re short (walltime)! Consider 1024 serial jobs (or distributed worker tasks
from a parallel code), each running for 30 minutes or less, on 4-core compute nodes
– 75% of cores, i.e., 3 out of 4 cores idle (mostly) at worst, for 30 minutes
- Again … 64 active cores; 192 cores idle
- Again … HSLL network connecting those 64 nodes unutilized
- BUT only for a short period of time;
– And most likely: not holding a large parallel job up
Scenario? Much better system utilization – backfill
– Constant flow of such jobs is good on massive parallel systems (with HSLL)
June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 8
TeraGrid Job Distribution (Oct 2004-06)
Plotted job characteristics:
– Number of CPUs used by each job vs. Number
- f jobs
– Walltime for serial jobs vs. Number of jobs
June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 9
TeraGrid Job Distribution: Number
- f CPUs per job vs. Number of Jobs
June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 10
TeraGrid Job Distribution: Walltime for serial jobs vs. Number of Jobs
June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 11
TeraGrid Job Distribution (contd…)
The plots in previous slides:
– Show serial vs. multi-CPU jobs (close to 60% serial) – Do not show coarse-grained vs. fine-grained parallel jobs
- Hard to figure out from available logs; possible to
filter at allocation stage though
June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 12
Job Distribution on TeraGrid vs. on IU resources
Is job distribution on TeraGrid completely reflective of
computational user base?
– Many legacy applications run serial (unless a web service wraps them to run differently – still a research topic) – Coarse-grained parallel applications abound –embarrassingly parallel with not much communication
For instance, on IU’s Big Red
– Part of system dedicated to TeraGrid; rest for local users – Larger serial user base – Plus … (repeating what’s been mentioned before) users running embarrassingly parallel code do not need HSLL network – Continuation of trend we’ve seen on past systems
June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 13
Revisiting backfill
So, are serial jobs inherently bad for
parallel machines? No!
Use parallel jobs with low CPU-count and
walltime requirements, as well as shorter serial jobs, to fill in while scheduler accumulates nodes for large parallel job
Great to increase system utilization
– But does backfill always work?
June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 14
When backfill is great …
Backfill is great when there are large number of
short serial (or parallel) jobs.
– Monte Carlo simulations – Lots and lots of really short serial jobs – Applications that distribute one big task into simultaneously short running serial/threaded tasks – Fill up nodes that are being kept idle for massive parallel job
Don’t necessarily need serial jobs; can use
smaller parallel jobs to fill in as backfill (for larger parallel jobs)
June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 15
Increased queue wait times
Backfill does not work when… Serial or coarse-grained parallel jobs are
longer (walltime)
– Long serial jobs wait in queue because scheduler is usually configured to give preference to large parallel jobs – Frustration for serial user: “Just one (or a few) long single CPU job(s), why can’t I run?”
June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 16
Increased queue wait times (contd…)
Another scenario – large serial job(s) sneak(s) in… What if, a set of long (walltime) serial or coarse-grained
parallel jobs sneak into running state before large parallel job arrives in job queue?
– Parallel job(s) wait(s) till all or a subset of serial jobs complete – In this case,
- Again 3/4ths or more of CPUs/cores – on nodes used by
aforementioned longer walltime jobs – laying around
- And repeating one more time … expensive networking gear
connecting those nodes also laying around idle!
– Frustration for parallel user! “My job is tailor-made for this system but I am waiting because of these serial jobs!”
June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 17
Specialized Serial Machines?
Certain resources without HSLL network
– Or certain resources with a mix of HSLL connected nodes and Ethernet connected nodes
Allocate large serial apps or long running coarse-
grained parallel apps here
– Still use short serial for backfill on massive parallel machine – Threshold will be based on need, and will vary over each allocation meeting
- Is parallel cycles in shortage (or serial cycles)?
June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 18
Lower financial barrier for new RPs
Not only does having serial systems
- bviate wastage of expensive network
gear/CPUs
Lowers financial barrier for new resource
providers to join TeraGrid
– Spend less on entire system! – Or, relocate funds that would have been used for HSLL networking gear toward more CPUs
June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 19
Training wheels for new RPs
Plus new RPs have lot on their hands:
– Hooking up to TG network backbone – CTSS – Accounting and Usage (AMIE, etc.) – Myrinet and Infiniband require specialized skills to maintain… even experienced sys- admins find it challenging
June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 20
Single point of entry for ALL computational users
Right now, the legacy serial code user base may
find it hard to get cycles on TeraGrid (unless they have some sort of threading in their code)
Have more variety in available resources:
– Massive parallel systems – Systems that provide serial cycles or parallel cycle without HSLL – SMP systems
Single point of entry for all types of
computational users in the US: TeraGrid
– 10 Gig pipe requirement may also a barrier as of now for smaller resource providers
June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 21
Conclusion
Myrinet, Infiniband, etc. (high speed low latency
networks) very expensive and require specialized maintenance
A large subset of users still run legacy serial applications
- r coarse-grained parallel applications
With specialized serial machines or hybrid parallel
machines:
– Allocate large serial user requests and coarse-grained parallel requests to those – Better “user experience” (especially in terms of queue wait times) for both massive parallel users and long running serial users – Lower financial barrier for new RPs and lesser learning curve for sys-admins – TeraGrid as single point of entry for computational users
June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 22
Acknowledgments
Larry Simms – for assistance with analysis
and visualization of the TeraGrid usage data using SAS
David Hart (@ SDSC), Chris Baumbauer –
for letting us query the TeraGrid central database to collect usage data used on this paper
TG’07 organizers – for giving us an
- pportunity to present this paper