Toward Specialized Serial Machines as TeraGrid Resources Arvind - - PowerPoint PPT Presentation

toward specialized serial machines
SMART_READER_LITE
LIVE PREVIEW

Toward Specialized Serial Machines as TeraGrid Resources Arvind - - PowerPoint PPT Presentation

Survey of TeraGrid Job Distribution: Toward Specialized Serial Machines as TeraGrid Resources Arvind Gopu, Richard Repasky, Scott McCaulay Indiana University June 5 th 2007 Introduction Proceeding toward peta-scale computing On the


slide-1
SLIDE 1

Survey of TeraGrid Job Distribution: Toward Specialized Serial Machines as TeraGrid Resources

Arvind Gopu, Richard Repasky, Scott McCaulay Indiana University June 5th 2007

slide-2
SLIDE 2

June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 2

Introduction

 Proceeding toward peta-scale computing

– On the TeraGrid: Massive parallel machines with high speed low latency (HSLL) networking – More to follow?

 HSLL networking gear:

– Expensive – most often 1/3rd of system cost – Additional technical skill set required to build and maintain

slide-3
SLIDE 3

June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 3

Introduction (contd…)

 Considerable user base that run serial and

coarse-grained parallel code

– Big chunk of compute nodes, and expensive networking gear: unutilized

 TeraGrid job distribution (Oct 2004-06)  Backfill: System utilization vs. user satisfaction  . . . More detail . . .  Conclusion: specialized serial systems or hybrid

parallel machines i.e. parts of parallel systems without HSSL network?

slide-4
SLIDE 4

June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 4

Peta-scale computing in the horizon

 Aggressively moving toward peta-scale

computing… NSF funded TeraGrid RP sites plus more…

– Large parallel systems – great for research! Lead to many a path breaking research finding

 BUT …

– High Speed Low Latency networks are expensive! – Do all researchers, who do computational analyses, need large parallel machines? No!

slide-5
SLIDE 5

June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 5

Who does not necessarily need HSLL?

 Users who run:

– Legacy serial applications: more likely to have longer walltime (since they don’t use multiple CPUs) – Coarse-grained parallel applications – embarrassingly parallel code, master-worker, etc.

slide-6
SLIDE 6

June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 6

Who does not necessarily need HSLL? (contd...)

 Consider 64 serial jobs, each running for 72

hours (or worse), on 4-core compute nodes

– 75% of cores, i.e., 3 out of 4 cores idle (mostly) for 72 hours

  • 64 cores active; 192 cores idle!

– Optical fiber or the like (usually 1/3rd of system cost), connecting those 64 nodes unutilized for 72 hours! – Possibly holding parallel user for 72 hours – How about 8-core nodes – even worse.

 Bad scenario!

slide-7
SLIDE 7

June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 7

When do serial and coarse-grained parallel jobs help?

 When they’re short (walltime)!  Consider 1024 serial jobs (or distributed worker tasks

from a parallel code), each running for 30 minutes or less, on 4-core compute nodes

– 75% of cores, i.e., 3 out of 4 cores idle (mostly) at worst, for 30 minutes

  • Again … 64 active cores; 192 cores idle
  • Again … HSLL network connecting those 64 nodes unutilized
  • BUT only for a short period of time;

– And most likely: not holding a large parallel job up

 Scenario? Much better system utilization – backfill

– Constant flow of such jobs is good on massive parallel systems (with HSLL)

slide-8
SLIDE 8

June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 8

TeraGrid Job Distribution (Oct 2004-06)

 Plotted job characteristics:

– Number of CPUs used by each job vs. Number

  • f jobs

– Walltime for serial jobs vs. Number of jobs

slide-9
SLIDE 9

June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 9

TeraGrid Job Distribution: Number

  • f CPUs per job vs. Number of Jobs
slide-10
SLIDE 10

June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 10

TeraGrid Job Distribution: Walltime for serial jobs vs. Number of Jobs

slide-11
SLIDE 11

June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 11

TeraGrid Job Distribution (contd…)

 The plots in previous slides:

– Show serial vs. multi-CPU jobs (close to 60% serial) – Do not show coarse-grained vs. fine-grained parallel jobs

  • Hard to figure out from available logs; possible to

filter at allocation stage though

slide-12
SLIDE 12

June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 12

Job Distribution on TeraGrid vs. on IU resources

 Is job distribution on TeraGrid completely reflective of

computational user base?

– Many legacy applications run serial (unless a web service wraps them to run differently – still a research topic) – Coarse-grained parallel applications abound –embarrassingly parallel with not much communication

 For instance, on IU’s Big Red

– Part of system dedicated to TeraGrid; rest for local users – Larger serial user base – Plus … (repeating what’s been mentioned before) users running embarrassingly parallel code do not need HSLL network – Continuation of trend we’ve seen on past systems

slide-13
SLIDE 13

June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 13

Revisiting backfill

 So, are serial jobs inherently bad for

parallel machines? No!

 Use parallel jobs with low CPU-count and

walltime requirements, as well as shorter serial jobs, to fill in while scheduler accumulates nodes for large parallel job

 Great to increase system utilization

– But does backfill always work?

slide-14
SLIDE 14

June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 14

When backfill is great …

 Backfill is great when there are large number of

short serial (or parallel) jobs.

– Monte Carlo simulations – Lots and lots of really short serial jobs – Applications that distribute one big task into simultaneously short running serial/threaded tasks – Fill up nodes that are being kept idle for massive parallel job

 Don’t necessarily need serial jobs; can use

smaller parallel jobs to fill in as backfill (for larger parallel jobs)

slide-15
SLIDE 15

June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 15

Increased queue wait times

 Backfill does not work when…  Serial or coarse-grained parallel jobs are

longer (walltime)

– Long serial jobs wait in queue because scheduler is usually configured to give preference to large parallel jobs – Frustration for serial user: “Just one (or a few) long single CPU job(s), why can’t I run?”

slide-16
SLIDE 16

June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 16

Increased queue wait times (contd…)

 Another scenario – large serial job(s) sneak(s) in…  What if, a set of long (walltime) serial or coarse-grained

parallel jobs sneak into running state before large parallel job arrives in job queue?

– Parallel job(s) wait(s) till all or a subset of serial jobs complete – In this case,

  • Again 3/4ths or more of CPUs/cores – on nodes used by

aforementioned longer walltime jobs – laying around

  • And repeating one more time … expensive networking gear

connecting those nodes also laying around idle!

– Frustration for parallel user! “My job is tailor-made for this system but I am waiting because of these serial jobs!”

slide-17
SLIDE 17

June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 17

Specialized Serial Machines?

 Certain resources without HSLL network

– Or certain resources with a mix of HSLL connected nodes and Ethernet connected nodes

 Allocate large serial apps or long running coarse-

grained parallel apps here

– Still use short serial for backfill on massive parallel machine – Threshold will be based on need, and will vary over each allocation meeting

  • Is parallel cycles in shortage (or serial cycles)?
slide-18
SLIDE 18

June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 18

Lower financial barrier for new RPs

 Not only does having serial systems

  • bviate wastage of expensive network

gear/CPUs

 Lowers financial barrier for new resource

providers to join TeraGrid

– Spend less on entire system! – Or, relocate funds that would have been used for HSLL networking gear toward more CPUs

slide-19
SLIDE 19

June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 19

Training wheels for new RPs

 Plus new RPs have lot on their hands:

– Hooking up to TG network backbone – CTSS – Accounting and Usage (AMIE, etc.) – Myrinet and Infiniband require specialized skills to maintain… even experienced sys- admins find it challenging

slide-20
SLIDE 20

June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 20

Single point of entry for ALL computational users

 Right now, the legacy serial code user base may

find it hard to get cycles on TeraGrid (unless they have some sort of threading in their code)

 Have more variety in available resources:

– Massive parallel systems – Systems that provide serial cycles or parallel cycle without HSLL – SMP systems

 Single point of entry for all types of

computational users in the US: TeraGrid

– 10 Gig pipe requirement may also a barrier as of now for smaller resource providers

slide-21
SLIDE 21

June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 21

Conclusion

 Myrinet, Infiniband, etc. (high speed low latency

networks) very expensive and require specialized maintenance

 A large subset of users still run legacy serial applications

  • r coarse-grained parallel applications

 With specialized serial machines or hybrid parallel

machines:

– Allocate large serial user requests and coarse-grained parallel requests to those – Better “user experience” (especially in terms of queue wait times) for both massive parallel users and long running serial users – Lower financial barrier for new RPs and lesser learning curve for sys-admins – TeraGrid as single point of entry for computational users

slide-22
SLIDE 22

June 5th 2007 Gopu et al., Survey of TeraGrid Job Distribution... 22

Acknowledgments

 Larry Simms – for assistance with analysis

and visualization of the TeraGrid usage data using SAS

 David Hart (@ SDSC), Chris Baumbauer –

for letting us query the TeraGrid central database to collect usage data used on this paper

 TG’07 organizers – for giving us an

  • pportunity to present this paper