Workload Management: NQE/LSF Status & Plans Jack Thompson - - PowerPoint PPT Presentation

workload management nqe lsf status plans
SMART_READER_LITE
LIVE PREVIEW

Workload Management: NQE/LSF Status & Plans Jack Thompson - - PowerPoint PPT Presentation

Workload Management: NQE/LSF Status & Plans Jack Thompson Brian MacDonald Marketing Product Manager Technical Relationship Manager SGI Platform Computing jt@sgi.com brian@platform.com 41st Cray User Group Conference Minneapolis,


slide-1
SLIDE 1

Workload Management: NQE/LSF Status & Plans

Jack Thompson

Marketing Product Manager

SGI jt@sgi.com

41st Cray User Group Conference Minneapolis, Minnesota

Brian MacDonald

Technical Relationship Manager

Platform Computing brian@platform.com

slide-2
SLIDE 2

2

Agenda

¥ NQE Transition & Status ¥ Migration Program ¥ Status of LSF on SGI and Cray Systems ¥ LSF Plans ¥ Q&A

slide-3
SLIDE 3

3

NQE 3.3

¥ Final feature release

Next Steps

¥ ISV solutions prevalent

Ð Core competency issue Ð Multi-vendor environment

¥ Partner solution best choice ¥ Platform ComputingÕs LSF

NQE Transition

slide-4
SLIDE 4

4

¥ Supported on SGI and Cray Systems

Ð Support through year-end, 2004 Ð Critical bugs fixed Ð Call center support

¥ Available for Cray SV1 systems ¥ Retired on non-SGI systems

NQE Status

slide-5
SLIDE 5

5

¥ Discounted pricing for systems licensed for NQE before February 1, 1999

Ð Available through January 31, 2000

¥ Migration Guide

Ð Developed jointly by Platform and SGI

¥ Professional services available ¥ Inclusion of key NQE features in LSF

Strong relationship between SGI and Platform Computing engineering teams

LSF Migration Program

slide-6
SLIDE 6

6

Current release is LSF 3.2

¥ Now available on IRIX, UNICOS, UNICOS/mk

Ð Including Cray SV1

¥ Also on NT and Linux ¥ Available from SGI

Ð LSF Standard Edition, LSF Parallel, LSF Client

¥ Available from Platform Computing

Ð LSF Analyzer, LSF MultiCluster, LSF JobScheduler, LSF Make

LSF on SGI Systems

slide-7
SLIDE 7

7

Data Center Requirements

Environments for High Performance

Ð Single point of control and administration Ð Logically present a single system image to users, applications and networks Ð Application of policies across the consolidated platform

  • uniform across all machines

Ð Uniform policies to satisfy workload performance

  • bjectives in terms of throughput, turn around and

response time Ð Improved application availability - both for failures and planned outages

slide-8
SLIDE 8

8

Defining Capacity Goals

LSF can be focused on throughput guarantees

¥ Run as much workload on the box, absolute performance not primary goal

12 jobs, 900 MB

  • f memory, lots
  • f disk activity
  • r network disk

access

8 CPUs 1 GB Memory 6 I/O Channels

slide-9
SLIDE 9

9

85 % 90 %

Critical and Lower Priority Jobs Stop Acceptin g New Jobs Low Priority Jobs Suspended

  • r

Migrated 100 % High Priority, Critical Workload Continues

CPU Utilization

Thresholds for Execution

slide-10
SLIDE 10

Defining Capability Computing Clearly Stated Performance Goals

¥ Get my job done as quickly as possible using all necessary dedicated resources ¥ Avoid sharing and contention at all costs ¥ Problems can be tackled that otherwise could not be considered ¥ Mission critical applications gain the undivided attention of the computing infrastructure

slide-11
SLIDE 11

1

Defining Capability Computing Supporting the Exclusive Execution Model

¥ multi-box parallelism (Origin 2000) ¥ mixed operation large machines ¥ optimum support for Cray T3E ¥ committed product development in support of partitioning mechanisms

Ð Miser (Q4 99) Ð Miser CPU sets (Q4 99) Ð OS service follow-on (XRS)

slide-12
SLIDE 12

2

Resource Based Job Placement

Selection

Ð Match necessary conditions

Ordering

Ð Choose the best from eligible candidates

Reservation

Ð Adjust load values for selected hosts

Spanning

Ð Define locality of parallel jobs

slide-13
SLIDE 13

3

. . .

batch queues server hosts submission hosts Resource Informatio n

LIM

Scheduler

Single Processing Image

slide-14
SLIDE 14

4

Parallel Application Manager

Remote Execution Server

¥ placement ¥ control (signals, limits, message) ¥ consolidated accounting ¥ SGI Array Session ¥ Task startup and control ¥ ASH returned to PAM ¥ ASH sent to RES used to discover per job usage ¥ MPT 1.3 Plug-in

System Level Integration

slide-15
SLIDE 15

5

MPT 1.3

¥ Application Checkpoint Restart ¥ Transparent host selection ¥ Accounting for ISV applications

ISVs, Custom Scientific and Commercial Applications transparently gain access to resource management services without changing their code

LSF Parallel 3.2

Solutions Through Integration

slide-16
SLIDE 16

6

LSF 4.0 Enhancements

Scheduler

Ð Scalability improvements for all the bells and whistles turned on - Fair-share + Back-filling

á 20,000 + jobs

Ð Dynamic re-configuration without re-start

á lim and mbatchd

Ð Client query scalability

á support for thousandÕs of clients

Ð Adaptive dispatch for high throughput, short running jobs Ð Time dependent configuration for queues

á different queue for night, same queue

slide-17
SLIDE 17

7

LSF 4.0 Enhancements

Job Execution

Ð Improved Input/Output handling support

á I/O Spooling á Admin defined spool directory á Job level CWD discovery enhancements

Ð Integrated FTA supported within LSF Ð Job Flow Ð Kill re-queue

Administrative Improvements

Ð Non-shared daemon configuration support Ð Automatic host type and model detection