workload management nqe lsf status plans
play

Workload Management: NQE/LSF Status & Plans Jack Thompson - PowerPoint PPT Presentation

Workload Management: NQE/LSF Status & Plans Jack Thompson Brian MacDonald Marketing Product Manager Technical Relationship Manager SGI Platform Computing jt@sgi.com brian@platform.com 41st Cray User Group Conference Minneapolis,


  1. Workload Management: NQE/LSF Status & Plans Jack Thompson Brian MacDonald Marketing Product Manager Technical Relationship Manager SGI Platform Computing jt@sgi.com brian@platform.com 41st Cray User Group Conference Minneapolis, Minnesota

  2. Agenda ¥ NQE Transition & Status ¥ Migration Program ¥ Status of LSF on SGI and Cray Systems ¥ LSF Plans ¥ Q&A 2

  3. NQE Transition NQE 3.3 ¥ Final feature release Next Steps ¥ ISV solutions prevalent Ð Core competency issue Ð Multi-vendor environment ¥ Partner solution best choice ¥ Platform ComputingÕs LSF 3

  4. NQE Status ¥ Supported on SGI and Cray Systems Ð Support through year-end, 2004 Ð Critical bugs fixed Ð Call center support ¥ Available for Cray SV1 systems ¥ Retired on non-SGI systems 4

  5. LSF Migration Program ¥ Discounted pricing for systems licensed for NQE before February 1, 1999 Ð Available through January 31, 2000 ¥ Migration Guide Ð Developed jointly by Platform and SGI ¥ Professional services available ¥ Inclusion of key NQE features in LSF Strong relationship between SGI and Platform Computing engineering teams 5

  6. LSF on SGI Systems Current release is LSF 3.2 ¥ Now available on IRIX, UNICOS, UNICOS/mk Ð Including Cray SV1 ¥ Also on NT and Linux ¥ Available from SGI Ð LSF Standard Edition, LSF Parallel, LSF Client ¥ Available from Platform Computing Ð LSF Analyzer, LSF MultiCluster, LSF JobScheduler, LSF Make 6

  7. Data Center Requirements Environments for High Performance Ð Single point of control and administration Ð Logically present a single system image to users, applications and networks Ð Application of policies across the consolidated platform - uniform across all machines Ð Uniform policies to satisfy workload performance objectives in terms of throughput, turn around and response time Ð Improved application availability - both for failures and planned outages 7

  8. Defining Capacity Goals LSF can be focused on throughput guarantees ¥ Run as much workload on the box, absolute performance not primary goal 8 CPUs 12 jobs, 900 MB 1 GB Memory of memory, lots 6 I/O Channels of disk activity or network disk access 8

  9. Thresholds for Execution High Priority, Critical Workload Continues Critical and Stop Lower Acceptin Low Priority g New Priority Jobs Jobs Jobs Suspended or 85 % 90 % Migrated 100 % CPU Utilization 9

  10. Defining Capability Computing Clearly Stated Performance Goals ¥ Get my job done as quickly as possible using all necessary dedicated resources ¥ Avoid sharing and contention at all costs ¥ Problems can be tackled that otherwise could not be considered ¥ Mission critical applications gain the undivided attention of the computing infrastructure 0

  11. Defining Capability Computing Supporting the Exclusive Execution Model ¥ multi-box parallelism (Origin 2000) ¥ mixed operation large machines ¥ optimum support for Cray T3E ¥ committed product development in support of partitioning mechanisms Ð Miser (Q4 99) Ð Miser CPU sets (Q4 99) Ð OS service follow-on (XRS) 1

  12. Resource Based Job Placement Selection Ð Match necessary conditions Ordering Ð Choose the best from eligible candidates Reservation Ð Adjust load values for selected hosts Spanning Ð Define locality of parallel jobs 2

  13. Single Processing Image Resource Informatio LIM n . . . Scheduler submission hosts server hosts batch queues 3

  14. System Level Integration ¥ placement ¥ SGI Array Session ¥ control (signals, limits, ¥ Task startup and message) control ¥ consolidated ¥ ASH returned to PAM accounting Parallel Application Manager ¥ MPT 1.3 Plug-in Remote Execution Server ¥ ASH sent to RES used to discover per job usage 4

  15. Solutions Through Integration ISVs, Custom Scientific and Commercial Applications transparently gain access to resource management services without changing their code ¥ Application Checkpoint Restart ¥ Transparent host selection ¥ Accounting for ISV applications LSF Parallel 3.2 MPT 1.3 5

  16. LSF 4.0 Enhancements Scheduler Ð Scalability improvements for all the bells and whistles turned on - Fair-share + Back-filling á 20,000 + jobs Ð Dynamic re-configuration without re-start á lim and mbatchd Ð Client query scalability á support for thousandÕs of clients Ð Adaptive dispatch for high throughput, short running jobs Ð Time dependent configuration for queues á different queue for night, same queue 6

  17. LSF 4.0 Enhancements Job Execution Ð Improved Input/Output handling support á I/O Spooling á Admin defined spool directory á Job level CWD discovery enhancements Ð Integrated FTA supported within LSF Ð Job Flow Ð Kill re-queue Administrative Improvements Ð Non-shared daemon configuration support Ð Automatic host type and model detection 7

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend