UPDATE ON NERSC PScheD EXPERIENCES, A CONTINUING SUCCESS STORY Tina - - PowerPoint PPT Presentation

update on nersc psched experiences a continuing success
SMART_READER_LITE
LIVE PREVIEW

UPDATE ON NERSC PScheD EXPERIENCES, A CONTINUING SUCCESS STORY Tina - - PowerPoint PPT Presentation

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTE R UPDATE ON NERSC PScheD EXPERIENCES, A CONTINUING SUCCESS STORY Tina Butler - NERSC Brent Draney - NERSC Mike Welcome -NERSC Bryan Hardy - SGI Steve Luzmoor - SGI This work was


slide-1
SLIDE 1

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTE

R

1

Tina Butler - NERSC Brent Draney - NERSC Mike Welcome -NERSC Bryan Hardy - SGI Steve Luzmoor - SGI

This work was supported by the Director, Office of Advanced Scientific Computing Research, Division of Mathematical, Information, and Computational Sciences of the U.S. Department of Energy under contract number DE-AC03-76SF00098.

UPDATE ON NERSC PScheD EXPERIENCES, A CONTINUING SUCCESS STORY

slide-2
SLIDE 2

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTE

R

2

What is NERSC?

¥ National Energy Research Scientific Computing Center

Ð Funded by DOE Office of Science Ð Located at Lawrence Berkeley National Lab Ð Provides Computational Resources to the following programs

¥ Fusion Energy ¥ High Energy and Nuclear Sciences ¥ Basic Energy Sciences ¥ Biology and Environmental Research ¥ Computational and Environmental Research

Ð Approximately 2500 Users from Major Universities and Government Labs Ð Hardware: 696 PE T3E-900, 1- J90 SE (32 CPUs) & 3 SV-1 (64 CPUs) systems

slide-3
SLIDE 3

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTE

R

3

Mcurie - The NERSC T3E

¥ T3E 900 with 696 PEs running UNICOS/MK 2.0.4.67 ¥ 644 APP PEs ¥ 256 MB per PE ¥ 383 GB Swap Space - 5 partitions, each 5-way striped ¥ 582 GB Checkpoint file system - 5 partitions, striped ¥ 1500 GB /usr/tmp file system ¥ 7 - 25 GB Home file systems, DMF managed ¥ All Large file systems Òremote mountedÓ

slide-4
SLIDE 4

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTE

R

4

NERSC Job Mix - Application Mix

¥ Applications from the fields of

Ð Chemistry Ð Materials Science Ð Fusion Energy Ð Geophysics Ð Biology Ð High Energy Nuclear Physics Ð Climate Modeling Ð Astrophysics Ð Computational Fluid Dynamics

¥ Mostly user-written codes

slide-5
SLIDE 5

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTE

R

5

NERSC Job Mix - Diverse and Dynamic

App Size ( PEs) % of all Apps % of PE Hours 2 - 16 56 6 17 - 64 38 56 65 - 128 5 29 129 - 512 1 9 App Run Time % of all Apps % of PE Hours 0 – 10 min 56 1 10 – 30 min 23 10 0.5 – 3.5 hr 17 49 3.5 – 12.0 hr 4 40

Mix of Development, Capacity and Capability computing

slide-6
SLIDE 6

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTE

R

6

NERSC T3E Scheduling Goals

¥ Minimize idle time in the APP region ¥ Provide fast interactive response while managing the total interactive workload on the system ¥ Provide reasonable and even turnaround across all the batch queues ¥ Encourage users to scale applications to large number of PEs ¥ Provide ÒPriority QueuingÓ capability via NQE/NQS

slide-7
SLIDE 7

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTE

R

7

Batch Request Interactive Application NQE NQS GRM

NQS Control Script Interactive Priming Script

Psched Mcurie Job Flow and Control Diagram Application PEs

slide-8
SLIDE 8

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTE

R

8

NERSC T3E Batch System

¥ NQE - holding pen for incoming requests

¥ Production Queues: LWS limit of 3 jobs per user ¥ Debug Queues: LWS limit of 1 job per user

¥ NQS - Queues defined by PE size and Time Limits

Queue P E Lim Time Lim Priority Pe512 Pe256 Pe128 Pe64 Pe32 Pe16 512 256 128 6 4 3 2 1 6 4 hr 4 hr 4 hr 4 hr 4 hr 4 hr 4 5 3 0 2 5 2 0 1 5 1 0 Long128 Long 256 128 256 1 2 hr 12h r 2 7 2 8 Debug_md Debug_ sm 128 3 2 1 0 min 3 0 min 2 9 2 3

slide-9
SLIDE 9

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTE

R

9

NERSC T3E Batch System (cont.)

¥ NQS Control Script (PERL 5)

Ð Reads configuration file

¥ Contains alternate queue configurations ¥ Configuration selection based on time, day of week ¥ Which queues are ÒonÓ, ÒoffÓ, ÒbackfillÓ, etc. ¥ Specifies global, complex and queue limits

Ð Gathers system state: parses output of ps, grmview, qstat, psview Ð Modifies NQS (via qmgr) to conform with selected configuration Ð Uses checkpoint/restart to switch between configurations

¥ Up to 5 checkpoints done in parallel

Ð Logs system state and all actions to time-stamped log file

slide-10
SLIDE 10

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTE

R

1 0

Alternate Queue Configurations

Schedule Configuration Queue S ta tus

22:00 – 0 1: 00 Full Mach ine On: pe5 12 Backfill: pe6 4, pe32, pe1 6 01:00- 07: 00 Batch Pref erred On: pe2 56, pe128, long128, long256, pe6 4, pe32, pe 16, deb ug 07:00 – 2 2: 00 Regular On: pe1 28, long128, pe6 4, pe32, pe1 6, deb ug

slide-11
SLIDE 11

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTE

R

1 1

Mcurie Configuration Prior to UNICOS/MK 2.0.4

¥ GRM - two regions (manage interactive workload)

Ð 512 PE batch-only region (maximum = 512) Ð 132 PE mixed region (maximum = 64)

¥ 06:00 - 18:00 weekdays: Interactive-only ¥ 23:00 - 03:00 everyday: Batch-only ¥ Otherwise: Both interactive and batch allowed

Ð app_max = 1, abs_app_max = 1

¥ Psched

Ð Two psched domains - one for each region Ð Load balancer enabled Ð No gang scheduler Ð No prime jobs

slide-12
SLIDE 12

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTE

R

1 2

Mcurie Configuration Prior to UNICOS/MK 2.0.4

¥ Problems

Ð Applications launched on region interface Ð Applications launched in ÒwrongÓ region Ð Interactive region idle if no interactive work Ð Job size ÒentropyÓ

¥ Attempted Solutions

Ð Torus-Pack Script Ð De-fragment Script Ð ÒB-schedÓ

slide-13
SLIDE 13

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTE

R

1 3

Mcurie Configuration after UNICOS/2.0.4 Upgrade

¥ GRM

Ð single uniform 644 PE APP region Ð service limits to control interactive workload (132-day/4- night) Ð app_max = 1, abs_app_max = 2

¥ Psched

Ð Load balancer - 5 sec heartbeat Ð Gang scheduler - 1 hr time-slice Ð Resource manager - prime jobs

¥ Interactive Priming Script

Ð All interactive work is ÒprimeÓ from 05:30 - 22:00

¥ NQS Control Script

Ð Large Jobs run ÒprimeÓ Ð 30% over-subscription (global MPP_limit=960)

slide-14
SLIDE 14

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTE

R

1 4

Psched Success at NERSC

¥ Average System Utilization (Connect Time) ¥ Average queue wait time

Ð reduced Ð decreased for large queues

¥ Interactive workload

Ð restricted but given priority Dates Utilization Comments 10/01/98 – 03/04/ 79.4% Prior to 2.0.4 03/05/99 – 03/24/ 85.6% Post 2.0.4 03/25/99 – 05/08/ 90.2% Current Configuration 05/09/99 – 09/30/ 87.3 % Allocation Problems

slide-15
SLIDE 15

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTE

R

1 5

MPP Charging and Usag FY 9 8- 99

2000 4000 6000 8000 10000 12000 14000 16000 1-Oct-97 9-Nov-97 18-Dec-97 26-Jan-98 6-Mar-98 14-Apr-98 23-May-98 1-Jul-98 9-Aug-98 17-Sep-98 26-Oct-98 4-Dec-98 12-Jan-99 20-Feb-99 31-Mar-99 9-May-99 17-Jun-99 26-Jul-99 3-Sep-99 Da te CPU Hours 30 -Day Mov ing Ave. Lost Tim 30 -Day Mov ing Ave. Pier re F 30 -Day Mov ing Ave. Pier re 30 -Day Mov ing Ave. GC0 30 -Day Mov ing Ave. Mcur ie 30 -Day Mov ing Ave. Over hea 80% 85% 90% Max CPU Hour s

80% 85% 9 0% Peak

slide-16
SLIDE 16

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTE

R

1 6

Mcurie Connect Time by Applicatio Size

7-Day Moving Average

0.00 2000.00 4000.00 6000.00 8000.00 10 000.00 12 000.00 14 000.00 16 000.00 18 000.00 10/2/98 11/6/98 12/11/98 1/15/99 2/19/99 3/26/99 5 / 1 / 9 9 6 / 5 / 9 9 7/10/99 8/14/99 9/18/99 Dat e PE Hours PE 512 PE 448 PE 256 PE 224 PE 128 PE 96 PE 64 PE 32 PE 16 INTERACTI VE O VERHEAD 90% Ti me Max Ti me

slide-17
SLIDE 17

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTE

R

1 7

Mcurie Connec t Time by Applicatio Size

3 0 Day Moving Averag e

0.0 0 2000.0 0 4000.0 0 6000.0 0 8000.0 0 10000.0 0 12000.0 0 14000.0 0 16000.0 0 18000.0 0 10/2/98 11/6/98 12/11/98 1/15/99 2/19/99 3/26/99 5 / 1 / 9 9 6 / 5 / 9 9 7/10/99 8/14/99 9/18/99 Dat e PE Hours PE 512 PE 448 PE 256 PE 224 PE 128 PE 96 PE 64 PE 32 PE 16 INTERACTI VE OVERHEAD 90% Time Max Ti me

slide-18
SLIDE 18

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTE

R

1 8

10 20 30 40 50 60

Oct-98 Nov-98 Dec-98 Jan-99 Feb-99 Mar-99 Apr-99

Month Hours

pe512 gc256 gc128 pe256 pe128 pe64 pe32 long128

Mcurie: Average Wait Time per Queue

slide-19
SLIDE 19

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTE

R

1 9

Mcurie Production Jobs less than 3 3 PE's

5 0 1 00 1 50 2 00 2 50 3 00 1 0 / 2 / 9 8 1 1 / 5 / 9 8 1 2 / 9 / 9 8 1 / 1 2 / 9 9 2 / 1 5 / 9 9 3 / 2 1 / 9 9 4 / 2 5 / 9 9 5 / 2 9 / 9 9 7 / 2 / 9 9 8 / 5 / 9 9 9 / 8 / 9 9 Dat e Number of Jobs Da il y Count s 30 Day Movi ng Av

slide-20
SLIDE 20

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTE

R

2 0

Mcurie Production Jobs 33 to 96 PE's

2 0 4 0 6 0 8 0 1 00 1 20 1 40 1 0 / 2 / 9 8 1 1 / 2 / 9 8 1 2 / 3 / 9 8 1 / 3 / 9 9 2 / 3 / 9 9 3 / 6 / 9 9 4 / 7 / 9 9 5 / 8 / 9 9 6 / 8 / 9 9 7 / 9 / 9 9 8 / 9 / 9 9 9 / 9 / 9 9 Da t e Number of Jobs Dail y Counts 30 Day Moving Av

slide-21
SLIDE 21

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTE

R

2 1

Mcurie Production Jobs Greater than 96 PE's

1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 1 0 / 2 / 9 8 1 1 / 1 / 9 8 1 2 / 1 / 9 8 12/31/98 1 / 3 0 / 9 9 3 / 1 / 9 9 3 / 3 1 / 9 9 5 / 1 / 9 9 5 / 3 1 / 9 9 6 / 3 0 / 9 9 7 / 3 0 / 9 9 8 / 2 9 / 9 9 9 / 2 8 / 9 9 Dat e Number of Jobs Dai l y Counts 3 0 Day Moving Av

slide-22
SLIDE 22

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTE

R

2 2

Conclusions

¥ Psched has been very stable ¥ GRM Service Limits are an effective means of managing the interactive workload ¥ Prime job feature is an effective tool for

Ð providing quick interactive response Ð scheduling large jobs

¥ System management is simplified ¥ Utilization is high ¥ Too early to declare victory with Òpriority queuingÓ