Job Coscheduling on Coupled High-End Computing Systems
Wei Tang*, Narayan Desai#, Venkatram Vishwanarth# Daniel Buettner#, Zhiling Lan* * Illinois Institute of Technolology # Argonne National Laboratory
Computing Systems Wei Tang*, Narayan Desai # , Venkatram Vishwanarth# - - PowerPoint PPT Presentation
Job Coscheduling on Coupled High-End Computing Systems Wei Tang*, Narayan Desai # , Venkatram Vishwanarth# Daniel Buettner#, Zhiling Lan* * Illinois Institute of Technolology # Argonne National Laboratory Outline Background &
Wei Tang*, Narayan Desai#, Venkatram Vishwanarth# Daniel Buettner#, Zhiling Lan* * Illinois Institute of Technolology # Argonne National Laboratory
– Intrepid: IBM Blue Gene/P with 163, 840 cores (#13 in Top500) – Eureka: 100-node cluster with 200 GPUs (largest GPU installation)
– Ranger: SunBlade with 62,976 cores (#15 in Top500) – Longhorn: 256-node Dell Cluster, 128 GPUS
– Jaguar: Cray XT5 with 224, 162 cores (#3 in Top500) – Lens: 32-node Linux cluster, 2 GPUs
– Kraken: Cray XT5 with 98,928 cores (#8 in Top500) – Verne: 5-node Dell cluster.
– Start time – Submission time – Average among total jobs
– (wait time + runtime) /runtime – Average among total jobs
– How many more minutes need to wait in co-scheduling – Average among all paired jobs
– Node-hour – System utilization rate
Scheme on Intrepid-Eureka HH: Hold-Hold HY: Hold-Yield YH: Yield-Hold YY: Yield-Yield Sys util. on Eureka: 25% 50% 75%
50 100 150 200 250 25%/H 25%/Y 50%/H 50%/Y 75%/H 75%/Y
minutes Eureka config. (sys. util./scheme)
Intrepid job sync-up overhead (average)
hold yield
20 40 60 80 100 120 140 160 25%/H 25%/Y 50%/H 50%/Y 75%/H 75%/Y
minutes Eureka sys. util. / Intrepid scheme
Eureka sync-up overhead (average)
hold yield
0.0% 0.5% 1.0% 1.5% 2.0% 2.5% 3.0% 3.5% 4.0% 4.5% 5.0% 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 25%/H 25%/Y 50%/H 50%/Y 75%/H 75%/Y
lost sys. util. rate node-hour Eureka config. (sys. util/scheme)
Intrepid loss of computing capability
node hour
0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 500 1000 1500 2000 2500 3000 3500 4000 25%/H 25%/Y 50%/H 50%/Y 75%/H 75%/Y
lost sys. util. rate node-hour Eureka sys. util./Intrepid scheme
Eureka loss of computing capability
node hour
20 40 60 80 100 120 140 160 2.5%/H 5%/H 10%/H 20%/H 33%/H
minutes mate job ratio/remote scheme
Intrepid job sync-up overhead (average)
hold yield
50 100 150 200 250 2.5%/H 5%/H 10%/H 20%/H 33%/H
minutes mate job ratio/remote scheme
Eureka job sync-up overhead (average)
hold yield
0% 5% 10% 15% 20% 25% 2000 4000 6000 8000 10000 12000 14000 16000 18000 2.5%/H 5%/H 10%/H 20%/H 33%/H
node-hour mate job ratio/remote scheme
Eureka loss of computing capability
node hour
0.0% 2.0% 4.0% 6.0% 8.0% 10.0% 12.0% 500000 1000000 1500000 2000000 2500000 3000000 3500000 2.5%/H 5%/H 10%/H 20%/H 33%/H
node-hour mate job ratio/remote scheme
Intrepid loss of computing capability
node hour