Improving I/O Performance of HPC Applications Using Intra-Job - - PowerPoint PPT Presentation

improving i o performance of hpc applications using intra
SMART_READER_LITE
LIVE PREVIEW

Improving I/O Performance of HPC Applications Using Intra-Job - - PowerPoint PPT Presentation

Improving I/O Performance of HPC Applications Using Intra-Job Scheduling Arnab K. Paul , Olaf Faaland , Adam Moody , Elsa Gonsiorowski , Kathryn Mohror , Ali R. Butt Virginia Tech , Lawrence Livermore National


slide-1
SLIDE 1

Improving I/O Performance of HPC Applications Using Intra-Job Scheduling

Arnab K. Paul†, Olaf Faaland‡, Adam Moody‡, Elsa Gonsiorowski‡, Kathryn Mohror‡, Ali R. Butt†

†Virginia Tech, ‡Lawrence Livermore National Laboratory

PDSW-DISCS 2019; collocated with SC’19, Denver, CO

slide-2
SLIDE 2

Motivation: The Increasing Gap

2

Processor Performance vs Disk Access Time

https://newsroom.intel.com/editorials/3d-xpoint-memory-storage/#gs.gqtcop

slide-3
SLIDE 3

Motivation

3

Processor Performance vs Disk Access Time

https://newsroom.intel.com/editorials/3d-xpoint-memory-storage/#gs.gqtcop

I/O operations become a limiting factor in application efficiency.

slide-4
SLIDE 4

Motivation

4

Processor Performance vs Disk Access Time

https://newsroom.intel.com/editorials/3d-xpoint-memory-storage/#gs.gqtcop

Improve I/O Performance of HPC Applications Using Intra-Job Scheduling I/O operations become a limiting factor in application efficiency.

slide-5
SLIDE 5

Lustre Parallel File System

5 Lustre Clients

. . .

Management Server (MGS) Metadata Server (MDT) Management Target (MGT) Metadata Target (MDT) DNE Metadata Servers and Metadata Targets

. . . . . .

Object Storage Servers and Targets (OSS & OSTs)

. . .

direct, parallel file access Ethernet or Infiniband Network

slide-6
SLIDE 6

System Design

6

Job Statistics Dataset Machine Learning Modeling Validation Models are stored

slide-7
SLIDE 7

System Design

7

Model DB New jobs Job scheduler Currently running jobs Current and new jobs’ future requests

slide-8
SLIDE 8

Preliminary Results

8

  • Built a Lustre Simulator on NS3.
  • Results from time-series modeling show an

accuracy of 95% in predicting job write bursts.

slide-9
SLIDE 9

Next Steps

9

  • Modify the scheduler to reduce I/O contention.
  • Measure the I/O performance of the jobs as well

as the overall performance of the system.

slide-10
SLIDE 10

10

akpaul@vt.edu http://research.cs.vt.edu/dssl/

Thank You! Q & A