Collective Prefetching for Parallel I/O Systems Yong Chen and - - PowerPoint PPT Presentation

collective prefetching for parallel i o systems
SMART_READER_LITE
LIVE PREVIEW

Collective Prefetching for Parallel I/O Systems Yong Chen and - - PowerPoint PPT Presentation

Collective Prefetching for Parallel I/O Systems Yong Chen and Philip C. Roth Oak Ridge National Laboratory Outline I/O gap in high-performance computing I/O prefetching and limitation Collective prefetching design and implementation


slide-1
SLIDE 1

Collective Prefetching for Parallel I/O Systems

Yong Chen and Philip C. Roth Oak Ridge National Laboratory

slide-2
SLIDE 2

2 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Outline

  • I/O gap in high-performance computing
  • I/O prefetching and limitation
  • Collective prefetching design and implementation
  • Preliminary experimental evaluation
  • Conclusion and future work
slide-3
SLIDE 3

3 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

High-Performance Computing Trend

slide-4
SLIDE 4

4 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

I/O for Large-scale Scientific Computing

  • Reading input and restart files
  • Reading and processing large

amount of data

  • Writing checkpoint files
  • Writing movie, history files
  • Applications tend to be data

intensive

Metadata Server

Object Storage Server Object Storage Server Object Storage Server Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node

slide-5
SLIDE 5

5 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

The I/O Gap

  • Widening gap between

computing and I/O

  • Widening gap between

demands and I/O capability

  • Long I/O access latency leads

to severe overall performance degradation

  • Limited I/O capability attributed

as the cause of low sustained performance

Application I/O Demand I/O System Capability

I/O Gap

System size FLOPS v.s. Disk Bandwidth

slide-6
SLIDE 6

6 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Bridging Gap: Prefetching

  • Move data in advance and closer
  • Improve I/O system capability
  • Representative existing works

– Patterson and Gibson, TIP, SOSP’95 – Tran and Reed, time series model based, TPDS’04 – Yang et. al, speculative execution, USENIX’02 – Byna et. al, signature based, SC’08 – Blas et. al, multi-level caching and prefetching for BGP, PVM/MPI’09

Time Time I/O Compute I/O Compute I/O Compute Compute Compute Prefetch Prefetch

slide-7
SLIDE 7

7 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Limitation of Existing Strategies

  • The effectiveness of I/O prefetching depends on carrying
  • ut prefetches efficiently and moving data swiftly
  • Existing studies take an independent approach, without

considering the correlation of accesses among processes

– Independent prefetching

  • Multiple processes of parallel applications have strong

correlation with each other with respect to I/O accesses

– Foundation of collective I/O, data sieving, etc.

  • We propose to take advantage of this correlation

– Parallel I/O prefetching should be done in a collective way rather an ad hoc individual and independent way

slide-8
SLIDE 8

8 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Collective Prefetching Idea

  • Take advantage of the correlation among I/O accesses of

multiple processes to optimize prefetching

  • Benefits/features

– Filter overlapping and redundant prefetch requests – Combine prefetch requests from multiple processes – Combine demand requests with prefetch requests – Form large and contiguous requests – Reduces system calls

  • Similar mechanism exploited in optimizations like collective

I/O, data sieving, but no study for prefetching yet

slide-9
SLIDE 9

9 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Collective Prefetching Framework

Parallel I/O Middleware/Library Parallel File Systems (PVFS, Lustre, GPFS, PanFS) I/O Hardware, Storage Devices Collective Prefetching (Prefetch Delegates) Caching Application Process Application Process Application Process Collective I/O Two-Phase I/O

slide-10
SLIDE 10

10 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

MPI-IO with Collective Prefetching

  • MPI-IO and ROMIO
  • Collective I/O and Two-phase Implementation

File domains Aggregator 0 Aggregator 1 Aggregator 2 Aggregator 3 Interconnect 1 2 3 Process 2 Process 1 Process 0 Process 3 I/O phase

  • Comm. phase

File servers

ADIO MPI-IO UFS PFS PVFS2 NFS

ROMIO

slide-11
SLIDE 11

11 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Two-Phase Read Protocol in ROMIO

  • Each aggregator calculates

the I/O requests span and exchange

  • Partitions the aggregated span

into file domains

  • Each aggregator carries out

I/O requests for its own file domain

  • All aggregators send data to

the requesting processes, and each process receives its required data

Calc offsets & exchange Calc FDs & requests Reads Exchange

slide-12
SLIDE 12

12 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Extended Protocol with Collective Prefetching

Calc offsets & exchange

  • E. Calc FDs

& requests w/ prefs Reads

  • A. Maintain

history

  • B. Predict
  • D. Check w/

cachebuffer Exchange

  • C. Place

pref data

slide-13
SLIDE 13

13 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Collective Prefetching Algorithm

Algorithm cpf /* Collective Prefetching at MPI-IO */ Input: I/O request offset list, I/O request length list Output: none Begin

  • 1. Each aggregator maintains recent access history of window size w
  • 2. Aggregators/prefetch delegates run prediction or mining

algorithms on all tracked global access history 1.Algorithms can be as streaming, strided, Markov, or advanced mining algorithms such as PCA/ANN

  • 3. Generate prefetch requests and enqueue them in PFQ
  • 4. Process requests in PFQs together with demand accesses
  • 5. Filter out overlapping and redundant requests
  • 6. Perform extended two-phase I/O protocol with prefetch requests

1.Prefetched data are kept in cache buffer to satisfy future requests 2.Exchange data to satisfy demand requests (move data to user buffer) End

slide-14
SLIDE 14

14 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Preliminary Results

50 100 150 200 250 300 350 400 450 8 16 32 64 128 Sustained Bandwidth (MB/s) MPI-IO Collective Prefetching Individual Prefetching 50 100 150 200 250 300 350 400 450 8 16 32 64 128 Sustained Bandwidth (MB/s) MPI-IO Collective Prefetching Individual Prefetching

  • Strided access pattern, with 1MB and 4MB strides

With 1MB stride Collective prefetching: up to 22%, 19% on average Individual prefetching: up to 12%, 8% on average With 4MB stride Collective prefetching: up to 17%, 15% on average Individual prefetching: up to 8%, 6% on average

slide-15
SLIDE 15

15 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Preliminary Results

0.05 0.1 0.15 0.2 0.25 8 16 32 64 128 Speedup Individual Prefetching Collective Prefetching 0.05 0.1 0.15 0.2 8 16 32 64 128 Speedup Individual Prefetching Collective Prefetching

  • Strided access pattern, with 1MB and 4MB strides
  • Collective prefetching outperformed by over one fold
  • Collective prefetching had a more stable performance trend

With 1MB stride With 4MB stride

slide-16
SLIDE 16

16 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Preliminary Results

50 100 150 200 250 300 350 8 16 32 64 128 Sustained Bandwidth (MB/s) MPI-IO Collective Prefetching Individual Prefetching 0.05 0.1 0.15 0.2 8 16 32 64 128 Speedup Individual Prefetching Collective Prefetching

  • Nested strided access pattern, with (1MB, 3MB) stride
  • Collective prefetching outperformed by over 66%
  • Collective prefetching had a similar stable performance trend

Bandwidth Speedup

slide-17
SLIDE 17

17 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Conclusion

  • I/O has been widely recognized as the performance

bottleneck for many HEC/HPC applications

  • Correlation of I/O accesses exploited in data sieving and

collective I/O, but no study exploit for prefetching yet

  • We propose a new form of collective prefetching for

parallel I/O systems

  • Preliminary results have demonstrated the potential
  • A general idea that can be applied at many levels, such as

the storage device level or server level

slide-18
SLIDE 18

18 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Ongoing and Future Work

  • Exploit the potential at the server level
  • LACIO: A New Collective I/O Strategy, and I/O customization

1 2 3 1 2 3 1 2 3 File domains (Logical) LB# 1 2 3 4 5 6 7 8 9 10 11 S# Aggregator 0 Aggregator 1 Aggregator 2 Aggregator 3 Interconnect 1 2 3

LB0 LB4 LB8 LB1 LB5 LB9 LB2 LB6 10 LB3 LB7 11

Processes Processes Processes Processes Processes I/O phase

  • Comm. phase

File servers (Physical)

slide-19
SLIDE 19

19 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Thank you.

  • Acknowledgement: Prof. Xian-He Sun of Illinois Institute of Technology, Dr.

Rajeev Thakur of Argonne National Lab, Prof. Wei-Keng Liao and Prof. Alok Choudary of Northwestern University.

Any Questions?