Collective Prefetching for Parallel I/O Systems Yong Chen and - PowerPoint PPT Presentation

Collective Prefetching for Parallel I/O Systems Yong Chen and Philip C. Roth Oak Ridge National Laboratory

Outline • I/O gap in high-performance computing • I/O prefetching and limitation • Collective prefetching design and implementation • Preliminary experimental evaluation • Conclusion and future work 2 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

High-Performance Computing Trend 3 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

I/O for Large-scale Scientific Computing • Reading input and restart files • Reading and processing large amount of data • Writing checkpoint files • Writing movie, history files Compute Compute Compute Compute Compute Compute Node Node Compute Node • Applications tend to be data Node Compute Node Node Node Node intensive Metadata Server Object Object Object Storage Storage Storage Server Server Server 4 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

The I/O Gap • Widening gap between computing and I/O • Widening gap between demands and I/O capability • Long I/O access latency leads FLOPS v.s. Disk Bandwidth to severe overall performance degradation • Limited I/O capability attributed as the cause of low sustained I/O Gap performance Application I/O Demand I/O System Capability System size 5 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Bridging Gap: Prefetching • Move data in advance and closer • Improve I/O system capability I/O Compute Compute Compute I/O Compute I/O Compute Prefetch Prefetch Time Time • Representative existing works – Patterson and Gibson, TIP, SOSP’95 – Tran and Reed, time series model based, TPDS’04 – Yang et. al, speculative execution, USENIX’02 – Byna et. al, signature based, SC’08 – Blas et. al, multi-level caching and prefetching for BGP, PVM/MPI’09 6 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Limitation of Existing Strategies • The effectiveness of I/O prefetching depends on carrying out prefetches efficiently and moving data swiftly • Existing studies take an independent approach, without considering the correlation of accesses among processes – Independent prefetching • Multiple processes of parallel applications have strong correlation with each other with respect to I/O accesses – Foundation of collective I/O, data sieving, etc. • We propose to take advantage of this correlation – Parallel I/O prefetching should be done in a collective way rather an ad hoc individual and independent way 7 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Collective Prefetching Idea • Take advantage of the correlation among I/O accesses of multiple processes to optimize prefetching • Benefits/features – Filter overlapping and redundant prefetch requests – Combine prefetch requests from multiple processes – Combine demand requests with prefetch requests – Form large and contiguous requests – Reduces system calls • Similar mechanism exploited in optimizations like collective I/O, data sieving, but no study for prefetching yet 8 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Collective Prefetching Framework Application Application Application Process Process Process Collective Prefetching Parallel I/O Middleware/Library (Prefetch Delegates) Collective I/O Two-Phase I/O Caching Parallel File Systems (PVFS, Lustre, GPFS, PanFS) I/O Hardware, Storage Devices 9 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

MPI-IO with Collective Prefetching • MPI-IO and ROMIO • Collective I/O and Two-phase Implementation Process 0 Process 1 Process 2 Process 3 MPI-IO Comm. phase File domains ADIO Aggregator 0 Aggregator 1 Aggregator 2 Aggregator 3 ROMIO … Interconnect UFS I/O phase NFS PFS PVFS2 0 1 2 3 File servers 10 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Two-Phase Read Protocol in ROMIO • Each aggregator calculates Calc offsets the I/O requests span and & exchange exchange • Partitions the aggregated span Calc FDs & into file domains requests • Each aggregator carries out I/O requests for its own file Reads domain • All aggregators send data to the requesting processes, and Exchange each process receives its required data 11 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Extended Protocol with Collective Prefetching Calc offsets & exchange A. Maintain history D. Check w/ cachebuffer B. Predict E. Calc FDs & requests w/ prefs Reads C. Place Exchange pref data 12 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Collective Prefetching Algorithm Algorithm cpf /* Collective Prefetching at MPI-IO */ Input : I/O request offset list, I/O request length list Output : none Begin 1. Each aggregator maintains recent access history of window size w 2. Aggregators/prefetch delegates run prediction or mining algorithms on all tracked global access history 1.Algorithms can be as streaming, strided, Markov, or advanced mining algorithms such as PCA/ANN 3. Generate prefetch requests and enqueue them in PFQ 4. Process requests in PFQs together with demand accesses 5. Filter out overlapping and redundant requests 6. Perform extended two-phase I/O protocol with prefetch requests 1.Prefetched data are kept in cache buffer to satisfy future requests 2.Exchange data to satisfy demand requests (move data to user buffer) End 13 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Preliminary Results • Strided access pattern, with 1MB and 4MB strides 450 450 400 400 Sustained Bandwidth (MB/s) Sustained Bandwidth (MB/s) 350 350 300 300 250 250 200 200 150 150 100 100 50 50 0 0 8 16 32 64 128 8 16 32 64 128 MPI-IO Collective Prefetching Individual Prefetching MPI-IO Collective Prefetching Individual Prefetching With 1MB stride With 4MB stride Collective prefetching: up to 22%, 19% on average Collective prefetching: up to 17%, 15% on average Individual prefetching: up to 12%, 8% on average Individual prefetching: up to 8%, 6% on average 14 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Preliminary Results • Strided access pattern, with 1MB and 4MB strides 0.25 0.2 0.2 0.15 0.15 Speedup Speedup 0.1 0.1 0.05 0.05 0 0 8 16 32 64 128 8 16 32 64 128 Individual Prefetching Collective Prefetching Individual Prefetching Collective Prefetching With 1MB stride With 4MB stride • Collective prefetching outperformed by over one fold • Collective prefetching had a more stable performance trend 15 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Preliminary Results • Nested strided access pattern, with (1MB, 3MB) stride 350 0.2 300 Sustained Bandwidth (MB/s) 0.15 250 Speedup 200 0.1 150 100 0.05 50 0 0 8 16 32 64 128 8 16 32 64 128 Individual Prefetching Collective Prefetching MPI-IO Collective Prefetching Individual Prefetching Bandwidth Speedup • Collective prefetching outperformed by over 66% • Collective prefetching had a similar stable performance trend 16 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Conclusion • I/O has been widely recognized as the performance bottleneck for many HEC/HPC applications • Correlation of I/O accesses exploited in data sieving and collective I/O, but no study exploit for prefetching yet • We propose a new form of collective prefetching for parallel I/O systems • Preliminary results have demonstrated the potential • A general idea that can be applied at many levels, such as the storage device level or server level 17 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Ongoing and Future Work • Exploit the potential at the server level • LACIO: A New Collective I/O Strategy, and I/O customization Processes Processes Processes Processes Processes Comm. phase LB# 0 1 2 3 4 5 6 7 8 9 10 11 S# 0 1 2 3 0 1 2 3 0 1 2 3 File domains (Logical) Aggregator 0 Aggregator 1 Aggregator 2 Aggregator 3 Interconnect I/O phase 0 1 2 3 File servers (Physical) LB0 LB1 LB2 LB3 LB4 LB5 LB6 LB7 LB8 LB9 10 11 18 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Any Questions? Thank you. • Acknowledgement: Prof. Xian-He Sun of Illinois Institute of Technology, Dr. Rajeev Thakur of Argonne National Lab, Prof. Wei-Keng Liao and Prof. Alok Choudary of Northwestern University. 19 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Collective Prefetching for Parallel I/O Systems Yong Chen and - PowerPoint PPT Presentation

Collective Prefetching for Parallel I/O Systems Yong Chen and Philip C. Roth Oak Ridge National Laboratory Outline I/O gap in high-performance computing I/O prefetching and limitation Collective prefetching design and implementation

1 Prefetching Implementations Recall Stream Buffer Diagram Sequential and stride prefetching

Prefetching Hyperlinks Prefetching Methods Prefetching Uncacheable/Dynamic Data

COMP 590-154: Computer Architecture Prefetching Prefetching (1/3) Fetch block ahead of demand

Graph Prefetching Using Data Structure Knowledge Sam Ainsworth and Timothy M. Jones Computer

Linux solution for prefetching necessary data during application and system startup Krzysztof

An unsophisticated cooperative approach to prefetching linked data structures Alexander Galazin

3 rd Data Prefetching Championship June 23 rd , 2019 Held in conjunction with ISCA 2019 Seth

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables Marius Granns

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: Coherence Sreepathi Pai

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

COLLECTIVE LEADERSHIP AND SAFETY CULTURES COLLECTIVE LEADERSHIP FOR SAFETY SKILLS Co-Lead Coll

Collective Investment Schemes in Cyprus What are the Collective Investment Schemes A

Collective states and transitional behavior in schooling fish KOLBJRN TUNSTRM Collective

Collective effects in small systems, Collective effects in small systems, Hydro vs Color

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

INTRODUCING THE SYSTEMS PLANNING COLLECTIVE Alina Turner Principal, Turner Strategies ||

Applications Three sample applications Fuzzy inferno Nostalgic cow Twilight Eden Fuzzy inferno

Languages at Galois Joey Dodds and many others Trust boundary Aggregator Aggregator User Core

Homework Aggregator Problem: Tracking homework assignments and important dates for every class

Computer Architecture and Systems Group Department of Computer Science University Carlos III of

The dCacheBillingAggregator Gregory J. Sharp Daniel S. Riley Overview The dCache file system

Collective Rationality in Graph Aggregation Ulle Endriss Institute for Logic, Language and

Prio: Private, Robust, and Efficient Computation of Aggregate Statistics Henry Corrigan-Gibbs and

On Overlapping Communication and File I/O in Collective Write Operation Raafat Feki and Edgar