I/O Mini-apps, Compression, and I/O Libraries for Physics-based - PowerPoint PPT Presentation

I/O Mini-apps, Compression, and I/O Libraries for Physics-based Simulations User Productivity Enhancement, Technology Transfer, and Training (PETTT) Presented by Sean Ziegeler (Engility PETTT) November 13, 2017 PETTT DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited.

MiniIO: I/O Mini-apps ”Cartiso” ”AMR” ”Struct” ”Unstruct” PETTT DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited. 2

MiniIO: I/O Mini-apps ”Cartiso” ”AMR” ”Struct” ”Unstruct” PETTT DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited. 3

Struct Mini-app Struct: Structured grids with masks/blanking – Masks for missing or invalid data (e.g., land in an ocean model) 2D simplectic noise to generate synthetic mask maps Can choose % of blanked data points Noise frequency governs sizes of blanked areas (continents vs islands) – 4D simplectic noise to fill time-variant variables – Option for load balancing non- masked points evenly (as desired) across ranks But creates load imbalance for I/O because blanked data is still written Compression theoretically rebalances the I/O (blanked constants compress well) PETTT DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited. 4

Results ADIOS POSIX: one file per rank Computationally unbalanced Broadwell ADIOS POSIX KNL ADIOS POSIX 160.00 200.00 Unbal./No Compr. 140.00 Throughput (GB/s) Throughput (GB/s) Unbal./zlib 120.00 150.00 100.00 Unbal./szip 80.00 100.00 Unbal./zfp 60.00 Bal./No Compr. 40.00 50.00 Bal./zlib 20.00 0.00 0.00 Bal./szip 528 4048 8008 21912 512 4096 8192 Bal./zfp Cores Cores Balanced (I/O unbalanced!) Red: No compression Blue: zlib deflate compression (think gzip) Green: szip compression Purple: zfp (error bounded lossy, 0.0001), ~9:1 on average PETTT DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited. 5

Results ADIOS POSIX: one file per rank Broadwell ADIOS POSIX KNL ADIOS POSIX 160.00 200.00 Unbal./No Compr. 140.00 Unbal./zlib Throughput (GB/s) Throughput (GB/s) 120.00 150.00 100.00 Unbal./szip 80.00 100.00 Unbal./zfp 60.00 Bal./No Compr. 50.00 40.00 Bal./zlib 20.00 0.00 0.00 Bal./szip 528 4048 8008 21912 512 4096 8192 Bal./zfp Cores Cores Initial scalability with core count Computational balancing hurts performance a little – But compression sometimes helps Zfp is the fastest compression KNL is slower ADIOS POSIX is the fastest without compression PETTT DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited. 6

Results ADIOS MPI: one file for all ranks Broadwell ADIOS MPI KNL ADIOS MPI 100.00 400.00 Unbal./No Compr. 350.00 80.00 Unbal./zlib Throughput (GB/s) Throughput (GB/s) 300.00 250.00 60.00 Unbal./szip 200.00 Unbal./zfp 40.00 150.00 Bal./No Compr. 100.00 20.00 Bal./zlib 50.00 0.00 0.00 Bal./szip 528 4048 8008 21912 512 4096 8192 Bal./zfp Cores Cores Good scalability with core count, especially with compression Computational balancing hurts performance a little – But compression mostly helps Zfp is by far the fastest compression KNL is much slower, especially the compression PETTT DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited. 7

Results ADIOS MPI-Lustre: one file for all ranks, tuned for Lustre file system on that system Broadwell ADIOS MPI-Lustre KNL ADIOS MPI-Lustre 450.00 100.00 Unbal./No Compr. 400.00 350.00 80.00 Unbal./zlib Throughput (GB/s) Throughput (GB/s) 300.00 Unbal./szip 60.00 250.00 200.00 Unbal./zfp 40.00 150.00 Bal./No Compr. 100.00 20.00 Bal./zlib 50.00 0.00 0.00 Bal./szip 528 4048 8008 21912 512 4096 8192 Bal./zfp Cores Cores Good scalability with core count, especially with compression Computational balancing hurts performance a little – But compression mostly helps Zfp is by far the fastest compression KNL is much slower, especially the compression MPI-Lustre is the fastest with compression PETTT DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited. 8

Results ADIOS MPI-Aggregate: m files, m < number of ranks, on Lustre: m = #_of_OSTs Broadwell ADIOS MPI-Aggregate KNL ADIOS MPI-Aggregate 370 100.00 Unbal./No Compr. 320 80.00 Unbal./zlib Throughput (GB/s) 270 Throughput (GB/s) 60.00 Unbal./szip 220 170 Unbal./zfp 40.00 120 Bal./No Compr. 20.00 70 Bal./zlib 20 0.00 Bal./szip -30 512 4096 8192 528 4048 8008 21912 Bal./zfp Cores Cores Good scalability with core count, especially with compression Computational balancing hurts performance very little – Compression helps, but not as much Zfp is by far the fastest compression KNL is much slower, especially the compression PETTT DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited. 9

Results HDF5: one file for all ranks Broadwell HDF5 KNL HDF5 80 25.00 Unbal./No Compr. 70 20.00 Unbal./zlib Throughput (GB/s) 60 Throughput (GB/s) 50 15.00 Unbal./szip 40 Unbal./shuffle+zlib 10.00 30 Bal./No Compr. 20 5.00 10 Bal./zlib 0 0.00 Bal./szip 528 4048 8008 21912 512 4096 8192 Bal./shuffle+zlib Cores Cores Starts slower, but scalability with core count, especially with compression Computational balancing hurts performance a lot – But compression helps somewhat Shuffle+zlib is the fastest compression (zfp not available at the time) KNL is much slower, especially the compression PETTT DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited. 10

Conclusions Compression can ”fix” I/O performance issues introduced by computational load balancing – With the right output method, it is faster than unbalanced, uncompressed output Compression can be faster than uncompressed I/O – Always been theoretically possible, but rare in practice – Part computation: So can scale with the simulation Zfp compression is very fast even at a modest compression ratio (~9:1) – At scale, produces “virtual” throughput faster than the file system – Shuffle+zlib in HDF5 is also good KNL is slower, with & without compression – More cores per node è fewer nodes doing parallel I/O – Much weaker integer processing means slower compression PETTT DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited. 11

Next Steps Tests on Intel Broadwell cores at larger scales – Complete 20k cores, begin at 40-60k cores Zfp with HDF5 Quilting (setting aside a few cores dedicated to I/O) – Works very well for struct [separate study by SDSC] & similar apps – Hypothesize that quilting would be very poor for compression – E.g., for zfp at scale, expect that we do not want to use quilting – Or, at least compression on all cores, quilting after for actual I/O Test on Intel Skylake cores – Google Compute Engine, Gluster file system – 512 – 4096 cores – Hypothesize performance between Broadwell & KNL PETTT DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited. 12

This material is based upon work supported by, or in part by, the Department of Defense High Performance Computing Modernization Program (HPCMP) under User Productivity, Technology Transfer and Training (PETTT) contract number GS04T09DBC0017. PETTT DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited. 13

Work-in-Progress Abstract Compiler-Assisted Scientific Workflow Optimization Hadia Ahmed 1 , Peter Pirkelbauer 2 , Purushotham Bangalore 2 , Anthony Skjellum 3 1 Lawrence Berkeley National Laboratory 2 University of Alabama at Birmingham 3 University of Tennessee at Chattanooga puri@uab.edu Workflow Optimization November 13, 2017 1 / 6

Introduction Exascale Systems Data analytics will face tremendous challenges on Exascale systems Many compute nodes communicate with analytics nodes Simulations produce vast amount of data In-situ (in-transit) analytics necessary to deal with limited bandwidth Simulation / analytics code need to be re-organized puri@uab.edu Workflow Optimization November 13, 2017 2 / 6

Idea Describe Re-organization Users specify re-organization with an annotation language Tool generates optimized version Move code from analytics node to simulation (or vice versa) Describe reductions . . . puri@uab.edu Workflow Optimization November 13, 2017 3 / 6

Approach Compiler-based Use ROSE to read, analyze, and re-organize source files puri@uab.edu Workflow Optimization November 13, 2017 4 / 6

Early Results Restructure Bonds-CSym On a single system, we achieved speedups between 4% and 12%. Restructured Bonds-CSym in a 1:1 configuration Re-organized code Eliminates storage to file system Eliminates data container conversion Enables further compile-time optimizations Bonds-CSym is quadratic, smaller input sizes exhibit larger speedups Reduced need for network communication puri@uab.edu Workflow Optimization November 13, 2017 5 / 6

Thank you contact: Peter Pirkelbauer (UAB) e-mail: pirkelbauer@uab.edu puri@uab.edu Workflow Optimization November 13, 2017 6 / 6

Micro-Storage Services for Open Ethernet Drive Hariharan Devarajan, hdevarajan@hawk.iit.edu Anthony Kougkas, akougkas@hawk.it.edu Xian-He Sun, sun@iit.edu

I/O Mini-apps, Compression, and I/O Libraries for Physics-based - PowerPoint PPT Presentation

I/O Mini-apps, Compression, and I/O Libraries for Physics-based Simulations User Productivity Enhancement, Technology Transfer, and Training (PETTT) Presented by Sean Ziegeler (Engility PETTT) November 13, 2017 PETTT DISTRIBUTION STATEMENT

14.9.2 JPEG2000 compression DCT compression basis for JPEG wavelet compression

Lossless compression in lossy compression systems Almost every lossy compression system

Boxing them in Buggy apps can crash other apps The Kernel App 1 App 2 App 3 Buggy apps can

JPEG Compression Ian Snyder December 11, 2009 Ian Snyder JPEG Compression Outline

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

MINI OPENDRIVE 1 MINI MINI OPENDRIVE EXP OPENDRIVE EXP Experience, eXpertise, Performance The

Adaptive Progressive Web Apps PWA Progressive Web Apps are just great websites that can behave

The Kernel wants to be your friend Boxing them in Buggy apps can crash other apps App 1 App 2

Small Business Apps WHAT ARE MOBI LE APPS ? W h a t are mob i le apps ? A little bit of

Libraries Jonathan Platt Head of Libraries and Heritage 22 nd July 2014 Libraries 1.

Libraries In C++ its possible to create static libraries and shared libraries Static

Digital Image Compression Digital Image Compression Digital Image Compression and JPEG Standards

Digital Video Compression Digital Video Compression Digital Video Compression and H.261

From Sorting to Heaps to Compression Data Compression video on demand/set top box jpeg

Tradeoffs in XML Database Compression James Cheney University of Edinburgh Data Compression

Mini-Sentinel Common Data Model Lesley Curtis on behalf of the Mini-Sentinel Data Core May 8,

A Boolean Satisfiability based Solution to the Routing and Wavelength Assignment (RWA) Problem in

OSINT OPEN-SOURCE INTELLIGENCE OSINT Offensive OSINT * 1 Whoami Adam Nurudini CEH, ITIL

Rank-Based Tensor Factorization for Predicting Student Performance Thanh-Nam Doan, Sherry Sahebi

BF theory on cobordisms endowed with cellular decomposition Pavel Mnev Max Planck Institute for

HYPERSPACES OF EUCLIDEAN SPACES IN THE GROMOV-HAUSDORFF METRIC SERGEY A. ANTONYAN National

PYP Requirements Slides Prepared by: Dr. Quail Middlebrooks: Coordinator, Gifted and Talented

Weil representations over abelian varieties Luca Candelori Louisiana State University LSU, April

Object Domain Induces by computation/deduction Represents