Reducing Checkpoint Size in PlasComCM with Lossy Compression 14th - PowerPoint PPT Presentation

Reducing Checkpoint Size in PlasComCM with Lossy Compression 14th Annual Workshop on Charm++ and its Applications Jon Calhoun 1 , Franck Cappello 1 , 2 , Luke Olson 1 , and Marc Snir 1 , 2 , Sheng Di 2 1 University of Illinois at Urbana-Champaign 2 Argonne National Labatory 19 April 2016 1 / 21 Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 1/21

Data Movement Problem On current systems, computation is essentially free compared to time and energy required for data transfers. What do we do with these free CPU cycles? [Keckler 2011] 2 / 21 Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 2/21

Checkpoint Restart in Charm++ Native checkpoint restart • partner nodes • permanent storage Although checkpointing to a partner node is much faster, checkpointing to permanent storage is still needed. Let’s look at improving checkpointing to the parallel filesystem. 3 / 21 Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 3/21

Lossless compression? [Son et al. 2014] 4 / 21 Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 4/21

• Standard compression schemes not designed for floating-point • Lossless floating-point schemes provide small compression factors [Son et al. 2014] 5 / 21 Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 5/21

Lossy Compression High compression ratios with lossy compression [Di and Cappello 2016] 6 / 21 Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 6/21

Can applications be restarted from a lossy checkpoint? Whenever you use floating-point values you have already embraced various amounts of error • Floating-point arithmic alread suffers from error due to roundoff. • Numerical methods used to solve PDEs and ODEs are only accurate to the order of the method. 7 / 21 Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 7/21

Understanding Error Many lossy compression schemes allow you to specify an error bound (e.g. relative, absolute). • How should I evaluate this error? • Is this error detrimental? 8 / 21 Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 8/21

Evaluation Accuracy of numerical methods is expressed as O ( h p ). Restrict lossy compression error tolerance to be less than truncation error, then error added by lossy compression is hidden in the simulation. Let’s first look at a 1-D heat and a 1-D advection equation to understand what happend to simple PDEs. Setup: • Lossy Compressor: SZ-0.5.5 [Di and Cappello 2016] • Data vectors 64-bit floating-point • Checkpoint PDE state variables 9 / 21 Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 9/21

1-D Heat Equation Error Evolution 10 / 21 Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 10/21

1-D Heat Error Evolution Maximum Error at Each Time-step -4 10 -5 10 -6 10 -7 10 Error -8 10 -9 10 -10 10 Discretization Error Error due to lossy checkpoint -11 10 0.0 0.5 1.0 1.5 2.0 Time ( T ) 11 / 21 Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 11/21

1-D Advection Equation Error Evolution 12 / 21 Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 12/21

1-D Advection Error Evolution Maximum Error at Each Time-step of 1-D Advection -4 10 Error -5 10 Discretization Error Error due to lossy checkpoint -6 10 0 1 2 3 4 5 Time ( T ) 13 / 21 Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 13/21

XPACC PlasComCM PlasComCM • coupled multipysics code • Checkpoint restart accomplished by AMPI Setup: • Navier-stokes flow past cylinder problem • h x = h y = 0 . 0015 • Checkpoint every 5000 iterations to 1 e − 14 14 / 21 Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 14/21

PlasComCM Compression Factor 90 density 80 energy x momenta 70 y momenta Compression Factor 60 50 40 30 20 10 0 -9 10 -8 10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -14 -13 -12 -11 -10 10 -1 10 10 10 10 10 Compression Error Tolerance 15 / 21 Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 15/21

PlasComCM Timings 0.018 density 0.016 energy x momenta 0.014 y momenta 0.012 Time (sec) 0.010 0.008 0.006 0.004 0.002 0.000 -9 10 -8 10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -14 -13 -12 -11 -10 10 -1 10 10 10 10 10 Compression Error Tolerance Solid line compression time. Dotted line decompression time. 16 / 21 Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 16/21

Simulation Simulation Error 17 / 21 Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 17/21

What is the compression error tolerance 1 e − ? Simulation Lossy Compressed Simulation 18 / 21 Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 18/21

What is the compression error tolerance 1 e − 2 Simulation Lossy Compressed Simulation 18 / 21 Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 18/21

Conclusion and Future Work Lossy compression can effectively reduce the size of a checkpoint without affecting the negatively solution Currently only applicable to file system checkpoints Need to discuss with users to determine acceptable error tolerance Investigate other applications and inputs to gain further insight Further leverage application properties when compressing 19 / 21 Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 19/21

Acknowledgments • This work was sponsored by the Air Force Office of Scientific Research under grant FA9550-12-1-0478. 20 / 21 Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 20/21

Thank you Any questions? 21 / 21 Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 21/21

Reducing Checkpoint Size in PlasComCM with Lossy Compression 14th - PowerPoint PPT Presentation

Reducing Checkpoint Size in PlasComCM with Lossy Compression 14th Annual Workshop on Charm++ and its Applications Jon Calhoun 1 , Franck Cappello 1 , 2 , Luke Olson 1 , and Marc Snir 1 , 2 , Sheng Di 2 1 University of Illinois at Urbana-Champaign

Lossless compression in lossy compression systems Almost every lossy compression system

The Parametric Complexity of Lossy Counter Machines Sylvain Schmitz ICALP , July 12, 2019,

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI Sam White Parallel

Exploration of Lossy Compression for Application- level Checkpoint/Restart Naoto Sasaki 1 ,

iOmx Therapeutics Announces Discovery of Novel, Druggable Immune-Checkpoint Targets iOTarg

ICD-10 Checkpoint: Update for NJ-HFMA Jim Hennessy June 2015 e4 Services LLC Discussion Topics

Logistics Assignments Crossover and Mutation Checkpoint 1 -- Problem Graded --

Oasys PRIMER Did you know? Back to Contents Top Tips Demo Slide 2 Slide 2 Checkpoint

Paper Summaries Any takers? Procedural Shading Announcement Logistics Checkpoint 2

Logistics Checkpoint 2 Mostly graded. Note on grading -- Regaining points

Logistics The Renderman Shading Language Checkpoint 3 Grading underway Checkpoint 4

Case 2: Reducing Cardiovascular Risk Type 2 Diabetes Management Case 1: Reducing Hypoglycemic

ECS231 PCA, revisited May 28, 2019 1 / 18 Outline 1. PCA for lossy data compression 2. PCA for

RPL- Routing over Low Power and Lossy Networks Michael Richardson Ines Robles IETF 94

Lecture 7 Lossy Source Coding I-Hsiang Wang Department of Electrical Engineering National

Ackermann-Hardness for Lossy Counter Machines (and Reset Petri Nets) Philippe Schnoebelen LSV,

Adaptive Coding for Two-Way Lossy Source-Channel Communication Jian-Jia Weng, Fady Alajaji, and

Adaptive Filters Linear Prediction Gerhard Schmidt Christian-Albrechts-Universitt zu Kiel

Cooperative Data Backup for Mobile Devices Ludovic Courts Advisors : David Powell, Marc-Olivier

Lecture 1: Shannons Theorem Lecturer: Travis Gagie January 13th, 2015 Welcome to Data

Index compression CE-324: Modern Information Retrieval Sharif University of Technology M.

Sparse Regression Codes Andrew Barron Ramji Venkataramanan Yale University University of

Information Retrieval Tutorial 3: Index Compression Professor: Michel Schellekens TA: Ang Gao

Lecture 3 Source Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan