Reducing Checkpoint Size in PlasComCM with Lossy Compression 14th - - PowerPoint PPT Presentation

reducing checkpoint size in plascomcm with lossy
SMART_READER_LITE
LIVE PREVIEW

Reducing Checkpoint Size in PlasComCM with Lossy Compression 14th - - PowerPoint PPT Presentation

Reducing Checkpoint Size in PlasComCM with Lossy Compression 14th Annual Workshop on Charm++ and its Applications Jon Calhoun 1 , Franck Cappello 1 , 2 , Luke Olson 1 , and Marc Snir 1 , 2 , Sheng Di 2 1 University of Illinois at Urbana-Champaign


slide-1
SLIDE 1

Reducing Checkpoint Size in PlasComCM with Lossy Compression

14th Annual Workshop on Charm++ and its Applications

Jon Calhoun1, Franck Cappello1,2, Luke Olson1, and Marc Snir1,2, Sheng Di2

1University of Illinois at Urbana-Champaign 2Argonne National Labatory

19 April 2016

Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 1/21

1/21

slide-2
SLIDE 2

Data Movement Problem

On current systems, computation is essentially free compared to time and energy required for data transfers. What do we do with these free CPU cycles? [Keckler 2011]

Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 2/21

2/21

slide-3
SLIDE 3

Checkpoint Restart in Charm++

Native checkpoint restart

  • partner nodes
  • permanent storage

Although checkpointing to a partner node is much faster, checkpointing to permanent storage is still needed. Let’s look at improving checkpointing to the parallel filesystem.

Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 3/21

3/21

slide-4
SLIDE 4

Lossless compression?

[Son et al. 2014]

Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 4/21

4/21

slide-5
SLIDE 5
  • Standard compression schemes not designed for floating-point
  • Lossless floating-point schemes provide small compression

factors [Son et al. 2014]

Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 5/21

5/21

slide-6
SLIDE 6

Lossy Compression

High compression ratios with lossy compression [Di and Cappello 2016]

Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 6/21

6/21

slide-7
SLIDE 7

Can applications be restarted from a lossy checkpoint?

Whenever you use floating-point values you have already embraced various amounts of error

  • Floating-point arithmic alread suffers from error due to

roundoff.

  • Numerical methods used to solve PDEs and ODEs are only

accurate to the order of the method.

Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 7/21

7/21

slide-8
SLIDE 8

Understanding Error

Many lossy compression schemes allow you to specify an error bound (e.g. relative, absolute).

  • How should I evaluate this error?
  • Is this error detrimental?

Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 8/21

8/21

slide-9
SLIDE 9

Evaluation

Accuracy of numerical methods is expressed as O (hp). Restrict lossy compression error tolerance to be less than truncation error, then error added by lossy compression is hidden in the simulation. Let’s first look at a 1-D heat and a 1-D advection equation to understand what happend to simple PDEs. Setup:

  • Lossy Compressor: SZ-0.5.5 [Di and Cappello 2016]
  • Data vectors 64-bit floating-point
  • Checkpoint PDE state variables

Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 9/21

9/21

slide-10
SLIDE 10

1-D Heat Equation Error Evolution

Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 10/21

10/21

slide-11
SLIDE 11

1-D Heat Error Evolution

0.0 0.5 1.0 1.5 2.0 Time (T) 10

  • 11

10

  • 10

10

  • 9

10

  • 8

10

  • 7

10

  • 6

10

  • 5

10

  • 4

Error

Maximum Error at Each Time-step Discretization Error Error due to lossy checkpoint

Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 11/21

11/21

slide-12
SLIDE 12

1-D Advection Equation Error Evolution

Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 12/21

12/21

slide-13
SLIDE 13

1-D Advection Error Evolution

1 2 3 4 5 Time (T) 10

  • 6

10

  • 5

10

  • 4

Error

Maximum Error at Each Time-step of 1-D Advection Discretization Error Error due to lossy checkpoint

Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 13/21

13/21

slide-14
SLIDE 14

XPACC PlasComCM

PlasComCM

  • coupled multipysics code
  • Checkpoint restart accomplished by AMPI

Setup:

  • Navier-stokes flow past cylinder problem
  • hx = hy = 0.0015
  • Checkpoint every 5000 iterations to 1e−14

Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 14/21

14/21

slide-15
SLIDE 15

PlasComCM Compression Factor

10

  • 14

10

  • 13

10

  • 12

10

  • 11

10

  • 1010
  • 9 10
  • 8 10
  • 7 10
  • 6 10
  • 5 10
  • 4 10
  • 3 10
  • 2 10
  • 1

Compression Error Tolerance

10 20 30 40 50 60 70 80 90

Compression Factor density energy x momenta y momenta

Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 15/21

15/21

slide-16
SLIDE 16

PlasComCM Timings

10

  • 14

10

  • 13

10

  • 12

10

  • 11

10

  • 1010
  • 9 10
  • 8 10
  • 7 10
  • 6 10
  • 5 10
  • 4 10
  • 3 10
  • 2 10
  • 1

Compression Error Tolerance

0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 0.018

Time (sec) density energy x momenta y momenta

Solid line compression time. Dotted line decompression time.

Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 16/21

16/21

slide-17
SLIDE 17

Simulation

Simulation Error

Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 17/21

17/21

slide-18
SLIDE 18

What is the compression error tolerance 1e−?

Simulation Lossy Compressed Simulation

Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 18/21

18/21

slide-19
SLIDE 19

What is the compression error tolerance 1e−2

Simulation Lossy Compressed Simulation

Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 18/21

18/21

slide-20
SLIDE 20

Conclusion and Future Work

Lossy compression can effectively reduce the size of a checkpoint without affecting the negatively solution Currently only applicable to file system checkpoints Need to discuss with users to determine acceptable error tolerance Investigate other applications and inputs to gain further insight Further leverage application properties when compressing

Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 19/21

19/21

slide-21
SLIDE 21

Acknowledgments

  • This work was sponsored by the Air Force Office of Scientific

Research under grant FA9550-12-1-0478.

Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 20/21

20/21

slide-22
SLIDE 22

Thank you

Any questions?

Jon Calhoun jccalho2@illinois.edu Reducing Checkpoint Size in PlasComCM with Lossy Compression 21/21

21/21