[PPT] - Scaling Resiliency via machine learning and compression Alok PowerPoint Presentation

SLIDE 1

Scaling Resiliency via machine learning and compression

Alok Choudhary

Henry and Isabel Dever Professor EECS and Kellogg School of Management Northwestern University choudhar@eecs.northwestern.edu Founder, chairman and Chief Scientist 4Cinsights Inc: A Big Data Science Company +1 312 515 2562 alok@4Cinsights.com

SLIDE 2

Department of Electrical Engineering and Computer Science

Motivation

Scientific simulations
Generate large amount of data.
Data feature: high-entropy, spatial-temporal
Exascale Requirements*
Scalable System Software: Developing scalable system software that

is power and resilience aware.

Resilience and correctness: Ensuring correct scientific computation in

face of faults, reproducibility, and algorithm verification challenges.

NUMARCK (NU Machine learning Algorithm for

Resiliency and ChecKpointing)

Learn temporal relative change and its distribution and bound point-

wise user defined error.

* From Advanced Scientific Computing Advisory Committee Top Ten Technical Approaches for Exascale

SLIDE 3

Checkpointing and NUMARACK

3

Solution? NUMARCK

n Traditional checkpointing systems store raw (and

uncompressed) data

− cost prohibitive: the storage space and time

− threatens to overwhelm the simulation and the post-simulation data analysis

n I/O accesses have become a limiting factor to key

scientific discoveries.

SLIDE 4

Department of Electrical Engineering and Computer Science

What if a Resilience and Checkpointing Solution Provided

Improved Resilience via more frequent yet

relevant checkpoints, while

Reducing the amount of data to be stored by an
rder of magnitude, and
Guaranteeing user-specified tolerable maximum

error rate for each data point, and

an order of magnitude smaller mean error for

each data set, and

reduced I/O time by an order of magnitude, while
Providing data for effective analysis and

visualization

SLIDE 5

0.2 0.4 0.6 0.8 1 1.2 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64

Motivation: “Incompressible” with Lossless Encoding

5

Compressible Exponent. Low Entropy. Incompressible mantissa. Less predictable. High Entropy.

=

n i i i

x p x p X H

1

) ( log ) ( ) (

Shannon’s Information Theory:

Bit position of double precision rlds data

Probability distribution of more common bit value

SLIDE 6

Highly random

Original rlds data

Extreme events missed

~0.35 correlation!

Motivation: Still “Incompressible” with Lossy Encoding

6

50 100 150 200 250 300 1 101 201 301 401 501 601 701 801 901 50 100 150 200 250 1 101 201 301 401 501 601 701 801 901

Bspline reconstructed rlds data

SLIDE 7

What if we analyze the Change in Value?

Observations:

Variable Values – distribution
Change in Variable Value – distribution
Relative Change in Variable Value - distribution

Hypothesis: The relative changes in variable values can be represented in a much smaller state space.

A1(t) = 100, A1(t+1) = 110 => change = 10, rel change = 10%
A2(t) = 5, A2(t+1) = 5.5 => change = .5, rel change = 10%

Observation: Simulation Represents a State Transition Model

SLIDE 8

Sneak Preview: Relative Change is more predictable

Randomness Iteration 1 and 2 on climate CMIP5 rlus data Relative Change between iteration 1 and 2 on climate CMIP5 rlus data

Learning Distribution

SLIDE 9

Department of Electrical Engineering and Computer Science

Challenges

How to learn patterns and distributions of relative

change at scale?

How to represent distributions at scale?
How to bound errors?
System Issues
data movement
I/O
Scalable software
Reconstruction when needed

SLIDE 10

Department of Electrical Engineering and Computer Science

NUMARCK Overview

Forward Predictive Coding Transform the data by computing relative changes in ratio from one iteration to the next Data Approximation Learn the distribution of relative change r using machine learning algorithms and store approximated values F

Traditional checkpointing Machine learning based checkpointing

Full checkpoint in each checkpoint F F F F F F F F C C F F: Full checkpoint C: change ratios

SLIDE 11

Forward coding ~0.99 correlation! 0.001 RMSE

NUMARCK: Overview

11

50 100 150 200 250 300 1 301 601 901 50 100 150 200 250 300 1 301 601 901

Distribution Learning

SLIDE 12

E.g., Distribution Learning Strategies

Equal-width Bins (Linear)
Log-scale Bins (Exponential)
Machine Learning – Dynamic clustering

Number of bins or clusters depends on the bits designated for storing indices and error tolerance examples

– index length (B): 8bits – tolerable error per point (E): 0.1%

the number of clusters the width of each cluster

SLIDE 13

−20 20 40 100 200 300 change ratio (%) counts

Equal-width distribution

dens: iteration 32 to 33

In each iteration, partition value into 255 bins of equal-width. Each value is assigned to a corresponding bin ID ( represented by the center of bin). If the difference between the original value and the approximated one is larger than user-specified value (0.1%), it is stored as it is (i.e., incompressible)

Pros: Easy to Implement Cons: (1) Can only cover range of 2E(2^B -1); (2) Bin width: 2*E

SLIDE 14

Log-scale Distribution

dens: iteration 32 to 33 In each iteration, partition the ratio distribution into 255 bins of log-scale width.

−30 −20 −10 10 20 30 50 100 150 200 250 300 change ratio (%) counts

Pros: cover larger ranger and more finer (narrower) bins Cons: may not perform well for highly irregularly distributed data

SLIDE 15

Machine Learning (Clustering-based) based Binning

dens: iteration 32 to 33

In each iteration, partition the ratio data into 255 clusters using (e.g., K-means) clustering, followed by approximated values based on corresponding cluster’s centroid value.

−20 20 40 50 100 150 200 change ratio (%) counts

SLIDE 16

Methodology Summary

this is the model, initial condition and metadata

Initialization

Calculate the relative change

Calculation

Bin the relative change into N bins
Index and Store bin IDs

Learning Distributions

Store index, compress index
Store exact values for change outside error bounds

Storage

Read last available complete checkpoint
Reconstruct data values for each data point, can report

the error bounds.

Reconstruction

SLIDE 17

NUMARCK Algorithm

Change ratio calculation

– Calculate element-wise change ratios

Bin histogram construction

– Assign change ratios within an error bound into bins

Indexing

– Each data element is indexed by its bin ID

Select top-K bins with most elements

– Data in top-K bins are represented by their bin IDs – Data out of top-K bins are stored as is

(optional) Apply lossless GNU ZLIB compression on the index table

– Further reduce data size

(optional) File I/O

– Data is saved in self-describing netCDF/HDF5 file

SLIDE 18

FLASH code is a modular, parallel

multi-physics simulation code: developed at the FLASH center of University of Chicago

– It is a parallel adaptive-mesh refinement (AMR) code with block-oriented structure – A block is the unit of computation – The grid is composed of blocks – Blocks consists of cells: guard and interior cells – Cells contains variable values

CMIP - supported by World Climate

Research Program: (1) Decadal Hindcasts and predictions simulations; (2) Long-term simulations; (3) atmosphere-only simulations.

var 0, 1, 2, …, 23 (e.g., density, pressure and temperature)

Experimental Results: Datasets

SLIDE 19

Department of Electrical Engineering and Computer Science

Evaluation metrics

Incompressible ratio
% of data that need to be stored as exact values because it would

be out of error bound if approximated

Mean error rate
Average difference between the approximated change ratio and the

real change ratio for all data

Compression ratio
Assuming data D of size |D| is reduced to size |D’|, it is defined as:

D − D' D ×100

SLIDE 20

Incompressible Ratio: Equal-width Binning

FLASH dataset, 0.1% error rate 0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 iterations dens eint ener pres temp

SLIDE 21

Incompressible Ratio: Log-scale Binning

FLASH dataset, 0.1% error rate 0.0 2.0 4.0 6.0 8.0 10.0 12.0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 iterations dens eint ener pres temp

SLIDE 22

Incompressible Ratio: Clustering-based Binning

FLASH dataset, 0.1% error rate 0.00% 1.00% 2.00% 3.00% 4.00% 5.00% 6.00% 7.00% 8.00% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 dens pres ener eint temp

SLIDE 23

Mean Error Rate: Clustering-based

FLASH dataset, 0.1% error rate 0.00% 0.00% 0.00% 0.01% 0.01% 0.01% 0.01% 0.01% 0.02% 0.02% 0.02% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 dens pres ener eint temp

SLIDE 24

Increasing Index Size:

Incompressible Ratio

10 20 30 40 50 60 70 80 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 rlds-8 rlds-9 rlds-10

% of data needed to be stored as exact values (i.e., uncompressible)

Increasing bin sizes (8-bit to 10-bit) reduces % of incompressible significantly. Note: rlds is the most difficult to compress with 8-bit

SLIDE 25

Different Approximations:

Compression Ratio

10 20 30 40 50 60 70 80 90 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 rlds-8 rlds-9 rlds-10 Iterations

Increasing bin sizes (8-bit to 10-bit) increases compression ratio significantly.

SLIDE 26

Different Tolerable Error Rates: Incompressible Ratio (0.1% - 0.5%)

10 20 30 40 50 60 1 3 5 7 9 11131517192123252729313335373941434547495153555759

Kmeans-0.1 Kmeans-0.2 Kmeans-0.3 Kmeans-0.4 Kmeans-0.5

SLIDE 27

Scaling - Experimental Settings

Name of data set Application Domain Size per iteration Variable dimension Variable size Sedov FLASH Astrophysics 15MB 165*32*32*1 1.3MB Stir-1 FLASH Astrophysics 3.7GB 64*157*157*157 945MB Stir-2 FLASH Astrophysics 296GB 1024*314*314*157 59GB Stir-3 FLASH Astrophysics 2.4TB 8192*314*314*157 472GB ASR ASR Climate 103MB 29*320*320 11MB CMIP CMIP3 Climate 19GB 42*2400*3600 1.4GB

Data sets and environment:

– FLASH datasets

SuperMUC at Leibniz Supercomputing Centre, Germany, a parallel computer

consists of 9216 nodes (16 cores per node)

We used up to 12,800 cores in our experiments

– Others

A Linux machine, 2 quad-core CPUs (32 GB memory)

SLIDE 28

Compression ratios

l Compared with lossy compression algorithms: ZFP (LLNL), ISABELLA (NCSU)

0" 1" 2" 3" 4" 5" 6" 1" 2" 3" 4" 5" Compressionra,o Itera,on* NUMARCK" ISABELA" ZFP"

CMIP (1.4 GB)

SLIDE 29

Scalability Experiments

0" 2000" 4000" 6000" 8000" 10000" 12000" 14000" 3200" 4800" 6400" 8000" 9600" 11200" 12800" Speedup& Number&of&cores& NUMARCK"speedup" linear"speedup"

0" 500" 1000" 1500" 2000" 320" 480" 640" 800" 960" 1120" 1280" 1440" 1600" Speedup& Number&of&cores& NUMARCK"speedup" linear"speedup"

2 4 6 8 10 12 14 320 480 640 800 960 1120 1280 1440 1600 Run$me (sec) Number of cores

2 4 6 8 10 12 3200 4800 6400 8000 9600 11200 12800 Run$me (sec) Number of cores

Speedup of Stir-2 Runtime of Stir-2 Speedup of Stir-3 Runtime of Stir-3

FLASH datasets (turbulence stirring test)

Stir-2 (59GB) data

– Numbers of cores: 1600 – Speed-up: 1404 – Time: 2.655 sec – Original I/O time: 13.2 sec/iteration

Stir-3 (472GB) dataset

– Number of cores:12800 – Speed-up: 8788 – Time: 3.610 sec – Original I/O time: 18.0 sec

SLIDE 30

Open Problems and Challenges

Optimize and/or create new machine learning

algorithms

– for higher compressions and more accurate representation – Scalable implementation – Learning from historical results to optimize the “learning step” for minimizing data movement and power – Adaptation for anomaly detection (for resilience and analysis)

Use of memory hierarchy and local SSDs
Incorporation into pNetCDF etc and libraries
I/O optimizations

SLIDE 31

Department of Electrical Engineering and Computer Science

THANK YOU!

Alok Choudhary Henry and Isabel Dever Professor EECS and Kellogg School of Management Northwestern University choudhar@eecs.northwestern.edu

Scaling Resiliency via machine learning and compression

Motivation

Resiliency and ChecKpointing)

* From Advanced Scientific Computing Advisory Committee Top Ten Technical Approaches for Exascale

Checkpointing and NUMARACK

uncompressed) data

− cost prohibitive: the storage space and time

− threatens to overwhelm the simulation and the post-simulation data analysis

n I/O accesses have become a limiting factor to key

scientific discoveries.

What if a Resilience and Checkpointing Solution Provided

relevant checkpoints, while

error rate for each data point, and

each data set, and

visualization

Motivation: “Incompressible” with Lossless Encoding

=

x p x p X H

) ( log ) ( ) (

Motivation: Still “Incompressible” with Lossy Encoding

What if we analyze the Change in Value?

Observation: Simulation Represents a State Transition Model

Sneak Preview: Relative Change is more predictable

Challenges

change at scale?

NUMARCK Overview

NUMARCK: Overview

Distribution Learning

E.g., Distribution Learning Strategies

Number of bins or clusters depends on the bits designated for storing indices and error tolerance examples

– index length (B): 8bits – tolerable error per point (E): 0.1%

Equal-width distribution

Pros: Easy to Implement Cons: (1) Can only cover range of 2*E*(2^B -1); (2) Bin width: 2*E

Log-scale Distribution

Machine Learning (Clustering-based) based Binning

Methodology Summary

Initialization

Calculation

Learning Distributions

Storage

Reconstruction

NUMARCK Algorithm

Experimental Results: Datasets

Evaluation metrics

D − D' D ×100

Incompressible Ratio: Equal-width Binning

Incompressible Ratio: Log-scale Binning

Incompressible Ratio: Clustering-based Binning

Mean Error Rate: Clustering-based

Increasing Index Size:

Incompressible Ratio

Different Approximations:

Compression Ratio

Different Tolerable Error Rates: Incompressible Ratio (0.1% - 0.5%)

Scaling - Experimental Settings

Compression ratios

0" 1" 2" 3" 4" 5" 6" 1" 2" 3" 4" 5" Compression*ra,o* Itera,on* NUMARCK" ISABELA" ZFP"

Scalability Experiments

FLASH datasets (turbulence stirring test)

Open Problems and Challenges

algorithms

– for higher compressions and more accurate representation – Scalable implementation – Learning from historical results to optimize the “learning step” for minimizing data movement and power – Adaptation for anomaly detection (for resilience and analysis)

THANK YOU!

Pros: Easy to Implement Cons: (1) Can only cover range of 2E(2^B -1); (2) Bin width: 2*E

0" 1" 2" 3" 4" 5" 6" 1" 2" 3" 4" 5" Compressionra,o Itera,on* NUMARCK" ISABELA" ZFP"