Scaling Resiliency via machine learning and compression Alok - - PowerPoint PPT Presentation

scaling resiliency via machine learning and compression
SMART_READER_LITE
LIVE PREVIEW

Scaling Resiliency via machine learning and compression Alok - - PowerPoint PPT Presentation

Scaling Resiliency via machine learning and compression Alok Choudhary Founder, chairman and Henry and Isabel Dever Professor Chief Scientist EECS and Kellogg School of Management 4Cinsights Inc: A Big Northwestern University Data Science


slide-1
SLIDE 1

Scaling Resiliency via machine learning and compression

Alok Choudhary

Henry and Isabel Dever Professor EECS and Kellogg School of Management Northwestern University choudhar@eecs.northwestern.edu Founder, chairman and Chief Scientist 4Cinsights Inc: A Big Data Science Company +1 312 515 2562 alok@4Cinsights.com

slide-2
SLIDE 2

Department of Electrical Engineering and Computer Science

Motivation

  • Scientific simulations
  • Generate large amount of data.
  • Data feature: high-entropy, spatial-temporal
  • Exascale Requirements*
  • Scalable System Software: Developing scalable system software that

is power and resilience aware.

  • Resilience and correctness: Ensuring correct scientific computation in

face of faults, reproducibility, and algorithm verification challenges.

  • NUMARCK (NU Machine learning Algorithm for

Resiliency and ChecKpointing)

  • Learn temporal relative change and its distribution and bound point-

wise user defined error.

* From Advanced Scientific Computing Advisory Committee Top Ten Technical Approaches for Exascale

slide-3
SLIDE 3

Checkpointing and NUMARACK

3

Solution? NUMARCK

n Traditional checkpointing systems store raw (and

uncompressed) data

− cost prohibitive: the storage space and time

− threatens to overwhelm the simulation and the post-simulation data analysis

n I/O accesses have become a limiting factor to key

scientific discoveries.

slide-4
SLIDE 4

Department of Electrical Engineering and Computer Science

What if a Resilience and Checkpointing Solution Provided

  • Improved Resilience via more frequent yet

relevant checkpoints, while

  • Reducing the amount of data to be stored by an
  • rder of magnitude, and
  • Guaranteeing user-specified tolerable maximum

error rate for each data point, and

  • an order of magnitude smaller mean error for

each data set, and

  • reduced I/O time by an order of magnitude, while
  • Providing data for effective analysis and

visualization

slide-5
SLIDE 5

0.2 0.4 0.6 0.8 1 1.2 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64

Motivation: “Incompressible” with Lossless Encoding

5

Compressible Exponent. Low Entropy. Incompressible mantissa. Less predictable. High Entropy.

=

=

n i i i

x p x p X H

1

) ( log ) ( ) (

Shannon’s Information Theory:

Bit position of double precision rlds data

Probability distribution of more common bit value

slide-6
SLIDE 6
  • Highly random

Original rlds data

  • Extreme events missed

~0.35 correlation!

Motivation: Still “Incompressible” with Lossy Encoding

6

50 100 150 200 250 300 1 101 201 301 401 501 601 701 801 901 50 100 150 200 250 1 101 201 301 401 501 601 701 801 901

Bspline reconstructed rlds data

slide-7
SLIDE 7

What if we analyze the Change in Value?

Observations:

  • Variable Values – distribution
  • Change in Variable Value – distribution
  • Relative Change in Variable Value - distribution

Hypothesis: The relative changes in variable values can be represented in a much smaller state space.

  • A1(t) = 100, A1(t+1) = 110 => change = 10, rel change = 10%
  • A2(t) = 5, A2(t+1) = 5.5 => change = .5, rel change = 10%

Observation: Simulation Represents a State Transition Model

slide-8
SLIDE 8

Sneak Preview: Relative Change is more predictable

Randomness Iteration 1 and 2 on climate CMIP5 rlus data Relative Change between iteration 1 and 2 on climate CMIP5 rlus data

Learning Distribution

slide-9
SLIDE 9

Department of Electrical Engineering and Computer Science

Challenges

  • How to learn patterns and distributions of relative

change at scale?

  • How to represent distributions at scale?
  • How to bound errors?
  • System Issues
  • data movement
  • I/O
  • Scalable software
  • Reconstruction when needed
slide-10
SLIDE 10

Department of Electrical Engineering and Computer Science

NUMARCK Overview

Forward Predictive Coding Transform the data by computing relative changes in ratio from one iteration to the next Data Approximation Learn the distribution of relative change r using machine learning algorithms and store approximated values F

Traditional checkpointing Machine learning based checkpointing

Full checkpoint in each checkpoint F F F F F F F F C C F F: Full checkpoint C: change ratios

slide-11
SLIDE 11

Forward coding ~0.99 correlation! 0.001 RMSE

NUMARCK: Overview

11

50 100 150 200 250 300 1 301 601 901 50 100 150 200 250 300 1 301 601 901

Distribution Learning

slide-12
SLIDE 12

E.g., Distribution Learning Strategies

  • Equal-width Bins (Linear)
  • Log-scale Bins (Exponential)
  • Machine Learning – Dynamic clustering

Number of bins or clusters depends on the bits designated for storing indices and error tolerance examples

– index length (B): 8bits – tolerable error per point (E): 0.1%

the number of clusters the width of each cluster

slide-13
SLIDE 13

−20 20 40 100 200 300 change ratio (%) counts

Equal-width distribution

dens: iteration 32 to 33

In each iteration, partition value into 255 bins of equal-width. Each value is assigned to a corresponding bin ID ( represented by the center of bin). If the difference between the original value and the approximated one is larger than user-specified value (0.1%), it is stored as it is (i.e., incompressible)

Pros: Easy to Implement Cons: (1) Can only cover range of 2*E*(2^B -1); (2) Bin width: 2*E

slide-14
SLIDE 14

Log-scale Distribution

dens: iteration 32 to 33 In each iteration, partition the ratio distribution into 255 bins of log-scale width.

−30 −20 −10 10 20 30 50 100 150 200 250 300 change ratio (%) counts

Pros: cover larger ranger and more finer (narrower) bins Cons: may not perform well for highly irregularly distributed data

slide-15
SLIDE 15

Machine Learning (Clustering-based) based Binning

dens: iteration 32 to 33

In each iteration, partition the ratio data into 255 clusters using (e.g., K-means) clustering, followed by approximated values based on corresponding cluster’s centroid value.

−20 20 40 50 100 150 200 change ratio (%) counts

slide-16
SLIDE 16

Methodology Summary

  • this is the model, initial condition and metadata

Initialization

  • Calculate the relative change

Calculation

  • Bin the relative change into N bins
  • Index and Store bin IDs

Learning Distributions

  • Store index, compress index
  • Store exact values for change outside error bounds

Storage

  • Read last available complete checkpoint
  • Reconstruct data values for each data point, can report

the error bounds.

Reconstruction

slide-17
SLIDE 17

NUMARCK Algorithm

  • Change ratio calculation

– Calculate element-wise change ratios

  • Bin histogram construction

– Assign change ratios within an error bound into bins

  • Indexing

– Each data element is indexed by its bin ID

  • Select top-K bins with most elements

– Data in top-K bins are represented by their bin IDs – Data out of top-K bins are stored as is

  • (optional) Apply lossless GNU ZLIB compression on the index table

– Further reduce data size

  • (optional) File I/O

– Data is saved in self-describing netCDF/HDF5 file

slide-18
SLIDE 18
  • FLASH code is a modular, parallel

multi-physics simulation code: developed at the FLASH center of University of Chicago

– It is a parallel adaptive-mesh refinement (AMR) code with block-oriented structure – A block is the unit of computation – The grid is composed of blocks – Blocks consists of cells: guard and interior cells – Cells contains variable values

  • CMIP - supported by World Climate

Research Program: (1) Decadal Hindcasts and predictions simulations; (2) Long-term simulations; (3) atmosphere-only simulations.

var 0, 1, 2, …, 23 (e.g., density, pressure and temperature)

Experimental Results: Datasets

slide-19
SLIDE 19

Department of Electrical Engineering and Computer Science

Evaluation metrics

  • Incompressible ratio
  • % of data that need to be stored as exact values because it would

be out of error bound if approximated

  • Mean error rate
  • Average difference between the approximated change ratio and the

real change ratio for all data

  • Compression ratio
  • Assuming data D of size |D| is reduced to size |D’|, it is defined as:

D − D' D ×100

slide-20
SLIDE 20

Incompressible Ratio: Equal-width Binning

FLASH dataset, 0.1% error rate 0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 iterations dens eint ener pres temp

slide-21
SLIDE 21

Incompressible Ratio: Log-scale Binning

FLASH dataset, 0.1% error rate 0.0 2.0 4.0 6.0 8.0 10.0 12.0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 iterations dens eint ener pres temp

slide-22
SLIDE 22

Incompressible Ratio: Clustering-based Binning

FLASH dataset, 0.1% error rate 0.00% 1.00% 2.00% 3.00% 4.00% 5.00% 6.00% 7.00% 8.00% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 dens pres ener eint temp

slide-23
SLIDE 23

Mean Error Rate: Clustering-based

FLASH dataset, 0.1% error rate 0.00% 0.00% 0.00% 0.01% 0.01% 0.01% 0.01% 0.01% 0.02% 0.02% 0.02% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 dens pres ener eint temp

slide-24
SLIDE 24

Increasing Index Size:

Incompressible Ratio

10 20 30 40 50 60 70 80 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 rlds-8 rlds-9 rlds-10

% of data needed to be stored as exact values (i.e., uncompressible)

Increasing bin sizes (8-bit to 10-bit) reduces % of incompressible significantly. Note: rlds is the most difficult to compress with 8-bit

slide-25
SLIDE 25

Different Approximations:

Compression Ratio

10 20 30 40 50 60 70 80 90 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 rlds-8 rlds-9 rlds-10 Iterations

Increasing bin sizes (8-bit to 10-bit) increases compression ratio significantly.

slide-26
SLIDE 26

Different Tolerable Error Rates: Incompressible Ratio (0.1% - 0.5%)

10 20 30 40 50 60 1 3 5 7 9 11131517192123252729313335373941434547495153555759

Kmeans-0.1 Kmeans-0.2 Kmeans-0.3 Kmeans-0.4 Kmeans-0.5

slide-27
SLIDE 27

Scaling - Experimental Settings

Name of data set Application Domain Size per iteration Variable dimension Variable size Sedov FLASH Astrophysics 15MB 165*32*32*1 1.3MB Stir-1 FLASH Astrophysics 3.7GB 64*157*157*157 945MB Stir-2 FLASH Astrophysics 296GB 1024*314*314*157 59GB Stir-3 FLASH Astrophysics 2.4TB 8192*314*314*157 472GB ASR ASR Climate 103MB 29*320*320 11MB CMIP CMIP3 Climate 19GB 42*2400*3600 1.4GB

  • Data sets and environment:

– FLASH datasets

  • SuperMUC at Leibniz Supercomputing Centre, Germany, a parallel computer

consists of 9216 nodes (16 cores per node)

  • We used up to 12,800 cores in our experiments

– Others

  • A Linux machine, 2 quad-core CPUs (32 GB memory)
slide-28
SLIDE 28

Compression ratios

l Compared with lossy compression algorithms: ZFP (LLNL), ISABELLA (NCSU)

0" 1" 2" 3" 4" 5" 6" 1" 2" 3" 4" 5" Compression*ra,o* Itera,on* NUMARCK" ISABELA" ZFP"

CMIP (1.4 GB)

slide-29
SLIDE 29

Scalability Experiments

0" 2000" 4000" 6000" 8000" 10000" 12000" 14000" 3200" 4800" 6400" 8000" 9600" 11200" 12800" Speedup& Number&of&cores& NUMARCK"speedup" linear"speedup"

0" 500" 1000" 1500" 2000" 320" 480" 640" 800" 960" 1120" 1280" 1440" 1600" Speedup& Number&of&cores& NUMARCK"speedup" linear"speedup"

2 4 6 8 10 12 14 320 480 640 800 960 1120 1280 1440 1600 Run$me (sec) Number of cores

2 4 6 8 10 12 3200 4800 6400 8000 9600 11200 12800 Run$me (sec) Number of cores

Speedup of Stir-2 Runtime of Stir-2 Speedup of Stir-3 Runtime of Stir-3

FLASH datasets (turbulence stirring test)

  • Stir-2 (59GB) data

– Numbers of cores: 1600 – Speed-up: 1404 – Time: 2.655 sec – Original I/O time: 13.2 sec/iteration

  • Stir-3 (472GB) dataset

– Number of cores:12800 – Speed-up: 8788 – Time: 3.610 sec – Original I/O time: 18.0 sec

slide-30
SLIDE 30

Open Problems and Challenges

  • Optimize and/or create new machine learning

algorithms

– for higher compressions and more accurate representation – Scalable implementation – Learning from historical results to optimize the “learning step” for minimizing data movement and power – Adaptation for anomaly detection (for resilience and analysis)

  • Use of memory hierarchy and local SSDs
  • Incorporation into pNetCDF etc and libraries
  • I/O optimizations
slide-31
SLIDE 31

Department of Electrical Engineering and Computer Science

THANK YOU!

Alok Choudhary Henry and Isabel Dever Professor EECS and Kellogg School of Management Northwestern University choudhar@eecs.northwestern.edu