On Efficient Constructions of Checkpoints Yu Chen, Zhenming Liu, Bin - PowerPoint PPT Presentation

On Efficient Constructions of Checkpoints Yu Chen, Zhenming Liu, Bin Ren and Xin Jin

Checkpoint for ML applications Recovery from checkpoint def train_and_checkpoint(net, manager): ckpt.restore(manager.latest_checkpoint) if manager.latest_checkpoint: print("Restored from {}".format(manager.latest_checkpoint)) else: print("Initializing from scratch.") for _ in range(50): example = next(iterator) Save model’s state for recovery loss = train_step(net, example, opt) ckpt.step.assign_add(1) if int(ckpt.step) % 10 == 0: save_path = manager.save() print("Saved checkpoint for step {}: {}".format(int(ckpt.step), save_path)) print("loss {:1.2f}".format(loss.numpy())) 2

Checkpoint for ML applications • Application errors • System failures • Cloud computing - Divide by zero - Power outages - Spot instance - Gradient explosion - Unstable network - Container rescheduling - Dead activation - Unhealthy disks 3

Checkpoint for ML applications cp1 cp2 cp1 cp2 cp3 cp4 Failure occurs Frequent checkpoint has less recovery cost 4

Checkpoint for ML applications Frequent checkpointing is System: Decrease checkpoint frequency costly for IO and storage ML & System: Partial checkpoint ML & System & Information theory How can we compress the model checkpoint? 5

• Lossy compression Compression distance-based compression – l 2 • Lossless compression – Model compression How to find the redundant information? How to design a suitable scheme? 6

Design • Design principles – Minimize irritation to SGD – Maximize redundancies in residual information • Two key components – Approximate tracking by delta-coding. – Quantization and Huffman coding. 7

Approximate tracking by delta-coding. u t = u 0 + ∑ ˜ u 0 u 0 ˜ ˜ δ i … … i ≤ t ˜ u t − 1 u t − 1 δ t = u t − ˜ u t − 1 ˜ δ t = f ( δ t ) ˜ u t u t … … 8

Quantization and Huffman coding • Two stage quantization – Exponent-based quantization 0.76 -0.48 0.2 0.07 0.18 0.49 0.14 0.39 0.82 0.09 – Priority Promotion Exponent-base Quantization • Huffman coding e=-1,s=0 e=-2,s=0 e=-2,s=1 e=-3,s=0 e=-4,s=0 0.18 0.76 0.39 0.14 0.07 0.82 0.49 -0.48 0.2 0.09 0.17 0.79 0.44 -0.48 0.08 000 001 010 011 100 9

Quantization and Huffman coding • Two stage quantization 0.76 -0.48 0.2 0.07 0.18 0.49 0.14 0.39 0.82 0.09 – Exponent-based quantization Exponent-base Quantization – Priority Promotion • Huffman coding 0.17 0.79 0.44 -0.48 0.08 000 001 010 011 100 Priority Promotion 0.79 0.44 -0.48 0 00 01 10 11 Hu ff man Encoding 0.79 -0.48 0 0 0 0.44 0 0.44 0.79 0 10 110 0 0 0 111 0 111 10 0 Checkpoint Saving 10

Design • System optimization – Asynchronous execution – Checkpoint merging – Huffman code table caching 11

Evaluation • Models • Dataset – Logistic Regression – MNIST – LeNet-5 – Fashion-MNIST – AlexNet – Jester – Matrix Factorization – MoiveLens10M • Objective – Comparing the recover cost with previous works – Evaluating the compression benefit brought by different approaches – Validating the effectiveness of priority promotion – Confirming the low overhead 12

Recovery cost comparison 5.37x - Outperforming SCAR by 2.88x-5.77x, and TOPN by 2.17x-4.06x at 5% checkpoint size - Outperforming SCAR by 1.9x-4.82x, and TOPN by 1.52x-2.17x at 10% checkpoint size - LC-checkpoint has more stable rework cost as the checkpoint size decreasing 13

Recovery cost comparison 4.4x - Outperforming SCAR by 2.88x-5.77x, and TOPN by 2.17x-4.06x at 5% checkpoint size - Outperforming SCAR by 1.9x-4.82x, and TOPN by 1.52x-2.17x at 10% checkpoint size - LC-checkpoint has more stable rework cost as the checkpoint size decreasing 14

Compression effect breakdown 85.47% • Exponent-based quantization 93.73% • Priority promotion 95.87% • Huffman coding - E yields a compression ratio of 85% on average - P brings 9.26% extra compression ratio on average for 2-bit and 6.23% for 3-bit - H brings 2% extra compression ratio with 2-bits priority promotion, and 1.6% with 3-bits one - P with smaller bits yields more benefits for H 15

The effectiveness of priority promotion Rebuild the by + δ m u t + m u t • X-axis: The exponent bucket id which to be removed from δ m • Y-axis: Related error calculated by loss function, lower is better. - Smaller exponent buckets have negligible impact to model state - 3 buckets (2bits) and 7 buckets(3bits) can hold most of significant bits. 2bits 3bits 16

Overhead • Each iteration costs 91 seconds on average • A failure occurs at 7th iteration • LC checkpoint saves 6 iterations (546 seconds) • LC checkpoint has less than 4 seconds overhead 6 iterations Failure occurs 17

Conclusion – Propose an important research question: how to compress checkpoint – Characterize a family of compression schemes for tracking learning process – Design a lossy coding scheme to compress checkpoint – Optimize the training systems to achieve low overhead checkpoint – Achieve the compression rate up to 28x and recovery speedup up to 5.77x over the state-of-the-art algorithms Thank you for your attention! ychen39@email.wm.edu 18

Checkpoint for ML applications • Classic checkpoint mechanism – Save model state periodically – Partially save model state for faster recovery • Key technical challenge – Frequent checkpointing is costly for IO and storage • How can we compress model checkpoint? – Maximize the compression rate – The scheme needs to be optimized for ML application Delta encoding scheme with lossy compression 19

On Efficient Constructions of Checkpoints Yu Chen, Zhenming Liu, Bin - PowerPoint PPT Presentation

On Efficient Constructions of Checkpoints Yu Chen, Zhenming Liu, Bin Ren and Xin Jin Checkpoint for ML applications Recovery from checkpoint def train_and_checkpoint(net, manager): ckpt.restore(manager.latest_checkpoint) if

Chapter 11: Relative Clause Constructions Syntactic Constructions in English Kim and Michaelis

Chapter 9: Passive Constructions Syntactic Constructions in English Kim and Michaelis (2020)

Chapter 7: Raising and Control Constructions Syntactic Constructions in English Kim and Michaelis

Investor Presentation 2 KNR Constructions Limited Disclaimer This presentation and the

An Integrated Architecture for Generating Parenthetical Constructions Eva Banik The Open

Chapter 2: Lexical and Phrasal Signs Syntactic Constructions in English Kim and Michaelis (2020)

Constructions of Derived Equivalences of Finite Posets Sefi Ladkani Einstein Institute of

Towards Optimal Constructions of Towards Optimal Constructions of Dynamically Corrected Quantum

When does macOS Catalina create APFS checkpoints and which data could be retrieved from them?

Ray Tracing Assignment Goal is to reproduce the following So You Want to Write a Ray Tracer

Checkpoints and Continuations instead of Nested Transactions Eric Koskinen Brown University

Fine-Grained Fault Tolerance using Device Checkpoints Asim Kadav with Matthew Renzelmann and

Ray Tracing Assignment Goal is to reproduce the following So You Want to Write a Ray Tracer

AMMI Introduction to Deep Learning 10.4. Model persistence and checkpoints Fran cois

Modeling the Impact of Checkpoints on Next-Generation Systems Cray User Group Technical

Ray Tracing Assignment Goal is to reproduce the following So You Want to Write a Ray Tracer

Securing Proof-of-Work Ledgers via Checkpointing Dimitris Karakostas, Aggelos Kiayias

Analysis of the Tradeoffs between Energy and Run Time for Multilevel Checkpointing Prasanna

Scalable in-memory checkpoint for hard and soft error protection with automatic restart on failure

CAPSTONE MATTERS Checkpointing for Success: Experience in Meteorology Michael Richman School of

MC714 - Sistemas Distribuidos slides by Maarten van Steen (adapted from Distributed System - 3rd

Oasys PRIMER Did you know? Back to Contents Top Tips Demo Slide 2 Slide 2 Checkpoint

Presenter: Box. Lean Box. Leangsuksun gsuksun SWEPCO Endowed Professor*, Computer Science

Identifying Slow Queries, and Fixing Them! Stephen Frost Crunchy Data stephen@crunchydata.com