on efficient constructions of checkpoints
play

On Efficient Constructions of Checkpoints Yu Chen, Zhenming Liu, Bin - PowerPoint PPT Presentation

On Efficient Constructions of Checkpoints Yu Chen, Zhenming Liu, Bin Ren and Xin Jin Checkpoint for ML applications Recovery from checkpoint def train_and_checkpoint(net, manager): ckpt.restore(manager.latest_checkpoint) if


  1. On Efficient Constructions of Checkpoints Yu Chen, Zhenming Liu, Bin Ren and Xin Jin

  2. Checkpoint for ML applications Recovery from checkpoint def train_and_checkpoint(net, manager): ckpt.restore(manager.latest_checkpoint) if manager.latest_checkpoint: print("Restored from {}".format(manager.latest_checkpoint)) else: print("Initializing from scratch.") for _ in range(50): example = next(iterator) Save model’s state for recovery loss = train_step(net, example, opt) ckpt.step.assign_add(1) if int(ckpt.step) % 10 == 0: save_path = manager.save() print("Saved checkpoint for step {}: {}".format(int(ckpt.step), save_path)) print("loss {:1.2f}".format(loss.numpy())) 2

  3. Checkpoint for ML applications • Application errors • System failures • Cloud computing - Divide by zero - Power outages - Spot instance - Gradient explosion - Unstable network - Container rescheduling - Dead activation - Unhealthy disks 3

  4. Checkpoint for ML applications cp1 cp2 cp1 cp2 cp3 cp4 Failure occurs Frequent checkpoint has less recovery cost 4

  5. Checkpoint for ML applications Frequent checkpointing is System: Decrease checkpoint frequency costly for IO and storage ML & System: Partial checkpoint ML & System & Information theory How can we compress the model checkpoint? 5

  6. • Lossy compression Compression distance-based compression – l 2 • Lossless compression – Model compression How to find the redundant information? How to design a suitable scheme? 6

  7. Design • Design principles – Minimize irritation to SGD – Maximize redundancies in residual information • Two key components – Approximate tracking by delta-coding. – Quantization and Huffman coding. 7

  8. Approximate tracking by delta-coding. u t = u 0 + ∑ ˜ u 0 u 0 ˜ ˜ δ i … … i ≤ t ˜ u t − 1 u t − 1 δ t = u t − ˜ u t − 1 ˜ δ t = f ( δ t ) ˜ u t u t … … 8

  9. Quantization and Huffman coding • Two stage quantization – Exponent-based quantization 0.76 -0.48 0.2 0.07 0.18 0.49 0.14 0.39 0.82 0.09 – Priority Promotion Exponent-base Quantization • Huffman coding e=-1,s=0 e=-2,s=0 e=-2,s=1 e=-3,s=0 e=-4,s=0 0.18 0.76 0.39 0.14 0.07 0.82 0.49 -0.48 0.2 0.09 0.17 0.79 0.44 -0.48 0.08 000 001 010 011 100 9

  10. Quantization and Huffman coding • Two stage quantization 0.76 -0.48 0.2 0.07 0.18 0.49 0.14 0.39 0.82 0.09 – Exponent-based quantization Exponent-base Quantization – Priority Promotion • Huffman coding 0.17 0.79 0.44 -0.48 0.08 000 001 010 011 100 Priority Promotion 0.79 0.44 -0.48 0 00 01 10 11 Hu ff man Encoding 0.79 -0.48 0 0 0 0.44 0 0.44 0.79 0 10 110 0 0 0 111 0 111 10 0 Checkpoint Saving 10

  11. Design • System optimization – Asynchronous execution – Checkpoint merging – Huffman code table caching 11

  12. Evaluation • Models • Dataset – Logistic Regression – MNIST – LeNet-5 – Fashion-MNIST – AlexNet – Jester – Matrix Factorization – MoiveLens10M • Objective – Comparing the recover cost with previous works – Evaluating the compression benefit brought by different approaches – Validating the effectiveness of priority promotion – Confirming the low overhead 12

  13. Recovery cost comparison 5.37x - Outperforming SCAR by 2.88x-5.77x, and TOPN by 2.17x-4.06x at 5% checkpoint size - Outperforming SCAR by 1.9x-4.82x, and TOPN by 1.52x-2.17x at 10% checkpoint size - LC-checkpoint has more stable rework cost as the checkpoint size decreasing 13

  14. Recovery cost comparison 4.4x - Outperforming SCAR by 2.88x-5.77x, and TOPN by 2.17x-4.06x at 5% checkpoint size - Outperforming SCAR by 1.9x-4.82x, and TOPN by 1.52x-2.17x at 10% checkpoint size - LC-checkpoint has more stable rework cost as the checkpoint size decreasing 14

  15. Compression effect breakdown 85.47% • Exponent-based quantization 93.73% • Priority promotion 95.87% • Huffman coding - E yields a compression ratio of 85% on average - P brings 9.26% extra compression ratio on average for 2-bit and 6.23% for 3-bit - H brings 2% extra compression ratio with 2-bits priority promotion, and 1.6% with 3-bits one - P with smaller bits yields more benefits for H 15

  16. The effectiveness of priority promotion Rebuild the by + δ m u t + m u t • X-axis: The exponent bucket id which to be removed from δ m • Y-axis: Related error calculated by loss function, lower is better. - Smaller exponent buckets have negligible impact to model state - 3 buckets (2bits) and 7 buckets(3bits) can hold most of significant bits. 2bits 3bits 16

  17. Overhead • Each iteration costs 91 seconds on average • A failure occurs at 7th iteration • LC checkpoint saves 6 iterations (546 seconds) • LC checkpoint has less than 4 seconds overhead 6 iterations Failure occurs 17

  18. Conclusion – Propose an important research question: how to compress checkpoint – Characterize a family of compression schemes for tracking learning process – Design a lossy coding scheme to compress checkpoint – Optimize the training systems to achieve low overhead checkpoint – Achieve the compression rate up to 28x and recovery speedup up to 5.77x over the state-of-the-art algorithms Thank you for your attention! ychen39@email.wm.edu 18

  19. Checkpoint for ML applications • Classic checkpoint mechanism – Save model state periodically – Partially save model state for faster recovery • Key technical challenge – Frequent checkpointing is costly for IO and storage • How can we compress model checkpoint? – Maximize the compression rate – The scheme needs to be optimized for ML application Delta encoding scheme with lossy compression 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend