On Efficient Constructions of Checkpoints Yu Chen, Zhenming Liu, Bin - - PowerPoint PPT Presentation

on efficient constructions of checkpoints
SMART_READER_LITE
LIVE PREVIEW

On Efficient Constructions of Checkpoints Yu Chen, Zhenming Liu, Bin - - PowerPoint PPT Presentation

On Efficient Constructions of Checkpoints Yu Chen, Zhenming Liu, Bin Ren and Xin Jin Checkpoint for ML applications Recovery from checkpoint def train_and_checkpoint(net, manager): ckpt.restore(manager.latest_checkpoint) if


slide-1
SLIDE 1

On Efficient Constructions of Checkpoints

Yu Chen, Zhenming Liu, Bin Ren and Xin Jin

slide-2
SLIDE 2

Checkpoint for ML applications

2

def train_and_checkpoint(net, manager): ckpt.restore(manager.latest_checkpoint) if manager.latest_checkpoint: print("Restored from {}".format(manager.latest_checkpoint)) else: print("Initializing from scratch.") for _ in range(50): example = next(iterator) loss = train_step(net, example, opt) ckpt.step.assign_add(1) if int(ckpt.step) % 10 == 0: save_path = manager.save() print("Saved checkpoint for step {}: {}".format(int(ckpt.step), save_path)) print("loss {:1.2f}".format(loss.numpy()))

Save model’s state for recovery Recovery from checkpoint

slide-3
SLIDE 3

Checkpoint for ML applications

  • Application errors
  • Divide by zero
  • Gradient explosion
  • Dead activation

3

  • System failures
  • Power outages
  • Unstable network
  • Unhealthy disks
  • Cloud computing
  • Spot instance
  • Container rescheduling
slide-4
SLIDE 4

Checkpoint for ML applications

4 cp1 cp2 cp1 cp2 cp3 cp4

Failure occurs

Frequent checkpoint has less recovery cost

slide-5
SLIDE 5

Checkpoint for ML applications

5 Frequent checkpointing is costly for IO and storage ML & System: Partial checkpoint System: Decrease checkpoint frequency ML & System & Information theory

How can we compress the model checkpoint?

slide-6
SLIDE 6

Compression

6

  • Lossless compression
  • Lossy compression

– distance-based compression

l2

How to find the redundant information? How to design a suitable scheme?

– Model compression

slide-7
SLIDE 7

Design

7

  • Design principles

– Minimize irritation to SGD – Maximize redundancies in residual information

  • Two key components

– Approximate tracking by delta-coding. – Quantization and Huffman coding.

slide-8
SLIDE 8

Approximate tracking by delta-coding.

8

… … … … u0 ˜ u0 ut−1 ˜ ut−1 ut ˜ ut

˜ ut = u0 + ∑

i≤t

˜ δi δt = ut − ˜ ut−1 ˜ δt = f(δt)

slide-9
SLIDE 9
  • Two stage quantization

– Exponent-based quantization – Priority Promotion

  • Huffman coding

0.76 -0.48 0.2 0.07 0.18 0.49 0.14 0.39 0.82 0.09 0.07 0.76 0.82

e=-1,s=0 e=-2,s=0 e=-2,s=1 e=-3,s=0 e=-4,s=0 000

0.18 0.39 0.49

  • 0.48

0.14 0.2 0.09

Exponent-base Quantization

0.79 0.44

  • 0.48

0.17 0.08

001 010 011 100

Quantization and Huffman coding

9

slide-10
SLIDE 10

Quantization and Huffman coding

  • Two stage quantization

– Exponent-based quantization – Priority Promotion

  • Huffman coding

0.76 -0.48 0.2 0.07 0.18 0.49 0.14 0.39 0.82 0.09 0.79 0.44

  • 0.48

0.79 -0.48 0.44 0.44 0.79

000 01 10 11 10 10 111 111 110

Exponent-base Quantization Priority Promotion

0.79 0.44

  • 0.48

0.17 0.08

001 010 011 100 00

Huffman Encoding Checkpoint Saving

10

slide-11
SLIDE 11

Design

11

  • System optimization

– Asynchronous execution – Checkpoint merging – Huffman code table caching

slide-12
SLIDE 12

Evaluation

  • Models

– Logistic Regression – LeNet-5 – AlexNet – Matrix Factorization

  • Objective

– Comparing the recover cost with previous works – Evaluating the compression benefit brought by different approaches – Validating the effectiveness of priority promotion – Confirming the low overhead

  • Dataset

– MNIST – Fashion-MNIST – Jester – MoiveLens10M

12

slide-13
SLIDE 13

Recovery cost comparison

  • Outperforming SCAR by 2.88x-5.77x, and

TOPN by 2.17x-4.06x at 5% checkpoint size

  • Outperforming SCAR by 1.9x-4.82x, and

TOPN by 1.52x-2.17x at 10% checkpoint size

  • LC-checkpoint has more stable rework cost

as the checkpoint size decreasing

13

5.37x

slide-14
SLIDE 14

Recovery cost comparison

  • Outperforming SCAR by 2.88x-5.77x, and

TOPN by 2.17x-4.06x at 5% checkpoint size

  • Outperforming SCAR by 1.9x-4.82x, and

TOPN by 1.52x-2.17x at 10% checkpoint size

  • LC-checkpoint has more stable rework cost

as the checkpoint size decreasing

14

4.4x

slide-15
SLIDE 15

Compression effect breakdown

  • Exponent-based quantization
  • Priority promotion
  • Huffman coding
  • E yields a compression ratio of 85% on average
  • P brings 9.26% extra compression ratio on

average for 2-bit and 6.23% for 3-bit

  • H brings 2% extra compression ratio with 2-bits

priority promotion, and 1.6% with 3-bits one

  • P with smaller bits yields more benefits for H

15

85.47% 93.73% 95.87%

slide-16
SLIDE 16

The effectiveness of priority promotion

  • X-axis: The exponent bucket id which

to be removed from

  • Y-axis: Related error calculated by

loss function, lower is better.

δm

  • Smaller exponent buckets have

negligible impact to model state

  • 3 buckets (2bits) and 7

buckets(3bits) can hold most of significant bits. Rebuild the by +

ut+m ut δm

2bits 3bits

16

slide-17
SLIDE 17

Overhead

Failure occurs

  • Each iteration costs 91 seconds
  • n average
  • A failure occurs at 7th iteration
  • LC checkpoint saves 6 iterations

(546 seconds)

  • LC checkpoint has less than 4

seconds overhead

17

6 iterations

slide-18
SLIDE 18

Conclusion

– Propose an important research question: how to compress checkpoint – Characterize a family of compression schemes for tracking learning process – Design a lossy coding scheme to compress checkpoint – Optimize the training systems to achieve low overhead checkpoint – Achieve the compression rate up to 28x and recovery speedup up to 5.77x

  • ver the state-of-the-art algorithms

Thank you for your attention!

ychen39@email.wm.edu

18

slide-19
SLIDE 19

Checkpoint for ML applications

19

  • Classic checkpoint mechanism

– Save model state periodically – Partially save model state for faster recovery

  • Key technical challenge

– Frequent checkpointing is costly for IO and storage

  • How can we compress model checkpoint?

– Maximize the compression rate – The scheme needs to be optimized for ML application Delta encoding scheme with lossy compression