Evaluation of a High Performance Code Compression Method Charles - PowerPoint PPT Presentation

Evaluation of a High Performance Code Compression Method Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor MICRO-32 November 16-18, 1999

Motivation • Problem: embedded code size CPU RAM ROM – Constraints: cost, area, and power Program I/O – Fit program in on-chip memory – Compilers vs. hand-coded assembly Original Program • Portability • Development costs – Code bloat ROM CPU RAM • Solution: code compression I/O – Reduce compiled code size – Take advantage of instruction repetition Compressed Program – Systems use cheaper processors with smaller on-chip memories • Implementation – Code size? – Execution speed? Embedded Systems 2

CodePack • Overview – IBM – PowerPC instruction set – First system with instruction stream compression – 60% compression ratio, ± 10% performance [IBM] • performance gain due to prefetching • Implementation – Binary executables are compressed after compilation – Compression dictionaries tuned to application – Decompression occurs on L1 cache miss • L1 caches hold decompressed data • Decompress 2 cache lines at a time (16 insns) – PowerPC core is unaware of compression 3

CodePack encoding • 32-bit insn is split into 2 16-bit words • Each 16-bit word compressed separately Encoding for upper 16 bits Encoding for lower 16 bits Encodes zero 0 0 x x x 0 0 8 1 32 0 1 x x x x x 16 0 1 x x x x 64 1 0 0 x x x x x x 23 1 0 0 x x x x x 128 x x x x x x x 128 x x x x x x x 1 0 1 1 0 1 x x 256 256 1 1 0 x x x x x x x x 1 1 0 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 1 1 1 1 1 1 x x Tag Escape Index Raw bits 4

CodePack decompression 0 5 6 25 26 31 L1 I-cache miss address Index table Fetch index (in main memory) Byte-aligned block address Compressed bytes Fetch (in main memory) compressed instructions Compression Block (16 instructions) Hi tag Low tag Hi index Low index 1 compressed instruction Decompress High dictionary Low dictionary High 16-bits Low 16-bits Native Instruction 5

Compression ratio compressed size = = = = compressio n ratio • original size • Average: 62% 100% 90% 80% 70% 60% Compression ratio 50% 40% 30% 20% 10% 0% u i o g 5 d t l m 1 p r d s d m x 5 5 c i r v c g 9 i o l p e w e 3 e e p 2 r t p 9 n i i i c p c w g p a b t v p o l s g p a s e r j 2 c a p r k m e s r o s i 2 a d u m u w f 8 e g p v s t y 8 r e o p h m p t m m o c 6

CodePack programs • Compressed executable – 17%-25% raw bits: not compressed! • Includes escape bits Tags • Compiler optimizations might help 25% – 5% index table – 2KB dictionary (fixed cost) Escape – 1% pad bits 3% Raw bits 14% Pad 1% Indices 51% Index table 5% Dictionary 1% go 7

I-cache miss timing • Native code uses critical word first • Compressed code must be fetched sequentially • Example shows miss to 5th instruction in cache line – 32-bit insns, 64-bit bus a) Native code Instruction cache miss Instructions from main memory b) Compressed code Instruction cache miss Index from main memory Codes from main memory Decompressor c) Compressed code + optimizations Instruction cache miss A Index from index cache Codes from main memory B 2 Decompressors 30 10 t=0 20 1 cycle L1 cache miss Fetch instructions (first line) Decompression cycle Fetch index Fetch instructions (remaining lines) Critical instruction word 8

Baseline results • CodePack causes up to 18% performance loss – SimpleScalar – 4-issue, out-of-order – 16 KB caches – Main memory: 10 cycle latency, 2 cycle rate 1.8 1.6 native 1.4 CodePack 1.2 Instructions 1.0 per cycle 0.8 0.6 0.4 0.2 0.0 cc1 go perl vortex 9

Optimization A: Index cache • Remove index table access with a cache – A cache hit removes main memory access for index – optimized: 64 lines, fully assoc., 4 indices/line (<15% miss ratio) • Within 8% of native code – perfect: an infinite sized index cache • Within 5% of native code 1.2 1.0 0.8 Speedup over native code 0.6 CodePack 0.4 optimized 0.2 perfect 0.0 cc1 go perl vortex 10

Optimization B: More decoders • Codeword tags enable fast extraction of codewords – Enables parallel decoding • Try adding more decoders for faster decompression • 2 decoders: performance within 13% of native code 1.0 0.8 Speedup over 0.6 native code CodePack 0.4 2 insn/cycle 3 insn/cycle 0.2 16 insn/cycle 0.0 cc1 go perl vortex 11

Comparison of optimizations • Index cache provides largest benefit • Optimizations – index cache: 64 lines, 4 indices/line, fully assoc. – 2nd decoder • Speedup over native code: 0.97 to 1.05 • Speedup over CodePack: 1.17 to 1.25 1.2 1 0.8 Speedup over 0.6 native code CodePack index cache 0.4 2nd decoder 0.2 both optimizations 0 cc1 go perl vortex 12

Cache effects • Cache size controls normal CodePack slowdown • Optimizations do well on small caches: 1.14 speedup go benchmark 1.4 1.2 1.0 Speedup 0.8 over native code 0.6 0.4 CodePack 0.2 optimized 0.0 1KB 4KB 16KB 64KB 13

Memory latency • Optimized CodePack performs better with slow memories – Fewer memory accesses than native code go benchmark 1.2 1.0 Speedup 0.8 over native 0.6 code 0.4 CodePack 0.2 optimized 0.0 0.5x 1x 2x 4x 8x Memory latency 14

Memory width • CodePack provides speedup for small buses • Optimizations help performance degrade gracefully as bus size increases go benchmark 1.2 1.0 0.8 Speedup over 0.6 native code 0.4 CodePack 0.2 optimized 0.0 16 32 64 128 Bus size (bits) 15

Conclusions • CodePack works for other instruction sets than PowerPC • Performance can be improved at modest cost – Remove decompression overhead: index lookup, dictionary lookup • Compression can speedup execution – Compressed code requires fewer main memory accesses – CodePack includes simple prefetching • Systems that benefit most from compression – Narrow buses – Slow memories • Workstations might benefit from compression – Fewer L2 misses – Less disk access 16

Web page http://www.eecs.umich.edu/~tnm/compress 17

Evaluation of a High Performance Code Compression Method Charles - PowerPoint PPT Presentation

Evaluation of a High Performance Code Compression Method Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor

Lossless compression in lossy compression systems Almost every lossy compression system

14.9.2 JPEG2000 compression DCT compression basis for JPEG wavelet compression

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

JPEG Compression Ian Snyder December 11, 2009 Ian Snyder JPEG Compression Outline

Digital Image Compression Digital Image Compression Digital Image Compression and JPEG Standards

Digital Video Compression Digital Video Compression Digital Video Compression and H.261

From Sorting to Heaps to Compression Data Compression video on demand/set top box jpeg

Tradeoffs in XML Database Compression James Cheney University of Edinburgh Data Compression

Evaluation of neural code compression techniques for image retrieval Feature compression for

Wavelets for progressive transmission/compression of images The SPIHT method WTBV WS 2016/17

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

What is a performance evaluation? Performance Management v. Performance Evaluation Evaluation

A Model to Address Salary Compression for Faculty (an anti-compression model) Presented to

Compression Overview Multimedia Encoding and Compression Huffman codes Lossless

Compression Programs File Compression: Gzip, Bzip Archivers :Arc, Pkzip, Winrar,

with Dictionaries an alternative to InnoDB table compression Yura Sorokin, Senior Software

Using Transparent Compression to Improve SSD-based I/O Caches Thanos Makatos, Yannis Klonatos,

Animation Sequence Compression Yang Liu Department of Computer Science March 2009 . . . . .

HTTP/2 Compression Dictionaries Vlad Krasnov In a nutshell Allow cross-stream compression in

Most of the slides are borrowed from the authors original presentation. original

Compressing Coldbox Data Ivan K. Furic, Remington Gerras University of Florida ProtoDUNE-SP TDR:

Efficient Lightweight Compression Alongside Fast Scans Orestis Polychroniou Kenneth A. Ross

A Little Confusing Without [a block digest], one must query the offset digest with all