 
              Fast Software-managed Code Decompression Charles Lefurgy and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor Compiler and Architecture Support for Embedded Systems (CASES) October 1-3, 1999
Motivation • Problem: embedded code size CPU RAM ROM – Constraints: cost, area, and power Program I/O – Fit program in on-chip memory – Compilers vs. hand-coded assembly Original Program • Solution: code compression – Reduce compiled code size – Take advantage of instruction repetition ROM CPU RAM • Benefits I/O – On-chip memory used more effectively Compressed Program – Trade-off performance for code density – Systems use cheaper processors with smaller on-chip memories Embedded Systems 2
Hardware or software decompression? • Hardware – Faster translation – CodePack, MIPS-16, Thumb • Software – Smaller physical area – Lower cost – Quicker re-targeting to new compression algorithms – Rivals HW solutions on some (loopy) benchmarks 3
Kirovski et al., 1997 • Overview – Procedure Compression – Decompress and execute 1 procedure at a time – Store decompressed code in procedure cache – Cache management • Results – 60% compression ratio on SPARC – 166% execution penalty with 64KB procedure cache HLL Native LZ F() {...} F: load r5,4 F: 10010... ... HLL Native LZ G() {...} G: addi r7,8 G: 00101... ... Compile LZ Compress Decompressor P-cache manager 4
Dictionary compression algorithm • Dictionary contains unique instructions • Replace program instructions with short index 32 bits 16 bits 32 bits Add r1,r2,r3 5 Add r1,r2,r3 Add r1,r2,r4 Add r1,r2,r3 5 Add r1,r2,r4 30 .dictionary segment Add r1,r2,r4 30 Add r1,r2,r4 30 .text segment .text segment (contains indices) Original program Compressed program 5
Compression ratio compressed size = = = = compressio n ratio • original size • Compression ratios – Dictionary: 65% - 82% – LZRW1: 55% - 63% Benchmark Original Dict. Compression LZRW1 Compression cc1 1,083,168 65.4% 60.4% vortex 495,248 65.8% 55.5% go 310,576 69.6% 63.9% perl 267,568 73.7% 60.2% ijpeg 198,272 77.2% 61.5% mpeg2enc 119,600 82.5% 60.5% pegwit 88,800 79.5% 56.7% 6
Decompression code • Simple – Small static code size: 25 instructions • Fast – Less than 3 instructions per output byte – 74 dynamic instructions per decompressed cache line • Algorithm – Invoke decompressor on L1 I-cache miss – Decompress 1 complete cache line – For each instruction in cache line • Read index • Reference dictionary with index to get instruction • Put instruction in I-cache • HW Support – L1-cache miss exception – Write into I-cache 7
Optimizations • Partial decompression – compress from missed instruction to end of cache line – use a valid bit per word in cache line to mark instructions at beginning of line as invalid – avoids decompressing instructions that may not be executed – up to 12% speedup • Second register file – Many embedded processors have an additional register file – Avoid save/restore of registers when decompressor runs – 2nd register file with partial decompression: up to 16% speedup 8
Simulation environment • SimpleScalar – Modified to support compression • 5 stage, in-order pipeline – Simple embedded processor • D-cache – 8KB, 16B lines, 2-way • I-cache – 1 to 64KB, 32B lines, 2-way • Memory – 10 cycle latency, 2 cycle rate 9
Performance: cc1 6 compressed 5 partial partial+regfile 4 Slowdown native relative to 3 native code 2 1 0 1KB 4KB 16KB 64KB I-cache size (KB) 10
Performance: ijpeg 6 compressed 5 partial 4 partial+regfile Slowdown native 3 relative to 2 native code 1 0 1KB 4KB 16KB 64KB I-cache size (KB) 11
Performance summary • Data from CINT95, MediaBench with several cache sizes • Control slowdown by optimizing I-cache miss ratio – Code layout may help 6 5 4 Slowdown relative to 3 native code compressed 2 partial partial+regfile 1 0 0% 5% 10% 15% I-cache miss ratio 12
Performance summary, cont. • Magnification of previous graph • Slowdown under 3x when I-miss ratio is under 2% • Slowdown under 2x when I-miss ratio is under 1% 4 3 Slowdown relative to 2 native code compressed partial 1 partial+regfile 0 0.0% 0.5% 1.0% 1.5% 2.0% 2.5% 3.0% I-cache miss ratio 13
Conclusions • Line-based decompression beats procedure-based – use normal cache as decompression buffer – no fragmentation management as in procedure-based decompression – order of magnitude performance difference – A previous decompressor with procedure granularity had 100x slowdown on gcc and go [Kirovski97] • Compressed code fills gap – has quick execution of native code – has small size of interpreted code 14
Web page http://www.eecs.umich.edu/~tnm/compress 15
Recommend
More recommend