Reducing Code Size with Run-time Decompression Charles Lefurgy, Eva - - PowerPoint PPT Presentation
Reducing Code Size with Run-time Decompression Charles Lefurgy, Eva - - PowerPoint PPT Presentation
Reducing Code Size with Run-time Decompression Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor
2
Motivation
Embedded Systems Original Program ROM Program RAM I/O CPU Compressed Program ROM RAM I/O CPU
- Problem: embedded code size
– Constraints: cost, area, and power – Fit program in on-chip memory – Compilers vs. hand-coded assembly
- Portability
- Development costs
– Code bloat
- Solution: code compression
– Reduce compiled code size – Take advantage of instruction repetition
- Implementation
– Hardware or software? – Code size? – Execution speed?
3
Software decompression
- Previous work
– Decompression unit: whole program [Tauton91]
- No memory savings
– Decompression unit: procedures [Kirovski97][Ernst97]
- Requires large decompression memory
- Fragmentation of decompression memory
- Slow
- Our work
– Decompression unit: 1 or 2 cache-lines – High performance focus – New profiling method
4
Dictionary compression algorithm
- Goal: fast decompression
- Dictionary contains unique instructions
- Replace program instructions with short index
lw r2,r3 lw r2,r3 lw r15,r3 lw r15,r3 lw r15,r3 32 bits
.text segment
Original program
5 5 30 30 30 16 bits
.text segment (contains indices)
Compressed program
lw r2,r3 lw r15,r3 32 bits
.dictionary segment
5
Decompression
- Algorithm
- 1. I-cache miss invokes decompressor (exception handler)
- 2. Fetch index
- 3. Fetch dictionary word
- 4. Place instruction in I-cache (special instruction)
- Write directly into I-cache
- Decompressed instructions only exist in I-cache
Proc.
- D-cache
I-cache Memory
Add r1,r2,r3 5
Dictionary Indices
...
6
CodePack
- Overview
– IBM – PowerPC – First system with instruction stream compression – Decompress during I-cache miss
- Software CodePack
Dictionary CodePack Codewords (indices) Fixed-length Variable-length Decompress granularity 1 cache line 2 cache lines Decompression overhead 75 instructions 1120 instructions
7
Compression ratio
- – CodePack: 55% - 63%
– Dictionary: 65% - 82%
size
- riginal
size compressed ratio n compressio =
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
cc1 ghostscript go ijpeg mpeg2enc pegwit perl vortex
Compression ratio
Dictionary CodePack
8
Simulation environment
- SimpleScalar
- Pipeline: 5 stage, in-order
- I-cache: 16KB, 32B lines, 2-way
- D-cache: 8KB, 16B lines, 2-way
- Memory: 10 cycle latency, 2 cycle rate
9
Performance
- CodePack: very high overhead
- Reduce overhead by reducing cache misses
Go
2 4 6 8 10 12 14 16 18 20 22
4KB 16KB 64KB I-cache size (KB) Slowdown relative to native code CodePack Dictionary Native
10
Cache miss
- Control slowdown by optimizing I-cache miss ratio
5 10 15 20 25 30 35 40 0% 2% 4% 6% 8% I-cache miss ratio Slowdown relative to native code
CodePack 4KB CodePack 16KB CodePack 64KB Dictionary 4KB Dictionary 16KB Dictionary 64KB
11
Selective compression
- Hybrid programs
– Only compress some procedures – Trade size for speed – Avoid decompression overhead
- Profile methods
– Count dynamic instructions
- Example: Thumb
- Use when compressed code has more instructions
- Reduce number of executed instructions
– Count cache misses
- Example: CodePack
- Use when compressed code has longer cache miss latency
- Reduce cache miss latency
12
Cache miss profiling
- Cache miss profile reduces overhead 50%
- Loop-oriented benchmarks benefit most
– Approach performance of native code
Pegwit (encryption) 1.00 1.02 1.04 1.06 1.08 1.10 1.12 60% 70% 80% 90% 100% Compression ratio Slowdown relative to native code CodePack: dynamic instructions CodePack: cache miss
13
CodePack vs. Dictionary
- More compression may have better performance
– CodePack has smaller size than Dictionary compression – Even with some native code, CodePack is smaller – CodePack is faster due to using more native code
Ghostscript
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 60% 70% 80% 90% 100% Compression ratio Slowdown relative to native code
CodePack: cache miss Dictionary: cache miss
14
Conclusions
- High-performance SW decompression possible
– Dictionary faster than CodePack, but 5-25% compression ratio difference – Hardware support
- I-cache miss exception
- Store-instruction instruction
- Tune performance by reducing cache misses
– Cache size – Code placement
- Selective compression
– Use cache miss profile for loop-oriented benchmarks
- Code placement affects decompression overhead
– Future: unify code placement and compression
15