Reducing Code Size with Run-time Decompression Charles Lefurgy, Eva - - PowerPoint PPT Presentation

reducing code size with run time decompression
SMART_READER_LITE
LIVE PREVIEW

Reducing Code Size with Run-time Decompression Charles Lefurgy, Eva - - PowerPoint PPT Presentation

Reducing Code Size with Run-time Decompression Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor


slide-1
SLIDE 1

Reducing Code Size with Run-time Decompression

Charles Lefurgy, Eva Piccininni, and Trevor Mudge

Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor High-Performance Computer Architecture (HPCA-6) January 10-12, 2000

slide-2
SLIDE 2

2

Motivation

Embedded Systems Original Program ROM Program RAM I/O CPU Compressed Program ROM RAM I/O CPU

  • Problem: embedded code size

– Constraints: cost, area, and power – Fit program in on-chip memory – Compilers vs. hand-coded assembly

  • Portability
  • Development costs

– Code bloat

  • Solution: code compression

– Reduce compiled code size – Take advantage of instruction repetition

  • Implementation

– Hardware or software? – Code size? – Execution speed?

slide-3
SLIDE 3

3

Software decompression

  • Previous work

– Decompression unit: whole program [Tauton91]

  • No memory savings

– Decompression unit: procedures [Kirovski97][Ernst97]

  • Requires large decompression memory
  • Fragmentation of decompression memory
  • Slow
  • Our work

– Decompression unit: 1 or 2 cache-lines – High performance focus – New profiling method

slide-4
SLIDE 4

4

Dictionary compression algorithm

  • Goal: fast decompression
  • Dictionary contains unique instructions
  • Replace program instructions with short index

lw r2,r3 lw r2,r3 lw r15,r3 lw r15,r3 lw r15,r3 32 bits

.text segment

Original program

5 5 30 30 30 16 bits

.text segment (contains indices)

Compressed program

lw r2,r3 lw r15,r3 32 bits

.dictionary segment

slide-5
SLIDE 5

5

Decompression

  • Algorithm
  • 1. I-cache miss invokes decompressor (exception handler)
  • 2. Fetch index
  • 3. Fetch dictionary word
  • 4. Place instruction in I-cache (special instruction)
  • Write directly into I-cache
  • Decompressed instructions only exist in I-cache

Proc.

  • D-cache

I-cache Memory

Add r1,r2,r3 5

Dictionary Indices

...

slide-6
SLIDE 6

6

CodePack

  • Overview

– IBM – PowerPC – First system with instruction stream compression – Decompress during I-cache miss

  • Software CodePack

Dictionary CodePack Codewords (indices) Fixed-length Variable-length Decompress granularity 1 cache line 2 cache lines Decompression overhead 75 instructions 1120 instructions

slide-7
SLIDE 7

7

Compression ratio

  • – CodePack: 55% - 63%

– Dictionary: 65% - 82%

size

  • riginal

size compressed ratio n compressio =

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

cc1 ghostscript go ijpeg mpeg2enc pegwit perl vortex

Compression ratio

Dictionary CodePack

slide-8
SLIDE 8

8

Simulation environment

  • SimpleScalar
  • Pipeline: 5 stage, in-order
  • I-cache: 16KB, 32B lines, 2-way
  • D-cache: 8KB, 16B lines, 2-way
  • Memory: 10 cycle latency, 2 cycle rate
slide-9
SLIDE 9

9

Performance

  • CodePack: very high overhead
  • Reduce overhead by reducing cache misses

Go

2 4 6 8 10 12 14 16 18 20 22

4KB 16KB 64KB I-cache size (KB) Slowdown relative to native code CodePack Dictionary Native

slide-10
SLIDE 10

10

Cache miss

  • Control slowdown by optimizing I-cache miss ratio

5 10 15 20 25 30 35 40 0% 2% 4% 6% 8% I-cache miss ratio Slowdown relative to native code

CodePack 4KB CodePack 16KB CodePack 64KB Dictionary 4KB Dictionary 16KB Dictionary 64KB

slide-11
SLIDE 11

11

Selective compression

  • Hybrid programs

– Only compress some procedures – Trade size for speed – Avoid decompression overhead

  • Profile methods

– Count dynamic instructions

  • Example: Thumb
  • Use when compressed code has more instructions
  • Reduce number of executed instructions

– Count cache misses

  • Example: CodePack
  • Use when compressed code has longer cache miss latency
  • Reduce cache miss latency
slide-12
SLIDE 12

12

Cache miss profiling

  • Cache miss profile reduces overhead 50%
  • Loop-oriented benchmarks benefit most

– Approach performance of native code

Pegwit (encryption) 1.00 1.02 1.04 1.06 1.08 1.10 1.12 60% 70% 80% 90% 100% Compression ratio Slowdown relative to native code CodePack: dynamic instructions CodePack: cache miss

slide-13
SLIDE 13

13

CodePack vs. Dictionary

  • More compression may have better performance

– CodePack has smaller size than Dictionary compression – Even with some native code, CodePack is smaller – CodePack is faster due to using more native code

Ghostscript

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 60% 70% 80% 90% 100% Compression ratio Slowdown relative to native code

CodePack: cache miss Dictionary: cache miss

slide-14
SLIDE 14

14

Conclusions

  • High-performance SW decompression possible

– Dictionary faster than CodePack, but 5-25% compression ratio difference – Hardware support

  • I-cache miss exception
  • Store-instruction instruction
  • Tune performance by reducing cache misses

– Cache size – Code placement

  • Selective compression

– Use cache miss profile for loop-oriented benchmarks

  • Code placement affects decompression overhead

– Future: unify code placement and compression

slide-15
SLIDE 15

15

Web page

http://www.eecs.umich.edu/compress