Fast Software-managed Code Decompression Charles Lefurgy and Trevor - - PowerPoint PPT Presentation

fast software managed code decompression
SMART_READER_LITE
LIVE PREVIEW

Fast Software-managed Code Decompression Charles Lefurgy and Trevor - - PowerPoint PPT Presentation

Fast Software-managed Code Decompression Charles Lefurgy and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor Compiler and Architecture Support for


slide-1
SLIDE 1

Fast Software-managed Code Decompression

Charles Lefurgy and Trevor Mudge

Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor Compiler and Architecture Support for Embedded Systems (CASES) October 1-3, 1999

slide-2
SLIDE 2

2

Motivation

  • Problem: embedded code size

– Constraints: cost, area, and power – Fit program in on-chip memory – Compilers vs. hand-coded assembly

  • Solution: code compression

– Reduce compiled code size – Take advantage of instruction repetition

  • Benefits

– On-chip memory used more effectively – Trade-off performance for code density – Systems use cheaper processors with smaller on-chip memories Embedded Systems Original Program ROM Program RAM I/O CPU Compressed Program ROM RAM I/O CPU

slide-3
SLIDE 3

3

Hardware or software decompression?

  • Hardware

– Faster translation – CodePack, MIPS-16, Thumb

  • Software

– Smaller physical area – Lower cost – Quicker re-targeting to new compression algorithms – Rivals HW solutions on some (loopy) benchmarks

slide-4
SLIDE 4

4

  • Overview

– Procedure Compression – Decompress and execute 1 procedure at a time – Store decompressed code in procedure cache – Cache management

  • Results

– 60% compression ratio on SPARC – 166% execution penalty with 64KB procedure cache

Kirovski et al., 1997

Native F: load r5,4 ... Compile LZ Compress Decompressor P-cache manager Native G: addi r7,8 ... HLL F() {...} HLL G() {...} LZ F: 10010... LZ G: 00101...

slide-5
SLIDE 5

5

Dictionary compression algorithm

  • Dictionary contains unique instructions
  • Replace program instructions with short index

Add r1,r2,r3 Add r1,r2,r3 Add r1,r2,r4 Add r1,r2,r4 Add r1,r2,r4 32 bits

.text segment

Original program

5 5 30 30 30 16 bits

.text segment (contains indices)

Compressed program

Add r1,r2,r3 Add r1,r2,r4 32 bits

.dictionary segment

slide-6
SLIDE 6

6

Benchmark Original

  • Dict. Compression

LZRW1 Compression

cc1 1,083,168 65.4% 60.4% vortex 495,248 65.8% 55.5% go 310,576 69.6% 63.9% perl 267,568 73.7% 60.2% ijpeg 198,272 77.2% 61.5% mpeg2enc 119,600 82.5% 60.5% pegwit 88,800 79.5% 56.7%

Compression ratio

  • Compression ratios

– Dictionary: 65% - 82% – LZRW1: 55% - 63%

size

  • riginal

size compressed ratio n compressio = = = =

slide-7
SLIDE 7

7

Decompression code

  • Simple

– Small static code size: 25 instructions

  • Fast

– Less than 3 instructions per output byte – 74 dynamic instructions per decompressed cache line

  • Algorithm

– Invoke decompressor on L1 I-cache miss – Decompress 1 complete cache line – For each instruction in cache line

  • Read index
  • Reference dictionary with index to get instruction
  • Put instruction in I-cache
  • HW Support

– L1-cache miss exception – Write into I-cache

slide-8
SLIDE 8

8

Optimizations

  • Partial decompression

– compress from missed instruction to end of cache line – use a valid bit per word in cache line to mark instructions at beginning of line as invalid – avoids decompressing instructions that may not be executed – up to 12% speedup

  • Second register file

– Many embedded processors have an additional register file – Avoid save/restore of registers when decompressor runs – 2nd register file with partial decompression: up to 16% speedup

slide-9
SLIDE 9

9

Simulation environment

  • SimpleScalar

– Modified to support compression

  • 5 stage, in-order pipeline

– Simple embedded processor

  • D-cache

– 8KB, 16B lines, 2-way

  • I-cache

– 1 to 64KB, 32B lines, 2-way

  • Memory

– 10 cycle latency, 2 cycle rate

slide-10
SLIDE 10

10

1 2 3 4 5 6 1KB 4KB 16KB 64KB I-cache size (KB) Slowdown relative to native code

compressed partial partial+regfile native

Performance: cc1

slide-11
SLIDE 11

11

Performance: ijpeg

1 2 3 4 5 6 1KB 4KB 16KB 64KB

I-cache size (KB)

compressed partial partial+regfile native

Slowdown relative to native code

slide-12
SLIDE 12

12

Performance summary

  • Data from CINT95, MediaBench with several cache sizes
  • Control slowdown by optimizing I-cache miss ratio

– Code layout may help

1 2 3 4 5 6 0% 5% 10% 15% I-cache miss ratio Slowdown relative to native code

compressed partial partial+regfile

slide-13
SLIDE 13

13

Performance summary, cont.

  • Magnification of previous graph
  • Slowdown under 3x when I-miss ratio is under 2%
  • Slowdown under 2x when I-miss ratio is under 1%

1 2 3 4 0.0% 0.5% 1.0% 1.5% 2.0% 2.5% 3.0% I-cache miss ratio Slowdown relative to native code

compressed partial partial+regfile

slide-14
SLIDE 14

14

Conclusions

  • Line-based decompression beats procedure-based

– use normal cache as decompression buffer – no fragmentation management as in procedure-based decompression – order of magnitude performance difference – A previous decompressor with procedure granularity had 100x slowdown

  • n gcc and go [Kirovski97]
  • Compressed code fills gap

– has quick execution of native code – has small size of interpreted code

slide-15
SLIDE 15

15

Web page

http://www.eecs.umich.edu/~tnm/compress