Evaluation of a High Performance Code Compression Method Charles - - PowerPoint PPT Presentation

evaluation of a high performance code compression method
SMART_READER_LITE
LIVE PREVIEW

Evaluation of a High Performance Code Compression Method Charles - - PowerPoint PPT Presentation

Evaluation of a High Performance Code Compression Method Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor


slide-1
SLIDE 1

Evaluation of a High Performance Code Compression Method

Charles Lefurgy, Eva Piccininni, and Trevor Mudge

Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor MICRO-32 November 16-18, 1999

slide-2
SLIDE 2

2

Motivation

Embedded Systems Original Program ROM Program RAM I/O CPU Compressed Program ROM RAM I/O CPU

  • Problem: embedded code size

– Constraints: cost, area, and power – Fit program in on-chip memory – Compilers vs. hand-coded assembly

  • Portability
  • Development costs

– Code bloat

  • Solution: code compression

– Reduce compiled code size – Take advantage of instruction repetition – Systems use cheaper processors with smaller on-chip memories

  • Implementation

– Code size? – Execution speed?

slide-3
SLIDE 3

3

CodePack

  • Overview

– IBM – PowerPC instruction set – First system with instruction stream compression – 60% compression ratio, ±10% performance [IBM]

  • performance gain due to prefetching
  • Implementation

– Binary executables are compressed after compilation – Compression dictionaries tuned to application – Decompression occurs on L1 cache miss

  • L1 caches hold decompressed data
  • Decompress 2 cache lines at a time (16 insns)

– PowerPC core is unaware of compression

slide-4
SLIDE 4

4

CodePack encoding

  • 32-bit insn is split into 2 16-bit words
  • Each 16-bit word compressed separately

Encoding for upper 16 bits Encoding for lower 16 bits

32 64 128 256

Tag Index Escape Raw bits

0 1 1 0 0 1 0 1 1 1 0 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 1 1 1 0 0 x x x x x x x x 0 1 x x x x 1 0 0 x x x x x x x x 1 0 1 x x x x x x x x x 1 1 0 x x x x x x x x x x x x x x x x x 1 1 1 0 0 x x x x x

8 16 23 128 256 1

Encodes zero

slide-5
SLIDE 5

5

CodePack decompression

Decompress Byte-aligned block address L1 I-cache miss address Fetch index Fetch compressed instructions Native Instruction Low dictionary Compression Block (16 instructions)

31 26 25 6 5

Index table (in main memory) Compressed bytes (in main memory) Hi tag Low tag Low index Hi index High 16-bits Low 16-bits High dictionary 1 compressed instruction

slide-6
SLIDE 6

6

  • Average: 62%

Compression ratio

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

a p p l u a p s i c c 1 c

  • m

p r e s s 9 5 f p p p p g

  • h

y d r

  • 2

d i j p e g l i 9 5 m 8 8 k s i m m g r i d m p e g 2 e n c p e g w i t p e r l s u 2 c

  • r

s w i m t

  • m

c a t v t u r b 3 d v

  • r

t e x w a v e 5

Compression ratio

size

  • riginal

size compressed ratio n compressio = = = =

slide-7
SLIDE 7

7

CodePack programs

  • Compressed executable

– 17%-25% raw bits: not compressed!

  • Includes escape bits
  • Compiler optimizations might help

– 5% index table – 2KB dictionary (fixed cost) – 1% pad bits

Tags 25% Indices 51% Dictionary 1% Index table 5% Escape 3% Raw bits 14% Pad 1%

go

slide-8
SLIDE 8

8

I-cache miss timing

  • Native code uses critical word first
  • Compressed code must be fetched sequentially
  • Example shows miss to 5th instruction in cache line

– 32-bit insns, 64-bit bus

Instruction cache miss Instructions from main memory Instruction cache miss Index from index cache Codes from main memory Instruction cache miss Codes from main memory Decompressor 2 Decompressors

a) Native code b) Compressed code c) Compressed code + optimizations t=0

L1 cache miss Fetch index Fetch instructions (first line) Fetch instructions (remaining lines) Decompression cycle Critical instruction word A B

10 30 20 1 cycle

Index from main memory

slide-9
SLIDE 9

9

Baseline results

  • CodePack causes up to 18% performance loss

– SimpleScalar – 4-issue, out-of-order – 16 KB caches – Main memory: 10 cycle latency, 2 cycle rate

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 cc1 go perl vortex Instructions per cycle native CodePack

slide-10
SLIDE 10

10

Optimization A: Index cache

  • Remove index table access with a cache

– A cache hit removes main memory access for index – optimized: 64 lines, fully assoc., 4 indices/line (<15% miss ratio)

  • Within 8% of native code

– perfect: an infinite sized index cache

  • Within 5% of native code

0.0 0.2 0.4 0.6 0.8 1.0 1.2 cc1 go perl vortex Speedup over native code CodePack

  • ptimized

perfect

slide-11
SLIDE 11

11

Optimization B: More decoders

  • Codeword tags enable fast extraction of codewords

– Enables parallel decoding

  • Try adding more decoders for faster decompression
  • 2 decoders: performance within 13% of native code

0.0 0.2 0.4 0.6 0.8 1.0 cc1 go perl vortex Speedup over native code CodePack 2 insn/cycle 3 insn/cycle 16 insn/cycle

slide-12
SLIDE 12

12

Comparison of optimizations

  • Index cache provides largest benefit
  • Optimizations

– index cache: 64 lines, 4 indices/line, fully assoc. – 2nd decoder

  • Speedup over native code: 0.97 to 1.05
  • Speedup over CodePack: 1.17 to 1.25

0.2 0.4 0.6 0.8 1 1.2

cc1 go perl vortex Speedup over native code

CodePack index cache 2nd decoder both optimizations

slide-13
SLIDE 13

13

Cache effects

  • Cache size controls normal CodePack slowdown
  • Optimizations do well on small caches: 1.14 speedup

go benchmark 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1KB 4KB 16KB 64KB Speedup

  • ver native

code CodePack

  • ptimized
slide-14
SLIDE 14

14

Memory latency

  • Optimized CodePack performs better with slow memories

– Fewer memory accesses than native code

go benchmark

0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.5x 1x 2x 4x 8x Memory latency Speedup

  • ver native

code

CodePack

  • ptimized
slide-15
SLIDE 15

15

Memory width

  • CodePack provides speedup for small buses
  • Optimizations help performance degrade gracefully as bus

size increases

go benchmark 0.0 0.2 0.4 0.6 0.8 1.0 1.2 16 32 64 128 Bus size (bits) Speedup over native code CodePack

  • ptimized
slide-16
SLIDE 16

16

Conclusions

  • CodePack works for other instruction sets than PowerPC
  • Performance can be improved at modest cost

– Remove decompression overhead: index lookup, dictionary lookup

  • Compression can speedup execution

– Compressed code requires fewer main memory accesses – CodePack includes simple prefetching

  • Systems that benefit most from compression

– Narrow buses – Slow memories

  • Workstations might benefit from compression

– Fewer L2 misses – Less disk access

slide-17
SLIDE 17

17

Web page

http://www.eecs.umich.edu/~tnm/compress