An Operation Rearrangement Technique for Low-Power VLIW Instruction - - PowerPoint PPT Presentation

an operation rearrangement technique for low power vliw
SMART_READER_LITE
LIVE PREVIEW

An Operation Rearrangement Technique for Low-Power VLIW Instruction - - PowerPoint PPT Presentation

An Operation Rearrangement Technique for Low-Power VLIW Instruction Fetch Dongkun Shin* and Jihong Kim Computer Architecture Lab School of Computer Science and Engineering Seoul National University, Korea Outline Motivations VLIW


slide-1
SLIDE 1

An Operation Rearrangement Technique for Low-Power VLIW Instruction Fetch

Dongkun Shin* and Jihong Kim Computer Architecture Lab School of Computer Science and Engineering Seoul National University, Korea

slide-2
SLIDE 2

School of CSE Seoul National University Workshop on Complexity-Effective Design 2

Outline

  • Motivations
  • VLIW Instruction Encodings
  • LOR Problem and Solution
  • GOR Problem and Solution
  • Experiment
  • Conclusions
slide-3
SLIDE 3

School of CSE Seoul National University Workshop on Complexity-Effective Design 3

Motivations

Many mobile devices are designed using VLIW processors for high performance, which usually consume more power than single-issue processors. Many mobile devices are designed using VLIW processors for high performance, which usually consume more power than single-issue processors. In digital CMOS circuits, switching activity accounts for over 90%

  • f total power

consumption. In digital CMOS circuits, switching activity accounts for over 90%

  • f total power

consumption. We propose a post-pass optimization technique that can reduce switching activity during the instruction fetch phase in VLIW processors We propose a post-pass optimization technique that can reduce switching activity during the instruction fetch phase in VLIW processors

slide-4
SLIDE 4

School of CSE Seoul National University Workshop on Complexity-Effective Design 4

VLIW Instruction Encoding-Uncompressed

IADD /*IntU*/

|| FADD /*FpU*/ || LOAD /*MemU*/ || STORE /*MEMU*/ ISUB /*IntU*/ || IMUL /*IntU*/ IADD /*IntU*/ || BEG /*BrU*/ IntU IntU FpU FpU MemU MemU CmpU BrU IADD NOP FADD NOP LOAD STORE NOP NOP ISUB IMUL NOP NOP NOP NOP NOP NOP IADD NOP NOP NOP NOP NOP NOP BEG IADD NOP FADD NOP LOAD STORE NOP NOP IMUL ISUB NOP NOP NOP NOP NOP NOP IADD NOP NOP NOP NOP NOP NOP BEG

Alternative encoding Functional Unit Program

slide-5
SLIDE 5

School of CSE Seoul National University Workshop on Complexity-Effective Design 5

VLIW Instruction Encoding - Compressed

IADD /*IntU*/

|| FADD /*FpU*/ || LOAD /*MemU*/ || STORE /*MEMU*/ ISUB /*IntU*/ || IMUL /*IntU*/ IADD /*IntU*/ || BEG /*BrU*/

IADD IntU 1 FADD FpU 1 LOAD MemU 1STORE MemU 0 ISUB IntU 1 IMUL IntU 0 IADD IntU 1 BEG BrU IADD IntU 1 FADD FpU 1 LOAD MemU 1 STORE MemU ISUB IntU 1 IMUL IntU IADD IntU 1 BEG BrU

Instruction 1 Instruction 2 Instruction 3 Instruction 1 Instruction 2 Instruction 3 Alternative encoding Possible choices = 4! 2! 2!

Which encoding is the best for low-power consumption?

Program Parallel bit

slide-6
SLIDE 6

School of CSE Seoul National University Workshop on Complexity-Effective Design 6

Machine Model

External Memory Internal Cache VLIW Processor Core Ins Memory block is fetched from the main memory through the bmem-bit width instruction bus

  • n cache-miss.

Because of the compressed encoding format, several VLIW instructions are fetched together in a single fetch from the instruction cache. A fetch packet consists of N operations, and bmem = bcache/N bmem-bit width bus bcache-bit width bus Ins Ins FP OP OP OP Ins Ins Ins FP

slide-7
SLIDE 7

School of CSE Seoul National University Workshop on Complexity-Effective Design 7

Basic Idea

00010101 10010101 10011001 00000000 10001111 00000011 00011101 01011100 10011101 10011001 10010001 11111110 10100101 10001111 00011101 00011100 00010101 10010101 10011001 00000000 00011101 10001111 01011101 00000010 10011101 10011001 11111111 10010000 10001111 00011101 10100101 00011100

Instruction Cache Instruction Cache (a) Before operation rearrangement (b) After operation rearrangement

14 bit transitions 12 bit transitions 13 bit transitions 8 bit transitions 10 bit transitions 11 bit transitions Total 39 bit transitions Total 29 bit transitions

The total # of bit changes are reduced by 25%

slide-8
SLIDE 8

School of CSE Seoul National University Workshop on Complexity-Effective Design 8

Problem Formulation

how to reorder given VLIW instructions to reduce the number of bit transitions between successive instruction fetches.

Problem

Local Operation Rearrangement (LOR) : each basic block is independently considered. Global Operation Rearrangement (GOR) : all the basic blocks are simultaneously considered.

Solutions

slide-9
SLIDE 9

School of CSE Seoul National University Workshop on Complexity-Effective Design 9

LOR Problem

SW = SWcache + α α α α•SWmem SW = SWcache + α α α α•SWmem

B B B

SWcache is the number of bit changes at the internal instruction bus. SWcache is the number of bit changes at the internal instruction bus.

B

SWmem is the number of bit changes at the external instruction bus. SWmem is the number of bit changes at the external instruction bus.

B

α α α α is the load capacitance ratio of the external instruction bus to the internal instruction bus. α α α α is the load capacitance ratio of the external instruction bus to the internal instruction bus.

slide-10
SLIDE 10

School of CSE Seoul National University Workshop on Complexity-Effective Design 10

LOR Problem

External Memory Internal Cache VLIW Processor Core OP1 OP2 OPN OP1 OP2 FP2 FP3 FP1

SWmem SWcache

SWB = ∑ ∑ ∑ ∑SWintra + ∑ ∑ ∑ ∑SWinter

FP

SWintra

FP

SWinter

FP FP

...

slide-11
SLIDE 11

School of CSE Seoul National University Workshop on Complexity-Effective Design 11

Solution for LOR

START

B i

FP 1

, B i

FP 2

, B i

FP 3

, B i

FP 4

, B i

FP

1 , 1 + B i

FP

2 , 1 + B i

FP

3 , 1 + B i

FP

4 , 1 +

END

SWintra

FP

SWinter

FP

EQ(FPi )

B

EQ(FPi ) : The set of equivalent fetch packets of FPi.

B B

slide-12
SLIDE 12

School of CSE Seoul National University Workshop on Complexity-Effective Design 12

Solution for LOR

  • We find the shortest

path from START to END, which is the solution of operation rearrangement to minimize the SWB

  • A node vi+1 in graph

finds the node vi through which the shortest path from START to the node vi+1 should pass.

START

B i

FP1

, B i

FP2

, B i

FP3

, B i

FP4

, B i

FP 1

, 1 + B i

FP 2

, 1 + B i

FP 3

, 1 + B i

FP 4

, 1 +

END

slide-13
SLIDE 13

School of CSE Seoul National University Workshop on Complexity-Effective Design 13

GOR Problem

  • All the basic blocks in a program are

simultaneously considered

– how many times each basic block is executed. – how often each basic block experiences cache misses. – how basic blocks are related each other. SWS = ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑SWinter(bbi,bbj) + ∑ ∑ ∑ ∑SWintra(bbi)

BB BB

  • SWinter and SWintra is represented by SWinter,

SWintra, weight of each basic block, and cache miss rate.

BB BB FP FP

slide-14
SLIDE 14

School of CSE Seoul National University Workshop on Complexity-Effective Design 14

Solution for GOR

This method may require an excessive amount of memory and cycles. We need a heuristic solution.

GOR Problem GOR Problem Shortest Path Problem Shortest Path Problem Graph Transformation (branch merging, loop rolling) Graph Transformation (branch merging, loop rolling) Graph Construction Graph Construction LOR Algorithm LOR Algorithm Solution Solution

slide-15
SLIDE 15

School of CSE Seoul National University Workshop on Complexity-Effective Design 15

Heuristic for GOR

  • All the basic blocks are not equally

treated.

– Basic blocks with larger effects on the total switching activity are more thoroughly reordered than ones with smaller effects.

  • Not all the equivalent basic blocks in

EQ(bbi) are tried to find an optimal solution.

– Only Ncand equivalent basic blocks are created and included in graph.

slide-16
SLIDE 16

School of CSE Seoul National University Workshop on Complexity-Effective Design 16

Experiment

  • Fixed-point DSP
  • VLIW processor that can specify eight 32-bit
  • perations in a single 256-bit instruction.
  • Use a compressed encoding
  • Fixed-point DSP
  • VLIW processor that can specify eight 32-bit
  • perations in a single 256-bit instruction.
  • Use a compressed encoding

TMS320C6201 TMS320C6201

Instruction Cache External Memory Internal Bus 256-bit width 32-bit width External Bus VLIW Processor Core FU1 FU5 FU2 FU6 FU3 FU7 FU4 FU8

slide-17
SLIDE 17

School of CSE Seoul National University Workshop on Complexity-Effective Design 17

Experiment Results

For our benchmark programs, the bit transitions was reduced by 34% on an average.

. . 2 . 4 . 6 . 8 1 . 1 . 2

v e c t

  • r

m u l t i p l y F I R 8 I I R l a t t i c e a n a l y s i s W_ v e c m i n e r r

  • r a

v e r a g e

B e n c h m a r k P r

  • g

r a m s

R e l a t i v e B T / I F

d e f a u l t L OR G OR

  • H
slide-18
SLIDE 18

School of CSE Seoul National University Workshop on Complexity-Effective Design 18

Conclusions

  • Described a post-pass optimal operation

rearrangement method for low-power VLIW instruction fetch.

– The switching activity was reduced by 34% on an average.

  • Future works

– The phase-ordering problem between the operation rearrangement and other compiler optimization steps. – Operation rearrangement problem in super-scalar processors.