An Operation Rearrangement Technique for Low-Power VLIW Instruction - PowerPoint PPT Presentation

An Operation Rearrangement Technique for Low-Power VLIW Instruction Fetch Dongkun Shin* and Jihong Kim Computer Architecture Lab School of Computer Science and Engineering Seoul National University, Korea

Outline • Motivations • VLIW Instruction Encodings • LOR Problem and Solution • GOR Problem and Solution • Experiment • Conclusions School of CSE 2 Workshop on Complexity-Effective Design Seoul National University

Motivations Many mobile devices are Many mobile devices are In digital CMOS circuits, designed using VLIW In digital CMOS circuits, designed using VLIW switching activity processors for high switching activity processors for high performance, which accounts for over 90% performance, which accounts for over 90% usually consume more of total power usually consume more of total power power than single-issue power than single-issue consumption. consumption. processors. processors. We propose a post-pass optimization We propose a post-pass optimization technique that can reduce switching technique that can reduce switching activity during the instruction fetch activity during the instruction fetch phase in VLIW processors phase in VLIW processors School of CSE 3 Workshop on Complexity-Effective Design Seoul National University

VLIW Instruction Encoding-Uncompressed IntU IntU FpU FpU MemU MemU CmpU BrU Program Functional Unit IADD /*IntU*/ || FADD /*FpU*/ IADD NOP FADD NOP LOAD STORE NOP NOP || LOAD /*MemU*/ || STORE /*MEMU*/ ISUB IMUL NOP NOP NOP NOP NOP NOP ISUB /*IntU*/ IADD NOP NOP NOP NOP NOP NOP BEG || IMUL /*IntU*/ IADD /*IntU*/ || BEG /*BrU*/ NOP IADD NOP FADD STORE LOAD NOP NOP Alternative IMUL ISUB NOP NOP NOP NOP NOP NOP encoding IADD NOP NOP NOP NOP NOP BEG NOP School of CSE 4 Workshop on Complexity-Effective Design Seoul National University

VLIW Instruction Encoding - Compressed Parallel bit Program IADD IntU 1 FADD FpU 1 LOAD MemU 1STORE MemU 0 ISUB IntU 1 IMUL IntU 0 IADD IntU 1 BEG IADD /*IntU*/ 0 BrU || FADD /*FpU*/ || LOAD /*MemU*/ || STORE /*MEMU*/ Instruction 1 Instruction 2 Instruction 3 Possible choices = 4! 2! 2! ISUB /*IntU*/ || IMUL /*IntU*/ IADD /*IntU*/ || BEG /*BrU*/ FADD STORE IADD LOAD IMUL ISUB BEG IADD 1 1 1 0 1 0 1 0 FpU MemU IntU MemU IntU IntU BrU IntU Alternative encoding Instruction 1 Instruction 2 Instruction 3 Which encoding is the best for low-power consumption? School of CSE 5 Workshop on Complexity-Effective Design Seoul National University

Machine Model External Memory OP Memory block is fetched from the main memory OP b mem -bit through the b mem -bit width instruction bus width bus OP on cache-miss. Internal Cache FP Because of the compressed encoding Ins Ins Ins format, several VLIW instructions are b cache -bit FP fetched together in a single fetch width bus Ins Ins Ins from the instruction cache. VLIW A fetch packet consists of N operations, Processor Core and b mem = b cache /N School of CSE 6 Workshop on Complexity-Effective Design Seoul National University

Basic Idea Instruction Cache Instruction Cache 00010101 10010101 10011001 00000000 00010101 10010101 10011001 00000000 8 bit transitions 14 bit transitions 10001111 00000011 00011101 01011100 00011101 10001111 01011101 00000010 10 bit transitions 12 bit transitions 10011101 10011001 10010001 11111110 10011101 10011001 11111111 10010000 13 bit transitions 11 bit transitions 10100101 10001111 00011101 00011100 10001111 00011101 10100101 00011100 Total 39 bit transitions Total 29 bit transitions (a) Before operation rearrangement (b) After operation rearrangement The total # of bit changes are reduced by 25% School of CSE 7 Workshop on Complexity-Effective Design Seoul National University

Problem Formulation Problem how to reorder given VLIW instructions to reduce the number of bit transitions between successive instruction fetches. Solutions Local Operation Rearrangement (LOR) : each basic block is independently considered. Global Operation Rearrangement (GOR) : all the basic blocks are simultaneously considered. School of CSE 8 Workshop on Complexity-Effective Design Seoul National University

LOR Problem B B SW = SW cache + α B α α •SW mem α SW = SW cache + α α •SW mem α α α α is the load capacitance ratio of the external α α α is the load capacitance ratio of the external α α α instruction bus to the internal instruction bus. instruction bus to the internal instruction bus. B SW cache is the number of bit changes at the SW cache is the number of bit changes at the internal instruction bus. internal instruction bus. B SW mem is the number of bit changes at the SW mem is the number of bit changes at the external instruction bus. external instruction bus. School of CSE 9 Workshop on Complexity-Effective Design Seoul National University

LOR Problem ... Internal Cache OP 1 OP 2 OP N OP 1 OP 2 External Memory FP 3 SW intra FP FP 2 SW inter FP FP 1 SW mem SW cache VLIW Processor Core SW B = ∑ ∑ ∑ ∑ SW intra + ∑ ∑ ∑ SW inter ∑ FP FP School of CSE 10 Workshop on Complexity-Effective Design Seoul National University

Solution for LOR START B 0 EQ(FP i ) SW intra 0 0 0 FP B B B B FP 1 FP 2 FP 3 FP 4 i , i , i , i , SW inter FP B B B B FP FP FP FP + + + + i 1 , 1 i 1 , 2 i 1 , 3 i 1 , 4 0 0 0 0 END B B EQ(FP i ) : The set of equivalent fetch packets of FP i . School of CSE 11 Workshop on Complexity-Effective Design Seoul National University

Solution for LOR • We find the shortest START path from START to END, which is the solution of operation B B B B FP 1 FP 2 FP 3 FP 4 i , i , i , i , rearrangement to minimize the SW B • A node v i+1 in graph B B B B FP 1 FP 2 FP 3 FP 4 finds the node v i + + + + i 1 , i 1 , i 1 , i 1 , through which the shortest path from START to the node v i+1 END should pass. School of CSE 12 Workshop on Complexity-Effective Design Seoul National University

GOR Problem • All the basic blocks in a program are simultaneously considered – how many times each basic block is executed. – how often each basic block experiences cache misses. – how basic blocks are related each other. SW S = ∑ ∑ ∑ ∑ ∑ ∑ SW inter (bb i ,bb j ) + ∑ ∑ ∑ ∑ ∑ SW intra (bb i ) ∑ BB BB • SW inter and SW intra is represented by SW inter , BB BB FP SW intra , weight of each basic block, and cache FP miss rate. School of CSE 13 Workshop on Complexity-Effective Design Seoul National University

Solution for GOR Shortest Path Shortest Path GOR Problem LOR Algorithm GOR Problem LOR Algorithm Problem Problem Graph Transformation Graph Transformation Graph Solution Graph (branch merging, Solution (branch merging, Construction Construction loop rolling) loop rolling) This method may require an excessive amount of memory and cycles. We need a heuristic solution. School of CSE 14 Workshop on Complexity-Effective Design Seoul National University

Heuristic for GOR • All the basic blocks are not equally treated . – Basic blocks with larger effects on the total switching activity are more thoroughly reordered than ones with smaller effects. • Not all the equivalent basic blocks in EQ(bb i ) are tried to find an optimal solution. – Only N cand equivalent basic blocks are created and included in graph. School of CSE 15 Workshop on Complexity-Effective Design Seoul National University

Experiment TMS320C6201 TMS320C6201 • Fixed-point DSP • Fixed-point DSP • VLIW processor that can specify eight 32-bit • VLIW processor that can specify eight 32-bit operations in a single 256-bit instruction. operations in a single 256-bit instruction. • Use a compressed encoding • Use a compressed encoding VLIW Processor External Bus Core Instruction External Internal FU1 FU5 Cache Memory Bus FU2 FU6 FU3 FU7 32-bit width 256-bit width FU4 FU8 School of CSE 16 Workshop on Complexity-Effective Design Seoul National University

Experiment Results 1 . 2 1 . 0 F I / T 0 . 8 d e f a u l t B e 0 . 6 L OR v i t G OR - H a l 0 . 4 e R 0 . 2 0 . 0 v e c t o r F I R 8 I I R l a t t i c e W_ v e c m i n e r r o r a v e r a g e m u l t i p l y a n a l y s i s B e n c h m a r k P r o g r a m s For our benchmark programs, the bit transitions was reduced by 34% on an average. School of CSE 17 Workshop on Complexity-Effective Design Seoul National University

Conclusions • Described a post-pass optimal operation rearrangement method for low-power VLIW instruction fetch. – The switching activity was reduced by 34% on an average . • Future works – The phase-ordering problem between the operation rearrangement and other compiler optimization steps. – Operation rearrangement problem in super-scalar processors. School of CSE 18 Workshop on Complexity-Effective Design Seoul National University

An Operation Rearrangement Technique for Low-Power VLIW Instruction - PowerPoint PPT Presentation

An Operation Rearrangement Technique for Low-Power VLIW Instruction Fetch Dongkun Shin* and Jihong Kim Computer Architecture Lab School of Computer Science and Engineering Seoul National University, Korea Outline Motivations VLIW

Rearrangement, Convection and Competition Yann BRENIER CNRS-Universit de Nice December 2009

VLIW Processors VLIW (very long instruction word) processors instructions are scheduled

Very Long Instruction Words (VLIW) 6.911 Architectures Anonymous Aaron Adler Very Long

WCED02, Anchorage, USA WCED02, Anchorage, USA Power Estimation of a C algorithm on a VLIW

Low Power Microprocessors Low Power Microprocessors Low Power Technology Gao Wei & Tian

District rearrangement Jnkping sterngen - Hamnkanalen 2018 Omlggningen planerad till

Pattern avoiding permutations in genome rearrangement problems: the transposition model G.

1 AGENDA 1. Summary 2. Genome comparison 3. Rearrangement events 4. Example: mouse vs.

Frac%ona%on, rearrangement, consolida%on, reconstruc%on, &

Algorithms in Bioinformatics: A Practical Introduction Genome Rearrangement Evidences of Genome

Rearrangement of the Experimental Data of Low Lying Collective Excited States Vladimir P.

Lattice optimization for low charge Lattice optimization for low charge state heavy ion operation

(power x 0) == 1 (power x (+ n 1)) == (* (power x n) x) (power x 0) == 1 (power x (+ (* 2 m)

Technique Demo and Practice - Low Back Pain 76b Orthopedic Massage: Technique Demo and

Mats Rahmstrm The President and CEOs address Annual General Meeting 2018 Product portfolio

WALES SOFT POWER BAROMETER 2018 Measuring soft power beyond the nation-state April 2018 01 WHAT

Infinite Mixture Prototypes for Few-Shot Learning Adaptively inferring model capacity for simple

Cold & Ultra-cold Neutron Source Studies Yunchang Shin Indiana University/IUCF Outline

Experience with Crystals at Fermilab Vladimir SHILTSEV (Fermilab) Workshop on Acceleration In

BSMART PRESENTOR CS446 - Project CORE FUNCTION Slide show controlling (volume button);

Doojin Kim Searching for New Physics Leaving No Stone Unturned University of Utah, August 9 th

Collapse of Deep and Narrow ReLU Neural Nets Lu Lu , Yeonjong Shin, Yanhui Su, George Karniadakis

On projective manifolds with semi-positive holomorphic sectional curvature Shin-ichi Matsumura (

Temporal Planning with Clock-Based SMT Encodings Jussi Rintanen Department of Computer Science

An Operation Rearrangement Technique for Low-Power VLIW Instruction - PowerPoint PPT Presentation

An Operation Rearrangement Technique for Low-Power VLIW Instruction Fetch Dongkun Shin* and Jihong Kim Computer Architecture Lab School of Computer Science and Engineering Seoul National University, Korea Outline Motivations VLIW

Rearrangement, Convection and Competition Yann BRENIER CNRS-Universit de Nice December 2009

VLIW Processors VLIW (very long instruction word) processors instructions are scheduled

Very Long Instruction Words (VLIW) 6.911 Architectures Anonymous Aaron Adler Very Long

WCED02, Anchorage, USA WCED02, Anchorage, USA Power Estimation of a C algorithm on a VLIW

Low Power Microprocessors Low Power Microprocessors Low Power Technology Gao Wei &amp; Tian

District rearrangement Jnkping sterngen - Hamnkanalen 2018 Omlggningen planerad till

Pattern avoiding permutations in genome rearrangement problems: the transposition model G.

1 AGENDA 1. Summary 2. Genome comparison 3. Rearrangement events 4. Example: mouse vs.

Frac%ona%on, rearrangement, consolida%on, reconstruc%on, &amp;

Algorithms in Bioinformatics: A Practical Introduction Genome Rearrangement Evidences of Genome

Rearrangement of the Experimental Data of Low Lying Collective Excited States Vladimir P.

Lattice optimization for low charge Lattice optimization for low charge state heavy ion operation

(power x 0) == 1 (power x (+ n 1)) == (* (power x n) x) (power x 0) == 1 (power x (+ (* 2 m)

Technique Demo and Practice - Low Back Pain 76b Orthopedic Massage: Technique Demo and

Mats Rahmstrm The President and CEOs address Annual General Meeting 2018 Product portfolio

WALES SOFT POWER BAROMETER 2018 Measuring soft power beyond the nation-state April 2018 01 WHAT

Infinite Mixture Prototypes for Few-Shot Learning Adaptively inferring model capacity for simple

Cold &amp; Ultra-cold Neutron Source Studies Yunchang Shin Indiana University/IUCF Outline

Experience with Crystals at Fermilab Vladimir SHILTSEV (Fermilab) Workshop on Acceleration In

BSMART PRESENTOR CS446 - Project CORE FUNCTION Slide show controlling (volume button);

Doojin Kim Searching for New Physics Leaving No Stone Unturned University of Utah, August 9 th

Collapse of Deep and Narrow ReLU Neural Nets Lu Lu , Yeonjong Shin, Yanhui Su, George Karniadakis

On projective manifolds with semi-positive holomorphic sectional curvature Shin-ichi Matsumura (

Temporal Planning with Clock-Based SMT Encodings Jussi Rintanen Department of Computer Science

Low Power Microprocessors Low Power Microprocessors Low Power Technology Gao Wei & Tian

Frac%ona%on, rearrangement, consolida%on, reconstruc%on, &

Cold & Ultra-cold Neutron Source Studies Yunchang Shin Indiana University/IUCF Outline