Energy Minimization of Pipeline Processor Using a Low Voltage - - PowerPoint PPT Presentation

energy minimization of pipeline processor using a low
SMART_READER_LITE
LIVE PREVIEW

Energy Minimization of Pipeline Processor Using a Low Voltage - - PowerPoint PPT Presentation

Energy Minimization of Pipeline Processor Using a Low Voltage Pipelined Cache Vincent J. Mooney III, Krishna Palem, Jun Cheol Park, and Kyu-won Choi Georgia Institute of Technology {mooney, palem, jcpark, kwchoi}@ece.gatech.edu Outline


slide-1
SLIDE 1

Energy Minimization of Pipeline Processor Using a Low Voltage Pipelined Cache

Vincent J. Mooney III, Krishna Palem, Jun Cheol Park, and Kyu-won Choi Georgia Institute of Technology

{mooney, palem, jcpark, kwchoi}@ece.gatech.edu

slide-2
SLIDE 2

Asilomar Nov. 05 2002 2

Outline

Introduction Motivation and previous work Approach Methodology Results Conclusion and future work

slide-3
SLIDE 3

Asilomar Nov. 05 2002 3

Introduction

Power/energy is a top

most bottle neck in embedded systems

Mobile devices require

longer usage time

Trade-off between

performance and power

Reducing power/energy

without performance loss

slide-4
SLIDE 4

Asilomar Nov. 05 2002 4

Motivation & previous work

A cache is a power

hungry component of a system

Caches consume

42% of a Strong ARM 110 processor*

Non- cache Caches

*J. Montanaro and et. al., “A 160-mhz, 32-b, 0.5-w cmos risc microprocessor,” IEEE Journal of Solid-State Circuits, 31(11):1703–1714, 1996.

slide-5
SLIDE 5

Asilomar Nov. 05 2002 5

Motivation & previous work

  • Intel XScale processor supports multiple frequencies and

voltages

  • L. T. Clarl and et. al., “An embedded 32-b microprocessor core for

lowpower and high-performance applications,” IEEE Journal of Solid- State Circuits, 36(11):1599–1608, November 2001.

  • High voltage supply for critical paths and low voltage supply for

non-critical paths

  • V. Moshnyaga and H. Tsuji, “Cache energy resuction by dual voltage

supply,” In Proc. Int. Symp. Circuit and System, pages 922–925, 2001.

  • Pipelining a cache to achieve lower cycle time
  • T. Chappell, B. Chappell, S. Schuster, J. Allan, S. Klepner, R. Joshi,

and R. Franch, “A 2-ns cycle, 3.8-ns access 512-kb cmos ecl sram with a fully pipelined architecture,” IEEE Journal of Solid-State Circuits, 26(11):1577–1585, 1991.

slide-6
SLIDE 6

Asilomar Nov. 05 2002 6

Approach

  • Case1. Non-pipelined caches with the same voltages as the processor

IF ID EX ME WB Vdd IF1 IF2 ID EX ME1 ME2 WB I.$1 I.$ 2 D.$1 D.$ 2 I.$ D.$

  • Case2. Caches pipelined with lower supply voltage and same cycle

time with case1 Vdd Lower Vdd

slide-7
SLIDE 7

Asilomar Nov. 05 2002 7

Approach (Cont.)

Case 2 uses same cycle time as case 1:

ideally same execution time

Case 2 saves power using lower supply

voltage

Two bottle necks

  • Branch penalty: branch misprediction adds overhead

for pipelined instruction cache

  • Load use penalty: a load instruction immediately

followed by dependent instruction adds overheads for pipelined data cache

slide-8
SLIDE 8

Asilomar Nov. 05 2002 8

Methodology

Processor Model Cache Model + System Energy

slide-9
SLIDE 9

Asilomar Nov. 05 2002 9

Processor Model

MARS

  • A cycle-accurate Verilog model of a 5-stage RISC

processor from U. Mich.

  • Capable of running ARM instruction
  • Non-pipelined caches
  • BTFN (backward taken forward non-taken) branch

prediction

MARS with 7-stage pipeline

  • 128 entry BTB (branch target buffer) with 2-bit counter
  • 2-stage IF (instruction fetch), 2-stage ME (memory access)
slide-10
SLIDE 10

Asilomar Nov. 05 2002 10

Processor Model (Cont.)

Compile benchmarks

using ARM-gcc compiler and generate hex ARM instructions called VHX

Benchmark Program (C/C++) Binary Translation ARM9 Based System Architecture Functional Simulation (VCS) Toggle Rate (Activity) Generation Processor Core Power Synthesize Verilog Model

slide-11
SLIDE 11

Asilomar Nov. 05 2002 11

Processor Model (Cont.)

Functional simulation

using Synopsys VCS

Collect toggle rate of

internal logic signals using Synopsys VCS simulation

Benchmark Program (C/C++) Binary Translation ARM9 Based System Architecture Functional Simulation (VCS) Toggle Rate (Activity) Generation Processor Core Power Synthesize Verilog Model

slide-12
SLIDE 12

Asilomar Nov. 05 2002 12

Processor Model (Cont.)

Synthesize Verilog model

using TSMC .25µ library

Benchmark Program (C/C++) Binary Translation ARM9 Based System Architecture Functional Simulation (VCS) Toggle Rate (Activity) Generation Processor Core Power Synthesize Verilog Model

slide-13
SLIDE 13

Asilomar Nov. 05 2002 13

Processor Model (Cont.)

Estimate power using

Synopsys Power Compiler

Benchmark Program (C/C++) Binary Translation ARM9 Based System Architecture Functional Simulation (VCS) Toggle Rate (Activity) Generation Processor Core Power Synthesize Verilog Model

slide-14
SLIDE 14

Asilomar Nov. 05 2002 14

Cache model

CACTI 2.0*

  • An integrated cache access time, cycle time,

and power model

  • Time and power estimation of each

component

  • RC based more detailed delay model used for

technology scaling (i.e. supply voltage,

threshold voltage)*

*G. Reinman and N. Jouppi, Cacti version 2.0, http://www.research.digital.com/wrl/people/jouppi/CACTI.html. **N.Weste and K. Eshraghian, Principles of CMOS VLSI Design, Addison Wesley, Santa Clara, California, 1992.

slide-15
SLIDE 15

Asilomar Nov. 05 2002 15

Cache model (Cont.)

The cache circuit is split into two

parts for pipelining

  • Pipeline stage 1: decoder, tag array,

data array

  • Pipeline stage 2: mux, sense-

amplifier, comparator

Timing order of the circuit-level

critical path considered

Direct mapped and 32B block size 16KB, 32KB, 64KB, 128KB,

256KB, 512KB cache size simulated

CACTI 2.0 cache model

slide-16
SLIDE 16

Asilomar Nov. 05 2002 16

Cache model (Cont.)

Delay for Pipeline 1 (Cache Size: 16K-512K, Block Size: 32, Direct Mapped ) Delay for Pipeline 2 (Cache Size: 16K-512K, Block Size: 32, Direct Mapped )

  • Delay is increased according to the supply voltage
  • Delay of the pipeline stage 1 is also dependent on the cache size
slide-17
SLIDE 17

Asilomar Nov. 05 2002 17

Cache model (Cont.)

Energy for Pipeline 1 (Cache Size: 16K-512K, Block Size: 32, Direct Mapped ) Energy for Pipeline 2 (Cache Size: 16K-512K, Block Size: 32, Direct Mapped )

  • Energy is dependent on the cache size and the supply voltage
slide-18
SLIDE 18

Asilomar Nov. 05 2002 18

slide-19
SLIDE 19

Asilomar Nov. 05 2002 19

Optimization of energy and delay

Pipelined cache for high-

performance

  • Reduced cycle time with same

supply voltage

  • Pipelined cache for low-power
  • Reduced supply voltage without

changing cycle time

delay

cycle time = 10 ns

Base case

Vdd = 2.75 V

Pipelined cache for high-performance delay delay

cycle time = 5 ns Vdd = 2.75 V E = C(2.75)2 = 7.56C Energy savings = (7.56 – 2.56)C/7.56*100 = 66%

idle Pipelined cache for low-power delay

cycle time = 10 ns Vdd = 1.6 V E = C(1.6)2 = 2.56C

slide-20
SLIDE 20

Asilomar Nov. 05 2002 20

Optimization of energy and delay (Cont.)

Optimized supply voltage for cache

Voltage optimization procedure for pipelined cache

Input: Vdd Range, delay_base Output: Power optimal Vdd Vdd Range ← [2.75V – 0.6V] Vdd(0) = Max(Vdd Range); For i steps do Calculate delay_stage1(Vdd(i)); Calculate delay_stage2(Vdd(i)); If Max[delay_stage1{Vdd(i)}, delay_stage2{Vdd(i)}] < dealy_base Vdd_optimal = Vdd(i); endIf Decrease Vdd(i); endFor

slide-21
SLIDE 21

Asilomar Nov. 05 2002 21

Optimization of energy and delay (Cont.)

Pipelined cache saves maximum 69.60%

energy saving

Energy/delay for a pipelined cache

3.84 101.422 2.7 0.194 12.030 105.477 12.224 2.75 512 7.62 50.442 2.65 0.195 6.060 54.605 6.254 2.75 256 18.52 22.767 2.5 0.199 2.991 27.942 3.190 2.75 128 31.95 10.450 2.3 0.201 1.540 15.357 1.741 2.75 64 49.73 4.534 2 0.206 0.814 9.019 1.021 2.75 32 69.60 1.729 1.6 0.210 0.438 5.689 0.648 2.75 16 % saving Energy(nJ) Vdd(V) Delay2(nS) Delay1(nS) Energy(nJ) Delay(nS) Vdd(V) Cache(KB) Pipelined cache Base case

slide-22
SLIDE 22

Asilomar Nov. 05 2002 22

Results

Execution time increased 15.35% due to the branch

misprediction penalty and load use penalty

  • More accurate branch prediction scheme required
  • Dynamic instruction scheduling such as out-of-order execution or

static instruction scheduling such as compiler optimization required

Execution Time (ICache=16KB, DCache=16KB)

15.35 Average 17.43 1063 47719 1057 40635 178 125 fib 15.00 987 221196 981 192345 1002 4 factorial 9.65 1086 47987 1079 43765 151 105 arith 16.36 1121 105293 1114 90485 512 604 matmul 18.31 1008 31465 1002 26595 201 177 sort_int E.T.% Increment Core Power(mW) E.T(ns) Core Power(mW) E.T(ns) Load use Misprediction Benchmark Pipelined cache processor Base case

slide-23
SLIDE 23

Asilomar Nov. 05 2002 23

Results (Cont.)

Average 24.85% power saving Processor core power does not change much for 5-stage

and 7-stage

Variation of total processor power is mainly dependent on

cache power

Power distribution (ICache=16KB, DCache=16KB)

24.85 Average 27.09 1253 39 151 1063 1719 149 513 1057 fib 26.24 1161 31 143 987 1574 118 475 981 factorial 22.96 1258 18 154 1086 1634 66 488 1079 arith 24.27 1292 37 134 1121 1706 142 450 1114 matmul 23.67 1154 25 120 1008 1511 98 411 1002 sort_int % Reduction Total

  • D. Cache
  • I. Cache

Core Power Total

  • D. Cache
  • I. Cache

Core Power Benchmark Pipelined cache (mW) Base case (mW)

slide-24
SLIDE 24

Asilomar Nov. 05 2002 24

Results (Cont.)

Average 13.33% energy saving The increment of execution time degrades the energy

reduction

To maximize the advantage of pipelined cache, a precise

branch prediction scheme and instruction scheduler (load use) required

Energy distribution (ICache=16KB, DCache=16KB)

13.33 Average 14.38 59810 1840 7227 50743 69851 6053 20845 42953 fib 15.17 256810 6928 31662 218220 302747 22791 91328 188628 factorial 15.53 60392 880 7406 52105 71496 2896 21363 47238 arith 11.87 136050 3898 14118 118034 154377 12823 40723 100830 matmul 9.70 36298 794 3789 31715 40195 2606 10929 26660 sort_int % Reduction Total

  • D. Cache
  • I. Cache

Core Energy Total

  • D. Cache
  • I. Cache

Core Energy Benchmark Pipelined cache (nJ) Base case (nJ)

slide-25
SLIDE 25

Asilomar Nov. 05 2002 25

Conclusion and future work

Pipelined cache with lower supply voltage

explored

Maximum 69.6% cache energy saving 24.85% power and 13.33% energy saved The savings of the power are masked by the

execution time increment

Branch prediction and load use penalty must

be considered to maximize energy saving

slide-26
SLIDE 26

Asilomar Nov. 05 2002 26

Thank you.