Motivation Memory is one of the most energy hungry subsystems in - - PDF document

motivation
SMART_READER_LITE
LIVE PREVIEW

Motivation Memory is one of the most energy hungry subsystems in - - PDF document

Efficient Utilization of Scratch Pad Memory in Preemptive Multi Task Systems Hiroyuki Tomiyama Ritsumeikan University http://hiroyuki.tomiyama lab.org/ Motivation Memory is one of the most energy hungry subsystems in embedded


slide-1
SLIDE 1

Efficient Utilization of Scratch‐Pad Memory in Preemptive Multi‐Task Systems

Hiroyuki Tomiyama Ritsumeikan University http://hiroyuki.tomiyama‐lab.org/

Motivation

 Memory is one of the most energy‐hungry subsystems in

embedded systems

 Up to 50% of total energy

 Cache improves energy efficiency by reducing off‐chip memory

accesses

 Cache is still energy hungry because of

 Tag comparison  Automatic replacement mechanism  Parallel accesses to multiple ways (in high‐performance cache)

 Use of SPM instead of (in addition to) cache

 Normalized read energy (calculated by CACTI 5.0)

 SPM is energy efficient but small

SPM Direct $ 2‐way $ 4‐way $ 1 1.56 1.93 2.54

2

slide-2
SLIDE 2

Overview

How to efficiently utilize SPM in the presence of multiple

tasks?

For simplicity, this talk focuses on instruction memory

 Static data is OK, but stack and heap data need special care.

Outline

SPM partitioning and code allocation for

 Non‐preemptive multi‐task systems

 [Takase, Tomiyama and Takada, VLSI‐DAT 2009]

 Preemptive multi‐task systems

 [Takase, Tomiyama and Takada, DATE 2010]

Code layout for inter‐task Interference minimization

 [Gauthier, Ishihara, Takase, Tomiyama and Takada, CASES 2010]

Main Contributors

Hideki Takase (Ph.D. Candidate, Nagoya University) Lovic Gauthier (Associate Professor, Kyushu University)

3

SPM Allocation Principle

Which memory objects should be placed in SPM?

Memory objects can be functions (procedures), basic

blocks, or other granularity.

For simplicity, we consider functions as memory objects

Knapsack problem

xi

1 if i‐th function is placed in SPM. Otherwise, 0.

fetchi

# of accesses to i‐th function (Obtained by profiling)

sizei

Code size of i‐th function

Maximize

Σi fetchi × xi

Subject to

Σi sizei × xi ≦ SPMsize

4

slide-3
SLIDE 3

Task Execution Model

Task states

Dormant / Ready / Running

Task scheduling policy

All tasks are periodic and independent Fixed‐priority‐based scheduling

The highest priority task among ready tasks gets dispatched

when CPU becomes available

Periods and priorities of tasks are statically decided

No task preemption

Ready Running Dormant

dispatch terminate activate

5

SPM Partitioning and Code Allocation

SPM partitioning

Assignment of SPM address space to tasks

Code allocation

Assignment of memory objects to SPM

Three methods

Spatial method Temporal method Hybrid method

Execution time Task1 Task2 Task3 Execution time Execution time

task arrival task is runnig MM-SPM copy

6

slide-4
SLIDE 4

Spatial Method

 SPM space is exclusively

partitioned and assigned to tasks

 No transfer necessary

between SPM and main memory

 Effective for large SPM

 ILP Formulation of simultaneous partitioning and allocation

 funci,j

j‐th function of i‐th task

 xi,j

1 if funci,j is placed in SPM

 periodi

Period of i‐th task

 hyperperiod

Least common multiple of periods

SPM region

7

Temporal Method

 Running task may use

entire SPM space

 When dispatched, code is

transferred from main memory to SPM.

 Effective for small SPM

 ILP formulation of simultaneous partitioning and allocation

 Eoverheadi,j

Energy overhead for transfer of funci,j

 yi,j

1 if funci,j is placed in SPM

SPM region

8

slide-5
SLIDE 5

Hybrid Method

Mixture of spatial and temporal approaches

More flexible than the two approaches

Partition the SPM space into two regions

Spatial region Temporal region

Spatial region is further partitioned and assigned to tasks

statically

Task1 Task2 Task3 SPM region Execution time

task arrival task is running MM-SPM copy

Spatial region Temporal region

9

Hybrid Method

ILP Formulation

Partitioning of SPM into spatial region and temporal one Partitioning of spatial region into tasks Code allocation for temporal region

10

slide-6
SLIDE 6

Experimental Setup and Tools

 Simulator : SimpleScalar / ARM

 An instruction‐set simulator of ARM7TDMI microprocessor

 Compiler : arm‐linux‐gcc 2.95.2  ILP solver : GNU GLPK 4.23  Memory configurations:

 On‐chip: 16KBytes 4‐way cache +4K / 8K / 12K / 16KBytes SPM

 Energy model: CACTI 4.2

 Off‐chip main memory: Mobile DDR SDRAM

 Energy model: Micron System‐Power Calculator

 Benchmark task sets (from MiBench suite)

 TasksetA: bf / tiff2rgba  TasksetB: cjpeg / crc / qsort / tiff2rgba  TasksetC: bitcnts / cjpeg / ispell / rawcaudio / sha  TasksetD: bitcnts / bf / crc / dijkstra / ispell / qsort / rawcaudio / sha  TasksetE: bitcnts / bf / cjpeg / crc / dijkstra / ispell / qsort /

rawcaudio / sha / tiff2rgba

11

Experimental Procedure

12

slide-7
SLIDE 7

Results: TasksetE (10 tasks)

Std: Simple spatial method where SPM is partitioned equally to every task Spt: Spatial method, Tmp: Temporal method, Hyb: Hybrid method

0.0 20.0 40.0 60.0 80.0 Std Spt Tmp Hyb Std Spt Tmp Hyb Std Spt Tmp Hyb Std Spt Tmp Hyb 4k 8k 12k 16k

Energy [mJ]

cache hit cache miss SPM hit Overhead

  • 47.2 %

13

0.0 1.0 2.0 3.0 4.0 Std Spt Tmp Hyb Std Spt Tmp Hyb Std Spt Tmp Hyb Std Spt Tmp Hyb 4k 8k 12k 16k

Energy [mJ]

cache hit cache miss SPM hit Overhead

Results: TasksetA (2 tasks)

  • 28.4 %

14

slide-8
SLIDE 8

Results: TasksetA / TasksetC / TasksetE

Hybrid approach is stably good Increased SPM size is not always effective

0.0 0.2 0.4 0.6 0.8 1.0 1.2 Std Rgn Prd Hyb Std Rgn Prd Hyb Std Rgn Prd Hyb Std Rgn Prd Hyb Std Rgn Prd Hyb Std Rgn Prd Hyb Std Rgn Prd Hyb Std Rgn Prd Hyb Std Rgn Prd Hyb Std Rgn Prd Hyb Std Rgn Prd Hyb Std Rgn Prd Hyb 4K 8K 12K 16K 4K 8K 12K 16K 4K 8K 12K 16K setA setC setE

Normalized Energy Consumption

Cache hit Cache miss SPM hit Overhead

15

Preemptive Multi‐Task Systems

Task states

Dormant / Ready / Running

Task scheduling policy

All tasks are periodic and independent Fixed‐priority preemptive scheduling

Periods and priorities of tasks are statically decided Higher‐priority task preempts lower‐priority task under

execution

Ready Running Dormant

dispatch terminate activate preempted

16

slide-9
SLIDE 9

SPM Partitioning and Code Allocation

Spatial method

Same as non‐preemptive systems

Temporal method Hybrid method

17

Temporal Method

Running task may use entire SPM space Program code is transferred at most twice per execution

1.

When the task gets started

2.

When a higher priority task is completed, and a preempted task resumes execution

 The contents of the preempted task needs to be restored into SPM

18

slide-10
SLIDE 10

Temporal Method: ILP Formulation

Eoverheadi,j : Energy consumption of transferring funci,j SPMsize_tmpi: Amount of SPM space that taskican use. yi,j : 1 if funci,j is placed in SPM.

19

Hybrid Method

 Mixture of the two methods

 At compile time, SPM is partitioned by the spatial method  At run time, a higher priority task may preempt not only CPU but

also SPM space of lower‐priority tasks  Reduces overhead of high‐priority tasks

task arrival task is running MM-SPM copy Execution time Task1 Task2 Task3 task arrival task is running MM-SPM copy Execution time Task1 Task2 Task3 task arrival task is running MM-SPM copy Execution time Task1 Task2 Task3 task arrival task is running MM-SPM copy Execution time Task1 Task2 Task3

Task1 preempts SPM spaces of Task2 and 3 The contents of SPM is restored

20

slide-11
SLIDE 11

Hybrid Method: ILP Formulation

 SPMsize_spti

 SPM size statically assigned to taski

by spatial method

 Constraint (1)

 SPMsize_tmpi

 SPM size which taski preempts by

temporal method

 Constraint (2)

21

Experimental Setup and Tools

 Simulator: SkyEye‐1.2.6_rc1 (ARM920T)  ILP solver: GNU GLPK 4.23  Compiler: arm‐elf‐gcc 4.1.1  RTOS: TOPPERS/ASP Kernel (Release 1.3.2)  Memory configurations:

 On‐chip: 4 KBytes 4‐way cache +1 / 2 / 4 / 8 KBytes SPM  Off‐chip main memory: Mobile DDR SDRAM

 Energy model: CACTI 5.3

 Task sets: tasks are selected from EEMBC suites

 SetA: aifftr, basefp, bitmnp, cacheb, idctrn  SetB: bezier, dither, ospf, pktflow, rotate, routelookup, text  SetC: conven, rgbcmy, rgbriq, viterb, and SetB  SetD: SetA and SetC

 The periods were set according to be proportional to their execution times  The total CPU utilization rate of the task set was set about 50 %

22

slide-12
SLIDE 12

Overall Workflow

23

Results: SetC (11 tasks)

Std: Simple method where SPM space is partitioned equally to each task Spt: Spatial method, Tmp: Temporal method, Hyb: Hybrid method

400 800 1200 1600 Std Spt Tmp Hyb Std Spt Tmp Hyb Std Spt Tmp Hyb Std Spt Tmp Hyb 1k 2k 4k 8k Energy Consumption [uJ]

cache hit cache miss SPM hit

  • verhead
  • 73 %

24

slide-13
SLIDE 13

Results: SetA / SetB / SetD

0.00 0.20 0.40 0.60 0.80 1.00 Std Spt Tmp Hyb Std Spt Tmp Hyb Std Spt Tmp Hyb Std Spt Tmp Hyb Std Spt Tmp Hyb Std Spt Tmp Hyb Std Spt Tmp Hyb Std Spt Tmp Hyb Std Spt Tmp Hyb Std Spt Tmp Hyb Std Spt Tmp Hyb Std Spt Tmp Hyb 1k 2k 4k 8k 1k 2k 4k 8k 1k 2k 4k 8k SetA SetB SetD

Normalized Energy

cache hit cache miss SPM hit

  • verhead

Hybrid method achieves the best energy saving

25

Code Layout Optimization

So far, we did not consider at which address they are

placed in SPM.

We can further reduce the energy by optimizing code

layout in the temporal region.

Base idea: Less overlap Cost proportional to size of overlap Not all the overlaps are equivalent: depend on scheduling

Fortunately, schedule is statically known for fixed‐priority,

periodic tasks

SPM space (bytes) Tasks t0 t1

β0 β1

512 1K SPM space (bytes) Tasks t0 t1

β0 β1

512 1K

Inefficient sharing Efficient sharing

26

slide-14
SLIDE 14

 Schedule of hyperperiod (LCM of task periods)

 t1 conflicts with t0 twice  t2 conflicts with t0 and t1 once

 ILP formulation: See our CASES’10 paper

 Lovic Gauthier, Tohru Ishihara, Hideki Takase, Hiroyuki Tomiyama,

Hiroaki Takada, “Minimizing Inter‐Task Interferences in Scratch‐Pad Memory Usage for Reducing the Energy Consumption of Multi‐Task Systems”

 Simultaneously decide code allocation and layout

27

Code Layout Optimization

Time

t2 t1 t0

Time Time P0 {t1: t0} P0 {t1: t0} P1 {t2: t0,t1}

Dynamic Management of SPM

 Partition SPM in areas based on block boundaries

 For simplicity, we assume a task uses a single block (with

continuous address space).  At context switch, save modified areas (for data objects) and

load areas individually where overlapped (overwritten)

β2

SPM space (bytes) Task t0 t1 t2

β0 β1 α3 α2 α1 α0

512 1K

28

slide-15
SLIDE 15

SPM Management Algorithm

for each block of next task do for each area covered by block Save area for previous task’s data Load area for next task end end

Implement an SPM management procedure in

the task dispatcher of RTOS

Use two tables

Block table (static) Area table (variable)

29

Implementation in RTOS

Block Covered areas

  • Inst. or

Data Begin End β0 α0 α2 I β1 α1 α3 I β2 α0 α1 D Area Address Size Block α0 a0000000 384B β2 α1 a0000180 128B β1 α2 a0000200 128B β1 α3 a0000280 384B β1

Toshiba MeP processor simulator 8kB Cache / 0‐16kB SPM Applications: from EEMBC and MiBench ILP Solver: ILOG CPLEX

Experimental Setup

Set Tasks

  • Nb. of inst. mem. objs

fft, mp3 decode, mp3 decode, string search

147 1

patricia, patricia, string search, string search 2

56 2

cubic, fft, patricia, qsort, string search

108 3

cubic, qsort, rad2deg, adpcm encode, adpcm decode

64 4

cubic, patricia, qsort, rad2deg, adpcm encode, adpcm decode, string search, string search 2

206 5

cubic, fft, mad, mpeg decode, mp3 decode, patricia, qsort, rad2deg, adpcm encode, adpcm decode, string search

333

30

slide-16
SLIDE 16

Results

1 2 4 8 16

20 40 60 80 100 120

Set 2 code SPM size in Kb 1 2 4 8 16

20 40 60 80 100 120

Set 1 code SPM size in Kb 1 2 4 8 16

10 20 30 40 50 60 70 80 90 100

Set 0 code Space Time Hybird One Two SPM size in Kb Normalized energy consumption 1 2 4 8 16

10 20 30 40 50 60 70 80 90 100

Set 3 code SPM size in Kb Normalized energy consumption 1 2 4 8 16

10 20 30 40 50 60 70 80 90 100

Set 4 code SPM size in Kb 1 2 4 8 16

10 20 30 40 50 60 70 80 90 100

Set 5 code SPM size in Kb

Additional 10% energy reduction over hybrid method

One: one block per task Two: two blocks per task Baseline: No SPM

31

Efficient utilization of SPM for preemptive multi‐task

systems

Simultaneous SPM partitioning and code allocation

Hybrid spatial/temporal method

Code layout optimization RTOS implementation

Future work

Faster algorithm Stack/heap memory management

Published in part at ESTIMedia’09 and SASIMI’10 Needs more work to be implemented

Multiprocessors with shared SPM

32

Conclusions