[PDF] - Memory Hierarchy Optimizations with Compilers/Software Jaejin Lee PDF Document

SLIDE 1

1

12-Augl-04 1 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y

Memory Hierarchy Optimizations with Compilers/Software

Jaejin Lee

Advanced Compiler Research Laboratory School of Computer Science and Engineering Seoul National University jlee@cse.snu.ac.kr http://aces.snu.ac.kr/~jlee

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 2

Outline

■ Heterogeneous Multithreading (Helper Threading)

Intelligent Memory
Coexecution
Prefetching

■ Compiler-Assisted Demand Paging ■ Wrap-up

SLIDE 2

2

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 3

Memory Wall Problem

■ The performance gap of processors and memory

Microprocessor performance has been

improving at a rate of 60% per year.

The access time to DRAM has been

improving at a rate of less than 10% per year.

■ The performance of applications is dominated by memory. ■ Thousands of papers.

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 4

The Intelligent Memory Architecture

Processor Chip

P.host L1 $ L2 $

Memory Chip L1 $ P.mem DRAM Of f -t he-shelf int erconnect ion Main t hread Helper t hread Could be a DI MM module

r

a memory cont roller

SLIDE 3

3

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 5

Co-execution

■ Using a compiler,

Partition code into compute-/memory-

intensive sections (so called modules).

▪ Performance prediction

The memory-intensive sections are wrapped

into a helper thread.

Statically/dynamically map the sections to the

best processor.

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 6

Overview of the Co-execution Algorithm

Advanced Part it ioning Af f init y Est imat ion (perf ormance model) Basic Part it ioning Mapping Overlapping Numerical Applicat ions Af f init y Est imat ion (prof iling) Basic Part it ioning Advanced Part it ioning Mapping Non-numerical Applicat ions st at ic dynamic dynamic st at ic

SLIDE 4

4

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 7

Static Mapping

■ Performance model (numerical apps)

Execution time = Tcomp + Tmemstall
Stack distance model for the number of misses

■ Profiling (non-numerical apps)

Gather execution time and the number of invocations

for all modules and subroutines.

∑

∈

=

+ =

caches i i i memstall

ther

ldst ldst fp fp int int comp

penalty miss T T ) N T , N T , N T max( T

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 8

Dynamic Mapping

■ Decision runs at runtime to determine affinity ■ Coarse and CoarseR

Decision runs are module invocations

P.host P.mem CoarseR I nvocat ion 1 2 3 4 5 ••• P.host P.mem Coarse

SLIDE 5

5

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 9

Overall Speedups for Co-execution

■ Our co-execution algorithm delivers speedups that are comparable to the ideal speedup.

0.89

1. 02
1. 18

Average 0.99 1.00 0.57 1.00 1.01 1.01 1.01 1.03

1.37

1.37 0.97 1.01 Bzip2 Mcf Go M88ksim 1.31

1. 71
1. 66

1.31 Average 1.85 1.44 0.99 0.80 1.47 2.00 1.67 1.04 1.91 1.94 2.71 1.60 1.22 1.22 1.55 1.67 1.17 1.26 1.42 1.05 Swim Tomcat v LU TFFT2 Mgrid 2-processor SGI Amdahl’s 2 P.host s P.host (alone) / OverDyn P.host (alone) / AdvCoarseR Apps.

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 10

Correlation Prefetching in Software

■ New correlation prefetching in software using the memory thread. ■ Records sequences of miss addresses in a correlation table. ■ When the head of a sequence is seen, prefetch the rest. ...

A B C Z

a[foo(i)]

... ...

a[4*(i++)]

SLIDE 6

6

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 11

Correlation Table

Succ Level 1 Tag Succ Level 2 Succ Level 1 Tag …

Basic Organization (Joseph & Grunwald) Advanced Organization

Addresses of immediate successors Addresses of next immediate successors

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 12

Our Scheme

North Bridge Chip Processor Chip L1$ L2$ DRAM Chip or DIMM module DRAM Cells L1$ Mem Proc Memory Controller

4 3 2 1 5

SLIDE 7

7

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 13

The Mechanism of the Memory (Helper) Thread

■ Requirements:

Low response time
Occupancy time < miss distance

Miss address

bserved

Prefetch addresses generated Table updated Prefetching step Learning step Response time Occupancy time Wait

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 14

Miss Distance

0% 20% 40% 60% 80% 100% C G E q u a k e F T G a p M c f M S T P a r s e r S p a r s e T r e e A v e r a g e [360,400) [320,360) [280,320) [240,280) [200,240) [160,200) [120,160) [80,120) [40,80) [0,40)

SLIDE 8

8

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 15

Response and Occupancy Time

50 100 150 200 250 Base Chain Repl ReplMC Base Chain Repl ReplMC Response time Occupancy time Processor Cycles Mem Busy

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 16

Execution Time in DRAM

0.2 0.4 0.6 0.8 1 1.2 NoPref Conven4 Base Chain Repl Conven4+Repl Custom NoPref Conven4 Base Chain Repl Conven4+Repl NoPref Conven4 Base Chain Repl Conven4+Repl NoPref Conven4 Base Chain Repl Conven4+Repl NoPref Conven4 Base Chain Repl Conven4+Repl Custom NoPref Conven4 Base Chain Repl Conven4+Repl Custom NoPref Conven4 Base Chain Repl Conven4+Repl NoPref Conven4 Base Chain Repl Conven4+Repl NoPref Conven4 Base Chain Repl Conven4+Repl NoPref Conven4 Base Chain Repl Conven4+Repl CG Equake FT Gap Mcf MST Parser Sparse Tree Average Normalized Execution Time Busy UptoL2 BeyondL2

SLIDE 9

9

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 17

Execution Time in MC

0.2 0.4 0.6 0.8 1 1.2 NoPref Conven4 BaseMC ChainMC ReplMC Conven4+ReplMC NoPref Conven4 BaseMC ChainMC ReplMC Conven4+ReplMC NoPref Conven4 BaseMC ChainMC ReplMC Conven4+ReplMC NoPref Conven4 BaseMC ChainMC ReplMC Conven4+ReplMC NoPref Conven4 BaseMC ChainMC ReplMC Conven4+ReplMC NoPref Conven4 BaseMC ChainMC ReplMC Conven4+ReplMC NoPref Conven4 BaseMC ChainMC ReplMC Conven4+ReplMC NoPref Conven4 BaseMC ChainMC ReplMC Conven4+ReplMC NoPref Conven4 BaseMC ChainMC ReplMC Conven4+ReplMC NoPref Conven4 BaseMC ChainMC ReplMC Conven4+ReplMC CG Equake FT Gap Mcf MST Parser Sparse Tree Average Normalized Execution Time Busy UptoL2 BeyondL2 12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 18

Active Prefetching

■ The helper thread runs the skeleton of the

riginal code
Address computation
Prefetch instructions

■ More accurate prefetches ■ The helper thread should be faster than the original code ■ Synchronization overhead

SLIDE 10

1

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 19

Outline

■ Heterogeneous Multithreading (Helper Threading)

Intelligent Memory
Coexecution
Prefetching

■ Compiler-Assisted Demand Paging

Motivation
Framework
Example
Performance Results

■ Wrap-up

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 20

Seoul National university Advanced Compiler tool Kit (SNACK)

SLIDE 11

1 1

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 21

SNACK Components

■ SNACK-cc: a C compiler for embedded systems ■ SNACK-c2c: C-to-C translator ■ SNACK-asm: assembler ■ SNACK-link: linker ■ SNACK-pop: post-pass optimizer ■ SNACK-jvm: embedded Java VM (planned)

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 22

Goals

■ High performance ■ Small code size ■ Low power/energy

SLIDE 12

1 2

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 23

Using SNACK,

■ Memory hierarchy optimization

Scratchpad memory optimization
Demand paging
Cache design

■ Code generation for ARM and DSPs ■ Embedded Java VM research ■ Java memory model

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 24

Compiler-Assisted Demand Paging

■ Motivation

The size of code is getting bigger → the cost
f (code) memory is getting higher → energy

consumption is getting bigger

■ How to reduce code-memory size with comparable performance and energy consumption for low-end embedded systems?

Demand paging, dynamic loading
No virtual memory

SLIDE 13

1 3

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 25

Memory Architecture

CPU (ARM7) DATA CODE NAND SRAM

2ndary St orage (Code & Dat a)

CPU (ARM7) DATA CODE NAND MASK ROM SRAM

2ndary St orage (Dat a)

CPU (ARM7) DATA CODE NAND NOR SRAM

2ndary St orage (Dat a)

N O R + S R A M M R O M + S R A M N A N D + S R A M NAND+SRAM MROM+SRAM NOR+SRAM Upgrade Code size (for the same cost) Performance Easy Difficult Easy Small (big?) Big Big Good Poor Poor

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 26

Framework

ELF binary Segment ed binary image Page manager image Prof ile inf ormat ion Disassemble St at ic call graph generat ion Dependence, Cont rol f low, Escape analyses Segment at ion & Branch expansion Clust ering SNACK-pop The applicat ion image

SLIDE 14

1 4

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 27

Segmented Paging

■ A segment consists of at least one page. ■ At least one function per page or segment. ■ Page size = the block size of the NAND flash memory

A A A B, C E D D F F u n c t i

n

A F u n c t i

n

B a n d C F u n c t i

n

D F u n c t i

n

E F u n c t i

n

F

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 28

Page Manager

■ Contains a page table, a buffer page table and a segment table.

Generated by the post-pass optimizer

■ When a page hit occurs, simply branches to the page on the buffer. ■ Otherwise, load the corresponding segment in to the execution buffer, and then branches to the page.

SLIDE 15

1 5

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 29

Call/Return Expansion

■ Basically, a branch to another segment are expended to the calls to the page manager ■ Many patterns

f oo: … BL bar … pc ← lr f oo: … BL T1 … B page_manager T1: r0 ← bar’s absolut e address B page_manager

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 30

Example (1)

A A A B, C E D D F N A N D F l a s h E x e c u t i

n

B u f f e r ( S R A M ) P a g e M a n a g e r A A A B, C

Calling sequence: A

A B C D E F S t a t i c C a l l G r a p h

A

SLIDE 16

1 6

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 31

Example (2)

A A A B, C E D D F N A N D F l a s h E x e c u t i

n

B u f f e r ( S R A M ) P a g e M a n a g e r A A A B, C

Calling sequence: A B

A calls B hit 1 2 A B C D E F S t a t i c C a l l G r a p h

A B

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 32

Example (3)

A A A B, C E D D F N A N D F l a s h E x e c u t i

n

B u f f e r ( S R A M ) P a g e M a n a g e r A A A B, C

Calling sequence: A B C

B calls C A B C D E F S t a t i c C a l l G r a p h

A B C

SLIDE 17

1 7

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 33

Example (4)

A A A B, C E D D F N A N D F l a s h E x e c u t i

n

B u f f e r ( S R A M ) P a g e M a n a g e r A A A B, C

Calling sequence: A B C

C ret urns t o B A B C D E F S t a t i c C a l l G r a p h

A B

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 34

Example (5)

A A A B, C E D D F N A N D F l a s h E x e c u t i

n

B u f f e r ( S R A M ) P a g e M a n a g e r A A A B, C

Calling sequence: A B C D

B calls D A B C D E F S t a t i c C a l l G r a p h

A B D

1 miss 2 load D 3

SLIDE 18

1 8

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 35

Example (6)

A A A B, C E D D F N A N D F l a s h E x e c u t i

n

B u f f e r ( S R A M ) P a g e M a n a g e r B, C

Calling sequence: A B C D

A B C D E F S t a t i c C a l l G r a p h

A B D

D D

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 36

Strategies

■ Base: segmented paging with round-robin replacement ■ Static: clustering with the static call graph

Nodes with multiple parents pinned in the

resident segment.

Parents and children together

■ Profile: pinning some segments in the SRAM using profile data (dynamic call graph)

SLIDE 19

1 9

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 37

Simulation Environment

■ ARMulator, ARM ADS 1.2

ARM7, cycle accurate, no caches
20Mhz
NAND flash read latency is fully incorporated in the

simulation.

▪ NAND cell → page register: 10us ▪ Reading 256 word (16bit) from page register: 256 x 50ns=12.8 us

■ Benchmark Programs

From MI bench and Media Bench

▪ Combine (qsort, dijkstra, adpcm, sha, bitcount), FFT, Epic, Unepic, MP3

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 38

Combine (14KB)

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 SRAM size (KB) Normalized Execution Time 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Base Static Profile Base Hit Rate Static Hit Rate Profile Hit Rate

SLIDE 20

2

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 39

FFT (16KB)

2 4 6 8 10 12 14 16 18 20 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 SRAM size (KB) Normalized Execution Time 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 Base Static Profile Base Hit Rate Static Hit Rate Profile Hit Rate

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 40

EPIC (32KB)

1 2 3 4 5 6 11.0 12.0 13.0 14.0 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0 25.0 26.0 27.0 28.0 29.0 30.0 31.0 SRAM size (KB) Normalized Execution Time 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Base Static Profile Base Hit Rate Static Hit Rate Profile Hit Rate

SLIDE 21

2 1

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 41

UNEPIC (23KB)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 13.0 13.5 14.0 14.5 15.0 15.5 16.0 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0 20.5 21.0 21.5 22.0 22.5 23.0 SRAM size (KB) Normalized Execution Time 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Base Static Profile Base Hit Rate Static Hit Rate Profile Hit Rate 12-Aug-04

Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 42

MP3 (37KB)

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 18.5 19.5 20.5 21.5 22.5 23.5 24.5 25.5 26.5 27.5 28.5 29.5 30.5 31.5 32.5 33.5 34.5 35.5 36.5 SRAM size (KB) Normalized Execution Time 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Base Static Profile Base Hit Rate Static Hit Rate Profile Hit Rate

SLIDE 22

2 2

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 43

Summary

■ 33% of SRAM reduction on average with less than 20% performance degradation and 8% more energy consumption. ■ The designer can select the SRAM size depending on their cost, energy, and real-time requirements. 16.4KB (67%) 22KB (59%) 18KB (78%) 16KB (50%) 16KB (100%) 10KB (71%) SRAM size with paging 1.08 1.14 1.02 1.08 1.00 1.16 Normalized Energy Consumption 24.4KB Average 37KB MP3 23KB Unepic 32KB Epic 16KB FFT 14KB Combine Original Code Size Application

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 44

Challenges

■ How to reduce the number of page manager calls (branching overhead)? ■ How to reduce the number of page/segment misses? ■ Aim at embedded systems with MMU and RTOS support ■ Speed differentiated (a.k.a. scratchpad memory) on-chip memory optimization ■ Data paging

SLIDE 23

2 3

12-Aug-04 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering Seoul Nat ional Universit y 45

References

■ "Automatically Mapping Code in an Intelligent Memory Architecture", Jaejin Lee, Yan Solihin, and Josep Torrellas. HPCA 2001 ■ "Using a User-Level Memory Thread for Correlation Prefetching", Yan Solihin, Jaejin Lee, and Josep Torrellas. ISCA 2002 ■ “Compiler-Assisted Demand Paging for Embedded Systems with Flash Memory”, Chanik Park, Junghee Lim, Kiwon Kwon, Jaejin Lee, and Sang Lyul Min, EMSoft 2004