Enhanced Pipeline Scheduling Seok-Young Lee, Jaemok Lee, Soo-Mook - - PowerPoint PPT Presentation

enhanced pipeline scheduling
SMART_READER_LITE
LIVE PREVIEW

Enhanced Pipeline Scheduling Seok-Young Lee, Jaemok Lee, Soo-Mook - - PowerPoint PPT Presentation

Some Cache Optimization with Enhanced Pipeline Scheduling Seok-Young Lee, Jaemok Lee, Soo-Mook Moon School of Electrical Engineering & Computer Science Seoul National University, Korea Outline Motivation and background Cache


slide-1
SLIDE 1

Some Cache Optimization with Enhanced Pipeline Scheduling

Seok-Young Lee, Jaemok Lee, Soo-Mook Moon

School of Electrical Engineering & Computer Science Seoul National University, Korea

slide-2
SLIDE 2

Outline

  • Motivation and background
  • Cache optimizations with Enhanced pipeline

scheduling

  • Experimental results
  • Summary and future work
slide-3
SLIDE 3

Cache Misses for Integer Programs

  • CPU stalls caused by data cache misses are serious,

even in some integer programs

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 164.gzip 175.vpr 176.gcc 181.m cf 197.parser 254.gap 256.bzip 300.twolf 445.gobm k 456.hm m er

CPU Stall portion in the total running time

slide-4
SLIDE 4

Conventional Techniques

  • Many compiler optimization techniques have been used
  • Prefetches for array-accessing loops [Mowry’92]
  • Increasing locality in loops [Wolf’91]
  • Dynamic runtime optimization [Chilimbi’02]
  • But they are not well applicable to integer loops
  • Address estimation is not easy (e.g., pointer-chasing loops)
  • Complex control flows
slide-5
SLIDE 5

A Better Technique

  • In integer programs, it is easier to separate “hot cache-

missing loads” from their consumers by cache-miss latencies

  • Simply implemented by increased load latency during code

scheduling

load x = [y] use x

CPU stall if cache miss

load x = [y] use x use a use b use c use d

No CPU stall if the load and consumer Is separated.

Cache miss latency

slide-6
SLIDE 6

Our Proposal

  • However, naïve code scheduling is not enough
  • Code motion of hot loads can be stuck at the loop entry
  • Difficult to fill added slack cycles fully and usefully
  • Actually, did not show tangible impact [Choi ’02 in EPIC-2]
  • Our proposal: moving hot loads across loop iterations
slide-7
SLIDE 7

Illustration of the Proposal

load a = @b use a … load a = @b load a = @b

naïve sep separatio ion:

stuck at the loop header

load a = @b use a … load a = @b … [iter 1] [iter n] [iter n+1]

pro roposed se separatio ion :

moving hot loads across loop iterations A code motion for software pipelining

slide-8
SLIDE 8

Some Characteristics of Hot Loads

  • Located close to loop entry
  • Tight data dependence chains to their source operands
  • Moving hot load requires moving dependent instructions as well
  • Difficult to estimate target address
  • Often in a loop with complex control flow
  • Require code motion above branches and joins
slide-9
SLIDE 9

Hot load example in 181.mcf cf

while( arcin ) { tail = arcin->tail; if( tail->time + arcin->org_cost > latest ) { arcin = (arc_t *)tail->mark; continue; } … }

181.mcf source code Complex and large code including inner loop and function call 181.mcf control flow graph

tail->time

Inner loop

Pointer chasing load In an outer loop with complex control flow Close to the loop entry

slide-10
SLIDE 10

Hot load example in 164.gzi zip

do { match = window + cur_match; if (*(ush*)(match+best_len-1) != scan_end || *(ush*)match != scan_start) continue; } while ((cur_match = prev[cur_match & WMASK]) > limit && --chain_length != 0);

164.gzip source Complex and large code including inner loop and function call 164.gzip control flow graph

(*(ush*)(match+best_len-1) prev[cur_match & WMASK]

slide-11
SLIDE 11

Cross-Iteration Global Scheduling

  • Separating hot loads requires

two types of code motions

  • Code motion across loop back-

edges: software pipelining

  • Code motion across branches

and joins: global scheduling

  • Needs global scheduling

across loop iterations Enhanced pipeline scheduling

tail->time tail->time tail->time tail->time

slide-12
SLIDE 12

Enhanced Pipeline Scheduling (EPS)

  • A software pipelining technique based on code motions
  • Global scheduling can be applied across loop back-edges
  • Aggressive code motions for scheduling useful instructions
  • If we exploit EPS appropriately, we can (1) separate hot

loads and the consumers effectively and (2) fill the slack cycles usefully

  • Let us first review how EPS works
slide-13
SLIDE 13

y = load(x)

EPS Illustration

  • EPS repetitively (1) defines a DAG by cutting

edges of a loop and (2) performs DAG scheduling

cc = (y==0) if(!cc) goto loop store x @A

Back-edge

preheader

cc = (y==0)

Back-edge

x’ = x+4 if(!cc) goto loop store x @A

preheader

y = load(x) x’ = x+4 x = x’

iter 1 iter n+1

x = x+4

preheader

Back-edge

x’ = x+4 if(!cc) goto loop store x @A y = load(x) x’ = x+4 x = x’ cc = (y==0) y = load(x’) y = load(x’) x’’ = x’+4 x’’ = x’+4 x’ = x’’

iter n+1 iter n+2 iter 1 iter 1 iter 2

slide-14
SLIDE 14

CPU Stall Reduction with EPS

  • We simply add a L1-cache-missing latency for “hot”

loads and schedule them by EPS algorithm

  • Their consumer instructions will be scheduled far enough

from them, even across loop iterations

  • However, this is not that simple

Inst Inst Inst Inst Inst ... ...

bac acke kedge

Load Use

slide-15
SLIDE 15

Issues in Stall Reduction with EPS

  • Adding slack cycles means more aggressive code motions
  • Some aggressive code motions such as speculative loads or join code

motions have a negative side-effect if performed recklessly

  • Must limit aggressive code motion
  • On the other hand, hot loads and their source definitions

should be scheduled aggressively

  • Must encourage aggressive code motion
slide-16
SLIDE 16

Hot Load-related instructions

  • We split instructions into two groups, hot-load-

related instructions and non-related instructions.

  • Hot-load-related instructions are scheduled more

aggressively than non-related instructions

  • Selective heuristics
slide-17
SLIDE 17

Scheduling Hot Load-related instructions

def d def c ... ======== add b = c + d ... ======== ld1 a <= @b ... ======== use a br

  • ther parts of loop body

... def d [iter n+1] def c [iter n+1] ======== add b = c + d[iter n+1] ... ======== use a [iter n] ld1 a <= @b [iter n+1] ... ======== br

  • ther parts of loop body

Relate lated inst nstru ructio ion Relate lated inst nstru ructio ion Ho Hot t load load

slide-18
SLIDE 18

Stall-Reducing EPS for Open-64

  • We implemented EPS into Open-64 (version 3.0),

an open-source compiler for IA-64

  • http://www.open64.net/
  • EPS is positioned between register allocation and global

instruction scheduling in Open-64

  • We then implemented stall reduction for EPS
  • Detect “hot” loads via profiling
slide-19
SLIDE 19

Experimental Results

  • Experimental Environment
  • Intel Itanium2 processor 900Mhz

– 256Kb L1 D-cache (L1 cache miss takes 5 Cycles)

  • 10 integer benchmarks from SPEC CPU 2000 and 2006
  • Use Performance Monitoring Unit for detecting hot loads

– Collect load instructions whose stall overhead takes over 2% of running time – 12 loops in 10 benchmarks are selected – We do not touch other loops

slide-20
SLIDE 20

Experiment Configurations

  • Base: Open-64 –O3 with EPS disabled (1.0x)
  • EPS without cache optimizations
  • Strictly schedule hot loops only
  • EPS with cache optimizations
  • Strict heuristics

– Limited code motions

  • Aggressive heuristics
  • Selective heuristics for hot-load-related instructions
slide-21
SLIDE 21

Stall Reduction and Performance Result

Sta Stall l cyc cycles Tot Total l exe execution cycle cycles St Stric ict EP EPS with Cache Opt ptim imization Sta Stall l is s redu reduced a li litt ttle than than EP EPS w/o /o cach cache op

  • ptim

timizatio ion co config iguration. No

  • tang

tangible ef effe fects in n exe executio ion cyc cycles. St Stric ict EP EPS without Cache Opt ptim imization

0.7 0.8 0.9 1 1.1 1.2 1.3 gzip vpr gcc mcf parser gap bzip2 twolf gobmk hmmer 0.7 0.8 0.9 1 1.1 1.2 1.3 gzip vpr gcc mcf parser gap bzip2 twolf gobmk hmmer

slide-22
SLIDE 22

0.7 0.8 0.9 1 1.1 1.2 1.3 gzip vpr gcc mcf parser gap bzip2 twolf gobmk hmmer 0.7 0.8 0.9 1 1.1 1.2 1.3 gzip vpr gcc mcf parser gap bzip2 twolf gobmk hmmer

Stall Reduction and Performance Result

Sta Stall l cyc cycles Tot Total l exe execution cycle cycles St Stric ict EP EPS with Cache Opt ptim imization Sta Stall l is s redu reduced more

  • re.

Ex Execution cycle cycle do does not not get get bett better St Stric ict EP EPS without Cache Opt ptim imization Ag Aggressiv ive EP EPS with Cache Opt ptim imization

slide-23
SLIDE 23

0.7 0.8 0.9 1 1.1 1.2 1.3 gzip vpr gcc mcf parser gap bzip2 twolf gobmk hmmer

Stall Reduction and Performance Result

Sta Stall l cyc cycles Tot Total l exe execution cycle cycles St Stric ict EP EPS with Cache Opt ptim imization Sta Stall l is s redu reduced as s much ch as s agg ggressive con configuration. Ex Execution cycle cycle is s dec decreased. Es Especially gzi gzip and nd mcf cf. St Stric ict EP EPS without Cache Opt ptim imization Ag Aggressiv ive EP EPS with Cache Opt ptim imization Se Selective EP EPS with Cache Opt ptim imization

0.7 0.8 0.9 1 1.1 1.2 1.3 gzip vpr gcc mcf parser gap bzip2 twolf gobmk hmmer

slide-24
SLIDE 24

Summary and Future Work

  • EPS-based stall reduction achieves promising result
  • Adding L1-cache-miss latency for hot loads to separate

them from their consumers

  • Aggressively schedule hot-load-related instructions only
  • Future Work
  • More balanced heuristics between parallelism & stall

reduction

– Canceling code motions which has no advantage for either parallelism or stall reduction after EPS

  • Handling L2-cache-miss for some hottest loads
slide-25
SLIDE 25

Thanks

  • Questions?