Enhanced Pipeline Scheduling Seok-Young Lee, Jaemok Lee, Soo-Mook - PowerPoint PPT Presentation

Some Cache Optimization with Enhanced Pipeline Scheduling Seok-Young Lee, Jaemok Lee, Soo-Mook Moon School of Electrical Engineering & Computer Science Seoul National University, Korea

Outline  Motivation and background  Cache optimizations with Enhanced pipeline scheduling  Experimental results  Summary and future work

Cache Misses for Integer Programs  CPU stalls caused by data cache misses are serious, even in some integer programs 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 164.gzip 175.vpr 176.gcc 181.m cf 197.parser 254.gap 256.bzip 300.twolf 445.gobm k 456.hm m er CPU Stall portion in the total running time

Conventional Techniques  Many compiler optimization techniques have been used • Prefetches for array- accessing loops [Mowry’92] • Increasing locality in loops [Wolf’91] • Dynamic runtime optimization [Chilimbi’02]  But they are not well applicable to integer loops • Address estimation is not easy (e.g., pointer-chasing loops) • Complex control flows

A Better Technique  In integer programs, it is easier to separate “ hot cache- missing loads ” from their consumers by cache-miss latencies • Simply implemented by increased load latency during code scheduling use a use b Cache miss use c latency load x = [y] load x = [y] use d use x use x CPU stall if cache miss No CPU stall if the load and consumer Is separated.

Our Proposal  However, naïve code scheduling is not enough • Code motion of hot loads can be stuck at the loop entry • Difficult to fill added slack cycles fully and usefully • Actually, did not show tangible impact [Choi ’02 in EPIC -2]  Our proposal: moving hot loads across loop iterations

Illustration of the Proposal [iter 1] load a = @b load a = @b load a = @b load a = @b load a = @b use a [iter n] use a … … … [iter n+1] naïve sep separatio ion: pro roposed se separatio ion : stuck at the loop header moving hot loads across loop iterations A code motion for software pipelining

Some Characteristics of Hot Loads  Located close to loop entry Tight data dependence chains to their source operands  • Moving hot load requires moving dependent instructions as well  Difficult to estimate target address  Often in a loop with complex control flow • Require code motion above branches and joins

Hot load example in 181.mcf cf tail->time while( arcin ) { tail = arcin->tail; if( tail->time + arcin->org_cost > latest ) { arcin = (arc_t *)tail->mark; continue; } Complex and large code including inner Inner loop loop and function call Pointer chasing load … In an outer loop with complex control flow } Close to the loop entry 181.mcf source code 181.mcf control flow graph

Hot load example in 164.gzi zip ( *(ush*)(match+best_len-1) do { match = window + cur_match; if ( *(ush*)(match+best_len-1) != scan_end || *(ush*)match != scan_start) continue; Complex and large code including inner loop and function call } while ((cur_match = prev[cur_match & WMASK] ) > limit && --chain_length != 0); prev[cur_match & WMASK] 164.gzip source 164.gzip control flow graph

Cross-Iteration Global Scheduling  Separating hot loads requires tail->time tail->time two types of code motions • Code motion across loop back- edges: software pipelining • Code motion across branches and joins: global scheduling  Needs global scheduling across loop iterations  Enhanced pipeline scheduling tail->time tail->time

Enhanced Pipeline Scheduling (EPS)  A software pipelining technique based on code motions • Global scheduling can be applied across loop back-edges • Aggressive code motions for scheduling useful instructions  If we exploit EPS appropriately, we can (1) separate hot loads and the consumers effectively and (2) fill the slack cycles usefully  Let us first review how EPS works

EPS Illustration  EPS repetitively (1) defines a DAG by cutting edges of a loop and (2) performs DAG scheduling preheader preheader preheader iter 1 iter 1 x’ = x+4 iter 1 iter 2 y = load(x’) y = load(x’) x’ = x+4 x’ = x+4 x = x’ x = x’ x’’ = x’+4 x’’ = x’+4 x’ = x’’ y = load(x) x = x+4 x’ = x+4 iter n+1 y = load(x) y = load(x) iter n+1 iter n+2 cc = (y==0) cc = (y==0) cc = (y==0) Back-edge Back-edge Back-edge if(!cc) goto loop if(!cc) goto loop if(!cc) goto loop store x @A store x @A store x @A

CPU Stall Reduction with EPS  We simply add a L1-cache-missing latency for “ hot ” loads and schedule them by EPS algorithm • Their consumer instructions will be scheduled far enough from them, even across loop iterations bac acke kedge Inst Inst Inst Load Use Inst Inst ... ...  However, this is not that simple

Issues in Stall Reduction with EPS  Adding slack cycles means more aggressive code motions • Some aggressive code motions such as speculative loads or join code motions have a negative side-effect if performed recklessly • Must limit aggressive code motion  On the other hand, hot loads and their source definitions should be scheduled aggressively • Must encourage aggressive code motion

Hot Load-related instructions  We split instructions into two groups, hot-load- related instructions and non-related instructions.  Hot-load-related instructions are scheduled more aggressively than non-related instructions • Selective heuristics

Scheduling Hot Load-related instructions Relate lated inst nstru ructio ion def d ... def c def d [iter n+1] ... def c [iter n+1] Relate lated inst nstru ructio ion ======== ======== add b = c + d add b = c + d[iter n+1] ... ... ======== ======== Hot Ho t load load ld1 a <= @b use a [iter n] ... ld1 a <= @b [iter n+1] ======== ... use a ======== br br other parts of loop body other parts of loop body

Stall-Reducing EPS for Open-64  We implemented EPS into Open-64 (version 3.0), an open-source compiler for IA-64 • http://www.open64.net/ • EPS is positioned between register allocation and global instruction scheduling in Open-64  We then implemented stall reduction for EPS • Detect “hot” loads via profiling

Experimental Results  Experimental Environment • Intel Itanium2 processor 900Mhz – 256Kb L1 D-cache (L1 cache miss takes 5 Cycles) • 10 integer benchmarks from SPEC CPU 2000 and 2006 • Use Performance Monitoring Unit for detecting hot loads – Collect load instructions whose stall overhead takes over 2% of running time – 12 loops in 10 benchmarks are selected – We do not touch other loops

Experiment Configurations  Base: Open-64 – O3 with EPS disabled (1.0x)  EPS without cache optimizations • Strictly schedule hot loops only  EPS with cache optimizations • Strict heuristics – Limited code motions • Aggressive heuristics • Selective heuristics for hot-load-related instructions

Stall Reduction and Performance Result 1.3 St Stric ict EP EPS without Cache Opt ptim imization 1.2 1.1 St Stric ict EP EPS with 1 Cache Opt ptim imization 0.9 0.8 0.7 gzip vpr gcc mcf parser gap bzip2 twolf gobmk hmmer Sta Stall l cyc cycles 1.3 1.2 Sta Stall l is s redu reduced a li litt ttle than than 1.1 EP EPS w/o /o cach cache op optim timizatio ion 1 co config iguration. 0.9 0.8 No o tang tangible ef effe fects in n exe executio ion 0.7 cyc cycles. gzip vpr gcc mcf parser gap bzip2 twolf gobmk hmmer Total Tot l exe execution cycle cycles

Stall Reduction and Performance Result 1.3 St Stric ict EP EPS without 1.2 Cache Opt ptim imization 1.1 St Stric ict EP EPS with 1 Cache Opt ptim imization 0.9 Aggressiv Ag ive EP EPS with 0.8 Cache Opt ptim imization 0.7 gzip vpr gcc mcf parser gap bzip2 twolf gobmk hmmer Sta Stall l cyc cycles 1.3 1.2 Sta Stall l is s redu reduced more ore. 1.1 1 Ex Execution cycle cycle do does not not get get bett better 0.9 0.8 0.7 gzip vpr gcc mcf parser gap bzip2 twolf gobmk hmmer Total Tot l exe execution cycle cycles

Stall Reduction and Performance Result St Stric ict EP EPS without 1.3 Cache Opt ptim imization 1.2 1.1 Stric St ict EP EPS with 1 Cache Opt ptim imization 0.9 Ag Aggressiv ive EP EPS with 0.8 Cache Opt ptim imization 0.7 gzip vpr gcc mcf parser gap bzip2 twolf gobmk hmmer Selective EP Se EPS with Sta Stall l cyc cycles Cache Opt ptim imization 1.3 1.2 Sta Stall l is s redu reduced as s much ch as s 1.1 agg ggressive con configuration. 1 0.9 Execution cycle Ex cycle is s dec decreased. 0.8 Especially gzi Es gzip and nd mcf cf. 0.7 gzip vpr gcc mcf parser gap bzip2 twolf gobmk hmmer Tot Total l exe execution cycle cycles

Summary and Future Work  EPS-based stall reduction achieves promising result • Adding L1-cache-miss latency for hot loads to separate them from their consumers • Aggressively schedule hot-load-related instructions only  Future Work • More balanced heuristics between parallelism & stall reduction – Canceling code motions which has no advantage for either parallelism or stall reduction after EPS • Handling L2-cache-miss for some hottest loads

Thanks  Questions?

Enhanced Pipeline Scheduling Seok-Young Lee, Jaemok Lee, Soo-Mook - PowerPoint PPT Presentation

Some Cache Optimization with Enhanced Pipeline Scheduling Seok-Young Lee, Jaemok Lee, Soo-Mook Moon School of Electrical Engineering & Computer Science Seoul National University, Korea Outline Motivation and background Cache

An Enhanced Global Router An Enhanced Global Router An Enhanced Global Router An Enhanced Global

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

Enhanced EDF Scheduling Enhanced EDF Scheduling Algorithms for Orchestrating Algorithms for

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Module 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Instruction Scheduling Last time Instruction scheduling using list scheduling Today

Ponchatoula High School Scheduling for your Junior Year 2015-2016 Scheduling Procedures Online

CPU Scheduling Schedulers in the OS Structure of a CPU Scheduler Scheduling =

Scheduling and SAT Emmanuel Hebrard Toulouse Outline Introduction 1 Scheduling and SAT

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

Atomistic simulations of rare events using the using the gentlest ascent gentlest ascent

Closing the Protection Gap Issues and Initiatives Joaquim Levy Closing the Protection Gap

Fast Data Driven Compressed Sensing and application to compressed quantitative MRI Mike Davies

Soil Particle Size Chapter 2.11 Soil Particle Size 3" No. 4 No. 200 USCS gravel sand

Thermal Hall effect of magnons Hosho Katsura (Dept. Phys., UTokyo) Rel ated papers : H.K.,

OpenConext Niels van Dijk, Technical Product Manager SURFconext Utrecht, VAMP, Sept 2012

Requirements Engineering Onno de Graaf Unit 2: o.de.graaf at gmail.com Requirements

Requirements Engineering Requirem ents Engineering Unit 3: Requirem ents Engineering process

Enhanced Pipeline Scheduling Seok-Young Lee, Jaemok Lee, Soo-Mook - PowerPoint PPT Presentation

Some Cache Optimization with Enhanced Pipeline Scheduling Seok-Young Lee, Jaemok Lee, Soo-Mook Moon School of Electrical Engineering & Computer Science Seoul National University, Korea Outline Motivation and background Cache

An Enhanced Global Router An Enhanced Global Router An Enhanced Global Router An Enhanced Global

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

Enhanced EDF Scheduling Enhanced EDF Scheduling Algorithms for Orchestrating Algorithms for

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Module 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Instruction Scheduling Last time Instruction scheduling using list scheduling Today

Ponchatoula High School Scheduling for your Junior Year 2015-2016 Scheduling Procedures Online

CPU Scheduling Schedulers in the OS Structure of a CPU Scheduler Scheduling =

Scheduling and SAT Emmanuel Hebrard Toulouse Outline Introduction 1 Scheduling and SAT

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

Atomistic simulations of rare events using the using the gentlest ascent gentlest ascent

Closing the Protection Gap Issues and Initiatives Joaquim Levy Closing the Protection Gap

Fast Data Driven Compressed Sensing and application to compressed quantitative MRI Mike Davies

Soil Particle Size Chapter 2.11 Soil Particle Size 3&quot; No. 4 No. 200 USCS gravel sand

Thermal Hall effect of magnons Hosho Katsura (Dept. Phys., UTokyo) Rel ated papers : H.K.,

OpenConext Niels van Dijk, Technical Product Manager SURFconext Utrecht, VAMP, Sept 2012

Requirements Engineering Onno de Graaf Unit 2: o.de.graaf at gmail.com Requirements

Requirements Engineering Requirem ents Engineering Unit 3: Requirem ents Engineering process

Soil Particle Size Chapter 2.11 Soil Particle Size 3" No. 4 No. 200 USCS gravel sand