Instruction Scheduling List scheduling [Gibbons & Muchnick 86] - PDF document

Instruction Scheduling List scheduling [Gibbons & Muchnick 86] Reorder instructions to better fit target machine’s pipeline • fill control transfer delay slots Schedule a basic block... • avoid using result of multi-cycle operations too early • obeying data dependences • loads, floating point operations, ... • avoiding interlocks • schedule code for VLIW, superscalar machines • coordinate multiple instructions to fit available machine resources Previous work: exponential, O(n 4 ) algorithms This work: O(n 2 ) algorithm, simple Techniques: • list scheduling , in a basic block • trace scheduling , across conditional branches • software pipelining , across loop iterations Loop unrolling often can help scheduling Register allocation can hurt scheduling Craig Chambers 218 CSE 501 Craig Chambers 219 CSE 501 Pipeline model Step 1: construct data dependence graph Hazards considered: Convert linear basic block into a DAG representing data dependences • load followed by use of target of load • store followed by a load Loads & stores assumed to alias, • load followed by except that different offsets from common base reg ALU op or load/store with address calculation (e.g. sp ) do not alias r2 := r1 + 1 ➀ r2 := r1 + 1 sp := sp - 12 ➁ sp := sp - 12 *A := r0 ➂ *A := r0 r3 := *(sp+4) ➃ r3 := *(sp+4) r4 := *(sp+8) ➄ r4 := *(sp+8) sp := sp - 8 ➅ sp := sp - 8 *sp := r2 ➆ *sp := r2 r5 := *A ➇ r5 := *A r4 := r0 + 1 ➈ r4 := r0 + 1 Craig Chambers 220 CSE 501 Craig Chambers 221 CSE 501

Step 2: traverse dependence graph, emitting code Results Maintain set of candidate nodes [“Effectiveness of a Machine-Level Global Optimizer”, whose data dependence predecessors have been emitted Johnson & Miller, PLDI ’86] candidates := roots of DAG Compiling small benchmark programs: 7% improvement while |candidates| > 0 do select best available candidate node emit it remove it from the DAG add any new root nodes to candidate set Best node: 1. doesn’t interlock with previous instruction, or 2. does interlock with an immediate successor node, or 3. has the most immediate successor nodes, or 4. is along the longest path to the leaves of the DAG [Previous work used lookahead in DAG to guide choice ⇒ complex and slow (worst case)] Craig Chambers 222 CSE 501 Craig Chambers 223 CSE 501 Automatic Garbage Collection Reference counting Automatically free dead objects For each heap-allocated object, maintain count of # of pointers to object • no dangling pointers , no storage leaks (maybe) • when create object, ref count = 0 • can have faster allocation, better memory locality • when create new ref to object, increment ref count • when remove ref to object, decrement ref count General styles: • if ref count goes to zero, then delete object • reference counting • tracing • mark/sweep, mark/compact • copying proc foo() { a := new Cons; b := new Blob; Adjectives: c := bar(a, b); • generational return c; • conservative } • incremental • parallel proc bar(x, y) { • distributed l := x; l.head := y; t := l.tail; return t; } Craig Chambers 224 CSE 501 Craig Chambers 225 CSE 501

Evaluation of reference counting Tracing collectors + local, incremental work Start with a set of root pointers + little/no language support required • global vars + local ⇒ feasible for distributed systems • contents of stack & registers − cannot reclaim cyclic structures Traverse objects transitively from roots − uses malloc/free back-end ⇒ heap gets fragmented • visits reachable objects − high run-time overhead (10-20%) • all unvisited objects are garbage • can delay processing of ptrs from stack (deferred reference counting [Deutsch & Bobrow 76]) − space cost Issues: − no bound on time to reclaim • how to identify pointers? • in what order to visit objects? • how to know an object is visited? • how to free unvisited objects? • how to allocate new objects? • how to synchronize collector and program ( mutator )? Craig Chambers 226 CSE 501 Craig Chambers 227 CSE 501 Identifying pointers Mark/sweep collection “ Accurate ”: always know unambiguously where pointers are [McCarthy 60]: stop-the-world tracing collector Use some subset of the following to do this: • static type info & compiler support Stop the application when heap fills • run-time tagging scheme • run-time conventions about where pointers can be Trace reachable objects • set mark bit in each object • tracing control: Conservative [Bartlett 88, Boehm & Weiser 88]: • depth-first, recursively using separate stack assume anything that looks like a pointer might a pointer, • depth-first, using pointer reversal & mark target object reachable + supports GC of C, C++, etc. Sweep through all of memory • add unmarked objects to free list What “looks” like a pointer? • clear marks of marked objects • most optimistic: just aligned pointers to beginning of objects • what about interior pointers? Restart mutator off-the-end pointers? • allocate new objects using free list unaligned pointers? Miss encoded pointers (e.g. xor’d ptrs), ptrs in files, ... Craig Chambers 228 CSE 501 Craig Chambers 229 CSE 501

Evaluation of mark/sweep collection Some improvements + collects cyclic structures Mark/ compact collection: + simple to implement when sweeping through memory, compact rather than free • all free memory in one block at end of memory space; no free lists − “embarrassing pause” problem + reduces fragmentation − poor memory locality + fast allocation • when tracing, sweeping − slower to sweep • when allocating, dereferencing due to heap fragmentation − changes pointers − not suitable for distributed systems ⇒ requires accurate info about pointers Generational mark/ ∗ Incremental and/or parallel mark/ ∗ + (greatly) reduce embarrassing pause problem + may be suitable for real-time collection − more complex Craig Chambers 230 CSE 501 Craig Chambers 231 CSE 501 Copying collection Evaluation of copying collection + collects cyclic structures [Cheney 70] + supports compaction, fast allocation automatically + no separate traversal stack required Divide heap into two equal-sized semi-spaces + only visits reachable objects, not all objects • mutator allocates in from-space • to-space is empty − requires twice the (virtual) memory, physical memory sloshes back and forth When from-space fills, do a GC: • could benefit from OS support • visit objects referenced by roots − “embarrassing pause” problem still • when visit object: − copying can be slow • copy to to-space − changes pointers • leave forwarding pointer in from-space version • if visit object again, just redirect pointer to to-space copy • scan to-space linearly to visit reachable objects • to-space acts like breadth-first-search work list • when done scanning to-space: • empty from-space • flip : swap roles of to-space and from-space • restart mutator Craig Chambers 232 CSE 501 Craig Chambers 233 CSE 501

An improvement Another improvement Add small nursery semi-space [Ungar 84] Add semi-space for large objects [Caudill & Wirfs-Brock 86] • nursery fits in main memory (or cache) • big objects slow to copy, so allocate them in separate space • mutator allocates in nursery • use mark/sweep in large object space + no copying of big objects • GC when nursery fills − more complex • copy nursery + from-space to to-space • flip: empty both nursery and from-space + reduces cache misses, page faults • most heap memory references satisfied in nursery? − nursery + from-space can overflow to-space − more complex Craig Chambers 234 CSE 501 Craig Chambers 235 CSE 501 Generational GC Generation scavenging Observation: A generational copying GC [Ungar 84] most objects die soon after allocation • e.g. closures, cons cells, stack frames, numbers, ... 2 generations: new-space and old-space • new-space managed as a 3-space copying collector Idea: • old-space managed using mark/sweep concentrate GC effort on young objects • new-space much smaller than old-space • divide up heap into 2 or more generations • GC each generation with different frequencies, algorithms Apply copy collection ( scavenging ) to new-space frequently If object survives many scavenges, then copy it to old-space Original idea: Peter Deutsch • tenuring (a.k.a. promotion ) Generational mark/sweep: [Lieberman & Hewitt 83] • need some representation of object’s age Generational copying GC: [Ungar 84] If old-space (nearly) full, do a full GC Craig Chambers 236 CSE 501 Craig Chambers 237 CSE 501

Instruction Scheduling List scheduling [Gibbons & Muchnick 86] - PDF document

Instruction Scheduling List scheduling [Gibbons & Muchnick 86] Reorder instructions to better fit target machines pipeline fill control transfer delay slots Schedule a basic block... avoid using result of multi-cycle operations

Instruction Scheduling Last time Instruction scheduling using list scheduling Today

Instruction Scheduling Last time Register allocation Today Instruction

Instruction Scheduling Last week Register allocation Today Instruction scheduling

Instruction Scheduling cs5363 1 Instruction scheduling Reordered Original Instruction code

Zipping Lists with Repetition -- a puzzle Koen Lindstrm Claessen type Nat = Int data List a =

Part C Instruction scheduling Instruction scheduling character stream token stream

Instruction Set 2 Architecting a vocabulary for the HW INSTRUCTION SET OVERVIEW 3 Instruction

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

Profile-Guided Optimizations Last time Instruction scheduling Register renaming

set list tuple set set() Sets methods .intersection() .union() .difference() set sets

Composition Announcements Linked Lists Linked List Structure A linked list is either empty or a

Composition Announcements Linked Lists Linked List Structure A linked list is either empty or a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Chapter 6 Cloud Resource Management and Scheduling Contents Resource management and

EECS 583 Class 10 Code Generation University of Michigan October 6, 2014 Announcements

CS244 Advanced Topics in Networking Lecture 6: Switching Nick McKeown High-speed switch

Parallel Splash Belief Propagation Joseph E. Gonzalez Yucheng Low Carlos Guestrin David

Energy-aware job scheduler for high- performance computing 7.9.2011 Olli Mmmel (VTT), Mikko

Claude TADONKI MINES ParisTech PSL Research University Centre de Recherche Informatique

Improving C HARM ++ Performance with a NUMA-aware Load Balancer Larcio Lima Pilla 1,2 ,

5 CPU Scheduling (1)