A Rapid Cache-aware Procedure Positioning Optimization to Favor - PowerPoint PPT Presentation

A Rapid Cache-aware Procedure Positioning Optimization to Favor Incremental Development Enrico Mezzetti , Tullio Vardanega RTAS 2013 19 th IEEE Real-Time and Embedded Technology and Applications Symposium Philadelphia, USA April 9 - 11, 2013

Outline 1 The case for incremental development 2 Incremental procedure positioning 3 Evaluation 4 Conclusion 2 of 15

Caches and incremental development Holy grail of verification-intensive software industry Natural incarnation of the divide-et-impera approach into hardware and software development To better master complexity and costs of industrial process Is incremental WCET analysis even feasible? Relies on composability and early availability of timing bounds - The later those are determined the worse! Hindered by context-dependent hardware resources Caches inherently wreck incrementality Intra-task timing behaviour determined by memory layout Not robust to software increments - Relatively small changes may cause significant jitter Only available on the final executable - Too late to afford costly feedback cycles! 3 of 15

Focusing on instruction cache Cache-aware procedure positioning Improves both performance and predictability - Conflict misses avoidance or reduction Granularity of procedures is industrially appealing - Methods on basic blocks too fine-grained and require specialized tool support Reduces the potential jitter by pinpointing a memory layout Graph-based program representation Weighted Call Graph ( WCG P ) for a program P is a (undirected) weighted graph with V = { p | p is a procedure in P } E ∈ V × V = { ( p , p ′ ) | p calls p ′ ∨ p ′ calls p } W p , p ′ → call frequency between p and p ′ in P . Placement heuristic Nodes pairwise merged according to max W p i , p j Induced procedure ordering � actual memory layout 4 of 15

Limitations and drawbacks Weaknesses of current approaches ✗ Historically focused on average-case optimization - Build on execution traces rather than program structure - WCET-oriented approaches only recently proposed ✗ Poorly scalable to large-scale industrial systems - Especially WCET-oriented methods as they rely on several iterations of static WCET analysis ✗ Only applicable at the tail end of development - Thus failing to account for incremental nature of development What we propose ✓ An alternative program representation, other than WCG - Improving on accuracy and scalability ✓ An optimization method based on program structure - Holistically addressing both WCET and AVG performance - Incrementally applicable on subsequent software releases 5 of 15

Need for an alternative representation Pitfall of WCG WCG representation may be ambiguous With negative consequences on the computed layout - The sources of conflict misses are not necessarily the same - May lead to bad node merging (and layout) Fails to account for the importance of loop nests - Call frequencies alone are not sufficient to catch all the structural information 6 of 15

The Loop-Call Tree structure Basic intuition Procedure involved in the same loop are the most critical source of cache conflicts Need to explicitly consider loop nests Loop-Call Tree LCT P for a program P is an ordered directed tree with i is the i th loop in p } V = { p | p ∈ Proc ( P ) } ∪ { l p i | l p E ∈ V × V = { ( p , p ′ ) | p → p ′ } ∪ { ( p , l p i ) } ∪ { ( l p i , p ′ ) | l p i → p ′ } ∪ { ( l p i , l p j ) | loop l p j is nested inside l p i } B l p i → statically computed loop bound 7 of 15

Computing an optimal layout LCT structural properties Naturally exhibits loop-induced relation between procedures Subtrees can be ordered wrt depth and execution frequency - Several heuristics can be defined Post-order depth-first traversal - Privileges nodes belonging to the same loop nest Procedure selection Procedures on the same subtree � independent pools Incrementally merged together Pool independency broken by procedures appearing in different subtrees - Memory displacements introduced in the merging step - Fragmentation cured with relatively independent procedures 8 of 15

Example placeholder 9 of 15

Example Select first nodes 9 of 15

Example Merge P and Q 9 of 15

Example Keep on merging 9 of 15

Example Q already in the pool... 9 of 15

Example ...just remind it 9 of 15

Example [Merge S and T] 9 of 15

Example [Merge optionally with displacement] 9 of 15

Example [U does not fit in the gap] 9 of 15

Example [U fits in the gap] 9 of 15

Fitting all into incremental development Development as a sequence of incremental steps Qualification status should be incrementally preserved - For either additive or corrective increments - No regression outside of the modules intentionally affected When it comes to caches - Memory layout of pre-existent modules must be preserved Incremental optimization LCT intrinsically fit to incremental addition - No assumptions on the pre-existing pools in the merging step - Keep global ordering up to the increment as set of constraints - Exploit them as an initial pre-existing subtree Naturally absorbs changes that are local to a module - Changes within a subtree do not affect ordering of others Problems arise with shared procedures - Introduce dependences (i.e., diplacements) within subtrees - Layout preservation may require high fragmentation 10 of 15

Evaluation On AVG/WCET I-cache behaviour and WCET variation Targeting the LEON2 (SPARC V8) processor Focusing on reference and domain-specific benchmarks - M¨ alardalen, Mediabench, MiBench, AOCS software Prototype tool 11 of 15

Average and worst-case performance [Average-case hit ratio] [Worst-case hit ratio] 12 of 15

Corresponding global WCET performance Assessing the overall WCET improvement Fairly proportional due to the relatively simple HW platform and setting (e.g., D-cache disabled) [Global WCET reduction] 13 of 15

Robustness to incremental release Simulated incremental steps Modules from the AOCS benchmark (GNC, PRO, TMTC) Confirms constant WCET behavior for GNC - Against an up to +26% potential variation if no countermeasure is taken - Low fragmentation: less than 2% increase in executable size [WCET variation across releases] 14 of 15

Conclusion Novel procedure positioning approach More accurate program representation Improves both avg and wc performance Robust against incremental development Limitations Still need a better solution to handle regression in the presence of shared procedures Iterative (but costly) WCET-oriented approaches may provide better WCET performance Future work Implement our approach as a plugin to standard GCC compiler Undergo an extensive evaluation of different ordering heuristics 15 of 15

A Rapid Cache-aware Procedure Positioning Optimization to Favor - PowerPoint PPT Presentation

A Rapid Cache-aware Procedure Positioning Optimization to Favor Incremental Development Enrico Mezzetti , Tullio Vardanega RTAS 2013 19 th IEEE Real-Time and Embedded Technology and Applications Symposium Philadelphia, USA April 9 - 11, 2013

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Rapid Response Jobs are Alaskas Future Rapid Response Rapid Response Rapid Response is a

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

St Stress Aware Layout Stress Aware Layout St A A L L t t Optimization Optimization

Cache related pre-emption delay aware response time analysis for fixed priority pre-emptive

SAIMENA PRESENTATION DYNAMIC POSITIONING SYSTEMS Introduction to Dynamic Positioning

How GPS Works March 2, 2017 The Global Positioning System Outline The Global Positioning System

Positioning to Win: A Dynamic Role Assignment and Formation Positioning System Patrick MacAlpine,

Optimal Positioning of Flying Relays for Wireless Networks Junting Chen 1 and David Gesbert 2 1

Ad hoc and Sensor Networks Chapter 9: Localization & positioning Holger Karl Computer

Position Descriptions High Quality Performance Measures e-Course Series: Overview

Graphs / Networks Centrality measures, algorithms, interactive applications Duen Horng (Polo) Chau

Numerical Methods for Rapid Computation of PageRank Gene H. Golub Stanford University Stanford,

Graph Mining - PageRank Mert Terzihan-Zhixiong Chen Content 1. Web as a Graph 2. Why is

PPI Network Alignment 02-715 Advanced Topics in Computa8onal Genomics

A Rapid Cache-aware Procedure Positioning Optimization to Favor - PowerPoint PPT Presentation

A Rapid Cache-aware Procedure Positioning Optimization to Favor Incremental Development Enrico Mezzetti , Tullio Vardanega RTAS 2013 19 th IEEE Real-Time and Embedded Technology and Applications Symposium Philadelphia, USA April 9 - 11, 2013

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Rapid Response Jobs are Alaskas Future Rapid Response Rapid Response Rapid Response is a

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

St Stress Aware Layout Stress Aware Layout St A A L L t t Optimization Optimization

Cache related pre-emption delay aware response time analysis for fixed priority pre-emptive

SAIMENA PRESENTATION DYNAMIC POSITIONING SYSTEMS Introduction to Dynamic Positioning

How GPS Works March 2, 2017 The Global Positioning System Outline The Global Positioning System

Positioning to Win: A Dynamic Role Assignment and Formation Positioning System Patrick MacAlpine,

Optimal Positioning of Flying Relays for Wireless Networks Junting Chen 1 and David Gesbert 2 1

Ad hoc and Sensor Networks Chapter 9: Localization &amp; positioning Holger Karl Computer

Position Descriptions High Quality Performance Measures e-Course Series: Overview

Graphs / Networks Centrality measures, algorithms, interactive applications Duen Horng (Polo) Chau

Numerical Methods for Rapid Computation of PageRank Gene H. Golub Stanford University Stanford,

Graph Mining - PageRank Mert Terzihan-Zhixiong Chen Content 1. Web as a Graph 2. Why is

PPI Network Alignment 02-715 Advanced Topics in Computa8onal Genomics

Ad hoc and Sensor Networks Chapter 9: Localization & positioning Holger Karl Computer