Motivations Instruction cache (icache) misses can FICO drastically - PDF document

Motivations Instruction cache (icache) misses can FICO drastically decrease code performance a Fast Instruction Cache Optimizer The problem is even more important for 1-level direct mapped caches Author: Marco Garatti On Lx ST210 the icache slows down the code by about 14.3% on our BenchSuite Presented by: Roberto Costa Advanced System Technology STMicroelectronics ADVANCED SYSTEM TECHNOLOGY Cache Miss Classification Goals and Requirements Compulsory : the very first access to a block cannot be in the cache, so the block must be brought into the cache. These are also called cold start misses or first Improvement of icache performance for reference misses programs compiled by our industrial compiler Capacity : if the cache cannot contain all the blocks No dynamic program profiling must be needed during execution of a program, capacity misses necessary will occur because of blocks being discarded and later retrieved No program size increase Conflict : if the block placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. These are also called collision misses ADVANCED SYSTEM TECHNOLOGY ADVANCED SYSTEM TECHNOLOGY FICO Main Features How to Decrease Misses Focuses on conflict misses only Compulsory misses cannot be avoided Works at function level (by reordering them) Capacity misses can be decreased using Relies on estimated execution profile information two basic ideas : Is implemented as a linking tool • Increasing the icache size • Decreasing the code size Conflict misses can be decreased by an It is usable in an industrial compiler since it is appropriate layout of the program code fast and it does not require any program execution to gather profiling information The achieved performance speed-up is about 50% of the maximum achievable ADVANCED SYSTEM TECHNOLOGY ADVANCED SYSTEM TECHNOLOGY 1

Compilation Flow Algorithm Outline Source n Source 1 The algorithm heuristically determines a function order to minimize function conflicts Compiler The order is computed by analyzing the call Assembler Exe graph annotated with call frequencies Obj n Obj 1 FICO The algorithm has a precise knowledge of the icache structure Libs Linker Exe ADVANCED SYSTEM TECHNOLOGY ADVANCED SYSTEM TECHNOLOGY Algorithm Steps Step 1: Call Graph Compute the program call graph The call graph is built through a linear scan of 1. the program code (only direct calls are considered) Prune the call graph 2. For each call site the compiler creates an entry Propagate local frequencies to derive global 3. into an appropriate section with the local profiling information estimated call execution frequency* Compute interesting neighbors of call-graph 4. The final graph is annotated with a local nodes execution frequency on each edge Generate an “optimal” function layout 5. * Execution frequencies are floating point numbers ADVANCED SYSTEM TECHNOLOGY ADVANCED SYSTEM TECHNOLOGY Step 3: Global Frequencies Step 2: Graph Pruning Main G(P)= Σ each p ∈ Pred(P) G(p)*L(p,P) 20 1 The graph is pruned to speed up the overall algorithm performance (edges with execution P2 P1 frequency under a given threshold are deleted) 1 20 40 Main[1] Nodes without parents (all but main ) are 20 1 P5 P3 P4 deleted P2[20] Cycles in the graph are destroyed. This makes P1[1] G(P) is the global frequency of P 1 20 the graph a DAG. Each node that was in a 40 (how many times P is entered) loop will have its frequency increased L(P1,P2) is the local frequency of P5[40] P3[20] P4[400] the edge P1 → P2 ADVANCED SYSTEM TECHNOLOGY ADVANCED SYSTEM TECHNOLOGY 2

Neighbors: Example Step 4: Computation of Neighbors This example shows 3 possible neighbors for Main node P4 . The number of Each node in the call graph has a set of neighbors can be extended interesting neighbors associated IF(N) to include grandparents, P2 P1 grandchildren, cousins and other Node B is an interesting neighbor for node A if relatives their conflict can affect performance P3 P5 P4 IF(N) is estimated including some of the Each neighbor has a conflict cost associated. The closer the closest relatives of N two nodes, the higher this cost. Depending on the call graph size, IF(N) size is The cost is also proportional to P6 tuned to let the algorithm be fast enough the number of times the two functions may conflict ADVANCED SYSTEM TECHNOLOGY ADVANCED SYSTEM TECHNOLOGY Step 5: Layout Computation Function Placement ε(−20,∞) F2(-20,20) F1(0,50) ε(50,30) F3(50,10) ε(50,∞) Edges are sorted on their global frequency The memory layout is modelled by blocks of these types: Tail and head of heavy edges are placed one close to each other � Functions, with an offset and a size � Empty blocks, with an offset and a maximum size (coil) Nodes are placed in the spot that minimizes When a pair of functions need to be placed all the empty slots the conflict cost are scanned and for those big enough to accommodate the functions a placement cost is computed. ADVANCED SYSTEM TECHNOLOGY ADVANCED SYSTEM TECHNOLOGY Coil Size Computation Cost Computation ε(−20,∞) F2(-20,20) F1(0,50) ε(50,30) F3(50,10) ε(50,∞) Each time a function is placed, the coil maximum size must be recomputed F4(?,20)F5(?,20) Each empty slot big enough to accommodate Let F be the function being placed. Each coil F4 and F5 is checked laid between F and one of the non-conflicting Each interesting neighbor of F4 (F5) that is nodes in IF(F) is resized to ensure that they already placed and that conflicts with F4 (F5) will not conflict gives a contribution proportional to their conflicting frequency and distance ADVANCED SYSTEM TECHNOLOGY ADVANCED SYSTEM TECHNOLOGY 3

Pros and Cons Experiments Pros Experiments used the Lx ST210 icache model: 1 level • • No execution profiling information is required Direct access • • Fast execution 32K size • Cons 64-byte line size • • Relies on the call graph. If it cannot be precisely built the algorithm is not effective (indirect function Miss delay set as a typical one for an embedded calls, system calls) system configuration • No temporal information is taken into account BenchSuite includes multimedia applications and “go” as general-purpose application ADVANCED SYSTEM TECHNOLOGY ADVANCED SYSTEM TECHNOLOGY FICO Impact (ST210) Icache Impact Be nchm a rk NoC ache C a che -No O pt C a che Im pact C om puls ory C omp Im pact C om p+C a p C omp+C a p Im pa ct Be nchm a rk NoCa che C a che -No O pt Ica cheO pt C a che Im pa ct Spe e dup of FIXO Adpcm 1.74 1.7 97.7% 1.7 97.7% 1.7 97.7% Adpcm 1.74 1.7 1.7 97.7% 100.0% C opym a rk 3.84 3.69 96.1% 3.78 98.4% 3.65 95.1% C opym a rk 3.84 3.69 3.68 95.8% 99.7% C rypto 1.59 1.53 96.2% 1.58 99.4% 1.58 99.4% C rypto 1.59 1.53 1.56 98.1% 102.0% C s c 3.48 3.45 99.1% 3.45 99.1% 3.45 99.1% C s c 3.48 3.45 3.45 99.1% 100.0% Dhry 0.81 0.71 87.7% 0.71 87.7% 0.71 87.7% Dhry 0.81 0.71 0.71 87.7% 100.0% Go 1.26 0.73 57.9% 1.15 91.3% 0.93 73.8% Go 1.26 0.73 0.74 58.7% 101.4% Mp2a udio 1.62 1.49 92.0% 1.57 96.9% 1.51 93.2% Mp2a udio 1.62 1.49 1.51 93.2% 101.3% Mp2vloop 5.21 4.57 87.7% 5.02 96.4% 5.02 96.4% Mp2vloop 5.21 4.57 5.02 96.4% 109.8% Mp2a vs witch 2.39 2.02 84.5% 2.31 96.7% 2.3 96.2% Mp2a vs witch 2.39 2.02 2.21 92.5% 109.4% Mp4de c 2.33 1.82 78.1% 2.21 94.8% 2.21 94.8% Mp4de c 2.33 1.82 2.14 91.8% 117.6% Mpeg2 3.73 2.53 67.8% 3.57 95.7% 3.56 95.4% Mpeg2 3.73 2.53 2.68 71.8% 105.9% O pendivx 3.43 2.64 77.0% 3.2 93.3% 3.2 93.3% O pendivx 3.43 2.64 3.01 87.8% 114.0% Tjpeg 5.1 4.69 92.0% 4.82 94.5% 4.82 94.5% Tjpeg 5.1 4.69 4.66 91.4% 99.4% Arith Me a n 2.810 2.428 85.7% 2.698 95.5% 2.665 93.6% Arith Me a n 2.810 2.428 2.544 89.4% 104.66% Legenda: Upper bound is 93.6% Conflict misses NoCache: perfect cache and initially it was 85.7% Cache No Opt: real icache, default layout optimization Compulsory: effect of compulsory misses Average speed-up upper bound Comp+Cap: effect of compulsory and capacity misses ADVANCED SYSTEM TECHNOLOGY ADVANCED SYSTEM TECHNOLOGY Future Developments Other placement algorithms can be investigated Use of real profile information Tuning on the placement algorithm performance ADVANCED SYSTEM TECHNOLOGY 4

Motivations Instruction cache (icache) misses can FICO drastically - PDF document

Motivations Instruction cache (icache) misses can FICO drastically decrease code performance a Fast Instruction Cache Optimizer The problem is even more important for 1-level direct mapped caches Author: Marco Garatti On Lx ST210 the icache

2 3 Motivations 4 Motivations 5 Motivations 6 Motivations 7 8 System Implementation and

Motivations for migration of Motivations for migration of Dutch Somalis to the UK New migrations

When Neurons Fail El Mahdi El Mhamdi, Rachid Guerraoui BDA, Chicago July 25th, 2016 1 / 28

Integrated Super Modules + New Prototypes Motivations Motivations Key features

v4l2 stream sharing Brandon Philips brandon@ifup.org brandon@suse.com Motivations Single

Component Based Software Engineering approach on DSP Targets Agenda 2 / 2 / Motivations

TinyOS Overview of TinyOS Industrial motivations behind TinyOS What is TinyOS? TinyOS

Two Tools for the Analysis of Longitudinal Data: Motivations, Applications and Issues Vern

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

HVP lattice finite-volume Giusti corrections OUTLINE Motivations Second Plenary Workshop of

(BUILDING AN) AI PLATFORM ON HTCONDOR Motivations, lessons learnt and Next Steps Cedalion

Contributions for 5G Development at Brazil May 22 nd 2018 Dr. Henry Douglas Rodrigues Agenda

Cracking Passwords With Time-memory Trade-offs Gildas Avoine Universit e catholique de

Mail Server Andrea Gussoni andrea at gussoni.ovh P.O.u.L. 12 Aprile 2017 Motivations Why

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

LATTICE QCD AND FLAVOR PHYSICS Vittorio Lubicz OUTLINE OUTLINE 1. Motivations for flavor

INTEGRATED WCET ESTIMATION OF MULTICORE APPLICATIONS Dumitru Potop-Butucaru, Isabelle Puaut

Generating High Coverage Tests for SystemC Designs Using Symbolic Execution Bin Lin Department

Reverse Engineering DSP Code GameCube DSP Analyzing GCN DSP code Pierre Bourdon Conclusion

TELECOMMUNICATION (DECT) ETI 2506 Monday, 21 November 2016 LOOK AT THE SYLLABUS 2 REVISITED

VideoLAN VLC 3.0.0 Jean-Baptiste Kempf samedi 30 janvier 2016 Ecole Centrale Paris The Cone

RIStAL Centre de Recherche en Informatique, Signal et Automatique de Lille 1 Outline

BEST: a Binary Executable Slicing Tool and its use to improve Model Checking-based WCET Analysis

Towards Automated Generation of Time-Predictable Code 1 Daniel Prokesch, Benedikt Huber, Peter