Hardware Modeling 2 Cache Analyses Peter Puschner slides credits: - - PowerPoint PPT Presentation

hardware modeling 2 cache analyses
SMART_READER_LITE
LIVE PREVIEW

Hardware Modeling 2 Cache Analyses Peter Puschner slides credits: - - PowerPoint PPT Presentation

Hardware Modeling 2 Cache Analyses Peter Puschner slides credits: P. Puschner, R. Kirner, B. Huber VU 2.0 182.101 SS 2015 Recap: Caches in WCET Analysis Purpose :


slide-1
SLIDE 1

Hardware Modeling 2 Cache Analyses

Peter Puschner

slides credits: P. Puschner, R. Kirner, B. Huber

VU 2.0 182.101 SS 2015

slide-2
SLIDE 2

Recap: Caches in WCET Analysis

Purpose: Bridge gap between fast CPU and memory Essential to analyze caches on many architectures Example: 40 cycles for a miss on MPC755 What: Instructions, Data, BTB, TLB Design: Direct Mapped, Set/Fully Associative Replacement Policy: LRU, FIFO, PLRU, PRR More Characteristics: read-only / write through / write back, write (no) allocate, Multi-Level Caches (inclusive/exclusive), ...

2

slide-3
SLIDE 3

Caches in WCET Analysis

For software running on hardware with caches, computing the WCET by IPET alone (CFG + CCG) gets too complex Ignoring caches leads to unacceptable overestimations

ð Decomposition of WCET analysis into 2+ phases

  • 1. Categorization of memory access wrt. cache

behavior (e.g., always hit, always miss, etc.); Low-Level Analysis uses cache categorization.

  • 2. WCET computation: IPET with no or simplified

cache model

3

slide-4
SLIDE 4

4

Categories of Cache Behavior

ah

always hit each access to the cache is a hit (MUST analysis) am always miss each access to the cache is a miss (MAY analysis ➭complement) ps(S) persistent for each entering of context S, first access is nc, but all other accesses are hits (PERSISTANCE analysis) nc not classified the access is not classified as one of the above categorizations

slide-5
SLIDE 5

Direct Mapped Cache

5

Line is selected by ld(m) address bits Line 1 Line 2 Line m m lines Line: valid bit (v), tag and data (k bytes) ... v w1 w2 wk tag tag

ld(m) bits

address word

ld(k) bits

slide-6
SLIDE 6

DM-$ Analysis Example

6

Compiled ¡from ¡e.g. ¡ x, y, z = a, b, 0 while (x > 0 && y > 0) {
 z += x-- + y--
 } x,y = 0,0

START tag, line, offset 0, 0, 0 0, 0, 1 0,1,1 0,2,0 0,2,1 0,3,0 0,3,1 0, 1, 0 END 1,0,0

slide-7
SLIDE 7

DM-$ Analysis Example

7

Compiled ¡from ¡e.g. ¡ x, y, z = a, b, 0 while (x > 0 && y > 0) {
 z += x-- + y--
 } x,y = 0,0

START tag, line, offset 0, 0, 0 0, 0, 1 0,2,0 0,3,0 0,3,1 0, 1, 0 END 1,0,0 always miss conflict with (0,0,x) 0,1,1 0,2,0 0,2,1 0,3,0 0, 1, 0 continue with 2nd loop iteration always hit (2..n loop iteration) 0, 1, 1 0, 2, 1 always hit

slide-8
SLIDE 8

8

Cache Classification (Hit/Miss)

Goal: A mechanized analysis, which classifies each cache access in a certain context (e.g. call context) as either

Ø Always hit: in all possible executions, this access to the

cache will be a cache hit (the accessed cache block is guaranteed to be in the cache)

Ø Always miss: in all possible executions, this access to

the cache will be a cache miss (the accessed cache block is guaranteed NOT to be in the cache)

Ø Not classified: The accessed cache block may or may

not be in the cache

slide-9
SLIDE 9

9

Automated Categorization of Memory Accesses

à Based on Abstract Interpretation and fixed-point analysis

  • f cache states in the CFG

à Cache update function: models changes of the cache

state for memory accesses

à Join function: Combines states at control-flow joins à Concrete Semantics: Set of possible cache

configurations (tags only, no data) at each program point

à Abstract Semantics: Efficient approximation in an

abstract, “more efficient” domain

slide-10
SLIDE 10

10

Data-Flow Analysis (DFA)

DFA analysis is based on the data-flow structure of the system behavior of interest (e.g. forward and backward propagation)

  • PRED(n) are the virtual predecessors of CFG node n

regarding the data flow of interest (Cache Analysis: usually CFG predecessors) The data domain L of the analysis forms a lattice, on which the transfer function Fn(): L → L models the semantics of the system behavior of interest. To merge two or more states, a join function ⊔: L × L → L is used to compute the least upper bound

slide-11
SLIDE 11

11

Data-Flow Analysis (2)

Data-flow equations modeling the data- flow between nodes: IN(n) = ⊔ ( { OUT(j) | j ∈ PRED(n) } ) OUT(n) = Fn ( IN(n) )

node n … OUT(n) IN(n) Fn()

slide-12
SLIDE 12

12

Data-Flow Analysis (3)

Monotonicity requirements for solving the data-flow equation iteratively:

  • the transfer functions Fn(s) as well as the join function

s1⊔s2 must be monotone to ensure termination of the

analysis.

Monotonicity: a function f: AàB is monotone, iff ∀a,a’∈A. (a ⊆A a’) à ( f(a) ⊆B f(a’) )

slide-13
SLIDE 13

13

Data-Flow Analysis (4)

Iterative Algorithm to find least fixpoint for data-flow equations:

for i ← 1 to N do /* initialize node i: */ OUT(i) = ⊥ while (sets are still changing) do for i ← 1 to N do /* recompute sets at node i: */ IN(i) = ⊔ ( { OUT(j) | j ∈ PRED(n) } ) OUT(i) = Fn( IN(i) )

slide-14
SLIDE 14

14

Concrete & Abstract Semantics

Concrete Cache Semantics: Model the semantics of the relevant aspects of the program (here: cache state & update). The concrete semantics collects the set of all possible cache states for each program point. Abstract Cache Semantics: Semantics in a different, usually finite domain, connected to the concrete semantics by an abstraction/concretization function.

slide-15
SLIDE 15

N-way Set-Associative Cache

15

Set is selected by ld(m) address bits Block 1,1 Block 2,1 Block 1,2 Block 1,n Block 2,2 Block 2,n ... ... Block m,1 Block m,2 Block m,n ... ... ... ... Replacement Strategy updates blocks in one set 1 2 m sets n ways Block (Line): valid bit (v), tag and data (k bytes) ... v w1 w2 wk tag tag

ld(m) bits

address word

ld(k) bits

slide-16
SLIDE 16

Fully-associative Cache (Associativity N)

16

Cache is updated based

  • n value of TAG.

Replacement Policy determines the update strategy used. Way 1 Way 2 Way N Line: valid bit (v), tag and data (k bytes) ... v w1 w2 wk tag tag address word

  • ffset

Associativity: N LRU, FIFO: youngest LRU, FIFO:

  • ldest, evicted on miss
slide-17
SLIDE 17

17

Concrete Cache Semantics (Fully Associative Cache)

Cache Configuration: Mapping from cache line to tag S (data is irrelevant) Domain: For each program point, set of all possible cache states State at start node: Singleton set with empty cache, or set

  • f all possible cache configurations

Update: For a cache configuration C and cache reference S, the new cache configuration C’ after accessing S

slide-18
SLIDE 18

18

Concrete LRU Update (Fully Associative Cache)

Update Function for 4-way cache (1 line per way) with LRU

a b c d c a b d access c a b c d e a b c access e HIT MISS

slide-19
SLIDE 19

19

Abstract Cache Semantics for MUST / MAY Analysis

Abstract Cache Configuration

Compact representation of cache configuration set MUST: For each tag S, the maximum age MAY: For each tag S, the minimum age

Join:

MUST: For each tag S, the maximum age MAY: For each tag S, the minimum age

Update (LRU)

Accessed Tag: Youngest Set MUST: For other tags, increase age if may be aged MAY: For other tags, increase age if must be aged

slide-20
SLIDE 20

20

Abstract Cache Representation

a <= 1 b <= 3 c <= 4 d,e <= 5+ { a } { } { b } { c } MUST Analysis

  • r

a >= 2 b >= 4 c >= 5 d,e >= 1 { d,e } { } { a } { b } MAY Analysis

  • r

⊤ ¡= ¡∀x, ¡x ¡≤ ¡N+1 ¡ ⊤ = ¡∀x, ¡x ¡≥ 1 ¡

slide-21
SLIDE 21

21

Abstract Cache Semantics (MUST Concretization)

a <= 1 b <= 3 c <= 4 d,e <= 5+ { a } { } { b } { c } MUST Analysis

  • r

Concretization a b c d a b c e a b d c a b e c a c b d a c b e a d b c a e b c

slide-22
SLIDE 22

22

Abstract Cache Semantics (MUST Join)

a <= 1 b <= 3 c <= 4 d,e <= 5+ MUST Join join a <= 2 c <= 4 d <= 4 b,e <= 5+ a <= 2 b <= 5+ c <= 4 d,e <= 5+ { a } { } { b } { c } { } { a } { } { c,d } { } { a } { } { c }

slide-23
SLIDE 23

23

Abstract Cache Update Function: (LRU Cache, MUST analysis)

when accessing block c max-age’(c) = 1 max-age(d) ≥ max-age(c) à max-age’(d) = max-age(d) max-age(d) < max-age(c) à max-age’(d) = max-age(d) + 1

slide-24
SLIDE 24

24

Abstract Cache Update Function: (LRU Cache, MUST analysis)

when accessing block c max-age’(c) = 1 max-age(d) ≥ max-age(c) à max-age’(d) = max-age(d)

  • 1. assume age(d) < age(c) à max-age(d) ≥ age(d)+1
  • 2. assume age(d) > age(c) à age’(d) = age(d)

max-age(d) < max-age(c) à max-agd’(d) = max-age(d) + 1

  • 1. If age(d) < age(c), age’(d) = age(d) + 1 ≤ max-age(d) + 1
  • 2. If age(d) > age(c), age’(d) = age(d) ≤ max-age(d) + 1
slide-25
SLIDE 25

25

Cache Hit/Miss Classification using MUST analysis

If at some program point, tag S must be in the cache, i.e., its maximum age is less than or equal to the associativity, then The cache access is classified as ALWAYS HIT If at some program point, it is not the case that tag S may be in the cache, i.e., its minimum age is greater than the associativity of the cache, then The cache access is classified as ALWAYS MISS Otherwise The cache access is NOT CLASSIFIED

slide-26
SLIDE 26

Abstract Cache Semantics (MAY Concretization)

a >= 2 b >= 4 c >= 5 d,e >= 1 { d,e } { a } { } { b } MAY Analysis

  • r

Concretization d a e b e a d b d e a b e d a b

slide-27
SLIDE 27

Abstract Cache Semantics (MAY Join)

a >= 2 b >= 4 c >= 5 d,e >= 1 { } { e } { } { a } MAY Analysis join { d,e } { a } { } { b } a >= 4 b >= 5 c >= 5 d >= 5 e >= 2 a >= 2 b >= 4 c >= 5 d >= 1 e >= 1 { d, e } { a } { } { b }

slide-28
SLIDE 28

28

Abstract Cache Update Function: (LRU Cache, MAY analysis)

when accessing block c min-age’(c) = 1 min-age(d) ≤ min-age(c) à min-age’(d) = min-age(d) + 1

  • 1. if age(d) > age(c) ≥ min-age(d) à

age’(d) = age(d) ≥ min-age(d) + 1

  • 2. assume age(d) < age(c) à age’(d) = age(d) + 1

min-age(d) > min-age(c) à min-age’(d) = min-age(d)

slide-29
SLIDE 29

29

Cache Hit/Miss Classification using MUST and MAY analysis

If at some program point, tag S must be in the cache, i.e., its maximum age is less than or equal to the associativity, then The cache access is classified as ALWAYS HIT If at some program point, it is not the case that tag S may be in the cache, i.e., its minimum age is greater than the associativity of the cache, then The cache access is classified as ALWAYS MISS Otherwise The cache access is NOT CLASSIFIED What is the benefit of ALWAYS MISS over NOT CLASS.?

slide-30
SLIDE 30

30

Discussion

slide-31
SLIDE 31

31

Consider ¡a ¡data ¡cache ¡(1 ¡word ¡line ¡size), ¡with ¡address ¡of ¡

  • dd_even_counter ¡sta<cally ¡known: ¡

static unsigned odd_even_counter[2];
 ++odd_even_counter[sensor() % 2];
 ++odd_even_counter[sensor() % 2];
 ++odd_even_counter[sensor() % 2];
 ++odd_even_counter[sensor() % 2];
 ++odd_even_counter[sensor() % 2]; Which ¡access ¡will ¡be ¡a ¡cache ¡miss? ¡How ¡many ¡access ¡will ¡be ¡ cache ¡hits? ¡

Persistence Analysis

slide-32
SLIDE 32

32

Sometimes, we do not know whether one particular access will be always a hit or a miss. A cache element is said to be persistent (with respect to a program scope S), if in every execution of the scope, all but the first access are guaranteed to be cache hits Data Caches benefit from persistence analysis, because address (implies tag) is not exactly known (e.g., arrays)

Persistence Analysis

slide-33
SLIDE 33

33

Published persistence analysis until ~2009 was unsound. Only recently, development of correct persistence analyses (LRU only), published e.g. in (Ju,Huynh,Roychoudhury). Abstract Domain: For each tag, set of possible younger tags (YS) accessed in the program scope of interest. If |YS(c)| is less than the associativity of the cache, the element is persistent in the scope (i.e., it is not evicted once loaded)

DFA-based Persistence Analysis

slide-34
SLIDE 34

34

Known DFA Persistence Analyses only work with LRU cache Another technique based on static scopes (LRU, FIFO): If during one execution of a program scope at most N elements are accessed, then all of them are persistent in an N-way cache. Open Problem (for all persistence analyses): How to find good program scopes? Functions and Loops are obvious

  • candidates. Which heuristics?

Scope-Based Persistence Analysis

slide-35
SLIDE 35

35

Usually assumes that the address of accessed elements is known, or within some small interval (e.g., if array index is unknown) Precision can be further improved by analyzing array indices and access patterns. If address is unknown, set-associative caches become less effective: access may affect any set. Modularity? To improve analysis results, cache locking or cache splitting can be used, disabling the cache for “unpredictable accesses”.

Data Cache Analysis Remarks

slide-36
SLIDE 36

36

Applying the Cache Categorizations to ILP

In integer linear programming (ILP) we typically calculate the WCET by maximizing Σ xi · ti

  • ti … execution time of CFG edge I (constant)
  • xi … execution frequency of CFG edge I

(to be determined) The hit and miss count of the cache are modeled by additional flow variables: xi = xi,h + xi,m Thus, the updated goal function is Σ xi,h · ti,h + Σ xi,m · ti,m

slide-37
SLIDE 37

37

Applying the Cache Categorizations to ILP (2)

Depending on the cache categorization of a memory reference at edge i additional flow constraints are added:

  • always hit [ah]:

xi,m = 0

  • always miss [am]:

xi,h = 0

  • global persistency [gp]:

xi,h ≥ xi - 1

  • local persistency [ps(S)]:

xi,h ≥ xi – (∑ xk | edge k is entry to context S)

  • [nc]:

no additional constraints are created

slide-38
SLIDE 38

38

Remarks to DFA-Based Cache Modeling

Persistence analysis is not necessary to distinguish first from subsequent loop iterations. To this end, the CFG is virtually rewritten to separate first loop iterations from the others (Virtual Loop Unpeeling1) The separation of cache classification and WCET calculation in DFA-based cache analysis scales well compared to the integrated approach where cache classification was modeled as cache conflict graph within the ILP problem. 1 Sometimes called “virtual loop unrolling”

slide-39
SLIDE 39

39

Remarks to DFA-Based Cache Modeling (2)

  • The ¡DFA-­‑based ¡cache ¡analysis ¡works ¡quite ¡well ¡for ¡set-­‑

associa<ve ¡caches ¡with ¡LRU ¡(least ¡recently ¡used) ¡replacement ¡ strategy: ¡ – LRU ¡has ¡the ¡nice ¡locality ¡property ¡that ¡the ¡content ¡of ¡one ¡ cache ¡line ¡is ¡not ¡affected ¡by ¡memory ¡accesses ¡that ¡map ¡to ¡

  • ther ¡cache ¡lines. ¡
  • However, ¡to ¡improve ¡hardware ¡performance, ¡oVen ¡much ¡less ¡

predictable ¡replacement ¡strategies ¡are ¡used: ¡ – ColdFire ¡MCF ¡5307: ¡pseudo-­‑round ¡robin ¡replacement ¡ – PowerPC ¡750/755: ¡pseudo-­‑LRU ¡replacement ¡

slide-40
SLIDE 40

40

Remarks to DFA-Based Cache Modeling (3)

  • Avg. ¡performance ¡of ¡PRR ¡and ¡PLRU ¡is ¡

similar ¡to ¡LRU, ¡but ¡predictability ¡is ¡much ¡ worse! ¡ ¡ Analysis ¡Results ¡with ¡PLRU: ¡ MAY ¡analysis ¡does ¡not ¡yield ¡any ¡ informa<on ¡at ¡all! ¡(star<ng ¡with ¡unknown ¡ cache, ¡no ¡block ¡is ¡found ¡to ¡be ¡removed) ¡ MUST ¡analysis ¡provides ¡some ¡informa<on ¡ (but ¡less ¡than ¡for ¡LRU): ¡at ¡most ¡4 ¡blocks ¡are ¡ found ¡in ¡each ¡cache ¡set ¡(out ¡of ¡8 ¡blocks ¡in ¡ prac<ce) ¡ S/ll ¡ongoing ¡research ¡(WCET’2010) ¡

Pseudo-­‑LRU ¡(PLRU): ¡ The ¡cache ¡lines ¡are ¡leaves ¡of ¡a ¡tree ¡ where ¡on ¡each ¡node ¡of ¡the ¡tree ¡a ¡path ¡ bit ¡is ¡placed. ¡The ¡replacement ¡line ¡is ¡ determined ¡by ¡following ¡from ¡top ¡ along ¡the ¡path ¡indicated ¡by ¡the ¡path ¡

  • bit. ¡On ¡each ¡regular ¡access, ¡the ¡path ¡

bits ¡along ¡this ¡access ¡are ¡set ¡to ¡the ¡

  • ther ¡direc<on. ¡

b0 b1 b2 b4 b3 b5 b6 L1 L2 L3 L4 L5 L6 L7 L0

1 1 1 1 1 1 1 1 1 1 1 1 1 1

slide-41
SLIDE 41

41

Remarks to DFA-Based Cache Modeling (4)

Pseudo-­‑Round-­‑Robin ¡(PRR): ¡ On ¡a ¡4-­‑way ¡set-­‑assoc. ¡caches ¡a ¡two-­‑bit ¡replacement ¡counter ¡is ¡used. ¡This ¡counter ¡is ¡ shared ¡for ¡all ¡cache ¡lines ¡and ¡is ¡only ¡modified ¡(increased ¡mod ¡4) ¡on ¡replacement. ¡ Thus, ¡each ¡cache ¡line ¡has ¡an ¡influence ¡on ¡the ¡others! ¡ ¡ Analysis ¡Results ¡with ¡PRR: ¡ MAY ¡analysis ¡does ¡not ¡yield ¡any ¡informa<on ¡at ¡all! ¡(without ¡counter ¡or ¡age ¡ informa<on, ¡one ¡can ¡never ¡know ¡which ¡block ¡is ¡removed ¡from ¡cache) ¡ MUST ¡analysis ¡provides ¡only ¡ligle ¡informa<on ¡(much ¡less ¡as ¡for ¡LRU): ¡when ¡a ¡block ¡b ¡ is ¡accessed, ¡it ¡goes ¡into ¡to ¡cache, ¡but ¡without ¡counter ¡or ¡age ¡informa<on, ¡we ¡do ¡not ¡ know, ¡which ¡block ¡is ¡removed ¡à ¡all ¡elements ¡currently ¡in ¡the ¡set ¡must ¡be ¡removed ¡ (only ¡1 ¡out ¡of ¡possibly ¡4 ¡elements ¡can ¡be ¡found ¡to ¡be ¡in ¡the ¡cache) ¡ With ¡PRR, ¡only ¡1 ¡way ¡is ¡effecBvely ¡used ¡ FIFO ¡caches: ¡Cache ¡hit/miss ¡classifica<on ¡difficult ¡(ECRTS’10) ¡

slide-42
SLIDE 42

42

Summary & Discussion

Topic of this lecture: cache access classification Abstract Interpretation: DFA + Abstract Cache States Cache Hit/Miss Classification: MUST/MAY analysis, for instruction caches Replacement Policies: Most work published on LRU; also applicable to direct mapped caches. FIFO,PLRU & PRR are less predictable. Discussion: Preemption? Unpredictable Accesses? Alternatives (Scratchpad)?

slide-43
SLIDE 43

43

References

  • 1. CMHC: Henrik Theiling, Christian Ferdinand, Reinhard

Wilhelm, Fast and Precise WCET Prediction by Separate Cache and Path Analyses, Real-Time Systems 18(2/3), Kluwer, 2000.1

  • 2. Data-Cache Analysis: Bach Khoa Huynh, Lei Ju, and

Abhik Roychoudhury. 2011. Scope-aware Data Cache Analysis for WCET Estimation. Proc. IEEE RTAS `11.

  • 3. FIFO Cache Analysis: Daniel Grund and Jan Reineke.
  • 2010. Precise and Efficient FIFO-Replacement

Analysis Based on Static Phase Detection. In Proceedings of the 2010 22nd Euromicro Conference

  • n Real-Time Systems (ECRTS '10).

1 For persistance analysis, refer to [2], not [1]

slide-44
SLIDE 44

44

References

  • 1. Preemption: Chang-Gun Lee, Joosun Hahn, Yang-Min

Seo, Sang Lyul Min, Rhan Ha, Seongsoo Hong, Chang Yun Park, Minsuk Lee, and Chong Sang Kim. 1998. Analysis of Cache-Related Preemption Delay in Fixed- Priority Preemptive Scheduling. IEEE Trans. Comput. 47, 6 (June 1998).

  • 2. Abstract Interpretation: Julien Bertrane, Patrick

Cousot, Radhia Cousot, Jérôme Feret, Laurent Mauborgne, Antoine Miné and Xavier Rival.2010. Static Analysis and Verification of Aerospace Software by Abstract Interpretation. Paper 2010-3385 in American Institue of Aeronautics and Astronautics (AIAA)

slide-45
SLIDE 45

Extra Material SS 2011

45

slide-46
SLIDE 46

Exercise:

2-way set-assoc cache: MUST, MAY, PS

46

Compiled ¡from ¡e.g. ¡ x, y, z = a, b, 0 while (x > 0 && y > 0) {
 z += x-- + y--
 } x,y = 0,0

START tag, set, offset 0, 0, 0 0, 0, 1 0,1,1 1,0,0 1,0,1 1,1,0 1,1,1 2,0,0 0, 1, 0 END