[PPT] - Real-Time Architecture Heechul Yun 1 Topics Introduction to PowerPoint Presentation

SLIDE 1

Real-Time Architecture

Heechul Yun

1

SLIDE 2

Topics

Introduction to Real-Time Systems, CPS
CPS Applications
Real-time architecture/OS
Fault tolerance, safety, security

2

Amazon prime air

SLIDE 3

Topics

Introduction to Real-Time Systems, CPS
CPS Applications
Real-time architecture/OS

– Real-time cache, DRAM controller designs – Real-time microarchitecture/OS Support – Real-time support for GPU/FPGA

Fault tolerance, safety, security

3

SLIDE 4

Real-Time Computing

Performance vs. Determinism

– Performance: average timing – Determinism: variance and worst-case timing

Traditional real-time systems

– Focused on determinism – So that we can analyze the system at design time – Many challenges exist in computer architecture – In general, performance demand was not high.

High performance real-time systems

– Such as self-driving cars and UAVs (intelligent robots) – Demand both performance and determinism – More difficult to satisfy both

4

SLIDE 5

Architecture for Intelligent Robots

Time predictability
High performance

5

Performance Predictability

Performance Architecture Real-Time Architecture High Performance Real-Time Architecture

SLIDE 6

Challenges for Time Predictability

Software

– Dynamic memory allocation, virtual memory

Hardware

– Interrupts – Frequency, voltage, temperature control – Pipeline, Out-of-order, Super-scalar – Caches – DMA devices and bus contention – Multicore, Accelerators (GPU, FPGA)

6

SLIDE 7

Cache

Small but fast memory (SRAM)
Hardware (cache controller) managed storage

– Mapping: phy addr  mapping function  set index – Replacement: select victim line among the ways

Improve average performance
Transparent to software

– It just works!

But makes timing analysis complicated

 Why?

7

SLIDE 8

Worst-Case Execution Time (WCET)

Real-time scheduling theory is based on the

assumption of known WCETs of real-time tasks

8

Image source: [Wilhelm et al., 2008]

SLIDE 9

WCET and Caches

How to determine the WCET of a task?
The longest execution path of the task?

– Problem: the longest path can take less time to finish than shorter paths if your system has a cache(s)!

Example

– Path1: 1000 instructions, 0 cache misses – Path2: 500 instructions, 100 cache misses – Cache hit: 1 cycle, Cache miss: 100 cycles – Path 2 takes much longer

9

SLIDE 10

WCET and Caches

Treat all memory accesses as cache-misses?

– Problem: extremely pessimistic

Example

– 1000 instructions, 100 mem accesses, 10 misses

Cache hit: 1 cycle, cache miss: 100 cycles

– Actual = 900 + 901 + 10100 = 1990 = ~2000cycles – WCETallmiss = 900 + 100 * 100 = 10900 = ~11000 cycles

>5X higher

10

SLIDE 11

WCET and Caches

Take cache hits/misses into account?

– To reduce pessimism in WCET estimation

How to know cache hits/misses of a given job?

– If we assume

the path (instruction stream) is given
the job is not interrupted.
A known “good” cache replacement policy is used

– Then we can statically determine hits/misses

But less so when “bad” replacement policies are used

11

SLIDE 12

Review: Direct-Map Cache

Cache-line size = 2L
# of cache-sets = 2S
Cache size = 2L+S

12

tags index

ffset

Cache cache-line (L) Cache Physical address Cache sets S L

SLIDE 13

Cache Cache Cache

Review: Set-Associative Cache

Cache-line size = 2L
# of cache-sets = 2S
# of ways = W
Cache size = W x 2L+S

13

tags index

ffset

Cache Physical address Cache sets Cache cache-line (L) S L 2 3 4 1

SLIDE 14

Cache Replacement Policy

Least Recently Used (LRU)

– Evict least recently used cache-line – “Good” (analyzable) policy. Tight analysis exists. – Expensive to maintain order. Not used for large caches

14

SLIDE 15

Cache Replacement Policy

(Tree) Pseudo-LRU

– Use a binary tree – Each node records which half is older – On a miss, follow the older path and flip the bits along the way – Approximate LRU, No need to sort, practical – But analysis is more pessimistic

15

1 1 1 1

L0 L1 L2 L3 L4 L5 L6 L7

Older

Image credit: Prof. Mikko H. Lipasti

SLIDE 16

Cache Replacement Policy

(Tree) Pseudo-LRU

16

Image credit: https://en.wikipedia.org/wiki/Pseudo-LRU

SLIDE 17

Cache Replacement Policy

(Bit) PLRU or NRU (Not Recently Used)

– One MRU bit per cache-line – Set 1 on access; when the last remaining 0 bit is set to 1, all other bits are reset to 0. – At cache misses, the line with lowest index whose MRU-bit is 0 is replaced.

17

Udacity Lecture: https://www.youtube.com/watch?v=8CjifA2yw7s

SLIDE 18

Cache Replacement Policies

How to know which policy is used?

– Manual (if you are lucky) – Reverse engineering

18

Image source: [Abel and Reineke, RTAS 2013]

SLIDE 19

Problems of Static Timing Analysis

A lot of assumptions

– The path (instruction stream) is given – The job is not interrupted. – Processor architecture (incl. cache) is analyzable

Reality

– Worst-case path is difficult to know – OS jitters change cache state – Most processor architectures are NOT analyzable

19

SLIDE 20

Timing Anomalies

Locally faster != globally faster

20

Image source: [Wilhelm et al., 2008]

SLIDE 21

Timing Anomalies

Locally faster != globally faster

21

Image source: [Wilhelm et al., 2008]

SLIDE 22

Timing Compositional Architecture

What architecture does static analysis work?

– Basically simple, in-order architecture, with 1-level LRU caches (I/D). – E.g.,) ARM7 [Axer et al., 2014]

Most architectures

– Non timing-compositional – Because: prefetchers, out-of-order, superscalar, speculative execution, …

22

SLIDE 23

Measurement Based WCET Analysis

Well, actually measure the execution times
Tools support

– automatically measure execution times w/ subset of all possible inputs &collect timing distribution

Benefits

– Can apply to ANY processors – Closer to exact WCET (no pessimism) – Widely used in practice (in industries)

But,

– No guarantees, because you cannot test all inputs

23

SLIDE 24

Summary

Terminologies: WCET, ACET, BCET
Cache-aware static timing analysis

– Possible but hard

Impact of cache replacement policies

– LRU (good, analyzable), PLRU (not good)

Timing compositional architecture

– Analyzable processor architecture (e.g., ARM7)

Timing anomalies

– Locally fast != globally fast on non-timing compositional architectures (i.e., most architectures)

24

SLIDE 25

References

[Vestal, 2007] Preemptive scheduling of multi-criticality systems with

varying degrees of execution time assurance. In Proc. of the IEEE Real-Time Systems Symposium (RTSS), pages 239–243

[Wilhelm et al., 2008] The Worst-case Execution-time Problem---Overview
f Methods and Survey of Tools, TECS
[Wilhelm et al., 2009] Memory hierarchies, pipelines, and buses for future

architectures in time-critical embedded systems, TCAD

[Abel and Reineke, 2013] Measurement-based modeling of the cache

replacement policy, RTAS

[Axer et al., 2014] Building Timing Predictable Embedded Systems, TECS

25

Real-Time Architecture

Heechul Yun

Topics

Topics

– Real-time cache, DRAM controller designs – Real-time microarchitecture/OS Support – Real-time support for GPU/FPGA

Real-Time Computing

Architecture for Intelligent Robots

Challenges for Time Predictability

– Dynamic memory allocation, virtual memory

– Interrupts – Frequency, voltage, temperature control – Pipeline, Out-of-order, Super-scalar – Caches – DMA devices and bus contention – Multicore, Accelerators (GPU, FPGA)

Cache

– Mapping: phy addr  mapping function  set index – Replacement: select victim line among the ways

– It just works!

 Why?

Worst-Case Execution Time (WCET)

assumption of known WCETs of real-time tasks

WCET and Caches

– Problem: the longest path can take less time to finish than shorter paths if your system has a cache(s)!

– Path1: 1000 instructions, 0 cache misses – Path2: 500 instructions, 100 cache misses – Cache hit: 1 cycle, Cache miss: 100 cycles – Path 2 takes much longer

WCET and Caches

– Problem: extremely pessimistic

– 1000 instructions, 100 mem accesses, 10 misses

– Actual = 900 + 90*1 + 10*100 = 1990 = ~2000cycles – WCETallmiss = 900 + 100 * 100 = 10900 = ~11000 cycles

WCET and Caches

– To reduce pessimism in WCET estimation

– If we assume

– Then we can statically determine hits/misses

Review: Direct-Map Cache

Review: Set-Associative Cache

Cache Replacement Policy

– Evict least recently used cache-line – “Good” (analyzable) policy. Tight analysis exists. – Expensive to maintain order. Not used for large caches

Cache Replacement Policy

– Use a binary tree – Each node records which half is older – On a miss, follow the older path and flip the bits along the way – Approximate LRU, No need to sort, practical – But analysis is more pessimistic

1 1 1 1

Cache Replacement Policy

Cache Replacement Policy

– One MRU bit per cache-line – Set 1 on access; when the last remaining 0 bit is set to 1, all other bits are reset to 0. – At cache misses, the line with lowest index whose MRU-bit is 0 is replaced.

Cache Replacement Policies

– Manual (if you are lucky) – Reverse engineering

Problems of Static Timing Analysis

– The path (instruction stream) is given – The job is not interrupted. – Processor architecture (incl. cache) is analyzable

– Worst-case path is difficult to know – OS jitters change cache state – Most processor architectures are NOT analyzable

Timing Anomalies

Timing Anomalies

Timing Compositional Architecture

– Basically simple, in-order architecture, with 1-level LRU caches (I/D). – E.g.,) ARM7 [Axer et al., 2014]

– Non timing-compositional – Because: prefetchers, out-of-order, superscalar, speculative execution, …

Measurement Based WCET Analysis

– automatically measure execution times w/ subset of all possible inputs &collect timing distribution

– Can apply to ANY processors – Closer to exact WCET (no pessimism) – Widely used in practice (in industries)

– No guarantees, because you cannot test all inputs

Summary

– Possible but hard

– LRU (good, analyzable), PLRU (not good)

– Analyzable processor architecture (e.g., ARM7)

– Locally fast != globally fast on non-timing compositional architectures (i.e., most architectures)

References

– Actual = 900 + 901 + 10100 = 1990 = ~2000cycles – WCETallmiss = 900 + 100 * 100 = 10900 = ~11000 cycles