Challenges for Timing Analysis of Multi-Core Architectures Jan - PowerPoint PPT Presentation

Challenges for Timing Analysis of Multi-Core Architectures Jan Reineke @ saarland university computer science DICE-FOPARA, Uppsala, Sweden April 22, 2017

The Context: Hard Real-Time Systems Safety-critical applications: ¢ Avionics, automotive, train industries, manufacturing Side airbag in car Crankshaft-synchronous tasks Reaction in < 10 msec Reaction in < 45 microsec ¢ Embedded software must l compute correct control signals, l within time bounds. 2

The Timing Analysis Problem ? + Set of Software Tasks Timing Requirements + Microarchitecture 3

“Standard Approach” for Timing Analysis Two-phase approach: Determine WCET (worst-case execution time) 1. bounds for each task on microarchitecture Perform response-time analysis 2. Simple interface between WCET analysis and response-time analysis: WCET bounds 4

What does the execution time depend on? ¢ The input, determining which path is taken through the program. Simple Memory CPU 5

What does the execution time depend on? ¢ The input, determining which path is taken through the program. ¢ The state of the hardware platform: l Due to caches, pipelining, speculation, etc. Complex CPU (out-of-order Simple L1 Main Memory execution, CPU Cache Memory branch prediction, etc.) 6

Example of Influence of Microarchitectural State LOAD r2, _a x=a+b; LOAD r1, _b ADD r3,r2,r1 PowerPC 755 7

What does the execution time depend on? ¢ The input, determining which path is taken through the program. ¢ The state of the hardware platform: l Due to caches, pipelining, speculation, etc. ¢ Interference from the environment: l External interference as seen from the analyzed task on shared busses, caches, memory. Complex L1 Complex CPU CPU Cache (out-of-order Simple L1 Main L2 Main Memory execution, ... CPU Cache Memory Cache Memory branch prediction, etc.) Complex L1 CPU Cache 8

Example of Influence of Corunning Tasks in Multicores Radojkovic et al. (ACM TACO, 2012) on Intel Atom and Intel Core 2 Quad: up to 14x slow-down due to interference on shared L2 cache and memory controller 9

What does the execution time depend on? ¢ The input, determining which path is taken through the program. ¢ The state of the hardware platform: l Due to caches, pipelining, speculation, etc. ¢ Interference from the environment: l External interference as seen from the analyzed task on shared busses, caches, memory. Complex L1 Complex CPU CPU Cache (out-of-order Simple L1 Main L2 Main Memory execution, ... CPU Cache Memory Cache Memory branch prediction, etc.) Complex L1 CPU Cache 10

Three Challenges: Modeling How to obtain sound timing models? Analysis How to precisely & efficiently bound the WCET? Design How to design microarchitectures that enable precise & efficient WCET analysis? 11

The Modeling Challenge Predictions about the future behavior of a system are always based on models of the system. architecture ? Timing Micro- + Model Model All models are wrong, but some are useful. George Box (Statistiker) 12

The Need for Timing Models The ISA only partially defines the behavior of microarchitectures: it abstracts from timing . How to obtain timing models ? ¢ Hardware manuals ¢ Manually devised microbenchmarks ¢ Machine learning Challenge : Introduce HW/SW contract to capture timing behavior of microarchitectures. 13

Current Process of Deriving Timing Models ? Timing Micro- + architecture Model à Time-consuming, and à error-prone. 14

Can We Automate the Process? Perform Timing Micro- + measurements on Infer model architecture Model hardware 15

Can We Automate the Process? Perform Timing Micro- + measurements on Infer model architecture Model hardware Derive timing model automatically from measurements on the hardware using methods from automata learning. à No manual effort, and à (under certain assumptions) provably correct. 16

Proof-of-concept: Automatic Modeling of the Cache Hierarchy ¢ Can be characterized by a few parameters: l ABC: associativity, block size, capacity l Replacement policy: finite automaton B = Block Size Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data A = Associativity ... Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data N = Number of Cache Sets chi [Abel and Reineke, RTAS] derives all of these parameters fully automatically including previously undocumented replacement policies. 17

Modeling Challenge: Ongoing and Future Work 1. Extend automata learning techniques to account for prior knowledge [NASA Formal Methods Symposium, 2016] 2. Apply approach to other parts of the microarchitecture: l Translation lookaside buffers, branch predictors l Shared caches in multicores including their coherency protocols l Contemporary out-of-order cores 18

Analysis and Design Challenges Design for Predictability Precise & Efficient How to design hardware to Timing Analysis allow for precise and efficient analysis without sacrificing performance? How to precisely and efficiently account for caches, pipelining, speculation, etc.? 19

The Analysis Challenge: State of the Art Private Caches Precise & efficient abstractions, for • LRU [Ferdinand, 1999] Not-as-precise but efficient abstractions, for • FIFO, PLRU, MRU [Grund and Reineke, 2008-2011] Reasonably precise quantitative analyses, for • FIFO, MRU [Guan et al., 2012-2014] Complex L1 CPU Cache Shared Resources on L2 Main ... Cache Memory Multicores Major challenge: interference on Complex L1 shared resources CPU Cache à execution time depends on corunning tasks à need timing compositionality Complex Pipelines Precise but very inefficient analyses; little abstraction Major challenge: timing anomalies 20

Contributions to Analysis and Design Challenges Analysis Caches [SIGMETRICS 08, SAS 09, WCET 10, ECRTS 10, CAV 17] ¢ Branch Target Buffers [RTCSA 09, JSA 10] ¢ Preemption Cost [WCET 09, LCTES 10, RTNS 16 ] ¢ Architecture-Parametric Timing Analysis [RTAS 14] ¢ Multi-Core Timing Analysis [RTNS 15, DAC 16, RTNS 16] ¢ Design Predictability Assessment (Randomized) Caches [RTS 07, Hardware: ¢ TECS 13, LITES 14, WAOA 15] Shared DRAM Controller [CODES+ISSS 11] ¢ Branch Target Buffers [JSA 10] ¢ Preemption-aware Cache [RTAS 14] ¢ Pipelines and Buses [TCAD 09] ¢ Smooth Shared Caches [WAOA 15] ¢ Load/Store-Unit [WCET 12] ¢ Anomaly-free Pipelines [Correct Sys. Des. 15] ¢ Timing Anomalies [WCET 06] ¢ Software: Timing Compositionality [CRTS 13] ¢ Predictable Memory Allocation [ECRTS 11] ¢ Compilation for Predictability [RTNS 14] ¢ 21

Timing Anomalies Nondeterminism due to uncertainty about hardware state Cache Miss = Local Worst Case Cache Hit leads to Global Worst Case Timing Anomalies in Dynamically Scheduled Microprocessors T. Lundqvist, P. Stenström – RTSS 1999 22

Timing Anomalies Timing Anomaly = Counterintuitive scenario in which the “local worst case” does not imply the “global worst case”. Example: Scheduling Anomaly C ready Resource 1 A D E Resource 2 C B Resource 1 A D E Resource 2 B C Bounds on multiprocessing timing anomalies RL Graham - SIAM Journal on Applied Mathematics, 1969 – SIAM (http://epubs.siam.org/doi/abs/10.1137/0117039) 23

Timing Anomalies Consequences for Timing Analysis Cannot exclude cases “locally”: à Need to consider all cases à May yield “State explosion problem” 24

Conventional Wisdom Simple in-order pipeline + LRU caches à no timing anomalies à timing compositional False! 25

Bad News: In-order Pipelines Fetch ( IF ) I-cache Decode ( ID ) Memory Execute ( EX ) Memory ( MEM ) D-cache Write-back ( WB ) We show such a pipeline has timing anomalies: Toward Compact Abstractions for Processor Pipelines S. Hahn, J. Reineke, and R. Wilhelm. In Correct System Design, 2015. 26

A Timing Anomaly Program: Pipeline State: load ... ( load r1 , 0 ) IF load ) load H ) load H nop IF ret IF ret load r1 M load r1 M EX div EX div ID load r1, ... EX ( load , 0 ) div ..., r1 load M load M load r1 M load r1 M IF ret IF ret MEM ----------- WB EX div EX div ret Hit case: • Instruction fetch starts before second load becomes ready • Stalls second load, which misses the cache Intuitive Reason: Progress in the pipeline influences order of Miss case: instruction fetch and data access • Second load can catch up during first load missing the cache • Second load is prioritized over instruction fetch • Loading before fetching suits subsequent execution 27

Good News: Strictly In-Order Pipelines Definition (Strictly In-Order): We call a pipeline strictly in-order if each resource processes the instructions in program order. • Enforce memory operations (instructions and data) in-order (common memory as resource) • Block instruction fetch until no potential data accesses in the pipeline 28

Strictly In-Order Pipelines: Properties Theorem 1 (Monotonicity): In the strictly in-order pipeline progress of an instruction is monotone in the progress of other instructions. In the blue state, each instruction has the same or more progress than in the red state. ≤ ∃ ∀ ≤ 29

Challenges for Timing Analysis of Multi-Core Architectures Jan - PowerPoint PPT Presentation

Challenges for Timing Analysis of Multi-Core Architectures Jan Reineke @ saarland university computer science DICE-FOPARA, Uppsala, Sweden April 22, 2017 The Context: Hard Real-Time Systems Safety-critical applications: Avionics,

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Architectures Architectural styles Software architectures Architectures versus middleware

Timing and Coordination Essential Knowledge 2.E.2 and 2.E.3 Timing and Coordination Timing

Timing Analysis Timing Path Groups and Types Timing paths are grouped into path groups

EECS 388: Embedded Systems 10. Timing Analysis Heechul Yun 1 Agenda Execution time analysis

Liberty Timing File (LIB) Advanced VLSI Design CMPE 641 Liberty Timing File The .lib file is an

Timing Library Format (TLF) Advanced VLSI Design CMPE 414 Timing Library Format (TLF) TLF is an

Digital Design Discussion: RTL Storage Components Shift Register Timing Register File Timing

Scheduling Multi-Periodic Mixed-Criticality DAGs on Multi-Core Architectures Roberto MEDINA

Challenges for Worst-case Execution Time Analysis of Multi-core Architectures Jan Reineke @

Real-Time Multi/Many-Core Architecture Heechul Yun 1 Real-Time Multi/Many-Core Architecture

Precision timing and scintillation of binary radio pulsars Daniel Reardon (Swinburne/OzGrav)

A Highly Compressed Timing Macro-modeling Algorithm for Hierarchical and Incremental Timing

Parallel Linear Algebra Software for Multi-Core Architectures (PLASMA) for the CELL BE Georgia

Exploiting Multi-Core Architectures for Fast Modular Synthesis LAC2008 Feb 29, 2008 Jrgen

Lecture 07 Multicore Computation Lecture based on notes from John Mellor-Crummey Department of

Efficient Execution of Dependent Tasks on Many-Core Processors Hamza Rihani, Claire Maiza,

Reli liability-Aware Scheduling on Heterogeneous Multicore Processors Ajeya Naithani Stijn

Methods for Emulation of Multi-Core CPU Performance Tomasz Buchert 1 Lucas Nussbaum 2 Jens Gustedt

Design Space Exploration and Dynamic Thermal Management of Multi-core Processors Sarma Vrudhula

Physical Layer CS 438: Spring 2014 Instructor: Matthew Caesar

Limit to Spin Squeezing in BEC : from two-mode to multimode A. Sinatra, Y. Castin, E. Witkowska

Switched behaviors with impulses A unifying framework Stephan Trenn and Jan C. Willems