EECS 388: Embedded Systems 10. Timing Analysis Heechul Yun 1

Agenda • Execution time analysis • Static timing analysis • Measurement based timing analysis 2

Execution Time Analysis • Will my brake-by-wire system actuate the brakes within one millisecond? • Will my camera based steer-by-wire system identify a bicycler crossing within 100ms (10Hz)? • Will my drone be able to finish computing control commands within 10ms (100Hz)? 3

Execution Time • Worst-Case Execution Time (WCET) • Best-Case Execution Time (BCET) • Average-Case Execution Time (ACET) 4

Execution Time Image source: [Wilhelm et al., 2008] • Real-time scheduling theory is based on the assumption of known WCETs of real-time tasks 5

The WCET Problem • For a given code of a task and the platform (OS & hardware), determine the WCET of the task. while(1) { read_from_sensors(); Loops w/ finite bounds compute(); No recursion Run uninterrupted write_to_actuators(); wait_till_next_period(); } 6

Timing Analysis • Static timing analysis – Input: code, arch. model; output: WCET • Measurement based timing analysis – Based on lots of measurements. Statistical. 7

Static Timing Analysis • Analyze code • Split basic blocks • Find longest path – consider loop bounds • Compute per-block WCET – use abstract CPU model • Compute task WCET – by summing up the WCETs of the longest path 8

WCET and Caches • How to determine the WCET of a task? • The longest execution path of the task? – Problem: the longest path can take less time to finish than shorter paths if your system has a cache(s)! • Example – Path1: 1000 instructions, 0 cache misses – Path2: 500 instructions, 100 cache misses – Cache hit: 1 cycle, Cache miss: 100 cycles – Path 2 takes much longer 9

Recall: Memory Hierarchy Fast, Expensive Volatile memory Non-volatile memory Slow, Inexpensive 10

SiFive FE310 32 bit data bus CPU: 32 bit RISC-V Clock: 320 MHz SRAM: 16 (D) + 16 (I) KB Flash: 4MB 11

Raspberry Pi 4: Broadcom BCM2711 (Bild: ct.de/Maik Merten (CC BY SA 4.0)) CPU: 4x Cortex-A72@ 1.5GHz L2 cache (shared): 1MB Image source: PC Watch. GPU: VideoCore IV@500Mhz DRAM: 1/2/4 GB LPDDR4-3200 12 Storage: micro-SD

Processor Behavior Analysis: Cache Effects Suppose: What happens 1. 32-bit processor when n=2 ? 2. Direct-mapped cache holds two sets  4 floats per set  x and y stored contiguously starting at address 0x0 Slide source: Edward A. Lee and Prabal Dutta (UCB)

B = 2 b bytes per block Direct-Mapped 1 valid bit t tag bits Set 0 Valid Tag Block Cache A “set” consists of one “line” Set 1 Valid Tag Block t bits s bits b bits Tag Set index Block offset . . . m -1 0 Address If the tag of the address Valid Tag Block Set S matches the tag of the line, then we have a “cache hit.” Otherwise, the fetch goes to main memory, updating the line. CACHE Slide source: Edward A. Lee and Prabal Dutta (UCB)

This Particular B = 2 b bytes per block 1 valid bit t tag bits Direct-Mapped Set 0 Valid Tag Block Cache Set 1 Valid Tag Block Four floats per block, four bytes t = 27 bits s = 1 bits b = 4 bits per float, means 16 Tag Set index Block offset bytes, so b = 4 m -1 0 Address = 32 bits CACHE Slide source: Edward A. Lee and Prabal Dutta (UCB)

Processor Behavior Analysis: Cache Effects What happens when n=2 ? Suppose: x[0] will miss, 1. 32-bit processor pulling x[0], x[1], 2. Direct-mapped cache holds two sets y[0] and y[1] into  4 floats per set the set 0. All but  x and y stored contiguously one access will starting at address 0x0 be a cache hit. Slide source: Edward A. Lee and Prabal Dutta (UCB)

Processor Behavior Analysis: Cache Effects What happens when n=8 ? x[0] will miss, pulling x[0-3] into the set 0. Then y[0] will miss, Suppose: pulling y[0-3] into 1. 32-bit processor the same set, 2. Direct-mapped cache holds two sets evicting x[0-3].  4 floats per set Every access will  x and y stored contiguously be a miss! starting at address 0x0 Slide source: Edward A. Lee and Prabal Dutta (UCB)

Measurement Based Timing Analysis • Measurement Based Timing Analysis (MBTA) • Do a lots of measurement under worst-case scenarios (e.g., heavy load) • Take the maximum + safety margin as WCET • No need for detailed architecture models • Commonly practiced in industry 18

Real-Time DNN Control • ~27M floating point multiplication and additions – Per image frame (deadline: 50ms) M. Bechtel. E. McEllhiney, M Kim, H. Yun . “DeepPicar: A Low -cost Deep Neural Network-based 19 Autonomous Car.” In RTCSA , 2018

First Attempt • 1000 samples (minus the first sample. Why?) CFS (nice=0) Mean 23.8 Max Why? 47.9 99pct 47.4 Min 20.7 Median 20.9 Stdev. 7.7 20

DVFS • Dynamic voltage and frequency scaling (DVFS) • Lower frequency/voltage saves power • Vary clock speed depending on the load • Cause timing variations • Disabling DVFS # echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor # echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor # echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor # echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor 21

Second Attempt (No DVFS) CFS (nice=0) Mean 21.0 Max 22.4 99pct 21.8 Min 20.7 Median 20.9 Stdev. 0.3 • What if there are other tasks in the system? 22

Third Attempt (Under Load) CFS (nice=0) Mean 31.1 Max 47.7 99pct 41.6 Min 21.6 Median 31.7 Stdev. 3.1 • 4x cpuhog compete the cpu time with the DNN 23

Recall: kernel/sched/fair.c (CFS) • Priority to CFS weight conversion table – Priority (Nice value): -20 (highest) ~ +19 (lowest) – kernel/sched/core.c const int sched_prio_to_weight[40] = { /* -20 */ 88761, 71755, 56483, 46273, 36291, /* -15 */ 29154, 23254, 18705, 14949, 11916, /* -10 */ 9548, 7620, 6100, 4904, 3906, /* -5 */ 3121 , 2501, 1991, 1586, 1277, /* 0 */ 1024 , 820, 655, 526, 423, /* 5 */ 335, 272, 215, 172, 137, /* 10 */ 110, 87, 70, 56, 45, /* 15 */ 36, 29, 23, 18, 15, }; 24

Fourth Attempt (Use Priority) CFS CFS CFS (nice=0) (nice=-2) (nice=-5) Mean 31.1 27.2 21.4 Max 47.7 44.9 31.3 99pct 41.6 40.8 22.4 Min 21.6 21.6 21.1 Median 31.7 22.1 21.3 Stdev. 3.1 5.8 0.4 • Effect may vary depending on the workloads 25

Fifth Attempt (Use RT Scheduler) CFS CFS CFS FIFO (nice=0) (nice=-2) (nice=-5) Mean 31.1 27.2 21.4 21.4 Max 47.7 44.9 31.3 22.0 99pct 41.6 40.8 22.4 21.8 Min 21.6 21.6 21.1 21.1 Median 31.7 22.1 21.3 21.4 Stdev. 3.1 5.8 0.4 0.1 • Are we done? 26

BwRead #define MEM_SIZE (4*1024*1024) char ptr[MEM_SIZE]; while(1) { for(int i = 0; i < MEM_SIZE; i += 64) { sum += ptr[i]; } } • Use this instead of the ‘ cpuhog ’ as background tasks • Everything else is the same. • Will there be any differences? If so, why? 27

Sixth Attempt (Use BwRead) Solo w/ BwRead CFS CFS CFS FIFO (nice=0) (nice=0) (nice=-5) Mean 21.0 75.8 52.3 50.2 Max 22.4 123.0 80.1 51.7 99pct 21.8 107.8 72.4 51.3 Min 20.7 40.6 40.9 38.3 Median 20.9 81.0 50.1 50.6 Stdev. 0.3 17.7 6.1 1.9 • ~2.5X (fifo) WCET increase! Why? 28

BwWrite #define MEM_SIZE (4*1024*1024) char ptr[MEM_SIZE]; while(1) { for(int i = 0; i < MEM_SIZE; i += 64) { ptr[i] = 0xff; } } • Use this background tasks instead • Everything else is the same. • Will there be any differences? If so, why? 29

Seventh Attempt (Use BwWrite) Solo w/ BwWrite CFS CFS CFS FIFO (nice=0) (nice=0) (nice=-5) Mean 21.0 101.2 89.7 92.6 Max 22.4 194.0 137.2 99.7 99pct 21.8 172.4 119.8 97.1 Min 20.7 89.0 71.8 78.7 Median 20.9 93.0 87.5 92.5 Stdev. 0.3 22.8 7.7 1.0 • ~4.7X (fifo) WCET increase! Why? 30

4xARM Cotex-A72 • Your Pi 4: 1 MB shared L2 cache, 2GB DRAM 31

Shared Memory Hierarchy Core3 Core1 Core2 Core4 Shared Last Level Cache (LLC) Memory Controller (MC) DRAM • Cache space • Memory bus bandwidth • Memory controller queues • … 32

Shared Memory Hierarchy • Memory performance varies widely due to interference • Task WCET can be extremely pessimistic Task 3 Task 4 Task 1 Task 2 Core1 Core3 Core4 Core2 I D I D I D I D Shared Cache Memory Controller (MC) DRAM 33

Multicore and Memory Hierarchy T T T T T T T T T1 T2 1 2 3 4 5 6 7 8 Core Core Core Core CPU 4 1 2 3 Memory Hierarchy Memory Hierarchy Unicore Multicore Performance Impact 34

Effect of Memory Interference 12 Solo Corun 10 Normalized Exeuction Time 8 DNN BwWrite 6 Core1 Core2 Core3 Core4 4 LLC DRAM 2 0 DNN (Core 0,1) BwWrite (Core 2,3) • DNN control task suffers >10X slowdown – When co-scheduling different tasks on on idle cores. 35 Waqar Ali and Heechul Yun. “RT -Gang: Real-Time Gang Scheduling Framework for Safety- Critical Systems.” RTAS , 2019

Effect of Memory Interference https://youtu.be/Jm6KSDqlqiU 36

EECS 388: Embedded Systems 10. Timing Analysis Heechul Yun 1 - PowerPoint PPT Presentation

EECS 388: Embedded Systems 10. Timing Analysis Heechul Yun 1 Agenda Execution time analysis Static timing analysis Measurement based timing analysis 2 Execution Time Analysis Will my brake-by-wire system actuate the brakes

EMBEDDED EMBEDDED REAL TIME SYSTEMS REAL TIME SYSTEMS EMBEDDED EMBEDDED REAL TIME SYSTEMS

Embedded PC The modular Industrial PC for mid-range control Embedded PC 1 Embedded OS

Platform Convergence Journey Windows Embedded Standard 7 Windows Embedded Standard 8 Converged

HW/SW Codesign w/ FPGAs Embedded Systems ECE 495/595 Overview (Slides from Embedded Systems

Embedded Embedded Architecture Architecture Systems Systems Jakob Engblom, PhD Jakob

Embedded PC The modular Industrial PC for mid-range control Stefan Hoppe 14.09.2007 1 Embedded

Optimization Models EECS 127 / EECS 227AT Laurent El Ghaoui EECS department UC Berkeley Spring

4TU MASTER EMBEDDED SYSTEMS Bert Molenkamp 19/03/2020 Master Embedded Systems 1 Table of

Embedded systems and the role of programmable logic devices in embedded systems Embedded system :

Embedded Systems Programming Trends of Embedded Systems (Module 2) Yann-Hang Lee Arizona State

Schools are mandated by NRS 388 (Public & Charter) and NRS 394 (Private) to annually review,

EECS 192: Mechatronics Design Lab Discussion 9 (Part 2): Embedded Software GSI: Richard

EECS 192: Mechatronics Design Lab Discussion 9 (Part 2): Embedded Software written by: Richard

Networked Embedded Systems Ezio Bartocci Overview Networked Embedded Systems (182.717): 6 weeks

EECS 373 Design of Microprocessor-Based Systems

EECS 753: Embedded Real-Time Systems Heechul Yun 1 Welcome to EECS753 About the course 2

Lecture 12: Memory hierarchy & caches A modern memory subsystem combines fast small

Memory Virtualization: Swapping and Demand Paging Policies 1 University of New Mexico Beyond

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

Low-Level Memory Optimisations at the High-Level with Ownership-like Annotations Do you want

Recitation 7 Caching By yzhuang Announcements Pick up your exam from ECE course hub

Organization Lecture-13 Caches-2 Performance Shakil M. Khan Example: Intrinsity FastMATH

An Imitation Learning Approach for Cache Replacement Evan Z. Liu, Milad Hashemi, Kevin Swersky,

ECE232: Hardware Organization and Design Lecture 22: Introduction to Caches Adapted from Computer