eecs 388 embedded systems
play

EECS 388: Embedded Systems 10. Timing Analysis Heechul Yun 1 - PowerPoint PPT Presentation

EECS 388: Embedded Systems 10. Timing Analysis Heechul Yun 1 Agenda Execution time analysis Static timing analysis Measurement based timing analysis 2 Execution Time Analysis Will my brake-by-wire system actuate the brakes


  1. EECS 388: Embedded Systems 10. Timing Analysis Heechul Yun 1

  2. Agenda • Execution time analysis • Static timing analysis • Measurement based timing analysis 2

  3. Execution Time Analysis • Will my brake-by-wire system actuate the brakes within one millisecond? • Will my camera based steer-by-wire system identify a bicycler crossing within 100ms (10Hz)? • Will my drone be able to finish computing control commands within 10ms (100Hz)? 3

  4. Execution Time • Worst-Case Execution Time (WCET) • Best-Case Execution Time (BCET) • Average-Case Execution Time (ACET) 4

  5. Execution Time Image source: [Wilhelm et al., 2008] • Real-time scheduling theory is based on the assumption of known WCETs of real-time tasks 5

  6. The WCET Problem • For a given code of a task and the platform (OS & hardware), determine the WCET of the task. while(1) { read_from_sensors(); Loops w/ finite bounds compute(); No recursion Run uninterrupted write_to_actuators(); wait_till_next_period(); } 6

  7. Timing Analysis • Static timing analysis – Input: code, arch. model; output: WCET • Measurement based timing analysis – Based on lots of measurements. Statistical. 7

  8. Static Timing Analysis • Analyze code • Split basic blocks • Find longest path – consider loop bounds • Compute per-block WCET – use abstract CPU model • Compute task WCET – by summing up the WCETs of the longest path 8

  9. WCET and Caches • How to determine the WCET of a task? • The longest execution path of the task? – Problem: the longest path can take less time to finish than shorter paths if your system has a cache(s)! • Example – Path1: 1000 instructions, 0 cache misses – Path2: 500 instructions, 100 cache misses – Cache hit: 1 cycle, Cache miss: 100 cycles – Path 2 takes much longer 9

  10. Recall: Memory Hierarchy Fast, Expensive Volatile memory Non-volatile memory Slow, Inexpensive 10

  11. SiFive FE310 32 bit data bus CPU: 32 bit RISC-V Clock: 320 MHz SRAM: 16 (D) + 16 (I) KB Flash: 4MB 11

  12. Raspberry Pi 4: Broadcom BCM2711 (Bild: ct.de/Maik Merten (CC BY SA 4.0)) CPU: 4x Cortex-A72@ 1.5GHz L2 cache (shared): 1MB Image source: PC Watch. GPU: VideoCore IV@500Mhz DRAM: 1/2/4 GB LPDDR4-3200 12 Storage: micro-SD

  13. Processor Behavior Analysis: Cache Effects Suppose: What happens 1. 32-bit processor when n=2 ? 2. Direct-mapped cache holds two sets  4 floats per set  x and y stored contiguously starting at address 0x0 Slide source: Edward A. Lee and Prabal Dutta (UCB)

  14. B = 2 b bytes per block Direct-Mapped 1 valid bit t tag bits Set 0 Valid Tag Block Cache A “set” consists of one “line” Set 1 Valid Tag Block t bits s bits b bits Tag Set index Block offset . . . m -1 0 Address If the tag of the address Valid Tag Block Set S matches the tag of the line, then we have a “cache hit.” Otherwise, the fetch goes to main memory, updating the line. CACHE Slide source: Edward A. Lee and Prabal Dutta (UCB)

  15. This Particular B = 2 b bytes per block 1 valid bit t tag bits Direct-Mapped Set 0 Valid Tag Block Cache Set 1 Valid Tag Block Four floats per block, four bytes t = 27 bits s = 1 bits b = 4 bits per float, means 16 Tag Set index Block offset bytes, so b = 4 m -1 0 Address = 32 bits CACHE Slide source: Edward A. Lee and Prabal Dutta (UCB)

  16. Processor Behavior Analysis: Cache Effects What happens when n=2 ? Suppose: x[0] will miss, 1. 32-bit processor pulling x[0], x[1], 2. Direct-mapped cache holds two sets y[0] and y[1] into  4 floats per set the set 0. All but  x and y stored contiguously one access will starting at address 0x0 be a cache hit. Slide source: Edward A. Lee and Prabal Dutta (UCB)

  17. Processor Behavior Analysis: Cache Effects What happens when n=8 ? x[0] will miss, pulling x[0-3] into the set 0. Then y[0] will miss, Suppose: pulling y[0-3] into 1. 32-bit processor the same set, 2. Direct-mapped cache holds two sets evicting x[0-3].  4 floats per set Every access will  x and y stored contiguously be a miss! starting at address 0x0 Slide source: Edward A. Lee and Prabal Dutta (UCB)

  18. Measurement Based Timing Analysis • Measurement Based Timing Analysis (MBTA) • Do a lots of measurement under worst-case scenarios (e.g., heavy load) • Take the maximum + safety margin as WCET • No need for detailed architecture models • Commonly practiced in industry 18

  19. Real-Time DNN Control • ~27M floating point multiplication and additions – Per image frame (deadline: 50ms) M. Bechtel. E. McEllhiney, M Kim, H. Yun . “DeepPicar: A Low -cost Deep Neural Network-based 19 Autonomous Car.” In RTCSA , 2018

  20. First Attempt • 1000 samples (minus the first sample. Why?) CFS (nice=0) Mean 23.8 Max Why? 47.9 99pct 47.4 Min 20.7 Median 20.9 Stdev. 7.7 20

  21. DVFS • Dynamic voltage and frequency scaling (DVFS) • Lower frequency/voltage saves power • Vary clock speed depending on the load • Cause timing variations • Disabling DVFS # echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor # echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor # echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor # echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor 21

  22. Second Attempt (No DVFS) CFS (nice=0) Mean 21.0 Max 22.4 99pct 21.8 Min 20.7 Median 20.9 Stdev. 0.3 • What if there are other tasks in the system? 22

  23. Third Attempt (Under Load) CFS (nice=0) Mean 31.1 Max 47.7 99pct 41.6 Min 21.6 Median 31.7 Stdev. 3.1 • 4x cpuhog compete the cpu time with the DNN 23

  24. Recall: kernel/sched/fair.c (CFS) • Priority to CFS weight conversion table – Priority (Nice value): -20 (highest) ~ +19 (lowest) – kernel/sched/core.c const int sched_prio_to_weight[40] = { /* -20 */ 88761, 71755, 56483, 46273, 36291, /* -15 */ 29154, 23254, 18705, 14949, 11916, /* -10 */ 9548, 7620, 6100, 4904, 3906, /* -5 */ 3121 , 2501, 1991, 1586, 1277, /* 0 */ 1024 , 820, 655, 526, 423, /* 5 */ 335, 272, 215, 172, 137, /* 10 */ 110, 87, 70, 56, 45, /* 15 */ 36, 29, 23, 18, 15, }; 24

  25. Fourth Attempt (Use Priority) CFS CFS CFS (nice=0) (nice=-2) (nice=-5) Mean 31.1 27.2 21.4 Max 47.7 44.9 31.3 99pct 41.6 40.8 22.4 Min 21.6 21.6 21.1 Median 31.7 22.1 21.3 Stdev. 3.1 5.8 0.4 • Effect may vary depending on the workloads 25

  26. Fifth Attempt (Use RT Scheduler) CFS CFS CFS FIFO (nice=0) (nice=-2) (nice=-5) Mean 31.1 27.2 21.4 21.4 Max 47.7 44.9 31.3 22.0 99pct 41.6 40.8 22.4 21.8 Min 21.6 21.6 21.1 21.1 Median 31.7 22.1 21.3 21.4 Stdev. 3.1 5.8 0.4 0.1 • Are we done? 26

  27. BwRead #define MEM_SIZE (4*1024*1024) char ptr[MEM_SIZE]; while(1) { for(int i = 0; i < MEM_SIZE; i += 64) { sum += ptr[i]; } } • Use this instead of the ‘ cpuhog ’ as background tasks • Everything else is the same. • Will there be any differences? If so, why? 27

  28. Sixth Attempt (Use BwRead) Solo w/ BwRead CFS CFS CFS FIFO (nice=0) (nice=0) (nice=-5) Mean 21.0 75.8 52.3 50.2 Max 22.4 123.0 80.1 51.7 99pct 21.8 107.8 72.4 51.3 Min 20.7 40.6 40.9 38.3 Median 20.9 81.0 50.1 50.6 Stdev. 0.3 17.7 6.1 1.9 • ~2.5X (fifo) WCET increase! Why? 28

  29. BwWrite #define MEM_SIZE (4*1024*1024) char ptr[MEM_SIZE]; while(1) { for(int i = 0; i < MEM_SIZE; i += 64) { ptr[i] = 0xff; } } • Use this background tasks instead • Everything else is the same. • Will there be any differences? If so, why? 29

  30. Seventh Attempt (Use BwWrite) Solo w/ BwWrite CFS CFS CFS FIFO (nice=0) (nice=0) (nice=-5) Mean 21.0 101.2 89.7 92.6 Max 22.4 194.0 137.2 99.7 99pct 21.8 172.4 119.8 97.1 Min 20.7 89.0 71.8 78.7 Median 20.9 93.0 87.5 92.5 Stdev. 0.3 22.8 7.7 1.0 • ~4.7X (fifo) WCET increase! Why? 30

  31. 4xARM Cotex-A72 • Your Pi 4: 1 MB shared L2 cache, 2GB DRAM 31

  32. Shared Memory Hierarchy Core3 Core1 Core2 Core4 Shared Last Level Cache (LLC) Memory Controller (MC) DRAM • Cache space • Memory bus bandwidth • Memory controller queues • … 32

  33. Shared Memory Hierarchy • Memory performance varies widely due to interference • Task WCET can be extremely pessimistic Task 3 Task 4 Task 1 Task 2 Core1 Core3 Core4 Core2 I D I D I D I D Shared Cache Memory Controller (MC) DRAM 33

  34. Multicore and Memory Hierarchy T T T T T T T T T1 T2 1 2 3 4 5 6 7 8 Core Core Core Core CPU 4 1 2 3 Memory Hierarchy Memory Hierarchy Unicore Multicore Performance Impact 34

  35. Effect of Memory Interference 12 Solo Corun 10 Normalized Exeuction Time 8 DNN BwWrite 6 Core1 Core2 Core3 Core4 4 LLC DRAM 2 0 DNN (Core 0,1) BwWrite (Core 2,3) • DNN control task suffers >10X slowdown – When co-scheduling different tasks on on idle cores. 35 Waqar Ali and Heechul Yun. “RT -Gang: Real-Time Gang Scheduling Framework for Safety- Critical Systems.” RTAS , 2019

  36. Effect of Memory Interference https://youtu.be/Jm6KSDqlqiU 36

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend