EECS 388: Embedded Systems 10. Timing Analysis Heechul Yun 1 - - PowerPoint PPT Presentation

eecs 388 embedded systems
SMART_READER_LITE
LIVE PREVIEW

EECS 388: Embedded Systems 10. Timing Analysis Heechul Yun 1 - - PowerPoint PPT Presentation

EECS 388: Embedded Systems 10. Timing Analysis Heechul Yun 1 Agenda Execution time analysis Static timing analysis Measurement based timing analysis 2 Execution Time Analysis Will my brake-by-wire system actuate the brakes


slide-1
SLIDE 1

EECS 388: Embedded Systems

  • 10. Timing Analysis

Heechul Yun

1

slide-2
SLIDE 2

Agenda

  • Execution time analysis
  • Static timing analysis
  • Measurement based timing analysis

2

slide-3
SLIDE 3

Execution Time Analysis

  • Will my brake-by-wire system actuate the

brakes within one millisecond?

  • Will my camera based steer-by-wire system

identify a bicycler crossing within 100ms (10Hz)?

  • Will my drone be able to finish computing

control commands within 10ms (100Hz)?

3

slide-4
SLIDE 4

Execution Time

  • Worst-Case Execution Time (WCET)
  • Best-Case Execution Time (BCET)
  • Average-Case Execution Time (ACET)

4

slide-5
SLIDE 5

Execution Time

  • Real-time scheduling theory is based on the

assumption of known WCETs of real-time tasks

5

Image source: [Wilhelm et al., 2008]

slide-6
SLIDE 6

The WCET Problem

  • For a given code of a task and the platform

(OS & hardware), determine the WCET of the task.

6

while(1) { read_from_sensors(); compute(); write_to_actuators(); wait_till_next_period(); }

Loops w/ finite bounds No recursion Run uninterrupted

slide-7
SLIDE 7

Timing Analysis

  • Static timing analysis

– Input: code, arch. model; output: WCET

  • Measurement based timing analysis

– Based on lots of measurements. Statistical.

7

slide-8
SLIDE 8

Static Timing Analysis

  • Analyze code
  • Split basic blocks
  • Find longest path

– consider loop bounds

  • Compute per-block WCET

– use abstract CPU model

  • Compute task WCET

– by summing up the WCETs of the longest path

8

slide-9
SLIDE 9

WCET and Caches

  • How to determine the WCET of a task?
  • The longest execution path of the task?

– Problem: the longest path can take less time to finish than shorter paths if your system has a cache(s)!

  • Example

– Path1: 1000 instructions, 0 cache misses – Path2: 500 instructions, 100 cache misses – Cache hit: 1 cycle, Cache miss: 100 cycles – Path 2 takes much longer

9

slide-10
SLIDE 10

Recall: Memory Hierarchy

10

Fast, Expensive Slow, Inexpensive

Volatile memory Non-volatile memory

slide-11
SLIDE 11

SiFive FE310

11

CPU: 32 bit RISC-V Clock: 320 MHz SRAM: 16 (D) + 16 (I) KB Flash: 4MB 32 bit data bus

slide-12
SLIDE 12

Raspberry Pi 4: Broadcom BCM2711

12

(Bild: ct.de/Maik Merten (CC BY SA 4.0)) Image source: PC Watch.

CPU: 4x Cortex-A72@1.5GHz L2 cache (shared): 1MB GPU: VideoCore IV@500Mhz DRAM: 1/2/4 GB LPDDR4-3200 Storage: micro-SD

slide-13
SLIDE 13

Processor Behavior Analysis: Cache Effects

Suppose:

  • 1. 32-bit processor
  • 2. Direct-mapped cache holds two sets

 4 floats per set  x and y stored contiguously starting at address 0x0 What happens when n=2?

Slide source: Edward A. Lee and Prabal Dutta (UCB)

slide-14
SLIDE 14

Direct-Mapped Cache

Valid

Tag Block

Valid

Tag Block

Valid

Tag Block

. . .

Set 0 Set 1 Set S

Tag Set index Block offset m-1 s bits t bits b bits

Address

1 valid bit t tag bits B = 2b bytes per block

CACHE A “set” consists of one “line” If the tag of the address matches the tag of the line, then we have a “cache hit.” Otherwise, the fetch goes to main memory, updating the line.

Slide source: Edward A. Lee and Prabal Dutta (UCB)

slide-15
SLIDE 15

This Particular Direct-Mapped Cache

Valid

Tag Block

Valid

Tag Block

Set 0 Set 1

Tag Set index Block offset m-1 s = 1 bits t = 27 bits b = 4 bits

Address = 32 bits

1 valid bit t tag bits B = 2b bytes per block

CACHE Four floats per block, four bytes per float, means 16 bytes, so b = 4

Slide source: Edward A. Lee and Prabal Dutta (UCB)

slide-16
SLIDE 16

Processor Behavior Analysis: Cache Effects

Suppose:

  • 1. 32-bit processor
  • 2. Direct-mapped cache holds two sets

 4 floats per set  x and y stored contiguously starting at address 0x0 What happens when n=2? x[0] will miss, pulling x[0], x[1], y[0] and y[1] into the set 0. All but

  • ne access will

be a cache hit.

Slide source: Edward A. Lee and Prabal Dutta (UCB)

slide-17
SLIDE 17

Processor Behavior Analysis: Cache Effects

Suppose:

  • 1. 32-bit processor
  • 2. Direct-mapped cache holds two sets

 4 floats per set  x and y stored contiguously starting at address 0x0 What happens when n=8? x[0] will miss, pulling x[0-3] into the set 0. Then y[0] will miss, pulling y[0-3] into the same set, evicting x[0-3]. Every access will be a miss!

Slide source: Edward A. Lee and Prabal Dutta (UCB)

slide-18
SLIDE 18

Measurement Based Timing Analysis

  • Measurement Based Timing Analysis (MBTA)
  • Do a lots of measurement under worst-case

scenarios (e.g., heavy load)

  • Take the maximum + safety margin as WCET
  • No need for detailed architecture models
  • Commonly practiced in industry

18

slide-19
SLIDE 19

Real-Time DNN Control

  • ~27M floating point multiplication and additions

– Per image frame (deadline: 50ms)

19

  • M. Bechtel. E. McEllhiney, M Kim, H. Yun. “DeepPicar: A Low-cost Deep Neural Network-based

Autonomous Car.” In RTCSA, 2018

slide-20
SLIDE 20

First Attempt

  • 1000 samples (minus the first sample. Why?)

20

CFS (nice=0)

Mean

23.8

Max

47.9

99pct

47.4

Min

20.7

Median

20.9

Stdev.

7.7

Why?

slide-21
SLIDE 21

DVFS

  • Dynamic voltage and frequency scaling (DVFS)
  • Lower frequency/voltage saves power
  • Vary clock speed depending on the load
  • Cause timing variations
  • Disabling DVFS

21

# echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor # echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor # echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor # echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor

slide-22
SLIDE 22

Second Attempt (No DVFS)

  • What if there are other tasks in the system?

22

CFS (nice=0)

Mean 21.0 Max 22.4 99pct 21.8 Min 20.7 Median 20.9 Stdev. 0.3

slide-23
SLIDE 23

Third Attempt (Under Load)

  • 4x cpuhog compete the cpu time with the DNN

23

CFS (nice=0)

Mean 31.1 Max 47.7 99pct 41.6 Min 21.6 Median 31.7 Stdev. 3.1

slide-24
SLIDE 24

Recall: kernel/sched/fair.c (CFS)

  • Priority to CFS weight conversion table

– Priority (Nice value): -20 (highest) ~ +19 (lowest) – kernel/sched/core.c

24

const int sched_prio_to_weight[40] = { /* -20 */ 88761, 71755, 56483, 46273, 36291, /* -15 */ 29154, 23254, 18705, 14949, 11916, /* -10 */ 9548, 7620, 6100, 4904, 3906, /* -5 */ 3121, 2501, 1991, 1586, 1277, /* 0 */ 1024, 820, 655, 526, 423, /* 5 */ 335, 272, 215, 172, 137, /* 10 */ 110, 87, 70, 56, 45, /* 15 */ 36, 29, 23, 18, 15, };

slide-25
SLIDE 25

Fourth Attempt (Use Priority)

  • Effect may vary depending on the workloads

25

CFS (nice=0) CFS (nice=-2) CFS (nice=-5)

Mean 31.1 27.2 21.4 Max 47.7 44.9 31.3 99pct 41.6 40.8 22.4 Min 21.6 21.6 21.1 Median 31.7 22.1 21.3 Stdev. 3.1 5.8 0.4

slide-26
SLIDE 26

Fifth Attempt (Use RT Scheduler)

  • Are we done?

26

CFS (nice=0) CFS (nice=-2) CFS (nice=-5) FIFO

Mean 31.1 27.2 21.4 21.4 Max 47.7 44.9 31.3 22.0 99pct 41.6 40.8 22.4 21.8 Min 21.6 21.6 21.1 21.1 Median 31.7 22.1 21.3 21.4 Stdev. 3.1 5.8 0.4 0.1

slide-27
SLIDE 27

BwRead

  • Use this instead of the ‘cpuhog’ as background tasks
  • Everything else is the same.
  • Will there be any differences? If so, why?

27

#define MEM_SIZE (4*1024*1024) char ptr[MEM_SIZE]; while(1) { for(int i = 0; i < MEM_SIZE; i += 64) { sum += ptr[i]; } }

slide-28
SLIDE 28

Sixth Attempt (Use BwRead)

  • ~2.5X (fifo) WCET increase! Why?

28

Solo w/ BwRead CFS (nice=0) CFS (nice=0) CFS (nice=-5) FIFO Mean 21.0 75.8 52.3 50.2 Max 22.4 123.0 80.1 51.7 99pct 21.8 107.8 72.4 51.3 Min 20.7 40.6 40.9 38.3 Median 20.9 81.0 50.1 50.6 Stdev. 0.3 17.7 6.1 1.9

slide-29
SLIDE 29

BwWrite

  • Use this background tasks instead
  • Everything else is the same.
  • Will there be any differences? If so, why?

29

#define MEM_SIZE (4*1024*1024) char ptr[MEM_SIZE]; while(1) { for(int i = 0; i < MEM_SIZE; i += 64) { ptr[i] = 0xff; } }

slide-30
SLIDE 30

Seventh Attempt (Use BwWrite)

  • ~4.7X (fifo) WCET increase! Why?

30

Solo w/ BwWrite CFS (nice=0) CFS (nice=0) CFS (nice=-5) FIFO Mean 21.0 101.2 89.7 92.6 Max 22.4 194.0 137.2 99.7 99pct 21.8 172.4 119.8 97.1 Min 20.7 89.0 71.8 78.7 Median 20.9 93.0 87.5 92.5 Stdev. 0.3 22.8 7.7 1.0

slide-31
SLIDE 31

4xARM Cotex-A72

  • Your Pi 4: 1 MB shared L2 cache, 2GB DRAM

31

slide-32
SLIDE 32

Shared Memory Hierarchy

  • Cache space
  • Memory bus bandwidth
  • Memory controller queues

32

Core1 Core2 Core3 Core4 DRAM Memory Controller (MC) Shared Last Level Cache (LLC)

slide-33
SLIDE 33

Shared Memory Hierarchy

33

  • Memory performance varies widely due to

interference

  • Task WCET can be extremely pessimistic

Core1 Core2 Core3 Core4 Memory Controller (MC) Shared Cache DRAM

Task 1 Task 2 Task 3 Task 4

I D I D I D I D

slide-34
SLIDE 34

Multicore and Memory Hierarchy

34

CPU Memory Hierarchy

Unicore

T1 T2 Core 1 Memory Hierarchy Core 2 Core 3 Core 4

Multicore

T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8

Performance Impact

slide-35
SLIDE 35

Effect of Memory Interference

  • DNN control task suffers >10X slowdown

– When co-scheduling different tasks on on idle cores.

35

2 4 6 8 10 12 DNN (Core 0,1) BwWrite (Core 2,3) Normalized Exeuction Time Solo Corun

DRAM LLC Core1 Core2 Core3 Core4

DNN BwWrite

Waqar Ali and Heechul Yun. “RT-Gang: Real-Time Gang Scheduling Framework for Safety-Critical Systems.” RTAS, 2019

slide-36
SLIDE 36

Effect of Memory Interference

36

https://youtu.be/Jm6KSDqlqiU

slide-37
SLIDE 37

Summary

  • Timing analysis is important for time sensitive,

safety-critical real-time applications

  • Static timing analysis

++ Strong analytic guarantee

  • -- Architecture model is hard and pessimistic
  • Measurement based timing analysis

++ Practical, no need for architecture model

  • -- No guarantee on true worst-case
  • Multicore is difficult to handle

37