Real-Time Multi/Many-Core Architecture Heechul Yun 1 Real-Time - - PowerPoint PPT Presentation

▶

Jan 08, 2024 420 likes •751 views

Real-Time Multi/Many-Core Architecture Heechul Yun 1 Real-Time Multi/Many-Core Architecture Projects on Real-Time CPU Architectures Assigned Papers Shedding the Shackles of Time-Division Multiplexing, RTSS, 2018 Deterministic

SLIDE 1

Real-Time Multi/Many-Core Architecture

Heechul Yun

SLIDE 2

Real-Time Multi/Many-Core Architecture

Projects on Real-Time CPU Architectures
Assigned Papers

– Shedding the Shackles of Time-Division Multiplexing, RTSS, 2018 – Deterministic Memory Abstraction and Supporting Multicore System Architecture. ECRTS, 2018

SLIDE 3

Trends in Automotive E/E Systems

A. Hamann (Bosch). “Industrial Challenge: Moving from Classical to High-Performance Real-Time Systems.” WATER, 2018.

Source: Bosch

Centralization & High-Performance HW

SLIDE 4

Modern System-on-a-Chip (SoC)

Core1 Core2 GPU NPU… Memory Controller (MC) Shared Cache

Integrate multiple cores, GPU, accelerators
Good performance, size, weight, power
Challenges: time predictability

DRAM

SLIDE 5

Worst-Case Execution Time (WCET)

Real-time scheduling theory is based on the

assumption of known WCETs of real-time tasks

Image source: [Wilhelm et al., 2008]

SLIDE 6

Computing WCET

Static analysis

– Input: program code, architecture model – output: WCET – Problem: architecture model is hard and pessimistic

Measurement

– No guarantee on true worst-case – But, widely used in practice

SLIDE 7

Memory Hierarchies, Pipelines, and Buses for Future Architectures in Time-Critical Embedded Systems

IEEE TCAD, 2009

SLIDE 8

“Problematic” CPU Features

Architectures are optimized to reduce average

performance

WCET estimation is hard because of

– Pipelining – TLBs/Caches – Super-scalar – Out-of-order scheduling – Branch predictors – Hardware prefetchers – Basically anything that affect processor state

SLIDE 9

Static Timing Analysis

[11]–[13]. control-flo flo first first

l-flow

program’ flo control-flo identifies

l-flow

processor’ finally control-flo ely—together interactions—to influence influence influence

SLIDE 10

Control Flow Graph (CFG)

Analyze code
Split basic blocks
Compute per-block WCET

– use abstract CPU model

SLIDE 11

Timing Anomalies

Locally faster != globally faster

Image source: [Wilhelm et al., 2008]

SLIDE 12

Timing Anomalies

Locally faster != globally faster

Image source: [Wilhelm et al., 2008]

SLIDE 13

Challenge: Shared Memory Hierarchy

Memory performance varies widely due to

interference

Task WCET can be extremely pessimistic

Core1 Core2 Core3 Core4 Memory Controller (MC) Shared Cache DRAM

Task 1 Task 2 Task 3 Task 4

I D I D I D I D

SLIDE 14

Effect of Memory Interference

DNN control task suffers >10X slowdown

– When co-scheduling different tasks on on idle cores.

2 4 6 8 10 12 DNN (Core 0,1) BwWrite (Core 2,3) Normalized Exeuction Time Solo Corun

DRAM LLC Core1 Core2 Core3 Core4

DNN BwWrite

Waqar Ali and Heechul Yun. “RT-Gang: Real-Time Gang Scheduling Framework for Safety-Critical Systems.” RTAS, 2019 (to appear)

SLIDE 15

Cache Denial-of-Service Attacks

Michael G. Bechtel and Heechul Yun. “Denial-of-Service Attacks on Shared Cache in Multicore: Analysis and Prevention.” In RTAS, 2019 (to appear, Outstanding Paper Award)

LLC Core1 Core2 Core3 Core4

victim attackers

Observed worst-case: >300X (times) slowdown

– On simple in-order multicores (Raspberry Pi3, Odroid C2)

Difficult to guarantee predictable timing

SLIDE 16

Real-Time CPU Architectures

PRET

– UC Berkeley.

MERASA/parMERASA project

– EU

ACROSS

– EU

ARAMIS

– Germany

EMC2

– EU

SLIDE 17

FlexPRET: A Processor Platform for Mixed-Criticality Systems

RTAS, 2014

SLIDE 18

SLIDE 19

PRET Pipeline

FETCH DECOD E REGACC MEM EXECUT E EXCEPT FETCH DECOD E REGACC MEM EXECUT E EXCEPT FETCH DECOD E REGACC MEM EXECUT E EXCEPT FETCH DECOD E REGACC MEM EXECUT E EXCEPT FETCH DECOD E REGACC MEM EXECUT E EXCEPT FETCH DECOD E REGACC MEM EXECUT E EXCEPT FETCH DECOD E REGACC MEM EXECUT E EXCEPT FETCH DECOD E REGACC MEM EXECUT E FETCH DECOD E REGACC MEM FETCH DECOD E REGACC FETCH DECOD E FETCH

THREAD#1 THREAD#2 THREAD#3 THREAD#4 THREAD#5 THREAD#6

1 clock Thread 1, Instruction 1 Thread 1, Instruction 2

SLIDE 20

FlexPRET Pipeline

SLIDE 21

Hardware Support for WCET Analysis of Hard Real-Time Multicore Systems

ISCA 2009

SLIDE 22

Analyzable Multicore Architecture

Idea1: Bound interference on shared

resources

– On-chip shared bus – (shared) L2 cache

Idea2: WCET computation mode

SLIDE 23

Architecture

SLIDE 24

Round-Robin Bus Arbitration

UBD = (NHRT – 1) * Lbus

SLIDE 25

Request vs. Job-level WCET Analysis

Request-level analysis

– Assume worst-case interference for each access of the task under analysis – Pessimistic as not all accesses will get interference

Job-level analysis

– Assume the total number of competing memory access is known – Can reduce pessimism

SLIDE 26

Summary

Timing anomalies

– Locally fast != globally fast on non-timing compositional architectures (i.e., most architectures)

Timing compositional architecture

– Free of timing anomalies

SLIDE 27

Discussion

Why is this interesting?
Are assumptions realistic?

– Task model – Cache model – Memory model – CPU (pipeline) model

SLIDE 28

Discussion

Why is this interesting?
Are assumptions realistic?

– Task model – Cache model – Memory model – CPU (pipeline) model

SLIDE 29

Atomic vs. Split-Transaction Bus

J. P. Shen and M. H. Lipasti. Modern Processor Design: Fundamentals of Superscalar Processors. Wav

eland Press, 2013.

SLIDE 30

Announcement

Mini Project #1
DeepPicar Competition

– Build a self-driving car – Based on DeepPicar – Competition format

SLIDE 31

Acknowledgement

Some slides are from:

Real-Time Multi/Many-Core Architecture

Heechul Yun

Real-Time Multi/Many-Core Architecture

– Shedding the Shackles of Time-Division Multiplexing, RTSS, 2018 – Deterministic Memory Abstraction and Supporting Multicore System Architecture. ECRTS, 2018

Trends in Automotive E/E Systems

Centralization & High-Performance HW

Modern System-on-a-Chip (SoC)

Worst-Case Execution Time (WCET)

assumption of known WCETs of real-time tasks

Computing WCET

– Input: program code, architecture model – output: WCET – Problem: architecture model is hard and pessimistic

– No guarantee on true worst-case – But, widely used in practice

Memory Hierarchies, Pipelines, and Buses for Future Architectures in Time-Critical Embedded Systems

IEEE TCAD, 2009

“Problematic” CPU Features

performance

– Pipelining – TLBs/Caches – Super-scalar – Out-of-order scheduling – Branch predictors – Hardware prefetchers – Basically anything that affect processor state

Static Timing Analysis

Control Flow Graph (CFG)

– use abstract CPU model

Timing Anomalies

Timing Anomalies

Challenge: Shared Memory Hierarchy

interference

Effect of Memory Interference

– When co-scheduling different tasks on on idle cores.

Cache Denial-of-Service Attacks

– On simple in-order multicores (Raspberry Pi3, Odroid C2)

Difficult to guarantee predictable timing

Real-Time CPU Architectures

– UC Berkeley.

– EU

– EU

– Germany

– EU

FlexPRET: A Processor Platform for Mixed-Criticality Systems

RTAS, 2014

PRET Pipeline

FlexPRET Pipeline

Hardware Support for WCET Analysis of Hard Real-Time Multicore Systems

ISCA 2009

Analyzable Multicore Architecture

resources

– On-chip shared bus – (shared) L2 cache

Architecture

Round-Robin Bus Arbitration

Request vs. Job-level WCET Analysis

– Assume worst-case interference for each access of the task under analysis – Pessimistic as not all accesses will get interference

– Assume the total number of competing memory access is known – Can reduce pessimism

Summary

– Locally fast != globally fast on non-timing compositional architectures (i.e., most architectures)

– Free of timing anomalies

Discussion

– Task model – Cache model – Memory model – CPU (pipeline) model

Discussion

– Task model – Cache model – Memory model – CPU (pipeline) model

Atomic vs. Split-Transaction Bus

Announcement

– Build a self-driving car – Based on DeepPicar – Competition format

Acknowledgement

– Prof. Rodolfo Pellizzoni, University of Waterloo – Prof. Edward A. Lee, University of Berkeley