Efficient fine-grain parallelism in shared memory for real-time avionics
- P. Baufreton – Safran
- V. Bregeon, J. Souyris – Airbus
- K. Didier, D. Potop-Butucaru, G. Iooss – Inria
Efficient fine-grain parallelism in shared memory for real-time - - PowerPoint PPT Presentation
Efficient fine-grain parallelism in shared memory for real-time avionics P. Baufreton Safran V. Bregeon , J. Souyris Airbus K. Didier, D. Potop-Butucaru , G. Iooss Inria Critical real-time on multi-/many-cores All the single-core
– Significantly more concurrency
– Making the parallelization decisions
– Time/space isolation facilitates the demonstration of certain properties
– Bad implementation decisions -> poor performance
worth it…
System Core
Shared Memory SMEM
DMA D-NoC Router C-NoC Router DSU C-NoC C0 C1 C2 C3 C12 C13 C14 C15 C4 C5 C6 C7 C8 C9 C10 C11
Shared Memory
DMA Router Interconnect C1 C2 C3 C4 IO Peripherals IO
t Core Core 1 Core 2 Core 3
App1 App2 App3 App4 App1 App3 App5 App1
t Core Core 1 Core 2 Core 3
App1 App1 App2 App3 App1 App1 App1 App1 App5 App3 App 5
partitions and due to TSP between partitions
Partitioning on common multi-core platforms
– Interferences between partitions running in parallel – Requires HW resource partitioning (e.g. caches, RAM, I/O, etc.)
App1
t Core Core 1 Core 2 Core 3
App1 App1 App2 App3 App1 App1 App1 App1 App5 App3 App 5
partitions
– Empty caches, reset shared devices
– But efficient heuristics exist
– Interferences due to the access to shared resources – Time/space isolation properties are often used to facilitate timing analysis, reducing efficiency
t Core Core 1 Core 2 Core 3
App1 App2 App3 App4 App1 App3 App5
6
Non- functional requirements (e.g. real-time) Lustre/Scade functional specification Platform model (cores, memory) Timing analysis Parallelization Real-time scheduling Parallel code gen. Compilers, linker Parallel real-time executable code
Functional correctness Respect of requirements
[ACM TACO’19]
– Also improve sequential code generation
7
[ACM TACO’19]
1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Speed-up Cores used for parallelization
Theoretical upper bound = 6.8x Guaranteed parallelization
8
9 1
void* thread_cpu1(void* unused){
2
lock_init_pe(1);
3
for(;;){
4 5 6
global_barrier_sync(1);
7 8
dcache_inval();
9
g(z,&y);
10
dcache_flush();
11
lock(1,1);
12
unlock(0);
13 14 15 16 17
}
18
} void* thread_cpu0(void* unused){ lock_init_pe(0); init(); for(;;){ global_barrier_reinit(2); time_wait(3000); global_barrier_sync(0); dcache_inval(); f(i,&x); dcache_flush(); unlock(1); lock(0,0); dcache_inval(); h(x,y,&z); dcache_flush(); } }
f g h
z y x i
f g h
Core 0 Core 1
Global barrier sync f Global barrier sync z-1
– e.g. One memory bank per core, computations only access local bank
10
11
Per-node variable copies Per-CPU variable copies No variable copies (Lopht default)
– Alternating phases of computation (without communication) and global synchronization/communication – Often used along with memory allocation (e.g. one memory bank per core)
12
13 1
void* thread_cpu1(void* unused){
2
lock_init_pe(1);
3
for(;;){
4 5 6
global_barrier_sync(1);
7 8
dcache_inval();
9
g(z,&y);
10
dcache_flush();
11
lock(1,1);
12
unlock(0);
13 14 15 16 17
}
18
} void* thread_cpu0(void* unused){ lock_init_pe(0); init(); for(;;){ global_barrier_reinit(2); time_wait(3000); global_barrier_sync(0); dcache_inval(); f(i,&x); dcache_flush(); unlock(1); lock(0,0); dcache_inval(); h(x,y,&z); dcache_flush(); } }
14
Core 0
f g h n
Core 1 Unconstrained Core 0
f g h n
Core 1 BSP scheduling Core 0
f g h n
Core 1 No Interferences
f h g n
15
16
17