[PPT] - An Energy-Efficient Parallel Algorithm for Real-Time Near-Optimal PowerPoint Presentation

SLIDE 1

13.06.2017

D. Palossi et al.

1 | |

D. Palossi a, A. Marongiuab, L. Beniniab
B. Forsberga, M. Furcib, R. Naldib, L. Marconib

aETH Zürich, bUniveristy of Bologna

NVidia GTC17 – Munich, October 10th - 12th, 2017 - #23356

An Energy-Efficient Parallel Algorithm for Real-Time Near-Optimal UAV Path Planning

SLIDE 2

13.06.2017

D. Palossi et al.

2 | |

Introduction



There are many applications for autonomous Unmanned Aerial Vehicles (UAVs)



Surveillance



Aerial Mapping



Entertainment



Rescue Mission



Standard-size quadrotors (~50cm, few Kg, ~100W) → computational bound due to weight/battery



One of the fundamental functional blocks for autonomous UAVs is the path planner We focus on standard-size quadrotors

SLIDE 3

13.06.2017

D. Palossi et al.

3 | |

Energy Efficiency Requirements

Current standard-size UAV

Current system Next Gen system Size [ , weight ] ∅ 50 cm / few Kg few cm / few g Propellers Power Cons. hundreds of W few W / hundred mW Processing Device Class desktop CPU LP/ULP embedded Cognitive Skills fully autonomous

If we want bring advanced cognitive skills of state-of-the-art systems into the next generation autonomous vehicles → energy efficient algorithms are key

[1] Progress on "pico" air vehicles, R.J. Wood, B. Finio, M. Karpelson, K. Ma, N.O. Perez-Arancibia, P .S. Sreetharan, H. T anaka, and J.P . Whitney, Int.

Symp. on Robotics Research (invited paper), Flagstafg, Az, Aug. 2011.

power budgets for pico-size UAV [1]

Next generation micro/nano-size UAV

SLIDE 4

13.06.2017

D. Palossi et al.

4 | |

Energy Efficiency Requirements

Current standard-size UAV

Current system Next Gen system Size [ , weight ] ∅ 50 cm / few Kg few cm / few g Propellers Power Cons. hundreds of W few W / hundred mW Processing Device Class desktop CPU LP/ULP embedded Cognitive Skills fully autonomous

If we want bring advanced cognitive skills of state-of-the-art systems into the next generation autonomous vehicles → energy efficient algorithms are key

[1] Progress on "pico" air vehicles, R.J. Wood, B. Finio, M. Karpelson, K. Ma, N.O. Perez-Arancibia, P .S. Sreetharan, H. T anaka, and J.P . Whitney, Int.

Symp. on Robotics Research (invited paper), Flagstafg, Az, Aug. 2011.

power budgets for pico-size UAV [1]

Next generation micro/nano-size UAV

We look into parallelism + near optimality as key solution to guarantee the energy requirements

SLIDE 5

13.06.2017

D. Palossi et al.

5 | |

Outline



Path Planning Application



Graph computation and exploration



Naive approximate and Atomic version



Profile-based version



Limitations of the Naive Approach



Experimental Evaluation



System Characterization



Experimental Results



The Predictable Execution Model (PREM)

SLIDE 6

13.06.2017

D. Palossi et al.

6 | |

Path Planning Application

Path Planning:



constantly updates the route of the vehicle based

n information sensed in

real time



selects the best path (according to specific metrics)



responsible for preventing collisions with dynamic, unexpected obstacles



the reactivity of the UAV depends on the path planner response time

SLIDE 7

13.06.2017

D. Palossi et al.

7 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Represents the kinematic and the constraints of the robot

SLIDE 8

13.06.2017

D. Palossi et al.

8 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Represents the kinematic and the constraints of the robot

SLIDE 9

13.06.2017

D. Palossi et al.

9 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Represents the kinematic and the constraints of the robot Represents location, possible connection and its constraints: obstacles

SLIDE 10

13.06.2017

D. Palossi et al.

10 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Represents the kinematic and the constraints of the robot Represents location, possible connection and its constraints: obstacles

SLIDE 11

13.06.2017

D. Palossi et al.

11 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Sequence of movements: go_45 - go_45 - go_45 Represents the kinematic and the constraints of the robot Represents location, possible connection and its constraints: obstacles

SLIDE 12

13.06.2017

D. Palossi et al.

12 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Sequence of movements: go_45 - go_45 - go_45 Represents the kinematic and the constraints of the robot Represents location, possible connection and its constraints: obstacles

SLIDE 13

13.06.2017

D. Palossi et al.

13 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Sequence of movements: go_45 - go_45 - go_45 Represents the kinematic and the constraints of the robot Represents location, possible connection and its constraints: obstacles

SLIDE 14

13.06.2017

D. Palossi et al.

14 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Sequence of movements: go_45 - go_45 - go_45 Represents the kinematic and the constraints of the robot Represents location, possible connection and its constraints: obstacles

SLIDE 15

13.06.2017

D. Palossi et al.

15 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Sequence of movements: go_45 - go_45 - go_45 Represents the kinematic and the constraints of the robot Represents location, possible connection and its constraints: obstacles

SLIDE 16

13.06.2017

D. Palossi et al.

16 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Sequence of movements: go_45 - go_45 - go_45 Obstacle detected in 2-3 Represents the kinematic and the constraints of the robot Represents location, possible connection and its constraints: obstacles

SLIDE 17

13.06.2017

D. Palossi et al.

17 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Sequence of movements: go_0 - go_0 - go_90 - go_90 Represents the kinematic and the constraints of the robot Represents location, possible connection and its constraints: obstacles

SLIDE 18

13.06.2017

D. Palossi et al.

18 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Sequence of movements: go_0 - go_0 - go_90 - go_90 Represents the kinematic and the constraints of the robot Represents location, possible connection and its constraints: obstacles

SLIDE 19

13.06.2017

D. Palossi et al.

19 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Sequence of movements: go_0 - go_0 - go_90 - go_90 Represents the kinematic and the constraints of the robot Represents location, possible connection and its constraints: obstacles

SLIDE 20

13.06.2017

D. Palossi et al.

20 | |

Graph Computation



Topological information and kinematic of the vehicle fused in a bigger graph

Automaton Synchronous Composition

Composition Automaton Quadrotor Automaton Map Automaton

SLIDE 21

13.06.2017

D. Palossi et al.

21 | |

Graph Computation



Obstacles and Safety As soon as an obstacle is detected the Automaton Composition graph is updated

SLIDE 22

13.06.2017

D. Palossi et al.

22 | |

Graph Exploration

Single Source Shortest Path problem with non-negative weights (SSSP)

Problem: find a path between two vertices (V) in a graph so that the sum of weights (W) of its constituent edges (E) is minimized. Naive Implementation:



Near-optimal parallel implementation of the Dijkstra algorithm (global optimality)



Locks-free updates of the cost of a central node



Race conditions allowed to boost performance



The near optimality never affects the safety of the mission



Naive-Atomic variant to prevent races (atomicMin)

SLIDE 23

13.06.2017

D. Palossi et al.

23 | |

Outline



Path Planning Application



Graph computation and exploration



Naive approximate and Atomic version



Profile-based version



Limitations of the Naive Approach



Experimental Evaluation



System Characterization



Experimental Results



The Predictable Execution Model (PREM)

SLIDE 24

13.06.2017

D. Palossi et al.

24 | |

Naive Path Planner: Limitations



Poor usage of the computational power due to sparse workload distribution

Naive implementation:



high synchronization cost (fine-grained)

SLIDE 25

13.06.2017

D. Palossi et al.

25 | |

Naive Path Planner: Limitations



Poor usage of the computational power due to sparse workload distribution

Naive implementation:



high synchronization cost (fine-grained)

SLIDE 26

13.06.2017

D. Palossi et al.

26 | |

Naive Path Planner: Limitations



Poor usage of the computational power due to sparse workload distribution

Naive implementation:



high synchronization cost (fine-grained)



sparse workload



Working Threads spread among multiple warps

SLIDE 27

13.06.2017

D. Palossi et al.

27 | |

Naive Path Planner: Limitations



Poor usage of the computational power due to sparse workload distribution

Naive implementation:



high synchronization cost (fine-grained)



sparse workload



Working Threads spread among multiple warps

SLIDE 28

13.06.2017

D. Palossi et al.

28 | |

Naive Path Planner: Limitations



Poor usage of the computational power due to sparse workload distribution

Naive implementation:



high synchronization cost (fine-grained)



sparse workload



Working Threads spread among multiple warps



it requires more iterations than an

ptimized solution

SLIDE 29

13.06.2017

D. Palossi et al.

29 | |

Profile-based Path Planner



To overcome the limitations of the Naive implementation we introduce a profile- based version



We introduce the concept of exploration frontiers:



enumeration of sets of vertices F, where all vertices Fn have been visited from at least one vertex in Fm for any m: 0 < m < n



expose dense, parallel workloads



allow for a coarser synchronization scheme

SLIDE 30

13.06.2017

D. Palossi et al.

30 | |

Profile-based Path Planner



To overcome the limitations of the Naive implementation we introduce a profile- based version



We introduce the concept of exploration frontiers:



enumeration of sets of vertices F, where all vertices Fn have been visited from at least one vertex in Fm for any m: 0 < m < n



expose dense, parallel workloads



allow for a coarser synchronization scheme

Frontiers are defined during an off- line, profiled-based, preprocessing

SLIDE 31

13.06.2017

D. Palossi et al.

31 | |

Profile-based Path Planner



Software architecture: Preprocessing + 2-phase Near-Optimal Exploration

Profile-based Preprocessing Parallel exploration

(1st phase)

Parallel exploration

(2nd phase)

Off-line (ahead-of-time) On-line

Frontiers Array Transition Matrix IF Deferred Array > 0 Deferred Array

SLIDE 32

13.06.2017

D. Palossi et al.

32 | |

Profile-based Path Planner



Software architecture: Preprocessing + 2-phase Near-Optimal Exploration

Profile-based Preprocessing Parallel exploration

(1st phase)

Frontiers Array Transition Matrix

Parallel exploration

(2nd phase)

IF Deferred Array > 0 Deferred Array



Input: static map snapshot



Sequential Dijkstra



Near-optimal exploration of frontiers



Deferred node exploration (to the 2nd phase) due to dynamic

bstacles



Conditional phase



Small instance of the Naive version, exploring

nly the deferred nodes

SLIDE 33

13.06.2017

D. Palossi et al.

33 | |

Profile-based Path Planner



Profile-based approach increases thread usage through frontiers

Profile-based implementation:



low synchronization cost (coarse-grained)

SLIDE 34

13.06.2017

D. Palossi et al.

34 | |

Profile-based Path Planner



Profile-based approach increases thread usage through frontiers

Profile-based implementation:



low synchronization cost (coarse-grained)

SLIDE 35

13.06.2017

D. Palossi et al.

35 | |

Profile-based Path Planner



Profile-based approach increases thread usage through frontiers

Profile-based implementation:



low synchronization cost (coarse-grained)



frontiers force dense workload

SLIDE 36

13.06.2017

D. Palossi et al.

36 | |

Profile-based Path Planner



Profile-based approach increases thread usage through frontiers

Profile-based implementation:



low synchronization cost (coarse-grained)



frontiers force dense workload

SLIDE 37

13.06.2017

D. Palossi et al.

37 | |

Profile-based Path Planner

Profile-based implementation:



low synchronization cost (coarse-grained)



frontiers force dense workload



it requires lower number of iterations than the Naive version



Profile-based approach increases thread usage through frontiers



Profile-based approach increases thread usage through frontiers

SLIDE 38

13.06.2017

D. Palossi et al.

38 | |

Profile-based Path Planner



Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles)

SLIDE 39

13.06.2017

D. Palossi et al.

39 | |

Profile-based Path Planner



Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles)

SLIDE 40

13.06.2017

D. Palossi et al.

40 | |

Profile-based Path Planner



Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles)

SLIDE 41

13.06.2017

D. Palossi et al.

41 | |

Profile-based Path Planner



Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles)

SLIDE 42

13.06.2017

D. Palossi et al.

42 | |

Profile-based Path Planner



Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles)

SLIDE 43

13.06.2017

D. Palossi et al.

43 | |

Profile-based Path Planner



Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles)

SLIDE 44

13.06.2017

D. Palossi et al.

44 | |

Profile-based Path Planner



Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles)

SLIDE 45

13.06.2017

D. Palossi et al.

45 | |

Profile-based Path Planner



Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles) On-line exploration (dynamic obstacles)

SLIDE 46

13.06.2017

D. Palossi et al.

46 | |

Profile-based Path Planner



Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles) On-line exploration (dynamic obstacles)

SLIDE 47

13.06.2017

D. Palossi et al.

47 | |

Profile-based Path Planner



Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles) On-line exploration (dynamic obstacles) No predecessor/cost

SLIDE 48

13.06.2017

D. Palossi et al.

48 | |

Profile-based Path Planner



Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles) On-line exploration (dynamic obstacles)

SLIDE 49

13.06.2017

D. Palossi et al.

49 | |

Profile-based Path Planner



Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles) On-line exploration (dynamic obstacles)

SLIDE 50

13.06.2017

D. Palossi et al.

50 | |

Profile-based Path Planner



Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles) On-line exploration (dynamic obstacles)

SLIDE 51

13.06.2017

D. Palossi et al.

51 | |

Profile-based Path Planner



Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles) On-line exploration (dynamic obstacles) Not yet explored

SLIDE 52

13.06.2017

D. Palossi et al.

52 | |

Outline



Path Planning Application



Graph computation and exploration



Naive approximate and Atomic version



Profile-based version



Limitations of the Naive Approach



Experimental Evaluation



System Characterization



Experimental Results



The Predictable Execution Model (PREM)

SLIDE 53

13.06.2017

D. Palossi et al.

53 | |

Experimental Setup



Naive near-optimal parallel version (Naive)



Naive parallel with atomic intrinsic atomicMin (Naive-Atomic)



Offline profiling strategy → different frontiers



Sequential Dijkstra, fetching at each iteration the neighbor with the min cost first (Prof-Min)



Sequential Dijkstra, fetching at each iteration the neighbor with the max cost first (Prof-Max)



Fine-grained locking to prevent race conditions (Prof-Min+Lock)

SLIDE 54

13.06.2017

D. Palossi et al.

54 | |

Experimental Setup



Naive near-optimal parallel version (Naive)



Naive parallel with atomic intrinsic atomicMin (Naive-Atomic)



Offline profiling strategy → different frontiers



Sequential Dijkstra, fetching at each iteration the neighbor with the min cost first (Prof-Min)



Sequential Dijkstra, fetching at each iteration the neighbor with the max cost first (Prof-Max)



Fine-grained locking to prevent race conditions (Prof-Min+Lock)



System configuration:



NVidia Tegra TX1, a many-core SoC featuring 4-core ARM Cortex A57 and a Maxwell GPU



1024 CUDA threads (max within the same block)

SLIDE 55

13.06.2017

D. Palossi et al.

55 | |

Experimental Setup



Naive near-optimal parallel version (Naive)



Naive parallel with atomic intrinsic atomicMin (Naive-Atomic)



Offline profiling strategy → different frontiers



Sequential Dijkstra, fetching at each iteration the neighbor with the min cost first (Prof-Min)



Sequential Dijkstra, fetching at each iteration the neighbor with the max cost first (Prof-Max)



Fine-grained locking to prevent race conditions (Prof-Min+Lock)



System configuration:



NVidia Tegra TX1, a many-core SoC featuring 4-core ARM Cortex A57 and a Maxwell GPU



1024 CUDA threads (max within the same block)



Vehicle speed of 4 m/s and minimum obstacle detection distance of 1 meter → 250 ms [1]



Vehicle speed of 20 m/s and minimum obstacle detection distance of 1 meter → 50 ms [2]

[1] Daniele Palossi, Michele Furci, Roberto Naldi, Andrea Marongiu, Lorenzo Marconi, and Luca Benini: An energy-efficient parallel algorithm for real-time near-optimal UAV path planning. Computing Frontiers 2016. [2] DJI Phantom 4: https://www.dji.com/phantom-4/info

SLIDE 56

13.06.2017

D. Palossi et al.

56 | |

Speedup



Speedup vs. Sequential (1 ARM Cortex A57) -- 20% obstacles -- 4 map sizes

SLIDE 57

13.06.2017

D. Palossi et al.

57 | |

Performance vs. Path Optimality



Performance: 4 obstacles configurations -- 4 map sizes

SLIDE 58

13.06.2017

D. Palossi et al.

58 | |

Performance vs. Path Optimality



Performance inverse ∝ to # obstacles due to the 2nd exploration phase

5x 2x 3x 7x

SLIDE 59

13.06.2017

D. Palossi et al.

59 | |

Performance vs. Path Optimality



Real-Time upper bounds: 4 m/s → 250 ms, 20 m/s → 50 ms

SLIDE 60

13.06.2017

D. Palossi et al.

60 | |

Performance vs. Path Optimality



Upper bounds: only Prof-Min capable of avoiding obstacles flying at 20 m/s

SLIDE 61

13.06.2017

D. Palossi et al.

61 | |

Performance vs. Path Optimality



Path optimality: 4 obstacles configurations -- 4 map sizes

SLIDE 62

13.06.2017

D. Palossi et al.

62 | |

Performance vs. Path Optimality



Path optimality: 2nd exploration phase is a new source for inaccuracy

0.3% vs. 0% 0.1% vs. 4.5% 0% vs. 4% 0.2% vs. 2.5%

SLIDE 63

13.06.2017

D. Palossi et al.

63 | |

Performance vs. Path Optimality



Prof-Max: insertion of a node multiple times in different frontiers → optimal path

SLIDE 64

13.06.2017

D. Palossi et al.

64 | |

Performance vs. Path Optimality



Locks: race conditions are negligible in the Prof-Min

SLIDE 65

13.06.2017

D. Palossi et al.

66 | |

Discussion



In the Naive: More obstacles → lower error



The less feasible paths there are → the closer to the optimal path we get (higher probability that Naive selects the optimal path)



In the Prof-Min: More obstacles → lower error vs. more obstacles → higher error



Same as for the Naive



2nd phase exploration will explore the deferred nodes only once and we do not propagate the updated costs to other nodes already visited in the 1st phase



Result: for 50% obstacles the highest error for Prof-Min is ≈ 0.5% (100×100)



The off-line profiling can be performed periodically in background on the host

SLIDE 66

13.06.2017

D. Palossi et al.

67 | |

Outline



Path Planning Application



Graph computation and exploration



Naive approximate and Atomic version



Profile-based version



Limitations of the Naive Approach



Experimental Evaluation



System Characterization



Experimental Results



The Predictable Execution Model (PREM)

SLIDE 67

13.06.2017

D. Palossi et al.

68 | |

The Predictable Execution Model (PREM)

Shared DRAM

Strong push for unifjed memory model in heterogeneous SoCs



Optimized to reduce performance loss



Good for programmability



How about predictability?

SLIDE 68

13.06.2017

D. Palossi et al.

69 | |

The Predictable Execution Model (PREM)



How large can the interference in execution time among the two subsystems be?



Rodinia benchmarks, executing on both the GPU and the CPU, show:



up to 2.5x slow-down on CPU execution under mutual interference



up to 33x slow-down on GPU execution under mutual interference

Shared DRAM

Strong push for unifjed memory model in heterogeneous SoCs



Optimized to reduce performance loss



Good for programmability



How about predictability?

SLIDE 69

13.06.2017

D. Palossi et al.

70 | |

The Predictable Execution Model (PREM)



Predictable interval



Memory prefetching in the first phase



No cache misses in the execution phase



Non-preemptive execution



System-wide co-scheduling of memory phases from multiple actors Requires compiler support for code re-structuring Requires compiler support for code re-structuring Requires runtime techniques for global memory arbitration Requires runtime techniques for global memory arbitration

SLIDE 70

13.06.2017

D. Palossi et al.

71 | |

The Predictable Execution Model (PREM)



Predictable interval



Memory prefetching in the first phase



No cache misses in the execution phase



Non-preemptive execution



System-wide co-scheduling of memory phases from multiple actors Requires compiler support for code re-structuring Requires compiler support for code re-structuring Requires runtime techniques for global memory arbitration Requires runtime techniques for global memory arbitration

Originally proposed for (multi-core) CPU. We study the applicability of this idea to heterogeneous SoCs Originally proposed for (multi-core) CPU. We study the applicability of this idea to heterogeneous SoCs

SLIDE 71

13.06.2017

D. Palossi et al.

72 | |

A heterogeneous variant of PREM



Current focus on GPU behavior (way more

severely affected by interference than CPU)



SPM as a predictable, local memory



Implement PREM phases within a single

ffload



Arbitration of main memory accesses via timed interrupts + shared memory



Rely on high-level constructs for offloading SoC

CPU complex

CORE CORE I$ I$ SHARED LLC MC SHARED OFF-CHIP DRAM

GPU complex

Cluster L1 SCRATCHPAD C C C C C C C C Cluster L1 SCRATCHPAD C C C C C C C C

[3] Björn Forsberg, Andrea Marongiu, and Luca Benini: GPUguard: towards supporting a predictable execution model for heterogeneous SoC. DATE 2017

SLIDE 72

13.06.2017

D. Palossi et al.

73 | |

A heterogeneous variant of PREM



Current focus on GPU behavior (way more

severely affected by interference than CPU)



SPM as a predictable, local memory



Implement PREM phases within a single

ffload



Arbitration of main memory accesses via timed interrupts + shared memory



Rely on high-level constructs for offloading SoC

CPU complex

CORE CORE I$ I$ SHARED LLC MC SHARED OFF-CHIP DRAM

GPU complex

Cluster L1 SCRATCHPAD C C C C C C C C Cluster L1 SCRATCHPAD C C C C C C C C



Loop tiling

[3] Björn Forsberg, Andrea Marongiu, and Luca Benini: GPUguard: towards supporting a predictable execution model for heterogeneous SoC. DATE 2017

SLIDE 73

13.06.2017

D. Palossi et al.

74 | |

PREM Evaluation: Path Planner

Increased instruction count for specialization and/or tiling Compiler

ptimizations

possible

Path Planner

[4] Björn Forsberg, Daniele Palossi, Andrea Marongiu, Luca Benini: GPU-Accelerated Real-Time Path Planning and the Predictable Execution Model. ICCS 2017

SLIDE 74

13.06.2017

D. Palossi et al.

75 | |

PREM Evaluation: Path Planner

Increased instruction count for specialization and/or tiling Compiler

ptimizations

possible Overhead due to GPU idleness Synchronization scheme Property of the workload

Path Planner

[4] Björn Forsberg, Daniele Palossi, Andrea Marongiu, Luca Benini: GPU-Accelerated Real-Time Path Planning and the Predictable Execution Model. ICCS 2017

SLIDE 75

13.06.2017

D. Palossi et al.

76 | |

PREM Evaluation: Path Planner

WCET

Near zero variance
3X reduction in

WCET

Path Planner

[4] Björn Forsberg, Daniele Palossi, Andrea Marongiu, Luca Benini: GPU-Accelerated Real-Time Path Planning and the Predictable Execution Model. ICCS 2017

SLIDE 76

13.06.2017

D. Palossi et al.

77 | |

Conclusion



Parallelism and near-optimality to boost UAV energy efficiency/performance

 Efficient use of embedded GPU  PREM techniques to guarantee predictable timing behavior

Achievements:



Profiled-based version ~1000x faster than sequential and ~7x faster than Naive



Loss in accuracy limited to ~5% and never affecting the safety of the mission



PREM give us near zero variance and 3x reduction in WCET

SLIDE 77

Thank you for your attention.

Questions?

SLIDE 78

13.06.2017

D. Palossi et al.

80 | |

Backup: Memory Footprint



Prof-Min the memory increase is a linear function of the map size



It introduces negligible overhead for the considered problem instance

SLIDE 79

13.06.2017

D. Palossi et al.

81 | |

Backup: Data Packing

We package two pieces of information into one 32 bit integer, using a cost array (cost) that contains both the cost and the predecessor id, thus only the left-most piece affects the comparisons.