An Energy-Efficient Parallel Algorithm for Real-Time Near-Optimal - - PowerPoint PPT Presentation

an energy efficient parallel algorithm for real time near
SMART_READER_LITE
LIVE PREVIEW

An Energy-Efficient Parallel Algorithm for Real-Time Near-Optimal - - PowerPoint PPT Presentation

An Energy-Efficient Parallel Algorithm for Real-Time Near-Optimal UAV Path Planning D. Palossi a , A. Marongiu ab , L. Benini ab B. Forsberg a , M. Furci b , R. Naldi b , L. Marconi b a ETH Zrich, b Univeristy of Bologna NVidia GTC17 Munich,


slide-1
SLIDE 1

13.06.2017

  • D. Palossi et al.

1 | |

  • D. Palossi a, A. Marongiuab, L. Beniniab
  • B. Forsberga, M. Furcib, R. Naldib, L. Marconib

aETH Zürich, bUniveristy of Bologna

NVidia GTC17 – Munich, October 10th - 12th, 2017 - #23356

An Energy-Efficient Parallel Algorithm for Real-Time Near-Optimal UAV Path Planning

slide-2
SLIDE 2

13.06.2017

  • D. Palossi et al.

2 | |

Introduction

There are many applications for autonomous Unmanned Aerial Vehicles (UAVs)

Surveillance

Aerial Mapping

Entertainment

Rescue Mission

Standard-size quadrotors (~50cm, few Kg, ~100W) → computational bound due to weight/battery

One of the fundamental functional blocks for autonomous UAVs is the path planner We focus on standard-size quadrotors

slide-3
SLIDE 3

13.06.2017

  • D. Palossi et al.

3 | |

Energy Efficiency Requirements

Current standard-size UAV

Current system Next Gen system Size [ , weight ] ∅ 50 cm / few Kg few cm / few g Propellers Power Cons. hundreds of W few W / hundred mW Processing Device Class desktop CPU LP/ULP embedded Cognitive Skills fully autonomous

If we want bring advanced cognitive skills of state-of-the-art systems into the next generation autonomous vehicles → energy efficient algorithms are key

[1] Progress on "pico" air vehicles, R.J. Wood, B. Finio, M. Karpelson, K. Ma, N.O. Perez-Arancibia, P .S. Sreetharan, H. T anaka, and J.P . Whitney, Int.

  • Symp. on Robotics Research (invited paper), Flagstafg, Az, Aug. 2011.

power budgets for pico-size UAV [1]

Next generation micro/nano-size UAV

slide-4
SLIDE 4

13.06.2017

  • D. Palossi et al.

4 | |

Energy Efficiency Requirements

Current standard-size UAV

Current system Next Gen system Size [ , weight ] ∅ 50 cm / few Kg few cm / few g Propellers Power Cons. hundreds of W few W / hundred mW Processing Device Class desktop CPU LP/ULP embedded Cognitive Skills fully autonomous

If we want bring advanced cognitive skills of state-of-the-art systems into the next generation autonomous vehicles → energy efficient algorithms are key

[1] Progress on "pico" air vehicles, R.J. Wood, B. Finio, M. Karpelson, K. Ma, N.O. Perez-Arancibia, P .S. Sreetharan, H. T anaka, and J.P . Whitney, Int.

  • Symp. on Robotics Research (invited paper), Flagstafg, Az, Aug. 2011.

power budgets for pico-size UAV [1]

Next generation micro/nano-size UAV

We look into parallelism + near optimality as key solution to guarantee the energy requirements

slide-5
SLIDE 5

13.06.2017

  • D. Palossi et al.

5 | |

Outline

Path Planning Application

Graph computation and exploration

Naive approximate and Atomic version

Profile-based version

Limitations of the Naive Approach

Experimental Evaluation

System Characterization

Experimental Results

The Predictable Execution Model (PREM)

slide-6
SLIDE 6

13.06.2017

  • D. Palossi et al.

6 | |

Path Planning Application

Path Planning:

constantly updates the route of the vehicle based

  • n information sensed in

real time

selects the best path (according to specific metrics)

responsible for preventing collisions with dynamic, unexpected obstacles

the reactivity of the UAV depends on the path planner response time

slide-7
SLIDE 7

13.06.2017

  • D. Palossi et al.

7 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Represents the kinematic and the constraints of the robot

slide-8
SLIDE 8

13.06.2017

  • D. Palossi et al.

8 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Represents the kinematic and the constraints of the robot

slide-9
SLIDE 9

13.06.2017

  • D. Palossi et al.

9 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Represents the kinematic and the constraints of the robot Represents location, possible connection and its constraints: obstacles

slide-10
SLIDE 10

13.06.2017

  • D. Palossi et al.

10 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Represents the kinematic and the constraints of the robot Represents location, possible connection and its constraints: obstacles

slide-11
SLIDE 11

13.06.2017

  • D. Palossi et al.

11 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Sequence of movements: go_45 - go_45 - go_45 Represents the kinematic and the constraints of the robot Represents location, possible connection and its constraints: obstacles

slide-12
SLIDE 12

13.06.2017

  • D. Palossi et al.

12 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Sequence of movements: go_45 - go_45 - go_45 Represents the kinematic and the constraints of the robot Represents location, possible connection and its constraints: obstacles

slide-13
SLIDE 13

13.06.2017

  • D. Palossi et al.

13 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Sequence of movements: go_45 - go_45 - go_45 Represents the kinematic and the constraints of the robot Represents location, possible connection and its constraints: obstacles

slide-14
SLIDE 14

13.06.2017

  • D. Palossi et al.

14 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Sequence of movements: go_45 - go_45 - go_45 Represents the kinematic and the constraints of the robot Represents location, possible connection and its constraints: obstacles

slide-15
SLIDE 15

13.06.2017

  • D. Palossi et al.

15 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Sequence of movements: go_45 - go_45 - go_45 Represents the kinematic and the constraints of the robot Represents location, possible connection and its constraints: obstacles

slide-16
SLIDE 16

13.06.2017

  • D. Palossi et al.

16 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Sequence of movements: go_45 - go_45 - go_45 Obstacle detected in 2-3 Represents the kinematic and the constraints of the robot Represents location, possible connection and its constraints: obstacles

slide-17
SLIDE 17

13.06.2017

  • D. Palossi et al.

17 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Sequence of movements: go_0 - go_0 - go_90 - go_90 Represents the kinematic and the constraints of the robot Represents location, possible connection and its constraints: obstacles

slide-18
SLIDE 18

13.06.2017

  • D. Palossi et al.

18 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Sequence of movements: go_0 - go_0 - go_90 - go_90 Represents the kinematic and the constraints of the robot Represents location, possible connection and its constraints: obstacles

slide-19
SLIDE 19

13.06.2017

  • D. Palossi et al.

19 | |

Graph Computation

Quadrotor Automaton [1]

[1] M. Furci, A. Paoli, and R. Naldi. A supervisory control strategy for robot-assisted search and rescue in hostile environments. In Emerging Technologies Factory Automation (ETFA), 2013 IEEE 18th Conference on, pages 1–4, Sept 2013.

Map Automaton

Sequence of movements: go_0 - go_0 - go_90 - go_90 Represents the kinematic and the constraints of the robot Represents location, possible connection and its constraints: obstacles

slide-20
SLIDE 20

13.06.2017

  • D. Palossi et al.

20 | |

Graph Computation

Topological information and kinematic of the vehicle fused in a bigger graph

Automaton Synchronous Composition

Composition Automaton Quadrotor Automaton Map Automaton

slide-21
SLIDE 21

13.06.2017

  • D. Palossi et al.

21 | |

Graph Computation

Obstacles and Safety As soon as an obstacle is detected the Automaton Composition graph is updated

slide-22
SLIDE 22

13.06.2017

  • D. Palossi et al.

22 | |

Graph Exploration

Single Source Shortest Path problem with non-negative weights (SSSP)

Problem: find a path between two vertices (V) in a graph so that the sum of weights (W) of its constituent edges (E) is minimized. Naive Implementation:

Near-optimal parallel implementation of the Dijkstra algorithm (global optimality)

Locks-free updates of the cost of a central node

Race conditions allowed to boost performance

The near optimality never affects the safety of the mission

Naive-Atomic variant to prevent races (atomicMin)

slide-23
SLIDE 23

13.06.2017

  • D. Palossi et al.

23 | |

Outline

Path Planning Application

Graph computation and exploration

Naive approximate and Atomic version

Profile-based version

Limitations of the Naive Approach

Experimental Evaluation

System Characterization

Experimental Results

The Predictable Execution Model (PREM)

slide-24
SLIDE 24

13.06.2017

  • D. Palossi et al.

24 | |

Naive Path Planner: Limitations

Poor usage of the computational power due to sparse workload distribution

Naive implementation:

high synchronization cost (fine-grained)

slide-25
SLIDE 25

13.06.2017

  • D. Palossi et al.

25 | |

Naive Path Planner: Limitations

Poor usage of the computational power due to sparse workload distribution

Naive implementation:

high synchronization cost (fine-grained)

slide-26
SLIDE 26

13.06.2017

  • D. Palossi et al.

26 | |

Naive Path Planner: Limitations

Poor usage of the computational power due to sparse workload distribution

Naive implementation:

high synchronization cost (fine-grained)

sparse workload

Working Threads spread among multiple warps

slide-27
SLIDE 27

13.06.2017

  • D. Palossi et al.

27 | |

Naive Path Planner: Limitations

Poor usage of the computational power due to sparse workload distribution

Naive implementation:

high synchronization cost (fine-grained)

sparse workload

Working Threads spread among multiple warps

slide-28
SLIDE 28

13.06.2017

  • D. Palossi et al.

28 | |

Naive Path Planner: Limitations

Poor usage of the computational power due to sparse workload distribution

Naive implementation:

high synchronization cost (fine-grained)

sparse workload

Working Threads spread among multiple warps

it requires more iterations than an

  • ptimized solution
slide-29
SLIDE 29

13.06.2017

  • D. Palossi et al.

29 | |

Profile-based Path Planner

To overcome the limitations of the Naive implementation we introduce a profile- based version

We introduce the concept of exploration frontiers:

enumeration of sets of vertices F, where all vertices Fn have been visited from at least one vertex in Fm for any m: 0 < m < n

expose dense, parallel workloads

allow for a coarser synchronization scheme

slide-30
SLIDE 30

13.06.2017

  • D. Palossi et al.

30 | |

Profile-based Path Planner

To overcome the limitations of the Naive implementation we introduce a profile- based version

We introduce the concept of exploration frontiers:

enumeration of sets of vertices F, where all vertices Fn have been visited from at least one vertex in Fm for any m: 0 < m < n

expose dense, parallel workloads

allow for a coarser synchronization scheme

Frontiers are defined during an off- line, profiled-based, preprocessing

slide-31
SLIDE 31

13.06.2017

  • D. Palossi et al.

31 | |

Profile-based Path Planner

Software architecture: Preprocessing + 2-phase Near-Optimal Exploration

Profile-based Preprocessing Parallel exploration

(1st phase)

Parallel exploration

(2nd phase)

Off-line (ahead-of-time) On-line

Frontiers Array Transition Matrix IF Deferred Array > 0 Deferred Array

slide-32
SLIDE 32

13.06.2017

  • D. Palossi et al.

32 | |

Profile-based Path Planner

Software architecture: Preprocessing + 2-phase Near-Optimal Exploration

Profile-based Preprocessing Parallel exploration

(1st phase)

Frontiers Array Transition Matrix

Parallel exploration

(2nd phase)

IF Deferred Array > 0 Deferred Array

Input: static map snapshot

Sequential Dijkstra

Near-optimal exploration of frontiers

Deferred node exploration (to the 2nd phase) due to dynamic

  • bstacles

Conditional phase

Small instance of the Naive version, exploring

  • nly the deferred nodes
slide-33
SLIDE 33

13.06.2017

  • D. Palossi et al.

33 | |

Profile-based Path Planner

Profile-based approach increases thread usage through frontiers

Profile-based implementation:

low synchronization cost (coarse-grained)

slide-34
SLIDE 34

13.06.2017

  • D. Palossi et al.

34 | |

Profile-based Path Planner

Profile-based approach increases thread usage through frontiers

Profile-based implementation:

low synchronization cost (coarse-grained)

slide-35
SLIDE 35

13.06.2017

  • D. Palossi et al.

35 | |

Profile-based Path Planner

Profile-based approach increases thread usage through frontiers

Profile-based implementation:

low synchronization cost (coarse-grained)

frontiers force dense workload

slide-36
SLIDE 36

13.06.2017

  • D. Palossi et al.

36 | |

Profile-based Path Planner

Profile-based approach increases thread usage through frontiers

Profile-based implementation:

low synchronization cost (coarse-grained)

frontiers force dense workload

slide-37
SLIDE 37

13.06.2017

  • D. Palossi et al.

37 | |

Profile-based Path Planner

Profile-based implementation:

low synchronization cost (coarse-grained)

frontiers force dense workload

it requires lower number of iterations than the Naive version

Profile-based approach increases thread usage through frontiers

Profile-based approach increases thread usage through frontiers

slide-38
SLIDE 38

13.06.2017

  • D. Palossi et al.

38 | |

Profile-based Path Planner

Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles)

slide-39
SLIDE 39

13.06.2017

  • D. Palossi et al.

39 | |

Profile-based Path Planner

Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles)

slide-40
SLIDE 40

13.06.2017

  • D. Palossi et al.

40 | |

Profile-based Path Planner

Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles)

slide-41
SLIDE 41

13.06.2017

  • D. Palossi et al.

41 | |

Profile-based Path Planner

Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles)

slide-42
SLIDE 42

13.06.2017

  • D. Palossi et al.

42 | |

Profile-based Path Planner

Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles)

slide-43
SLIDE 43

13.06.2017

  • D. Palossi et al.

43 | |

Profile-based Path Planner

Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles)

slide-44
SLIDE 44

13.06.2017

  • D. Palossi et al.

44 | |

Profile-based Path Planner

Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles)

slide-45
SLIDE 45

13.06.2017

  • D. Palossi et al.

45 | |

Profile-based Path Planner

Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles) On-line exploration (dynamic obstacles)

slide-46
SLIDE 46

13.06.2017

  • D. Palossi et al.

46 | |

Profile-based Path Planner

Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles) On-line exploration (dynamic obstacles)

slide-47
SLIDE 47

13.06.2017

  • D. Palossi et al.

47 | |

Profile-based Path Planner

Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles) On-line exploration (dynamic obstacles) No predecessor/cost

slide-48
SLIDE 48

13.06.2017

  • D. Palossi et al.

48 | |

Profile-based Path Planner

Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles) On-line exploration (dynamic obstacles)

slide-49
SLIDE 49

13.06.2017

  • D. Palossi et al.

49 | |

Profile-based Path Planner

Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles) On-line exploration (dynamic obstacles)

slide-50
SLIDE 50

13.06.2017

  • D. Palossi et al.

50 | |

Profile-based Path Planner

Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles) On-line exploration (dynamic obstacles)

slide-51
SLIDE 51

13.06.2017

  • D. Palossi et al.

51 | |

Profile-based Path Planner

Dynamic obstacles might alter the visit order defined by the frontiers

Profile-based exploration (no obstacles) On-line exploration (dynamic obstacles) Not yet explored

slide-52
SLIDE 52

13.06.2017

  • D. Palossi et al.

52 | |

Outline

Path Planning Application

Graph computation and exploration

Naive approximate and Atomic version

Profile-based version

Limitations of the Naive Approach

Experimental Evaluation

System Characterization

Experimental Results

The Predictable Execution Model (PREM)

slide-53
SLIDE 53

13.06.2017

  • D. Palossi et al.

53 | |

Experimental Setup

Naive near-optimal parallel version (Naive)

Naive parallel with atomic intrinsic atomicMin (Naive-Atomic)

Offline profiling strategy → different frontiers

Sequential Dijkstra, fetching at each iteration the neighbor with the min cost first (Prof-Min)

Sequential Dijkstra, fetching at each iteration the neighbor with the max cost first (Prof-Max)

Fine-grained locking to prevent race conditions (Prof-Min+Lock)

slide-54
SLIDE 54

13.06.2017

  • D. Palossi et al.

54 | |

Experimental Setup

Naive near-optimal parallel version (Naive)

Naive parallel with atomic intrinsic atomicMin (Naive-Atomic)

Offline profiling strategy → different frontiers

Sequential Dijkstra, fetching at each iteration the neighbor with the min cost first (Prof-Min)

Sequential Dijkstra, fetching at each iteration the neighbor with the max cost first (Prof-Max)

Fine-grained locking to prevent race conditions (Prof-Min+Lock)

System configuration:

NVidia Tegra TX1, a many-core SoC featuring 4-core ARM Cortex A57 and a Maxwell GPU

1024 CUDA threads (max within the same block)

slide-55
SLIDE 55

13.06.2017

  • D. Palossi et al.

55 | |

Experimental Setup

Naive near-optimal parallel version (Naive)

Naive parallel with atomic intrinsic atomicMin (Naive-Atomic)

Offline profiling strategy → different frontiers

Sequential Dijkstra, fetching at each iteration the neighbor with the min cost first (Prof-Min)

Sequential Dijkstra, fetching at each iteration the neighbor with the max cost first (Prof-Max)

Fine-grained locking to prevent race conditions (Prof-Min+Lock)

System configuration:

NVidia Tegra TX1, a many-core SoC featuring 4-core ARM Cortex A57 and a Maxwell GPU

1024 CUDA threads (max within the same block)

Vehicle speed of 4 m/s and minimum obstacle detection distance of 1 meter → 250 ms [1]

Vehicle speed of 20 m/s and minimum obstacle detection distance of 1 meter → 50 ms [2]

[1] Daniele Palossi, Michele Furci, Roberto Naldi, Andrea Marongiu, Lorenzo Marconi, and Luca Benini: An energy-efficient parallel algorithm for real-time near-optimal UAV path planning. Computing Frontiers 2016. [2] DJI Phantom 4: https://www.dji.com/phantom-4/info

slide-56
SLIDE 56

13.06.2017

  • D. Palossi et al.

56 | |

Speedup

Speedup vs. Sequential (1 ARM Cortex A57) -- 20% obstacles -- 4 map sizes

slide-57
SLIDE 57

13.06.2017

  • D. Palossi et al.

57 | |

Performance vs. Path Optimality

Performance: 4 obstacles configurations -- 4 map sizes

slide-58
SLIDE 58

13.06.2017

  • D. Palossi et al.

58 | |

Performance vs. Path Optimality

Performance inverse ∝ to # obstacles due to the 2nd exploration phase

5x 2x 3x 7x

slide-59
SLIDE 59

13.06.2017

  • D. Palossi et al.

59 | |

Performance vs. Path Optimality

Real-Time upper bounds: 4 m/s → 250 ms, 20 m/s → 50 ms

slide-60
SLIDE 60

13.06.2017

  • D. Palossi et al.

60 | |

Performance vs. Path Optimality

Upper bounds: only Prof-Min capable of avoiding obstacles flying at 20 m/s

slide-61
SLIDE 61

13.06.2017

  • D. Palossi et al.

61 | |

Performance vs. Path Optimality

Path optimality: 4 obstacles configurations -- 4 map sizes

slide-62
SLIDE 62

13.06.2017

  • D. Palossi et al.

62 | |

Performance vs. Path Optimality

Path optimality: 2nd exploration phase is a new source for inaccuracy

0.3% vs. 0% 0.1% vs. 4.5% 0% vs. 4% 0.2% vs. 2.5%

slide-63
SLIDE 63

13.06.2017

  • D. Palossi et al.

63 | |

Performance vs. Path Optimality

Prof-Max: insertion of a node multiple times in different frontiers → optimal path

slide-64
SLIDE 64

13.06.2017

  • D. Palossi et al.

64 | |

Performance vs. Path Optimality

Locks: race conditions are negligible in the Prof-Min

slide-65
SLIDE 65

13.06.2017

  • D. Palossi et al.

66 | |

Discussion

In the Naive: More obstacles → lower error

The less feasible paths there are → the closer to the optimal path we get (higher probability that Naive selects the optimal path)

In the Prof-Min: More obstacles → lower error vs. more obstacles → higher error

Same as for the Naive

2nd phase exploration will explore the deferred nodes only once and we do not propagate the updated costs to other nodes already visited in the 1st phase

Result: for 50% obstacles the highest error for Prof-Min is ≈ 0.5% (100×100)

The off-line profiling can be performed periodically in background on the host

slide-66
SLIDE 66

13.06.2017

  • D. Palossi et al.

67 | |

Outline

Path Planning Application

Graph computation and exploration

Naive approximate and Atomic version

Profile-based version

Limitations of the Naive Approach

Experimental Evaluation

System Characterization

Experimental Results

The Predictable Execution Model (PREM)

slide-67
SLIDE 67

13.06.2017

  • D. Palossi et al.

68 | |

The Predictable Execution Model (PREM)

Shared DRAM

Strong push for unifjed memory model in heterogeneous SoCs

Optimized to reduce performance loss

Good for programmability

How about predictability?

slide-68
SLIDE 68

13.06.2017

  • D. Palossi et al.

69 | |

The Predictable Execution Model (PREM)

How large can the interference in execution time among the two subsystems be?

Rodinia benchmarks, executing on both the GPU and the CPU, show:

up to 2.5x slow-down on CPU execution under mutual interference

up to 33x slow-down on GPU execution under mutual interference

Shared DRAM

Strong push for unifjed memory model in heterogeneous SoCs

Optimized to reduce performance loss

Good for programmability

How about predictability?

slide-69
SLIDE 69

13.06.2017

  • D. Palossi et al.

70 | |

The Predictable Execution Model (PREM)

Predictable interval

Memory prefetching in the first phase

No cache misses in the execution phase

Non-preemptive execution

System-wide co-scheduling of memory phases from multiple actors Requires compiler support for code re-structuring Requires compiler support for code re-structuring Requires runtime techniques for global memory arbitration Requires runtime techniques for global memory arbitration

slide-70
SLIDE 70

13.06.2017

  • D. Palossi et al.

71 | |

The Predictable Execution Model (PREM)

Predictable interval

Memory prefetching in the first phase

No cache misses in the execution phase

Non-preemptive execution

System-wide co-scheduling of memory phases from multiple actors Requires compiler support for code re-structuring Requires compiler support for code re-structuring Requires runtime techniques for global memory arbitration Requires runtime techniques for global memory arbitration

Originally proposed for (multi-core) CPU. We study the applicability of this idea to heterogeneous SoCs Originally proposed for (multi-core) CPU. We study the applicability of this idea to heterogeneous SoCs

slide-71
SLIDE 71

13.06.2017

  • D. Palossi et al.

72 | |

A heterogeneous variant of PREM

Current focus on GPU behavior (way more

severely affected by interference than CPU)

SPM as a predictable, local memory

Implement PREM phases within a single

  • ffload

Arbitration of main memory accesses via timed interrupts + shared memory

Rely on high-level constructs for offloading SoC

CPU complex

CORE CORE I$ I$ SHARED LLC MC SHARED OFF-CHIP DRAM

GPU complex

Cluster L1 SCRATCHPAD C C C C C C C C Cluster L1 SCRATCHPAD C C C C C C C C

[3] Björn Forsberg, Andrea Marongiu, and Luca Benini: GPUguard: towards supporting a predictable execution model for heterogeneous SoC. DATE 2017

slide-72
SLIDE 72

13.06.2017

  • D. Palossi et al.

73 | |

A heterogeneous variant of PREM

Current focus on GPU behavior (way more

severely affected by interference than CPU)

SPM as a predictable, local memory

Implement PREM phases within a single

  • ffload

Arbitration of main memory accesses via timed interrupts + shared memory

Rely on high-level constructs for offloading SoC

CPU complex

CORE CORE I$ I$ SHARED LLC MC SHARED OFF-CHIP DRAM

GPU complex

Cluster L1 SCRATCHPAD C C C C C C C C Cluster L1 SCRATCHPAD C C C C C C C C

Loop tiling

[3] Björn Forsberg, Andrea Marongiu, and Luca Benini: GPUguard: towards supporting a predictable execution model for heterogeneous SoC. DATE 2017

slide-73
SLIDE 73

13.06.2017

  • D. Palossi et al.

74 | |

PREM Evaluation: Path Planner

Increased instruction count for specialization and/or tiling Compiler

  • ptimizations

possible

Path Planner

[4] Björn Forsberg, Daniele Palossi, Andrea Marongiu, Luca Benini: GPU-Accelerated Real-Time Path Planning and the Predictable Execution Model. ICCS 2017

slide-74
SLIDE 74

13.06.2017

  • D. Palossi et al.

75 | |

PREM Evaluation: Path Planner

Increased instruction count for specialization and/or tiling Compiler

  • ptimizations

possible Overhead due to GPU idleness Synchronization scheme Property of the workload

Path Planner

[4] Björn Forsberg, Daniele Palossi, Andrea Marongiu, Luca Benini: GPU-Accelerated Real-Time Path Planning and the Predictable Execution Model. ICCS 2017

slide-75
SLIDE 75

13.06.2017

  • D. Palossi et al.

76 | |

PREM Evaluation: Path Planner

WCET

  • Near zero variance
  • 3X reduction in

WCET

Path Planner

[4] Björn Forsberg, Daniele Palossi, Andrea Marongiu, Luca Benini: GPU-Accelerated Real-Time Path Planning and the Predictable Execution Model. ICCS 2017

slide-76
SLIDE 76

13.06.2017

  • D. Palossi et al.

77 | |

Conclusion

Parallelism and near-optimality to boost UAV energy efficiency/performance

 Efficient use of embedded GPU  PREM techniques to guarantee predictable timing behavior

Achievements:

Profiled-based version ~1000x faster than sequential and ~7x faster than Naive

Loss in accuracy limited to ~5% and never affecting the safety of the mission

PREM give us near zero variance and 3x reduction in WCET

slide-77
SLIDE 77

Thank you for your attention.

Questions?

slide-78
SLIDE 78

13.06.2017

  • D. Palossi et al.

80 | |

Backup: Memory Footprint

Prof-Min the memory increase is a linear function of the map size

It introduces negligible overhead for the considered problem instance

slide-79
SLIDE 79

13.06.2017

  • D. Palossi et al.

81 | |

Backup: Data Packing

We package two pieces of information into one 32 bit integer, using a cost array (cost) that contains both the cost and the predecessor id, thus only the left-most piece affects the comparisons.

tid ← get_global_id(0) /* thread id */ for each neighbour n of tid: new_cost ← (cost[tid] » 20 + cost(tid, n)) « 20 V tid