Making OpenVX Really Real Time Ming Yang 1 , Tanya Amert 1 , Kecheng - - PowerPoint PPT Presentation

making openvx really real time
SMART_READER_LITE
LIVE PREVIEW

Making OpenVX Really Real Time Ming Yang 1 , Tanya Amert 1 , Kecheng - - PowerPoint PPT Presentation

Making OpenVX Really Real Time Ming Yang 1 , Tanya Amert 1 , Kecheng Yang 1,2 , Nathan Otterness 1 , James H. Anderson 1 , F. Donelson Smith 1 , and Shige Wang 3 1 The University of North Carolina at Chapel Hill 2 Texas State University 3


slide-1
SLIDE 1

Making OpenVX Really “Real Time”

Ming Yang1, Tanya Amert1, Kecheng Yang1,2, Nathan Otterness1, James H. Anderson1, F. Donelson Smith1, and Shige Wang3

1The University of North Carolina at Chapel Hill 2Texas State University 3General Motors Research

slide-2
SLIDE 2

700 ms

slide-3
SLIDE 3
slide-4
SLIDE 4

A new approach for graph scheduling

slide-5
SLIDE 5

Shorter response time + Less capacity loss

slide-6
SLIDE 6
  • 1. State of the art
  • 2. Our approach
  • 3. Future work

6

slide-7
SLIDE 7

7 Source: https://www.khronos.org/openvx/

OpenVX Node OpenVX Node OpenVX Node OpenVX Node

Example OpenVX Graph

Native Camera Control Downstream Application Processing

Graph-based architecture

Application Application GPU FPGA DSP

Portability to diverse hardware Does OpenVX really target “real-time” processing?

slide-8
SLIDE 8

8 Source: https://www.khronos.org/openvx/

  • 1. It lacks real-time concepts

OpenVX Node OpenVX Node OpenVX Node OpenVX Node

Example OpenVX Graph

Native Camera Control Downstream Application Processing

  • 2. Entire graphs = monolithic schedulable entities

Does OpenVX really target “real-time” processing?

slide-9
SLIDE 9

9 Source: https://www.khronos.org/openvx/

  • 1. It lacks real-time concepts
  • 2. Entire graphs = monolithic schedulable entities

D C B A

Does OpenVX really target “real-time” processing?

slide-10
SLIDE 10

D C B A

10 Source: https://www.khronos.org/openvx/

  • 1. It lacks real-time concepts
  • 2. Entire graphs = monolithic schedulable entities

D C B A

Monolithic scheduling Time A … A B C D

Does OpenVX really target “real-time” processing?

slide-11
SLIDE 11

Prior Work

  • OpenVX nodes = schedulable entities [23, 51]

11

Coarse-grained scheduling

D C B A Coarse-grained scheduling Time A B C D … Task A: Task B: Task C: Task D:

D C B A

slide-12
SLIDE 12

Prior Work

  • OpenVX nodes = schedulable entities [23, 51]

12

Coarse-grained scheduling Remaining problems:

  • 1. More parallelism to be explored
  • 2. Suspension-oblivious analysis was applied and

causes capacity loss.

slide-13
SLIDE 13

Fine-Grained Scheduling

This Work

slide-14
SLIDE 14

14

1. Coarse-grained vs. fine-grained 2. Response-time bounds analysis 3. Case study

slide-15
SLIDE 15

15

1. Coarse-grained vs. fine-grained 2. Response-time bounds analysis 3. Case study

slide-16
SLIDE 16

Coarse-Grained Scheduling

16

Time

Task A: Task B: Task C: Task D:

D C B A

Suspension for GPU execution Time Task A: Task E: Task C: Task D: Task F: Task G:

D C A

GPU execution

E F G

Fine-Grained Scheduling

slide-17
SLIDE 17

17

1. Coarse-grained vs. fine-grained 2. Response-time bounds analysis 3. Case study

slide-18
SLIDE 18

Deriving Response-Time Bounds for a DAG*

Step 1: Schedule the nodes as sporadic tasks Step 2: Compute bounds for every node Step 3: Sum the bounds of nodes on the critical path

18 * C. Liu and J. Anderson, “Supporting Soft Real-Time DAG-based Systems on Multiprocessors with No Utilization Loss,” in RTSS, 2013.

slide-19
SLIDE 19

19

Deriving Response-Time Bounds for a DAG

D C A

B E F

slide-20
SLIDE 20

20

Deriving Response-Time Bounds for a DAG

D C A

B E F

slide-21
SLIDE 21

CPU GPU

21

Deriving Response-Time Bounds for a DAG

D C A

B E F

… …

Need a response-time bound analysis for GPU tasks

slide-22
SLIDE 22

2048 2048

A system model of GPU Tasks

22

τ1 = (3076,6,2,1024)

SM1 SM0 6 Time 3

τi = (Ci, Ti, Bi, Hi)

Period Number of blocks Number of threads per block (or block size) Per-block worst-case workload

C1

B1

H1 = 1024

T1

slide-23
SLIDE 23

Response-Time Bounds Proof Sketch

23

  • 1. We first show the necessity of a total utilization bound

and intra-task parallelism via counterexamples.

slide-24
SLIDE 24

Response-Time Bounds Proof Sketch

24

  • 1. We first show the necessity of a total utilization bound

and intra-task parallelism via counterexamples.

Time

Releases: 1 2 3 4 5 1 2 3 4 5 Without intra-task parallelism: With intra-task parallelism:

slide-25
SLIDE 25

Response-Time Bounds Proof Sketch

25

SM1 SM0 Time

Rk

τk,j

rk,j

  • 2. We then bound the

unfinished workload from jobs released at

  • r before .

rk,j

  • 3. We prove

the job finishes before .

rk,j + Rk

  • 1. We first show the necessity of a total utilization bound

and intra-task parallelism via counterexamples.

slide-26
SLIDE 26

26

1. Coarse-grained vs. fine-grained 2. Response-time bounds analysis 3. Case study

slide-27
SLIDE 27

Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling

27

Resize Image Compute Gradients Compute Orientation Histograms Normalize Orientation Histograms Resize Image Resize Image Compute Gradients Compute Gradients Compute Orientation Histograms Compute Orientation Histograms Normalize Orientation Histograms Normalize Orientation Histograms

vxHOGCells Node vxHOGCells Node vxHOGFeature sNode vxHOGFeature sNode vxHOGFeatures Node vxHOGCells Node

  • Application: Histogram of Oriented Gradients (HOG)

CPU+GPU Execution (Coarse-Grained) GPU Execution (Fine-Grained)

slide-28
SLIDE 28
  • Application: Histogram of Oriented Gradients (HOG)
  • 6 instances
  • 33 ms period
  • 30,000 samples
  • Platform: NVIDIA Titan V GPU + Two eight-core Intel

CPUs.

  • Schedulers: G-EDF, G-FL (fair-lateness)

28

Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling

slide-29
SLIDE 29

29

Left is better Time % of samples

50% samples have response time less than 60 ms

Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling

slide-30
SLIDE 30

30

FL: fair-lateness

[1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) 65.99 136.57 84669.47 Maximum Response Time (ms) 125.66 427.07 170091.06

Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling

slide-31
SLIDE 31

31

FL: fair-lateness

[1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) 65.99 136.57 84669.47 Maximum Response Time (ms) 125.66 427.07 170091.06

Half the average response time

[1] [2]

Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling

slide-32
SLIDE 32

32

FL: fair-lateness

[1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) 65.99 136.57 84669.47 Maximum Response Time (ms) 125.66 427.07 170091.06

Half the average response time One-third the maximum response time

[1] [2]

Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling

slide-33
SLIDE 33

33

FL: fair-lateness

[1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) 65.99 136.57 84669.47 Maximum Response Time (ms) 125.66 427.07 170091.06

[1] [2] [3]

Half the average response time One-third the maximum response time

Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling

slide-34
SLIDE 34

34

FL: fair-lateness

[1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) 65.99 136.57 84669.47 Maximum Response Time (ms) 125.66 427.07 170091.06

[1] [2] [3] [3]

Half the average response time One-third the maximum response time

Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling

slide-35
SLIDE 35

35

FL: fair-lateness

[1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) 65.99 136.57 84669.47 Maximum Response Time (ms) 125.66 427.07 170091.06 Analytical Bound (ms) N/A

[1] [2] [3] [3]

Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling

slide-36
SLIDE 36

36

FL: fair-lateness

[1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) 65.99 136.57 84669.47 Maximum Response Time (ms) 125.66 427.07 170091.06 Analytical Bound (ms) N/A N/A

[1] [2] [3] [3]

Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling

slide-37
SLIDE 37

37

FL: fair-lateness

[1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) 65.99 136.57 84669.47 Maximum Response Time (ms) 125.66 427.07 170091.06 Analytical Bound (ms) 542.39 N/A N/A

[1] [2] [3] [3]

Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling

slide-38
SLIDE 38

38

FL: fair-lateness

[1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) 65.99 136.57 84669.47 Maximum Response Time (ms) 125.66 427.07 170091.06 Analytical Bound (ms) 542.39 N/A N/A

[1] [2] [3] [3]

An alert driver takes 700 ms to react.

Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling

slide-39
SLIDE 39

39

FL: fair-lateness

[1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) 65.99 136.57 84669.47 Maximum Response Time (ms) 125.66 427.07 170091.06 Analytical Bound (ms) 542.39 N/A N/A

[1] [2] [3] [3]

An alert driver takes 700 ms to react.

  • Fair-lateness-based scheduler is

beneficial as it reduced node response times by up to 9.9%.

  • Overheads of supporting fine-grained

scheduling was 14.15%.

Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling

slide-40
SLIDE 40

Conclusions

  • 1. Fine-grained scheduling
  • 2. Response-time bounds analysis for

GPU tasks

  • 3. Case study

40

slide-41
SLIDE 41

Future Work

  • 1. Cycles in the graph
  • 2. Other resource constraints
  • 3. Schedulability studies

41

slide-42
SLIDE 42

Thanks!