Op Opti timi mizat atio ion n Fram amewor ork k for or DN - - PowerPoint PPT Presentation

op opti timi mizat atio ion n fram amewor ork k for or dn
SMART_READER_LITE
LIVE PREVIEW

Op Opti timi mizat atio ion n Fram amewor ork k for or DN - - PowerPoint PPT Presentation

Neu euOS OS: : A L A Lat aten ency cy-Pr Pred edict ictabl able e Mul ulti ti-Di Dime mens nsion ional al Op Opti timi mizat atio ion n Fram amewor ork k for or DN DNN-dr driv iven en Aut uton onom omous ous


slide-1
SLIDE 1

Neu euOS OS: : A L A Lat aten ency cy-Pr Pred edict ictabl able e Mul ulti ti-Di Dime mens nsion ional al Op Opti timi mizat atio ion n Fram amewor

  • rk

k for

  • r DN

DNN-dr driv iven en Aut uton

  • nom
  • mous
  • us Sys

ystem ems

Soroush Bateni

The University of Texas at Dallas

Cong Liu

The University of Texas at Dallas

slide-2
SLIDE 2

Deep Neural Networks (DNNs) Autonomous Embedded Systems

The e ta tale of e of tw two w

  • wor
  • rlds

ds

Background

2

Σ Σ Σ Σ FC FC SoftMax Σ Σ Autonomous Decision

slide-3
SLIDE 3

Deep Neural Networks (DNNs) Autonomous Embedded Systems

The e ta tale of e of tw two w

  • wor
  • rlds

ds

Background

3

Σ Σ Σ Σ FC FC SoftMax Σ Σ Autonomous Decision

Main Objective

  • Maximum Accuracy

Main Objectives

  • Timing predictability
  • Energy efficiency
  • Safety
slide-4
SLIDE 4

Deep Neural Networks (DNNs) Autonomous Embedded Systems

Mar arriage iage betw tween een th the tw e two w

  • wor
  • rlds

ds

Background

4

Σ Σ Σ Σ FC FC SoftMax Σ Σ Autonomous Decision

slide-5
SLIDE 5

The e big ig pic ictur ture

Hardware/software stack for executing DNNs in Autonomous Embedded Systems

Background

5

DNN

Framework/OS

slide-6
SLIDE 6

The e big ig pic ictur ture

Hardware/software stack for executing DNNs in Autonomous Embedded Systems

Background

6

DNN

The focus of related research in AES is currently mostly on the DNN and the hardware.

Framework/OS

slide-7
SLIDE 7

The e big ig pic ictur ture

Hardware/software stack for executing DNNs in Autonomous Embedded Systems

Background

7

DNN

Efficient DNNs

  • Quantization
  • Lowrank approximation

Framework/OS

slide-8
SLIDE 8

The e big ig pic ictur ture

Hardware/software stack for executing DNNs in Autonomous Embedded Systems

Background

8

DNN

Special Processors

  • AI accelerators
  • DNN-focused SoCs

Framework/OS

slide-9
SLIDE 9

Wh Wher ere syst e system em sof softw twar are/fr e/frame amewor

  • rk

k ca can n hel elp

Goals

9

DNN

Challenges

  • Meet timing requirements
  • Be energy efficient
  • Minimize accuracy loss.

All the above goals must be achieved at the same time.

Framework/OS

slide-10
SLIDE 10

Timing predictable & energy efficient

Can be achieved at system level via Dynamic Voltage Frequency Scaling (DVFS).

Timing predictable & accurate

Can be achieved at application level via DNN configuration change.

Master of none

Combining the two (even at different rates) will yield unpredictable results.

Ja Jack ck of

  • f al

all tr trad ades, es, ma mast ster er of

  • f no

none ne

Motivation

10

slide-11
SLIDE 11

Timing predictable & energy efficient

Can be achieved at system level via Dynamic Voltage Frequency Scaling (DVFS).

Timing predictable & accurate

Can be achieved at application level via DNN configuration change.

Master of none

Combining the two (even at different rates) will yield unpredictable results.

Ja Jack ck of

  • f al

all tr trad ades, es, ma mast ster er of

  • f no

none ne

Motivation

11

slide-12
SLIDE 12

Timing predictable & energy efficient

Can be achieved at system level via Dynamic Voltage Frequency Scaling (DVFS).

Timing predictable & accurate

Can be achieved at application level via DNN configuration change.

Master of none

Combining the two (even at different rates) will yield unpredictable results.

Ja Jack ck of

  • f al

all tr trad ades, es, ma mast ster er of

  • f no

none ne

Motivation

12

Need per-layer adjustments. Need per-layer adjustments. Need coordination.

slide-13
SLIDE 13

No

  • on
  • ne is a

e is alon

  • ne

Multiple ResNet-50 instances executed together

The underlying system-level solution here is PredJoule1

Takeaways

The first DNN instance is the winner, other DNN instances not as lucky because the method used here is greedy. The DVFS configurations chosen only work well for the first DNN instance.

Motivation

1Bateni, Soroush, Husheng Zhou, Yuankun Zhu, and Cong Liu. "Predjoule: A timing-predictable energy optimization framework

for deep neural networks." In 2018 IEEE Real-Time Systems Symposium (RTSS)

13

slide-14
SLIDE 14

No

  • on
  • ne is a

e is alon

  • ne

Multiple ResNet-50 instances executed together

The underlying system-level solution here is PredJoule1

Takeaways

The first DNN instance is the winner, other DNN instances not as lucky because the method used here is greedy. The DVFS configurations chosen only work well for the first DNN instance.

Motivation

1Bateni, Soroush, Husheng Zhou, Yuankun Zhu, and Cong Liu. "Predjoule: A timing-predictable energy optimization framework

for deep neural networks." In 2018 IEEE Real-Time Systems Symposium (RTSS)

14

Need cross-DNN coordination.

slide-15
SLIDE 15

Core Targets

  • Timing

ng predicta ctable: the system must meet deadlines set by the system designer for the DNN.

  • Energy

y efficient icient: the system must use DVFS to achieve near-optimal energy usage for DNNs.

  • Ac

Accu curat ate: the system can change accuracy dynamically but must do so cautiously.

  • Multi

ti-DNN DNN compa mpati tibility ility: the system should be able to coordinate and find an efficient solution for all DNN instances.

Optimization Targets

The system must also be flexible to adapt to different system constraints. We offer three

  • ptimization targets (switchable by an

external policy controller):

  • Min Energy

y (Mp) ) is used when our design is deployed in extremely low power scenario such as remote sensing.

  • Max Ac

Accur urac acy y (MA) ) is used when our design is deployed in extremely mission-critical scenarios.

  • Balanc

nced ed Energy y and Ac Accur urac acy y is the scenario where our design can choose what is best given the timing requirement.

Des Desig ign Goal n Goals

Design

15

slide-16
SLIDE 16

LAG analysis

  • Keep track of per-layer progress

Proportional Deadline

  • Build an ideal schedule by setting sub

deadlines in proportion to their execution time

Tim imin ing predi edict ctabi abili lity ty

Design

16

Accumulative LAG Per-layer sub-deadline Tracked execution time Per-layer sub-deadline End-to-end deadline for the DNN instance Per-layer execution time

slide-17
SLIDE 17

Building a cohort

We keep a pair of local variables for each DNN instance.

∆ Calculator

1. Based on the last reported values of LAG in the cohort, calculate a speedup (or slowdown) 2. Lookup1 the best possible DVFS configuration for that slowdown. 3. The output is a list (∆) of

  • ptimal DVFS

configurations for each DNN instance.

Xi Calculator

1. For each element of ∆, calculate the required (further) speedup (or slowdown) for other DNN instances. 2. This time, lookup1 the best possible approximation configuration that matches that slowdown.

1Please see the paper and the source code for more information.

Coo

  • ordi

dinati nation

  • n

Design

17

slide-18
SLIDE 18

The decision tree Overview of modes

Op Opti timi mizat atio ion

Design

18

1 2 …

Cohort

n-1 n δ1 δ2 δn-1 δn ∆ Calculator

… ∆=

𝑇𝐵1 𝑇𝐵2 𝑇𝐵𝑜

𝑇𝐵1 𝑇𝐵2 𝑇𝐵𝑜

XiCalculator Xi Calculator

slide-19
SLIDE 19

The decision tree Overview of modes

Choosing a δ (DVFS configuration) will have consequences in terms of accuracy for all DNNs in the cohort. Therefore, the question is, which δ is the best?

Min. . Energy (Mp) ) chooses the δ that has the least PowerUp value in the PowerUp/SpeedUp table, without looking at accuracy loss.

  • Max. Ac

Accur uracy cy (MA) ) chooses the δ so to minimize the value of σ∀𝜀𝑗 𝑇𝐵𝑗. Balanc nced ed Energy y and Ac Accur urac acy y uses the Bivariate Regression Analysis (BRA) to achieve a balanced approach backed by statistical analysis of the tree1.

1Please see the paper for more information.

Op Opti timi mizat atio ion

Design

19

1 2 …

Cohort

n-1 n δ1 δ2 δn-1 δn ∆ Calculator

… ∆=

𝑇𝐵1 𝑇𝐵2 𝑇𝐵𝑜

𝑇𝐵1 𝑇𝐵2 𝑇𝐵𝑜

XiCalculator Xi Calculator

slide-20
SLIDE 20

Based on Caffe

  • Available as an open-source project on

GitHub

  • No need to use APIs
  • No need to redesign DNN models
  • Need to generate
  • Hash tables
  • Lowrank approximated version of your DNN model.

Tested extensively

  • Tested on NVIDIA Jetson TX2 and Jetson

AGX Xavier

  • Tested using image recognition DNNs
  • AlexNet, GoogleNet, ResNet-50, VGGNet
  • Tested using three cohort sizes
  • Small: 1 DNN instance
  • Medium: 2-4 DNN instances
  • Large: 6-8 DNN instances
  • We include a mixed scenario that uses a

combination of all the DNN models

Ov Over ervie view

Implementation and Evaluation

20

slide-21
SLIDE 21

Ene Energy

Evaluation

21

68% avg. improvement on TX2 46% avg. improvement on AGX Xavier 70% avg. improvement on TX2

slide-22
SLIDE 22

Ene Energy

Evaluation

22

slide-23
SLIDE 23

La Latenc ency

Evaluation

23

slide-24
SLIDE 24

La Latenc ency

Evaluation

24

68% avg. improvement on TX2 40% avg. improvement on AGX Xavier 53% avg. improvement on TX2 32% avg. improvement on AGX Xavier

slide-25
SLIDE 25

Small cohort

3.25% deadline miss rate.

Medium cohort

Deadline miss rate same as the small cohort.

Large cohort

Deadline miss rate same as the small cohort.

Tai ail La Latenc ency

Evaluation

25

slide-26
SLIDE 26

Rel elat ativ ive e Acc ccur uracy acy

Evaluation

26

slide-27
SLIDE 27

Fl Flexi xibil ilit ity

Evaluation

27

slide-28
SLIDE 28

Fl Flexi xibil ilit ity

Evaluation

28

11759 DVFS configurations

  • n Jetson TX2.

51967 DVFS configurations

  • n Jetson AGX Xavier.
slide-29
SLIDE 29

Computation

Relatively negligible execution overhead (in ms).

Memory

Overhead includes the lowrank version of each DNN model. The right side shows how much of the total memory of each device is

  • ccupied.

Ov Overhead erhead

Evaluation

29

slide-30
SLIDE 30

The e sy syst stem em c com

  • mmu

muni nity ty to the

  • the re

resc scue ue

Conclusion

30

  • Certain problems cannot be solved at application level (by AI researchers) and at hardware level

separately

  • Ensuring timing predictability, energy efficiency, and accuracy for DNNs in Autonomous Embedded Systems requires coordination
  • We presented the design of NeuOS that can achieve these three goals by
  • Using LAG analysis to ensure real-time performance
  • Efficiently propagating all possible choices
  • Having flexibility in terms of choosing the best combination of configurations based on system designer’s criteria or external policy

controller

  • We extensively evaluated NeuOS
  • Using the latest AES devices
  • Using prominent image recognition DNNs
  • Under multiple configurations, including various cohort sizes
  • Against the most prominent accessible solutions available to researchers.
slide-31
SLIDE 31

Questions

Please do not hesitate to send your questions to soroush@utdallas.edu.

Source Code

https://github.com/Soroosh129/Neu OS

Thank you