Ordering Chaos Memory-Aware Scheduling for Irregularly Wired Neural - - PowerPoint PPT Presentation
Ordering Chaos Memory-Aware Scheduling for Irregularly Wired Neural - - PowerPoint PPT Presentation
Ordering Chaos Memory-Aware Scheduling for Irregularly Wired Neural Networks on Edge Devices Byung Hoon Ahn , Jinwon Lee, Jamie Lin, Hsin-Pai Cheng, Jilei Hou, Hadi Esmaeilzadeh Motivation: Enabling Intelligence, Transition from Cloud to Edge
Motivation: Enabling Intelligence, Transition from Cloud to Edge
Intelligence is moving from Cloud to Edge for Low Latency, Privacy, and Reliability
Intelligence moving from the Cloud to the Edge
Low Latency Privacy Reliability
Motivation: How to Make Deep Neural Networks More Efficient?
Motivation: Irregularly Wired Neural Networks
Randomly Wired Neural Network [ICCV’19] SwiftNet [ICCV-W’19] These Efficient Networks comprise of many Irregular Wirings We classify them as Irregularly Wired Neural Networks
Irregular Wirings
Motivation: Emerging Class of DNNs for Resource Constrained Scenarios
Multiply-and-accumulate (Billions) Top-1 ImageNet Accuracy (%) 85 65 70 75 80 20 10 30 40 DPN-131 Inception V1 MobileNet ShuffleNet Inception V2 Inception V3 Xception ResNet-152 SENet ReNeXt-101 PolyNet Inception ResNet V2 Inception V4 irregularly wired neural networks regular topology neural networks top left means is better Number of Parameters (Millions) Top-1 ImageNet Accuracy (%) 85 65 70 75 80 80 40 100 140 irregularly wired neural networks Inception V1 MobileNet ShuffleNet Inception V2 Inception V3 Xception ResNet-152 SENet ReNeXt-101 PolyNet Inception ResNet V2 Inception V4 regular topology neural networks 60 20 120 top left means is better
Certain class of networks require less Resources for same Accuracy (a.k.a. More Efficient Networks)
Number of Parameters (Millions) Top-1 ImageNet Accuracy (%) 85 65 70 75 80 80 40 100 140 DPN-131 irregularly wired neural networks Inception V1 MobileNet ShuffleNet Inception V2 Inception V3 Xception ResNet-152 SENet AmoebaNet-C ReNeXt-101 PolyNet Inception ResNet V2 Inception V4 NASNet-A NASNet-A RandWire AmoebaNet-A RandWire regular topology neural networks 60 20 120 NASNet-A top left means is better Multiply-and-accumulate (Billions) Top-1 ImageNet Accuracy (%) 85 65 70 75 80 20 10 30 40 DPN-131 Inception V1 MobileNet ShuffleNet Inception V2 Inception V3 Xception ResNet-152 SENet AmoebaNet-A ReNeXt-101 PolyNet Inception ResNet V2 Inception V4 NASNet-A NASNet-B RandWire AmoebaNet-A AmoebaNet-B RandWire irregularly wired neural networks regular topology neural networks top left means is better
Running Example: SwiftNet (ICCV-W’19)
Size (8bits) MACs Peak Mem ACC 249.7KB 57.4M ? 95.13%
SwiftNet: Using Graph Propagation as Meta-knowledge to Search Highly Representative Neural Architectures: hsinpaic@qti.qualcomm.com; dave.cheng@duke.edu
Conv2D SeperableConv SwiftNet Cell A SwiftNet Cell B SwiftNet Cell C Dense 224x224 Input Image Human Presence
1×56×56×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×128 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×64 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×128 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×32 1×28×28×64 1×28×28×32 1×28×28×32 1×28×28×160 1×28×28×48 1×28×28×48 1×28×28×48 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 MaxPool2D Conv2D weights〈32×1×1×32〉 bias〈32〉 Concatenation Conv2D weights〈32×1×1×32〉 bias〈32〉 Conv2D weights〈32×1×1×32〉 bias〈32〉 Concatenation Conv2D weights〈32×1×1×32〉 bias〈32〉 Conv2D weights〈32×1×1×64〉 bias〈32〉 Concatenation Concatenation Conv2D weights〈32×1×1×64〉 bias〈32〉 Concatenation Concatenation Conv2D weights〈32×1×1×64〉 bias〈32〉 Conv2D weights〈32×1×1×128〉 bias〈32〉 Concatenation DepthwiseConv2D weights〈1×3×3×32〉 bias〈32〉 Conv2D weights〈32×1×1×32〉 bias〈32〉 DepthwiseConv2D weights〈1×3×3×32〉 bias〈32〉 Conv2D weights〈32×1×1×32〉 bias〈32〉 DepthwiseConv2D weights〈1×3×3×64〉 bias〈64〉 Conv2D weights〈32×1×1×64〉 bias〈32〉 DepthwiseConv2D weights〈1×3×3×64〉 bias〈64〉 Conv2D weights〈32×1×1×64〉 bias〈32〉 Conv2D weights〈32×1×1×128〉 bias〈32〉 Concatenation Conv2D weights〈48×1×1×32〉 bias〈48〉 Conv2D weights〈48×1×1×160〉 bias〈48〉 Add 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉Running Example: SwiftNet (ICCV-W’19)
Size (8bits) MACs Peak Mem ACC 249.7KB 57.4M 800KB? 95.13%
SwiftNet: Using Graph Propagation as Meta-knowledge to Search Highly Representative Neural Architectures: hsinpaic@qti.qualcomm.com; dave.cheng@duke.edu
1×56×56×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×128 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×64 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×128 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×32 1×28×28×64 1×28×28×32 1×28×28×32 1×28×28×160 1×28×28×48 1×28×28×48 1×28×28×48 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 MaxPool2D Conv2D weights〈32×1×1×32〉 bias〈32〉 Concatenation Conv2D weights〈32×1×1×32〉 bias〈32〉 Conv2D weights〈32×1×1×32〉 bias〈32〉 Concatenation Conv2D weights〈32×1×1×32〉 bias〈32〉 Conv2D weights〈32×1×1×64〉 bias〈32〉 Concatenation Concatenation Conv2D weights〈32×1×1×64〉 bias〈32〉 Concatenation Concatenation Conv2D weights〈32×1×1×64〉 bias〈32〉 Conv2D weights〈32×1×1×128〉 bias〈32〉 Concatenation DepthwiseConv2D weights〈1×3×3×32〉 bias〈32〉 Conv2D weights〈32×1×1×32〉 bias〈32〉 DepthwiseConv2D weights〈1×3×3×32〉 bias〈32〉 Conv2D weights〈32×1×1×32〉 bias〈32〉 DepthwiseConv2D weights〈1×3×3×64〉 bias〈64〉 Conv2D weights〈32×1×1×64〉 bias〈32〉 DepthwiseConv2D weights〈1×3×3×64〉 bias〈64〉 Conv2D weights〈32×1×1×64〉 bias〈32〉 Conv2D weights〈32×1×1×128〉 bias〈32〉 Concatenation Conv2D weights〈48×1×1×32〉 bias〈48〉 Conv2D weights〈48×1×1×160〉 bias〈48〉 Add 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉Peak Memory Footprint: 800KB (> 250KB Requirement)
Today’s Frameworks are Oblivious to "Peak Memory Footprint" Issue When it come to Irregularly Wired Neural Networks
50 100 150 200 250
Activation memory (KB)
Running Example: SwiftNet (ICCV-W’19)
Size (8bits) MACs Peak Mem ACC 249.7KB 57.4M 200KB 95.13%
1×56×56×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×128 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×64 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×128 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×32 1×28×28×64 1×28×28×32 1×28×28×32 1×28×28×160 1×28×28×48 1×28×28×48 1×28×28×48 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 MaxPool2D Conv2D weights〈32×1×1×32〉 bias〈32〉 Concatenation Conv2D weights〈32×1×1×32〉 bias〈32〉 Conv2D weights〈32×1×1×32〉 bias〈32〉 Concatenation Conv2D weights〈32×1×1×32〉 bias〈32〉 Conv2D weights〈32×1×1×64〉 bias〈32〉 Concatenation Concatenation Conv2D weights〈32×1×1×64〉 bias〈32〉 Concatenation Concatenation Conv2D weights〈32×1×1×64〉 bias〈32〉 Conv2D weights〈32×1×1×128〉 bias〈32〉 Concatenation DepthwiseConv2D weights〈1×3×3×32〉 bias〈32〉 Conv2D weights〈32×1×1×32〉 bias〈32〉 DepthwiseConv2D weights〈1×3×3×32〉 bias〈32〉 Conv2D weights〈32×1×1×32〉 bias〈32〉 DepthwiseConv2D weights〈1×3×3×64〉 bias〈64〉 Conv2D weights〈32×1×1×64〉 bias〈32〉 DepthwiseConv2D weights〈1×3×3×64〉 bias〈64〉 Conv2D weights〈32×1×1×64〉 bias〈32〉 Conv2D weights〈32×1×1×128〉 bias〈32〉 Concatenation Conv2D weights〈48×1×1×32〉 bias〈48〉 Conv2D weights〈48×1×1×160〉 bias〈48〉 Add 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉SwiftNet: Using Graph Propagation as Meta-knowledge to Search Highly Representative Neural Architectures: hsinpaic@qti.qualcomm.com; dave.cheng@duke.edu
4x improvement in Peak memory footprint
(c.f., today’s TF Lite scheduler = 800KB) Output Activations In memory X 1A 2A 2B 2C 3C 7C 3B 5B 5C 3A 4A 5A 6C 8B 6B 7B 6A 7A 8A Y
Laziness drives innovations that improve productivity
- Steven Shapiro
Happy me J
We cannot rely on human expert for scheduling all the time
Manual Work Automation vs
Our Solution
Automated Solution: Serenity (Ordering Chaos)
We propose an Automated Approach that:
1 2 3 4 5 7 8 9 10 11 12 13 14 15 16 17
G s* G Graph Rewritten Graph Schedule
Identity Graph Rewriter Dynamic Pro‐ gramming-based Scheduler Adaptive Soft Budgeting Rewrite graph to alleviate activation memory footprint
- f the graph
Find memory-optimal schedule given an in‐ put graph Adaptively manage soft budget to speed up scheduling G flag = {‘no solution’, ‘timeout’, ‘solution’} τ , T s*
1 2 3 4 5 7 8 9 10 11 12 13 14 15 16 17
G s* Graph Schedule
Dynamic Pro‐ gramming-based Scheduler Adaptive Soft Budgeting Find memory-optimal schedule given an in‐ put graph Adaptively manage soft budget to speed up scheduling G flag = {‘no solution’, ‘timeout’, ‘solution’} τ , T s*
Explores another dimension that alleviates the memory footprint of the graph
2 2
Quickly finds a memory-optimal schedule for a fixed graph
1 1
Search Space: Scheduling = Topological Ordering
SwiftNet Cell A AlexNet, VGGNet, … ResNet While Conventional Network (e.g. AlexNet, …) execution is "streamlined" Irregularly Wired Neural Network execution is "not streamlined" VS
Order of Execution
Search Space: Scheduling = Topological Ordering
Search space is exponentially large and Optimal solutions account for very very small fraction of the entire space SwiftNet Cell A
Peak Memory Footprint (KB) Cumulative Distribution
- f Schedules (%)
100 20 40 60 80 350 400 200 250 300 0.04% of schedules are optimal
Brute Force Algorithm for Topological Ordering
A B C D E F H I J K L G
G Graph Recursive Topological Ordering
A Search Step Scheduled Schedulable X X For memoization X X A B C D E F H I J K L G
G Graph Recursive Topological Ordering
A B C J Search Step Scheduled Schedulable X X For memoization X X A B C D E F H I J K L G
G Graph Recursive Topological Ordering
A B C J C D J G Search Step Scheduled Schedulable X X For memoization X X A B C D E F H I J K L G
G Graph Recursive Topological Ordering
A B C J C D J G B E F J Search Step Scheduled Schedulable X X For memoization X X A B C D E F H I J K L G
G Graph Recursive Topological Ordering
A B C J C D J G B E F J B C Search Step Scheduled Schedulable X X For memoization X X A B C D E F H I J K L G
G Graph Recursive Topological Ordering
A B C J C D J G B E F J B C
…
Search Step C D G C D G
… …
Scheduled Schedulable X X For memoization X X A B C D E F H I J K L G
G Graph Recursive Topological Ordering
A B C J C D J G B E F J B C
…
Search Step C D G C D G
… …
Scheduled Schedulable X X For memoization X X A B C D E F H I J K L G
G Graph Recursive Topological Ordering
A B C J C D J G B E F J B C
…
Search Step C D G C D G
… …
z s Redundant zero-indegree set z Scheduled Schedulable X X For memoization X X
Many zero-indegree sets are redundant Optimizing this eliminates redundancy
zero-indegree set
Dynamic Programming Algorithm for Topological Ordering
A B C D E F H I J K L G
G Graph
Search Step z Scheduled Schedulable X X For memoization X X
Dynamic Programming-based Topological Ordering
A B C J A,B,C D A,B,J G E F A,C,J
…
C D G
…
Unique zero-indegree set z
Dynamic Programming-based Topological Ordering can speed-up the traversal of schedules significantly
Overlaying Problem Constraints
µ 8 A B C D E F H I J K L G
G Graph
Scheduled To Schedule/Allocate B A To Deallocate B Activation Memory D E F I J
(0) Initial State i = 8 s8 = A
B C D E F I J
z8 =
H G
u8 = H
µ 8 A B C D E F H I J K L G
G Graph
Scheduled To Schedule/Allocate B A To Deallocate B Activation Memory D E F I J
(1) Schedule/Allocate H
D E F I J H
(0) Initial State i = 8
µpeak
s8 = A
B C D E F I J
z8 =
H G
u8 = H s9 = A
B C D E F I J H
- utdegree of : 1→ 0
- utdegree of : 1→ 0
µ 8 A B C D E F H I J K L G
G Graph
Scheduled To Schedule/Allocate B A To Deallocate B Activation Memory D E F I J
(1) Schedule/Allocate H
D E F I J H
(0) Initial State
D E H I D E
i = 8
µpeak
s8 = A
B C D E F I J
z8 =
H G
u8 = H s9 = A
B C D E F I J H
- utdegree of : 1→ 0
- utdegree of : 1→ 0
µ 8 A B C D E F H I J K L G
G Graph
Scheduled To Schedule/Allocate B A To Deallocate B Activation Memory D E F I J
(1) Schedule/Allocate H
D E F I J H
(0) Initial State (2) Deallocate
F I J H D E D E H I D E
i = 8
µpeak µ9
s8 = A
B C D E F I J
z8 =
H G
u8 = H s9 = A
B C D E F I J H
Overlaying these constraints gives Memory-optimal schedule of the nodes
50 100 150 200 250
Activation memory (KB)
Dynamic Programming-based Scheduling
SwiftNet: Using Graph Propagation as Meta-knowledge to Search Highly Representative Neural Architectures: hsinpaic@qti.qualcomm.com; dave.cheng@duke.edu
Size (8bits) MACs Peak Mem ACC 249.7KB 57.4M 200KB 95.13%
4x Improvement in Peak Memory Footprint
(c.f., today’s TF Lite scheduler = 800KB) Output Activations In memory
Identity Graph Rewriting
µpeak = Σsize(xi) + size(y) µpeak = max(size(xi) + size(wixi) ) µpeak = Σsize(xi) + size(y) µpeak = max(size(xi) + size(y) )
Channel-wise Partitioning Kernel-wise Partitioning
= =
wij concat conv x1 x2 xn w1…wm y add partial conv w?1 x1 x2 xn w?2 w?n y concat depth-conv x1 x2 xn w1…wn y concat partial depth-conv w1 x1 x2 xn w2 wn y y xi ith Input Output jth Channel of ith Kernel x x
Graph Rewriting while maintaining the mathematical integrity allows further reduction in Peak Memory Footprint
50 100 150 200 250
Activation memory (KB)
Dynamic Programming-based Scheduling + Graph Rewriting
Size (8bits) MACs Peak Mem ACC 249.7KB 57.4M 188KB 95.13%
12KB Further Improvement with Graph Rewriting
(c.f., today’s TF Lite scheduler = 800KB)
SwiftNet: Using Graph Propagation as Meta-knowledge to Search Highly Representative Neural Architectures: hsinpaic@qti.qualcomm.com; dave.cheng@duke.edu
Output Activations In memory
Conv2D SeperableConv SwiftNet Cell A SwiftNet Cell B SwiftNet Cell C Dense 224x224 Input Image Human Presence
1×56×56×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×128 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×64 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×128 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×32 1×28×28×64 1×28×28×32 1×28×28×32 1×28×28×160 1×28×28×48 1×28×28×48 1×28×28×48 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 MaxPool2D Conv2D weights〈32×1×1×32〉 bias〈32〉 Concatenation Conv2D weights〈32×1×1×32〉 bias〈32〉 Conv2D weights〈32×1×1×32〉 bias〈32〉 Concatenation Conv2D weights〈32×1×1×32〉 bias〈32〉 Conv2D weights〈32×1×1×64〉 bias〈32〉 Concatenation Concatenation Conv2D weights〈32×1×1×64〉 bias〈32〉 Concatenation Concatenation Conv2D weights〈32×1×1×64〉 bias〈32〉 Conv2D weights〈32×1×1×128〉 bias〈32〉 Concatenation DepthwiseConv2D weights〈1×3×3×32〉 bias〈32〉 Conv2D weights〈32×1×1×32〉 bias〈32〉 DepthwiseConv2D weights〈1×3×3×32〉 bias〈32〉 Conv2D weights〈32×1×1×32〉 bias〈32〉 DepthwiseConv2D weights〈1×3×3×64〉 bias〈64〉 Conv2D weights〈32×1×1×64〉 bias〈32〉 DepthwiseConv2D weights〈1×3×3×64〉 bias〈64〉 Conv2D weights〈32×1×1×64〉 bias〈32〉 Conv2D weights〈32×1×1×128〉 bias〈32〉 Concatenation Conv2D weights〈48×1×1×32〉 bias〈48〉 Conv2D weights〈48×1×1×160〉 bias〈48〉 Add 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉Peak memory performance for different scheduling
Scheduling Strategy Peak Mem Time Manual Optimization + Partial Convolution 200KB 2 days (Automatic) Dynamic Programming-based Scheduling 200KB ? (Automatic) Dynamic Programming-based Scheduling + Graph Rewriting 188KB ?
Long Compile Time is Not Good for Mental Health
Pruning without Affecting Optimality
By setting an appropriate threshold, some paths can be pruned without affecting optimality
32 35 35 38 38 J A B C D E F H I K L G
G Graph
Scheduled Schedulable X X For memoization X X A,B,C,D,E,I
…
Search Step F H G
…
J D 6 E 6 F 6 J 6 I 3 H 3 G 3 C 6 z 35
- utput activation size
32 23 35 35 38 38 J A B C D E F H I K L G
G Graph
Scheduled Schedulable X X For memoization X X A,B,C,D,E,I
…
Search Step F H G
…
J D 6 E 6 F 6 J 6 I 3 H 3 G 3 A,B,C,D,E,I,F ,H
… …
C 6 23 32 z z 35
- utput activation size
32 23 35 35 38 38 J A B C D E F H I K L G
G Graph
Scheduled Schedulable X X For memoization X X A,B,C,D,E,I
…
Search Step F H G
…
J D 6 E 6 F 6 J 6 I 3 H 3 G 3 A,B,C,D,E,I,F ,H
… …
C 6 23 32 z z 35
- utput activation size
32 23 35 τ = 36 35 38 38 J A B C D E F H I K L G
G Graph
Scheduled Schedulable X X For memoization X X A,B,C,D,E,I
…
Search Step F H G
…
J D 6 E 6 F 6 J 6 I 3 H 3 G 3 A,B,C,D,E,I,F ,H
… …
C 6 23 32 > τ z z 35
- utput activation size
Prohibitive Scheduling Time
Adaptive Soft Budgeting
No Solution Number of Explored Schedules ∝ Scheduling Time Budget
Adaptive Soft Budgeting finds appropriate threshold reducing the scheduling time significantly
Accelerating Automated Approach: Divide and Conquer
Many Irregularly Wired Neural Networks are Hourglass-shaped that enables Divide-and-Conquer
A B C D E F G H
Divide Conquer Combine
A B C D Schedule Schedule
g1 g2
Concatenate
s* sg1 sg2
E F G H A B C D E F G H
Peak memory performance for different scheduling
Scheduling Strategy Peak Mem Time Manual Optimization + Partial Convolution 200KB 2 days (Automatic) Dynamic Programming-based Scheduling 200KB seconds (Automatic) Dynamic Programming-based Scheduling + Graph Rewriting 188KB minutes
Conv2D SeperableConv SwiftNet Cell A SwiftNet Cell B SwiftNet Cell C Dense 224x224 Input Image Human Presence
1×56×56×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×128 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×64 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×128 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×32 1×28×28×64 1×28×28×32 1×28×28×64 1×28×28×32 1×28×28×32 1×28×28×160 1×28×28×48 1×28×28×48 1×28×28×48 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 MaxPool2D Conv2D weights〈32×1×1×32〉 bias〈32〉 Concatenation Conv2D weights〈32×1×1×32〉 bias〈32〉 Conv2D weights〈32×1×1×32〉 bias〈32〉 Concatenation Conv2D weights〈32×1×1×32〉 bias〈32〉 Conv2D weights〈32×1×1×64〉 bias〈32〉 Concatenation Concatenation Conv2D weights〈32×1×1×64〉 bias〈32〉 Concatenation Concatenation Conv2D weights〈32×1×1×64〉 bias〈32〉 Conv2D weights〈32×1×1×128〉 bias〈32〉 Concatenation DepthwiseConv2D weights〈1×3×3×32〉 bias〈32〉 Conv2D weights〈32×1×1×32〉 bias〈32〉 DepthwiseConv2D weights〈1×3×3×32〉 bias〈32〉 Conv2D weights〈32×1×1×32〉 bias〈32〉 DepthwiseConv2D weights〈1×3×3×64〉 bias〈64〉 Conv2D weights〈32×1×1×64〉 bias〈32〉 DepthwiseConv2D weights〈1×3×3×64〉 bias〈64〉 Conv2D weights〈32×1×1×64〉 bias〈32〉 Conv2D weights〈32×1×1×128〉 bias〈32〉 Concatenation Conv2D weights〈48×1×1×32〉 bias〈48〉 Conv2D weights〈48×1×1×160〉 bias〈48〉 Add 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉 〈 〉Evaluation
Evaluation: Benchmark Irregularly Wired Neural Networks
Network Type Dataset # MAC # Weight Top-1 Accuracy* DARTS [ICLR’19] Neural Architecture Search ImageNet 574.0M 4.7M 73.3% SwiftNet [CVPR-C’19, ICCV-W’19] Human Presence Detection 57.4M 249.7K 95.1% Randomly Wired Neural Networks [ICCV’19] Random Network Generators CIFAR10 111.0M 1.2M 93.6% CIFAR100 160.0M 4.7M 74.5%
* Serenity does not affect accuracy
1.83x 2.20x 2.39x 2.09x 1.40x 1.27x 1.68x 1.25x 1.39x 1.68x 2.20x 2.44x 2.70x 3.45x 1.40x 1.27x 1.68x 1.25x 1.39x 1.86x
0.00 1.00 2.00 3.00 4.00 Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean DARTS ImageNet SwiftNet Visual Wake Words Dataset RandWire CIFAR10 RandWire CIFAR100 Reduction in Peak Memory TensorFow Lite Dynamic Programming+Memory Allocator Dynamic Programming+Graph Rewriting+Memory Allocator
Higher the better Reduction in Peak Memory Normal Cell A Cell B Cell C Cell A Cell B Cell C Cell A Cell B Geomean SwiftNet Human Presence DARTS ImageNet RandWire CIFAR10 RandWire CIFAR100
Evaluation: Reduction in Peak Memory Footprint
Serenity reduces the Peak Memory Footprint by 1.68x without Graph Rewriting and 1.86x with Graph Rewriting
Evaluation: Reduction in Off-Chip Memory Communication
Serenity also reduces off-chip memory communication by 1.52x, 1.49x, 1.51x, and 1.76x for 32KB, 64KB, 128KB, and 256KB, respectively
1.92x 2.58x 2.51x 1.15x 1.08x 1.29x 1.08x 1.30x 1.52x 1.92x 2.68x 1.25x 1.11x 1.31x 1.11x 1.61x 1.49x 1.92x 3.56x 1.25x 1.19x 1.09x 1.08x 1.51x 2.00x 1.35x 2.50x 1.82x 1.38x 1.76x 0.00 1.00 2.00 3.00 4.00 Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean DARTS ImageNet SwiftNet Visual Wake Words Dataset RandWire CIFAR10 RandWire CIFAR100 Reduction in Off-chip 32KB 64KB 128KB 256KB
Reduction in Off-chip Normal Cell A Cell B Cell C Cell A Cell B Cell C Cell A Cell B Geomean SwiftNet Human Presence DARTS ImageNet RandWire CIFAR10 RandWire CIFAR100 Memory Communication
- nly SERENITY fits on-chip
- nly SERENITY fits on-chip
- nly SERENITY fits on-chip
- nly SERENITY fits on-chip
SERENITY removes off-chip communication
N/A N/A N/A N/A N/A N/A
- n-chip memory size
Evaluation: Reduction in Off-Chip Memory Communication
Serenity even eradicates off-chip memory communication
1.92x 2.58x 2.51x 1.15x 1.08x 1.29x 1.08x 1.30x 1.52x 1.92x 2.68x 1.25x 1.11x 1.31x 1.11x 1.61x 1.49x 1.92x 3.56x 1.25x 1.19x 1.09x 1.08x 1.51x 2.00x 1.35x 2.50x 1.82x 1.38x 1.76x 0.00 1.00 2.00 3.00 4.00 Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean DARTS ImageNet SwiftNet Visual Wake Words Dataset RandWire CIFAR10 RandWire CIFAR100 Reduction in Off-chip 32KB 64KB 128KB 256KB
Reduction in Off-chip Normal Cell A Cell B Cell C Cell A Cell B Cell C Cell A Cell B Geomean SwiftNet Human Presence DARTS ImageNet RandWire CIFAR10 RandWire CIFAR100 Memory Communication
- nly SERENITY fits on-chip
- nly SERENITY fits on-chip
- nly SERENITY fits on-chip
- nly SERENITY fits on-chip
SERENITY removes off-chip communication
N/A N/A N/A N/A N/A N/A
- n-chip memory size
3.2s 5.7s 4.5s 27.8s 118.1s 15.1s 28.5s 74.4s 87.9s 40.6s 3.2s 42.1s 30.5s 39.3s 118.1s 15.1s 28.5s 74.4s 87.9s 48.8s
1 10 100 1000 Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Mean DARTS ImageNet SwiftNet Visual Wake Words Dataset RandWire CIFAR10 RandWire CIFAR100 Scheduling Time (seconds) Dynamic Programming+Memory Allocator Dynamic Programming+Graph Rewriting+Memory Allocator
Scheduling Time (seconds) Normal Cell A Cell B Cell C Cell A Cell B Cell C Cell A Cell B Mean SwiftNet Human Presence DARTS ImageNet RandWire CIFAR10 RandWire CIFAR100
Evaluation: Scheduling Time
Average scheduling time of Serenity is under a minute for the benchmark models Can be further improved by Porting from Python to C/C++
Summary and Takeaways
- 1. Irregularly Wired Neural Networks are emerging class of Network Architectures with
many upsides in terms of efficiency, but current deep learning frameworks are
- blivious to the Peak Memory Footprint challenge they introduce.
- 2. We leverage Dynamic Programming-based Scheduling to find an optimal schedule;
devise a Identity Graph Rewriting to further reduce Peak Memory Footprint; and develop Adaptive Soft Budgeting and Divide-and-Conquer to minimize overhead
Future Directions
1 . Expanding Applications or Revisiting the classical algorithms or compiler heuristics:
- Problems of optimizing memory communication and inference time can also benefit
from similar dynamic programming formulation
- 2. Using Machine Learning techniques to find good schedules in one-shot:
- Graph Neural Networks to parse and extract information from the graph
- Reinforcement Learning and other intelligent algorithms for scheduling
- 3. Exploring Other Dimensions of reducing intermediate activations:
- Quantization and Pruning are popular compression techniques
- Lossy/Lossless compression for intermediate activations are interesting future path