Algorithm-SoC Co-Design for Mobile Continuous Vision Yuhao Zhu - - PowerPoint PPT Presentation

algorithm soc co design for mobile continuous vision
SMART_READER_LITE
LIVE PREVIEW

Algorithm-SoC Co-Design for Mobile Continuous Vision Yuhao Zhu - - PowerPoint PPT Presentation

Algorithm-SoC Co-Design for Mobile Continuous Vision Yuhao Zhu Department of Computer Science University of Rochester with Anand Samajdar, Georgia Tech Matthew Mattina, ARM Research Paul Whatmough, ARM Research Mobile Continuous Vision:


slide-1
SLIDE 1

Algorithm-SoC Co-Design for Mobile Continuous Vision

Yuhao Zhu

Department of Computer Science University of Rochester with Anand Samajdar, Georgia Tech Matthew Mattina, ARM Research Paul Whatmough, ARM Research

slide-2
SLIDE 2

Mobile Continuous Vision: Excessive Energy Consumption

slide-3
SLIDE 3

Mobile Continuous Vision: Excessive Energy Consumption

720p, 30 FPS

slide-4
SLIDE 4

Mobile Continuous Vision: Excessive Energy Consumption

Energy Budget: (under 3 W TDP) 109 nJ/pixel

720p, 30 FPS

slide-5
SLIDE 5

Mobile Continuous Vision: Excessive Energy Consumption

Energy Budget: (under 3 W TDP) 109 nJ/pixel Object Detection Energy Consumption 1400 nJ/pixel

720p, 30 FPS

slide-6
SLIDE 6

Application Drivers for Continuous Vision

3

Autonomous Drones

slide-7
SLIDE 7

Application Drivers for Continuous Vision

3

Autonomous Drones ADAS

slide-8
SLIDE 8

Application Drivers for Continuous Vision

3

Autonomous Drones Augmented Reality ADAS

slide-9
SLIDE 9

Application Drivers for Continuous Vision

3

Autonomous Drones Augmented Reality ADAS Security Camera

slide-10
SLIDE 10

Expanding the Scope

4

Vision Kernels

RGB Frames Semantic Results Conventional Scope

slide-11
SLIDE 11

Expanding the Scope

4

Vision Kernels

RGB Frames Semantic Results

Imaging

Photons Conventional Scope

slide-12
SLIDE 12

Expanding the Scope

Our Scope

4

Vision Kernels

RGB Frames Semantic Results

Imaging

Photons

slide-13
SLIDE 13

Expanding the Scope

Our Scope

4

Vision Kernels

RGB Frames Semantic Results

Imaging

Photons Motion Metadata

slide-14
SLIDE 14

Expanding the Scope

Our Scope

4

Vision Kernels

RGB Frames Semantic Results

Imaging

Photons Motion Metadata

f(xt) =

slide-15
SLIDE 15

Expanding the Scope

Our Scope

4

Vision Kernels

RGB Frames Semantic Results

Imaging

Photons Motion Metadata diff (motion)

f(xt) = (xt ⊖ xt-1)

slide-16
SLIDE 16

Expanding the Scope

Our Scope

4

Vision Kernels

RGB Frames Semantic Results

Imaging

Photons Motion Metadata diff (motion)

f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

slide-17
SLIDE 17

Expanding the Scope

Our Scope

4

Vision Kernels

RGB Frames Semantic Results

Imaging

Photons Motion Metadata diff (motion) synthesis

f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

slide-18
SLIDE 18

Expanding the Scope

Our Scope

4

Vision Kernels

RGB Frames Semantic Results

Imaging

Photons Motion Metadata diff (motion) synthesis cheap

f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

slide-19
SLIDE 19

Expanding the Scope

Our Scope

4

Vision Kernels

RGB Frames Semantic Results

Imaging

Photons Motion Metadata diff (motion) synthesis Motion-based Synthesis cheap

f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

slide-20
SLIDE 20

Getting Motion Data

5

Vision Kernels

RGB Frames Semantic Results

Imaging

Photons Motion Metadata

slide-21
SLIDE 21

Getting Motion Data

5

Vision Kernels

RGB Frames Semantic Results

Imaging

Photons

Conversion Demosaic … Bayer Domain Dead Pixel Correction

YUV Domain Temporal Denoising

slide-22
SLIDE 22

Getting Motion Data

5

Vision Kernels

RGB Frames Semantic Results

Imaging

Photons

Conversion Demosaic … Bayer Domain Dead Pixel Correction

YUV Domain Temporal Denoising

slide-23
SLIDE 23

Getting Motion Data

5

Vision Kernels

RGB Frames Semantic Results

Imaging

Photons

Conversion Demosaic … Bayer Domain Dead Pixel Correction

YUV Domain Temporal Denoising

Frame k

slide-24
SLIDE 24

Getting Motion Data

5

Vision Kernels

RGB Frames Semantic Results

Imaging

Photons

Conversion Demosaic … Bayer Domain Dead Pixel Correction

YUV Domain Temporal Denoising

Frame k

<u, v>

slide-25
SLIDE 25

Getting Motion Data

5

Vision Kernels

RGB Frames Semantic Results

Imaging

Photons

Conversion Demosaic … Bayer Domain Dead Pixel Correction

YUV Domain Temporal Denoising

Frame k-1 Frame k

<u, v> <x, y>

slide-26
SLIDE 26

Getting Motion Data

5

Vision Kernels

RGB Frames Semantic Results

Imaging

Photons

Conversion Demosaic … Bayer Domain Dead Pixel Correction

YUV Domain Temporal Denoising

Frame k-1 Frame k

<u, v> <x, y>

Motion Vector = <x - u, y - v>

slide-27
SLIDE 27

Getting Motion Data

5

Vision Kernels

RGB Frames Semantic Results

Imaging

Photons

Conversion Demosaic … Bayer Domain Dead Pixel Correction

YUV Domain Temporal Denoising

Motion Info. Frame k-1 Frame k

<u, v> <x, y>

Motion Vector = <x - u, y - v>

slide-28
SLIDE 28

Synthesis Operation

6

f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

diff (motion) synthesis Motion-based Synthesis

slide-29
SLIDE 29

Synthesis Operation

6

f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

diff (motion) synthesis Motion-based Synthesis

▸ Synthesis operation: Extrapolate based on motion vectors

slide-30
SLIDE 30

Synthesis Operation

6

f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

diff (motion) synthesis Motion-based Synthesis

▸ Synthesis operation: Extrapolate based on motion vectors

slide-31
SLIDE 31

Synthesis Operation

6

f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

diff (motion) synthesis Motion-based Synthesis

▸ Synthesis operation: Extrapolate based on motion vectors

slide-32
SLIDE 32

Synthesis Operation

6

f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

diff (motion) synthesis Motion-based Synthesis

▸ Synthesis operation: Extrapolate based on motion vectors

slide-33
SLIDE 33

Inference (I-Frame) Extrapolation (E-Frame) Inference (I-Frame) Extrapolation (E-Frame)

Extrapolation Window = 2

Extrapolation (E-Frame)

Extrapolation Window = 3 t4 t0 t1 t2 t3

Synthesis Operation

6

f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

diff (motion) synthesis Motion-based Synthesis

▸ Synthesis operation: Extrapolate based on motion vectors

slide-34
SLIDE 34

Inference (I-Frame) Extrapolation (E-Frame) Inference (I-Frame) Extrapolation (E-Frame)

Extrapolation Window = 2

Extrapolation (E-Frame)

Extrapolation Window = 3 t4 t0 t1 t2 t3

Synthesis Operation

6

f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

diff (motion) synthesis Motion-based Synthesis

▸ Synthesis operation: Extrapolate based on motion vectors

slide-35
SLIDE 35

Inference (I-Frame) Extrapolation (E-Frame) Inference (I-Frame) Extrapolation (E-Frame)

Extrapolation Window = 2

Extrapolation (E-Frame)

Extrapolation Window = 3 t4 t0 t1 t2 t3

Synthesis Operation

6

f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

diff (motion) synthesis Motion-based Synthesis

▸ Synthesis operation: Extrapolate based on motion vectors

slide-36
SLIDE 36

Synthesis Operation

6

f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

diff (motion) synthesis Motion-based Synthesis

▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges:

slide-37
SLIDE 37

Synthesis Operation

6

f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

diff (motion) synthesis Motion-based Synthesis

▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges:

▹ Handle deformable parts

slide-38
SLIDE 38

Synthesis Operation

6

f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

diff (motion) synthesis Motion-based Synthesis

▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges:

▹ Handle deformable parts ▹ Filter motion noise

slide-39
SLIDE 39

Synthesis Operation

6

f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

diff (motion) synthesis Motion-based Synthesis

▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges:

▹ Handle deformable parts ▹ Filter motion noise ▹ When to inference vs. extrapolate?

slide-40
SLIDE 40

Synthesis Operation

6

f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

diff (motion) synthesis Motion-based Synthesis

▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges:

▹ Handle deformable parts ▹ Filter motion noise ▹ When to inference vs. extrapolate? ▹ See paper for details!

slide-41
SLIDE 41

Synthesis Operation

6

f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

diff (motion) synthesis Motion-based Synthesis

▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges:

▹ Handle deformable parts ▹ Filter motion noise ▹ When to inference vs. extrapolate? ▹ See paper for details!

Computationally efficient:

Extrapolation: 10K operations/frame CNN Inference: 50B operations/frame

slide-42
SLIDE 42

7

Euphrates

An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous Vision

slide-43
SLIDE 43

7

Euphrates

An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous Vision

Algorithm

Motion-based tracking and detection synthesis.

slide-44
SLIDE 44

7

Euphrates

An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous Vision

SoC

Exploits synergies across IP

  • blocks. Enables task autonomy.

Algorithm

Motion-based tracking and detection synthesis.

slide-45
SLIDE 45

7

Euphrates

An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous Vision

Results

66% energy saving & 1% accuracy loss with RTL/measurement.

SoC

Exploits synergies across IP

  • blocks. Enables task autonomy.

Algorithm

Motion-based tracking and detection synthesis.

slide-46
SLIDE 46

7

Euphrates

An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous Vision

Results

66% energy saving & 1% accuracy loss with RTL/measurement.

SoC

Exploits synergies across IP

  • blocks. Enables task autonomy.

Algorithm

Motion-based tracking and detection synthesis.

slide-47
SLIDE 47

SoC Architecture

8

Vision Kernels

RGB Frames Semantic Results

Imaging

Photons

slide-48
SLIDE 48

SoC Architecture

8

CNN Accelerator

Vision Kernels

RGB Frames Semantic Results

Imaging

Photons

slide-49
SLIDE 49

SoC Architecture

8

Image Signal Processor CNN Accelerator

Vision Kernels

RGB Frames Semantic Results

Imaging

Photons

slide-50
SLIDE 50

SoC Architecture

8

Image Signal Processor CNN Accelerator

Camera Sensor Sensor Interface

On-chip Interconnect

Vision Kernels

RGB Frames Semantic Results

Imaging

Photons

slide-51
SLIDE 51

SoC Architecture

9

Image Signal Processor CNN Accelerator

Camera Sensor Sensor Interface

On-chip Interconnect

slide-52
SLIDE 52

DRAM Display

SoC Architecture

9

Image Signal Processor CNN Accelerator

Camera Sensor Sensor Interface

On-chip Interconnect

CPU (Host)

Memory Controller DMA Engine

SoC

slide-53
SLIDE 53

DRAM Display Frame Buffer

SoC Architecture

9

Image Signal Processor CNN Accelerator

Camera Sensor Sensor Interface

On-chip Interconnect

CPU (Host)

Memory Controller DMA Engine

SoC

slide-54
SLIDE 54

DRAM Display Frame Buffer

SoC Architecture

9

Image Signal Processor CNN Accelerator

Camera Sensor Sensor Interface

On-chip Interconnect

CPU (Host)

Memory Controller DMA Engine

SoC

slide-55
SLIDE 55

DRAM Display Frame Buffer

SoC Architecture

9

Image Signal Processor CNN Accelerator

Camera Sensor Sensor Interface

On-chip Interconnect

CPU (Host)

Memory Controller DMA Engine

SoC

slide-56
SLIDE 56

DRAM Display Frame Buffer

SoC Architecture

9

Image Signal Processor CNN Accelerator

Camera Sensor Sensor Interface

On-chip Interconnect

CPU (Host)

Memory Controller DMA Engine

SoC

slide-57
SLIDE 57

DRAM Display Frame Buffer

SoC Architecture

9

CNN Accelerator

Camera Sensor Sensor Interface

On-chip Interconnect

CPU (Host)

Memory Controller DMA Engine Image Signal Processor

SoC

slide-58
SLIDE 58

DRAM Display Frame Buffer

SoC Architecture

9

CNN Accelerator

Camera Sensor Sensor Interface

On-chip Interconnect

CPU (Host)

Memory Controller DMA Engine Image Signal Processor

Metadata

SoC

slide-59
SLIDE 59

DRAM Display Frame Buffer

SoC Architecture

9

CNN Accelerator

Camera Sensor Sensor Interface

On-chip Interconnect

CPU (Host)

Memory Controller DMA Engine Image Signal Processor

Metadata

1

SoC

slide-60
SLIDE 60

DRAM Display Frame Buffer

SoC Architecture

9

CNN Accelerator

Camera Sensor Sensor Interface

On-chip Interconnect

CPU (Host)

Memory Controller DMA Engine Image Signal Processor

Motion Controller Metadata

1 2

slide-61
SLIDE 61

DRAM Display Frame Buffer

SoC Architecture

9

CNN Accelerator

Camera Sensor Sensor Interface

On-chip Interconnect

CPU (Host)

Memory Controller DMA Engine Image Signal Processor

Motion Controller Metadata

1 2

slide-62
SLIDE 62

DRAM Display Frame Buffer

SoC Architecture

9

CNN Accelerator

Camera Sensor Sensor Interface

On-chip Interconnect

CPU (Host)

Memory Controller DMA Engine Image Signal Processor

Motion Controller Metadata

1 2

slide-63
SLIDE 63

DRAM Display Frame Buffer

SoC Architecture

9

CNN Accelerator

Camera Sensor Sensor Interface

On-chip Interconnect

CPU (Host)

Memory Controller DMA Engine Image Signal Processor

Motion Controller Metadata

1 2

slide-64
SLIDE 64

ISP Augmentation

▸ Expose motion vectors to the rest of the SoC

10

slide-65
SLIDE 65

ISP Augmentation

▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM

10

slide-66
SLIDE 66

ISP Augmentation

▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM

▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data

10

slide-67
SLIDE 67

ISP Augmentation

▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM

▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme

10

slide-68
SLIDE 68

ISP Augmentation

▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM

▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme

▸ Light-weight modification to ISP Sequencer

10

slide-69
SLIDE 69

Temporal Denoising Stage

Motion Estimation Motion Compensation SRAM DMA Demosaic Color Balance

ISP Internal Interconnect SoC Interconnect

ISP Pipeline

Frame Buffer (DRAM) ISP Sequencer

Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame

ISP Augmentation

▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM

▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme

▸ Light-weight modification to ISP Sequencer

10

slide-70
SLIDE 70

Temporal Denoising Stage

Motion Estimation Motion Compensation SRAM DMA Demosaic Color Balance

ISP Internal Interconnect SoC Interconnect

ISP Pipeline

Frame Buffer (DRAM) ISP Sequencer

Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame

ISP Augmentation

▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM

▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme

▸ Light-weight modification to ISP Sequencer

10

slide-71
SLIDE 71

Temporal Denoising Stage

Motion Estimation Motion Compensation SRAM DMA Demosaic Color Balance

ISP Internal Interconnect SoC Interconnect

ISP Pipeline

Frame Buffer (DRAM) ISP Sequencer

Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame

ISP Augmentation

▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM

▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme

▸ Light-weight modification to ISP Sequencer

10

slide-72
SLIDE 72

Temporal Denoising Stage

Motion Estimation Motion Compensation SRAM DMA Demosaic Color Balance

ISP Internal Interconnect SoC Interconnect

ISP Pipeline

Frame Buffer (DRAM) ISP Sequencer

Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame

ISP Augmentation

▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM

▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme

▸ Light-weight modification to ISP Sequencer

10

MVs

slide-73
SLIDE 73

Temporal Denoising Stage

Motion Estimation Motion Compensation SRAM DMA Demosaic Color Balance

ISP Internal Interconnect SoC Interconnect

ISP Pipeline

Frame Buffer (DRAM) ISP Sequencer

Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame

ISP Augmentation

▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM

▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme

▸ Light-weight modification to ISP Sequencer

10

MVs

slide-74
SLIDE 74

Temporal Denoising Stage

Motion Estimation Motion Compensation SRAM DMA Demosaic Color Balance

ISP Internal Interconnect SoC Interconnect

ISP Pipeline

Frame Buffer (DRAM) ISP Sequencer

Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame

ISP Augmentation

▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM

▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme

▸ Light-weight modification to ISP Sequencer

10

MVs

slide-75
SLIDE 75

Motion Controller IP

11

Extrapolation Unit

Motion Vector Buffer DMA Sequencer (FSM)

ROI Selection

ROI 4-Way SIMD Unit Scalar MVs New ROI

MMap Regs

ROI Winsize Base Addrs

Conf

slide-76
SLIDE 76

Motion Controller IP

11

Extrapolation Unit

Motion Vector Buffer DMA Sequencer (FSM)

ROI Selection

ROI 4-Way SIMD Unit Scalar MVs New ROI

MMap Regs

ROI Winsize Base Addrs

Conf

slide-77
SLIDE 77

Motion Controller IP

11

Extrapolation Unit

Motion Vector Buffer DMA Sequencer (FSM)

ROI Selection

ROI 4-Way SIMD Unit Scalar MVs New ROI

MMap Regs

ROI Winsize Base Addrs

Conf

slide-78
SLIDE 78

Motion Controller IP

11

Extrapolation Unit

Motion Vector Buffer DMA Sequencer (FSM)

ROI Selection

ROI 4-Way SIMD Unit Scalar MVs New ROI

MMap Regs

ROI Winsize Base Addrs

Conf

slide-79
SLIDE 79

Motion Controller IP

11

Extrapolation Unit

Motion Vector Buffer DMA Sequencer (FSM)

ROI Selection

ROI 4-Way SIMD Unit Scalar MVs New ROI

MMap Regs

ROI Winsize Base Addrs

Conf

slide-80
SLIDE 80

Motion Controller IP

▸ Why not directly augment the CNN accelerator, but a new IP?

▹Independent of vision algo./arch implementation

11 Extrapolation Unit

Motion Vector Buffer DMA Sequencer (FSM)

ROI Selection

ROI 4-Way SIMD Unit Scalar MVs New ROI

MMap Regs

ROI Winsize Base Addrs

Conf

slide-81
SLIDE 81

Motion Controller IP

▸ Why not directly augment the CNN accelerator, but a new IP?

▹Independent of vision algo./arch implementation

▸ Why not synthesize in CPU, but a new IP?

▹Switch-off CPU to enable “always-on” vision

11 Extrapolation Unit

Motion Vector Buffer DMA Sequencer (FSM)

ROI Selection

ROI 4-Way SIMD Unit Scalar MVs New ROI

MMap Regs

ROI Winsize Base Addrs

Conf

slide-82
SLIDE 82

Motion Controller CNN Accelerator

Motion Controller IP

12 Extrapolation Unit

Motion Vector Buffer DMA Sequencer (FSM)

ROI Selection

ROI 4-Way SIMD Unit Scalar MVs New ROI

MMap Regs

ROI Winsize Base Addrs

Conf

ISP

SoC Interconnect

▸ Why not directly augment the CNN accelerator, but a new IP?

▹Independent of vision algo./arch implementation

▸ Why not synthesize in CPU, but a new IP?

▹Switch-off CPU to enable “always-on” vision

slide-83
SLIDE 83

13

Euphrates

An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous Vision

Algorithm

Motion-based tracking and detection synthesis.

SoC

Exploits synergies across IP

  • blocks. Enables task autonomy.

Results

66% energy saving & 1% accuracy loss with RTL/measurement.

slide-84
SLIDE 84

Experimental Setup

▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2

▹ Real board measurement

14

slide-85
SLIDE 85

Experimental Setup

▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2

▹ Real board measurement

▸ Develop RTL models for IPs unavailable on TX2

▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2)

14

slide-86
SLIDE 86

Experimental Setup

▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2

▹ Real board measurement

▸ Develop RTL models for IPs unavailable on TX2

▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2)

14

▸ Evaluate on Object Tracking and Object Detection

▹ Important domains that are building blocks for many vision applications ▹ IP vendors have started shipping standalone tracking/detection IPs

slide-87
SLIDE 87

Experimental Setup

▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2

▹ Real board measurement

▸ Develop RTL models for IPs unavailable on TX2

▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2)

14

▸ Evaluate on Object Tracking and Object Detection

▹ Important domains that are building blocks for many vision applications ▹ IP vendors have started shipping standalone tracking/detection IPs

slide-88
SLIDE 88

Experimental Setup

▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2

▹ Real board measurement

▸ Develop RTL models for IPs unavailable on TX2

▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2)

14

▸ Evaluate on Object Tracking and Object Detection

▹ Important domains that are building blocks for many vision applications ▹ IP vendors have started shipping standalone tracking/detection IPs

▸ Object Detection

▹ Baseline CNN: YOLOv2 (state-of-the-art detection results)

slide-89
SLIDE 89

Experimental Setup

▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2

▹ Real board measurement

▸ Develop RTL models for IPs unavailable on TX2

▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2)

14

▸ Evaluate on Object Tracking and Object Detection

▹ Important domains that are building blocks for many vision applications ▹ IP vendors have started shipping standalone tracking/detection IPs

▸ Object Detection

▹ Baseline CNN: YOLOv2 (state-of-the-art detection results)

▸ SCALESim: A systolic array-based, cycle-accurate CNN accelerator

  • simulator. https://github.com/ARM-software/SCALE-Sim.
slide-90
SLIDE 90

0.1 0.2 0.3 0.4 0.5 0.6 0.7 Y O L O v 2

Evaluation Results

15

Accuracy

slide-91
SLIDE 91

0.1 0.2 0.3 0.4 0.5 0.6 0.7 Y O L O v 2 0.25 0.5 0.75 1 Y O L O v 2

Evaluation Results

15

Accuracy

  • Norm. Energy
slide-92
SLIDE 92

0.1 0.2 0.3 0.4 0.5 0.6 0.7 Y O L O v 2 0.25 0.5 0.75 1 Y O L O v 2 Y O L O v 2 E W

  • 2

E W

  • 4

E W

  • 8

E W

  • 1

6 E W

  • 3

2

Evaluation Results

15

Accuracy

  • Norm. Energy

EW = Extrapolation Window

slide-93
SLIDE 93

0.1 0.2 0.3 0.4 0.5 0.6 0.7 Y O L O v 2 0.25 0.5 0.75 1 Y O L O v 2 Y O L O v 2 E W

  • 2

E W

  • 4

E W

  • 8

E W

  • 1

6 E W

  • 3

2 Y O L O v 2 E W

  • 2

E W

  • 4

E W

  • 8

E W

  • 1

6 E W

  • 3

2

Evaluation Results

15

Accuracy

  • Norm. Energy

EW = Extrapolation Window

slide-94
SLIDE 94

0.1 0.2 0.3 0.4 0.5 0.6 0.7 Y O L O v 2 0.25 0.5 0.75 1 Y O L O v 2 Y O L O v 2 E W

  • 2

E W

  • 4

E W

  • 8

E W

  • 1

6 E W

  • 3

2 Y O L O v 2 E W

  • 2

E W

  • 4

E W

  • 8

E W

  • 1

6 E W

  • 3

2

Evaluation Results

15

Accuracy

  • Norm. Energy

66% system energy saving with ~ 1% accuracy loss.

EW = Extrapolation Window

slide-95
SLIDE 95

Scale-down CNN

0.1 0.2 0.3 0.4 0.5 0.6 0.7 Y O L O v 2 0.25 0.5 0.75 1 Y O L O v 2 Y O L O v 2 E W

  • 2

E W

  • 4

E W

  • 8

E W

  • 1

6 E W

  • 3

2 Y O L O v 2 E W

  • 2

E W

  • 4

E W

  • 8

E W

  • 1

6 E W

  • 3

2 Y O L O v 2 E W

  • 4

E W

  • 1

6 T i n y Y O L O

Evaluation Results

15

Accuracy

  • Norm. Energy

66% system energy saving with ~ 1% accuracy loss.

EW = Extrapolation Window

slide-96
SLIDE 96

0.1 0.2 0.3 0.4 0.5 0.6 0.7 Y O L O v 2 0.25 0.5 0.75 1 Y O L O v 2 Y O L O v 2 E W

  • 2

E W

  • 4

E W

  • 8

E W

  • 1

6 E W

  • 3

2 Y O L O v 2 E W

  • 2

E W

  • 4

E W

  • 8

E W

  • 1

6 E W

  • 3

2 Y O L O v 2 E W

  • 4

E W

  • 1

6 T i n y Y O L O

Evaluation Results

15

Accuracy

  • Norm. Energy

66% system energy saving with ~ 1% accuracy loss. More efficient than simply scaling-down the CNN.

EW = Extrapolation Window

slide-97
SLIDE 97

Conclusions

16

slide-98
SLIDE 98

Conclusions

16

▸ We must expand our focus from isolated accelerators to holistic SoC architecture.

slide-99
SLIDE 99

Conclusions

16

▸ We must expand our focus from isolated accelerators to holistic SoC architecture.

slide-100
SLIDE 100

Conclusions

16

▸ Euphrates co-designs the SoC with a motion-based synthesis algorithm. ▸ We must expand our focus from isolated accelerators to holistic SoC architecture.

slide-101
SLIDE 101

Conclusions

16

▸ Euphrates co-designs the SoC with a motion-based synthesis algorithm. ▸ We must expand our focus from isolated accelerators to holistic SoC architecture. ▸ 66% SoC energy savings with ~1% accuracy

  • loss. More efficient than scaling-down CNNs.
slide-102
SLIDE 102

Thank you!

17

Georgia Tech Anand Samajdar Paul Whatmough ARM Research Matt Mattina ARM Research