Algorithm-SoC Co-Design for Mobile Continuous Vision
Yuhao Zhu
Department of Computer Science University of Rochester with Anand Samajdar, Georgia Tech Matthew Mattina, ARM Research Paul Whatmough, ARM Research
Algorithm-SoC Co-Design for Mobile Continuous Vision Yuhao Zhu - - PowerPoint PPT Presentation
Algorithm-SoC Co-Design for Mobile Continuous Vision Yuhao Zhu Department of Computer Science University of Rochester with Anand Samajdar, Georgia Tech Matthew Mattina, ARM Research Paul Whatmough, ARM Research Mobile Continuous Vision:
Department of Computer Science University of Rochester with Anand Samajdar, Georgia Tech Matthew Mattina, ARM Research Paul Whatmough, ARM Research
3
Autonomous Drones
3
Autonomous Drones ADAS
3
Autonomous Drones Augmented Reality ADAS
3
Autonomous Drones Augmented Reality ADAS Security Camera
4
Vision Kernels
RGB Frames Semantic Results Conventional Scope
4
Vision Kernels
RGB Frames Semantic Results
Imaging
Photons Conventional Scope
Our Scope
4
Vision Kernels
RGB Frames Semantic Results
Imaging
Photons
Our Scope
4
Vision Kernels
RGB Frames Semantic Results
Imaging
Photons Motion Metadata
Our Scope
4
Vision Kernels
RGB Frames Semantic Results
Imaging
Photons Motion Metadata
Our Scope
4
Vision Kernels
RGB Frames Semantic Results
Imaging
Photons Motion Metadata diff (motion)
Our Scope
4
Vision Kernels
RGB Frames Semantic Results
Imaging
Photons Motion Metadata diff (motion)
Our Scope
4
Vision Kernels
RGB Frames Semantic Results
Imaging
Photons Motion Metadata diff (motion) synthesis
Our Scope
4
Vision Kernels
RGB Frames Semantic Results
Imaging
Photons Motion Metadata diff (motion) synthesis cheap
Our Scope
4
Vision Kernels
RGB Frames Semantic Results
Imaging
Photons Motion Metadata diff (motion) synthesis Motion-based Synthesis cheap
5
Vision Kernels
RGB Frames Semantic Results
Imaging
Photons Motion Metadata
5
Vision Kernels
RGB Frames Semantic Results
Imaging
Photons
Conversion Demosaic … Bayer Domain Dead Pixel Correction
…
YUV Domain Temporal Denoising
…
5
Vision Kernels
RGB Frames Semantic Results
Imaging
Photons
Conversion Demosaic … Bayer Domain Dead Pixel Correction
…
YUV Domain Temporal Denoising
…
5
Vision Kernels
RGB Frames Semantic Results
Imaging
Photons
Conversion Demosaic … Bayer Domain Dead Pixel Correction
…
YUV Domain Temporal Denoising
…
Frame k
5
Vision Kernels
RGB Frames Semantic Results
Imaging
Photons
Conversion Demosaic … Bayer Domain Dead Pixel Correction
…
YUV Domain Temporal Denoising
…
Frame k
<u, v>
5
Vision Kernels
RGB Frames Semantic Results
Imaging
Photons
Conversion Demosaic … Bayer Domain Dead Pixel Correction
…
YUV Domain Temporal Denoising
…
Frame k-1 Frame k
<u, v> <x, y>
5
Vision Kernels
RGB Frames Semantic Results
Imaging
Photons
Conversion Demosaic … Bayer Domain Dead Pixel Correction
…
YUV Domain Temporal Denoising
…
Frame k-1 Frame k
<u, v> <x, y>
Motion Vector = <x - u, y - v>
5
Vision Kernels
RGB Frames Semantic Results
Imaging
Photons
Conversion Demosaic … Bayer Domain Dead Pixel Correction
…
YUV Domain Temporal Denoising
…
Motion Info. Frame k-1 Frame k
<u, v> <x, y>
Motion Vector = <x - u, y - v>
6
diff (motion) synthesis Motion-based Synthesis
6
diff (motion) synthesis Motion-based Synthesis
▸ Synthesis operation: Extrapolate based on motion vectors
6
diff (motion) synthesis Motion-based Synthesis
▸ Synthesis operation: Extrapolate based on motion vectors
6
diff (motion) synthesis Motion-based Synthesis
▸ Synthesis operation: Extrapolate based on motion vectors
6
diff (motion) synthesis Motion-based Synthesis
▸ Synthesis operation: Extrapolate based on motion vectors
Inference (I-Frame) Extrapolation (E-Frame) Inference (I-Frame) Extrapolation (E-Frame)
Extrapolation Window = 2
Extrapolation (E-Frame)
Extrapolation Window = 3 t4 t0 t1 t2 t3
6
diff (motion) synthesis Motion-based Synthesis
▸ Synthesis operation: Extrapolate based on motion vectors
Inference (I-Frame) Extrapolation (E-Frame) Inference (I-Frame) Extrapolation (E-Frame)
Extrapolation Window = 2
Extrapolation (E-Frame)
Extrapolation Window = 3 t4 t0 t1 t2 t3
6
diff (motion) synthesis Motion-based Synthesis
▸ Synthesis operation: Extrapolate based on motion vectors
Inference (I-Frame) Extrapolation (E-Frame) Inference (I-Frame) Extrapolation (E-Frame)
Extrapolation Window = 2
Extrapolation (E-Frame)
Extrapolation Window = 3 t4 t0 t1 t2 t3
6
diff (motion) synthesis Motion-based Synthesis
▸ Synthesis operation: Extrapolate based on motion vectors
6
diff (motion) synthesis Motion-based Synthesis
▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges:
6
diff (motion) synthesis Motion-based Synthesis
▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges:
6
diff (motion) synthesis Motion-based Synthesis
▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges:
6
diff (motion) synthesis Motion-based Synthesis
▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges:
6
diff (motion) synthesis Motion-based Synthesis
▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges:
6
diff (motion) synthesis Motion-based Synthesis
▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges:
Extrapolation: 10K operations/frame CNN Inference: 50B operations/frame
7
7
Motion-based tracking and detection synthesis.
7
Exploits synergies across IP
Motion-based tracking and detection synthesis.
7
66% energy saving & 1% accuracy loss with RTL/measurement.
Exploits synergies across IP
Motion-based tracking and detection synthesis.
7
66% energy saving & 1% accuracy loss with RTL/measurement.
Exploits synergies across IP
Motion-based tracking and detection synthesis.
8
Vision Kernels
RGB Frames Semantic Results
Imaging
Photons
8
CNN Accelerator
Vision Kernels
RGB Frames Semantic Results
Imaging
Photons
8
Image Signal Processor CNN Accelerator
Vision Kernels
RGB Frames Semantic Results
Imaging
Photons
8
Image Signal Processor CNN Accelerator
Camera Sensor Sensor Interface
On-chip Interconnect
Vision Kernels
RGB Frames Semantic Results
Imaging
Photons
9
Image Signal Processor CNN Accelerator
Camera Sensor Sensor Interface
On-chip Interconnect
DRAM Display
9
Image Signal Processor CNN Accelerator
Camera Sensor Sensor Interface
On-chip Interconnect
CPU (Host)
Memory Controller DMA Engine
DRAM Display Frame Buffer
9
Image Signal Processor CNN Accelerator
Camera Sensor Sensor Interface
On-chip Interconnect
CPU (Host)
Memory Controller DMA Engine
DRAM Display Frame Buffer
9
Image Signal Processor CNN Accelerator
Camera Sensor Sensor Interface
On-chip Interconnect
CPU (Host)
Memory Controller DMA Engine
DRAM Display Frame Buffer
9
Image Signal Processor CNN Accelerator
Camera Sensor Sensor Interface
On-chip Interconnect
CPU (Host)
Memory Controller DMA Engine
DRAM Display Frame Buffer
9
Image Signal Processor CNN Accelerator
Camera Sensor Sensor Interface
On-chip Interconnect
CPU (Host)
Memory Controller DMA Engine
DRAM Display Frame Buffer
9
CNN Accelerator
Camera Sensor Sensor Interface
On-chip Interconnect
CPU (Host)
Memory Controller DMA Engine Image Signal Processor
DRAM Display Frame Buffer
9
CNN Accelerator
Camera Sensor Sensor Interface
On-chip Interconnect
CPU (Host)
Memory Controller DMA Engine Image Signal Processor
Metadata
DRAM Display Frame Buffer
9
CNN Accelerator
Camera Sensor Sensor Interface
On-chip Interconnect
CPU (Host)
Memory Controller DMA Engine Image Signal Processor
Metadata
1
DRAM Display Frame Buffer
9
CNN Accelerator
Camera Sensor Sensor Interface
On-chip Interconnect
CPU (Host)
Memory Controller DMA Engine Image Signal Processor
Motion Controller Metadata
1 2
DRAM Display Frame Buffer
9
CNN Accelerator
Camera Sensor Sensor Interface
On-chip Interconnect
CPU (Host)
Memory Controller DMA Engine Image Signal Processor
Motion Controller Metadata
1 2
DRAM Display Frame Buffer
9
CNN Accelerator
Camera Sensor Sensor Interface
On-chip Interconnect
CPU (Host)
Memory Controller DMA Engine Image Signal Processor
Motion Controller Metadata
1 2
DRAM Display Frame Buffer
9
CNN Accelerator
Camera Sensor Sensor Interface
On-chip Interconnect
CPU (Host)
Memory Controller DMA Engine Image Signal Processor
Motion Controller Metadata
1 2
▸ Expose motion vectors to the rest of the SoC
10
▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM
10
▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM
10
▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM
10
▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM
▸ Light-weight modification to ISP Sequencer
10
Temporal Denoising Stage
Motion Estimation Motion Compensation SRAM DMA Demosaic Color Balance
ISP Internal Interconnect SoC Interconnect
ISP Pipeline
Frame Buffer (DRAM) ISP Sequencer
Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame
▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM
▸ Light-weight modification to ISP Sequencer
10
Temporal Denoising Stage
Motion Estimation Motion Compensation SRAM DMA Demosaic Color Balance
ISP Internal Interconnect SoC Interconnect
ISP Pipeline
Frame Buffer (DRAM) ISP Sequencer
Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame
▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM
▸ Light-weight modification to ISP Sequencer
10
Temporal Denoising Stage
Motion Estimation Motion Compensation SRAM DMA Demosaic Color Balance
ISP Internal Interconnect SoC Interconnect
ISP Pipeline
Frame Buffer (DRAM) ISP Sequencer
Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame
▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM
▸ Light-weight modification to ISP Sequencer
10
Temporal Denoising Stage
Motion Estimation Motion Compensation SRAM DMA Demosaic Color Balance
ISP Internal Interconnect SoC Interconnect
ISP Pipeline
Frame Buffer (DRAM) ISP Sequencer
Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame
▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM
▸ Light-weight modification to ISP Sequencer
10
Temporal Denoising Stage
Motion Estimation Motion Compensation SRAM DMA Demosaic Color Balance
ISP Internal Interconnect SoC Interconnect
ISP Pipeline
Frame Buffer (DRAM) ISP Sequencer
Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame
▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM
▸ Light-weight modification to ISP Sequencer
10
Temporal Denoising Stage
Motion Estimation Motion Compensation SRAM DMA Demosaic Color Balance
ISP Internal Interconnect SoC Interconnect
ISP Pipeline
Frame Buffer (DRAM) ISP Sequencer
Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame
▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM
▸ Light-weight modification to ISP Sequencer
10
11
Extrapolation Unit
Motion Vector Buffer DMA Sequencer (FSM)
ROI Selection
ROI 4-Way SIMD Unit Scalar MVs New ROI
MMap Regs
ROI Winsize Base Addrs
Conf
11
Extrapolation Unit
Motion Vector Buffer DMA Sequencer (FSM)
ROI Selection
ROI 4-Way SIMD Unit Scalar MVs New ROI
MMap Regs
ROI Winsize Base Addrs
Conf
11
Extrapolation Unit
Motion Vector Buffer DMA Sequencer (FSM)
ROI Selection
ROI 4-Way SIMD Unit Scalar MVs New ROI
MMap Regs
ROI Winsize Base Addrs
Conf
11
Extrapolation Unit
Motion Vector Buffer DMA Sequencer (FSM)
ROI Selection
ROI 4-Way SIMD Unit Scalar MVs New ROI
MMap Regs
ROI Winsize Base Addrs
Conf
11
Extrapolation Unit
Motion Vector Buffer DMA Sequencer (FSM)
ROI Selection
ROI 4-Way SIMD Unit Scalar MVs New ROI
MMap Regs
ROI Winsize Base Addrs
Conf
▸ Why not directly augment the CNN accelerator, but a new IP?
11 Extrapolation Unit
Motion Vector Buffer DMA Sequencer (FSM)
ROI Selection
ROI 4-Way SIMD Unit Scalar MVs New ROI
MMap Regs
ROI Winsize Base Addrs
Conf
▸ Why not directly augment the CNN accelerator, but a new IP?
▸ Why not synthesize in CPU, but a new IP?
11 Extrapolation Unit
Motion Vector Buffer DMA Sequencer (FSM)
ROI Selection
ROI 4-Way SIMD Unit Scalar MVs New ROI
MMap Regs
ROI Winsize Base Addrs
Conf
Motion Controller CNN Accelerator
12 Extrapolation Unit
Motion Vector Buffer DMA Sequencer (FSM)
ROI Selection
ROI 4-Way SIMD Unit Scalar MVs New ROI
MMap Regs
ROI Winsize Base Addrs
Conf
ISP
SoC Interconnect
▸ Why not directly augment the CNN accelerator, but a new IP?
▸ Why not synthesize in CPU, but a new IP?
13
Motion-based tracking and detection synthesis.
Exploits synergies across IP
66% energy saving & 1% accuracy loss with RTL/measurement.
▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2
▹ Real board measurement
14
▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2
▹ Real board measurement
▸ Develop RTL models for IPs unavailable on TX2
▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2)
14
▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2
▹ Real board measurement
▸ Develop RTL models for IPs unavailable on TX2
▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2)
14
▸ Evaluate on Object Tracking and Object Detection
▹ Important domains that are building blocks for many vision applications ▹ IP vendors have started shipping standalone tracking/detection IPs
▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2
▹ Real board measurement
▸ Develop RTL models for IPs unavailable on TX2
▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2)
14
▸ Evaluate on Object Tracking and Object Detection
▹ Important domains that are building blocks for many vision applications ▹ IP vendors have started shipping standalone tracking/detection IPs
▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2
▹ Real board measurement
▸ Develop RTL models for IPs unavailable on TX2
▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2)
14
▸ Evaluate on Object Tracking and Object Detection
▹ Important domains that are building blocks for many vision applications ▹ IP vendors have started shipping standalone tracking/detection IPs
▸ Object Detection
▹ Baseline CNN: YOLOv2 (state-of-the-art detection results)
▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2
▹ Real board measurement
▸ Develop RTL models for IPs unavailable on TX2
▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2)
14
▸ Evaluate on Object Tracking and Object Detection
▹ Important domains that are building blocks for many vision applications ▹ IP vendors have started shipping standalone tracking/detection IPs
▸ Object Detection
▹ Baseline CNN: YOLOv2 (state-of-the-art detection results)
▸ SCALESim: A systolic array-based, cycle-accurate CNN accelerator
0.1 0.2 0.3 0.4 0.5 0.6 0.7 Y O L O v 2
15
Accuracy
0.1 0.2 0.3 0.4 0.5 0.6 0.7 Y O L O v 2 0.25 0.5 0.75 1 Y O L O v 2
15
Accuracy
0.1 0.2 0.3 0.4 0.5 0.6 0.7 Y O L O v 2 0.25 0.5 0.75 1 Y O L O v 2 Y O L O v 2 E W
E W
E W
E W
6 E W
2
15
Accuracy
EW = Extrapolation Window
0.1 0.2 0.3 0.4 0.5 0.6 0.7 Y O L O v 2 0.25 0.5 0.75 1 Y O L O v 2 Y O L O v 2 E W
E W
E W
E W
6 E W
2 Y O L O v 2 E W
E W
E W
E W
6 E W
2
15
Accuracy
EW = Extrapolation Window
0.1 0.2 0.3 0.4 0.5 0.6 0.7 Y O L O v 2 0.25 0.5 0.75 1 Y O L O v 2 Y O L O v 2 E W
E W
E W
E W
6 E W
2 Y O L O v 2 E W
E W
E W
E W
6 E W
2
15
Accuracy
EW = Extrapolation Window
Scale-down CNN
0.1 0.2 0.3 0.4 0.5 0.6 0.7 Y O L O v 2 0.25 0.5 0.75 1 Y O L O v 2 Y O L O v 2 E W
E W
E W
E W
6 E W
2 Y O L O v 2 E W
E W
E W
E W
6 E W
2 Y O L O v 2 E W
E W
6 T i n y Y O L O
15
Accuracy
EW = Extrapolation Window
0.1 0.2 0.3 0.4 0.5 0.6 0.7 Y O L O v 2 0.25 0.5 0.75 1 Y O L O v 2 Y O L O v 2 E W
E W
E W
E W
6 E W
2 Y O L O v 2 E W
E W
E W
E W
6 E W
2 Y O L O v 2 E W
E W
6 T i n y Y O L O
15
Accuracy
EW = Extrapolation Window
16
16
▸ We must expand our focus from isolated accelerators to holistic SoC architecture.
16
▸ We must expand our focus from isolated accelerators to holistic SoC architecture.
16
▸ Euphrates co-designs the SoC with a motion-based synthesis algorithm. ▸ We must expand our focus from isolated accelerators to holistic SoC architecture.
16
▸ Euphrates co-designs the SoC with a motion-based synthesis algorithm. ▸ We must expand our focus from isolated accelerators to holistic SoC architecture. ▸ 66% SoC energy savings with ~1% accuracy
17
Georgia Tech Anand Samajdar Paul Whatmough ARM Research Matt Mattina ARM Research