SLIDE 1
Matching Guided Distillation ECCV 2020 Kaiyu Yue, Jiangfan Deng, - - PowerPoint PPT Presentation
Matching Guided Distillation ECCV 2020 Kaiyu Yue, Jiangfan Deng, - - PowerPoint PPT Presentation
Matching Guided Distillation ECCV 2020 Kaiyu Yue, Jiangfan Deng, and Feng Zhou Algorithm Research, Aibee Inc. Motivation Motivation Distillation Obstacle The gap in semantic feature structure between the intermediate features of teacher
SLIDE 2
SLIDE 3
Motivation
- Distillation Obstacle
The gap in semantic feature structure between the intermediate features of teacher and student
- Classic Scheme
Transform intermediate features by adding the adaptation modules, such the conv layer
- Problems
1) The adaptation module brings more parameters into training 2) The adaptation module with random initialization or special transformation isn’t friendly for distilling a pre-trained student
SLIDE 4
Matching Guided Distillation Framework
SLIDE 5
Matching Guided Distillation – Matching
Given two feature sets from teacher and student, we use Hungarian method to achieve the flow guided matrix M. Flow guided matrix M indicates the matched relationships.
SLIDE 6
Matching Guided Distillation – Channels Reduction
One student channel could match multiple teacher channels. We perform reduction into one tensor for guiding the student.
Channels Reduction
SLIDE 7
Matching Guided Distillation – Distillation
After reducing teacher channels, we start to distill student using partial distance training loss, such as L2 loss.
Channels Reduction Distance Loss
SLIDE 8
Matching Guided Distillation – Coordinate Descent Optimization
The overall training takes a coordinate-descent approach between two optimization objects — flow guided matrix update and parameters update.
Channels Reduction Distance Loss
Updating flow guided matrix M Training student model using SGD Coordinate Descent Optimization
SLIDE 9
Matching Guided Distillation Reduction Methods
SLIDE 10
Matching Guided Distillation – Channels Reduction
We propose three efficient methods for reducing teacher channels: Sparse Matching, Random Drop and Absolute Max Pooling.
SLIDE 11
Matching Guided Distillation – Sparse Matching
Each student channel will only match the most related teacher channel. Unmatched teacher channels are ignored.
Distance Loss Matching
SLIDE 12
Matching Guided Distillation – Random Drop
Sampling a random teacher channel from the ones associated with each student channel.
Distance Loss Matching
SLIDE 13
Matching Guided Distillation – Absolute Max Pooling
To keep both positive and negative feature information of teacher, we propose a novel pooling mechanism that reduce features according to the absolute value at the same feature structure location.
Distance Loss Matching
SLIDE 14
Matching Guided Distillation Main Results
SLIDE 15
Results – Fine-Grained Recognition on CUB-200
+ 3.97 % on top1 + 5.44 % on top1
SLIDE 16
Results – Large-Scale Classification on ImageNet-1K
+ 1.83 % on top1 + 2.6 % on top1
SLIDE 17
Results – Object Detection and Instance Segmentation on COCO
SLIDE 18
Summary
- MGD is lightweight and efficient for various tasks
- MGD gets rid of channels number constraint between teacher and student, it’s flexible to plug into network
- MGD is friendly for distilling a pre-trained student