Matching Guided Distillation ECCV 2020 Kaiyu Yue, Jiangfan Deng, - - PowerPoint PPT Presentation

matching guided distillation
SMART_READER_LITE
LIVE PREVIEW

Matching Guided Distillation ECCV 2020 Kaiyu Yue, Jiangfan Deng, - - PowerPoint PPT Presentation

Matching Guided Distillation ECCV 2020 Kaiyu Yue, Jiangfan Deng, and Feng Zhou Algorithm Research, Aibee Inc. Motivation Motivation Distillation Obstacle The gap in semantic feature structure between the intermediate features of teacher


slide-1
SLIDE 1

Matching Guided Distillation

ECCV 2020 Kaiyu Yue, Jiangfan Deng, and Feng Zhou Algorithm Research, Aibee Inc.

slide-2
SLIDE 2

Motivation

slide-3
SLIDE 3

Motivation

  • Distillation Obstacle

The gap in semantic feature structure between the intermediate features of teacher and student

  • Classic Scheme

Transform intermediate features by adding the adaptation modules, such the conv layer

  • Problems

1) The adaptation module brings more parameters into training 2) The adaptation module with random initialization or special transformation isn’t friendly for distilling a pre-trained student

slide-4
SLIDE 4

Matching Guided Distillation Framework

slide-5
SLIDE 5

Matching Guided Distillation – Matching

Given two feature sets from teacher and student, we use Hungarian method to achieve the flow guided matrix M. Flow guided matrix M indicates the matched relationships.

slide-6
SLIDE 6

Matching Guided Distillation – Channels Reduction

One student channel could match multiple teacher channels. We perform reduction into one tensor for guiding the student.

Channels Reduction

slide-7
SLIDE 7

Matching Guided Distillation – Distillation

After reducing teacher channels, we start to distill student using partial distance training loss, such as L2 loss.

Channels Reduction Distance Loss

slide-8
SLIDE 8

Matching Guided Distillation – Coordinate Descent Optimization

The overall training takes a coordinate-descent approach between two optimization objects — flow guided matrix update and parameters update.

Channels Reduction Distance Loss

Updating flow guided matrix M Training student model using SGD Coordinate Descent Optimization

slide-9
SLIDE 9

Matching Guided Distillation Reduction Methods

slide-10
SLIDE 10

Matching Guided Distillation – Channels Reduction

We propose three efficient methods for reducing teacher channels: Sparse Matching, Random Drop and Absolute Max Pooling.

slide-11
SLIDE 11

Matching Guided Distillation – Sparse Matching

Each student channel will only match the most related teacher channel. Unmatched teacher channels are ignored.

Distance Loss Matching

slide-12
SLIDE 12

Matching Guided Distillation – Random Drop

Sampling a random teacher channel from the ones associated with each student channel.

Distance Loss Matching

slide-13
SLIDE 13

Matching Guided Distillation – Absolute Max Pooling

To keep both positive and negative feature information of teacher, we propose a novel pooling mechanism that reduce features according to the absolute value at the same feature structure location.

Distance Loss Matching

slide-14
SLIDE 14

Matching Guided Distillation Main Results

slide-15
SLIDE 15

Results – Fine-Grained Recognition on CUB-200

+ 3.97 % on top1 + 5.44 % on top1

slide-16
SLIDE 16

Results – Large-Scale Classification on ImageNet-1K

+ 1.83 % on top1 + 2.6 % on top1

slide-17
SLIDE 17

Results – Object Detection and Instance Segmentation on COCO

slide-18
SLIDE 18

Summary

  • MGD is lightweight and efficient for various tasks
  • MGD gets rid of channels number constraint between teacher and student, it’s flexible to plug into network
  • MGD is friendly for distilling a pre-trained student

Project webpage: http://kaiyuyue.com/mgd