Deep Transfer Learning for Visual Analysis
Yu-Chiang Frank Wang, Associate Professor
- Dept. Electrical Engineering, National Taiwan University
Taipei, Taiwan
2018/5/19 2nd AII Workshop
Deep Transfer Learning for Visual Analysis Yu-Chiang Frank Wang, - - PowerPoint PPT Presentation
Deep Transfer Learning for Visual Analysis Yu-Chiang Frank Wang, Associate Professor Dept. Electrical Engineering, National Taiwan University Taipei, Taiwan 2018/5/19 2 nd AII Workshop Trends of Deep Learning 2 Transfer Learning: What, When,
Yu-Chiang Frank Wang, Associate Professor
Taipei, Taiwan
2018/5/19 2nd AII Workshop
2
https://techcrunch.com/2017/02/08/udacity-open-sources-its-self-driving-car-simulator-for-anyone-to-use/ https://googleblog.blogspot.tw/2014/04/the-latest-chapter-for-self-driving-car.html
3
Detach and Adapt: Learning Cross-Domain Disentangled Deep Representation
Order-Free RNN with Visual Attention for Multi-Label Classification
Multi-Label Zero-Shot Learning with Structured Knowledge Graphs
Unsupervised Deep Transfer Learning for Person Re-Identification
4
Input
5
Disentangle smile from
Disentangle smile from
Transfer w/o supervision With supervision
Y.-C. F. Wang et al., Detach and Adapt: Learning Cross-Domain Disentangled Deep Representation, CVPR 2018 6
Y.-C. F. Wang et al., Detach and Adapt: Learning Cross-Domain Disentangled Deep Representation, CVPR 2018
With supervision W/o supervision attribute
7
w/o Label supervision w/o Label supervision
Unpaired
Y.-C. F. Wang et al., Detach and Adapt: Learning Cross-Domain Disentangled Deep Representation, CVPR 2018 8
Cross-Domain Image Translation Representation Disentanglement Unpaired Training Data Multi- domains Bi-direction Joint Representation Unsupervised Interpretability of disentangled factor Pix2pix
Cannot disentangle image representation
CycleGAN
StarGAN
UNIT
DTN
infoGAN
Cannot translate images across domains
AC-GAN
CDRD (Ours)
Partially
9
Detach and Adapt: Learning Cross-Domain Disentangled Deep Representation
Order-Free RNN with Visual Attention for Multi-Label Classification
Multi-Label Zero-Shot Learning with Structured Knowledge Graphs
Unsupervised Deep Transfer Learning for Person Re-Identification
10
Labels: Person Table Sofa Chair TV Lights Carpet …
11
Latent space
label space
label space
feature space
Clouds Lake Ocean Water Sky Sun Sunset Clouds Lake Ocean Water Sky Sun Sunset
Y.-C. F. Wang et al., Learning Deep Latent Spaces for Multi-Label Classification, AAAI 2017 12
Y.-C. F. Wang et al., Order-Free RNN with Visual Attention for Multi-Label Classification, AAAI 2018
13
Y.-C. F. Wang et al., Order-Free RNN with Visual Attention for Multi-Label Classification, AAAI 2018
NUS-WIDE MS-COCO
14
Y.-C. F. Wang et al., Order-Free RNN with Visual Attention for Multi-Label Classification, AAAI 2018
Example images in MS-COCO with the associated attention maps Incorrect predictions with reasonable visual attention
15
16
17
18
19
Detach and Adapt: Learning Cross-Domain Disentangled Deep Representation
Order-Free RNN with Visual Attention for Multi-Label Classification
Multi-Label Zero-Shot Learning with Structured Knowledge Graphs
Unsupervised Deep Transfer Learning for Person Re-Identification
20
Person re-identification task: the system needs to match appearances of a person of interest across non-overlapping cameras.
Camera #1 Camera #2 Camera #3 Camera #4
21
Target Dataset 𝐽𝑢 w/o labels Source Dataset 𝐽s w/ labels
𝑌 $% 𝑌 $&
ℒ()*&& ℒ(%+&
𝐷- Latent Encoder Latent Decoder
Classifier
𝑌𝑢 𝑌𝑡 𝐹0
𝑓2
%
𝑓(
%
𝑓(
&
𝑓2
&
Latent Space
+ +
𝐸0
ℒ5677 ℒ5677
𝐹8 𝐹- 𝐹9
ℒ+:( ℒ+:(
22
23
24
Order-Free RNN with Visual Attention for Multi-Label Classification
Detach and Adapt: Learning Cross-Domain Disentangled Deep Representation
Multi-Label Zero-Shot Learning with Structured Knowledge Graphs
Unsupervised Deep Transfer Learning for Person Re-Identification
25
26
27
28
29
pose pose Chair
30
31
32
Temporal Encoder
Three Stages in Learning
conditioned on input anchor frames Stochastic & Recurrent Conditional-GAN (SR-cGAN)
Output
Temporal Generator
Input
Real Input Synthesized Input . . . . . . . . Real
Fake
Input: non-consecutive frames of interest Output: video sequence (more than one possible output)
33
Shape Motion KTH
MUG
34
Input (Anchor Frames) Output (Synthesized Video) 7 6 11 12 14 15 6 11 7 12 14 15 GIF Input (Anchor Frames) Output (Synthesized Video) 3 2 7 9 12 14 2 3 7 9 12 14 GIF KTH Shape Motion
35
Different Motion Input (Anchor Frames) Output (Synthesized Video) GIF 5 3 8 12 13 14 5 3 8 12 13 14
36
37
Person Table Sofa Chair TV Lights Carpet …
38
39
40