Disentanglement of Visual Concepts from Classifying and Synthesizing - - PowerPoint PPT Presentation
Disentanglement of Visual Concepts from Classifying and Synthesizing - - PowerPoint PPT Presentation
Disentanglement of Visual Concepts from Classifying and Synthesizing Scenes Bolei Zhou The Chinese University of Hong Kong Representation Learning The purpose of representation learning: To identify and disentangle the underlying
Representation Learning
The purpose of representation learning: “To identify and disentangle the underlying explanatory factors hidden in the observed milieu of low-level sensory data.”
Bengio, et al. Representation Learning: A review and new perspectives.
Sources of Deep Representations
Scene Recognition Object Recognition
Colorization ECCV’16 and CVPR’17 Audio prediction, ECCV’16
Self Supervised Learning Image Classification Image Generation
Outline
- Disentanglement of Concepts from Classifying Scenes
- Sanity Check Experiment: Mixture of MNIST
- Disentanglement of Visual Concepts from Synthesizing Scenes
- Future Directions
My Previous Talks
- On the importance of single units
CVPR’18 Tutorial talk: https://www.youtube.com/watch?v=1aSS5GEH58U
- Interpretable representation learning for visual intelligence
MIT thesis defense: https://www.youtube.com/watch?v=J7Zz_33ZeJc
http://places2.csail.mit.edu/demo.html https://github.com/CSAILVision/places365
Neural Networks for Scene Classification
Cafeteria (0.9)
Convolutional Neural Network (CNN)
Units as concept detectors
Unit2 at Layer4: Lamp Unit 22 at Layer 5: Face Unit42 at Layer3 : Trademark Unit 57 at Layer4: Windows
What are the internal units for classifying scenes?
What is a unit doing? - Visualize the unit
[Zeiler et al., ECCV’14] [Girshick et al., CVPR’14]
Deconvolution
[Simonyan et al., ICLR’15] [Springerberg et al., ICLR’15] [Selvaraju, ICCV’17]
Back-propagation Image Synthesis
[Nguyen et al., NIPS’16] [Dosovitskiy et al., CVPR’16] [Mahendran, et al., CVPR’15]
Unit1: Top activated images
Data Driven Visualization
Unit2: Top activated images Unit3: Top activated images
Layer 5 https://github.com/metalbubble/cnnvisualizer
Annotating the Interpretation of Units
Word/Description to summarize the images: ______
Amazon Mechanical Turk
Which category the description belongs to:
- Scene
- Region or surface
- Object
- Object part
- Texture or material
- Simple elements or colors
Lamp
[Zhou, Khosla, Lapedriza, Oliva, Torralba. ICLR 2015]
Interpretable Representations for Objects and Scenes
59 units as objects at conv5 of AlexNet on ImageNet
tie bird dog dog
151 units as objects at conv5 of AlexNet on Places
building windows baseball field face
Network Dissection
[Zhou*, Bau*, et al. TPAMI’18, CVPR 2017]
Interpretable Units
IoU 0.16
road
conv5 unit 107 (object) IoU 0.14
car
conv5 unit 79 (object) IoU 0.14
waffled
conv5 unit 252 (texture) IoU 0.13
grid
conv5 unit 191 (texture) IoU 0.13
honeycombed
conv5 unit 41 (texture) IoU 0.13
mountain
conv5 unit 144 (object) IoU 0.13
grass
conv5 unit 88 (object) IoU 0.12
paisley
conv5 unit 229 (texture)
6 units
water tree grass plant car windowpane sea airplane mountain skyscraper ceiling building dog person road painting stove bed chair horse floor house sky track bus waterfall sink cabinet shelf pool table sidewalk book ball pit mountain snowy street skyscraper pantry building facade hair wheel head screen shop window crosswalk food wood lined dotted studded banded zigzagged honeycombed grid paisley potholed meshed swirly spiralled freckled sprinkled fibrous waffled pleated grooved cracked chequered cobwebbed matted stratified perforated woven red
32 objects 6 scenes 6 parts 2 materials 25 textures 1 color
Quantify the Interpretability of Networks
Evaluate Unit for Semantic Segmentation
Top Concept: Lamp, Intersection over Union (IoU)= 0.23 Unit 1: Top activated images from the Testing Dataset
Testing Dataset: 60,000 images annotated with 1,200 concepts
Layer5 unit 79 car (object) IoU=0.13 Layer5 unit 107 road (object) IoU=0.15
118/256 units covering 72 unique concepts
AlexNet VGG GoogLeNet ResNet
House Airplane
More results in the TPAMI extension paper
Interpreting Deep Visual Representations via Network Dissection https://arxiv.org/pdf/1711.05611.pdf Comparison of different network architectures Comparison of supervisions (supervised v.s. self-supervised)
Sanity Check Experiment for Disentanglement
- How to quantitatively evaluate the solution reached by CNN?
- What are the hidden factors in object recognition and scene
recognition?
Scene Recognition Object Recognition
Sanity Check Experiment for Disentanglement
A controlled classification experiment: Mixture of MNIST
10 digits from MNIST Pairwise combination of digits Class 1 (3,6) Class 2 (0,2) Class 3 (4,5) Class N
…
With Wentao Zhu (PKU)
Solving Mixture of MNIST
To classify the given image into one of 45 classes
- Training data20,000 images
- Accuracy on validation set: 91.7%
Layer1: 10 units Layer2: 20 units Layer3: 10 units Global average pooling Softmax: 45 classes
Class number
A simple convnet for classification
Digit Detectors Emerge from Solving Mixture of MNIST
Unit03 for detecting digit 0 Precision: @100=1.00 @300=1.00 @500=1.00 @700=0.99 @(recall=0.25)=0.99 @(recall=0.50)=0.98 @(recall=0.75)=0.90 Top activated images: Activation:
Digit Detectors Emerge from Solving Mixture of MNIST
Two metrics for unit importance: alignment score and ablation effect
Dropout Affects the Unit as Digit Detector
Baseline Baseline + Dropout on the conv3
Dropout Affects the Unit as Digit Detector
Baseline Baseline + Dropout on the conv3
Layer Width Affects the Unit as Digit Detector
- Wider network performs better at disentanglement
- Less reliance on single units
Baseline Baseline with tripling the number of units at conv3
Layer Width Affects the Unit as Digit Detector
- Wider layer performs better at disentanglement
- Less reliance on single units
Baseline Baseline with tripling the number of units at conv3
Wider layer + Dropout
Baseline Baseline with wider layer Baseline with wider layer + dropout
Usefulness Experiment
- Take 8 and 9 as redundant digits (randomly shown in all classes)
- Effective digits: 0-7
- Number of classes: 28
Deep Neural Networks for Synthesizing Scenes
Goodfellow, et al. NIPS’14 Radford, et al. ICLR’15 T Karras et al. 2017
- A. Brock, et al. 2018
Generative Adversarial Networks
T Karras et al. 2017
How to Add or Modify Contents?
Input: Random noise Output: Synthesized image Add trees Add domes
Understanding the Internal Units in GANs
What are they doing?
David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, J. Tenenbaum, W. Freeman, A. Torralba. GAN Dissection: Visualizing and Understanding GANs. ICLR’19. https://arxiv.org/pdf/1811.10597.pdf
Input: Random noise Output: Synthesized image
Framework of GAN Dissection
David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, J. Tenenbaum, W. Freeman, A. Torralba. GAN Dissection: Visualizing and Understanding GANs. https://arxiv.org/pdf/1811.10597.pdf
Unit 365 draws trees. Unit 43 draws domes. Unit 14 draws grass. Unit 276 draws towers.
Units Emerge as Drawing Objects
Manipulating the Images
Synthesized Images Synthesized Images with Unit 4 removed Unit 4 for drawing Lamp
Interactive Image Manipulation
All the code and paper are available at http://gandissect.csail.mit.edu
Latest Work on Using GAN to Manipulate Real Image
- Challenge: Invert hidden code for any given image
Input: Hidden code z Output: Synthesized image
Future Directions
Generalization & Overfitting Plasticity & Transfer Learning GAN & Deep RL
Defend and Attack by Adversarial Samples
Network Compression
Interpretable Deep Learning
Why Care About Interpretability?
‘Alchemy’ of Deep Learning ‘Chemistry’ of Deep Learning
Scientific Understanding