Recognize, Describe, and Generate:
Introduction of Recent Work at MIL
The University of Tokyo, NVAIL Partner Yoshitaka Ushiku
Introduction of Recent Work at MIL The University of Tokyo, NVAIL - - PowerPoint PPT Presentation
Recognize, Describe, and Generate: Introduction of Recent Work at MIL The University of Tokyo, NVAIL Partner Yoshitaka Ushiku MIL: Machine Intelligence Laboratory Beyond Human Intelligence Based on Cyber-Physical Systems Members Varying
Recognize, Describe, and Generate:
The University of Tokyo, NVAIL Partner Yoshitaka Ushiku
Beyond Human Intelligence Based on Cyber-Physical Systems
Members
Varying research topics
ICCV, CVPR, ECCV, ICML, NIPS, ICASSP , SIGdial, ACM Multimedia, ICME, ICRA, IROS, etc.
The most important thing
– Recognize
– Describe
– Generate
– Recognize
– Describe
– Generate
[Hidaka+, ICLR Workshop 2017]
– Currently WebCL is utilized. – Now working on WebGPU.
– Even ResNet with 152 layers can be trained
[Hidaka+, ICLR Workshop 2017]
Let me show you a preliminary demonstration using mnist!
Trained on mnist → Works on SVHN? – Ground-truth labels are associated with source (mnist) – However, there are no labels for target (SVHN)
[Saito+, submitted to ICML 2017]
[Saito+, submitted to ICML 2017]
1st: Training on MNIST → Add pseudo labels for easy samples 2nd~: Training on MNIST+α→ Add more pseudo labels
eight nine
[Saito+, submitted to ICML 2017]
End-to-end learning for environmental sound classification Existing methods for speech / sound recognition:
① Feature extraction: Fourier Transformation (log-mel features) ② Classification: CNN with the extracted feature map
[Tokozume+, ICASSP 2017]
① ②
Log-mel features are suitable for human speech; but for environmental sounds…?
End-to-end learning for environmental sound classification Proposed approach (EnvNet):
CNN for both ① feature map extraction and ② classification
[Tokozume+, ICASSP 2017]
① ②
Extracted “feature map”
End-to-end learning for environmental sound classification Comparison of accuracy [%] on ESC-50 [Piczak, ACM MM 2015]
[Tokozume+, ICASSP 2017]
64.5 64.0 71.0
log-mel feature + CNN
[Piczak, MLSP 2015]
End-to-end CNN (Ours) End-to-end CNN & log-mel feature + CNN (Ours)
EnvNet can extract discriminative features for environmental sounds
Question answering system for
[Saito+, ICME 2017]
Q: Is it going to rain soon? Ground Truth A: yes Q: Why is there snow on one side of the stream and clear grass on the other? Ground Truth A: shade
After integrating for : usual classification
Question What objects are found on the bed? Answer bed sheets, pillow Image Image feature
vector
VQA = Multi-class classification
Current advancement: improving how to integrate and
e.g.)
[Antol+, ICCV 2015]
e.g.) Image feature (with attention) + Question feature
[Xu+Saenko, ECCV 2016]
e.g.) Bilinear multiplication
[Fukui+, EMNLP 2016]
VQA Challenge 2016 (in CVPR 2016) Won the 1st place on abstract images w/o attention mechanism
[Saito+, ICME 2017] Q: What fruit is yellow and brown? A: banana Q: How many screens are there? A: 2 Q: What is the boy playing with? A: teddy bear Q: Are there any animals swimming in the pond? A: no
[Ushiku+, ACMMM 2011]
A woman posing
White and gray kitten lying on its side. A white van parked in an empty lot. A white cat rests head on a stone. Silver car parked
A small gray dog
A black dog standing in a grassy area. A small white dog wearing a flannel warmer. Input Image A small white dog wearing a flannel warmer. A small gray dog on a leash. A black dog standing in a grassy area.
A black dog standing in a grassy area.
[ACM MM 2012, ICCV 2015]
Group of people sitting at a table with a dinner. Tourists are standing on the middle of a flat desert.
[Andrew+, BMVC 2016]
A confused man in a blue shirt is sitting on a bench. A man in a blue shirt and blue jeans is standing in the
A zebra standing in a field with a tree in the dirty background.
Two steps for adding a sentiment term
[Andrew+, BMVC 2016]
The most probable noun is memorized
Two steps for adding a sentiment term
[Andrew+, BMVC 2016]
A man is holding a box of doughnuts. Then he and a woman are standing next each other. Then she is holding a plate of food.
[Andrew+, ICIP 2016]
[Andrew+, ICIP 2016] A man is holding a box
he and a woman are standing next each other. she is holding a plate of food.
Narrative
A boat is floating on the water near a mountain. And a man riding a wave on top of a surfboard. Then he on the surfboard in the water.
[Andrew+, ICIP 2016]
[Kato+, CVPR 2014]
1
d
2
d
3
d
m
d
j
d
k
d
N
d
) ; ( θ d p
1
d
2
d
3
d
m
d
k
d
N
d
j
d
Cat Camera
Traditional pipeline for image classification
Extracting local descriptors Collecting descriptors Calculating Global feature Classifying images
[Kato+, CVPR 2014]
1
d
2
d
3
d
m
d
j
d
k
d
N
d
) ; ( θ d p
1
d
2
d
3
d
m
d
k
d
N
d
j
d
Cat Camera
Inversed problem: Image reconstruction from a label
[Kato+, CVPR 2014]
cat (bombay) camera grand piano headphone joshua tree pyramid wheel chair gramophone
Optimized arrangement using: Global location cost + Adjacency cost
Other examples
Only successful for controlled settings: – Human faces – Birds – Flowers
– Additionally requiring temporal consistency – Extremely challenging
[Yamamoto+, ACMMM 2016] [Vondrick+, NIPS 2016] BEGAN [Berthelot+, 2017 Mar.] StackGAN [Zhang+, 2016 Dec.]
– C3D (3D convolutional neural network) for conditional generation with an input label – tempCAE (temporal convolutional auto-encoder) for regularizing video to improve its naturalness
[Yamamoto+, ACMMM 2016]
[Yamamoto+, ACMMM 2016]
Ours (C3D+tempCAE) Only C3D Ours (C3D+tempCAE) Only C3D Car runs to left Rocket flies up
Beyond Human Intelligence Based on Cyber-Physical Systems
– Recognize
– Describe
– Generate