Introduction of Recent Work at MIL The University of Tokyo, NVAIL - - PowerPoint PPT Presentation

introduction of recent work at mil
SMART_READER_LITE
LIVE PREVIEW

Introduction of Recent Work at MIL The University of Tokyo, NVAIL - - PowerPoint PPT Presentation

Recognize, Describe, and Generate: Introduction of Recent Work at MIL The University of Tokyo, NVAIL Partner Yoshitaka Ushiku MIL: Machine Intelligence Laboratory Beyond Human Intelligence Based on Cyber-Physical Systems Members Varying


slide-1
SLIDE 1

Recognize, Describe, and Generate:

Introduction of Recent Work at MIL

The University of Tokyo, NVAIL Partner Yoshitaka Ushiku

slide-2
SLIDE 2

MIL: Machine Intelligence Laboratory

Beyond Human Intelligence Based on Cyber-Physical Systems

Members

  • One Professor (Prof. Harada)
  • One Lecturer (me)
  • One Assistant Professor
  • One Postdoc
  • Two Office Administrators
  • 11 Ph. D. students
  • 23 Master students
  • 8 Bachelor students
  • 5 Interns

Varying research topics

ICCV, CVPR, ECCV, ICML, NIPS, ICASSP , SIGdial, ACM Multimedia, ICME, ICRA, IROS, etc.

The most important thing

We are hiring!

slide-3
SLIDE 3

Journalist Robot

  • Born in 2006
  • Objective: publishing news automatically

– Recognize

  • Objects, people, actions

– Describe

  • What is happening

– Generate

  • Contents as humans do
slide-4
SLIDE 4

Outline

  • Journalist Robot: ancestor of current work in MIL
  • Outline: research originates with this robot

– Recognize

  • Basic: Framework for DL, Domain Adaptation
  • Classification: Single-modality, Multi-modalities

– Describe

  • Image Captioning
  • Video Captioning

– Generate

  • Image Reconstruction
  • Video Generation
slide-5
SLIDE 5

Recognize

slide-6
SLIDE 6

MILJS: JavaScript × Deep Learning

[Hidaka+, ICLR Workshop 2017]

slide-7
SLIDE 7

MILJS: JavaScript × Deep Learning

  • Support for both learning and inference
  • Support for nodes with GPGPUs

– Currently WebCL is utilized. – Now working on WebGPU.

  • Support for nodes w/o GPGPUs
  • No requirements to install any software

– Even ResNet with 152 layers can be trained

[Hidaka+, ICLR Workshop 2017]

Let me show you a preliminary demonstration using mnist!

slide-8
SLIDE 8

Asymmetric Tri-training for Domain Adaptation

  • Unsupervised domain adaptation

Trained on mnist → Works on SVHN? – Ground-truth labels are associated with source (mnist) – However, there are no labels for target (SVHN)

[Saito+, submitted to ICML 2017]

slide-9
SLIDE 9

Asymmetric Tri-training for Domain Adaptation

  • Asymmetric Tri-training: pseudo labels for target domain

[Saito+, submitted to ICML 2017]

slide-10
SLIDE 10

Asymmetric Tri-training for Domain Adaptation

1st: Training on MNIST → Add pseudo labels for easy samples 2nd~: Training on MNIST+α→ Add more pseudo labels

eight nine

[Saito+, submitted to ICML 2017]

slide-11
SLIDE 11

End-to-end learning for environmental sound classification Existing methods for speech / sound recognition:

① Feature extraction: Fourier Transformation (log-mel features) ② Classification: CNN with the extracted feature map

[Tokozume+, ICASSP 2017]

① ②

Log-mel features are suitable for human speech; but for environmental sounds…?

slide-12
SLIDE 12

End-to-end learning for environmental sound classification Proposed approach (EnvNet):

CNN for both ① feature map extraction and ② classification

[Tokozume+, ICASSP 2017]

① ②

Extracted “feature map”

slide-13
SLIDE 13

End-to-end learning for environmental sound classification Comparison of accuracy [%] on ESC-50 [Piczak, ACM MM 2015]

[Tokozume+, ICASSP 2017]

64.5 64.0 71.0

log-mel feature + CNN

[Piczak, MLSP 2015]

End-to-end CNN (Ours) End-to-end CNN & log-mel feature + CNN (Ours)

EnvNet can extract discriminative features for environmental sounds

slide-14
SLIDE 14

Visual Question Answering (VQA)

Question answering system for

  • Associated image
  • Question by natural language

[Saito+, ICME 2017]

Q: Is it going to rain soon? Ground Truth A: yes Q: Why is there snow on one side of the stream and clear grass on the other? Ground Truth A: shade

slide-15
SLIDE 15

Visual Question Answering (VQA)

After integrating for : usual classification

Question What objects are found on the bed? Answer bed sheets, pillow Image Image feature

  • Question feature
  • Integrated

vector

  • [Saito+, ICME 2017]

VQA = Multi-class classification

slide-16
SLIDE 16

Visual Question Answering

Current advancement: improving how to integrate and

  • Concatenation

e.g.)

[Antol+, ICCV 2015]

  • Summation

e.g.) Image feature (with attention) + Question feature

[Xu+Saenko, ECCV 2016]

  • Multiplication

e.g.) Bilinear multiplication

[Fukui+, EMNLP 2016]

  • This work: DualNet doing sum, multiply and concatenation
  • [Saito+, ICME 2017]
slide-17
SLIDE 17

Visual Question Answering (VQA)

VQA Challenge 2016 (in CVPR 2016) Won the 1st place on abstract images w/o attention mechanism

[Saito+, ICME 2017] Q: What fruit is yellow and brown? A: banana Q: How many screens are there? A: 2 Q: What is the boy playing with? A: teddy bear Q: Are there any animals swimming in the pond? A: no

slide-18
SLIDE 18

Describe

slide-19
SLIDE 19

[Ushiku+, ACMMM 2011]

Automatic Image Captioning

slide-20
SLIDE 20

Training Dataset

A woman posing

  • n a red scooter.

White and gray kitten lying on its side. A white van parked in an empty lot. A white cat rests head on a stone. Silver car parked

  • n side of road.

A small gray dog

  • n a leash.

A black dog standing in a grassy area. A small white dog wearing a flannel warmer. Input Image A small white dog wearing a flannel warmer. A small gray dog on a leash. A black dog standing in a grassy area.

Nearest Captions

A small white dog wearing a flannel warmer.

A small gray dog on a leash.

A black dog standing in a grassy area.

A small white dog standing on a leash.

slide-21
SLIDE 21

Automatic Image Captioning

[ACM MM 2012, ICCV 2015]

Group of people sitting at a table with a dinner. Tourists are standing on the middle of a flat desert.

slide-22
SLIDE 22

Image Captioning + Sentiment Terms

[Andrew+, BMVC 2016]

A confused man in a blue shirt is sitting on a bench. A man in a blue shirt and blue jeans is standing in the

  • verlooked water.

A zebra standing in a field with a tree in the dirty background.

slide-23
SLIDE 23

Image Captioning + Sentiment Terms

Two steps for adding a sentiment term

  • 1. Usual image captioning using CNN+RNN

[Andrew+, BMVC 2016]

The most probable noun is memorized

slide-24
SLIDE 24

Image Captioning + Sentiment Terms

Two steps for adding a sentiment term

  • 1. Usual image captioning using CNN+RNN
  • 2. Forced to predict sentiment term before the noun

[Andrew+, BMVC 2016]

slide-25
SLIDE 25

Beyond Caption to Narrative

A man is holding a box of doughnuts. Then he and a woman are standing next each other. Then she is holding a plate of food.

[Andrew+, ICIP 2016]

slide-26
SLIDE 26

Beyond Caption to Narrative

[Andrew+, ICIP 2016] A man is holding a box

  • f doughnuts.

he and a woman are standing next each other. she is holding a plate of food.

Narrative

slide-27
SLIDE 27

Beyond Caption to Narrative

A boat is floating on the water near a mountain. And a man riding a wave on top of a surfboard. Then he on the surfboard in the water.

[Andrew+, ICIP 2016]

slide-28
SLIDE 28

Generate

slide-29
SLIDE 29

Image Reconstruction

[Kato+, CVPR 2014]

1

d

2

d

3

d

m

d

j

d

k

d

N

d

) ; ( θ d p

1

d

2

d

3

d

m

d

k

d

N

d

j

d

Cat Camera

Traditional pipeline for image classification

Extracting local descriptors Collecting descriptors Calculating Global feature Classifying images

slide-30
SLIDE 30

Image Reconstruction

[Kato+, CVPR 2014]

1

d

2

d

3

d

m

d

j

d

k

d

N

d

) ; ( θ d p

1

d

2

d

3

d

m

d

k

d

N

d

j

d

Cat Camera

Inversed problem: Image reconstruction from a label

Pot

slide-31
SLIDE 31

Image Reconstruction

[Kato+, CVPR 2014]

cat (bombay) camera grand piano headphone joshua tree pyramid wheel chair gramophone

Pot

Optimized arrangement using: Global location cost + Adjacency cost

Other examples

slide-32
SLIDE 32
slide-33
SLIDE 33

Video Generation

  • Image generation is still challenging

Only successful for controlled settings: – Human faces – Birds – Flowers

  • Video generation is …

– Additionally requiring temporal consistency – Extremely challenging

[Yamamoto+, ACMMM 2016] [Vondrick+, NIPS 2016] BEGAN [Berthelot+, 2017 Mar.] StackGAN [Zhang+, 2016 Dec.]

slide-34
SLIDE 34

Video Generation

  • This work: generating easy videos

– C3D (3D convolutional neural network) for conditional generation with an input label – tempCAE (temporal convolutional auto-encoder) for regularizing video to improve its naturalness

[Yamamoto+, ACMMM 2016]

slide-35
SLIDE 35

Video Generation

[Yamamoto+, ACMMM 2016]

Ours (C3D+tempCAE) Only C3D Ours (C3D+tempCAE) Only C3D Car runs to left Rocket flies up

slide-36
SLIDE 36

Conclusion

  • MIL: Machine Intelligence Laboratory

Beyond Human Intelligence Based on Cyber-Physical Systems

  • This talk introduces some of the current research

– Recognize

  • Basic: Framework for DL, Domain Adaptation
  • Classification: Single-modality, Multi-modalities

– Describe

  • Image Captioning, Video Captioning

– Generate

  • Image Reconstruction, Video Generation