Learning Large-Scale Multimodal Data Streams Ranking, Mining, and - - PowerPoint PPT Presentation

learning large scale multimodal data streams
SMART_READER_LITE
LIVE PREVIEW

Learning Large-Scale Multimodal Data Streams Ranking, Mining, and - - PowerPoint PPT Presentation

Learning Large-Scale Multimodal Data Streams Ranking, Mining, and Machine Comprehension Winston H. HSU ( ) Hung-Yi LEE ( ) National Taiwan University & National Taiwan University IBM TJ Watson Ctr., New York


slide-1
SLIDE 1

Learning Large-Scale Multimodal Data Streams – Ranking, Mining, and Machine Comprehension

Hung-Yi LEE (李宏毅) National Taiwan University

@GTC 2017, May 8, 2017

http://speech.ee.ntu.edu.tw/~tlkagk/

http://winstonhsu.info/

Winston H. HSU (徐宏民) National Taiwan University & IBM TJ Watson Ctr., New York

slide-2
SLIDE 2

@GTC, May 2017 – Winston Hsu 2

slide-3
SLIDE 3

@GTC, May 2017 – Winston Hsu

The First AI-Generated Movie Trailer – Identifying the “Horror” Factors by Multimodal Learning ▪ The first movie trailer generated by AI system (Watson)

(tender) (scary) (suspenseful)

https://www.ibm.com/blogs/think/2016/08/cognitive-movie-trailer/

1

slide-4
SLIDE 4

@GTC, May 2017 – Winston Hsu

Detecting Activities of Daily Living (ADL) from Egocentric Videos ▪ Activities of daily living – used in healthcare to refer to people's daily self care activities

– Enabling technologies for exciting applications

▪ Very challenging!!

4

ADL: brushing teeth

2

https://www.advancedrm.com/measuring-adls-to-assess-needs-and- improve-independence/

slide-5
SLIDE 5

@GTC, May 2017 – Winston Hsu

Our Proposal: Beyond Objects – Leveraging More Contexts by Multimodal Learning

tap cup toothbrush Objects [1]:

  • tap
  • cup
  • toothbrush

Scene: bathroom … Scenes:

  • Bathroom: 0.8
  • Kitchen: 0.1
  • Living room: 0.01
  • ….

Sensors:

  • Accelerometer
  • Mic.
  • Heartrate

Sensors

[1] Ramanan et al., Detecting Activities of Daily Living in First-person Camera Views, CVPR 2012 [2] Hsieh et al., Egocentric activity recognition by leveraging multiple mid-level representations, ICME 2016

CNN for scene recognition (67)

5

[Hsieh et al., ICME’16]

slide-6
SLIDE 6

@GTC, May 2017 – Winston Hsu

Experimental Results for ADL – Multimodal Learning Matters!

▪ Egocentric videos collected of 20 people (by Google Glass, GeneActiv)

6

0% 10% 20% 30% 40% 50% 60% 70%

Accuracy

[1] Ramanan et al., Detecting Activities of Daily Living in First-person Camera Views, CVPR 2012 [2] Hsieh et al., Egocentric activity recognition by leveraging multiple mid-level representations, ICME 2016

slide-7
SLIDE 7

@GTC, May 2017 – Winston Hsu 7

Perception/understanding is multimodal. How to design multimodal (end-to-end) deep learning frameworks?

slide-8
SLIDE 8

@GTC, May 2017 – Winston Hsu

Outlines

▪ Why learning with multimodal deep neural networks ▪ Requiring techniques for multimodal learning ▪ Sample projects

– Medical segmentation by cross-modal and sequential learning – Cross domain and cross-view learning for 3D retrieval – Speech Summarization – Speech Question Answering – Audio Word to Vector

8

slide-9
SLIDE 9

@GTC, May 2017 – Winston Hsu

3D Medical Segmentation by Deep Neural Networks ▪ Motivations – 3D biomedical segmentation plays a vital role in biomedical analysis. ▪ Brain tumors have different kinds of shapes, and can appear anywhere in the brain  very challenging to localize the tumors ▪ Goal – To perform 3D segmentation with deep methods and segment by stacking all the 2D slices (sequences). ▪ Observing oncologists leverage the multi-modal signals in tumor diagnosis

9

3

[Tseng et al., CVPR 2017]

slide-10
SLIDE 10

@GTC, May 2017 – Winston Hsu

Multi-Modal Biomedical Images

▪ 3D multi-modal MRI images

– Different modalities used to distinguish the boundary of different tumor tissues (e.g., edema, enhancing core, non-enhancing core, necrosis) – Four modalities: Flair, T1, T1c, T2

10

T1 T1c T2 Flair

slide-11
SLIDE 11

@GTC, May 2017 – Winston Hsu

Related Work – SegNet (2D Image)

▪ Structured as encoder and decoder with multi- resolution fusion (MRF) ▪ But

– Ignoring multi-modalities – lacking sequential learning

11

Badrinarayanan, et al., SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation, 2015

slide-12
SLIDE 12

@GTC, May 2017 – Winston Hsu

3D Medical Segmentation by Deep Neural Networks ▪ Our proposal – (first-ever) utilizing cross-modal learning in the (end-to-end) sequential and convolutional neural networks and effectively aggregating multiple resolutions

12

[Tseng et al., CVPR 2017]

Kuan-Lun Tseng, Yen-Liang Lin, Winston Hsu and Chung-Yang Huang. Joint Sequence Learning and Cross-Modality Convolution for 3D Biomedical Segmentation. CVPR 2017

slide-13
SLIDE 13

@GTC, May 2017 – Winston Hsu

ConvLSTM – Temporally Augmented Convolutional Neural Networks ▪ Convolutional + sequential networks, e.g., convLSTM

– Modeling spatial cues in temporal (sequential) evolvements ▪ LSTM vs. convLSTM: Traditional LSTM employs the dot-product; Conv-LSTM replaces the dot-product by convolution.

13

Shi, et al., Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting, NIPS 2015

slide-14
SLIDE 14

@GTC, May 2017 – Winston Hsu

Cross Modality Convolution (CMC) – For Each Slice

T1c T2 Flair

Chan C

Chan C Chan C Chan 2 Chan 1 Chan 2 Chan 1 Chan 2 Chan 1

C w

… … …

Chan C Chan 2 Chan 1

Chan 1 Chan 1 Chan 1 Chan 1

w w K

h

T1

h

Chan 2 Chan 2 Chan 2 Chan 2 Chan C Chan C Chan C Chan C

… Multi-Modal Encoder Cross-Modality Convolution Decoder

Convolution LSTM

: Conv + Batch Norm + ReLU : Max pooling : Deconv : Conv + Batch Norm + ReLU

Encoder: Decoder:

Flair T2 T1c T1 Convolution LSTM Cross-Modality Convolution Multi-Modal Encoder Convolution LSTM Cross-Modality Convolution Multi-Modal Encoder slice 1 slice 2 slice n Convolution LSTM Cross-Modality Convolution Multi-Modal Encoder slice 1 slice 2 … slice n slice 1 slice 2 … slice n slice 1 slice 2 … slice n slice 1 slice 2 … slice n … … … Decoder Decoder Decoder Detailed structure in Figure 2 (a) (b) (c) (d) (e) (f) (g) (h)

convolution with kernel 4x1x1xC Tensor(C * h * w * 4)

14

slide-15
SLIDE 15

@GTC, May 2017 – Winston Hsu

Comparing with the State-of-The-Art in BRATS-2015

(b) Ground truth (c) U-Net (e) CMC + convLSTM (ours) (d) CMC (ours) (a) MRI slices

15

▪ MRF is effective ▪ MME + CMC is better than regular encoder + decoder ▪ Two phase is an important training strategy for imbalanced data ▪ convLSTM, sequential modeling, helps slightly

slide-16
SLIDE 16

@GTC, May 2017 – Winston Hsu

Sketch/Image-Based 3D Model Search

▪ Speeding up 3D design and printing

– Current 3D shape search engines take text inputs only – Leveraging large-scale freely available 3D models

▪ Various applications in 3D models: 3D printing, AR, 3D game design, etc.

16

[Liu et al., ACMMM’15] demo

4

[Lee et al., 2017]

slide-17
SLIDE 17

@GTC, May 2017 – Winston Hsu

▪ To retrieve 3D shapes based on photo inputs ▪ Challenges:

– Effective feature representations of 3D shapes (with CNNs) – Image to 3D cross-domain similarity learning

Image-based 3D Shape Retrieval

17

Query 

[Lee et al., 2017]

slide-18
SLIDE 18

@GTC, May 2017 – Winston Hsu

Our Proposal – Cross-Domain 3D Shape Retrieval with View Sequence Learning

▪ Novel proposal – End-to-end deep neural networks for cross-domain and cross-view learning and efficient triplet learning ▪ A brand-new problem

18

[Lee et al., 2017]

Image-CNN Adaptation Layer Cross-View Convolution Rank by L2 distance View-CNN

Query Image 3D Shapes Image representation Shape representation

Top Ranked 3D Shapes:

Rendered Views …

View-CNN

slide-19
SLIDE 19

@GTC, May 2017 – Winston Hsu

Cross-Domain (Distance Metric) Learning: Siamese vs. Triplet Networks

19 Wang, Jiang, et al. "Learning fine-grained image similarity with deep ranking." CVPR 2014.

Contrastive Loss

image1 image2

Triplet Loss

positive image anchor image negative image identical, weights shared identical, weights shared

Neural Networks (CNN / DNN..)

slide-20
SLIDE 20

@GTC, May 2017 – Winston Hsu

▪ Straightforward but ignoring view sequences

– Each view is passed to the same CNN (shared weights) – View-pooling is a MAX POOLING operation

Baseline: MVCNN, 3D Shape Feature by Max Pooling – Ignoring Sequences

… …

View-Pooling

conv1 → pool5 fc6 fc7 fc8 airplane car bed

feature (4096D)

(same size as pool5) Pool 5 (4096D)

Su, Hang, et al. "Multi-view convolutional neural networks for 3d shape recognition.” CVPR 2015

20

slide-21
SLIDE 21

@GTC, May 2017 – Winston Hsu

Our Proposal: Cross-Domain Triplet NN with View Sequence Learning

21

▪ Cross-View Convolution aggregates multi-view features ▪ The adaptation layer adapts image features to the joint embedding space ▪ Late triplet sampling speeds up the training of cross-domain triplet learning

slide-22
SLIDE 22

@GTC, May 2017 – Winston Hsu

Cross-View Convolution (CVC)

22

▪ Stack the feature maps from V views by channel: V x ( H x W x C ) → H x W x V x C ▪ Convolve the new tensor with K kernels (1 x 1 x V x C)

– Assign K == C → #output channel == #input channel (for comparisons) – K = C = 256 = AlexNet pool5 feature map #channels

▪ CVC works as a weighted summation across views and channels

reshape from CNN features

slide-23
SLIDE 23

@GTC, May 2017 – Winston Hsu

Late Triplet Sampling (Fast-CDTNN) – Speeding Up Cross-Domain Learning

▪ Naive cross-domain triplet neural networks (CDTNN) has three streams ▪ Fast-CDTNN has two streams. It forward sampled image/3D shape, and enumerates the triplets (combinations) at the integrated triplet loss layer ▪ In our experiments, Fast-CDTNN is ~4x - 5x faster.

23

slide-24
SLIDE 24

@GTC, May 2017 – Winston Hsu

Comparisons and Datasets ▪ New image/3D shape: 12,311 3D shapes, 10,000 images, across 40 categories

24

Methods mAP AlexNet pool5 [1] 7.16% MVCNN [2] 7.92% Joint Embedding [3] 3.44% CDTNN + view pooling [2] 40.85% CDTNN + Adaptation Layer 47.84% CDTNN + Adaptation Layer + CVC 52.67%

[1] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012. [2] Su, Hang, et al. "Multi-view convolutional neural networks for 3d shape recognition." Proceedings of the IEEE International Conference

  • n Computer Vision. 2015.

[3] Li, Yangyan, et al. "Joint embeddings of shapes and images via cnn image purification." ACM Trans. Graph 5 (2015).

slide-25
SLIDE 25

@GTC, May 2017 – Winston Hsu

Sample Results and Demo

25

Bathtub Person Bed Car Bookshelf Keyboard Guitar (a) (b)

demo

slide-26
SLIDE 26

Speech Summarization

5

slide-27
SLIDE 27

Summarization

Audio File to be summarized This is the summary. ➢ Select the most informative segments to form a compact version

Extractive Summaries

…… deep learning is powerful …… …… ……

[Lee & Lee, Interspeech 12] [Lee & Lee, ICASSP 13] [Shiang & Lee, Interspeech 13]

➢ Machine does not write summaries in its own words

slide-28
SLIDE 28

Abstractive Summarization

  • Now machine can do abstractive summary (write

summaries in its own words)

  • Title generation: abstractive summary with one

sentence Title 1 Title 2 Title 3 Training Data title generated by machine without hand- crafted rules (in its own words)

slide-29
SLIDE 29

Sequence-to-sequence

  • Input: transcriptions of audio, output: title

ℎ1 ℎ2 ℎ3 ℎ4 RNN Encoder: read through the input w1 w4 w2 w3 transcriptions of audio from automatic speech recognition (ASR) 𝑨1 𝑨2 …… ……

wA wB

RNN generator

slide-30
SLIDE 30

Sequence-to-sequence

  • Training data: 2M story-headline pairs

ROUGE-1 ROUGE-L Manual Transcription as input 26.8 23.9 ASR Transcription as input 21.3 20.0 ASR Transcription as input + Special Structure 22.9 20.9 Learn to ignore the words which are misrecognized when generating the title

[Yu, Lee, Lee, SLT 2016]

slide-31
SLIDE 31

Demo

  • http://140.112.30.37:2401/
  • https://www.youtube.com/watch?v=X3BapMl7Wv8
  • https://www.youtube.com/watch?v=hFVKpVMB-Rc
  • https://www.youtube.com/watch?v=hYf3fARyNvg
  • https://www.youtube.com/watch?v=usi8EUabU7Y
  • https://www.youtube.com/watch?v=FUmd6EnVeWw
  • From SONG TUYEN NEWS:

https://www.youtube.com/channel/UC- P4mEcWZVrFfdZIuiODiTg

slide-32
SLIDE 32

Speech Question Answering

6

slide-33
SLIDE 33

Speech Question Answering

What is a possible origin

  • f Venus’ clouds?

Speech Question Answering: Machine answers questions based on the information in spoken content Gases released as a result of volcanic activity

slide-34
SLIDE 34

New task for Machine Comprehension of Spoken Content

  • TOEFL Listening Comprehension Test by Machine

Question: “ What is a possible origin of Venus’ clouds? ” Audio Story: Choices: (A) gases released as a result of volcanic activity (B) chemical reactions caused by high surface temperatures (C) bursts of radio energy from the plane's surface (D) strong winds that blow dust into the atmosphere (The original story is 5 min long.)

slide-35
SLIDE 35

New task for Machine Comprehension of Spoken Content

  • TOEFL Listening Comprehension Test by Machine

“what is a possible

  • rigin of Venus‘ clouds?"

Question: Audio Story:

Neural Network

4 Choices e.g. (A) answer

Using previous exams to train the network

ASR transcriptions

slide-36
SLIDE 36

Model Architecture

“what is a possible origin

  • f Venus‘ clouds?"

Question: Question Semantics

…… It be quite possible that this be due to volcanic eruption because volcanic eruption often emit gas. If that be the case volcanism could very well be the root cause of Venus 's thick cloud cover. And also we have observe burst of radio energy from the planet 's surface. These burst be similar to what we see when volcano ……

Audio Story: Speech Recognition Semantic Analysis Semantic Analysis Attention Answer Select the choice most similar to the answer Attention

The whole model learned end-to-end.

slide-37
SLIDE 37

More Details

(A) (A) (A) (A) (A) (B) (B) (B)

slide-38
SLIDE 38

Demo

slide-39
SLIDE 39

Experimental Results

Accuracy (%) random Naïve approaches Example Naïve approach:

  • 1. Find the paragraph containing most key terms in

the question.

  • 2. Select the choice containing most key terms in

that paragraph. not easy to get high score without understanding

slide-40
SLIDE 40

Experimental Results

Accuracy (%) random 42.2% [Tseng, Shen, Lee, Lee, Interspeech’16] Naïve approaches 48.8% [Fan, Hsu, Lee, Lee, SLT’16]

slide-41
SLIDE 41

Analysis

  • There are three types of questions

Type 3: Connecting Information ➢Understanding Organization ➢Connecting Content ➢Making Inferences

slide-42
SLIDE 42

Analysis

  • There are three types of questions

Type 3: Pragmatic Understanding ➢Understanding the Function of What Is Said ➢Understanding the Speaker’s Attitude

slide-43
SLIDE 43

Co Corp rpus & Co Code https://github.com/sunprinceS/Hierarchical-Attention-Model

slide-44
SLIDE 44

Audio Word to Vector

7

slide-45
SLIDE 45

Framework of Spoken Language Understanding Tasks

Spoken Content Text Spoken Content Retrieval Dialogue Speech Summarization Speech Question Answering Speech Recognition

Can we bypass speech recognition?

Why? Need the manual transcriptions of lots of audio to learn. Most languages have little transcribed data. New Research Direction:

Audio Word to Vector

slide-46
SLIDE 46

Typical Word to Vector

  • Machine represents each word by a vector

representing its meaning

  • Learning from lots of text without supervision

dog cat rabbit jump run flower tree

slide-47
SLIDE 47

Audio Word to Vector

  • Machine represents each audio segment also by a

vector

audio segment vector (word-level)

Learn from lots of audio without supervision

[Chung, Wu, Lee, Lee, Interspeech 16)

slide-48
SLIDE 48

Sequence-to-sequence Auto-encoder

audio segment acoustic features x1 x2 x3 x4

RNN Encoder

vector The vector we want We use sequence-to-sequence auto-encoder here The training is unsupervised.

slide-49
SLIDE 49

Sequence-to-sequence Auto-encoder

RNN Generator

x1 x2 x3 x4 y1 y2 y3 y4 x1 x2 x3 x4

RNN Encoder

audio segment acoustic features

The RNN encoder and generator are jointly trained.

Input acoustic features

slide-50
SLIDE 50

What does machine learn?

  • Typical word to vector:
  • Audio word to vector (phonetic information)

𝑊 𝑆𝑝𝑛𝑓 − 𝑊 𝐽𝑢𝑏𝑚𝑧 + 𝑊 𝐻𝑓𝑠𝑛𝑏𝑜𝑧 ≈ 𝑊 𝐶𝑓𝑠𝑚𝑗𝑜 𝑊 𝑙𝑗𝑜𝑕 − 𝑊 𝑟𝑣𝑓𝑓𝑜 + 𝑊 𝑏𝑣𝑜𝑢 ≈ 𝑊 𝑣𝑜𝑑𝑚𝑓 V( ) - V( ) + V( ) = V( )

GIRL PEARL PEARLS

V( ) - V( ) + V( ) = V( )

GIRLS

slide-51
SLIDE 51

Demo

slide-52
SLIDE 52

Next Step ……

  • Audio word to vector with semantics

flower tree dog cat cats walk walked run One day we can build all spoken language understanding applications directly from audio word to vector.

slide-53
SLIDE 53

@GTC, May 2017 – Winston Hsu

Take-Home Messages ▪ Multimodal deep learning is ”must” for practical applications

– Perception/understanding is multimodal – Sensors are complementary and low-cost

▪ Dealing with issues including

– proper networks with domain knowledge, imbalanced training data, cross-modality/cross-domain learning, training strategies, etc.

▪ Promising in many tasks

– Segmentation, Ranking, Summarization, Question Answering, Unsupervised Representation

53

slide-54
SLIDE 54

@GTC, May 2017 – Winston Hsu

Thanks and Comments!

54

Hung-Yi LEE (李宏毅) National Taiwan University Winston H. HSU (徐宏民) National Taiwan University & IBM TJ Watson Ctr., New York