learning large scale multimodal data streams
play

Learning Large-Scale Multimodal Data Streams Ranking, Mining, and - PowerPoint PPT Presentation

Learning Large-Scale Multimodal Data Streams Ranking, Mining, and Machine Comprehension Winston H. HSU ( ) Hung-Yi LEE ( ) National Taiwan University & National Taiwan University IBM TJ Watson Ctr., New York


  1. Learning Large-Scale Multimodal Data Streams – Ranking, Mining, and Machine Comprehension Winston H. HSU ( 徐宏民 ) Hung-Yi LEE ( 李宏毅 ) National Taiwan University & National Taiwan University IBM TJ Watson Ctr., New York http://winstonhsu.info/ http://speech.ee.ntu.edu.tw/~tlkagk/ @GTC 2017, May 8, 2017

  2. 2 @GTC, May 2017 – Winston Hsu

  3. 1 The First AI-Generated Movie Trailer – Identifying the “Horror” Factors by Multimodal Learning ▪ The first movie trailer generated by AI system (Watson) (tender) (suspenseful) (scary) https://www.ibm.com/blogs/think/2016/08/cognitive-movie-trailer/ @GTC, May 2017 – Winston Hsu

  4. 2 Detecting Activities of Daily Living (ADL) from Egocentric Videos ▪ Activities of daily living – used in healthcare to refer to people's daily self care activities – Enabling technologies for exciting applications ▪ Very challenging!! ADL: brushing teeth https://www.advancedrm.com/measuring-adls-to-assess-needs-and- 4 @GTC, May 2017 – Winston Hsu improve-independence/

  5. Our Proposal: Beyond Objects – Leveraging More Contexts by Multimodal Learning [Hsieh et al., ICME’16] tap Scene: Scenes : bathroom … • Bathroom: 0.8 • Kitchen: 0.1 • Living room: 0.01 toothbrush • …. cup CNN for Sensors scene recognition (67) Sensors : Objects [1]: • Accelerometer • tap • Mic. • cup • Heartrate • toothbrush [1] Ramanan et al., Detecting Activities of Daily Living in First-person Camera Views, CVPR 2012 5 [2] Hsieh et al., Egocentric activity recognition by leveraging multiple mid-level representations, ICME 2016 @GTC, May 2017 – Winston Hsu

  6. Experimental Results for ADL – Multimodal Learning Matters! ▪ Egocentric videos collected of 20 people (by Google Glass, GeneActiv) Accuracy 70% 60% 50% 40% 30% 20% 10% 0% [1] Ramanan et al., Detecting Activities of Daily Living in First-person Camera Views, CVPR 2012 6 [2] Hsieh et al., Egocentric activity recognition by leveraging multiple mid-level representations, ICME 2016 @GTC, May 2017 – Winston Hsu

  7. Perception/understanding is multimodal. How to design multimodal (end-to-end) deep learning frameworks? 7 @GTC, May 2017 – Winston Hsu

  8. Outlines ▪ Why learning with multimodal deep neural networks ▪ Requiring techniques for multimodal learning ▪ Sample projects – Medical segmentation by cross-modal and sequential learning – Cross domain and cross-view learning for 3D retrieval – Speech Summarization – Speech Question Answering – Audio Word to Vector 8 @GTC, May 2017 – Winston Hsu

  9. 3 3D Medical Segmentation by Deep Neural Networks [Tseng et al., CVPR 2017] ▪ Motivations – 3D biomedical segmentation plays a vital role in biomedical analysis. ▪ Brain tumors have different kinds of shapes, and can appear anywhere in the brain  very challenging to localize the tumors ▪ Goal – To perform 3D segmentation with deep methods and segment by stacking all the 2D slices (sequences). ▪ Observing oncologists leverage the multi-modal signals in tumor diagnosis 9 @GTC, May 2017 – Winston Hsu

  10. Multi-Modal Biomedical Images ▪ 3D multi-modal MRI images – Different modalities used to distinguish the boundary of different tumor tissues (e.g., edema, enhancing core, non-enhancing core, necrosis) – Four modalities: Flair, T1, T1c, T2 T1c Flair T2 T1 10 @GTC, May 2017 – Winston Hsu

  11. Related Work – SegNet (2D Image) ▪ Structured as encoder and decoder with multi- resolution fusion (MRF) ▪ But – Ignoring multi-modalities – lacking sequential learning Badrinarayanan, et al., SegNet: A Deep Convolutional Encoder-Decoder Architecture for 11 @GTC, May 2017 – Winston Hsu Image Segmentation, 2015

  12. 3D Medical Segmentation by Deep Neural Networks [Tseng et al., CVPR 2017] ▪ Our proposal – (first-ever) utilizing cross-modal learning in the (end-to-end) sequential and convolutional neural networks and effectively aggregating multiple resolutions 12 Kuan-Lun Tseng, Yen-Liang Lin, Winston Hsu and Chung-Yang Huang. Joint Sequence @GTC, May 2017 – Winston Hsu Learning and Cross-Modality Convolution for 3D Biomedical Segmentation. CVPR 2017

  13. ConvLSTM – Temporally Augmented Convolutional Neural Networks ▪ Convolutional + sequential networks, e.g., convLSTM – Modeling spatial cues in temporal (sequential) evolvements ▪ LSTM vs. convLSTM: Traditional LSTM employs the dot-product; Conv-LSTM replaces the dot-product by convolution. Shi, et al., Convolutional LSTM Network: A Machine Learning Approach for Precipitation 13 @GTC, May 2017 – Winston Hsu Nowcasting, NIPS 2015

  14. Cross Modality Convolution (CMC) Detailed structure in Figure 2 slice 1 slice 2 Flair … Multi-Modal Convolution Cross-Modality … Decoder – For Each Slice LSTM Encoder Convolution slice n slice 1 slice 1 slice 2 T2 … … Multi-Modal Cross-Modality Convolution slice n Decoder Encoder Convolution LSTM slice 1 slice 2 slice 2 T1 … … … slice n Cross-Modality Convolution Multi-Modal Decoder slice 1 Convolution LSTM Encoder slice 2 T1c … … slice n slice n (a) (b) (c) (d) (e) (f) (g) (h) w h Tensor(C * h * w * 4) Chan 1 C Chan 2 … w Flair Chan 1 Chan C h Chan 1 Chan 1 Chan 1 Chan 1 w … Chan 2 … Chan 2 K T2 Chan C Chan 2 Convolution Chan 2 LSTM Chan 2 Chan 1 Chan 2 … … … … Decoder Chan C Chan C T1 Chan C Chan C Chan C Chan 1 Encoder: Chan 2 … : Conv + Batch Norm + ReLU Cross-Modality Convolution Chan C T1c : Max pooling convolution with Decoder: kernel 4x1x1xC : Deconv : Conv + Batch Norm + ReLU Multi-Modal Encoder 14 @GTC, May 2017 – Winston Hsu

  15. Comparing with the State-of-The-Art in BRATS-2015 (a) MRI slices (b) Ground truth (c) U-Net (d) CMC (ours) (e) CMC + convLSTM (ours) ▪ MRF is effective ▪ MME + CMC is better than regular encoder + decoder ▪ Two phase is an important training strategy for imbalanced data ▪ convLSTM, sequential modeling, helps slightly 15 @GTC, May 2017 – Winston Hsu

  16. 4 demo Sketch/Image-Based 3D Model Search [Liu et al., ACMMM’15] [Lee et al., 2017] ▪ Speeding up 3D design and printing – Current 3D shape search engines take text inputs only – Leveraging large-scale freely available 3D models ▪ Various applications in 3D models: 3D printing, AR, 3D game design, etc. 16 @GTC, May 2017 – Winston Hsu

  17. Image-based 3D Shape Retrieval [Lee et al., 2017] ▪ To retrieve 3D shapes based on photo inputs ▪ Challenges: – Effective feature representations of 3D shapes (with CNNs) – Image to 3D cross-domain similarity learning Query  17 @GTC, May 2017 – Winston Hsu

  18. Our Proposal – Cross-Domain 3D Shape Retrieval with View Sequence Learning [Lee et al., 2017] ▪ Novel proposal – End-to-end deep neural networks for cross-domain and cross-view learning and efficient triplet learning ▪ A brand-new problem Adaptation Image Image-CNN Layer representation Rank by Query Image L2 distance View-CNN Cross-View Shape … … Convolution representation View-CNN 3D Shapes Rendered Views Top Ranked 3D Shapes: 18 @GTC, May 2017 – Winston Hsu

  19. Cross-Domain (Distance Metric) Learning: Siamese vs. Triplet Networks Triplet Contrastive Loss Loss Neural Networks (CNN / DNN..) identical, identical, weights shared weights shared image1 image2 positive anchor negative image image image Wang, Jiang, et al. "Learning fine-grained image similarity with deep ranking." CVPR 2014. 19 @GTC, May 2017 – Winston Hsu

  20. Baseline: MVCNN, 3D Shape Feature by Max Pooling – Ignoring Sequences ▪ Straightforward but ignoring view sequences – Each view is passed to the same CNN (shared weights) – View-pooling is a MAX POOLING operation conv1 → pool5 Pool 5 feature (4096D) (4096D) fc6 fc7 fc8 (same size as pool5) airplane bed … car … View-Pooling … Su, Hang, et al. "Multi-view convolutional neural networks for 3d shape recognition. ” CVPR 2015 20 @GTC, May 2017 – Winston Hsu

  21. Our Proposal: Cross-Domain Triplet NN with View Sequence Learning ▪ Cross-View Convolution aggregates multi-view features ▪ The adaptation layer adapts image features to the joint embedding space ▪ Late triplet sampling speeds up the training of cross-domain triplet learning 21 @GTC, May 2017 – Winston Hsu

  22. Cross-View Convolution (CVC) ▪ Stack the feature maps from V views by channel: V x ( H x W x C ) → H x W x V x C ▪ Convolve the new tensor with K kernels (1 x 1 x V x C) – Assign K == C → #output channel == #input channel (for comparisons) – K = C = 256 = AlexNet pool5 feature map #channels ▪ CVC works as a weighted summation across views and channels from CNN features reshape 22 @GTC, May 2017 – Winston Hsu

  23. Late Triplet Sampling (Fast-CDTNN) – Speeding Up Cross-Domain Learning ▪ Naive cross-domain triplet neural networks (CDTNN) has three streams ▪ Fast-CDTNN has two streams. It forward sampled image/3D shape, and enumerates the triplets (combinations) at the integrated triplet loss layer ▪ In our experiments, Fast-CDTNN is ~4x - 5x faster. 23 @GTC, May 2017 – Winston Hsu

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend