deep ten texture encoding network
play

Deep TEN: Texture Encoding Network Hang Zhang, Jia Xue, Kristin Dana - PowerPoint PPT Presentation

Deep TEN: Texture Encoding Network Hang Zhang, Jia Xue, Kristin Dana 1 Hang Zhang Highlight and Overview Introduced Encoding-Net a new architecture of CNNs Achieved state-of-the-art results on texture recognition MINC-2500, FMD ,GTOS,


  1. Deep TEN: Texture Encoding Network Hang Zhang, Jia Xue, Kristin Dana 1 Hang Zhang

  2. Highlight and Overview • Introduced Encoding-Net a new architecture of CNNs • Achieved state-of-the-art results on texture recognition MINC-2500, FMD ,GTOS, KTH, 4D-Light • Released the ArXiv paper (CVPR 17) and Torch Implementation (GPU backend) 2 Hang Zhang

  3. Challenges for Texture Recognition • Orderless • Distributions 3 Hang Zhang

  4. Classic Vision Approaches 4 Hang Zhang

  5. Classic Vision Approaches Feature extraction Filterbank responses or SIFT 5 Hang Zhang

  6. Classic Vision Approaches Dictionary Learning Feature extraction 6 Hang Zhang

  7. Classic Vision Approaches Dictionary Learning Encoding Feature extraction Bag-of-words , VQ or VLAD 7 Hang Zhang

  8. Classic Vision Approaches Dictionary Learning Encoding Feature extraction Classifier 8 Hang Zhang

  9. Classic Vision Approaches Dictionary Learning Encoding Feature extraction Classifier • The input image sizes are flexible • No domain-transfer problem 9 Hang Zhang

  10. Comparing to Deep Learning Framework Dictionary Learning Encoding Feature extraction SVM FC Layer • Preserve Spatial Info • Domain Transfer 10 Hang Zhang • Fix size

  11. Comparing to Deep Learning Framework Dictionary Learning Encoding Feature extraction SVM FC Layer • Can we bridge the gap? 11 Hang Zhang

  12. Hybrid Solution BoWs SVM Histogram Encoding Dictionary SIFT / Filter Bank Responses 12 Hang Zhang

  13. Hybrid Solution and Its Limitation BoWs FV-CNN SVM SVM • Off-the-Shelf • The dictionary and the Histogram Fisher Vector encoders are fixed once built Encoding • Feature learning and encoding are not benefiting from the Dictionary Dictionary labeled data SIFT / Filter Bank Pre-trained CNNs Responses 13 Hang Zhang Off-the-Shelf

  14. End-to-end Encoding Deep-TEN BoWs FV-CNN FC Layer SVM SVM Encoding Layer Histogram Fisher Vector Encoding Residual Encoding Dictionary Dictionary Dictionary SIFT / Filter Bank Pre-trained CNNs Convolutional Layers Responses 14 Hang Zhang End-to-End Off-the-Shelf

  15. � Bag-of-Words (BoW) Encoder • Given a set of visual features 𝑌 = {𝑦 % , … 𝑦 ( } , and a learned codebook C = 𝑑 % , … 𝑑 , (the input features is 𝑒 -dimension and 𝑂 is number of visual features and 𝐿 is number of codewords ) • The assignment weight 𝑏 12 correspond to the visual feature 𝑦 1 assigned to 6 = each codeword 𝑑 2 . Hard-assignment: 𝑏 12 = 𝜀( 𝑦 1 − 𝑑 2 :∈ %,…, min {=𝑦 1 − 6 }) 𝑑 : = • BoWs counts the occurrences of the visual words ∑ 𝑏 1 1 15 Hang Zhang

  16. Residual Encoders • The Fisher Vector, concatenating the gradient of GMM with respect to the mean and standard deviation ( D = E 𝑏 12 𝐻 B C 𝑦 1 − 𝑑 2 1F% ( D = E 𝑏 12 𝑦 1 − 𝑑 2 6 − 1 𝐻 G C 1F% • VLAD (1 st order, hard-assignment) ( 𝑊 2 = E 𝑦 1 − 𝑑 2 1F(( J K FB C 16 Hang Zhang

  17. � Residual Encoding Model Encoding-Layer • Residual vector 𝑠 12 = 𝑦 1 − 𝑑 2 Dictionary Residuals • Aggregating residuals with Aggregate assignment weights Input Assign 𝑓 2 = E 𝑏 12 𝑠 12 1 17 Hang Zhang

  18. Feature Distributions and Assigning • Soft-assignment 6 ) exp (−𝛾 𝑠 1: 𝑏 12 = 6 ) , ∑ exp (−𝛾 𝑠 1: :F% • Learnable Smoothing Factor 6 ) exp (−𝑡 2 𝑠 12 𝑏 12 = 6 ) , ∑ exp (−𝑡 : 𝑠 1: :F% 18 Hang Zhang

  19. End-to-end Learning Deep-TEN FC Layer • The loss function is differentiable w.r.t the input 𝑌 and the parameters Encoding Layer (Dictionary 𝐸 and smoothing factors 𝑡 ) Residual Encoding • The Encoding Layer can be trained end- to-end by standard Stochastic Gradient Dictionary Decent (SGD) with backpropagation Convolutional Layers 19 Hang Zhang End-to-End

  20. 20 Hang Zhang

  21. 21 Hang Zhang

  22. 22 Hang Zhang

  23. Relation to Dictionary Learning • Dictionary learning approaches usually are achieved by unsupervised grouping (e.g. K-means) or minimizing the reconstruction error (e.g. K-SVD). • The Encoding Layer makes the inherent dictionary differentiable w.r.t the loss function and learns the dictionary in a supervised manner. 23 Hang Zhang

  24. Relation to BoWs and Residual Encoders Encoding-Layer • Generalize BoWs, VLAD & Fisher Vector Dictionary Residuals • Arbitrary input sizes, output fixed length Aggregate representation Input Assign • NetVLAD decouples the codewords with their assignments 𝑏 = 𝑔(𝑦) instead of 𝑏 = 𝑔(𝑦, 𝑒) 24 Hang Zhang

  25. Relation to Global Pooling Layer • Sum Pooling (avg Pooling) and B W = B W ( Let 𝐿 = 1 and d = 0 , then 𝑓 = ∑ 𝑦 1 1F% B XK B Y • SPP-Layer (He et. al. ECCV 2014) Fix bin numbers instead of receptive field, reshaping, arbitrary input size) • Bilinear Pooling (Lin et. al. ICCV 2015) sum of the outer product across different location 25 Hang Zhang

  26. Methods Overview 26 Hang Zhang

  27. � Domain Transfer • The Residual Encoding Representation 𝑓 2 = ∑ 𝑏 12 𝑠 12 1 • For a visual feature 𝑦 1 that appears frequently in the data • It is likely to close to a visual center 𝑒 2 • 𝑓 2 is close to zero, since 𝑠 12 = 𝑦 1 − 𝑒 2 ≈ 0 e ) ^_` (ab c d Kc • 𝑓 : (𝑘 ≠ 𝑙) is close to zero, since 𝑏 1: = ≈ 0 g (ab f d Kf e ) ∑ ^_` fhi • The Residual Encoding discard the frequently appearing features, which is like to be domain specific (useful for fine-tuning pre-trained features) 27 Hang Zhang

  28. Experiments • Datasets • Gold-standard material & texture datasets: MINC-2500, KTH, FMD • 2 Recent datasets: GTOS, Light Field • General recognition datasets: MIT-Indoor, Caltech-101 • Baseline approaches (off-the-shelf) • FV-SIFT (128 Gaussian Components, 32𝐿 → 512 ) • FV-CNN (Cimpoi et. al. pre-trained VGG-VD & ResNet, 32GMM) 28 Hang Zhang

  29. Dataset Examples 29 Hang Zhang

  30. Deep-TEN Architecture 30 Hang Zhang

  31. Comparing to the Baselines 31 Hang Zhang

  32. Multi-size Training (using different image sizes) • Deep-TEN ideally accepts arbitrary sizes (larger than a constant) • Training with predefined sizes iteratively in different epochs w/o modifying the solver • Adopt single-size testing for simplicity 32 Hang Zhang

  33. Multi-size Training 33 Hang Zhang

  34. Comparing to State-of-the-Art • Prior approaches • (1) relies on assembling features • (2)adopts an additional SVM classifier for classification. 34 Hang Zhang

  35. Extra Thoughts • So many labeled datasets: object recognition, scene understanding, material recognition • How to benefit from them • Simply merging datasets (different label strategy) • Share convolutional features (domain transfer problem) 35 Hang Zhang

  36. Joint Encoding • Multi-task learning • Encoding Layer carries the domain specific information • Convolutional Layers are generic E1 CIFAR • Joint training on two datasets • CIFAR-10 (50,000 training images with size Conv Layers 36×36 ) • STL-10 (5,000 training images with size STL E2 96×96 ) 36 Hang Zhang

  37. Experimental Results for Joint Training • Joint training on two datasets (simple network architecture) • CIFAR-10 (50,000 training images with size 36×36 ) • STL-10 (5,000 training images with size 96×96 ) The SoA for CIFAR-10 is 95.4% using 1,001 layers ResNet (He et. al. ECCV 2016 ) 37 Hang Zhang

  38. Summary • Proposed a new model • Integrated the entire dictionary learning and encoding into a single layer of CNN • Generalize residual encoders (VLAD, FV), suitable for texture recognition and achieved state-of-the-art results • Introduced a new CNN architecture • Making deep learning framework more flexible by allowing arbitrary input image sizes • Carries domain-specific information and make the learned features easier to transfer 38 Hang Zhang

  39. Thank you! • We provide efficient Torch implementation with CUDA backend at https://github.com/zhanghang1989/Deep-Encoding 39 Hang Zhang

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend