deep structured learning
play

Deep Structured Learning Chunhua Shen School of Computer Science, - PowerPoint PPT Presentation

Deep Structured Learning Chunhua Shen School of Computer Science, The University of Adelaide www.cs.adelaide.edu.au/~chhshen 1 Street scene understanding 2 Pedestrian detection Our 2014 improved result, best one before CVPR2015


  1. Deep Structured Learning Chunhua Shen School of Computer Science, The University of Adelaide www.cs.adelaide.edu.au/~chhshen 1

  2. 街景认知 Street scene understanding 2

  3. Pedestrian detection Our 2014 improved result, best one before CVPR2015 S. Paisitkriangkrai, C. Shen, A. van den Hengel. Pedestrian Detection with Spatially Pooled Features and Structured Ensemble Learning, IEEE T. PAMI 2015 http://arxiv.org/abs/1409.5209 3

  4. Pedestrian Detection: average miss rate (Caltech data)

  5. Car detection Kitti car dataset (average precision) with 70% overlap. We are ranked 2nd currently. 5

  6. Speed sign

  7. Caution sign

  8. Car and sign detection

  9. Car and sign detection

  10. Car and sign detection

  11. Text in the wild BigData (0.5M Labelled data) + Deep Learning • ICDAR2003 Text in wild detection results Method Precision Recall F-measure Our Method 0.84 0.70 0.76 Max et al. (ECCV2014) 0.89 0.66 0.75 Oxford Visual Factory, Huang et al. (ECCV2014) 0.84 0.67 0.75 acquired by Google 2014 Neumann et al. 0.65 0.64 0.63 (ICDAR2011) 12

  12. Text in the wild Left: – green rectangles: our results – red rectangles: ground truth. Right: results of http://textspotter.org/ 13

  13. Text in the wild Left: – green rectangles: our results – red rectangles: ground truth. Right: results of http://textspotter.org/ 14

  14. Car license plate detection • Caltech Cars-markus 1999 dataset ( http://www.vision.caltech.edu/html-files/archive.html) Method Precision Recall Our Method 0.960 0.952 Zhou (TIP2012) 0.955 0.848 Lim (ICSUDET2010) 0.837 0.905 • AOLP dataset (http://aolpr.ntust.edu.tw/lab/) Subset_AC Subset_LE Method Precision Recall Method Precision Recall Ours 0.987 0.985 Ours 0.977 0.976 Hsu 0.91 0.96 Hsu 0.91 0.95 (TVT2013) (TVT2013) 15

  15. Car license plate detection 车牌检测 16

  16. Face recognition Deep learning based system with 400k labelled data. Faces in the wild (LFW) Face verification task: ~98.2% ( Single model ) Best reported result at CVPR2015, deep learning + 200M labelled data) : Google FaceNet: ~99.7 17

  17. Large scale Image Classification Top 5 error rate: ~8% on ImageNet ILSVRC2014 (parallel training on Workstations with 8 K40 GPUs) Best reported result (Google): ~4.8% 18

  18. Image Captioning Image Extract Image CNN Features Attributes/Labels/Locati ons prediction Language Modeling LSTM A person is sitting at a dining table with plate of fruit. A tablet and bouquet of red Image Captions flowers are on the side.

  19. Fig: Qualitative results for images in Microsoft COCO dataset. Selected attributes and corresponding detection scores are shown at the right side of each image. Our generated caption (in black), the Baseline results (in blue) and a human caption (in red) are shown below.

  20. Our team currently achieves the best result on 3 evaluation metrics (BLEU81,2,3) out of 7. We also achieve the top85 ranking on the other evaluation metrics.

  21. Visual Question Answering Image Extract Image CNN Features Attributes/Labels/Loc Knowledge ations prediction Base Q: What kind of glasses are they drinking out of ? Question Language LSTM Understanding Modeling Generate Answer A: Wine

  22. Visual Question Answering Top -5 Attributes • couch, sleeping , brown, laying, dog Question Answering • Does this animal appear to be resting ? - yes (yes) • What is the color of the cushions ? - brown (brown) Generated caption • a dog laying on top of a couch next to a pillow . Top -5 Attributes • group, people, children, cake, grass Question Answering • What kind of event does this look like? - birthday party (birthday) • Is this woman trying to give the kids an uncooked cake? Generated caption - no (no) • a group of children standing around a cake.

  23. Visual Question Answering Generated caption • a table topped with plates of food. Generated caption • a living room with a couch and a television. Top -5 Attributes Top -5 Attributes • shelf, small, book, room, television • top, cake, fruits, table, plates Question Answering Question Answering • What pattern is on the curtain? • What are the orange sticks? - floral (leaves) - carrots (carrots) • What sport is being displayed on the • How many carrots are in the bowls? television? - 2 (over 10) - football (football) • Is this set up for a party? • Is there a bookcase nearby? - yes (yes) - yes (yes)

  24. Table . Results on the open/answer task for various question types on MS COCO/VQA. All results are the percentage of answers in agreement with human subjects. ‡ indicates ground truth attributes labels are used. Human results arealso reported for reference. Image Captioning with an Intermediate Attributes Layer Qi Wu, Chunhua Shen, Anton van den Hengel, Lingqiao Liu, Anthony Dick http://arxiv.org/abs/1506.01144

  25. Deep Structured Learning

  26. Background: Neural Networks Feed forward neural networks Supervised learning Fully-connected network Multi-layer perceptron (MLP) Convolutional Neural Network LeNet [LeCun98] Other networks: Recurrent Network (speech/text parsing)

  27. Convolutional Neural Network Convolutional neural net Fully connected net Another aspect (1-Dimensional convolution) Input: 5*1 output: 3*1 Conv layer Apply a filter: 3*1 Number of weights: 3 Number of weights: 15 (model parameter) CNN layer properties: 1. sparse connectivity (spatially-local connection) 2. sharing weights (same colour: shared weights) (regardless the spatial position)

  28. Network layers A network contains not only Conv/FC layers Pooling layer Cross-channel normalisation layer Drop-out layer ...

  29. CNN learning Stochastic gradient decent (SGD) Mini-batch: a small number of examples for one gradient update Momentum (avoid gradient fluctuation) Parameter update in the t-th iteration: Learning rate: (0.01, decreasing) Momentum:0.9 Weight decay:0.0005

  30. Summary CNN is “simple” The whole network: repeating simple building block ...

  31. Summary CNN is difficult design the architecture hard to train (non-convex) GPUs Why 224 Why 11*11 Why 5*5 Why 9 layers?? ...

  32. Depth estimation from single monocular images ● Depth acquisition: – Depth sensors, e.g., Kinect – Machine learning methods ● Most vision datasets are still RGB images ● Estimate depth from single RGB images – Ill-posed problem

  33. Depth Estimation From Single Monocular Images

  34. Depth Estimation From Single Monocular Images ● Useful – Scene understanding – 3D modelling – Benefit other vision tasks ● e.g., semantic labellings, pose estimations ● Challenging – No reliable depth cues ● e.g., stereo correspondence, motion information

  35. Our method ● Joint learning: Continuous CRF + deep CNN ● Exact maximization of log-likelihood ● Closed form solution for MAP inference

  36. Deep$Convolutional$Neural$Fields 11/09/2015 19

  37. Continuous(CRF ● Given(image(x(with(labels((((((((((((((((((((((((((((((((((((((((( CRF(model(the(conditional(probability(density( function ● Z(x)(the(partition(function – Z(x)(integrable here((y(is(continuous)

  38. Continuous(CRF ● Energy(function((unary+pairwise): ● Depth(estimation((MAP(inference):

  39. Continuous(CRF ● Energy(function((unary+pairwise): ● Depth(estimation((MAP(inference):

  40. Potential)functions ● Pairwise)potential

  41. Learning ● Minimize+the+negative+condition+log3likelihood – Optimization:+back+propagation

  42. Depth&prediction ● MAP&inference

  43. Model&speedup&using&fully/conv& networks&and&superpixel&pooling ● Problem&with&DCNF&model:&inefficient – Need&to&perform&convolutions&for&each&superpixel& image&patch ● Model&speedup:&DCNF/FCSP – Deep&convolutional&neural&fields&with&fully& convolutional&networks&and&superpixel&pooling – Only&need&to&perform&convolutions&over&the&entire& image&once

  44. DCNF%FCSP ● Fully%conv/nets//////////////convolution/maps ● Superpixel/pooling//////////////convolution/features/ of/superpixels

  45. Fully%conv*nets ● Perform*convolutions*over*the*input*image*to* obtain*convolution*maps ● Deeper*networks,*e.g.,*VGG%16,*GoogleNet,* can*be*used

  46. Superpixel)pooling ● Average)pooling)within)superpixels

  47. Baseline(comparison ● NYU(v2

  48. DCNF%vs.%DCNF)FCSP ● Training%time%comparison

  49. State%of%the%art*comparison ● NYU*v2

  50. State%of%the%art*comparison ● Make3D 37

  51. Prediction*examples:*NYU*v2 38

  52. Prediction*examples:*Make*3D

  53. Conclusion ● Deep convolutional neural fields for monocular image depth estimations ● Combine deep CNN and continuous CRF ● General learning framework Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields Fayao Liu, Chunhua Shen, Guosheng Lin CVPR2015 http://arxiv.org/abs/1502.07411

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend