Deep Structured Learning Chunhua Shen School of Computer Science, - PowerPoint PPT Presentation

Deep Structured Learning Chunhua Shen School of Computer Science, The University of Adelaide www.cs.adelaide.edu.au/~chhshen 1

街景认知 Street scene understanding 2

Pedestrian detection Our 2014 improved result, best one before CVPR2015 S. Paisitkriangkrai, C. Shen, A. van den Hengel. Pedestrian Detection with Spatially Pooled Features and Structured Ensemble Learning, IEEE T. PAMI 2015 http://arxiv.org/abs/1409.5209 3

Pedestrian Detection: average miss rate (Caltech data)

Car detection Kitti car dataset (average precision) with 70% overlap. We are ranked 2nd currently. 5

Speed sign

Caution sign

Car and sign detection

Text in the wild BigData (0.5M Labelled data) + Deep Learning • ICDAR2003 Text in wild detection results Method Precision Recall F-measure Our Method 0.84 0.70 0.76 Max et al. (ECCV2014) 0.89 0.66 0.75 Oxford Visual Factory, Huang et al. (ECCV2014) 0.84 0.67 0.75 acquired by Google 2014 Neumann et al. 0.65 0.64 0.63 (ICDAR2011) 12

Text in the wild Left: – green rectangles: our results – red rectangles: ground truth. Right: results of http://textspotter.org/ 13

Text in the wild Left: – green rectangles: our results – red rectangles: ground truth. Right: results of http://textspotter.org/ 14

Car license plate detection • Caltech Cars-markus 1999 dataset ( http://www.vision.caltech.edu/html-files/archive.html) Method Precision Recall Our Method 0.960 0.952 Zhou (TIP2012) 0.955 0.848 Lim (ICSUDET2010) 0.837 0.905 • AOLP dataset (http://aolpr.ntust.edu.tw/lab/) Subset_AC Subset_LE Method Precision Recall Method Precision Recall Ours 0.987 0.985 Ours 0.977 0.976 Hsu 0.91 0.96 Hsu 0.91 0.95 (TVT2013) (TVT2013) 15

Car license plate detection 车牌检测 16

Face recognition Deep learning based system with 400k labelled data. Faces in the wild (LFW) Face verification task: ~98.2% （ Single model ） Best reported result at CVPR2015, deep learning + 200M labelled data) ： Google FaceNet: ~99.7 17

Large scale Image Classification Top 5 error rate: ~8% on ImageNet ILSVRC2014 (parallel training on Workstations with 8 K40 GPUs) Best reported result (Google): ~4.8% 18

Image Captioning Image Extract Image CNN Features Attributes/Labels/Locati ons prediction Language Modeling LSTM A person is sitting at a dining table with plate of fruit. A tablet and bouquet of red Image Captions flowers are on the side.

Fig: Qualitative results for images in Microsoft COCO dataset. Selected attributes and corresponding detection scores are shown at the right side of each image. Our generated caption (in black), the Baseline results (in blue) and a human caption (in red) are shown below.

Our team currently achieves the best result on 3 evaluation metrics (BLEU81,2,3) out of 7. We also achieve the top85 ranking on the other evaluation metrics.

Visual Question Answering Image Extract Image CNN Features Attributes/Labels/Loc Knowledge ations prediction Base Q: What kind of glasses are they drinking out of ? Question Language LSTM Understanding Modeling Generate Answer A: Wine

Visual Question Answering Top -5 Attributes • couch, sleeping , brown, laying, dog Question Answering • Does this animal appear to be resting ? - yes (yes) • What is the color of the cushions ? - brown (brown) Generated caption • a dog laying on top of a couch next to a pillow . Top -5 Attributes • group, people, children, cake, grass Question Answering • What kind of event does this look like? - birthday party (birthday) • Is this woman trying to give the kids an uncooked cake? Generated caption - no (no) • a group of children standing around a cake.

Visual Question Answering Generated caption • a table topped with plates of food. Generated caption • a living room with a couch and a television. Top -5 Attributes Top -5 Attributes • shelf, small, book, room, television • top, cake, fruits, table, plates Question Answering Question Answering • What pattern is on the curtain? • What are the orange sticks? - floral (leaves) - carrots (carrots) • What sport is being displayed on the • How many carrots are in the bowls? television? - 2 (over 10) - football (football) • Is this set up for a party? • Is there a bookcase nearby? - yes (yes) - yes (yes)

Table . Results on the open/answer task for various question types on MS COCO/VQA. All results are the percentage of answers in agreement with human subjects. ‡ indicates ground truth attributes labels are used. Human results arealso reported for reference. Image Captioning with an Intermediate Attributes Layer Qi Wu, Chunhua Shen, Anton van den Hengel, Lingqiao Liu, Anthony Dick http://arxiv.org/abs/1506.01144

Deep Structured Learning

Background: Neural Networks Feed forward neural networks Supervised learning Fully-connected network Multi-layer perceptron (MLP) Convolutional Neural Network LeNet [LeCun98] Other networks: Recurrent Network (speech/text parsing)

Convolutional Neural Network Convolutional neural net Fully connected net Another aspect (1-Dimensional convolution) Input: 5*1 output: 3*1 Conv layer Apply a filter: 3*1 Number of weights: 3 Number of weights: 15 (model parameter) CNN layer properties: 1. sparse connectivity (spatially-local connection) 2. sharing weights (same colour: shared weights) (regardless the spatial position)

Network layers A network contains not only Conv/FC layers Pooling layer Cross-channel normalisation layer Drop-out layer ...

CNN learning Stochastic gradient decent (SGD) Mini-batch: a small number of examples for one gradient update Momentum (avoid gradient fluctuation) Parameter update in the t-th iteration: Learning rate: (0.01, decreasing) Momentum:0.9 Weight decay:0.0005

Summary CNN is “simple” The whole network: repeating simple building block ...

Summary CNN is difficult design the architecture hard to train (non-convex) GPUs Why 224 Why 11*11 Why 5*5 Why 9 layers?? ...

Depth estimation from single monocular images ● Depth acquisition: – Depth sensors, e.g., Kinect – Machine learning methods ● Most vision datasets are still RGB images ● Estimate depth from single RGB images – Ill-posed problem

Depth Estimation From Single Monocular Images

Depth Estimation From Single Monocular Images ● Useful – Scene understanding – 3D modelling – Benefit other vision tasks ● e.g., semantic labellings, pose estimations ● Challenging – No reliable depth cues ● e.g., stereo correspondence, motion information

Our method ● Joint learning: Continuous CRF + deep CNN ● Exact maximization of log-likelihood ● Closed form solution for MAP inference

Deep$Convolutional$Neural$Fields 11/09/2015 19

Continuous(CRF ● Given(image(x(with(labels((((((((((((((((((((((((((((((((((((((((( CRF(model(the(conditional(probability(density( function ● Z(x)(the(partition(function – Z(x)(integrable here((y(is(continuous)

Continuous(CRF ● Energy(function((unary+pairwise): ● Depth(estimation((MAP(inference):

Potential)functions ● Pairwise)potential

Learning ● Minimize+the+negative+condition+log3likelihood – Optimization:+back+propagation

Depth&prediction ● MAP&inference

Model&speedup&using&fully/conv& networks&and&superpixel&pooling ● Problem&with&DCNF&model:&inefficient – Need&to&perform&convolutions&for&each&superpixel& image&patch ● Model&speedup:&DCNF/FCSP – Deep&convolutional&neural&fields&with&fully& convolutional&networks&and&superpixel&pooling – Only&need&to&perform&convolutions&over&the&entire& image&once

DCNF%FCSP ● Fully%conv/nets//////////////convolution/maps ● Superpixel/pooling//////////////convolution/features/ of/superpixels

Fully%conv*nets ● Perform*convolutions*over*the*input*image*to* obtain*convolution*maps ● Deeper*networks,*e.g.,*VGG%16,*GoogleNet,* can*be*used

Superpixel)pooling ● Average)pooling)within)superpixels

Baseline(comparison ● NYU(v2

DCNF%vs.%DCNF)FCSP ● Training%time%comparison

State%of%the%art*comparison ● NYU*v2

State%of%the%art*comparison ● Make3D 37

Prediction*examples:*NYU*v2 38

Prediction*examples:*Make*3D

Conclusion ● Deep convolutional neural fields for monocular image depth estimations ● Combine deep CNN and continuous CRF ● General learning framework Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields Fayao Liu, Chunhua Shen, Guosheng Lin CVPR2015 http://arxiv.org/abs/1502.07411

Deep Structured Learning Chunhua Shen School of Computer Science, - PowerPoint PPT Presentation

Deep Structured Learning Chunhua Shen School of Computer Science, The University of Adelaide www.cs.adelaide.edu.au/~chhshen 1 Street scene understanding 2 Pedestrian detection Our 2014 improved result, best one before CVPR2015

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Structured Electronic Design Structured Electronic Design ET 8016 5 ECTS credits 1

L101: Introduction to Structured Prediction Ryan Cotterell What is structured prediction?

Semi-structured data Data is not just text, but is not as well- Semi-structured data

Introduction to SparkSQL Structured Data Processing in Spark 1 Structured Data Processing A

Variational Inference for Tutorial Outline Structured NLP Models 1. Structured Models and Factor

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Structured Probabilistic Models for Deep Learning Lecture slides for Chapter 16 of Deep Learning

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

A PODS-based Extended Kalman Filter: Quantifying Sensing Uncertainties in Automatic Bird Species

Analysis of Ultra High Energetic Cosmic Rays measured in monocular mode with the fmuorescence

Inferring 3D Cues from a Single Image Wei- -Cheng Su Cheng Su Wei Motivation 2 Human can

3D Multi-Object Tracking for Autonomous Driving Xinshuo Weng, Kris Kitani June 15, 2020 1 3D

robots navigation LUKAS HFLIGER SUPERVISED BY MARIAN GEORGE 2 LUKAS HFLIGER 3 4 LUKAS

Deep-Learning: general principles + Convolutional Neural Networks Pr. Fabien MOUTARDE Center

Cue combinations, Bayesian models Thurs. March 1, 2018 1 Visual Cues: image properties that

Cosmic Rays Energy Spectrum from PeV to EeV energies measured by the TALE Detector Tareq