Fan Yang
National Engineering Laboratory for Video Technology School of EE & CS, Peking University
Multi-task Learning for Precise Object Search from Massive Images/Videos
Multi-task Learning for Precise Object Search from Massive - - PowerPoint PPT Presentation
Multi-task Learning for Precise Object Search from Massive Images/Videos Fan Yang National Engineering Laboratory for Video Technology School of EE & CS, Peking University Outline Introduction Motivation Challenge Multi-task
Fan Yang
National Engineering Laboratory for Video Technology School of EE & CS, Peking University
Multi-task Learning for Precise Object Search from Massive Images/Videos
Introduction Motivation Challenge Multi-task learning for precise object search
Summary
Introduction Motivation Challenge Multi-task learning for precise object search
Summary
IEEE Fellow ACM Fellow
Video Coding Lab System Lab Testing Lab New Media Lab SoC Lab
National Engineering Laboratory for Video Technology
Video coding algorithm: Wen Gao,Siwei Ma,Ruiqin Xiong
Video coding standard Cooperation: CCTV、Huawei、AVS Industry Alliance
Intelligent video analysis: Tiejun Huang,Yonghong Tian,Wei Zeng ,Yaowei Wang
Analysis and mine surveillance videos, recognition friendly video coding Cooperation: China Security & Protection , Hisense
Mobile Visual Search: Linyu Duan, Shiliang Zhang
CDVS international standard Cooperation:Baidu,Singapore media bureau
Media content analysis: Yizhou Wang, Tingting Jiang
Computer vision Cooperation:Machine intelligence Lab, Computing Technology, Chinese Academy
Image/Video Chip: Xiaodong Xie, Huizhu Jia
Industrial production Application:National defense, Camera, Consumer Electronics
Accelerating Video Encoding
investigate the acceleration methods of video encoding on Graphics Processing Unit (GPU).
Video Classification/Recognition for CDN Surveillance
Extend the current state-of-the-art methods and further improve their performance especially for the CDN surveillance purpose
Accelerating Compact Descriptors for Visual Search
Use GPU to accelerate the CDVS extracting process.
Image Super-Resolution via Convolutional Neural Networks
Extend the current state-of-the-art CNNs based super-resolution approaches and accelerate the time inference of CNNs.
Introduction Motivation Challenge Multi-task learning for precise object search
Summary
The e Bi Big Da g Data ta Era Era
Big Data collected/collecting by societies
More data has been created in the past two years than in the entire previous history of the human race. Data is growing faster than ever before and by the year 2020, about 1.7 megabytes of new information will be created every second for every human being on the planet.
1 (48%) 1.5 (43%) 2 (40%) 4.5 (53%) 8 (66%) 13 (72%) 21.5 (77%) 34 (81%) 78.5 (90%)
Data Size (EB)
Images and Videos Others
The growth trend of Internet Data, estimated by IDC
Sur urveill veillanc ance e Vi Video:
The Bi Bigges ggest t Bi Big Data
City Operation
Social Lifes Public Security Traffic Healthcare Surveillance Video Network Data Center
Surveillance Video Network:
The Key infrastructure of intelligent city >100K cameras for a middle-size city China
Surveillance Videos:
More than half of all big data
[online]; http://www.computer.org/web/computingnow/archive/february2014.
BUT, data is far from being analyzed and used
“Target rich” data, i.e., the data with especial value, take about 1.5% of the digital universe To obtain such “target rich” data, we need to analyze and mine all the data.
At the moment, less than 0.5% of all data is ever analyzed and used
Have eyes(i.e., camera)Ca Cannot See See (i.e., Recog. and Search)
The Stat The Status of Curr us of Current ent Syst Systems: ems:
Le Less ss Sma Smart rt
Boston Paris London Moscow
4
Sur urveil eillan lance ce Vide ideo
Anal alys ysis is
To develop intelligent algorithms, technologies and systems that can detect/recognize/search specific objects (e.g., pedestrian, vehicle), behavior, or events.
Enabling Technologies
Background modeling Object detection/tracking (e.g., pedestrian, vehicle) Object recognition (e.g., face) Object re-identification and search Action/Behavior detection/recognition (Abnormal) Event detection Crowd analysis Cross-camera tracking …
12
A Challenging Problem
How can we search a specific object from massive image
NOT for visually similar object BUT for exactly the same object
Detection and classification Precise object search
Gallery Query
ID=1 ID=2 ID=3 … …
Precise Object Search
Task: to search a specific object from a large-scale dataset which contains a set of visually similar objects captured from different camera networks.
Search as Similarity Ranking (SaS) Search as Recognition (SaR)
Precise person search Precise vehicle search
Car Monitoring 2 Car Monitoring 3 Car Monitoring N Tollgate
Example: Det etect ect Fa Fake e Li Lice cense se Pl Plate
Car Registry Database
Peugeot 206
Honda accord Fake Plate
Search Engine
Search Engine
Example: Tr Traci cing ng Sus uspic icious ious Ve Vehicle icle
2014.10.19 10:12:11
2014.10.19 10:22:32 2014.10.19 10:36:33 2014.10.9 10:42:15 2014.10.19 12:42:11 2014.10.19 13:02:18
From Search to Recognition
Precise object recognition: The ultimate goal
Till to now, none of any recognition technology (including vehicle plate number recognition, face recognition) can achieve sufficiently high precision under an unconstrained environment
The success story of Google and Baidu tell us: Search can help, even substitute for in some cases, recognition.
The task is aiming to find visually similar objects from a large database through visual similarity measurement and ranking
In most cases, the returned objects that are visually similar (e.g., within the same (sub-)category, having the same attributes such as color) are treated as correct
Query Returned List
...
Recent Work: Deep Learning for Visual Search
Three Schemes
Direct Representation Refining by Similarity Learning Refining by Model Retraining
Wan J, Wang D, Hoi S C H, et al. Deep learning for content-based image retrieval: A comprehensive study[C] ACM MM2014
Refine with class labels (classification loss) Refine with side information (similarity rank loss)
Recent Work: Large-scale Clothes Image Retrieval
Cross-domain Image Retrieval
Given a user photo depicting a clothing image, the goal is to retrieve the same or attribute-similar clothing items from online shopping stores
Dual Attribute-aware Ranking Network
are driven by semantic attribute learning.
triplet visual similarity constraint.
Huang, Junshi, et al. "Cross-domain image retrieval with a dual attribute- aware ranking network." ICCV 2015.
Introduction Motivation Challenge Multi-task learning for precise object search
Summary
Challenge Challenge 1: 1:
Hard Hard to retriev to retrieval al
500 1000 1500 2000 2500
2000 4000 6000 8000 10000 12000 Class Number Image Size Vehicle Images in a Province ImageNet ImageNet- ILSVRC’12 Caltech-256 CIFAR-100
2.2B 150M … …
60K images, 100 classes 30K images 256 classes 1.2M images 1000 classes 2.2B images ~15M classes 14M images 220K classes
Datasize-Recognition Gap
The he e exp xpon
entiall tially i y inc ncrea easing sing siz size of
image ges s an and video d videos s pr prese esent nts s a a gran and d cha hall llen enge ge to to pa patte ttern n rec ecog
nition! !
7
Using a unified framework to analysis, recognition and search from images/videos that are captured in an unconstrained environment
1) Huge amount of videos; 2) Different imaging views, illuminations, environmental conditions and image quality; 3) Visual appearance changes of the suspicious person/vehicle; 4) Other factors (e.g., lack of training data)
Zhou Kehua Case London Underground bombings Changchun Car Theft Case
23
Challeng Challenge e 2: 2:
Har Hard d to i to ide dentify ntify
Difficult to distinguish different objects with similar appearance (i.e. vehicles of the same color and model)
Camera view, distance, illumination variations
Different Same
Challeng Challenge e 2: 2:
Har Hard d to i to ide dentify ntify
NOT depend on the strong identification information such as face or vehicle license plate number
Face is unavailable in most real-world surveillance cameras Vehicle license plate may be faked Face Image Retrieval Scenario [Li, ICCV2015] How to search given these pictures?
✓ No front face image is available ✓ With some facial makeups ✓ Don’t know he is who
ID Face Database Surveillance Face Database
It is It is challenging also because challenging also because…
Introduction Motivation Challenge Multi-task learning for precise object search
Summary
Multi-task learning
Definition in Wikipedia
Multi-task learning (MTL) is an approach to machine learning that learns a problem together with other related problems at the same time, using a shared representation
Motivation
Address multiple tasks with an unified model Utilize the intrinsic relatedness between different tasks
Multi-task learning
The main question: how to learn?
1) Combine features in different tasks together 2) Share hidden nodes or model parameters across different tasks
Color feature Shape feature Texture feature Edge feature
Image classification model Task 1 Task 2 Task 3
Output 1 Output 2 Output 3 Input 1 Input 2 Input 3 Output layer Hidden layers (shared) Input layer Sharing hidden nodes in deep neural network Mix different features together
Multi-classes Classification
AlexNet
Classify 1000 classes within an unified model
Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, Imagenet classification with deep convolutional neural networks, NIPS 2012
Object Detection
Fast R-CNN
Two tasks:
Image classification Softmax over ROI features Region detection Bounding box regression
Motorbike 0.9 Person 0.6 Ross Girshick, Fast R-CNN, ICCV 2015
Multi-task
Introduction Motivation Challenge Multi-task learning for precise object search
Summary
What is Person Re-ID?
Definition
Person re-identification (Re-ID) is the problem of matching people across non-overlapping camera views.
Challenges
A person’s appearance often changes dramatically across camera views due to changes in body pose, view angle,
More variations for non-rigid objects
Key challenge for precise person search
The drawbacks of person re-identification
Unsupervised methods: weak performance
Without labelled matching pairs across camera views, existing unsupervised models are unable to learn what makes a person recognizable under remarkable appearance changes.
Supervised methods: poor scalability
Existing supervised models needs labelled data for each dataset. Eye-balling the two views to annotate correctly matching pairs among hundreds of images is a tough job even for humans. For a camera network, the labelling cost would be prohibitively high.
300 camera pairs need to be labelled for a campus surveillance system (25 cameras)!!!
Deep Re-ID (Pair-wise)
Similarity estimation as a binary classification task Process two images once No explicit feature representation for each sample Different architectures across different methods
Framework of most pair-wise networks Deep Architecture 1
Same individual Different individuals
Single-task
Deep Re-ID (Pair-wise)
Siamese Network
Jointly learn the color feature, texture feature and metric in an unified framework Two sub-networks for feature extraction
Framework Different distance or similarity functions
Dong Yi, Zhen Lei, Stan Z. Li, Deep Metric Learning for Practical Person Re-Identification, ICPR 2014
Deep Re-ID (Pair-wise)
DeepReID
Filter pairing neural network (FPNN)
Distance measurement in the middle (patch) level Patch matching (maxout pooling)
Wei Li, Rui Zhao, Tong Xiao, Xiaogang Wang, DeepReID: Deep Filter Pairing Neural Network for Person Re-Identification, CVPR 2014
filter
filter ×
′ ′ ′ ′ ′ ′ ′ ′specific filter filter filter filter filter ×
pairs (indicated by the colors of yellow, purple, green and white)
passing thepatch matching layer. Without maxout grouping, each matrix only has one patch with large response. Right: Group four channels together and take the maximum value to form a single channel output. A line structure isformed.
× × first × × filters × × filters defined
· ·Deep Re-ID (Pair-wise)
Cross-Input Neighborhood Differences Network
Capture local relationships in mid-level features A new layer to handle viewpoint variation across different camera views
Ejaz Ahmed, Michael Jones, Tim K. Marks, An Improved Deep Learning Architecture for Person Re-Identification, CVPR 2015
Deep Re-ID (Triplet)
Learn a feature representation explicitly via CNN
Raw image X -> Feature vector F(X)
Triplet units in training phase
Reference sample 𝑃1 Positive sample 𝑃2 Negative sample 𝑃3
Shengyong Ding, Liang Lin, Guangrun Wang, Hongyang Chao, Deep feature learning with relative distance comparison for person re-identification, Pattern Recognition 2015
𝑃1 𝑃2 𝑃3
convolution convolution max pooling max pooling full connection Triplet unit for training
Deep Re-ID (Triplet)
Relative distance constraint over F(X):
Pull images of the same individual closer Push images of different individuals further |𝐺 𝑃1 − 𝐺 𝑃2 |2
2 < |𝐺 𝑃1 − 𝐺(𝑃3)|2 2 (Triplet loss)
Shengyong Ding, Liang Lin, Guangrun Wang, Hongyang Chao, Deep feature learning with relative distance comparison for person re-identification, Pattern Recognition 2015
Single-task Positive pair Negative pair
Deep Person Re-Identification
Person Re-Identification with deep learning Battle against data volume Lack of training data
It is hard to annotate a person Re-ID dataset Transfer learning between datasets is important
Multi-task learning
Our method 1/2
Supervised deep Re-ID via transfer learning
The network
Multi-task framework with classification and verification loss Base network uses GoogLeNet to transfer knowledge learned from ImageNet A task specific dropout is applied
Mengyue Geng, Yaowei Wang, Tao Xiang, Yonghong Tian, Deep Transfer Learning for Person Re-identification, arXiv 2016
Multi-task
Our method 1/2
Supervised deep Re-ID via transfer learning
Transfer learning via two stepped fine-tuning strategy
Train only ID classifier layer on target data first Then fine-tune the whole network on target data
Our method 1/2
Experimental Results
Our method 2/2
Unsupervised deep Re-ID
Iteratively co-training of deep network and dictionary Deep network is trained by generating pseudo labels Dictionary is trained using deep features
Our method 2/2
Experimental Results
Pedestrain Search by Behaviors Features (e.g., Gait)
Multi-feature bipartite ranking model: to reduce the effects of multiple factors such as viewing angles, carrying objects and wearing different coat Swiss multi-round competition mechanism: Through multi-round competition, the effectiveness and efficiency of cascade ranking model can be improved remarkably.
How to do when visual appearance is unreliable?
46
Probe Gallery Ranking
Grouping Final Ranking
47 47
Indoor Gait-based Person Search Outdoor Gait-based Person Search
Lan Wei, Yonghong Tian, Yaowei Wang, Tiejun Huang, Swiss-System based Cascade Ranking for Gait-based Person Re-identification, Proc. 29th AAAI Conf., January 25 –30, 2015, Austin, Texas USA.
Introduction Motivation Challenge Multi-task learning for precise object search
Summary
Precise Vehicle Search
Precise Vehicle Search is not an easy task
The Twin Problem: It is very difficult to distinguish two cars from the same model and with the same color
Precise Vehicle Search
Is it really possible to distinguish two vehicles of the same model and color?
Yes, if we can find some discriminative features Attributes help precise vehicle search
Recent Work: Fine-Grained Visual Recognition
The Comprehensive Cars (CompCars)
Two scenarios: web-nature and surveillance-nature The web-nature data contains 163 car makes with 1,716 car models There are a total of 136,726 images capturing the entire cars and 27,618 images capturing the car parts. Five attributes: maximum speed, displacement, number of doors, number of seats, and type of car The surveillance-nature data contains 50,000 car images captured in the front view.
Yang, Linjie, et al. "A large-scale car dataset for fine-grained categorization and verification." CVPR 2015.
Recent Work: Vehicle re-identification
Appearance-based coarse filtering: low-level hand-crafted features and high-level semantic attributes Plate-based accurate search : a Siamese neural network is trained for license plate verification instead of recognizing the characters Spatiotemporal relation model : utilized to re-rank vehicles
Liu, Xinchen, et al. "A Deep Learning-Based Approach to Progressive Vehicle Re- identification for Urban Surveillance." ECCV, 2016.
Framework
Use deep convolutional network for feature extraction Map the raw image data into a special Euclidean space Use L2 distance to measure the image similarity
Hongye Liu, Yonghong Tian, Yaowei Wang, Lu Pang, Tiejun Huang, Deep Relative Distance Learning: Tell the Difference Between Similar Vehicles, Proc. IEEE Conf. Computer Vision and Pattern Recognition,
2016
Our Our methods methods 1: 1:
Deep Deep R Rela elativ tive e Di Distan stance ce L Lea earning ning
Deep Relative Distance Learning
Drawbacks of triplet loss
Slow convergence Fail to handle some special cases
An enhanced version:
Coupled Cluster Loss (CCL)
estimate the cluster center compute cluster loss
Multi-task Deep Learning
Determine whether two images are of the same vehicle:
If they are of the same color and vehicle model? If they have any common marks?
Mixed Difference Network (Multi-task learning)
One branch for attribute recognition (model, color, …) One branch for discriminative features learning
Training
Network training
Step 1: training two branches separately
Branch 1 Vehicle model, color classification Batch data are selected across different vehicle models Branch 2 Coupled clusters loss (fix conv 1-3) Batch data are selected within a specific vehicle model
Step 2: training the entire network
Set the learning rate of fc8 10 times larger than other layers Loss weights of loss 1,2,3 are set as 0.5, 0.5 and 1.0
VehicleID Dataset
Dataset
221, 763+ images of 26,267 vehicles(8.44 images/vehicle in average) Each vehicle has an unique ID (labeled by its license plate) 111,585 images of 13,133 vehicles have model labels(250 models)
Experimental Results
By MAP By match rate
Experimental Results
Results of precise vehicle search
Multi-grain Relationship
Given multiple attributes, the relationship between vehicle images is abstracted to multiple grains(levels) It is difficult to directly optimize under so strong constraint conditions
generalized pairwise ranking multi-grain list ranking
Our Our methods methods 2: 2:
Multi Multi-grain ain Cons Constr trains ains b base ased d Rank Ranking ing
Ke Yan, Yonghong Tian, Yaowei Wang and Wei Zeng,``Exploiting Multi-Grain Ranking Constraints for Precisely Searching Visually-similar Vehicles." Submitted to IEEE International Conference on Computer Vision (ICCV), 2017.
Multi-grain Constrains based Ranking
Generalized pairwise ranking
Generalize conventional pairwise only consider binary similar/dissimilar relations to multiple relations Jointly optimize multi-attribute and generalized pairwise ranking
n indicates the number of image pairs. 𝑞(𝑗, 𝑟) represents the grain prediction value on 𝑟-th grain of 𝑗-th pair. 𝑗 = 𝑛 represents that the ground truth grain of 𝑗-th pair is 𝑛. y indicates the type of attributes (ID, model and color). 𝑏𝑧 𝑦 = 𝑛 represents that the ground truth category on 𝑧-th attribute of 𝑦-th image is 𝑛. 𝑞(𝑦, 𝑧, 𝑘) represents the prediction value on 𝑘-th category
Image. 𝜇 is a weight to control the balance of the two tasks.
Datasets
Two high-quality and well-annotated vehicle datasets
each image is labeled ID, precise vehicle model and color VD1 and VD2 are the largest high-quality annotated vehicle datasets published so far.
Multi-grain Constrains based Ranking
Experiment results
Example: Precise Vehicle Search
Don’t rely on the license plate (for detecting fake license plate) Insensitive to blurred images
query Rank no.1 Rank no.2 Rank no.3
Example: Precise Vehicle Search
Insensitive to occlusion Insensitive to car pose
A Practical System in Wendeng City
switch switch
Data center
Type1: image frames from video toll gate, resolution 1920x1144 Type2: images from image toll gate, resolution 1536x2048 Input examples Data: 1.5M per day 3 GPU severs 2 Core with 8 NVidia Tesla K40 Vehicle detection 6 CPU severs Hadoop platform 1 storage 32 TB Search sub-system
Users
DEMO for vehicle search
CNN vs. SIFT-like Features
Experimental results
Database: 611,944 images from two cities Query images: 1000 images (randomly choose 1000 vehicles, each of whom further selects 1 random images) Evaluation criterion: mean average precision (mAP)
Method Feature size mAP
SIFT 4~5K(Bpi) 0.3512 Our deep feature 4K(Bpi) 0.4206 Our compact feature 1K(bpi) 0.4191
Ke Yan, Yaowei Wang, Dawei Liang, Tiejun Huang, Yonghong Tian, CNN vs. SIFT for Image Retrieval: Alternative or Complementary? Proc. ACM International Conference on Multimedia, Amsterdam, The Netherlands, Oct 2016.
CNN vs. SIFT-like Features
Complementary between CNN and SIFT-like features
UKBench Database: 10,200 images of 2,550 objects Evaluation criterion: mean average precision(mAP)
69
Introduction Motivation Challenge Multi-task learning for precise object search
Summary
Summary
Beyond visual search (Traditional) image search Fine-grained image search Precise object search Precise Person Search
Multi-task learning: handles the challenge of person’s appearance changes Transfer learning: handles the problem of small dataset
Precise Vehicle Search
Multi-task learning & deep relative distance learning: find some discriminative features to distinguish different vehicles with similar appearance
Summary
Future Directions
Benchmarking: Billions-scale benchmark dataset Multi-task Feature: More discriminative global and local deep features, for both fine-grained categorization and search Unified Framework: One framework for detection, recognition and search Efficiency: Compact descriptors for multi-tasks via learn to hash
Acknowledgement
We gratefully acknowledge the support from NVIDIA NVAIL program.
74
Yonghong Tian: yhtian@pku.edu.cn Fan Yang: fyang.eecs@pku.edu.cn