Multi-task Learning for Precise Object Search from Massive - - PowerPoint PPT Presentation

multi task learning for precise object
SMART_READER_LITE
LIVE PREVIEW

Multi-task Learning for Precise Object Search from Massive - - PowerPoint PPT Presentation

Multi-task Learning for Precise Object Search from Massive Images/Videos Fan Yang National Engineering Laboratory for Video Technology School of EE & CS, Peking University Outline Introduction Motivation Challenge Multi-task


slide-1
SLIDE 1

Fan Yang

National Engineering Laboratory for Video Technology School of EE & CS, Peking University

Multi-task Learning for Precise Object Search from Massive Images/Videos

slide-2
SLIDE 2

Outline

 Introduction  Motivation  Challenge  Multi-task learning for precise object search

  • 1. Multi-task based person re-identification
  • 2. Multi-task based vehicle search

 Summary

slide-3
SLIDE 3

Outline

 Introduction  Motivation  Challenge  Multi-task learning for precise object search

  • 1. Multi-task based person re-identification
  • 2. Multi-task based vehicle search

 Summary

slide-4
SLIDE 4

Laboratory Organization

  • Dr. Wen Gao

IEEE Fellow ACM Fellow

  • Dr. Tiejun Huang

Video Coding Lab System Lab Testing Lab New Media Lab SoC Lab

National Engineering Laboratory for Video Technology

slide-5
SLIDE 5

Research Fields and Groups

 Video coding algorithm: Wen Gao,Siwei Ma,Ruiqin Xiong

 Video coding standard  Cooperation: CCTV、Huawei、AVS Industry Alliance

 Intelligent video analysis: Tiejun Huang,Yonghong Tian,Wei Zeng ,Yaowei Wang

 Analysis and mine surveillance videos, recognition friendly video coding  Cooperation: China Security & Protection , Hisense

 Mobile Visual Search: Linyu Duan, Shiliang Zhang

 CDVS international standard  Cooperation:Baidu,Singapore media bureau

 Media content analysis: Yizhou Wang, Tingting Jiang

 Computer vision  Cooperation:Machine intelligence Lab, Computing Technology, Chinese Academy

  • f sciences

 Image/Video Chip: Xiaodong Xie, Huizhu Jia

 Industrial production  Application:National defense, Camera, Consumer Electronics

slide-6
SLIDE 6

Cooperation with NVIDIA(NVAIL)

 Accelerating Video Encoding

 investigate the acceleration methods of video encoding on Graphics Processing Unit (GPU).

 Video Classification/Recognition for CDN Surveillance

 Extend the current state-of-the-art methods and further improve their performance especially for the CDN surveillance purpose

 Accelerating Compact Descriptors for Visual Search

 Use GPU to accelerate the CDVS extracting process.

 Image Super-Resolution via Convolutional Neural Networks

 Extend the current state-of-the-art CNNs based super-resolution approaches and accelerate the time inference of CNNs.

slide-7
SLIDE 7

Outline

 Introduction  Motivation  Challenge  Multi-task learning for precise object search

  • 1. Multi-task based person re-identification
  • 2. Multi-task based vehicle search

 Summary

slide-8
SLIDE 8

The e Bi Big Da g Data ta Era Era

 Big Data collected/collecting by societies

 More data has been created in the past two years than in the entire previous history of the human race.  Data is growing faster than ever before and by the year 2020, about 1.7 megabytes of new information will be created every second for every human being on the planet.

1 (48%) 1.5 (43%) 2 (40%) 4.5 (53%) 8 (66%) 13 (72%) 21.5 (77%) 34 (81%) 78.5 (90%)

Data Size (EB)

Images and Videos Others

The growth trend of Internet Data, estimated by IDC

slide-9
SLIDE 9

Sur urveill veillanc ance e Vi Video:

  • : Th

The Bi Bigges ggest t Bi Big Data

City Operation

Social Lifes Public Security Traffic Healthcare Surveillance Video Network Data Center

Surveillance Video Network:

The Key infrastructure of intelligent city >100K cameras for a middle-size city China

Surveillance Videos:

More than half of all big data

  • T. Huang, "Surveillance Video: The Biggest Big Data," Computing Now, vol. 7, no. 2, Feb. 2014, IEEE Computer Society

[online]; http://www.computer.org/web/computingnow/archive/february2014.

slide-10
SLIDE 10

BUT, data is far from being analyzed and used

 “Target rich” data, i.e., the data with especial value, take about 1.5% of the digital universe  To obtain such “target rich” data, we need to analyze and mine all the data.

 At the moment, less than 0.5% of all data is ever analyzed and used

slide-11
SLIDE 11

Have eyes(i.e., camera)Ca Cannot See See (i.e., Recog. and Search)

The Stat The Status of Curr us of Current ent Syst Systems: ems:

Le Less ss Sma Smart rt

Boston Paris London Moscow

4

slide-12
SLIDE 12

Sur urveil eillan lance ce Vide ideo

  • An

Anal alys ysis is

 To develop intelligent algorithms, technologies and systems that can detect/recognize/search specific objects (e.g., pedestrian, vehicle), behavior, or events.

 Enabling Technologies

 Background modeling  Object detection/tracking (e.g., pedestrian, vehicle)  Object recognition (e.g., face)  Object re-identification and search  Action/Behavior detection/recognition  (Abnormal) Event detection  Crowd analysis  Cross-camera tracking  …

12

slide-13
SLIDE 13

A Challenging Problem

 How can we search a specific object from massive image

  • r video data?

 NOT for visually similar object  BUT for exactly the same object

Detection and classification Precise object search

Gallery Query

ID=1 ID=2 ID=3 … …

slide-14
SLIDE 14

Precise Object Search

 Task: to search a specific object from a large-scale dataset which contains a set of visually similar objects captured from different camera networks.

 Search as Similarity Ranking (SaS)  Search as Recognition (SaR)

Precise person search Precise vehicle search

slide-15
SLIDE 15

Car Monitoring 2 Car Monitoring 3 Car Monitoring N Tollgate

Example: Det etect ect Fa Fake e Li Lice cense se Pl Plate

Car Registry Database

Peugeot 206

Honda accord Fake Plate

Search Engine

slide-16
SLIDE 16

Search Engine

Example: Tr Traci cing ng Sus uspic icious ious Ve Vehicle icle

2014.10.19 10:12:11

2014.10.19 10:22:32 2014.10.19 10:36:33 2014.10.9 10:42:15 2014.10.19 12:42:11 2014.10.19 13:02:18

slide-17
SLIDE 17

From Search to Recognition

 Precise object recognition: The ultimate goal

 Till to now, none of any recognition technology (including vehicle plate number recognition, face recognition) can achieve sufficiently high precision under an unconstrained environment

 The success story of Google and Baidu tell us: Search can help, even substitute for in some cases, recognition.

slide-18
SLIDE 18
  • vs. Visual Search

 The task is aiming to find visually similar objects from a large database through visual similarity measurement and ranking

 In most cases, the returned objects that are visually similar (e.g., within the same (sub-)category, having the same attributes such as color) are treated as correct

Query Returned List

...

slide-19
SLIDE 19

Recent Work: Deep Learning for Visual Search

 Three Schemes

 Direct Representation  Refining by Similarity Learning  Refining by Model Retraining

Wan J, Wang D, Hoi S C H, et al. Deep learning for content-based image retrieval: A comprehensive study[C] ACM MM2014

Refine with class labels (classification loss) Refine with side information (similarity rank loss)

slide-20
SLIDE 20

Recent Work: Large-scale Clothes Image Retrieval

 Cross-domain Image Retrieval

Given a user photo depicting a clothing image, the goal is to retrieve the same or attribute-similar clothing items from online shopping stores

 Dual Attribute-aware Ranking Network

  • 1. Two sub-networks,
  • ne for each domain.
  • 2. Feature representations

are driven by semantic attribute learning.

  • 3. Learning to rank by

triplet visual similarity constraint.

Huang, Junshi, et al. "Cross-domain image retrieval with a dual attribute- aware ranking network." ICCV 2015.

slide-21
SLIDE 21

Outline

 Introduction  Motivation  Challenge  Multi-task learning for precise object search

  • 1. Multi-task based person re-identification
  • 2. Multi-task based vehicle search

 Summary

slide-22
SLIDE 22

Challenge Challenge 1: 1:

Hard Hard to retriev to retrieval al

500 1000 1500 2000 2500

2000 4000 6000 8000 10000 12000 Class Number Image Size Vehicle Images in a Province ImageNet ImageNet- ILSVRC’12 Caltech-256 CIFAR-100

2.2B 150M … …

60K images, 100 classes 30K images 256 classes 1.2M images 1000 classes 2.2B images ~15M classes 14M images 220K classes

Datasize-Recognition Gap

The he e exp xpon

  • nen

entiall tially i y inc ncrea easing sing siz size of

  • f ima

image ges s an and video d videos s pr prese esent nts s a a gran and d cha hall llen enge ge to to pa patte ttern n rec ecog

  • gnition

nition! !

7

slide-23
SLIDE 23

 Using a unified framework to analysis, recognition and search from images/videos that are captured in an unconstrained environment

1) Huge amount of videos; 2) Different imaging views, illuminations, environmental conditions and image quality; 3) Visual appearance changes of the suspicious person/vehicle; 4) Other factors (e.g., lack of training data)

Zhou Kehua Case London Underground bombings Changchun Car Theft Case

23

Challeng Challenge e 2: 2:

Har Hard d to i to ide dentify ntify

slide-24
SLIDE 24

 Difficult to distinguish different objects with similar appearance (i.e. vehicles of the same color and model)

 Camera view, distance, illumination variations

Different Same

Challeng Challenge e 2: 2:

Har Hard d to i to ide dentify ntify

slide-25
SLIDE 25

 NOT depend on the strong identification information such as face or vehicle license plate number

 Face is unavailable in most real-world surveillance cameras  Vehicle license plate may be faked Face Image Retrieval Scenario [Li, ICCV2015] How to search given these pictures?

✓ No front face image is available ✓ With some facial makeups ✓ Don’t know he is who

ID Face Database Surveillance Face Database

It is It is challenging also because challenging also because…

slide-26
SLIDE 26

Outline

 Introduction  Motivation  Challenge  Multi-task learning for precise object search

  • 1. Multi-task based person re-identification
  • 2. Multi-task based vehicle search

 Summary

slide-27
SLIDE 27

Multi-task learning

 Definition in Wikipedia

 Multi-task learning (MTL) is an approach to machine learning that learns a problem together with other related problems at the same time, using a shared representation

 Motivation

 Address multiple tasks with an unified model  Utilize the intrinsic relatedness between different tasks

slide-28
SLIDE 28

Multi-task learning

 The main question: how to learn?

 1) Combine features in different tasks together  2) Share hidden nodes or model parameters across different tasks

Color feature Shape feature Texture feature Edge feature

Image classification model Task 1 Task 2 Task 3

Output 1 Output 2 Output 3 Input 1 Input 2 Input 3 Output layer Hidden layers (shared) Input layer Sharing hidden nodes in deep neural network Mix different features together

slide-29
SLIDE 29

Multi-classes Classification

 AlexNet

 Classify 1000 classes within an unified model

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, Imagenet classification with deep convolutional neural networks, NIPS 2012

slide-30
SLIDE 30

Object Detection

 Fast R-CNN

 Two tasks:

 Image classification  Softmax over ROI features  Region detection  Bounding box regression

Motorbike 0.9 Person 0.6 Ross Girshick, Fast R-CNN, ICCV 2015

Multi-task

slide-31
SLIDE 31

Outline

 Introduction  Motivation  Challenge  Multi-task learning for precise object search

  • 1. Multi-task based person re-identification
  • 2. Multi-task based vehicle search

 Summary

slide-32
SLIDE 32

What is Person Re-ID?

 Definition

 Person re-identification (Re-ID) is the problem of matching people across non-overlapping camera views.

 Challenges

 A person’s appearance often changes dramatically across camera views due to changes in body pose, view angle,

  • cclusion and illumination conditions.

 More variations for non-rigid objects

slide-33
SLIDE 33

Key challenge for precise person search

 The drawbacks of person re-identification

 Unsupervised methods: weak performance

 Without labelled matching pairs across camera views, existing unsupervised models are unable to learn what makes a person recognizable under remarkable appearance changes.

 Supervised methods: poor scalability

 Existing supervised models needs labelled data for each dataset.  Eye-balling the two views to annotate correctly matching pairs among hundreds of images is a tough job even for humans.  For a camera network, the labelling cost would be prohibitively high.

300 camera pairs need to be labelled for a campus surveillance system (25 cameras)!!!

slide-34
SLIDE 34

Deep Re-ID (Pair-wise)

 Similarity estimation as a binary classification task  Process two images once  No explicit feature representation for each sample  Different architectures across different methods

Framework of most pair-wise networks Deep Architecture 1

Same individual Different individuals

Single-task

slide-35
SLIDE 35

Deep Re-ID (Pair-wise)

 Siamese Network

 Jointly learn the color feature, texture feature and metric in an unified framework  Two sub-networks for feature extraction

Framework Different distance or similarity functions

Dong Yi, Zhen Lei, Stan Z. Li, Deep Metric Learning for Practical Person Re-Identification, ICPR 2014

slide-36
SLIDE 36

Deep Re-ID (Pair-wise)

 DeepReID

 Filter pairing neural network (FPNN)

 Distance measurement in the middle (patch) level  Patch matching (maxout pooling)

Wei Li, Rui Zhao, Tong Xiao, Xiaogang Wang, DeepReID: Deep Filter Pairing Neural Network for Person Re-Identification, CVPR 2014

filter

filter ×

′ ′ ′ ′ ′ ′ ′ ′

specific filter filter filter filter filter ×

  • Figure5. Maxout pooling. Left: Responsesof patchesto four filter

pairs (indicated by the colors of yellow, purple, green and white)

  • n two stripes. Middle: Four patch displacement matrices after

passing thepatch matching layer. Without maxout grouping, each matrix only has one patch with large response. Right: Group four channels together and take the maximum value to form a single channel output. A line structure isformed.

× × first × × filters × × filters defined

· ·
slide-37
SLIDE 37

Deep Re-ID (Pair-wise)

 Cross-Input Neighborhood Differences Network

 Capture local relationships in mid-level features  A new layer to handle viewpoint variation across different camera views

Ejaz Ahmed, Michael Jones, Tim K. Marks, An Improved Deep Learning Architecture for Person Re-Identification, CVPR 2015

slide-38
SLIDE 38

Deep Re-ID (Triplet)

 Learn a feature representation explicitly via CNN

 Raw image X -> Feature vector F(X)

 Triplet units in training phase

 Reference sample 𝑃1  Positive sample 𝑃2  Negative sample 𝑃3

Shengyong Ding, Liang Lin, Guangrun Wang, Hongyang Chao, Deep feature learning with relative distance comparison for person re-identification, Pattern Recognition 2015

𝑃1 𝑃2 𝑃3

convolution convolution max pooling max pooling full connection Triplet unit for training

slide-39
SLIDE 39

Deep Re-ID (Triplet)

 Relative distance constraint over F(X):

 Pull images of the same individual closer  Push images of different individuals further  |𝐺 𝑃1 − 𝐺 𝑃2 |2

2 < |𝐺 𝑃1 − 𝐺(𝑃3)|2 2 (Triplet loss)

Shengyong Ding, Liang Lin, Guangrun Wang, Hongyang Chao, Deep feature learning with relative distance comparison for person re-identification, Pattern Recognition 2015

Single-task Positive pair Negative pair

slide-40
SLIDE 40

Deep Person Re-Identification

 Person Re-Identification with deep learning  Battle against data volume  Lack of training data

 It is hard to annotate a person Re-ID dataset  Transfer learning between datasets is important

 Multi-task learning

slide-41
SLIDE 41

Our method 1/2

 Supervised deep Re-ID via transfer learning

 The network

 Multi-task framework with classification and verification loss  Base network uses GoogLeNet to transfer knowledge learned from ImageNet  A task specific dropout is applied

Mengyue Geng, Yaowei Wang, Tao Xiang, Yonghong Tian, Deep Transfer Learning for Person Re-identification, arXiv 2016

Multi-task

slide-42
SLIDE 42

Our method 1/2

 Supervised deep Re-ID via transfer learning

 Transfer learning via two stepped fine-tuning strategy

 Train only ID classifier layer on target data first  Then fine-tune the whole network on target data

slide-43
SLIDE 43

Our method 1/2

 Experimental Results

slide-44
SLIDE 44

Our method 2/2

 Unsupervised deep Re-ID

 Iteratively co-training of deep network and dictionary  Deep network is trained by generating pseudo labels  Dictionary is trained using deep features

slide-45
SLIDE 45

Our method 2/2

 Experimental Results

slide-46
SLIDE 46

 Pedestrain Search by Behaviors Features (e.g., Gait)

 Multi-feature bipartite ranking model: to reduce the effects of multiple factors such as viewing angles, carrying objects and wearing different coat  Swiss multi-round competition mechanism: Through multi-round competition, the effectiveness and efficiency of cascade ranking model can be improved remarkably.

How to do when visual appearance is unreliable?

46

Probe Gallery Ranking

…… …… …… …… …… ……

Grouping Final Ranking

slide-47
SLIDE 47

47 47

Indoor Gait-based Person Search Outdoor Gait-based Person Search

Lan Wei, Yonghong Tian, Yaowei Wang, Tiejun Huang, Swiss-System based Cascade Ranking for Gait-based Person Re-identification, Proc. 29th AAAI Conf., January 25 –30, 2015, Austin, Texas USA.

DEMO for Person Re-ID

slide-48
SLIDE 48

Outline

 Introduction  Motivation  Challenge  Multi-task learning for precise object search

  • 1. Multi-task based person re-identification
  • 2. Multi-task based vehicle search

 Summary

slide-49
SLIDE 49

Precise Vehicle Search

 Precise Vehicle Search is not an easy task

 The Twin Problem: It is very difficult to distinguish two cars from the same model and with the same color

slide-50
SLIDE 50

Precise Vehicle Search

 Is it really possible to distinguish two vehicles of the same model and color?

 Yes, if we can find some discriminative features  Attributes help precise vehicle search

slide-51
SLIDE 51

Recent Work: Fine-Grained Visual Recognition

 The Comprehensive Cars (CompCars)

 Two scenarios: web-nature and surveillance-nature  The web-nature data contains 163 car makes with 1,716 car models  There are a total of 136,726 images capturing the entire cars and 27,618 images capturing the car parts.  Five attributes: maximum speed, displacement, number of doors, number of seats, and type of car  The surveillance-nature data contains 50,000 car images captured in the front view.

Yang, Linjie, et al. "A large-scale car dataset for fine-grained categorization and verification." CVPR 2015.

slide-52
SLIDE 52

Recent Work: Vehicle re-identification

 Appearance-based coarse filtering: low-level hand-crafted features and high-level semantic attributes  Plate-based accurate search : a Siamese neural network is trained for license plate verification instead of recognizing the characters  Spatiotemporal relation model : utilized to re-rank vehicles

Liu, Xinchen, et al. "A Deep Learning-Based Approach to Progressive Vehicle Re- identification for Urban Surveillance." ECCV, 2016.

slide-53
SLIDE 53

 Framework

 Use deep convolutional network for feature extraction  Map the raw image data into a special Euclidean space  Use L2 distance to measure the image similarity

Hongye Liu, Yonghong Tian, Yaowei Wang, Lu Pang, Tiejun Huang, Deep Relative Distance Learning: Tell the Difference Between Similar Vehicles, Proc. IEEE Conf. Computer Vision and Pattern Recognition,

2016

Our Our methods methods 1: 1:

Deep Deep R Rela elativ tive e Di Distan stance ce L Lea earning ning

slide-54
SLIDE 54

Deep Relative Distance Learning

 Drawbacks of triplet loss

 Slow convergence  Fail to handle some special cases

 An enhanced version:

 Coupled Cluster Loss (CCL)

estimate the cluster center compute cluster loss

slide-55
SLIDE 55

Multi-task Deep Learning

 Determine whether two images are of the same vehicle:

 If they are of the same color and vehicle model?  If they have any common marks?

 Mixed Difference Network (Multi-task learning)

 One branch for attribute recognition (model, color, …)  One branch for discriminative features learning

slide-56
SLIDE 56

Training

 Network training

 Step 1: training two branches separately

 Branch 1  Vehicle model, color classification  Batch data are selected across different vehicle models  Branch 2  Coupled clusters loss (fix conv 1-3)  Batch data are selected within a specific vehicle model

 Step 2: training the entire network

 Set the learning rate of fc8 10 times larger than other layers  Loss weights of loss 1,2,3 are set as 0.5, 0.5 and 1.0

slide-57
SLIDE 57

VehicleID Dataset

 Dataset

 221, 763+ images of 26,267 vehicles(8.44 images/vehicle in average)  Each vehicle has an unique ID (labeled by its license plate)  111,585 images of 13,133 vehicles have model labels(250 models)

slide-58
SLIDE 58

Experimental Results

 By MAP  By match rate

slide-59
SLIDE 59

Experimental Results

 Results of precise vehicle search

slide-60
SLIDE 60

 Multi-grain Relationship

 Given multiple attributes, the relationship between vehicle images is abstracted to multiple grains(levels)  It is difficult to directly optimize under so strong constraint conditions

 generalized pairwise ranking  multi-grain list ranking

Our Our methods methods 2: 2:

Multi Multi-grain ain Cons Constr trains ains b base ased d Rank Ranking ing

Ke Yan, Yonghong Tian, Yaowei Wang and Wei Zeng,``Exploiting Multi-Grain Ranking Constraints for Precisely Searching Visually-similar Vehicles." Submitted to IEEE International Conference on Computer Vision (ICCV), 2017.

slide-61
SLIDE 61

Multi-grain Constrains based Ranking

 Generalized pairwise ranking

 Generalize conventional pairwise only consider binary similar/dissimilar relations to multiple relations  Jointly optimize multi-attribute and generalized pairwise ranking

n indicates the number of image pairs. 𝑞(𝑗, 𝑟) represents the grain prediction value on 𝑟-th grain of 𝑗-th pair. 𝑕 𝑗 = 𝑛 represents that the ground truth grain of 𝑗-th pair is 𝑛. y indicates the type of attributes (ID, model and color). 𝑏𝑧 𝑦 = 𝑛 represents that the ground truth category on 𝑧-th attribute of 𝑦-th image is 𝑛. 𝑞(𝑦, 𝑧, 𝑘) represents the prediction value on 𝑘-th category

  • f 𝑧-th attribute of 𝑦-th

Image. 𝜇 is a weight to control the balance of the two tasks.

slide-62
SLIDE 62

Datasets

 Two high-quality and well-annotated vehicle datasets

 each image is labeled ID, precise vehicle model and color  VD1 and VD2 are the largest high-quality annotated vehicle datasets published so far.

slide-63
SLIDE 63

Multi-grain Constrains based Ranking

 Experiment results

slide-64
SLIDE 64

Example: Precise Vehicle Search

 Don’t rely on the license plate (for detecting fake license plate)  Insensitive to blurred images

query Rank no.1 Rank no.2 Rank no.3

slide-65
SLIDE 65

Example: Precise Vehicle Search

 Insensitive to occlusion  Insensitive to car pose

slide-66
SLIDE 66

A Practical System in Wendeng City

switch switch

Data center

Type1: image frames from video toll gate, resolution 1920x1144 Type2: images from image toll gate, resolution 1536x2048 Input examples Data: 1.5M per day  3 GPU severs  2 Core with 8 NVidia Tesla K40  Vehicle detection  6 CPU severs  Hadoop platform  1 storage  32 TB Search sub-system

Users

slide-67
SLIDE 67

DEMO for vehicle search

slide-68
SLIDE 68

CNN vs. SIFT-like Features

 Experimental results

 Database: 611,944 images from two cities  Query images: 1000 images (randomly choose 1000 vehicles, each of whom further selects 1 random images)  Evaluation criterion: mean average precision (mAP)

Method Feature size mAP

SIFT 4~5K(Bpi) 0.3512 Our deep feature 4K(Bpi) 0.4206 Our compact feature 1K(bpi) 0.4191

slide-69
SLIDE 69

Ke Yan, Yaowei Wang, Dawei Liang, Tiejun Huang, Yonghong Tian, CNN vs. SIFT for Image Retrieval: Alternative or Complementary? Proc. ACM International Conference on Multimedia, Amsterdam, The Netherlands, Oct 2016.

CNN vs. SIFT-like Features

 Complementary between CNN and SIFT-like features

 UKBench Database: 10,200 images of 2,550 objects  Evaluation criterion: mean average precision(mAP)

69

slide-70
SLIDE 70

Outline

 Introduction  Motivation  Challenge  Multi-task learning for precise object search

  • 1. Multi-task based person re-identification
  • 2. Multi-task based vehicle search

 Summary

slide-71
SLIDE 71

Summary

 Beyond visual search  (Traditional) image search  Fine-grained image search Precise object search  Precise Person Search

 Multi-task learning: handles the challenge of person’s appearance changes  Transfer learning: handles the problem of small dataset

 Precise Vehicle Search

 Multi-task learning & deep relative distance learning: find some discriminative features to distinguish different vehicles with similar appearance

slide-72
SLIDE 72

Summary

 Future Directions

 Benchmarking: Billions-scale benchmark dataset  Multi-task Feature: More discriminative global and local deep features, for both fine-grained categorization and search  Unified Framework: One framework for detection, recognition and search  Efficiency: Compact descriptors for multi-tasks via learn to hash

slide-73
SLIDE 73

Acknowledgement

We gratefully acknowledge the support from NVIDIA NVAIL program.

slide-74
SLIDE 74

74

Thanks!

Yonghong Tian: yhtian@pku.edu.cn Fan Yang: fyang.eecs@pku.edu.cn