Beyond detection: GANs and LSTMs to pay attention at human presence
Ri Rita Cucchia iara
Imag Imagelab, , Di Dipartimento di di Ing Ingegneria «E «Enzo Ferrari» University of Modena e Reggio Emilia, Italy Talk @Munich October 11, 2017
Beyond detection: GANs and LSTMs to pay attention at human presence - - PowerPoint PPT Presentation
Talk @Munich October 11, 2017 Beyond detection: GANs and LSTMs to pay attention at human presence Ri Rita Cucchia iara Imag Imagelab, , Di Dipartimento di di Ing Ingegneria E Enzo Ferrari University of Modena e Reggio Emilia,
Imag Imagelab, , Di Dipartimento di di Ing Ingegneria «E «Enzo Ferrari» University of Modena e Reggio Emilia, Italy Talk @Munich October 11, 2017
✓ 10 10 year ears pe pedestr trian de detectio ion [S. Zhang, R. Benenson,
70% accuracy on Caltech ✓ Many dee deep ne netw tworks for
pedestr trian de detectio ion : CNNS+ handcraft feature: 9% miss rate on Caltech reasonable dataset * ✓ Ob Object de detector
YOLOv2 78.6% mAP on VOC2007-12 at 40fps Still a margin of improvement..
**SSD [W.Liu at al SSD Single Shot MultiBox Detector 2017] ** FAST CFM [ Hu, Wang, Shen, van den Hengel, Porikli IEEE TCSVT 2017] ***YOLOv2 [Redmon Farhadi ArXiv 2017]
Real-time detection of people and AGVs in working areas on embedded NVIDIA boards at Imagelab
Standard networks are not enough. Embedded vision solutions with bckg sub and CNNs
thanks to PANASONIC
Black hair Object Plastic bag Jacket Backpack Long Trousers
Male
Now CNNs can classify more than 50 attributes Problems with
Low-Resolution Occlusion
Generator Discriminator Noise
“..a generative model G captures the data distribution, ..a discriminative model D estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake” [I.Goodfellow.. Y.Bengio 2014]
Incomplete Low Resolution
A conditional l gen enerative model p(x | c) can be obtained by adding c as input to both G and D
RAP RAP: A Richly Annotated Dataset for Pedestrian Attribute Recognition [http://rap.idealtest.org/] Dataset dimension:
Fabbri, Calderara, Cucchiara Generative Adversarial Models for People Attribute Recognition in Surveillance IEEE AVSS 2017
Discriminator
De-occluded (Fake) Occluded Original image (Real)
Compare (SSE)
RAP
Cross Entropy
Encoder Generator Decoder
lowRAP by Imagelab
RAP
TConv1
Upsample2
SConv4
Stride2
SConv3
Stride2
SConv2
Stride2
SConv1
Stride2
Conv (3x160x64) (256x80x32) (256x80x32) (256x40x16) (256x40x16) (512x20x8) (512x20x8) (1024x10x4) (128x160x64)
Input Output
5x5 5x5
TConv2
Upsample2
5x5
TConv3
Upsample2
5x5
TConv4
Upsample2
5x5 5x5 5x5 5x5 5x5
Decoder Encoder
SConv2 Batch-Norm Leaky-ReLU TConv3 Batch-Norm ReLU (3x160x64) (3x160x64) (128x80x32) (256x40x16) (512x20x8) (1024x10x4)
SConv1 SConv2 SConv3 SConv4 Classification (fake or real)
Stride2 Stride2 Stride2 Stride2 (1x1x1)
Input image (fake or real)
5x5 5x5 5x5 5x5
SConv2 Batch- Norm Leaky- ReLU
Discriminator Generator
De-occlusion Super Resolution
Attribute Class lassifi fication Network de details batch size: 8 GPU: 1080ti Training time: 24 hours Rec econstruction GAN AN (f (for
deocclusi sion) ) de details batch size: 256 GPU: 1080ti Training time: 48 hours Sup Super Res esolution GAN AN (f (for
imag age res esolution) ) de details batch size: 128 GPU: 1080ti Training time: 72 hours
EU-ER- FESR 2015-2018
From [D.Zhang, H,Maei,X.Wang, Y-F. Wang Samsung, UCSB ArXiv 2017]
*[S-E Wei, V.Ramakrishna,T.Kanade,Y.Sheickh «Convolutional Pose Machines»CVPR 2016]
“lest-sholder”.)
(Part Affin inity Fi Field) to assemble the detected joints. The score of a candidate limb is proportional to the alignment with the PAF associated with that type of limb.
init ity Fi Field ld) to link the corresponding joints of the same person in consecutive frames (for an unknown number of people). “left knee” joint PAF vector connecting two nodes TAF vector connecting the same node in the time
ScriptHook library
functions
available to the game engine
The Deep architecture and the software is propriety of ImageLab UNIMORE, We thanks Jum Jump pr project fu funded with th EU EU ER ER-FESR-2015 2015-2020 pr program
T-CPMs do not use recurrence but works on sequences of frames and refines s wit ith it iteration with th lon
lutional layers
(thanks to Marcella Cornia, Giuseppe Serra and Lorenzo Baraldi)
Saliency Attentive Model (SAM): ML-NET+ LSTMs
Marcella Cornia, L. Baraldi, G. Serra, and
for Saliency Prediction, ICPR CPR 2016 MIT300 (Itti, Torralba et al) more than 70 competitors since 2014 SALICON (Jiang et al 2015), 10000 images;
10.000 training images 5.000 validation images 5.000 test images
Bottom-up saliency, detected by ML-NET, trained on SALICON on DR(EYE)VE dataset http://imagelab.ing.unimore.it/dreyeve Saliency not driven by a task.. Saliency trained by driving as a passenger sees as a driver sees
Collected with SMI ETG 2w, Frontal camera 720p/30fps + Eye pupils cameras at 60fps GARMIN VirvX , 1080 p/25fps +GPS.
“
rit rita.cucchiara@unim imore.it it http://imagela lab.in ing.unimore.it it
Thank to to