[PPT] - Beyond detection: GANs and LSTMs to pay attention at human presence PowerPoint Presentation

SLIDE 1

Beyond detection: GANs and LSTMs to pay attention at human presence

Ri Rita Cucchia iara

Imag Imagelab, , Di Dipartimento di di Ing Ingegneria «E «Enzo Ferrari» University of Modena e Reggio Emilia, Italy Talk @Munich October 11, 2017

SLIDE 2

Agenda

Beyond Human detection: 1) See humans 2) See what humans see Use of GANs, Iterative and Recurrent neural architectures in Vision

SLIDE 3

Beyond (People) detection

✓ 10 10 year ears pe pedestr trian de detectio ion [S. Zhang, R. Benenson,

M. Omran, J. Hosang, B. Schiele CVPR2016] , about

70% accuracy on Caltech ✓ Many dee deep ne netw tworks for

r pe

pedestr trian de detectio ion : CNNS+ handcraft feature: 9% miss rate on Caltech reasonable dataset * ✓ Ob Object de detector

r: SSD**, YOLO, YOLOv2***..

YOLOv2 78.6% mAP on VOC2007-12 at 40fps Still a margin of improvement..

**SSD [W.Liu at al SSD Single Shot MultiBox Detector 2017] ** FAST CFM [ Hu, Wang, Shen, van den Hengel, Porikli IEEE TCSVT 2017] ***YOLOv2 [Redmon Farhadi ArXiv 2017]

SLIDE 4

CHALLENGES IN NEW ENVIRONMENTS

Real-time detection of people and AGVs in working areas on embedded NVIDIA boards at Imagelab

Standard networks are not enough. Embedded vision solutions with bckg sub and CNNs

SLIDE 5

GANS FOR UNDERSTANDING HUMAN PRESENCE UNDER EXTREME CONDITIONS (thanks to Matteo Fabbri, and Simone Calderara)

If People Detection Solved..

thanks to PANASONIC

SLIDE 6

Attribute Classification

Black hair Object Plastic bag Jacket Backpack Long Trousers

Male

Now CNNs can classify more than 50 attributes Problems with

Low resolution
Occlusions and self-occlusions

Low-Resolution Occlusion

SLIDE 7

Generative Adversarial Networks

Generator Discriminator Noise

CNN

“..a generative model G captures the data distribution, ..a discriminative model D estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake” [I.Goodfellow.. Y.Bengio 2014]

CNN

Incomplete Low Resolution

A conditional l gen enerative model p(x | c) can be obtained by adding c as input to both G and D

SLIDE 8

Generated Dataset

RAP RAP: A Richly Annotated Dataset for Pedestrian Attribute Recognition [http://rap.idealtest.org/] Dataset dimension:

41,585 pedestrian samples
33,268 for training
8,317 for testing

Dataset image resolution:

from 36x92 to 344x554

Fabbri, Calderara, Cucchiara Generative Adversarial Models for People Attribute Recognition in Surveillance IEEE AVSS 2017

With a GAN from Noise..

SLIDE 9

Generative Adversarial Network for De-occlusion (or Super-Resolution)

Discriminator

De-occluded (Fake) Occluded Original image (Real)

Compare (SSE)

RAP

Cross Entropy

ccRAP

Encoder Generator Decoder

lowRAP by Imagelab

ccRAP by Imagelab

RAP

SLIDE 10

Selected Architecture

TConv1

Upsample2

SConv4

Stride2

SConv3

Stride2

SConv2

Stride2

SConv1

Stride2

Conv (3x160x64) (256x80x32) (256x80x32) (256x40x16) (256x40x16) (512x20x8) (512x20x8) (1024x10x4) (128x160x64)

Input Output

5x5 5x5

TConv2

Upsample2

5x5

TConv3

Upsample2

5x5

TConv4

Upsample2

5x5 5x5 5x5 5x5 5x5

Decoder Encoder

SConv2 Batch-Norm Leaky-ReLU TConv3 Batch-Norm ReLU (3x160x64) (3x160x64) (128x80x32) (256x40x16) (512x20x8) (1024x10x4)

SConv1 SConv2 SConv3 SConv4 Classification (fake or real)

Stride2 Stride2 Stride2 Stride2 (1x1x1)

Input image (fake or real)

5x5 5x5 5x5 5x5

SConv2 Batch- Norm Leaky- ReLU

Discriminator Generator

SLIDE 11

RESULTS

De-occlusion Super Resolution

SLIDE 12

The Complete Approach: De-occlusion and Super-resolution For Aspect Recognition

Attribute Class lassifi fication Network de details batch size: 8 GPU: 1080ti Training time: 24 hours Rec econstruction GAN AN (f (for

r de

deocclusi sion) ) de details batch size: 256 GPU: 1080ti Training time: 48 hours Sup Super Res esolution GAN AN (f (for

r im

imag age res esolution) ) de details batch size: 128 GPU: 1080ti Training time: 72 hours

SLIDE 13

Attribute classification

✓ More than 75% of precision and recall for 50 people attributes on RAP ✓ Acceptable results for occluded shapes and good for low resolution shapes

SLIDE 14

TRACKING HUMANS IN THE WILD BY JUNCTIONS WITH CMP (thanks to Fabio Lanzi, and Simone Calderara)

If People Detection still not solved.. ..without detection

EU-ER- FESR 2015-2018

SLIDE 15

State-of

f-the

the-art: Recurrent Nets for object tracking

For lon long-term-trackin ing

YOLO network for detection (fine tuned on PascalVOC)
NVIDIA GTX1080 GPU 45fps (python TensorFlow)
70fps with precomputed YOLO features

Recu ecurrence is is p provided by an LS LSTM Very fast. Still very low accuracy..

From [D.Zhang, H,Maei,X.Wang, Y-F. Wang Samsung, UCSB ArXiv 2017]

SLIDE 16

Recurrence with CPM

✓ CP CPM Con Convolutional l Pos

se

e Machines*: a sequence of Convolutional nets that repeatedly produce 2D belief maps for the location of interesting parts (human junctions) ✓ Belief map is a non-parametric encoding of the spatial uncertainly of location. ✓ CPM learns implicit relationships between parts ✓ it is not recurrent but t a mult lti-stage network, trained with backpropagation

*[S-E Wei, V.Ramakrishna,T.Kanade,Y.Sheickh «Convolutional Pose Machines»CVPR 2016]

SLIDE 17

Without detection: Temporal CPM3

Imagelab: tracking multiple body parts with T-CP CPM (T (Tem emporal Co Convolutional l Pos

se

e Mach chin ines). An iterative network (CPM) for predicting:

the position of joints (H)
their mutual association in space (P)
their association in time (T)

SLIDE 18

Three Branches: : Heatmaps, , PAFs and and TAFs

Heatmap models the part locations as gaussian peaks in the map; 1 for each joint (“nose”, “neck”,

“lest-sholder”.)

PAFs: (P

(Part Affin inity Fi Field) to assemble the detected joints. The score of a candidate limb is proportional to the alignment with the PAF associated with that type of limb.

TAFs:(Temporal Affin

init ity Fi Field ld) to link the corresponding joints of the same person in consecutive frames (for an unknown number of people). “left knee” joint PAF vector connecting two nodes TAF vector connecting the same node in the time

SLIDE 19

Visual Example

SLIDE 20

How to provide initial annotation?

ScriptHook library

Photorealistic
Plausible dynamics
Lifelike entity AI
Access to native GTA

functions

Customizable
Extract all the information

available to the game engine

SLIDE 21

T-CPM3 In Action On Tracking People In The Wild

The Deep architecture and the software is propriety of ImageLab UNIMORE, We thanks Jum Jump pr project fu funded with th EU EU ER ER-FESR-2015 2015-2020 pr program

SLIDE 22

For Tracking, Action, Behavior Recognition

T-CPMs do not use recurrence but works on sequences of frames and refines s wit ith it iteration with th lon

ng convolu

lutional layers

Problems of vanishing gradient
Long-Short Term Memory architectures can give solution for time iterations, but not for long time sequences

SLIDE 23

SALIENCY DETECTION WITH LSTMS SAM ARCHITECTURE

(thanks to Marcella Cornia, Giuseppe Serra and Lorenzo Baraldi)

If target Detection is not required..

SLIDE 24

✓ SAM

SALIENCY DETECTION @Imagelab SAM

Saliency Attentive Model (SAM): ML-NET+ LSTMs

Marcella Cornia, L. Baraldi, G. Serra, and

R. Cucchiara. A Deep Multi-Level Network

for Saliency Prediction, ICPR CPR 2016 MIT300 (Itti, Torralba et al) more than 70 competitors since 2014 SALICON (Jiang et al 2015), 10000 images;

SLIDE 25

Winner of the competition LSUN Challenge CVPR 2017 Number of

f im

images: 20.000

10.000 training images 5.000 validation images 5.000 test images

GPU: NVIDIA K80 on Supercomputer GALILEO CINECA Training Tim ime: ~15 hours

SLIDE 26

Groundtruth SAM Actions in the Eye (Hollywood2) dataset

SLIDE 27

Saliency in task-driven video

Bottom-up saliency, detected by ML-NET, trained on SALICON on DR(EYE)VE dataset http://imagelab.ing.unimore.it/dreyeve Saliency not driven by a task.. Saliency trained by driving as a passenger sees as a driver sees

SLIDE 28

SIFT-BASED REGISTRATION FRAME BY FRAME

Collected with SMI ETG 2w, Frontal camera 720p/30fps + Eye pupils cameras at 60fps GARMIN VirvX , 1080 p/25fps +GPS.

SLIDE 29

Some conclusion (if any)

✓ Computer vision now is a Deep Learning based discipline ✓ Computer vision systems cannot be build without GPUs (both in training and at run-time) ✓ Conv-Nets are fundamental bricks to new architectures ✓ Autoencoders: for image generation ✓ (Conditional) Generative Adversarial Networks: for low-resolution occluded attribute recognition ✓ Multi- layers convolutional networks for emulating recurrency as T-CPM3 for tracking ✓ Recurrent and Long Short Term Memories for short time analysis: saliency and video captioning ✓ …

“

Computer Vision Deep Architetcures GPUs GPUs

SLIDE 30

Th Thank you!

rit rita.cucchiara@unim imore.it it http://imagela lab.in ing.unimore.it it

Acknowle ledgements

Thank to to