Computer Vision: Weakly-supervised learning from video and images - - PowerPoint PPT Presentation

computer vision
SMART_READER_LITE
LIVE PREVIEW

Computer Vision: Weakly-supervised learning from video and images - - PowerPoint PPT Presentation

CSClub Saint Petersburg November 17, 2014 Computer Vision: Weakly-supervised learning from video and images Ivan Laptev ivan.laptev@inria.fr WILLOW, INRIA/ENS/CNRS, Paris Joint work with: Piotr Bojanowski Rmi Lajugie Maxime Oquab


slide-1
SLIDE 1

Ivan Laptev

ivan.laptev@inria.fr WILLOW, INRIA/ENS/CNRS, Paris

Computer Vision: Weakly-supervised learning from video and images

CSClub Saint Petersburg November 17, 2014

Joint work with: Piotr Bojanowski – Rémi Lajugie – Maxime Oquab – Francis Bach – Leon Bottou – Jean Ponce – Cordelia Schmid – Josef Sivic

slide-2
SLIDE 2

Контакты: Официальный сайт: http://visionlabs.ru/ Контактное лицо: Ханин Александр E-mail: a.khanin@visionlabs.ru Тел.: +7 (926) 988-7891 VisionLabs – команда профессионалов, обладающих значительными знаниями и существенным практическим опытом в сфере разработки алгоритмов компьютерного зрения и интеллектуальных систем.

Мы создаем и внедряем технологии компьютерного зрения, открывая новые возможности для изменения окружающего нас мира к лучшему.

О компании

– Advertisement –

slide-3
SLIDE 3

Команда

Александр Ханин Chief Executive Officer Алексей Нехаев Executive Officer Слава Казьмин Chief Technical Officer Иван Лаптев Scientific advisor Сергей Миляев Senior CV engineer Алексей Кордичев Financial advisor Иван Трусков Software developer Сергей Черепанов Software developer

Наша команда – симбиоз науки и бизнеса

Направления деятельности

Технология распознавания лиц

Система выявления мошенников в банках

Технология распознавания номеров

Система учета и автоматизации доступа транспорта

Технологии для безопасного города

Система выявления нарушений и опасных ситуаций

– Advertisement –

slide-4
SLIDE 4

– Advertisement –

Проекты масштаба государства

Достижения

slide-5
SLIDE 5

– Advertisement –

Мы ищем единомышленников

Создание и внедрение интеллектуальных систем Решение интересных практических задач Работа в дружной амбициозной команде

Спасибо за внимание!

Контакты: Официальный сайт: http://visionlabs.ru/ Контактное лицо: Ханин Александр E-mail: a.khanin@visionlabs.ru Тел.: +7 (926) 988-7891

slide-6
SLIDE 6

What is Computer Vision?

slide-7
SLIDE 7

7

What is Computer Vision?

slide-8
SLIDE 8
slide-9
SLIDE 9

What is the recent progress?

1990s:

Recognition at the level of a few toy objects (COIL 20 dataset)

Research Industry

Automated quality inspection (controlled lighting, scale,…)

Now:

Face recognition in social media ImageNet: 14M images, 21K classes 6% Top-5 error rate in 2014 Challenge

slide-10
SLIDE 10

~5K image uploads every min.

>34K hours of video upload every day TV-channels recorded since 60’s ~30M surveillance cameras in US => ~700K video hours/day ~2.5 Billion new images / month And even more with future wearable devices

Why image and video analysis?

Data:

slide-11
SLIDE 11

Movies TV YouTube

Why looking at people?

How many person-pixels are in the video?

slide-12
SLIDE 12

Movies TV YouTube

How many person-pixels are in the video?

40% 35% 34%

Why looking at people?

slide-13
SLIDE 13

How many person pixels in our daily life?

Wearable camera data: Microsoft SenseCam dataset 

slide-14
SLIDE 14

How many person pixels in our daily life?

Wearable camera data: Microsoft SenseCam dataset 

~4%

slide-15
SLIDE 15
  • Large variations in appearance:
  • cclusions, non-rigid motion, view-

point changes, clothing…

What are the difficulties?

  • Manual collection of training

samples is prohibitive: many action classes, rare occurrence

  • Action vocabulary is not

well-defined

Action Open:

… …

Action Hugging:

slide-16
SLIDE 16

This talk:

Brief overview of recent techniques Weakly-supervised learning from video and scripts Weakly-supervised learning with convolutional neural networks

slide-17
SLIDE 17

Standard visual recognition pipeline

GetOutCar AnswerPhone Kiss HandShake StandUp DriveCar

  • Collect image/video samples

and corresponding class labels

  • Design appropriate data

representation, with certain invariance properties

  • Design / use existing

machine learning methods for learning and classification

slide-18
SLIDE 18

Occurrence histogram

  • f visual words

space-time patches Extraction of Local features

Feature description K-means clustering (k=4000) Feature quantization Non-linear SVM with χ2 kernel [Laptev, Marszałek, Schmid, Rozenfeld 2008]

Bag-of-Features action recognition

slide-19
SLIDE 19

Action classification

Test episodes from movies “The Graduate”, “It’s a Wonderful Life”, “Indiana Jones and the Last Crusade”

slide-20
SLIDE 20

Where to get training data?

Shoot actions in the lab

  • KTH dataset

Weizman dataset,…

  • Limited variability
  • Unrealistic

Manually annotate existing content

  • HMDB, Olympic Sports,

UCF50, UCF101, …

  • Very time-consuming

Use readily-available video scripts

  • www.dailyscript.com, www.movie-page.com, www.weeklyscript.com
  • Scripts are available for 1000’s of hours of movies and TV-series
  • Scripts describe dynamic and static content of videos
slide-21
SLIDE 21

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...

21

slide-22
SLIDE 22

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...

22

slide-23
SLIDE 23

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...

23

slide-24
SLIDE 24

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...

24

slide-25
SLIDE 25

… 1172 01:20:17,240 --> 01:20:20,437 Why weren't you honest with me? Why'd you keep your marriage a secret? 1173 01:20:20,640 --> 01:20:23,598 lt wasn't my secret, Richard. Victor wanted it that way. 1174 01:20:23,800 --> 01:20:26,189 Not even our closest friends knew about our marriage. … … RICK Why weren't you honest with me? Why did you keep your marriage a secret? Rick sits down with Ilsa. ILSA Oh, it wasn't my secret, Richard. Victor wanted it that way. Not even

  • ur closest friends knew about our

marriage. … 01:20:17 01:20:23

subtitles movie script

  • Scripts available for >500 movies (no time synchronization)

www.dailyscript.com, www.movie-page.com, www.weeklyscript.com …

  • Subtitles (with time info.) are available for the most of movies
  • Can transfer time to scripts by text alignment

Script-based video annotation

[Laptev, Marszałek, Schmid, Rozenfeld 2008]

slide-26
SLIDE 26

Scripts as weak supervision

Uncertainty

24:25 24:51

Imprecise temporal localization

  • No explicit spatial localization
  • NLP problems, scripts ≠ training labels
  • “… Will gets out of the Chevrolet. …”

“… Erin exits her new truck…”

  • vs. Get-out-car

Challenges:

slide-27
SLIDE 27

Previous work

Sivic, Everingham, and Zisserman, ''Who are you?'' -- Learning Person Specific Classifiers from Video, In CVPR 2009. Buehler, Everingham, and Zisserman "Learning sign language by watching TV (using weakly aligned subtitles)", In CVPR 2009. Duchenne, Laptev, Sivic, Bach and Ponce, "Automatic Annotation of Human Actions in Video", In ICCV 2009.

…wanted to know about the history of the trees

slide-28
SLIDE 28

Joint Learning of Actors and Actions

Rick?

Rick? Walks? Walks?

[Bojanowski et al. ICCV 2013]

Rick walks up behind Ilsa

slide-29
SLIDE 29

Rick Walks

Rick walks up behind Ilsa

Joint Learning of Actors and Actions

[Bojanowski et al. ICCV 2013]

slide-30
SLIDE 30

Formulation: Cost function

Rick Ilsa Sam Actor labels Actor image features Actor classifier

slide-31
SLIDE 31

Formulation: Cost function

Person p appears at least once in clip N :

p = Rick

Weak supervision from scripts:

slide-32
SLIDE 32

Action a appears at least once in clip N :

a = Walk

Weak supervision from scripts:

Formulation: Cost function

slide-33
SLIDE 33

Formulation: Cost function

Action a appears in clip N : Weak supervision from scripts: Person p appears in clip N : Person p and Action a appear in clip N :

slide-34
SLIDE 34

34

Image and video features

  • Facial features

[Everingham’06]

  • HOG descriptor on

normalized face image

  • Dense Trajectory

features in person bounding box [Wang et al.,’11]

Face features Action features

slide-35
SLIDE 35

35

Results for Person Labelling

American beauty (11 character names) Casablanca (17 character names)

slide-36
SLIDE 36

36

Results for Person + Action Labelling

Casablanca, Walking

slide-37
SLIDE 37

Finding Actions and Actors in Movies

[Bojanowski, Bach, Laptev, Ponce, Sivic, Schmid, 2013]

slide-38
SLIDE 38

38

Action Learning with Ordering Constraints

[Bojanowski et al. ECCV 2014]

slide-39
SLIDE 39

39

Action Learning with Ordering Constraints

[Bojanowski et al. ECCV 2014]

slide-40
SLIDE 40

Cost Function

Weak supervision from

  • rdering constraints on Z:

Action label Action index 2 4 1 2 3 2 Video time intervals

slide-41
SLIDE 41

Cost Function

Weak supervision from

  • rdering constraints on Z:

Action label Action index 2 4 1 2 3 2 Video time intervals

slide-42
SLIDE 42

Cost Function

Weak supervision from

  • rdering constraints on Z:

Action label Action index 2 4 1 2 3 2 Video time intervals

slide-43
SLIDE 43

Is the optimization tractable?

  • Path constraints are implicit
  • Cannot use off-the-shelf solvers
  • Frank-Wolfe optimization algorithm
slide-44
SLIDE 44

Results

937 video clips from 60 Hollywood movies

  • 16 action classes
  • Each clip is annotated by a sequence of n actions (2≤n≤11)
slide-45
SLIDE 45
slide-46
SLIDE 46

Object recognition

slide-47
SLIDE 47

Convolutional Neural Networks

  • ImageNet Large-Scale Visual Recognition Challenge is

very hard: 1000 classes, 1.2M images

  • Krizhevsky et al. ILSVRC12 results improve other

methods with a large margin

2012 2014 GoogleLeNet: 6%

slide-48
SLIDE 48

CNN of Krizhevsky et al. NIPS’12

  • Learns low-level features at

the first layer.

  • Has some tricks but the main

principle is similar to LeCun’88

  • Has 60M parameters and 650K neurons.
  • Success seems to be determined by (a) lots of labeled

images and (b) very fast GPU implementation. Both (a) and (b) have not been available until very recently.

slide-49
SLIDE 49

Approach

  • 1. Design training/test procedure using sliding windows
  • 2. Train adaptation layers to map labels

See also [Girshick et al.’13], [Donahue et al.’13], [Sermanet et al. ’14], [Zeiler and Fergus ’13] Transfer learning workshop at ICCV’13, ImageNet workshop at ICCV’13

slide-50
SLIDE 50

Approach – sliding window training / testing

slide-51
SLIDE 51

Results

Object localization

slide-52
SLIDE 52

Results

[Oquab, Bottou, Laptev, Sivic 2013, HAL-00911179]

slide-53
SLIDE 53

Results

slide-54
SLIDE 54

Vision works?

slide-55
SLIDE 55

Vision works?

[Oquab, Bottou, Laptev, Sivic 2013, HAL-00911179]

slide-56
SLIDE 56

VOC Action Classification Taster Challenge

  • Given the bounding box of a person, predict

whether they are performing a given action

Playing Instrument? Reading?

  • Encourage research on still-image activity

recognition: more detailed understanding of image

slide-57
SLIDE 57

Nine Action Classes

Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking

slide-58
SLIDE 58

CNN action recognition and localization

Qualitative results: reading

slide-59
SLIDE 59

CNN action recognition and localization

Qualitative results: phoning

slide-60
SLIDE 60

CNN action recognition and localization

Qualitative results: playing instrument

slide-61
SLIDE 61

Results PASCAL VOC 2012

Object classification Action classification

[Oquab, Bottou, Laptev, Sivic 2013, HAL-00911179]

slide-62
SLIDE 62

Are bounding boxes needed for training CNNs?

Image-level labels: Bicycle, Person

[Oquab, Bottou, Laptev, Sivic, 2014]

slide-63
SLIDE 63

Motivation: labeling bounding boxes is tedious

slide-64
SLIDE 64

Motivation: image-level labels are plentiful

“Beautiful red leaves in a back street of Freiburg”

[Kuznetsova et al., ACL 2013] http://www.cs.stonybrook.edu/~pkuznetsova/imgcaption/captions1K.html

slide-65
SLIDE 65

Motivation: image-level labels are plentiful

“Public bikes in Warsaw during night”

https://www.flickr.com/photos/jacek_kadaj/8776008002/in/photostream/

slide-66
SLIDE 66

Let the algorithm localize the object in the image

[Oquab, Bottou, Laptev, Sivic, 2014] Example training images with bounding boxes The locations of objects or their parts learnt by the CNN NB: Related to multiple instance learning, e.g. [Viola et al.’05] and weakly supervised object localization, e.g. [Pandy and Lazebnik’11], [Prest et al.’12], [Oh Song et al. ICML’14], …

slide-67
SLIDE 67

Approach: search over object’s location

  • 1. Efficient window sliding to find object location hypothesis
  • 2. Image-level aggregation (max-pool)
  • 3. Multi-label loss function (allow multiple objects in image)

See also [Sermanet et al. ’14] and [Chaftield et al.’14] Max-pool

  • ver image

Per-image score FCa FC b

C1-C2-C3-C4-C5

FC6 FC7 4096- dim vector 9216- dim vector 4096- dim vector

motorbike person diningtable pottedplant chair car bus train … Max

slide-68
SLIDE 68
  • 1. Efficient window sliding to find object location

192 norm pool 1:8 3 256 norm pool 1:16 384 1:16 384 1:16 6144 dropout 1:32 6144 dropout 1:32 2048 dropout 1:32 20 1:32 20 final-pool

Convolutional feature extraction layers trained on 1512 ImageNet classes (Oquab et al., 2014) Adaptation layers trained on Pascal VOC.

256 pool 1:32

C1 C2 C3 C4 C5 FC6 FC7 FCa FCb

configurations find classification classification fi fix modifications. modifications

slide-69
SLIDE 69
  • 2. Image-level aggregation using global max-pool

192 norm pool 1:8 3 256 norm pool 1:16 384 1:16 384 1:16 6144 dropout 1:32 6144 dropout 1:32 2048 dropout 1:32 20 1:32 20 final-pool

Convolutional feature extraction layers trained on 1512 ImageNet classes (Oquab et al., 2014) Adaptation layers trained on Pascal VOC.

256 pool 1:32

C1 C2 C3 C4 C5 FC6 FC7 FCa FCb

configurations find classification classification fi fix modifications. modifications

slide-70
SLIDE 70
  • 3. Multi-label loss function

(to allow for multiple objects in image)

192 norm pool 1:8 3 256 norm pool 1:16 384 1:16 384 1:16 6144 dropout 1:32 6144 dropout 1:32 2048 dropout 1:32 20 1:32 20 final-pool 256 pool 1:32

C1 C2 C3 C4 C5 FC6 FC7 FCa FCb

configurations find classification classification fi fix modifications. modifications

Sum of K (=20) log-loss functions, one for each of K classes:

K-vector of network output for image x K-vector of (+1,-1) labels indicating presence/absence of each class

slide-71
SLIDE 71

Search for objects using max-pooling

Max-

« !» « !»

« !»

fiel

Correct label: increase score for this class Incorrect label: decrease score for this class

slide-72
SLIDE 72

Search for objects using max-pooling

a What is the effect of errors?

slide-73
SLIDE 73

Multi-scale training and testing

Rescale

[ 0.7…1.4 ]

chair diningtable sofa pottedplant person car bus train …

Figure 3: Weakly supervised training

chair diningtable person pottedplant person car bus train …

Rescale

Figure 4: Multiscale object recognition fix ⇥ ⇥ ⇥ ⇥ ⇥ final ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ficient classification ficiently

  • bject’

⇥ ⇥ ⇥ ⇥ ⇥ classification classification

slide-74
SLIDE 74

Training videos

slide-75
SLIDE 75

Test results on 80 classes in Microsoft COCO dataset

slide-76
SLIDE 76

Test results on 80 classes in Microsoft COCO dataset

slide-77
SLIDE 77

Test results on 80 classes in Microsoft COCO dataset

slide-78
SLIDE 78

Test results on 80 classes in Microsoft COCO dataset

slide-79
SLIDE 79

Test results on 80 classes in Microsoft COCO dataset

slide-80
SLIDE 80

Test results on 80 classes in Microsoft COCO dataset