Computer Vision: Weakly-supervised learning from video and images - PowerPoint PPT Presentation

CSClub Saint Petersburg November 17, 2014 Computer Vision: Weakly-supervised learning from video and images Ivan Laptev ivan.laptev@inria.fr WILLOW, INRIA/ENS/CNRS, Paris Joint work with: Piotr Bojanowski – Rémi Lajugie – Maxime Oquab – Francis Bach – Leon Bottou – Jean Ponce – Cordelia Schmid – Josef Sivic

– Advertisement – О компании VisionLabs – команда профессионалов, обладающих значительными знаниями и существенным практическим Контакты : опытом в сфере разработки алгоритмов компьютерного Официальный сайт : http://visionlabs.ru/ зрения и интеллектуальных систем . Контактное лицо: Ханин Александр E-mail: a.khanin@visionlabs.ru Мы создаем и внедряем технологии Тел. : +7 (926) 988-7891 компьютерного зрения, открывая новые возможности для изменения окружающего нас мира к лучшему.

– Advertisement – Команда Направления деятельности  Технология распознавания лиц Система выявления мошенников в банках  Технология распознавания номеров Система учета и автоматизации доступа транспорта  Технологии для безопасного города Александр Алексей Слава Сергей Сергей Иван Алексей Иван Система выявления нарушений и опасных ситуаций Ханин Нехаев Казьмин Лаптев Миляев Кордичев Трусков Черепанов Chief Executive Chief Senior Software Scientific Financial Software Executive Officer Technical advisor CV engineer developer developer advisor Officer Officer Наша команда – симбиоз науки и бизнеса

– Advertisement – Достижения Проекты масштаба государства

– Advertisement – Мы ищем единомышленников Спасибо за внимание! Создание и внедрение интеллектуальных систем Решение интересных практических задач Контакты : Официальный сайт : http://visionlabs.ru/ Работа в дружной амбициозной команде Контактное лицо: Ханин Александр E-mail: a.khanin@visionlabs.ru Тел. : +7 (926) 988-7891

What is Computer Vision?

What is Computer Vision? 7

What is the recent progress? Industry Research 1990s: Automated quality inspection Recognition at the level of a few (controlled lighting, scale,…) toy objects (COIL 20 dataset) Now: Face recognition in social media ImageNet: 14M images, 21K classes 6% Top-5 error rate in 2014 Challenge

Why image and video analysis? Data: ~2.5 Billion new images / month TV-channels recorded since 60’s ~5K image uploads every min. >34K hours of video upload every day ~30M surveillance cameras in US => ~700K video hours/day And even more with future wearable devices

Why looking at people? How many person-pixels are in the video? Movies TV YouTube

Why looking at people? How many person-pixels are in the video? 35% 34% Movies TV 40% YouTube

How many person pixels in our daily life?  Wearable camera data: Microsoft SenseCam dataset

How many person pixels in our daily life?  Wearable camera data: Microsoft SenseCam dataset ~4%

What are the difficulties?  Large variations in appearance: occlusions, non-rigid motion, view- … point changes, clothing… Action Hugging :  Manual collection of training samples is prohibitive: many … action classes, rare occurrence  Action vocabulary is not well-defined … Action Open :

This talk: Brief overview of recent techniques Weakly-supervised learning from video and scripts Weakly-supervised learning with convolutional neural networks

Standard visual recognition pipeline  Collect image/video samples and corresponding class labels GetOutCar AnswerPhone  Design appropriate data representation, with certain HandShake StandUp invariance properties  Design / use existing DriveCar Kiss machine learning methods for learning and classification

Bag-of-Features action recognition space-time patches Extraction of Local features K-means clustering Occurrence histogram (k=4000) of visual words Feature description Non-linear SVM with χ 2 Feature kernel quantization [Laptev, Marszałek , Schmid, Rozenfeld 2008]

Action classification Test episodes from movies “The Graduate”, “It’s a Wonderful Life”, “Indiana Jones and the Last Crusade”

Where to get training data? • Shoot actions in the lab KTH dataset Weizman dataset,… - Limited variability - Unrealistic • Manually annotate existing content HMDB, Olympic Sports, UCF50, UCF101, … - Very time-consuming • Use readily-available video scripts - Scripts are available for 1000’s of hours of movies and TV -series www.dailyscript.com, www.movie-page.com, www.weeklyscript.com - Scripts describe dynamic and static content of videos

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 21

Script-based video annotation • Scripts available for >500 movies (no time synchronization) www.dailyscript.com, www.movie- page.com, www.weeklyscript.com … • Subtitles (with time info.) are available for the most of movies • Can transfer time to scripts by text alignment movie script subtitles … 1172 … 01:20:17,240 --> 01:20:20,437 RICK Why weren't you honest with me? Why weren't you honest with me? Why Why'd you keep your marriage a secret? did you keep your marriage a secret? 01:20:17 1173 Rick sits down with Ilsa. 01:20:20,640 --> 01:20:23,598 01:20:23 lt wasn't my secret, Richard. ILSA Victor wanted it that way. Oh, it wasn't my secret, Richard. Victor wanted it that way. Not even 1174 our closest friends knew about our 01:20:23,800 --> 01:20:26,189 marriage. Not even our closest friends … knew about our marriage. [Laptev, Marszałek , Schmid, Rozenfeld 2008] …

Scripts as weak supervision Challenges: • Imprecise temporal localization • No explicit spatial localization • NLP problems, scripts ≠ training labels “… Will gets out of the Chevrolet. …” vs. Get-out-car “… Erin exits her new truck…” 24:25 Uncertainty 24:51

Previous work Sivic, Everingham, and Zisserman, ''Who are you?'' -- Learning Person Specific Classifiers from Video, In CVPR 2009. Buehler, Everingham, and Zisserman "Learning sign language by watching TV (using weakly aligned subtitles)", In CVPR 2009. …wanted to know about the history of the trees Duchenne, Laptev, Sivic, Bach and Ponce, "Automatic Annotation of Human Actions in Video", In ICCV 2009.

Joint Learning of Actors and Actions [Bojanowski et al. ICCV 2013] Rick? Rick? Walks? Walks? Rick walks up behind Ilsa

Joint Learning of Actors and Actions [Bojanowski et al. ICCV 2013] Rick Walks Rick walks up behind Ilsa

Formulation: Cost function Actor classifier Actor labels Actor image features Rick Ilsa Sam

Formulation: Cost function Weak supervision from scripts: Person p appears at least once in clip N : p = Rick

Formulation: Cost function Weak supervision from scripts: Action a appears at least once in clip N : a = Walk

Formulation: Cost function Weak supervision from scripts: Person p and Action a Person p Action a appears in appears appear in clip N : in clip N : clip N :

Image and video features Face features • Facial features [Everingham’06] • HOG descriptor on normalized face image Action features • Dense Trajectory features in person bounding box [Wang et al.,’11] 34

Results for Person Labelling American beauty (11 character names) Casablanca (17 character names) 35

Results for Person + Action Labelling Casablanca, Walking 36

Finding Actions and Actors in Movies [Bojanowski, Bach, Laptev, Ponce, Sivic, Schmid, 2013]

Action Learning with Ordering Constraints [Bojanowski et al. ECCV 2014] 38

Computer Vision: Weakly-supervised learning from video and images - PowerPoint PPT Presentation

CSClub Saint Petersburg November 17, 2014 Computer Vision: Weakly-supervised learning from video and images Ivan Laptev ivan.laptev@inria.fr WILLOW, INRIA/ENS/CNRS, Paris Joint work with: Piotr Bojanowski Rmi Lajugie Maxime Oquab

Computer Vision Computer Vision How does vision work? What is vision for? Ela Claridge

CS262: Computer Vision (and Human-Computer Interaction) John Magee 1 Computer Vision How are

Branding Presentation VISION Mevushal VISION Muscat of Alexandria & Viognier VISION

Vision Services Vision Services & & Vision Therapy Vision Therapy February 2, 2007

Vision Our National Church partners .. Vision Our National Network partners Vision Getting

Computer Vision Introduction Historical context Connections to other disciplines Vision and

HIM Without Walls Realizing Our Vision! Realizing Our Vision Realize Our Vision Realizing Our

Deep Learning in Computer Vision Caner Hazrba Deep Learning in Action 24. June 15

J J R R Our Vision . . . Our Vision . . . Our Vision . . . Our Vision . . . TO BE THE BEST

Post- -trauma vision trauma vision Post Post- -trauma vision trauma vision Post syndrome

2017 Humana Vision 130 LOOK Whats NEW! NEW RETAIL FRAME BENEFIT 2 Humana Vision 100

Vision What is the Vision? The American Fork Canyon Vision (Vision) will ho- Few places in the

Building Our Vision St. Andrews Vision and Mission Our Vision: Our Vision: The Tree of Life is

FLITTER FLITTER The Foldable Litter Pink B Our Vision Our Vision Our Vision Our Vision A

CS201 Lecture 02 Computer Vision: Image Formation and Basic Techniques John Magee 1 Computer

CS 4495 Computer Vision 3D Perception Kelsey Hawkins Robotics 3D Perception CS 4495 Computer

Workshop: op: Create ate your r own cuddly y Audio-Tou Tour r Guide John Sear (Very!)

C10.1 Projections and Pattern and for Rubiks cube (Korf, 1997). Database Heuristics The

Future changes of nutrient dynamics and biological productivity in California Current System (CCS)

trt t r

NJAES Research and Technical Assistance for Climate Change and Agriculture Peggy Brennan,

DECISION UCSC UNDERGRADUATE RESEARCH 70% of undergraduates participate in research before

We Have Sinned Understood by Books Even though he was a Prophet, he still had to

Facility Update Briefing HQ AFRC/A4C Civil Engineer Division 9 Jan 19 1 1 Agenda AFRC

Computer Vision: Weakly-supervised learning from video and images - PowerPoint PPT Presentation

CSClub Saint Petersburg November 17, 2014 Computer Vision: Weakly-supervised learning from video and images Ivan Laptev ivan.laptev@inria.fr WILLOW, INRIA/ENS/CNRS, Paris Joint work with: Piotr Bojanowski Rmi Lajugie Maxime Oquab

Computer Vision Computer Vision How does vision work? What is vision for? Ela Claridge

CS262: Computer Vision (and Human-Computer Interaction) John Magee 1 Computer Vision How are

Branding Presentation VISION Mevushal VISION Muscat of Alexandria &amp; Viognier VISION

Vision Services Vision Services &amp; &amp; Vision Therapy Vision Therapy February 2, 2007

Vision Our National Church partners .. Vision Our National Network partners Vision Getting

Computer Vision Introduction Historical context Connections to other disciplines Vision and

HIM Without Walls Realizing Our Vision! Realizing Our Vision Realize Our Vision Realizing Our

Deep Learning in Computer Vision Caner Hazrba Deep Learning in Action 24. June 15

J J R R Our Vision . . . Our Vision . . . Our Vision . . . Our Vision . . . TO BE THE BEST

Post- -trauma vision trauma vision Post Post- -trauma vision trauma vision Post syndrome

2017 Humana Vision 130 LOOK Whats NEW! NEW RETAIL FRAME BENEFIT 2 Humana Vision 100

Vision What is the Vision? The American Fork Canyon Vision (Vision) will ho- Few places in the

Building Our Vision St. Andrews Vision and Mission Our Vision: Our Vision: The Tree of Life is

FLITTER FLITTER The Foldable Litter Pink B Our Vision Our Vision Our Vision Our Vision A

CS201 Lecture 02 Computer Vision: Image Formation and Basic Techniques John Magee 1 Computer

CS 4495 Computer Vision 3D Perception Kelsey Hawkins Robotics 3D Perception CS 4495 Computer

Workshop: op: Create ate your r own cuddly y Audio-Tou Tour r Guide John Sear (Very!)

C10.1 Projections and Pattern and for Rubiks cube (Korf, 1997). Database Heuristics The

Future changes of nutrient dynamics and biological productivity in California Current System (CCS)

trt t r

NJAES Research and Technical Assistance for Climate Change and Agriculture Peggy Brennan,

DECISION UCSC UNDERGRADUATE RESEARCH 70% of undergraduates participate in research before

We Have Sinned Understood by Books Even though he was a Prophet, he still had to

Facility Update Briefing HQ AFRC/A4C Civil Engineer Division 9 Jan 19 1 1 Agenda AFRC

Branding Presentation VISION Mevushal VISION Muscat of Alexandria & Viognier VISION

Vision Services Vision Services & & Vision Therapy Vision Therapy February 2, 2007