Learning Several cars are driving straight on the freeway - PowerPoint PPT Presentation

A dark car is turning left on an exit Learning Several cars are driving straight on the freeway Semantic Video Captioning using Data Generated with Grand Theft Auto A white car is turning right on an exit Alex “Polis” Polichroniadis | Data Scientist, MSc Kolia Sadeghi | Applied Mathematician, PhD Timothy Emerick | Data Scientist, PhD

A dark car is turning left on an exit 1. Several cars are driving straight on the freeway THE VIDEO LEARNING PROBLEM A white car is turning right on an exit

THE PROBLEM Machine vision algorithms require large ▸ amounts of labeled data to train. Models trained on non-domain relevant ▸ data do not transfer to desired domain. Often, domain relevant labeled data ▸ isn’t available .

Sample of Large Computer Vision Datasets ImageNet YouTube-8M (http://www.image-net.org/) (https://research.google.com/youtube8m/) >14M images of >21K concepts >450K hours of video of >4700 classes Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Abu-El-Haija, Sami, et al. "YouTube-8M: A large-scale Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej video classification benchmark." arXiv preprint Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. arXiv:1609.08675 (2016). Berg and Li Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.

Wow, that looks good. OUR APPROACH USE VIDEO GAMES

Our Approach: Use Video Games With the advent of extremely ▸ powerful GPUs, graphics have become extremely realistic over the years. Video games encode realistic ▸ movements such as walking gaits and vehicle routing. Video games are controllable via ▸ code, and can expose semantic labels.

Our Approach: Use GTA V Extremely realistic ▸ graphics. Huge modding ▸ community. GPU-intensive visual ▸ mods for more realism. Of specific interest: ▸ script-hook-v has thousands of function calls.

...with some quirks.

Programmatically configurable options Vehicles Environment ▸ ▸ Activities: driving, turning, ▹ Weather: rainy, sunny, hazy... ▹ waiting at stop light. Time of day. ▹ ▹ Describers: color, type, damage... Camera elevation and zoom. ▹ People ▸ Buildings ▸ Activities: entering/exiting ▹ Activities: people going into, ▹ vehicle, walking, standing still, out of, and walking close to talking, waiting to cross the buildings. street, parking, smoking, talking Describers: type of building ▹ on a cell phone, carrying a (church, mall, police station, firearm, planting a bomb. etc). Describers: number of people, ▹ clothing color, gender.

A simple test: YOLO 9000 Redmon, Joseph, and Ali Before training Farhadi. "YOLO9000: better, with annotated synthetic faster, stronger." arXiv preprint footage arXiv:1612.08242 (2016).

A simple test: YOLO 9000 After training with annotated synthetic footage

A simple test: YOLO 9000

A dark car is turning left on an exit 2. Several cars are driving straight on the freeway VIDEO CAPTIONING A white car is turning right and other cool stuff. on an exit

Architecture We employ cutting edge deep learning methods; Convolutional Neural Networks (CNNs) that capture localised information in frames and Long Short Term Memory networks (LSTMs), with demonstrated state-of-the-art performance in sequence captioning. We train our models using hours of fully annotated synthetic footage produced using our in-house Photorealistic Synthetic Video Generator (PSVG) and we observe domain translation between synthetic and real-world footage

Attention ▸ Focus semantic representation of frames on objects of interest , in our case pedestrians and vehicles. We corrupt input frames at train time to match real world noise . ▸ Speedy model training thanks to a fully GPU based architecture ▸ that takes full advantage of 8 latest gen NVIDIA GTX1080Ti GPUs. A modified version of YOLO9000 was developed and trained to produce attention labels in video.

Attention Attention Result

Attention Attention Result Input Frame

Attention We test domain translation by applying our attention model to the open ▸ VIRAT overhead video dataset. Results presented confirm our hypothesis that the model is able to produce attention labels in real world video based purely on training from simulated data.

Captioning We use a recurrent Long-Short Term Memory Network (LSTM) to translate sequences of feature representations computed from video frames into text. Why LSTMs? Map a sequence of frames to a sequence of words. ▸ Use multi-frame information. ▸ Capture language structure. ▸ Inference using the combination of an LSTM and a convolutional attention model runs in faster-than-real-time on a single NVIDIA GTX TitanX (Pascal).

Captioning “ A male wearing a white shirt “ A white service vehicle is and dark pants is walking. ” parked ” “ A Male is crossing the street ”

Captioning Applications Semantic Video Search Extracting captions from video and storing them in a semantic index allows for fast and flexible video search by text query over large amounts of video.

Captioning Applications Real Time Alerting Our scalable pipeline infers captions faster than 30fps on a single NVIDIA GTX1080Ti GPU. This allows for indexing of live video streams and providing real-time alerts for user-defined events.

Search by example A user-defined bounding box on a video frame can be used as a query for a search for similar objects of interest in the entirety of a video dataset, at a frame level. We highlight relevant frames in the form of a heatbar.

Conclusion ▸ Using GTAV with latest generation NVIDIA GPUs allows us to create fully annotated, custom tailored, photorealistic datasets, programmatically. Cutting edge neural network architectures that run on ▸ the GPU allow us to train fast and efficiently. ▸ We observe domain translation between synthetic footage and real-world footage. Latest generation GPU technology allows us to process ▸ frames in faster-than-realtime. ▸ We use this to achieve: Captioning video ▹ ▹ Text search in a large video corpus Live video captioning ▹ ▹ Real-time notifications Search by example ▹

Future steps Use localisation in frame to thread entities temporally ▸ and produce “ action tubelets ” to be captioned. Improve counting of entities. ▸ Captioning with geolocation information of the camera to ▸ extract information about entities. “ At what coordinates have I seen over 200 people? ” ▹ “ At what coordinates have I seen this vehicle? ” ▹ “ At what coordinates have I seen red trucks? ” ▹ Fuse different sensor modalities for tracking and ▸ geo-localisation of entities.

A dark car is turning left on an exit Thank You Several cars are driving Questions? straight on the freeway A white car is turning right on an exit

Learning Several cars are driving straight on the freeway - PowerPoint PPT Presentation

A dark car is turning left on an exit Learning Several cars are driving straight on the freeway Semantic Video Captioning using Data Generated with Grand Theft Auto A white car is turning right on an exit Alex Polis Polichroniadis

The Learning Tree Workshop: The Learning Tree Workshop: Experience-based Learning Series on

Machine Learning 11 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 11 1 11 Machine Learning

What is mobile learning, mobile learning policies and technologies Dr. Mohamed Ally Learning

Year 7 Learning Evening 2017 W elcome! Year 7 Learning Evening 2017 Year 7 Learning Evening

Learning is a never-ending process Tasks come and go, but learning is forever Learn more e ff

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

A Gentle Introduction to Machine Learning Supervised learning, unsupervised learning (very

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Learning From Data Lecture 2 The Perceptron The Learning Setup A Simple Learning Algorithm: PLA

Welcome to Welcome to The Learning Tree Workshop Series on Learning Differences, Learning

Impasse, Conflict Impasse, Conflict and Learning of CS Notions and Learning of CS Notions David

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Why e Learning can actually be effective for learning an understanding from psycho

Fairness in Machine Learning Fairness in Supervised Learning Make decisions by machine learning:

Objectives Objectives Objectives Objectives Learning Learning Learning Learning

Learning Sciences: Impact on Learning Technologies & Learning Activities Phillip D. Long,

CRYPTOLOGY WITH CRYPTOOL 1 Practical Introduction to Cryptography and Cryptanalysis Scope,

#MCN2017-F7 Lock It Down! Securing Your Museum In a Hackers World Mara Kurlandsky, Adam Gegg,

NICO Project Presentation May 2019 1 Forward-Looking Information This management presentation

HALF YEAR RESULTS PRESENTATION Good Revenue Growth Momentum 7 August 2018 CAUTIONARY STATEMENT

Software Security By Hunter Stevenson Khalid Alharbi CSCI 5828 Foundations of Software

NCHIMA 66 th Annual Meeting Managing the Access and Release of Health Information As Health

Did Executive Compensation Encourage Excessive Risktaking in Financial Institutions? Sudhakar

Developing the Literacy Instruction Knowledge Scales (LIKS): A Comprehensive Assessment of

Learning Several cars are driving straight on the freeway - PowerPoint PPT Presentation

A dark car is turning left on an exit Learning Several cars are driving straight on the freeway Semantic Video Captioning using Data Generated with Grand Theft Auto A white car is turning right on an exit Alex Polis Polichroniadis

The Learning Tree Workshop: The Learning Tree Workshop: Experience-based Learning Series on

Machine Learning 11 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 11 1 11 Machine Learning

What is mobile learning, mobile learning policies and technologies Dr. Mohamed Ally Learning

Year 7 Learning Evening 2017 W elcome! Year 7 Learning Evening 2017 Year 7 Learning Evening

Learning is a never-ending process Tasks come and go, but learning is forever Learn more e ff

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

A Gentle Introduction to Machine Learning Supervised learning, unsupervised learning (very

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Learning From Data Lecture 2 The Perceptron The Learning Setup A Simple Learning Algorithm: PLA

Welcome to Welcome to The Learning Tree Workshop Series on Learning Differences, Learning

Impasse, Conflict Impasse, Conflict and Learning of CS Notions and Learning of CS Notions David

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Why e Learning can actually be effective for learning an understanding from psycho

Fairness in Machine Learning Fairness in Supervised Learning Make decisions by machine learning:

Objectives Objectives Objectives Objectives Learning Learning Learning Learning

Learning Sciences: Impact on Learning Technologies &amp; Learning Activities Phillip D. Long,

CRYPTOLOGY WITH CRYPTOOL 1 Practical Introduction to Cryptography and Cryptanalysis Scope,

#MCN2017-F7 Lock It Down! Securing Your Museum In a Hackers World Mara Kurlandsky, Adam Gegg,

NICO Project Presentation May 2019 1 Forward-Looking Information This management presentation

HALF YEAR RESULTS PRESENTATION Good Revenue Growth Momentum 7 August 2018 CAUTIONARY STATEMENT

Software Security By Hunter Stevenson Khalid Alharbi CSCI 5828 Foundations of Software

NCHIMA 66 th Annual Meeting Managing the Access and Release of Health Information As Health

Did Executive Compensation Encourage Excessive Risktaking in Financial Institutions? Sudhakar

Developing the Literacy Instruction Knowledge Scales (LIKS): A Comprehensive Assessment of

Learning Sciences: Impact on Learning Technologies & Learning Activities Phillip D. Long,