Computer Vision and Deep Learning
Introduction to Data Science 2019 University of Helsinki Mats Sj¨
- berg
mats.sjoberg@csc.fi
CSC – IT Center for Science
September 23, 2019
Computer Vision and Deep Learning Introduction to Data Science 2019 - - PowerPoint PPT Presentation
Computer Vision and Deep Learning Introduction to Data Science 2019 University of Helsinki Mats Sj oberg mats.sjoberg@csc.fi CSC IT Center for Science September 23, 2019 Computer vision Giving computers the ability to understand
Introduction to Data Science 2019 University of Helsinki Mats Sj¨
mats.sjoberg@csc.fi
CSC – IT Center for Science
September 23, 2019
Giving computers the ability to understand visual information Examples:
◮ A robot that can move around obstacles by analysing the
input of its camera(s)
◮ A computer system finding images of cats among millions of
images (e.g., on the Internet).
2/45
◮ The camera image needs to be digitised for computer
processing
◮ Turning it into millions of discrete picture elements, or pixels
0.4941 0.4941 0.5058 0.4941 0.4980 0.4980 0.4941 0.4862 0.4705 0.4941 0.5019 0.4980 0.4980 0.4901 0.5098 0.5098 0.5058 0.5215 0.5098 0.5058
“There’s a cat among some flowers in the grass”
◮ How do we get from pixels to understanding? ◮ . . . or even some kind of useful/actionable interpretation.
3/45
Before
◮ Hand-crafted features, e.g., colour distributions, edge
histograms
◮ Complicated feature selection mechanisms ◮ “Classical” machine learning, e.g., kernel methods (SVM)
About 5 years ago: deep learning
◮ End-to-end learning, i.e., the network itself learns the features ◮ Each layer typically learns a higher level of representation ◮ However: entirely data-driven, features can be hard to
interpret Computer vision was one of the first breakthroughs of deep learning.
4/45
Fully connected or dense layer
. . . f (·)
. . .
f (·)
x1 x2 xn y1 ym wji yj = f (n
i=1 wjixi)
y = f w11 w12 . . . w1n w21 w22 . . . w2n . . . . . . ... . . . wm1 wm2 . . . wmn x1 . . . xn = f (WTx)
(we’re ignoring the bias term here . . . )
5/45
◮ Feedforward network has a huge number of parameters that
need to be learned
◮ Each output node interacts with every input node via the
weights in W
◮ n × m weights (and that’s just one layer!) ◮ Learning is typically done with stochastic gradient descent
http://ruder.io/optimizing-gradient-descent/
◮ Gradients for each neuron obtained with backpropagation ◮ Given enough time and data the network can in theory learn
to model any complex phenomena (Universal approximation theorem)
◮ In practice, we often use domain knowledge to restrict the
number of parameters that need to be learned. http://playground.tensorflow.org/
6/45
While we don’t hand-craft features anymore . . . In practice we still apply some “expert knowledge” to make learning feasible . . .
◮ Neighbouring pixels are probably related (convolutions) ◮ There are common image features which can appear anywhere
such as edges, corners, etc (weight sharing)
◮ Often the exact location of a feature isn’t important (max
pooling) ⇒ Convolutional neural networks (CNN, ConvNet).
7/45
x1 x2 x3 x4 x5 x6 x7 y1 y2 y3 y4 y5 y6 y7 w11 w21 . . . w77
Network changes from this . . .
x1 x2 x3 x4 x5 x6 x7 y1 y2 y3 y4 y5 y6 y7 w1 w1 w1 w1 w1 w1 w1 w2 w2 w2 w2 w2 w2
to this.
8/45
◮ We arrange the input and output neurons in 2D ◮ The output is the result of a weighted sum of a small local
area in the previous layer – convolution S(i, j) =
I(i + m, j + n)K(m, n)
◮ The weights K(m, n) is what is learned.
9/45
◮ We arrange the input and output neurons in 2D ◮ The output is the result of a weighted sum of a small local
area in the previous layer – convolution S(i, j) =
I(i + m, j + n)K(m, n)
◮ The weights K(m, n) is what is learned.
10/45
◮ The convolutional layer learns several sets of weights, each a
kind of feature detector
◮ These are built up in layers ◮ Until we get our end result, e.g., an object detector.
“cat”
11/45
Krizhevsky et al 2012
12/45
Map activations back to the image space
Zeiler and Fergus 2014, https://arxiv.org/abs/1311.2901
13/45
◮ What we call CNNs, actually also contain other types of
◮ Modern CNNs have a huge bag of tricks: pooling, various
training shortcuts, 1x1 convolutions, inception modules, residual connections, etc.
INPUT 32x32
Convolutions Subsampling Convolutions
C1: feature maps 6@28x28
Subsampling
S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84
Full connection Full connection Gaussian connections
OUTPUT 10
LeNet5 (LeCun et al 1998)
14/45
AlexNet (Krizhevsky et al 2012)
15/45
GoogLeNet (Szegedy et al 2014)
16/45
Inception v3 (Szegedy et al 2015)
17/45
ResNet-152 (He et al 2015) https://github.com/KaimingHe/deep-residual-networks
18/45
ImageNet benchmark
◮ ImageNet Large Scale Visual Recognition Challenge (ILSVRC) ◮ More than 1 million images ◮ Task: classify into 1000 object categories.
19/45
◮ First time won by a CNN in 2012 (Krizhevsky et al) ◮ Wide margin: top-5 error rate from 26% to 16% ◮ CNNs have ruled ever since.
20/45
◮ Accuracy vs number of inference operations ◮ Circle size represents number of parameters ◮ Newer nets are both better, faster and have fewer parameters.
Image from https://arxiv.org/pdf/1605.07678.pdf 21/45
22/45
Rich feature hierarchies for accurate object detection and semantic segmentation. Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. CVPR 2014. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. arXiv:1506.01497 23/45
Learning Deconvolution Network for Semantic Segmentation. Hyeonwoo Noh, Seunghoon Hong, Bohyung Han. arXiv: 1505.04366 24/45
https://github.com/facebookresearch/Detectron 25/45
Show and Tell: A Neural Image Caption Generator. Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru
26/45
DenseCap: Fully Convolutional Localization Networks for Dense Captioning, Justin Johnson, Andrej Karpathy, Li Fei-Fei, CVPR 2016. 27/45
VQA: Visual Question Answering. Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi Parikh. ICCV 2015. 28/45
“The coolest idea in machine learning in the last twenty years” – Yann LeCun
◮ We have two networks: generator and discriminator ◮ The generator produces samples, while the discriminator tries
to distinguish between real data items and the generated samples
◮ The discriminator tries to learn to classify correctly, while the
generator in turn tries to learn to fool the discriminator.
29/45
Generated bedrooms
https://arxiv.org/abs/1511.06434v2 30/45
Generated “celebrities”
Progressive Growing of GANs for Improved Quality, Stability, and Variation. Tero Karras, Timo Aila, Samuli Laine, Jaakko Lehtinen. arXiv: 1710.10196 31/45
CycleGAN
Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks https://junyanz.github.io/CycleGAN/ 32/45
Generative Adversarial Text to Image Synthesis
https://arxiv.org/pdf/1605.05396.pdf 33/45
A Neural Algorithm of Artistic Style https://arxiv.org/pdf/1508.06576.pdf https://github.com/jcjohnson/neural-style
34/45
35/45
Recall our ImageNet benchmark . . . where do humans stand?
http: //karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/ 36/45
◮ Don’t confuse classification accuracy with understanding! ◮ Neural nets learn to optimize for a particular problem pretty
well
◮ But in the end it’s just pixel statistics ◮ Humans can generalize and understand the context.
37/45
Microsoft CaptionBot: “I think it’s a group of people standing next to a man in a suit and tie.”
https://karpathy.github.io/2012/10/22/state-of-computer-vision/ 38/45
39/45
◮ Deep nets fooled by deliberately crafted inputs ◮ Revealing: what deep nets learn is quite different from what
humans learn
https://blog.openai.com/adversarial-example-research/ 40/45
◮ Deep learning has been a big leap for computer vision ◮ We can solve some specific problems really well ◮ Still far away from true understanding of visual information
41/45
42/45
◮ Finnish non-profit state enterprise with special tasks ◮ Owned by the Finnish state (70%) and higher education
institutions (30%)
◮ ICT expertise for research, education, public administration ◮ Services mostly free for universities and state research
institutions
◮ You might have heard about: Funet, HAKA, eduroam,
VIRTA, Finland’s fastest supercomputers, . . .
◮ Headquarters in Espoo (Keilaniemi), datacenter in Kajaani.
43/45
Some other services, which might be relevant for you:
◮ Notebooks – notebooks.csc.fi
◮ Jupyter notebooks, e.g., with deep learning environments ◮ Anyone with student account can access
◮ Puhti CPU and GPU cluster
◮ 320 NVIDIA Volta V100 GPUs for deep learning ◮ requires research project or university course ◮ https://research.csc.fi/dl2021-utilization
◮ In Q4/2020: EuroHPC pre-exascale supercomputer LUMI
◮ among world’s fastest computers ∼ 200 petaflops/s ◮ largely GPU based ◮ https://datacenter.csc.fi/wp/about-eurohpc/. 44/45
45/45