 
              IDA Machine Learning Seminars - Fall 2015 Deep Convolutional Networks and their impact on solving large scale visual recognition problems Hossien Azizpour, Computer Vision Group , KTH . Thanks to: J. Sullivan, A. S. Razavian, A. Maki and S. Carlsson
What Deep Learning has done for Computer Vision? Deep Learning has resulted in 1. much better automatic - visual image classification and - object detection, 2. much more powerful generic image representations.
What ConvNets have done for Computer Vision? ConvNets have resulted in 1. much better automatic - visual image classification and - object detection, 2. much more powerful generic image representations.
Image Classification Task: ILSVRC Steel!drum! Output:* Output:* Scale! Scale! ✗ ! ✔ ! TPshirt! TPshirt! Steel!drum! Giant!panda! Drums1ck! Drums1ck! Mud!turtle! Mud!turtle! 1 � Error = 1( incorrect on image i ) 100 , 000 100 , 000 images Source : Detecting avocados to zucchinis: what have we done, and where are we going? O. Russakovsky et al., ICCV 2013
ConvNets → much better image classification 30 28 . 2 25 . 8 Classification error (%) 20 16 . 4 11 . 7 10 6 . 7 0 2010 2011 2012 2013 2014 Performance of winning entry in ILSVRC competitions (2010-14). Red indicates when deep ConvNets were introduced.
How well would a human perform on ImageNet? • Andrej Karpathy, Stanford, set himself this challenge. • Replicated the 1000 way classification problem for a human. - Person shown image on the left of the image - On the right shown 13 examples from each of the 1000 classes. - Must pick 5 of classes as the potential ground truth label.
How well would a human perform on ImageNet? • Efforts and results reported on his blog What I learned from competing against a ConvNet on ImageNet • Estimated his own accuracy on ImageNet as 5.1%. ( After some training period. ) • Later conjectured (Feb 2015) a dedicated and motivated human classifier capable of error rate in the range of 2%–3%
Race is on to beat human level performance 30 28 . 2 25 . 8 Classification error (%) 20 16 . 4 11 . 7 10 6 . 7 5 . 33 4 . 94 4 . 82 0 2010 2011 2012 2013 2014 Jan Feb Mar Recent progress made by Baidu, MSR and Google.
��������������� ������������������������ ���������������������������� ���������������������������� ����������� ����������� �� ���������� ����������
Pascal VOC: Object Detection Classifica>on:=person,=motorcycle= Detec4on( Person= Motorcycle(
ConvNets → much better object detection Accuracy Deep learning 80 plant 70 person chair 60 cat 50 car 40 aeroplane all classes 30 20 10 Year 2007 2008 2009 2010 2011 2012 2013 2014 2015 Progress of object detection for the Pascal VOC 2007 challenge.
ConvNets → much better image representation
Other Common Tasks in Computer Vision • Fine-Grained classification Task: - Label the sub-categories within a class.
Other Common Tasks in Computer Vision • Attribute Classification Task: - Predict the attributes describing a scene (person, etc.)
Other Common Tasks in Computer Vision • Image Retrieval Have a database of images. Task: - Given a query image. - Find images in database with same content as the query image. Database images ranked closest to query image. Query image correct result, incorrect result.
Solving these tasks often involves a complicated pipeline CNN • Example: fine-grained classification Representation Learn Extract Features Strong Part Image Normalized SVM RGB, gradient, DPM Annotations Pose LBP
Solving these tasks often involves a complicated pipeline CNN • Example: fine-grained classification Representation Learn Extract Features Strong Part Image Normalized SVM RGB, gradient, DPM Annotations Pose LBP • Can IMPROVE RESULTS by replacing the complicated pipeline with CNN Representation Learn Extract Features Strong Part Image SVM Normalized RGB, gradient, Annotations DPM Pose LBP • ConvNet used must be deep and trained on a large diverse labelled dataset.
What we mean by a ConvNet feature 224 × 224 × 3 55 × 55 × 48 4096 4096 27 × 27 × 128 13 × 13 × 192 13 × 13 × 192 13 × 13 × 128 1000 dense dense dense Input Image Convolutional layers Fully connected layers Output
ConvNets → much better image representation Best state-of-the-art ConvNet off-the-shelf + Linear SVM 100 91 . 4 91 . 1 89 . 5 89 . 3 86 . 8 84 . 3 81 . 7 81 . 9 80 . 7 79 . 5 80 77 . 2 74 . 9 73 71 . 1 69 . 9 69 68 64 61 . 8 60 56 . 8 45 . 4 42 . 3 40 n n n n n n l l l l l a a a a a o o o o o o v v v v v i i i i i i e e e e e t t t t t t a a a i c c i i i i i r r r r r n c c z e e t t t t t fi fi i g t t e e e e e r o e e i i R R R R R s s o D D c s s g e a a s s s e e e R e e l l t g g e g c C C t t r a n n a n u u c s u i i m a r b b t e b d d t e t c i i p n l l I s u w r r e i i n e u u l S t t u e j o c t t B B I b n S l A A c d O F S e t r c s d c n t i i S e B r c r a o j e a b m f j P x O b u O O H Source : CNN Features off-the-shelf: an Astounding Baseline for Recognition, A. Sharif Razavian et al., arXiv, March 2013.
Reason for jump in performance : Learn feature hierarchies from the data
Modern Visual Recognition Systems 1. Training Phase - Gather labelled training data. - Extract a feature representation for each training example. - Construct a decision boundary. 2. Test Phase - Extract feature representation from the test example. - Compare to the learnt decision boundary.
Modern Visual Recognition Systems 1. Training Phase - Gather labelled training data. - Extract a feature representation for each training example. - Construct a decision boundary. 2. Test Phase - Extract feature representation from the test example. - Compare to the learnt decision boundary. It’s just supervised learning.
Is it a bike or a face? ?
Construct a decision boundary Decision Boundary
The two extremes of feature extraction Ideal features Far from ideal
The two extremes of feature extraction Ideal features Far from ideal Supervised Deep Learning allows you to learn more ideal features.
Learning Representations/Features Traditional Pattern Recognition : Fixed/Handcrafted feature extraction Feature Trainable Extractor Classifier Modern Pattern Recognition : Unsupervised mid-level features Feature Mid-level Trainable Extractor Features Classifier Deep Learning : Trained hierarchical representations Low-level Mid-level High-level Trainable Features Features Features Classifier Source : Talk Computer Perception with Deep Learning by Yann LeCun
Key Properties of Deep Learning Provides a mechanism to: • Learn a highly non-linear function. ( Efficiently encoded in a deep structure. ) • Learn it from data. • Build feature hierarchies - Distributed representations - Compositionality • Perform end-to-end learning.
How? Convolutional Networks
Convolutional Networks • Are deployed in many practical applications Image recognition, speech recognition, Google’s and Baidu’s photo taggers • Have won several competitions ImageNet, Kaggle Facial Expression, Kaggle Multimodal Learning, German Traffic Signs, Connectomics, Handwriting... • Are applicable to array data where nearby values are correlated Images, sound, time-frequency representations, video, volumetric images, RGB-Depth images.... Source : Talk Computer Perception with Deep Learning by Yann LeCun
Convolutional Network • Training is supervised and with stochastic gradient descent . • LeCun et al. ’89, ’98 Source : Talk Computer Perception with Deep Learning by Yann LeCun
ConvNets: History • Fukushima 1980 : designed network with same basic structure but did not train by backpropagation. • LeCun from late 80s : figured out backpropagation for ConvNets, popularized and deployed ConvNets for OCR applications etc. • Poggio from 1999 : same basic structure but learning is restricted to top layer (k-means at second stage) • LeCun from 2006 : unsupervised feature learning • DiCarlo from 2008 : large scale experiments, normalization layer • LeCun from 2009 : harsher non-linearities, normalization layer, learning unsupervised and supervised. • Mallat from 2011 : provides a theory behind the architecture • Hinton 2012 : use bigger nets, GPUs, more data
E M Convolutional I Neural Net 2012 T Convolutional Neural Net 1998 Convolutional Neural Net 1988 Reasons for breakthrough now: • Data and GPUs , • Networks have been made deeper.
Modern Convolutional Network 224 × 224 × 3 55 × 55 × 48 4096 4096 27 × 27 × 128 13 × 13 × 192 13 × 13 × 192 13 × 13 × 128 1000 dense dense dense Input Image Convolutional layers Fully connected layers Output Alex Net 2012
Convolutional Networks for RGB Images: The Basic Operations
Recommend
More recommend