Learning from Limited Data The Univ. of Tokyo / RIKEN AIP Tatsuya - - PowerPoint PPT Presentation

learning from limited data
SMART_READER_LITE
LIVE PREVIEW

Learning from Limited Data The Univ. of Tokyo / RIKEN AIP Tatsuya - - PowerPoint PPT Presentation

GTC March 29, 2018 Learning from Limited Data The Univ. of Tokyo / RIKEN AIP Tatsuya Harada Contents Background Deep Learning (DL) is one of the most successful machine learning methods. DL generally requires a huge amount of


slide-1
SLIDE 1

Learning from Limited Data

The Univ. of Tokyo / RIKEN AIP Tatsuya Harada

GTC March 29, 2018

slide-2
SLIDE 2

Contents

Background

Deep Learning (DL) is one of the most successful machine learning methods. DL generally requires a huge amount of annotated data. Annotation cost is very expensive.

Challenge

Obtaining High Quality Deep Neural Networks from limited data

Topics

Learning method for supervised learning from limited data Unsupervised domain adaptation using classifier discrepancy

slide-3
SLIDE 3

Between-class Learning

Yuji Tokozume, Yoshitaka Ushiku, Tatsuya Harada Learning from Between-class Examples for Deep Sound Recognition To appear ICLR 2018 Between-class Learning for Image Classification To appear CVPR 2018

Learning from Limited Data

  • Y. Tokozume

https://github.com/mil-tokyo/bc_learning_sound https://github.com/mil-tokyo/bc_learning_image

slide-4
SLIDE 4

Standard Supervised Learning

4

Training Dataset Dog Cat Label Random Select & Augment Bird Model Output Dog 1 Cat 0 Bird 0 Input

1. Select one example from training dataset 2. Train the model to output 1 for the corresponding class and 0 for the other classes

slide-5
SLIDE 5

Between-class (BC) Learning

5 1. Select two training examples from different classes 2. Mix those examples with a random ratio 3. Train the model to output the mixing ratio and mixing classes

Proposed method

Training Dataset Dog Cat Label Random Select & Augment Bird Model Output KL Dog 0.7 Cat 0.3 Bird 0 Input

On test phase, we input a single example into the network.

Generate infinite training data from limited data Learn more discriminative feature space than standard learning Merits

0.7 0.3

slide-6
SLIDE 6

BC learning for sounds

 Two training examples  Random ratio sounds

Dog: 1 Cat: 0 Bird: 0 Dog: 0 Cat: 1 Bird: 0

labels

Dog: Cat: Bird: 0

6

𝐻, 𝐻: sound pressure level of 𝒚, 𝒚[dB] A dog and a cat

slide-7
SLIDE 7

Results of Sound Recognition

① Various models ② Various datasets ③ Compatible with strong data augmentation ④ Surpass the human level

7 We can improve recognition performance for any sound networks, if we apply the BC learning.

slide-8
SLIDE 8

BC Learning for Image

static component wave component

would not be important or even have a bad effect if CNNs treat input data as waveforms

Proposal 1

8

Dog 0.5 Cat 0.5 Cat 1.0 Dog 1.0

Images as waveforms

Proposal 2 (BC+)

slide-9
SLIDE 9

Results on CIFAR

9

Our preliminary results were presented in ILSVRC2017 on July 26, 2017.

slide-10
SLIDE 10

Results on ImageNet-1K

100 epochs Standard 20.4/5.3 [28] BC (ours) 19.92/4.91 150 epochs Standard 20.44/5.25 BC (ours) 19.43/4.80 top-1/top-5 val. error

around 1% gain in top-1 error

10

Our preliminary results were presented in ILSVRC2017 on July 26, 2017.

slide-11
SLIDE 11

How BC Learning Works

11

Class A distribution Class B distribution rA+(1-r)B distribution

Small Fisher’s criterion → Overlap among distributions → Large BC learning loss Large Fisher’s criterion → No overlap among distributions → Small BC learning loss

Class A distribution Class B distribution rA+(1-r)B distribution

More discriminative Less discriminative

slide-12
SLIDE 12

How BC Learning Works

Large correlation among classes → Mixing class of A and B may be classified into class C. → Large BC learning loss

A B C

Decision boundary

rA+(1-r)B A B C

Decision boundary

rA+(1-r)B

Small correlation among classes → Mixing class of A and B is not classified into class C. → Small BC learning loss

Small correlation Large correlation In the classification, the distributions must be uncorrelated because the teaching signal is discrete.

slide-13
SLIDE 13

Visualization using PCA

Standard learning BC learning (ours)

Activations of ・10-th layer of 11-layers CNN ・trained on CIFAR-10

13

Distributions are more compact than those from standard learning. Distributions are spherical. Larger Fisher’s criterion than that of standard learning

Fisher’s criterion: 1.97 Fisher’s criterion: 1.76

slide-14
SLIDE 14

Unsupervised Domain Adaptation using Classifier Discrepancy

Adversarial Dropout Regularization Kuniaki Saito, Yoshitaka Ushiku, Tatsuya Harada, Kate Saenko To apper ICLR 2018 Maximum Classifier Discrepancy for Unsupervised Domain Adaptation Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, Tatsuya Harada To apper CVPR 2018, oral presentation

Learning from Limited Data

  • K. Saito
slide-15
SLIDE 15

Domain Adaptation (DA)

Problems

Supervised learning model needs many labeled examples Cost to collect them in various domains

Goal

Transfer knowledge from source to target domain Classifier that works well on target domain.

Unsupervised Domain Adaptation (UDA)

Labeled examples are given only in the source domain. There are no labeled examples in the target domain.

Source domain Target domain Synthetic images, labeled Real images, unlabeled

slide-16
SLIDE 16

Related Work

Distribution matching based method

  • Match distributions of source and target features
  • Domain Classifier (GAN) [Ganin et al., 2015]
  • Maximum Mean Discrepancy [Long et al., 2015]

Problems

  • Features are aligned just by looking hidden features.
  • Relationship between the decision boundary and target examples is not considered.
  • This method only considers whole distribution.

Feature Extractor Category Classifier Source (labeled) Target (unlabeled) T S Domain Classifier

Source Target Source Target Before adaptation Adapted

Decision boundary

slide-17
SLIDE 17

Proposed Approach

Considering class specific distributions Using decision boundary to align distributions

Source Target Source Target Source Target Source Target

Proposed

Before adaptation Adapted

Previous work

Decision boundary Decision boundary

Class A Class B

Decision boundary

slide-18
SLIDE 18

Key Idea

Source Target

F1 F2

Source Target

F1 F2

Source Target

F1 F2 Maximize discrepancy by learning classifiers Minimize discrepancy by learning feature space Maximize discrepancy by learning classifiers

Source Target

F1 F2 Minimize discrepancy by learning feature space

Discrepancy Maximizing discrepancy by learning two classifiers Minimizing discrepancy by learning feature space Discrepancy

Discrepancy is the example which gets different predictions from two different classifiers.

slide-19
SLIDE 19

1 2

Input

F1 F2

1 2

L1class

Classifiers

L2class

Loss

Network Architecture and Training

Maximize D by learning classifier Minimize D by learning feature generator

Source Target F1 F2 Source Target F1 F2

  • 1. Fix generator 𝐻, and find classifiers 𝐺

, 𝐺 that maximize 𝑬 − (𝑴𝟐 + 𝑴𝟑)

  • 2. Find 𝐻, 𝐺

, 𝐺 that minimize 𝑴𝟐 + 𝑴𝟑 (minimize classification error on source)

  • 3. for 𝑙 = 1: 𝑜

Fix classifiers 𝐺

, 𝐺, and find feature generator 𝐻 that minimizes 𝑬

Algorithm

slide-20
SLIDE 20

Why Discrepancy Method Works Well?

20

Hypothesis Expected error in target domain Expected error in source domain Shared error of the ideal hypothesis

slide-21
SLIDE 21

Object Classification

Synthetic images to Real images (12 Classes) Finetune pre-trained ResNet101 [He et al., CVPR 2016] (ImageNet) Source:images, Target:images

Source (Synthetic images) Target (Real images)

slide-22
SLIDE 22

Semantic Segmentation

 Simulated Image (GTA5) to Real Image (CityScape)  Finetuning of pre-trained VGG, Dilated Residual Network [Yu et al., 2017] (ImageNet)

 Calculate discrepancy pixel-wise

 Evaluation by mean IoU (TP/(TP+FP+FN))

GTA 5 (Source) CityScape(Target)

10 20 30 40 50 60 70 80 90 100

road sdwk bldng wall fence pole light sign vg n trrn sky perso rider car truck bus train mcycl bcycl

source only

  • urs

IoU

slide-23
SLIDE 23

Qualitative Results

RGB Ground truth Source

  • nly

Adapted (ours)

slide-24
SLIDE 24

Take Home Messages

Between-class learning (BC learning)

Mix two training examples with a random ratio Train the model to output the mixing ratio Simple and easy to implement Can be introduced independently from previous techniques: network architectures, data augmentation schemes, optimizers, etc.

Unsupervised Domain Adaptation

Unsupervised domain adaptation method using classifier discrepancy is useful.