CS839 Special Topics in AI: Deep Learning Learning with Less - - PowerPoint PPT Presentation

cs839 special topics in ai deep learning
SMART_READER_LITE
LIVE PREVIEW

CS839 Special Topics in AI: Deep Learning Learning with Less - - PowerPoint PPT Presentation

CS839 Special Topics in AI: Deep Learning Learning with Less Supervision Sharon Yixuan Li University of Wisconsin-Madison October 29, 2020 Overview Weakly Supervised Learning Flickr100M JFT300M (Google) Instagram3B (Facebook)


slide-1
SLIDE 1

CS839 Special Topics in AI: Deep Learning

Learning with Less Supervision

Sharon Yixuan Li University of Wisconsin-Madison

October 29, 2020

slide-2
SLIDE 2

Overview

  • Weakly Supervised Learning
  • Flickr100M
  • JFT300M (Google)
  • Instagram3B (Facebook)
  • Data augmentation
  • Human heuristics
  • Automated data augmentation
  • Self-supervised Learning
  • Pretext tasks (rotation, patches, colorization etc.)
  • Invariant vs. Covariant learning
  • Contrastive learning based framework (current SoTA)
slide-3
SLIDE 3

Part I: Weakly Supervised Learning

slide-4
SLIDE 4

Model Complexity Keeps Increasing

ResNet (He et al. 2016) LeNet (Lecun et al. 1998)

conv conv fc 120 fc 84
  • utput 10

>100 millions of parameters

slide-5
SLIDE 5

[Sun et al. 2017]

slide-6
SLIDE 6

Challenge: Limited labeled data

x 1000

1B images ~million annotation hours ImageNet, 1M images ~thousand annotation hours [Deng et al. 2009]

slide-7
SLIDE 7

Fully Supervised

CAT, DOG, FLOOR

Weakly Supervised

A CUTE CAT COUPLE #CAT

Un-supervised

???

Levels of Supervision

TRAINING AT SCALE

Instagram/Flickr ImageNet Crawled web images

slide-8
SLIDE 8

Noisy Data

Non-Visual Labels Missing Labels Incorrect Labels

#DOG #LOVE #HUSKY #CAT TRAINING AT SCALE

slide-9
SLIDE 9

Flickr 100M [Joulin et al. 2015]

slide-10
SLIDE 10

JFT 300M [Sun et al. 2017]

slide-11
SLIDE 11

Can we use billions of images with hashtags for pre-training?

[Mahajan et al. 2018]

slide-12
SLIDE 12

Hashtags Selection

synonyms of ImageNet labels 1.5K, 1B synonyms of nouns in wordnet 17K, 3B

[Mahajan et al. 2018]

slide-13
SLIDE 13

Network Architecture and Capacity

ResNeXt-101 32xCd

225 450 675 900 4 8 16 32 48

x10^6 C # of params

40 80 120 160 4 8 16 32 48

C # of flops x10^9 Xie et al. 2016

slide-14
SLIDE 14

3.5B 
 PUBLIC INSTAGRAM IMAGES 17K UNIQUE LABELS LARGE CAPACITY MODEL (RESNEXT101-32X48) DISTRIBUTED TRAINING (350 GPUS)

85.1%

Largest Weakly Supervised Training

[Mahajan et al. 2018]

slide-15
SLIDE 15

Results

slide-16
SLIDE 16 16

* With a bigger model, we even got 85.4% top-1 error on ImageNet-1K.

Target task: ImageNet

Transfer Learning Performance

slide-17
SLIDE 17 17

* With a bigger model, we even got 85.4% top-1 error on ImageNet-1K.

Target task: ImageNet

Transfer Learning Performance

slide-18
SLIDE 18 18

* With a bigger model, we even got 85.4% top-1 error on ImageNet-1K.

Target task: ImageNet

Transfer Learning Performance

slide-19
SLIDE 19 19

* With a bigger model, we even got 85.4% top-1 error on ImageNet-1K.

Target task: ImageNet Target task: CUB-2011 & Places-365

Transfer Learning Performance

slide-20
SLIDE 20 20

Models are surprisingly robust to label "noise"

Dataset: IG-1B-17k Network: ResNext-101 32x16

slide-21
SLIDE 21

Matching hashtags to target task helps (1.5K tags)


Effect of Model Capacity

Target task: ImageNet-1K

slide-22
SLIDE 22

BiT Transfer [Kolesnikov et al. 2020]

slide-23
SLIDE 23

Part II: Data Augmentation

slide-24
SLIDE 24

“Quokka”

Figure credit: https://github.com/aleju/imgaug

Data Augmentation

slide-25
SLIDE 25

Data Augmentation

Data

CNN Load image and label “cat”

slide-26
SLIDE 26

Data Augmentation

Data

CNN Load image and label Transformation function (TF)

slide-27
SLIDE 27

Data Augmentation

  • Change the pixels without changing the labels
  • Train on transformed data improves

generalization

  • VERY widely used

Transformation function (TF)

slide-28
SLIDE 28

Example of Transformation Functions (TFs)

Original image Color jitter Horizontal flip Random crop

slide-29
SLIDE 29

Heuristic Data Augmentation

Data Augmented data

TF sequences

TF1 TFL

rotation flip Human expert
slide-30
SLIDE 30

Heuristic Data Augmentation

Data Augmented data

TF sequences

TF1 TFL

rotation flip Human expert

How to automatically learn the compositions and parameterizations of TFs?

slide-31
SLIDE 31

TANDA

Generator (LSTM) TF sequences TF1 TFL

rotation flip

Data Augmented data

[Ratner et al. 2017] Transformation Adversarial Networks for Data Augmentations

slide-32
SLIDE 32

TANDA

Generator (LSTM) TF sequences TF1 TFL

rotation flip

Discriminator

real or augmented?

Data Augmented data

[Ratner et al. 2017] Transformation Adversarial Networks for Data Augmentations

slide-33
SLIDE 33

TANDA

[Ratner et al. 2017] Transformation Adversarial Networks for Data Augmentations Generated MNIST samples

25 50 75 100 CIFAR-10 ACE (F1 score) Medical Imaging

Heuristic augmentation TANDA

+2.1% +1.4 +3.4%

slide-34
SLIDE 34

AutoAugment

[Cubuk et al. 2018]

slide-35
SLIDE 35

AutoAugment

Controller (RNN) TF sequences TF1 TFL

rotation flip

Discriminator

real or augmented?

Data Augmented data

[Cubuk et al. 2018]

slide-36
SLIDE 36

AutoAugment

Controller (RNN) TF sequences TF1 TFL

rotation flip

End model

Validation accuracy R

Data Augmented data

[Cubuk et al. 2018] State-of-the-art performance on various benchmarks, however the computational cost is very high.

slide-37
SLIDE 37

RandAugment

[Cubuk et al. 2019]

Controller (RNN) TF sequences TF1 TFL

rotation flip

End model

Validation accuracy R

Data Augmented data

slide-38
SLIDE 38

RandAugment

[Cubuk et al. 2019]

Data Augmented data

TF1 TFL

Randomly Sampled Randomly Sampled

(1) random sampling over the transformation functions (2) grid search over the parameters of each transformation

Outperform AutoAugment

slide-39
SLIDE 39

Adversarial AutoAugment

TF sequences TF1 TFL

rotation flip

End model

Minimize Training loss

Data Augmented data

[Zhang et al. 2019]

Adversarial Controller (RNN) Maximize Training loss

Reward signal

12x reduction in computing cost on ImageNet, compared to AutoAugment. Top-1 error 1.36% on CIFAR-10 (new sota).

slide-40
SLIDE 40

Uncertainty-based sampling augmentation

Data Augmented data

mixup invert rotate cutout

… K randomly sampled comp. of TFs …

Model selects the TFs that provides the most information during training —No policy learning required

Rotate Invert Cutout Mixup

Users provide transformation functions (TFs)

[Wu et al. 2020]

slide-41
SLIDE 41

Empirical results: State of the art quality

CIFAR-10 CIFAR-100 SVHN

Improved the existing methods across domains

SoTA on CIFAR-10, CIFAR-100, and SVHN 84.54% on CIFAR-100 using Wide-ResNet-28-10

  • utperforming RandAugment (Cubuk et al.’19) by 1.24%

Improved 0.28 pts. in accuracy on text classification problem

slide-42
SLIDE 42

Check out the blog post series!

Automating the Art of Data Augmentation (Part I: Overview) Automating the Art of Data Augmentation (Part III: Theory) Automating the Art of Data Augmentation (Part IV: New Direction) Automating the Art of Data Augmentation (Part II: Practical Methods)

slide-43
SLIDE 43

Part III: Self-supervised Learning

slide-44
SLIDE 44

Source: Yann LeCun’s talk

slide-45
SLIDE 45

What if we can get labels for free for unlabelled data and train unsupervised dataset in a supervised manner?

slide-46
SLIDE 46

Pretext Tasks

slide-47
SLIDE 47

[Gidaris et al. 2018]

Rotation

slide-48
SLIDE 48

Gidaris et al. 2018

Rotation

slide-49
SLIDE 49

Gidaris et al. 2018

Rotation

slide-50
SLIDE 50

[Doersch et al., 2015]

Patches

slide-51
SLIDE 51

Colorization

http://richzhang.github.io/colorization/

[Zhang et al. 2016]

slide-52
SLIDE 52

Pretext Invariant Representation Learning (PIRL)

[Misra et al. 2019]

slide-53
SLIDE 53

Pretext Invariant Representation Learning (PIRL)

[Misra et al. 2019] Positive pair Negative pairs

slide-54
SLIDE 54

SimCLR

[Chen et al. 2020]

slide-55
SLIDE 55

SimCLR

[Chen et al. 2020]

slide-56
SLIDE 56

SimCLR

[Chen et al. 2020]

slide-57
SLIDE 57

Data Augmentation is the key

[Chen et al. 2020]

slide-58
SLIDE 58

Unsupervised learning benefits more from bigger models

[Chen et al. 2020]

slide-59
SLIDE 59

Summary

  • Weakly Supervised Learning
  • Flickr100M
  • JFT300M (Google)
  • Instagram3B (Facebook)
  • Data augmentation
  • Human heuristics
  • Automated data augmentation
  • Unsupervised Learning
  • Pretext tasks (rotation, patches, colorization etc.)
  • Invariant vs. Covariant learning
  • Contrastive learning based framework (current SoTA)
slide-60
SLIDE 60

Questions?