CS 103: Representation Learning, Information Theory and Control - - PowerPoint PPT Presentation

cs 103 representation learning information theory and
SMART_READER_LITE
LIVE PREVIEW

CS 103: Representation Learning, Information Theory and Control - - PowerPoint PPT Presentation

CS 103: Representation Learning, Information Theory and Control Lecture 5, Feb 8, 2019 Representation Learning and Information Bottleneck Desiderata for representations An optimal representation z of the data x for the task y is a stochastic


slide-1
SLIDE 1

CS 103: Representation Learning, Information Theory and Control

Lecture 5, Feb 8, 2019

slide-2
SLIDE 2

Representation Learning and Information Bottleneck

slide-3
SLIDE 3

3

Desiderata for representations

An optimal representation z of the data x for the task y is a stochastic function z ∼ p(z|x) that is: I(z; y) = I(x; y) Sufficient Minimal I(x; z) is minimal among sufficient z Invariant to nuisances If n ⫫ y, then I(n; z) = 0

TC(z) = KL(p(z)kQ

i p(zi))

Maximally disentangled is minimized

Sufficient

Minimal

slide-4
SLIDE 4

Minimal sufficient representations for deep learning

4

Information Bottleneck Lagrangian

A minimal sufficient representation is the solution to:

minimizep(z|x) I(x; z) s.t. H(y|z) = H(y|x)

Trade-off: between sufficiency and minimality, regulated by the parameter. Information Bottleneck Lagrangian:

L = Hp,q(y|z) + βI(z; x)

regularizer cross-entropy

slide-5
SLIDE 5

We only need to enforce minimality (easy) to gain invariance (difficult)

5

Invariant if and only if minimal

  • Proposition. (A. and Soatto, 2017) Let z be a sufficient representation and n a
  • nuisance. Then,

Moreover, there exists a nuisance n for which equality holds.

I(z; n) ≤ I(z; x) − I(x; y)

constant minimality invariance

A representation is maximally insensitive to all nuisances iff it is minimal

slide-6
SLIDE 6

The standard architecture alone already promotes invariant representations

6

Corollary: Ways of enforcing invariance

Regularization by architecture Reducing dimension (max-pooling) or adding noise (dropout) increases minimality and invariance. Stacking layers Stacking multiple layers makes the representation increasingly minimal. Task information I(x; y) Nuisance information I(x; n) Only nuisance information dropped in a bottleneck (sufficiency).

1

Increasingly more minimal implies increasingly more invariant to nuisances.

2

The classifier cannot overfit to nuisances.

3

slide-7
SLIDE 7

Creating a soft bottleneck with controlled noise

7

Information Dropout: a Variational Bottleneck

bottleneck

L = Hp,q(y|z) + β I(z; x) = Hp,q(y|z) − β log α(x)

Task information I(x; y) Nuisance information I(x; n)

Multiplicative noise ~ N(0, 𝛽(x))

Achille and Soatto, "Information Dropout: Learning Optimal Representations Through Noisy Computation”, PAMI 2018 (arXiv 2016)

slide-8
SLIDE 8

8

Learning invariant representations

Deeper layers filter increasingly more nuisances Stronger bottleneck = more filtering

Only informative part of the image Other information is discarded

(Achille and Soatto, 2017)

Achille and Soatto, "Information Dropout: Learning Optimal Representations Through Noisy Computation”, PAMI 2018 (arXiv 2016)