Invariances in Gaussian processes And how to learn them ST John - - PowerPoint PPT Presentation

invariances in gaussian processes
SMART_READER_LITE
LIVE PREVIEW

Invariances in Gaussian processes And how to learn them ST John - - PowerPoint PPT Presentation

Invariances in Gaussian processes And how to learn them ST John PROWLER.io Outline 1. What are invariances? 2. Why do we want to make use of them? 3. How can we construct invariant GPs? 4. Where invariant GPs are actually crucial 5.


slide-1
SLIDE 1

Invariances in Gaussian processes

And how to learn them ST John

PROWLER.io

slide-2
SLIDE 2

Outline

1. What are invariances? 2. Why do we want to make use of them? 3. How can we construct invariant GPs? 4. Where invariant GPs are actually crucial 5. How can we figure out what invariances to employ?

2/53

slide-3
SLIDE 3

What are invariances?

Function does not change under some transformation i.e. for

3/53

Can be discrete or continuous

  • Translation
  • Rotation
  • Reflection
  • Permutation
slide-4
SLIDE 4

Invariance under discrete translation

4/53

Periodic functions

slide-5
SLIDE 5

Invariance under discrete translation

Density( ) = Density( )

5/53

1 2 3 4 5 1 2 3 4 (2, 3) (3, 4)

slide-6
SLIDE 6

Density of water molecules as a function of (x, y) point in plane 1/6th of the plane already predicts the function value everywhere

Invariance under discrete rotation

6/53

slide-7
SLIDE 7

Invariance under reflection

Solar elevation measured as function of azimuth (for different days)

7/53

Left half already predicts right half

slide-8
SLIDE 8

Invariance under permutation

8/53

[100, 200, 1, 1, 1]

slide-9
SLIDE 9

Invariance under permutation

9/53

[1, 200, 1, 100, 1]

slide-10
SLIDE 10

f( ) = f( )

Invariance under permutation

10/53

f(100, 200, 1, 1, 1) = f(1, 200, 1, 100, 1)

Different inputs but same function value

slide-11
SLIDE 11

Invariance under permutation

1 3 2 3 1 2

E( ) = E( )

11/53

slide-12
SLIDE 12

Discrete symmetries

12/53

slide-13
SLIDE 13

Invariance under continuous transformations

Translation Rotation

13/53

slide-14
SLIDE 14

Example: image classification

Class label as a function of image pixel matrix

Label ( ) = “cat”

14/53

slide-15
SLIDE 15

Example: image classification

Class label as a function of image pixel matrix

Label ( ) = “cat”

15/53

slide-16
SLIDE 16

Example: image classification

Class label as a function of image pixel matrix

Label ( ) = “cat”

16/53

slide-17
SLIDE 17

Example: image classification

Class label as a function of image pixel matrix

Label ( ) = “cat”

17/53

slide-18
SLIDE 18

Example: image classification

Class label as a function of image pixel matrix

Label ( ) = “8”

8

18/53

slide-19
SLIDE 19

Example: molecular energy

E( ) = E( )

19/53

slide-20
SLIDE 20

Approximately invariant…

20/53

slide-21
SLIDE 21

Approximately invariant…

6 9

21/53

slide-22
SLIDE 22
  • 2. Why do we want to use invariances?
  • Incorporate prior knowledge about the behaviour of a system
  • Physical symmetries, e.g. modelling total energy (and gradients, i.e. forces) of a set of atoms
  • Helps generalisation
  • Improved accuracy vs number of training points

22/53

slide-23
SLIDE 23

Toy example

23/53

slide-24
SLIDE 24

Toy example

24/53

slide-25
SLIDE 25

Toy example

25/53

slide-26
SLIDE 26

Toy example

26/53

slide-27
SLIDE 27

Constructing invariant GPs

We want a prior over functions that obey the chosen symmetry. Symmetrise the function: Can do this by a) appropriate mapping to invariant space b) sum over transformations

27/53

slide-28
SLIDE 28

Permutation-invariant GPs: mapping construction

28/53

slide-29
SLIDE 29

Permutation-invariant GPs: sum construction

29/53

:

slide-30
SLIDE 30

Invariant sum kernel

30/53

slide-31
SLIDE 31

Samples from the prior

31/53

slide-32
SLIDE 32

How can we generalise this?

32/53

slide-33
SLIDE 33

Symmetry group

33/53

Transformations can be composed: Set of all compositions of transformations is a group; corresponds to symmetries

slide-34
SLIDE 34

Orbit of x: all points reachable by transformations

Example: Permutation in 2D

34/53

slide-35
SLIDE 35

Examples of orbits: permutation invariance

35/53

Orbit size = 2

slide-36
SLIDE 36

Examples of orbits: six-fold rotation invariance

36/53

Orbit size = 6

slide-37
SLIDE 37

Examples of orbits: permutation and six-fold rotation

Orbit size = 12

37/53

slide-38
SLIDE 38

Examples of orbit: continuous rotation symmetry

Uncountably infinite

38/53

slide-39
SLIDE 39

Orbit of a periodic function in 1D

Countably infinite

39/53

… …

slide-40
SLIDE 40

Constructing invariant GPs: sum revisited

40/53

slide-41
SLIDE 41

Applications

41/53

slide-42
SLIDE 42

Molecular modelling

Time-evolution of the configuration (position of all atoms) of a system of atoms/molecules Need Potential Energy Surface (PES)! Gradients = forces (easy with GPs)

42/53

slide-43
SLIDE 43

Potential Energy Surface

43/53

slide-44
SLIDE 44

Modelling Potential Energy Surface

Approximate as sum over k-mers (many-body expansion) Invariance to rotation/translation of local environment/k-mer Invariance under permutation of equivalent atoms

44/53

slide-45
SLIDE 45

Modelling Potential Energy Surface

Many-body expansion, sum over k-mers:

45/53

slide-46
SLIDE 46

Invariance to rotation/translation of local environment/k-mer:

Modelling Potential Energy Surface

46/53

Map to interatomic distances

slide-47
SLIDE 47

Modelling Potential Energy Surface

Invariance under permutation of equivalent atoms:

47/53

sum over them!

slide-48
SLIDE 48

How can we find out if an invariance is helpful?

  • As usual (like another kernel hyperparameter): marginal likelihood
  • Unlike “regular” likelihood (equivalent to training-set RMSE):
  • Less overfitting
  • Related to generalisation

48/53

slide-49
SLIDE 49

Marginal likelihood and generalisation

Measures how well part of the training set predicts the other training points: = how accurately the model generalises during inference, similar to cross-validation (but differentiable)

49/53

slide-50
SLIDE 50

50/53

Marginal likelihood

slide-51
SLIDE 51

Summary: we have seen…

How to constrain GPs to give invariant functions When invariance improves a model's generalisation When invariance increases the marginal likelihood That invariances exist in real-world problems

51/53

slide-52
SLIDE 52

Questions?

Next up: how to learn invariances…

52/53

slide-53
SLIDE 53

Snowflake prior

53/53

slide-54
SLIDE 54

Why not just data augmentation?

Used in deep learning… Invariances are better: 1. Cubic scaling with number of data points vs linear scaling with invariances in prior 2. Data augmentation results in same predictive mean, but not variance 3. Invariances in the GP prior give us invariant samples

54/53

slide-55
SLIDE 55

Learning Invariances

with the Marginal Likelihood

Mark van der Wilk PROWLER.io

slide-56
SLIDE 56

We discussed…

1

How to constrain GPs to give invariant functions

2

When invariance increases the marginal likelihood

3

When invariance improves a model's generalisation

4

That invariances exist in real-world problems

slide-57
SLIDE 57

From known invariances to learning them

We previously saw that known invariances were useful to modelling.

  • How do we exploit invariances in a problem, if we don't know them a-priori?
  • Can we learn a useful invariance from the data?
slide-58
SLIDE 58

Model selection

  • Invariances in a GP are expressed in the kernel
  • We use the marginal likelihood to select models
  • Parameterising the orbit is all that is left
slide-59
SLIDE 59

Parameterising orbits is hard

Strict invariance requires: which we can obtain using the construction

I don't know how to parameterise orbits!

slide-60
SLIDE 60

From orbits to distributions

  • We sum over an arbitrary set of points
  • Take the infinite limit
  • Find kernel

I do know how to parameterise distributions!

slide-61
SLIDE 61

Insensitivity

  • We lose exact invariance… but this may be a blessing!

&

slide-62
SLIDE 62

Insensitivity

  • We lose exact invariance… but this may be a blessing!

&

slide-63
SLIDE 63

What we will do

  • Parameterise a distribution that describes the insensitivity
  • Use this distribution to define a kernel
  • Find invariance in the kernel by optimising the hyperparameters
slide-64
SLIDE 64

Obstacles to inference

1.

For large datasets, the matrix operations of Kff becomes infeasible (O(N3) time complexity) 2. We may have non-Gaussian likelihoods (classification!) 3. We can't even evaluate the kernel!

slide-65
SLIDE 65

Variational inference

1.

For large datasets, the matrix operations of Kff becomes infeasible (O(N3) time complexity) 2. We may have non-Gaussian likelihoods (classification!) 3. We can't even evaluate the kernel! Still needed for Kuu and kun

slide-66
SLIDE 66

Interdomain inducing variables

  • Variational posterior is constructed by conditioning
  • Gaussian conditioning requires covariances

&

slide-67
SLIDE 67

Interdomain inducing variables

  • Variational posterior is constructed by conditioning
  • Gaussian conditioning requires covariances

&

slide-68
SLIDE 68

Interdomain inducing variables

  • Variational posterior is constructed by conditioning
  • Gaussian conditioning requires covariances

&

slide-69
SLIDE 69

Interdomain inducing variables

  • Variational posterior is constructed by conditioning
  • Gaussian conditioning requires covariances

&

slide-70
SLIDE 70

Unbiased estimation of the kernel

Unbiased estimates of μn, μn

2, σn 2 , give unbiased estimate of the ELBO!

slide-71
SLIDE 71

Unbiased estimation of the kernel

slide-72
SLIDE 72

Unbiased estimation of the kernel

(We only need to sample one set from pθ(xa | x), see paper for details)

}

sample

slide-73
SLIDE 73

What we did

  • Parameterise a distribution that describes the insensitivity
  • Use this distribution to define a kernel
  • Approximate the marginal likelihood using the variational evidence lower bound (ELBO)
  • Find an unbiased ELBO approximation, using unbiased estimates of the kernel
  • Optimise the hyperparameters, using the gradients of the ELBO
slide-74
SLIDE 74

Results

  • Single model tunes

itself automatically to multiple datasets

  • Fire off optimisation

and watch it go

MNIST Rotated MNIST

slide-75
SLIDE 75

Conclusions & outlook

  • We can parameterise invariant kernels
  • We can learn the parameters with a marginal likelihood approximation
  • Learned invariances improve generalisation