Uncertainty in compositional models of alignment Ieva Kazlauskaite, - - PowerPoint PPT Presentation

uncertainty in compositional models of alignment
SMART_READER_LITE
LIVE PREVIEW

Uncertainty in compositional models of alignment Ieva Kazlauskaite, - - PowerPoint PPT Presentation

Uncertainty in compositional models of alignment Ieva Kazlauskaite, University of Bath Neill D.F. Campbell, University of Bath Carl Henrik Ek, University of Bristol Ivan Ustyuzhaninov, University of T ubingen Tom Waterson, Electronic Arts


slide-1
SLIDE 1

Uncertainty in compositional models of alignment

Ieva Kazlauskaite, University of Bath Neill D.F. Campbell, University of Bath Carl Henrik Ek, University of Bristol Ivan Ustyuzhaninov, University of T¨

ubingen

Tom Waterson, Electronic Arts

September, 2019

slide-2
SLIDE 2

Motivation

Data:

  • Motion capture sequences,

e.g. a jump or a golf swing.

  • Each motion corresponds to

a different style or mood. Goal: Generate new motions by interpolating between the captured clips. Pre-processing: The clips need to be temporally aligned.

slide-3
SLIDE 3

Motivation

Assume we are given some time-series data with inputs x ∈ RN and J output sequences {yj ∈ RN}. We know that there are multiple underlying function that generated this data, say K such functions, fk(·), and the observed data was generated by warping the inputs to the true functions using some warping function gj(x) such that: yj = fk(gj(x)) + noise. (1)

Two groups (to be found automatically): Unknown warps Unknown latent functions

slide-4
SLIDE 4

Motivation

Unknowns:

  • Number of underlying functions K
  • Underlying functions fk(·)
  • Warps gj(·) for each sequence

20 40 60 80 100 0.50 0.25 0.00 0.25 0.50 0.75 1.00

Data

slide-5
SLIDE 5

Motivation

Let’s try to find K using K-means clustering:

20 40 60 80 100 0.50 0.25 0.00 0.25 0.50 0.75 1.00

K-means initialisation with 2 clusters

20 40 60 80 100 0.50 0.25 0.00 0.25 0.50 0.75 1.00

K-means initialisation with 3 clusters

slide-6
SLIDE 6

Motivation

K-means clustering vs. correct labels:

20 40 60 80 100 0.50 0.25 0.00 0.25 0.50 0.75 1.00

K-means initialisation with 2 clusters

20 40 60 80 100 0.50 0.25 0.00 0.25 0.50 0.75 1.00

K-means initialisation with 3 clusters

20 40 60 80 100 0.50 0.25 0.00 0.25 0.50 0.75 1.00

Correct clustering of inputs

slide-7
SLIDE 7

Motivation

A PCA scatter plot of the data:

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1 1 2 3

PCA initialisation with correct labels

slide-8
SLIDE 8

Alignment model

Three constituent parts:

  • Model of transformations (warps), gj
  • Model of sequences, fk
  • Alignment objective
slide-9
SLIDE 9

Model of transformations (warps)

Observed sequences

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

x

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

g(x)

Example warp

  • Parametric warps.
  • i∈I wi = 1, wi ≥ 0

∀ i ∈ I

  • Nonparametric warps.

For example, monotonic GPs In general, we prefer warps that are close to an identity

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

Riihim¨ aki & Vehtari. Gaussian processes with monotonicity information (2010)

  • K. et al. Monotonic Gaussian Process Flow (2019)
slide-10
SLIDE 10

Model of sequences

Option 1: interpolate sequences using linear interpolation or splines. Option 2: fit GPs to the sequences.

  • principled way to handle
  • bservational noise
  • can impose priors of fk

Observed sequences

1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.0 0.5 0.0 0.5 1.0

GP regression

slide-11
SLIDE 11

Notation

Assume that the observed data was generated as: yj = fk(gj(x)) + ǫj, ǫj ∼ N(0, β−

j 1)

(2) where x are fixed linearly spaced input locations (or evenly sampled time). Then the corresponding aligned sequences are: sj := fk(x) (3) The joint conditional likelihood is: p sj yj

  • Gj, Xj, θj
  • ∼ N
  • 0,

kθj(X, X) kθj(X, Gj) kθj(Gj, X) kθj(Gj, Gj) + β−1

j

  • (4)
slide-12
SLIDE 12

Model of sequences

Pseudo-observations S Evenly spaced inputs X Observations Y Warped inputs g(X)

Then the goal is to:

  • Fit GPs to observations and pseudo-observations

{[g(X), X], [Y, S]} for each sequence

  • Impose alignment constraint on pseudo-observations {X, S}
slide-13
SLIDE 13

Alignment objective

We want an alignment objective that:

  • infers the number of clusters (underlying functions) K
  • aligns sequences within these clusters

We aim to design a clustering or dim. reduction objective that is invariant to the transformation (warps) of the inputs

slide-14
SLIDE 14

Pairwise distance alignment objective

Minimise the pairwise distance between all sequences (irrespective

  • f the underlying clusters of functions):

L =

J

  • n=1

J

  • m=n+1

||sn(x) − sm(x)||2 (5)

20 40 60 80 100 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

Warps Complexity: 1.845

20 40 60 80 100 0.4 0.2 0.0 0.2 0.4 0.6 0.8

Aligned functions Alignment error: 1.735

slide-15
SLIDE 15

Traditional GP-LVM

  • Observe high-dimensional data S.
  • Find low-dim representation Z that captures the structure of S.
  • Find a mapping f from Z to S.

Latent space Z Mapping h Inputs S

sj = h ( zj, θ ) + noise, where θ are parameters of h.

slide-16
SLIDE 16

Traditional GP-LVM

In a GP-LVM, GPs are taken to be independent across the features and the likelihood function is: p(S | x) =

D

  • d=1

p(sd | x) =

D

  • d=1

N(sd | 0, K + γ−1I) (6)

Observed data Y in matrix form Aligned data S in matrix form

slide-17
SLIDE 17

GP-LVM as alignment objective

We impose the alignment objective by learning a low-dimensional representation Z of the pseudo-observations S. LGP-LVM = log p(S | Z, θh, θz, β) = N 2 log |Kzz|

  • complexity terms

− 1 2Tr(K−1

zz SST)

  • data fitting terms

+ log(p(Z | θz))

  • prior over latent variables

+ log(p(θh))

  • prior over GP mappings

+ const (7) As an alignment objective, it is controlled by:

  • 1. prior over the latent variables Z, p(Z) ∼ N(0, θzI)
  • 2. lengthscale in the GP-LVM mapping (part of θh))
slide-18
SLIDE 18

Aside: Pairwise distance alignment objective

2 4 6 2 1 1 2 Observations 2 4 6 = 0.000 2 4 6 = 0.100 2 4 6 2 1 1 2 = 0.464 2 4 6 = 2.154 2 4 6 = 10.000

yi

transformed = yi input + wi,

yi, wi ∈ R8 with γ ||w||2, i = 1, 2, 3, 4

slide-19
SLIDE 19

Aside: GP-LVM as alignment objective

2 4 6 2 1 1 2 Observations 2 4 6 0.000 2 4 6 0.100

0.02 0.02

2 4 6 2 1 1 2 0.464

0.02 0.01 0.34 0.25

2 4 6 2.154

0.05 0.04 0.81 0.42

2 4 6 10.000

0.10 0.07 0.96 0.51

yi

transformed = yi input + wi,

yi, wi ∈ R8 with γ ||w||2, i = 1, 2, 3, 4

slide-20
SLIDE 20

Aside: Bayesian Mixture Model as alignment objective

2 4 6 2 1 1 2 Observations 2 4 6 0.000

1 2 3 Cluster assignments

2 4 6 0.100

1 2 3 Cluster assignments

2 4 6 2 1 1 2 0.464

1 2 3 Cluster assignments

2 4 6 2.154

1 2 3 Cluster assignments

2 4 6 10.000

1 2 3 Cluster assignments

yi

transformed = yi input + wi,

yi, wi ∈ R8 with γ ||w||2, i = 1, 2, 3, 4

slide-21
SLIDE 21

Full objective for sequence alignment

  • 1. For each of the J sequences we perform standard GP regression
  • n the observed data yj and the pseudo-observations sj by

learning the hyperparameters of the GPs and the parameters of the warpings.

  • 2. Impose the alignment objective on the pseudo-observations S

The sum of the log-likelihoods is:

L =

J

  • j=1

LGPi + LGP-LVM +

J

  • j=1

log p(gj) =

J

  • j=1

log p([sj, yj]T | x, gj, θj, βj) + LGP-LVM(Z, ψh, ψz, γ) +

J

  • j=1

log p(gj)

(8)

slide-22
SLIDE 22

Results on ECG data

Input data: Alignment with GP-LVM objective:

20 40 60 80 100 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

Warps Complexity: 6.447

20 40 60 80 100 0.4 0.2 0.0 0.2 0.4 0.6 0.8

Aligned functions Alignment error: 0.411

0.6 0.4 0.2 0.0 0.2 0.4 0.4 0.2 0.0 0.2 0.4 0.6

Manifold locations

slide-23
SLIDE 23

Competing objectives and joint model

Y f g X S h Z

γ β J

slide-24
SLIDE 24

Competing objectives and joint model

Y f g X S h Z

γ β J

Likelihood p(S | H, FX) as an equal mixture (where Sj and Sn refer to rows and columns of S): p(S | H, FX) = 1 2  

n

N(Sn|Hn, γ−1IJ) +

  • j

N(Sj|FX

j , β−1 j

IN)  

slide-25
SLIDE 25

Multi-task learning and Matrix distributions

Given data Y ∈ RJ×N:

  • 1. each sequence (row) has a GP prior and there’s a free-form

matrix C that models the covariances between the sequences1.

  • 2. learn sparse inverse covariance between features while

accounting for a low-rank confounding covariance between samples using GP-LVM2: p(Y | R, C −1) = N(vec(Y ) | 0N×D, C ⊗ R + σ2IN×D) (9)

1 Bonilla et al. Multi-task Gaussian Process Prediction (2008) 2 Stegle et al. Efficient inference in matrix-variate Gaussian models with iid

  • bservation noise (2011)
slide-26
SLIDE 26

More generally...

These types of constructions are useful when:

  • 1. The data has a hierarchical structure with additional

constraints: yj = fk(gj(x)) + ǫj, ǫj ∼ N(0, β−

j 1)

  • 2. We want to perform dim. reduction or clustering that is

invariant to a specific transformation

slide-27
SLIDE 27

Uncertainty in alignment model

slide-28
SLIDE 28

Uncertainty in alignment model

While the alignment model is probabilistic, so far we only considered point estimates and ignored the uncertainties associated with warpings and group assignments. Uncertainty in the alignment model contains:

  • 1. Observed sequences are often noisy
  • 2. Warping uncertainty
  • 3. Assignment of sequences to groups is ambiguous
slide-29
SLIDE 29

Uncertainty in alignment model

10 20 30 40 50 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

True warps

10 20 30 40 50 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

Input

slide-30
SLIDE 30

Going beyond the point estimates of the warps

  • So far we have been computing point estimates of the warps

(by optimising Gj directly).

  • To model warping uncertainty we developed a nonparametric

model1 of monotonic warps based on the Gaussian process differential flow model2.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 3 2 1 1 2 3

Random samples from the prior

1Hegde et al. Deep learning with differential Gaussian process flows (2019)

  • 2K. et al. Monotonic Gaussian Process Flow (2019)
slide-31
SLIDE 31

Fully probabilistic model - Mean-field

  • The composition of a warp (g-function) and a GP (f -function)

is similar to a two-layer DGP

  • Exact inference is also intractable, so we augment both layers

with inducing points {Ug} and {Uf }

  • Inducing points effectively define mappings in each layer. If

they are independent, the mappings do not match each other to fit the observations

−1 −0.5 0.5 1 −1 1

Observations

−2 2 −1 1

Layer 1

−1 1 −1 1

Layer 2

Aside: Girard et al. Gaussian Process Priors With Uncertain Inputs (2013)

slide-32
SLIDE 32

Beyond mean-field variational distribution

Use optimal distribution of inducing points1 Two components of a variational distribution:

  • 1. Free-form variational distribution q({Ug}) for the inducing

points of the warp

  • 2. For a given output G of the warp, we define q({Uf }) to be the
  • ptimal variational distribution1 of inducing points in a GP

mapping G to the observations

  • 1M. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian

Processes, 2009

slide-33
SLIDE 33

Beyond mean-field variational distribution

Use optimal distribution of inducing points Fitting the model:

  • 1. Sample {Ug} ∼ q({Ug})
  • 2. Conditioned on this sample, sample (again) the output the

warps G ∼ p(G | {Ug})

  • 3. Conditioned on G, compute the optimal distribution of

inducing points q({Uf }) and the likelihood p(Y | G) =

  • p(Y | G, {Uf })q({Uf })dUf

The only variational parameters to optimise are those of q({Ug}), which we can do by maximising p(Y | G) (using the reparametrisation trick)

Salimbeni & Deisenroth. Doubly Stochastic Variational Inference for Deep Gaussian Processes (2017)

slide-34
SLIDE 34

2-layer DGP

Consider 2-layer DGP where first layer is monotonic:

Overall fit Warpings Fit in warped coordinates Base Model Optimal Ind.

slide-35
SLIDE 35

Thank you

  • I. Kazlauskaite, C. H. Ek, N. D. F. Campbell. Gaussian Process Latent Variable

Alignment Learning. AISTATS (2019)

  • I. Kazlauskaite, I. Ustyuzhaninov, C. H. Ek, N. D. F. Campbell. Sequence

Alignment with Dirichlet Process Mixtures. Bayesian Nonparametrics Workshop at NIPS (2018)

  • I. Ustyuzhaninov∗, I. Kazlauskaite∗, C. H. Ek, N. D. F. Campbell. Monotonic

Gaussian Process Flow. arXiv (2019)

  • I. Ustyuzhaninov∗, I. Kazlauskaite∗, M. Kaiser, E. Bodin, C. H. Ek, N. D. F.
  • Campbell. Compositional uncertainty in deep Gaussian processes.

arXiv (2019)