AdaGeo: Adaptive Geometric Learning for Optimization and Sampling - - PowerPoint PPT Presentation

adageo adaptive geometric learning for optimization and
SMART_READER_LITE
LIVE PREVIEW

AdaGeo: Adaptive Geometric Learning for Optimization and Sampling - - PowerPoint PPT Presentation

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling AdaGeo: Adaptive Geometric Learning for Optimization and Sampling Gabriele Abbati 1 , Alessandra Tosi 2 , Seth Flaxman 3 , Michael A Osborne 1 1 University of Oxford, 2


slide-1
SLIDE 1

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

AdaGeo: Adaptive Geometric Learning for Optimization and Sampling

Gabriele Abbati1, Alessandra Tosi2, Seth Flaxman3, Michael A Osborne1

1University of Oxford, 2Mind Foundry Ltd, 3Imperial College London

Afternoon Meeting on Bayesian Computation 2018 University of Reading

slide-2
SLIDE 2

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

High-dimensional Problems

  • Gradient-based optimization
  • MCMC Sampling

Issues arising from high dimensionality: non-convexity strong correlations multimodality

slide-3
SLIDE 3

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

Related Work

Gradient-based optimization AdaGrad AdaDelta Adam RMSProp MCMC Sampling Hamiltonian Monte Carlo Particle Monte Carlo Stochastic gradient Langevin dynamics All of these methods focus on computing clever updates for optimization algorithms or for Markov chains. Novelty: to the best of our knowledge, no dimensionality reduction approaches were applied in this direction before.

slide-4
SLIDE 4

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

The Manifold Idea

After t steps of optimization or sampling, we assume the obtained points in the parameter space to be lying on a manifold. We then feed them to a dimensionality reduction method to find a lower-dimensional representation. 3D example: if the sampler/

  • ptimizer algorithm keeps
  • n returning proposals on a

sphere surface, that information might be used to our advantage Can we perform better if the algorithm acts with knowledge of the manifold?

slide-5
SLIDE 5

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

Latent Variable Models

Latent Variable Models describe a set Θ through a lower-dimensional latent set Ω Latent Variable Models Θ = {θ1, . . . , θN ∈ RD} Ω = {ω1, . . . , ωN ∈ RQ} map f with Q < D where:

  • θ: observed variables/parameters
  • ω: latent variables
  • f: mapping
  • D, Q: dimensionalities of Θ and Ω respectively
slide-6
SLIDE 6

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

Latent Variable Models

Latent Variable Models describe a set Θ through a lower-dimensional latent set Ω Latent Variable Models Θ = {θ1, . . . , θN ∈ RD} Ω = {ω1, . . . , ωN ∈ RQ} map f with Q < D mapping: θ = f(ω) + η with η ∼ N(0, β−1I) Dimensionality reduction Manifold identification The lower-dimensional manifold on which the samples lie is characterized through the latent set

slide-7
SLIDE 7

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

Gaussian Process Latent Variable Model

The choice of the dimensionality reduction method fell on the Gaussian Process Latent Variable Model[1]. GPLVM: Gaussian Process prior over mapping f in θ = f(ω) + η Motivation: Analytically sound mathematical tool Full distribution over the mapping f Full distribution over the derivatives of the mapping f

[1]Lawrence, N., Probabilistic non-linear principal component analysis with Gaussian process latent

variable models. Journal of machine learning research (2005)

slide-8
SLIDE 8

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

Gaussian Process

Gaussian Process[2]: a collection of random variables, any finite number of which have a joint Gaussian distribution. If a real-valued stochastic process f is a GP , it will be denoted as f(·) ∼ GP(m(·), k(·, ·)) A Gaussian Process is fully specified by a mean function m(·) a covariance function k(·, ·) where m(ω) = E [ f(ω) ] , k(ω, ω′) = E [( f(ω) − m(ω) )( f(ω′) − m(ω′) )]

Training data Regression GP Real function [2]Rasmussen, C. E., Williams, C. K. I., Gaussian Processes for Machine Learning, the MIT Press (2006)

slide-9
SLIDE 9

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

Gaussian Process Latent Variable Model

GPLVM: Gaussian Process prior over mapping f in θ = f(ω) + η The likelihood of the data Θ given the latent Ω is given by

1 marginalizing the mapping 2 optimizing the latent variables

Resulting likelihood: p(Θ | Ω, β) =

D

j=1

N ( θ:,j | 0, K + β−1I ) =

D

j=1

N ( θ:,j | 0, ˜ K ) With the resulting noise model being: θi,j = ˜ K(ωi,Ω) ˜ K−1Θ:,j + ηj

slide-10
SLIDE 10

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

Gaussian Process Latent Variable Model

GPLVM: Gaussian Process prior over mapping f in θ = f(ω) + η For differentiable kernels k(·, ·), the Jacobian J of the mapping f can be computed analytically: Jij = ∂fi ∂ωj But as previously said, GPLVM can yield the full (Gaussian) distribution over the Jacobian. If the rows of J are assumed to be independent: p(J | Ω, β) =

D

i=1

N ( Ji,: | µJi,:, ΣJ ) ,

slide-11
SLIDE 11

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

Recap

1 After t iterations the optimization or sampling algorithm has yielded a set of

  • bserved points Θ = {θ1, . . . , θN ∈ RD} in the parameter space

2 A GPLVM is trained on Θ in order to build a latent space Ω that describes the

lower-dimensional manifold on which the optimization/sampling is allegedly taking place. We can:

  • move from the latent space Ω to the observed space Θ:

θ = f(ω) + η Θ ← Ω but not viceversa (f is not invertible)

  • bring the gradients of a generic function g : Θ → R from the observed space Θ to

the latent space Ω: ∇ωg ( f(ω) ) = µJ∇θg(θ) Ω ←Θ In this case a punctual estimate of J is given by the mean of its distribution.

slide-12
SLIDE 12

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

Recap

1 After t iterations the optimization or sampling algorithm has yielded a set of

  • bserved points Θ = {θ1, . . . , θN ∈ RD} in the parameter space

2 A GPLVM is trained on Θ in order to build a latent space Ω that describes the

lower-dimensional manifold on which the optimization/sampling is allegedly taking place. We can:

  • move from the latent space Ω to the observed space Θ:

θ = f(ω) + η Θ ← Ω but not viceversa (f is not invertible)

  • bring the gradients of a generic function g : Θ → R from the observed space Θ to

the latent space Ω: ∇ωg ( f(ω) ) = µJ∇θg(θ) Ω ←Θ In this case a punctual estimate of J is given by the mean of its distribution.

slide-13
SLIDE 13

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

Recap

1 After t iterations the optimization or sampling algorithm has yielded a set of

  • bserved points Θ = {θ1, . . . , θN ∈ RD} in the parameter space

2 A GPLVM is trained on Θ in order to build a latent space Ω that describes the

lower-dimensional manifold on which the optimization/sampling is allegedly taking place. We can:

  • move from the latent space Ω to the observed space Θ:

θ = f(ω) + η Θ ← Ω but not viceversa (f is not invertible)

  • bring the gradients of a generic function g : Θ → R from the observed space Θ to

the latent space Ω: ∇ωg ( f(ω) ) = µJ∇θg(θ) Ω ←Θ In this case a punctual estimate of J is given by the mean of its distribution.

slide-14
SLIDE 14

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

Recap

1 After t iterations the optimization or sampling algorithm has yielded a set of

  • bserved points Θ = {θ1, . . . , θN ∈ RD} in the parameter space

2 A GPLVM is trained on Θ in order to build a latent space Ω that describes the

lower-dimensional manifold on which the optimization/sampling is allegedly taking place. We can:

  • move from the latent space Ω to the observed space Θ:

θ = f(ω) + η Θ ← Ω but not viceversa (f is not invertible)

  • bring the gradients of a generic function g : Θ → R from the observed space Θ to

the latent space Ω: ∇ωg ( f(ω) ) = µJ∇θg(θ) Ω ←Θ In this case a punctual estimate of J is given by the mean of its distribution.

slide-15
SLIDE 15

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

AdaGeo Gradient-based Optimization

Minimization problem: θ∗ = arg min

θ g(θ)

Iterative scheme solution (e.g. (stochastic) gradient descent): θt+1 = θt − ∆θt(∇θg) We propose, after having learned a latent representation with GPLVM, to move the problem onto the latent space Ω Minimization problem: ω∗ = arg min

ω g(f(ω))

Iterative scheme solution (e.g. (stochastic) gradient descent): ωt+1 = ωt − ∆ωt(∇ωg)

slide-16
SLIDE 16

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

AdaGeo Gradient-based Optimization

Minimization problem: θ∗ = arg min

θ g(θ)

Iterative scheme solution (e.g. (stochastic) gradient descent): θt+1 = θt − ∆θt(∇θg) We propose, after having learned a latent representation with GPLVM, to move the problem onto the latent space Ω Minimization problem: ω∗ = arg min

ω g(f(ω))

Iterative scheme solution (e.g. (stochastic) gradient descent): ωt+1 = ωt − ∆ωt(∇ωg)

slide-17
SLIDE 17

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

AdaGeo Gradient-based Optimization

Algorithm 1 AdaGeo gradient-based optimization (minimization)

1: while convergence is not reached do 2:

Perform Tθ iterations with classic updates on the parameter space Θ: ∆θt = ∆θt(∇θg(θ)) θt+1 = θt − ∆θt

3:

Train the GP-LVM model on the samples Θ = {θ1, . . . , θTθ}

4:

Continue performing Tω using the AdaGeo optimizer: ∆ωt = ∆ωt(∇ωg(f(ω))) ωt+1 = ωt − ∆ωt and moving back to the parameter space with θt+1 = f(ωt+1).

5: end while.

slide-18
SLIDE 18

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

Experiment: Logistic Regression on MNIST

Neural Network with a single hidden layer implementing logistic regression on MNIST Dimension of parameter space: D = 7850 Dimension of latent space: Q = 9 Iterations: Tθ = 20 and Tω = 30

2 4 6 10−0.5 100 n of epochs NN training loss function SGD AdaGeo-SGD

slide-19
SLIDE 19

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

Experiment: Gaussian Process Training

Concrete compressive strength dataset[3]: regression task with 8 real variables Composite kernel (RBF , Matérn, linear and bias) Dimension of parameter space: D = 9 Dimension of latent space: Q = 3 Iterations: Tθ = 15 and Tω = 15

200 400 600 800 1,000 103.6 103.8 104 n of iterations GP Negative Log-likelihood Gradient Descent AdaGeo-Gradient Descent [3]Lichman, M., UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of

California, School of Information and Computer Science (2013)

slide-20
SLIDE 20

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

AdaGeo Bayesian Sampling

Bayesian sampling framework: a dataset X = {x1, . . . , xN} is given X is modeled with a generative model whose likelihood is p(X, θ) =

N

i=1

p(xi, θ), parameterized by the vector θ ∈ RD, with prior p(θ). Performing statistical inference means getting insights on the posterior distribution p(θ | X) = p(X | θ)p(θ) p(X) analytically or approximately through sampling.

slide-21
SLIDE 21

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

AdaGeo Bayesian Sampling

Bayesian sampling framework: a dataset X = {x1, . . . , xN} is given X is modeled with a generative model whose likelihood is p(X, θ) =

N

i=1

p(xi, θ), parameterized by the vector θ ∈ RD, with prior p(θ). Unfortunately the denominator is often intractable. One possible approach is to approximate the integral p(X) = ∫

Θ

p(X, θ)dθ = ∫

Θ

p(X|θ)p(θ)dθ through Markov Chain Monte Carlo or similar methods.

slide-22
SLIDE 22

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

Stochastic Gradient Langevin Dynamics

Stochastic gradient Langevin dynamics[4] combines stochastic optimization and the physical concept of Langevin dynamics to build a posterior sampler At each time t a mini-batch is extracted and the parameters are updated as: θt+1 = θt + ∆θt, ∆θt = ϵt 2 ( ∇θ log p(θt) + N n

n

i=1

∇θ log p(xi | θt) ) + ηt, ηt ∼ N(0, ϵtI) with the learning rate ϵt satisfying:

t=1

ϵt = ∞,

t=1

ϵ2

t < ∞

[4]Welling, M., Teh, Y. W., Bayesian learning via stochastic gradient Langevin dynamics. Proceedings of

the 28th International Conference on Machine Learning (ICML) (2011)

slide-23
SLIDE 23

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

AdaGeo - Stochastic Gradient Langevin Dynamics

Analogously as before, we propose to:

1 Pick your favourite sampler and produce the first t samples to build the set

Θ = {θ1, . . . , θN}

2 Train a GPLVM on Θ to learn the latent space Ω 3 Move the updates onto the latent space with AdaGeo - Stochastic Gradient

Langevin Dynamics: ωt+1 = ωt + ∆ωt, ∆ωt = ϵt 2 ( ∇ω log p ( f(ωt) ) + N n

n

i=1

∇ω log p ( xti|f(ωt) ) ) + ηt, ηt ∼ N(0, ϵtI)

slide-24
SLIDE 24

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

Experiment: Sampling from the Banana Distribution

The “banana” distribution has this formula: p(θ) ∝ exp  − θ2

1

200 − (θ2 − bθ2

1 + 100b)2

2 −

D

j=3

θ2

j

 

−20 −10 10 20 −10 10

θ1 θ2

θ1 and θ2 present the interaction shown

  • n the left, while the other variables

produce Gaussian noise

slide-25
SLIDE 25

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

Experiment: Sampling from the Banana Distribution

A Metropolis-Hastings returns the first 100 samples drawn from a 50-dimensional banana distribution AdaGeo-SGLD is then employed to sample from a 5-dimensional latent space

−20 −10 10 20 −10 10

θ1 θ2

MH sampler AdaGeo-sampler −1 1 −2

θ3 θ15

MH sampler AdaGeo-sampler

slide-26
SLIDE 26

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

Bonus Round: Riemannian Extensions (theory only)

If a covariance function of a Gaussian Process is differentiable, then it is straightforward to show that the mapping f is also differentiable. Under this assumption we can compute the latent metric tensor G, which will give further information about the geometry of the latent space (distances, geodetic lines etc.) If J is the Jacobian of the mapping f, then G = J⊤J This yields a distribution over the metric tensor[5]: G ∼ WQ ( D, ΣJ, E [ J⊤] E [ J ]) and a punctual estimate can be obtained with E [ J⊤J ] = E [ J⊤] E [ J ] + DΣJ.

[5]Tosi, A., Hauberg, S., Vellido, A., Lawrence, N. D., Metrics for probabilistic geometries. Uncertainty in

Artificial Intelligence (2014)

slide-27
SLIDE 27

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

Bonus Round: Stochastic Gradient Riemannian Langevin Dynamics

Stochastic gradient Riemannian Langevin dynamics[6] puts together the advantages

  • f exploiting a known Riemannian geometry with the scalability of the stochastic
  • ptimization approaches:

θt+1 = θt + ∆θt, ∆θt = ϵt 2 µ(θt) + G− 1

2 (θt)ηt

ηt ∼ N(0, ϵtI), where µ(θ)j = ( G−1(θ) ( ∇θ log p(θ) + N n

n

i=1

∇θ log p(xti | θ) ))

j

− 2

D

k=1

( G−1(θ)∂G(θ) ∂θk G−1(θ) )

jk

+

D

k=1

( G−1(θ) )

jkTr

( G−1(θ)∂G(θ) ∂θk )

[6]Patterson, S., Teh, Y. W., Stochastic gradient Riemannian Langevin dynamics on the probability simplex.

Advances in Neural Information Processing Systems (2013)

slide-28
SLIDE 28

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

Bonus Round: AdaGeo - Stochastic Gradient Riemannian Langevin Dynamics

Analogously as the SGLD case, we can move now the update in the latent space with AdaGeo - Stochastic gradient Riemannian Langevin dynamics: ωt+1 = ωt + ∆ωt ∆ωt = ϵt 2 µ(ωt) + G

− 1

2

ω (ωt)ηt

ηt ∼ N(0, ϵtI) where µ(ω)j = ( G−1

ω (ω)

( ∇ω log p ( f(ω) ) + N n

n

i=1

∇ω log p ( xti | f(ω) )))

j

− 2

Q

k=1

( G−1

ω (ω)∂Gω(ω)

∂ωk G−1

ω (ω)

)

jk

+

Q

k=1

( G−1

ω (ω)

)

jkTr

( G−1

ω (ω)∂Gω(ω)

∂ωk )

slide-29
SLIDE 29

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

Conclusions

We develop a generic framework for combining dimensionality reduction techniques with sampling and optimization methods We contribute to gradient-based optimization methods by coupling them with appropriate dimensionality reduction techniques. In particular, we improve the performances of gradient descent and stochastic gradient descent, when training respectively a Gaussian Process and a neural network We contribute to Markov Chain Monte Carlo by developing a AdaGeo version

  • f stochastic gradient Langevin dynamics; the information gathered through

the latent space are employed to compute the steps of the Markov chain We extend the approach to stochastic gradient Riemannian Langevin dynamics, thanks to the geometric tensor naturally recovered by the GP-LVM model

slide-30
SLIDE 30

AdaGeo: Adaptive Geometric Learning for Opti- mization and Sampling

Thank you

Reference: Abbati, G., Tosi, A., Flaxman, S., Osborne, M. A. (2018). AdaGeo: Adaptive Geometric Learning for Optimization and Sampling. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS), to appear.