Bayesian Learning from Sequential Data using Gaussian Processes with - - PowerPoint PPT Presentation

bayesian learning from sequential data using gaussian
SMART_READER_LITE
LIVE PREVIEW

Bayesian Learning from Sequential Data using Gaussian Processes with - - PowerPoint PPT Presentation

Bayesian Learning from Sequential Data using Gaussian Processes with Signature Covariances Csaba Toth Joint work with Harald Oberhauser Mathematical Institute, University of Oxford International Conference on Machine Learning, July 2020


slide-1
SLIDE 1

Bayesian Learning from Sequential Data using Gaussian Processes with Signature Covariances

Csaba Toth Joint work with Harald Oberhauser Mathematical Institute, University of Oxford International Conference on Machine Learning, July 2020

slide-2
SLIDE 2

Overview

slide-3
SLIDE 3

Overview

Purpose of this work

  • 1. Define a Gaussian process (GP) [6] over sequences/time series

2/62

slide-4
SLIDE 4

Overview

Purpose of this work

  • 1. Define a Gaussian process (GP) [6] over sequences/time series

◮ To model of functions of sequences {Seq(Rd) → R} (fx)x∈Seq(Rd) ∼ GP(m(·), k(·, ·))

3/62

slide-5
SLIDE 5

Overview

Purpose of this work

  • 1. Define a Gaussian process (GP) [6] over sequences/time series

◮ To model of functions of sequences {Seq(Rd) → R} (fx)x∈Seq(Rd) ∼ GP(m(·), k(·, ·)) ◮ Find a suitable covariance kernel k : Seq(Rd) × Seq(Rd) → R

4/62

slide-6
SLIDE 6

Overview

Purpose of this work

  • 1. Define a Gaussian process (GP) [6] over sequences/time series

◮ To model of functions of sequences {Seq(Rd) → R} (fx)x∈Seq(Rd) ∼ GP(m(·), k(·, ·)) ◮ Find a suitable covariance kernel k : Seq(Rd) × Seq(Rd) → R ◮ Seq(Rd) := {(xt1, . . . , xtL) | (ti, xti) ∈ R+ × Rd, L ∈ N}

5/62

slide-7
SLIDE 7

Overview

Purpose of this work

  • 1. Define a Gaussian process (GP) [6] over sequences/time series

◮ To model of functions of sequences {Seq(Rd) → R} (fx)x∈Seq(Rd) ∼ GP(m(·), k(·, ·)) ◮ Find a suitable covariance kernel k : Seq(Rd) × Seq(Rd) → R ◮ Seq(Rd) := {(xt1, . . . , xtL) | (ti, xti) ∈ R+ × Rd, L ∈ N}

  • 2. Develop an efficient inference framework

6/62

slide-8
SLIDE 8

Overview

Purpose of this work

  • 1. Define a Gaussian process (GP) [6] over sequences/time series

◮ To model of functions of sequences {Seq(Rd) → R} (fx)x∈Seq(Rd) ∼ GP(m(·), k(·, ·)) ◮ Find a suitable covariance kernel k : Seq(Rd) × Seq(Rd) → R ◮ Seq(Rd) := {(xt1, . . . , xtL) | (ti, xti) ∈ R+ × Rd, L ∈ N}

  • 2. Develop an efficient inference framework

◮ Standard challenges: intractable posteriors, O(N3) scaling in training data

7/62

slide-9
SLIDE 9

Overview

Purpose of this work

  • 1. Define a Gaussian process (GP) [6] over sequences/time series

◮ To model of functions of sequences {Seq(Rd) → R} (fx)x∈Seq(Rd) ∼ GP(m(·), k(·, ·)) ◮ Find a suitable covariance kernel k : Seq(Rd) × Seq(Rd) → R ◮ Seq(Rd) := {(xt1, . . . , xtL) | (ti, xti) ∈ R+ × Rd, L ∈ N}

  • 2. Develop an efficient inference framework

◮ Standard challenges: intractable posteriors, O(N3) scaling in training data ◮ Additional challenge: potentially very high dimensional inputs (long sequences)

8/62

slide-10
SLIDE 10

Overview

Suitable feature map? Signatures from stochastic analysis [2]!

9/62

slide-11
SLIDE 11

Overview

Suitable feature map? Signatures from stochastic analysis [2]!

Can be used to transform vector-kernels into sequence-kernels

10/62

slide-12
SLIDE 12

Overview

Suitable feature map? Signatures from stochastic analysis [2]!

Can be used to transform vector-kernels into sequence-kernels

◮ κ : Rd × Rd → R a kernel for vector-valued data

11/62

slide-13
SLIDE 13

Overview

Suitable feature map? Signatures from stochastic analysis [2]!

Can be used to transform vector-kernels into sequence-kernels

◮ κ : Rd × Rd → R a kernel for vector-valued data ◮ [4] used signatures to define the kernel for x, y ∈ Seq(Rd) k(x, y) =

M

  • m=0

σ2

m

  • 1≤i1<···<im≤Lx

1≤j1<···<jm≤Ly

c(i)c(j)

m

  • l=1

∆il,jlκ(xil, yjl) for some explicitly given constants c(i1, . . . , im), c(j1, . . . , jm) ∆i,jκ(xi, yj) = κ(xi+1, yj+1) − κ(xi, yj+1) − κ(xi+1, yj) + κ(xi, yj)

12/62

slide-14
SLIDE 14

Overview

Suitable feature map? Signatures from stochastic analysis [2]!

Can be used to transform vector-kernels into sequence-kernels

◮ κ : Rd × Rd → R a kernel for vector-valued data ◮ [4] used signatures to define the kernel for x, y ∈ Seq(Rd) k(x, y) =

M

  • m=0

σ2

m

  • 1≤i1<···<im≤Lx

1≤j1<···<jm≤Ly

c(i)c(j)

m

  • l=1

∆il,jlκ(xil, yjl) for some explicitly given constants c(i1, . . . , im), c(j1, . . . , jm) ∆i,jκ(xi, yj) = κ(xi+1, yj+1) − κ(xi, yj+1) − κ(xi+1, yj) + κ(xi, yj) ◮ Strong theoretical properties!

13/62

slide-15
SLIDE 15

Overview

Our contributions ◮ Bringing GPs and signatures together (+analysis)

14/62

slide-16
SLIDE 16

Overview

Our contributions ◮ Bringing GPs and signatures together (+analysis) ◮ Developing a tractable, efficient inference scheme

15/62

slide-17
SLIDE 17

Overview

Our contributions ◮ Bringing GPs and signatures together (+analysis) ◮ Developing a tractable, efficient inference scheme

  • 1. Sparse VI [3]: non-conjugacy, large N ∈ N

16/62

slide-18
SLIDE 18

Overview

Our contributions ◮ Bringing GPs and signatures together (+analysis) ◮ Developing a tractable, efficient inference scheme

  • 1. Sparse VI [3]: non-conjugacy, large N ∈ N
  • 2. Inter-domain inducing points: long sequences (supx∈X Lx large)

17/62

slide-19
SLIDE 19

Overview

Our contributions ◮ Bringing GPs and signatures together (+analysis) ◮ Developing a tractable, efficient inference scheme

  • 1. Sparse VI [3]: non-conjugacy, large N ∈ N
  • 2. Inter-domain inducing points: long sequences (supx∈X Lx large)

◮ GPflow implementation, thorough experimental evaluation

18/62

slide-20
SLIDE 20

Signatures

slide-21
SLIDE 21

Signatures

What are signatures?

19/62

slide-22
SLIDE 22

Signatures

What are signatures? Signatures are defined on continuous time objects, paths ◮ Paths(Rd) =

  • x ∈ C([0, T], Rd) | x0 = 0, xbv < +∞
  • 20/62
slide-23
SLIDE 23

Signatures

What are signatures? Signatures are defined on continuous time objects, paths ◮ Paths(Rd) =

  • x ∈ C([0, T], Rd) | x0 = 0, xbv < +∞
  • Φm(x) =
  • 0<t1<···<tm<T ˙

xt1 ⊗ · · · ⊗ ˙ xtmdt1 . . . dtm

21/62

slide-24
SLIDE 24

Signatures

What are signatures? Signatures are defined on continuous time objects, paths ◮ Paths(Rd) =

  • x ∈ C([0, T], Rd) | x0 = 0, xbv < +∞
  • Φm(x) =
  • 0<t1<···<tm<T ˙

xt1 ⊗ · · · ⊗ ˙ xtmdt1 . . . dtm Φm(x) ∈ (Rd)⊗m is what is known as a tensor of degree m ∈ N

22/62

slide-25
SLIDE 25

Signatures

What are signatures? Signatures are defined on continuous time objects, paths ◮ Paths(Rd) =

  • x ∈ C([0, T], Rd) | x0 = 0, xbv < +∞
  • Φm(x) =
  • 0<t1<···<tm<T ˙

xt1 ⊗ · · · ⊗ ˙ xtmdt1 . . . dtm Φm(x) ∈ (Rd)⊗m is what is known as a tensor of degree m ∈ N Φ(x) = (Φm(x))m≥0 is an infinite collection of tensors with increasing degrees

23/62

slide-26
SLIDE 26

Signatures

What are signatures? Signatures are defined on continuous time objects, paths ◮ Paths(Rd) =

  • x ∈ C([0, T], Rd) | x0 = 0, xbv < +∞
  • Φm(x) =
  • 0<t1<···<tm<T ˙

xt1 ⊗ · · · ⊗ ˙ xtmdt1 . . . dtm Φm(x) ∈ (Rd)⊗m is what is known as a tensor of degree m ∈ N Φ(x) = (Φm(x))m≥0 is an infinite collection of tensors with increasing degrees A generalization of polynomials for vector-valued data to paths (and sequences!)

24/62