Learning from Irregularly-Sampled Time Series A Missing Data - - PowerPoint PPT Presentation

learning from irregularly sampled time series
SMART_READER_LITE
LIVE PREVIEW

Learning from Irregularly-Sampled Time Series A Missing Data - - PowerPoint PPT Presentation

Learning from Irregularly-Sampled Time Series A Missing Data Perspective Steven Cheng-Xian Li Benjamin M. Marlin University of Massachusetts Amherst Irregularly-Sampled Time Series Irregularly-sampled time series: Time series with non-uniform


slide-1
SLIDE 1

Learning from Irregularly-Sampled Time Series

A Missing Data Perspective

Steven Cheng-Xian Li Benjamin M. Marlin

University of Massachusetts Amherst

slide-2
SLIDE 2

Irregularly-Sampled Time Series

Irregularly-sampled time series: Time series with non-uniform time intervals between successive measurements

1

slide-3
SLIDE 3

Problem and Challenges

2

Problem: learning from a collection of irregularly-sampled time series within a common time interval

value time

Challenges:

  • Each time series is observed at arbitrary time points.
  • Different data cases may have different numbers of observations
  • Observed samples may not be aligned in time
  • Many real-world time series data are extremely sparse
  • Most machine learning algorithms require data lying on fixed

dimensional feature space

slide-4
SLIDE 4

Problem and Challenges

2

Problem: learning from a collection of irregularly-sampled time series within a common time interval

value time

Tasks:

  • Learning the distribution of latent temporal processes
  • Inferring the latent process associated with a time series
  • Classification of time series

This can be transformed into a missing data problem.

slide-5
SLIDE 5

Index Representation of Incomplete Data

Data defined on an index set I:

  • Examples:
  • Image: pixel coordinates
  • Time series: timestamps
  • Complete data as a mapping: I → R.

Index representation of an incomplete data case (x, t):

  • t ≡ {ti}|t|

i=1 ⊂ I are the indices of observed entries.

  • xi is the corresponding value observed at ti.
  • Applicable for both finite and continuous index set.

3

slide-6
SLIDE 6

Index Representation of Incomplete Data

Data defined on an index set I:

  • Examples:
  • Image: pixel coordinates
  • Time series: timestamps
  • Complete data as a mapping: I → R.

Index representation of an incomplete data case (x, t):

  • t ≡ {ti}|t|

i=1 ⊂ I are the indices of observed entries.

  • xi is the corresponding value observed at ti.
  • Applicable for both finite and continuous index set.

3

slide-7
SLIDE 7

Generative Process of Incomplete Data

Generative process for an incomplete case (x, t): f ∼ pθ(f) complete data f : I → R t ∼ pI(t|f) t ∈ 2I (subset of I) x =

  • f(ti)

|t|

i=1

values of f indexed at t Goal: learning the complete data distribution pθ given the incomplete dataset D = {(xi, ti)}n

i=1 4

slide-8
SLIDE 8

Generative Process of Incomplete Data

Generative process for an incomplete case (x, t): f ∼ pθ(f) complete data f : I → R t ∼ pI(t) independence between f and t x =

  • f(ti)

|t|

i=1

values of f indexed at t Goal: learning the complete data distribution pθ given the incomplete dataset D = {(xi, ti)}n

i=1 4

slide-9
SLIDE 9

Generative Process of Incomplete Data

Generative process for an incomplete case (x, t): f ∼ pθ(f) complete data f : I → R t ∼ pI(t) independence between f and t x =

  • f(ti)

|t|

i=1

values of f indexed at t Goal: learning the complete data distribution pθ given the incomplete dataset D = {(xi, ti)}n

i=1 4

slide-10
SLIDE 10

Encoder-Decoder Framework for Incomplete Data

Probabilistic latent variable model Decoder:

  • Model the data generating process: z ∼ pz(z), f = gθ(z)
  • Given t ∼ pI, the corresponding values are gθ(z, t) ≡
  • f(ti)

|t|

i=1.

  • Note: our goal is to model gθ, not pI.

5

slide-11
SLIDE 11

Encoder-Decoder Framework for Incomplete Data

Encoder (stochastic):

  • Model the posterior distribution qφ(z|x, t)
  • Functional form: qφ(z|x, t) = qφ(z | m(x, t))
  • Example: qφ(z|x, t) = N(z|µφ(v), Σφ(v)) with v = m(x, t).
  • Different incomplete cases carry different levels of uncertainty.

Masking function m(x, t):

  • Replacing all missing entries by zero.
  • m

( ) =

  • m

(

T time t value x

) =

T time t value x

6

slide-12
SLIDE 12

Partial Variational Autoencoder (P-VAE)

Generative process: t ∼ pI(t) z ∼ p(z) f = gθ(z) xi ∼ p(xi|f(ti)) (i.i.d. noise) Example: p(xi|f(ti)) = N(xi|f(ti), σ2) Joint distribution: p(x, t) =

  • p(z)pI(t)

|t|

  • i=1

pθ(xi|z, ti)dz

pθ(xi|z, ti) is the shorthand for p(xi|f(ti)) with f = gθ(z).

7

x t z

  • x

qφ gθ (x, t) ∼ pD

slide-13
SLIDE 13

Partial Variational Autoencoder (P-VAE)

Variational lower bound for log p(x, t):

  • qφ(z|x, t) log pz(z)pI(t) |t|

i=1 pθ(xi|z, ti)

qφ(z|x, t) dz Learning with gradients without pI(t) involved: ∇φ,θEz∼qφ(z|x,t)

  • log pz(z)✟✟

✟ pI(t) |t|

i=1 pθ(xi|z, ti)

qφ(z|x, t)

  • Kingma & Welling. (2014). Auto-encoding variational bayes.

Ma, et al. (2018). Partial VAE for hybrid recommender system.

8

x t z

  • x

qφ gθ (x, t) ∼ pD

slide-14
SLIDE 14

Partial Variational Autoencoder (P-VAE)

Conditional objective (lower bound for log p(x|t)): Ez∼qφ(z|x,t)

  • log pz(z) |t|

i=1 pθ(xi|z, ti)

qφ(z|x, t)

  • Related work:
  • Partial VAE [Ma, et al., 2018]
  • Neural processes [Garnelo, et al., 2018]
  • MIWAE [Mattei & Frellsen, 2019]

9

x t z

  • x

qφ gθ (x, t) ∼ pD

slide-15
SLIDE 15

Partial Bidirectional GAN (P-BiGAN)

Li, Jiang, Marlin. (2019). MisGAN: Learning from Incomplete Data with GANs. Donahue, et al. (2016). Adversarial feature learning (BiGAN).

10

x t z qφ encoding x′ t′ z′ gθ decoding D {(x, t, z)} {(x′, t′, z′)}

slide-16
SLIDE 16

Partial Bidirectional GAN (P-BiGAN)

Li, Jiang, Marlin. (2019). MisGAN: Learning from Incomplete Data with GANs. Donahue, et al. (2016). Adversarial feature learning (BiGAN).

10

x t z qφ encoding x′ t′ z′ gθ decoding D {(x, t, z)} {(x′, t′, z′)}

(x, t) ∼ pD

slide-17
SLIDE 17

Partial Bidirectional GAN (P-BiGAN)

Li, Jiang, Marlin. (2019). MisGAN: Learning from Incomplete Data with GANs. Donahue, et al. (2016). Adversarial feature learning (BiGAN).

10

x t z qφ encoding x′ t′ z′ gθ decoding D {(x, t, z)} {(x′, t′, z′)}

(·, t′) ∼ pD z′ ∼ pz

slide-18
SLIDE 18

Partial Bidirectional GAN (P-BiGAN)

Li, Jiang, Marlin. (2019). MisGAN: Learning from Incomplete Data with GANs. Donahue, et al. (2016). Adversarial feature learning (BiGAN).

10

x t z qφ encoding x′ t′ z′ gθ decoding D {(x, t, z)} {(x′, t′, z′)}

(x, t) ∼ pD (·, t′) ∼ pD z′ ∼ pz Discriminator: D(m(x, t), z)

slide-19
SLIDE 19

Invertibility Property of P-BiGAN

gθ(z, t) is the shorthand notation for [f(ti)]|t|

i=1 with f = gθ(z).

11

Theorem: For (x, t) with non-zero probability, if z ∼ qφ(z|x, t) then gθ(z, t) = x.

x t z qφ encoding x′ t′ z′ gθ decoding D {(x, t, z)} {(x′, t′, z′)}

slide-20
SLIDE 20

Autoencoding Regularization for P-BiGAN

x t z

  • x

qφ gθ x′ t′ z′ gθ D {(x, t, z)} {(x′, t′, z′)} ℓ(x, x)

12

slide-21
SLIDE 21

Missing Data Imputation

x t z x′ qφ gθ t′

Imputation: p(x′|t′, x, t) = Ez∼qφ(z|x,t) [pθ(x′|z, t′)] Sampling: z ∼ qφ(z|x, t) f = gθ(z) x′ = [f(t′

i)]|t′| i=1 13

slide-22
SLIDE 22

Supervised Learning: Classification

x t z

  • y

qφ C

Adding classification term to objective: Ez∼qφ(z|x,t)

  • log pz(z)pθ(x|z, t)p(y|z)

qφ(z|x, t)

  • = Eqφ(z|x,t)
  • log pz(z)pθ(x|z, t)

qφ(z|x, t)

  • regularization

+ Eqφ(z|x,t)[log p(y|z)]

  • classification

Prediction:

  • y = argmax

y

Eqφ(z|x,t)[log p(y|z)]

14

slide-23
SLIDE 23

MNIST Completion

P-VAE P-BiGAN

square observation with 90% missing

P-VAE P-BiGAN

independent dropout with 90% missing

15

slide-24
SLIDE 24

CelebA Completion

P-VAE P-BiGAN

square observation with 90% missing

P-VAE P-BiGAN

independent dropout with 90% missing

16

slide-25
SLIDE 25

Architecture for Irregularly-Sampled Time Series

How to construct decoder, encoder and discriminator for continuous index set, e.g., time series with I = [0, T]?

x t z

  • x

qφ gθ (x, t) ∼ pD

P-VAE

x t z qφ x′ t′ z′ gθ D {(x, t, z)} {(x′, t′, z′)}

P-BiGAN

17

slide-26
SLIDE 26

Decoder for Continuous Time Series

Generative process for time series: z ∼ pz(z) v = CNNθ(z) values on evenly-spaced times u f(t) = L

i=1 K(ui, t)vi

L

i=1 K(ui, t)

kernel smoother

18

slide-27
SLIDE 27

Continuous Convolutional Layer

δ(·) is the Dirac delta function.

19

CNN encoder/discriminator evenly-spaced u index t continuous kernel w(t)

Cross-correlation between:

  • continuous kernel w(t)
  • masked function m(x, t)(t) = |t|

i=1 xiδ(t − ti)

slide-28
SLIDE 28

Continuous Convolutional Layer

19

CNN encoder/discriminator evenly-spaced u index t continuous kernel w(t)

Cross-correlation between w and m(x, t): (w ⋆ m(x, t))(u) =

  • i:ti∈neighbor(u)

w(ti − u)xi Construct kernel w(t) using a degree-1 B-spline

slide-29
SLIDE 29

Continuous Convolutional Layer

19

CNN encoder/discriminator evenly-spaced u index t continuous kernel w(t)

Cross-correlation between w and m(x, t): (w ⋆ m(x, t))(u) =

  • i:ti∈neighbor(u)

w(ti − u)xi Construct kernel w(t) using a degree-1 B-spline

slide-30
SLIDE 30

Architecture Overview for Continuous Time Series

CNN encoder

z

CNN decoder

x t

x t z

  • x

qφ gθ (x, t) ∼ pD

generic P-VAE P-VAE for continuous time series

20

slide-31
SLIDE 31

MIMIC-III Mortality Prediction

  • about 53,000 labeled examples
  • 12 irregularly-sampled physiological time series
  • average mortality rate: 8.10%

SpO2 HR RR SBP DBP Temp TGCS CRR UO FiO2 Glucose 12 24 36 48 hour pH

21

slide-32
SLIDE 32

MIMIC-III Mortality Prediction

method AUC (%) time (hr) params GRU-D† 83.88 ± 0.65 0.11 2.6K Latent ODE‡ 85.71 ± 0.38 2.62 154.7K Cont classifier 84.87 ± 0.18 0.03 30.5K Cont P-VAE 85.13 ± 0.43 0.04 64.8K Cont P-BiGAN 86.02 ± 0.38 0.22 73.2K

†Che, et al. (2018). RNNs for multivariate time series with missing values. ‡Rubanova, et al. (2019). Latent ODEs for irregularly-sampled time series.

22

slide-33
SLIDE 33

Summary

  • Transforming modeling of irregularly-sampled time series into

missing data problem

  • An encoder-decoder framework for missing data problem
  • Partial VAE
  • Partial BiGAN
  • Scalable architecture for modeling continuous time series
  • Kernel smoothing decoder
  • Continuous convolutional layer

23

slide-34
SLIDE 34

Appendix

slide-35
SLIDE 35

Why Stochastic Encoders?

Imputation by model trained with 2-D latent code

24

slide-36
SLIDE 36

Why Stochastic Encoders?

25

Different incomplete cases carry different levels of uncertainty

slide-37
SLIDE 37

Synthetic Multivariate Time Series

Generative process: a ∼ N(0, 102) b ∼ uniform(0, 10) f1(t) = 0.8 sin(20(t + a) + sin(20(t + a))) f2(t) = −0.5 sin(20(t + a + 20) + sin(20(t + a + 20))) f3(t) = sin(12(t + b)) Observation time points drawn from homogeneous Poisson process with λ = 30 within [d, d + 0.25] where d ∼ uniform(0, 0.75).

f1 f2 f3

26

slide-38
SLIDE 38

Synthetic Multivariate Time Series

Data Cont P-VAE Cont P-BiGAN 27

slide-39
SLIDE 39

Synthetic Multivariate Time Series

Random time series generation:

Cont P-VAE Cont P-BiGAN 28

slide-40
SLIDE 40

MisGAN: GAN for Missing Data

G training data D

mask out

Li, Jiang, Marlin. (2019). MisGAN: Learning from Incomplete Data with GANs.

29

slide-41
SLIDE 41

On the Independence Assumption

For the most general case without the independence assumption, we use the generative process for an incomplete case (x, t): z ∼ pz(z) t ∼ pI(t|z) x = gθ(z, t) It encodes dependency between t and x when z is unobserved.

30

slide-42
SLIDE 42

On the Independence Assumption

Generative process for an incomplete case (x, t): z ∼ pz(z), t ∼ pI(t|z), x = gθ(z, t). P-VAE: max

φ,θ,τ E(x,t)∼pDEqφ(z|x,t)

  • log pz(z)pI(t|z) |t|

i=1 pθ(xi|z, ti)

qφ(z|x, t)

  • P-BiGAN:

min

θ,φ,τ max D

  • E(x,t)∼pDEz∼pφ(z|x,t) [log D(x, t, z)]

+ Ez∼pz(z)Et∼pI(t|z) [log(1 − D(gθ(z, t), t, z))]

  • τ denotes the parameters of pI(t|z).

For P-BiGAN, pI(t|z) can be stochastic or deterministic.

31