8 Approximate inference in switching linear dynamical systems using - - PDF document

8 approximate inference in switching linear dynamical
SMART_READER_LITE
LIVE PREVIEW

8 Approximate inference in switching linear dynamical systems using - - PDF document

8 Approximate inference in switching linear dynamical systems using Gaussian mixtures David Barber 8.1 Introduction The linear dynamical system (LDS) (see Section 1.3.2) is a standard time series model in which a latent linear process


slide-1
SLIDE 1

8 Approximate inference in switching linear dynamical systems using Gaussian mixtures

David Barber

8.1 Introduction

The linear dynamical system (LDS) (see Section 1.3.2) is a standard time series model in which a latent linear process generates the observations. Complex time series which are not well described globally by a single LDS may be divided into segments, each modelled by a potentially different LDS. Such models can handle situations in which the underlying model ‘jumps’ from one parameter setting to another. For example a single LDS might well represent the normal flows in a chemical plant. However, if there is a break in a pipeline, the dynamics of the system changes from one set of linear flow equations to another. This scenario could be modelled by two sets of linear systems, each with different parameters, with a discrete latent variable at each time st ∈ {normal, pipe broken} indicating which of the LDSs is most appropriate at the current time. This is called a switching LDS (SLDS) and used in many disciplines, from econometrics to machine learning [2, 9, 15, 13, 12, 6, 5, 19, 21, 16].

8.2 The switching linear dynamical system

At each time t, a switch variable st ∈ {1, . . . , S } describes which of a set of LDSs is to be

  • used. The observation (or ‘visible’) variable vt ∈ RV is linearly related to the hidden state

ht ∈ RH by vt = B(st)ht + ηv(st), ηv(st) ∼ N ηv(st) ¯ v(st), Σv(st) . (8.1) Here st describes which of the set of emission matrices B(1), . . . , B(S ) is active at time t. The observation noise ηv(st) is drawn from one of a set of Gaussians with different means ¯ v(st) and covariances Σv(st). The transition of the continuous hidden state ht is linear, ht = A(st)ht−1 + ηh(st), ηh(st) ∼ N

  • ηh(st) ¯

h(st), Σh(st)

  • ,

(8.2) and the switch variable st selects a single transition matrix from the available set A(1), . . . , A(S ). The Gaussian transition noise ηh(st) also depends on the switch vari-

  • ables. The dynamics of st itself is Markovian, with transition p(st|st−1). For the more

general‘augmented’ (aSLDS) model the switch st is dependent on both the previous st−1

terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511984679.009 Downloaded from https://www.cambridge.org/core. Seoul National University - Statistics Department, on 01 Aug 2018 at 08:05:20, subject to the Cambridge Core

slide-2
SLIDE 2

Inference in switching linear systems using mixtures 167

s1 h1 v1 s2 h2 v2 s3 h3 v3 s4 h4 v4

Figure 8.1 The independence structure of the aSLDS. Square nodes st denote discrete switch variables; ht are continu-

  • us latent/hidden variables, and vt continuous observed/vis-

ible variables. The discrete state st determines which linear dynamical system from a finite set of linear dynamical sys- tems is operational at time t. In the SLDS links from h to s are not normally considered.

and ht−1. The probabilistic model defines a joint distribution, see Fig. 8.1, p(v1:T, h1:T, s1:T) =

T

  • t=1

p(vt|ht, st)p(ht|ht−1, st)p(st|ht−1, st−1), p(vt|ht, st) = N (vt ¯ v(st) + B(st)ht, Σv(st)) , p(ht|ht−1, st) = N

  • ht ¯

h(st) + A(st)ht−1, Σh(st)

  • .

At time t = 1, p(s1|h0, s0) denotes the prior p(s1), and p(h1|h0, s1) denotes p(h1|s1). The SLDS can be thought of as a marriage between a hidden Markov model and an LDS. The SLDS is also called a jump Markov model/process, switching Kalman filter, switching linear Gaussian state space model, conditional linear Gaussian model. 8.2.1 Exact inference is computationally intractable Performing exact filtered and smoothed inference in the SLDS is intractable, scaling expo- nentially with time, see for example [16]. As an informal explanation, consider filtered posterior inference, for which, by analogy with Section 1.4.1 the forward pass is p(st+1, ht+1|v1:t+1) =

  • st
  • ht

p(st+1, ht+1|st, ht, vt+1)p(st, ht|v1:t). (8.3) At time step 1, p(s1, h1|v1) = p(h1|s1, v1)p(s1|v1) is an indexed Gaussian. At time step 2, due to the summation over the states s1, p(s2, h2|v1:2) is an indexed set of S Gaussians. In general, at time t, p(st, ht|v1:t) is an indexed set of S t−1 Gaussians. Even for small t, the num- ber of components required to exactly represent the filtered distribution is computationally

  • intractable. Analogously, smoothing is also intractable.

The origin of the intractability of the SLDS therefore differs from ‘structural intractabil- ity’ since, in terms of the cluster variables x1:T with xt ≡ (st, ht) and visible variables v1:T, the graph of the distribution is singly connected. From a purely graph-theoretic view- point, one would therefore envisage no difficulty in carrying out inference. Indeed, as we saw above, the derivation of the filtering algorithm is straightforward since the graph of p(x1:T, v1:T) is singly connected. However, the numerical representation of the messages requires an exponentially increasing number of terms. In order to deal with this intractability, several approximation schemes have been intro- duced, [8, 9, 15, 13, 12]. Here we focus on techniques which approximate the switch conditional posteriors using a limited mixture of Gaussians. Since the exact posterior distributions are mixtures of Gaussians, but with an exponentially large number of com- ponents, the aim is to drop low-weight components such that the resulting limited number

  • f Gaussians still accurately represents the posterior.

terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511984679.009 Downloaded from https://www.cambridge.org/core. Seoul National University - Statistics Department, on 01 Aug 2018 at 08:05:20, subject to the Cambridge Core

slide-3
SLIDE 3

168 David Barber

8.3 Gaussian sum filtering

Equation (8.3) describes the exact filtering recursion. Whilst the number of mixture com- ponents increases exponentially with time, intuitively one would expect that there is an effective time scale over which the previous visible information is relevant. In general, the influence of ancient observations will be much less relevant than that of recent observa-

  • tions. This suggests that a limited number of components in the Gaussian mixture should

suffice to accurately represent the filtered posterior [1]. Our aim is to form a recursion for p(st, ht|v1:t) using a Gaussian mixture approxi- mation of p(ht|st, v1:t). Given an approximation of the filtered distribution p(st, ht|v1:t) ≈ q(st, ht|v1:t), the exact recursion (8.3) is approximated by q(st+1, ht+1|v1:t+1) =

  • st
  • ht

p(st+1, ht+1|st, ht, vt+1)q(st, ht|v1:t). (8.4) This approximation to the filtered posterior at the next time step will contain S times more components than at the previous time step. Therefore to prevent an exponential explosion in mixture components we need to collapse this mixture in a suitable way. We will deal with this once the new mixture representation for the filtered posterior has been computed. To derive the updates it is useful to break the filtered approximation from Eq. (8.4) into continuous and discrete parts q(ht, st|v1:t) = q(ht|st, v1:t)q(st|v1:t), (8.5) and derive separate filtered update formulae, as described below. An important remark is that many techniques approximate p(ht|st, v1:t) using a single Gaussian. Naturally, this gives rise to a mixture of Gaussians for p(ht|v1:t). However, in making a single Gaussian approximation to p(ht|st, v1:t) the representation of the posterior may be poor. Our aim here is to maintain an accurate approximation to p(ht|st, v1:t) by using a mixture of Gaussians. 8.3.1 Continuous filtering The exact representation of p(ht|st, v1:t) is a mixture with S t−1 components. To retain computational feasibility we approximate this with a limited I-component mixture q(ht|st, v1:t) =

I

  • it=1

q(ht|it, st, v1:t)q(it|st, v1:t), where q(ht|it, st, v1:t) is a Gaussian parameterised with mean f(it, st) and covariance F(it, st). Strictly speaking, we should use the notation ft(it, st) since, for each time t, we have a set

  • f means indexed by it, st, but we drop these dependencies in the notation used here.

To find a recursion for the approximating distribution, we first assume that we know the filtered approximation q(ht, st|v1:t) and then propagate this forwards using the exact

  • dynamics. To do so consider first the exact relation

q(ht+1|st+1, v1:t+1) =

  • st,it

q(ht+1, st, it|st+1, v1:t+1) =

  • st,it

q(ht+1|st, it, st+1, v1:t+1)q(st, it|st+1, v1:t+1). (8.6)

terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511984679.009 Downloaded from https://www.cambridge.org/core. Seoul National University - Statistics Department, on 01 Aug 2018 at 08:05:20, subject to the Cambridge Core

slide-4
SLIDE 4

Inference in switching linear systems using mixtures 169 Wherever possible we substitute the exact dynamics and evaluate each of the two factors

  • above. By decomposing the update in this way the new filtered approximation is of the form
  • f a Gaussian mixture, where q(ht+1|st, it, st+1, v1:t+1) is Gaussian and q(st, it|st+1, v1:t+1) are

the weights or mixing proportions of the components. We describe below how to com- pute these terms explicitly. Equation (8.6) produces a new Gaussian mixture with I × S components which we collapse back to I components at the end of the computation. Evaluating q(ht+1|st, it, st+1, v1:t+1) We aim to find a filtering recursion for q(ht+1|st, it, st+1, v1:t+1). Since this is conditional on switch states and components, this corresponds to a single LDS forward step, which can be evaluated by considering first the joint distribution q(ht+1, vt+1|st, it, st+1, v1:t)=

  • ht

p(ht+1, vt+1|ht,✚ st, ✁ ✁ it, st+1, v1:t)q(ht|st, it,✟ ✟ st+1, v1:t), and subsequently conditioning on vt+1. To ease the burden on notation we assume ¯ ht, ¯ vt ≡ 0 for all t. By propagating q(ht|v1:t, it, st) = N (ht f(it, st), F(it, st)) with the dynamics (8.1) and (8.2), we obtain that q(ht+1, vt+1|st, it, st+1, v1:t) is a Gaussian with covariance and mean elements Σhh = A(st+1)F(it, st)AT(st+1) + Σh(st+1), Σvv = B(st+1)ΣhhBT(st+1) + Σv(st+1), Σvh = B(st+1)Σhh = ΣT

hv,

µv = B(st+1)A(st+1)f(it, st), µh = A(st+1)f(it, st). (8.7) These results are obtained from integrating the exact dynamics over ht, using the results in Section 1.4.7. To find q(ht+1|st, it, st+1, v1:t+1) we condition q(ht+1, vt+1|st, it, st+1, v1:t) on vt+1 using the standard Gaussian conditioning formula, Section 1.4.7, to obtain q(ht+1|st, it, st+1, v1:t+1) = N

  • ht+1 µh|v, Σh|v
  • with

µh|v = µh + ΣhvΣ−1

vv

vt+1 − µv , Σh|v = Σhh − ΣhvΣ−1

vv Σvh.

Evaluating the mixture weights q(st, it|st+1, v1:t+1) The mixture weight in Eq. (8.6) can be found from q(st, it|st+1, v1:t+1) ∝ q(vt+1|it, st, st+1, v1:t)q(st+1|it, st, v1:t)q(it|st, v1:t)q(st|v1:t). The factor q(vt+1|it, st, st+1, v1:t) is Gaussian with mean µv and covariance Σvv, as given in

  • Eq. (8.7). The factors q(it|st, v1:t) and q(st|v1:t) are given from the previous filtered iteration.

Finally, q(st+1|it, st, v1:t) is found from (where angled brackets denote expectation) q(st+1|it, st, v1:t) = p(st+1|ht, st)q(ht|it,st,v1:t) augmented SLDS, p(st+1|st) standard SLDS , (8.8)

terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511984679.009 Downloaded from https://www.cambridge.org/core. Seoul National University - Statistics Department, on 01 Aug 2018 at 08:05:20, subject to the Cambridge Core

slide-5
SLIDE 5

170 David Barber

t+1 t

Figure 8.2 Gaussian sum filtering. The leftmost column depicts the pre- vious Gaussian mixture approximation q(ht, st|v1:t) for two states S = 2 (dashed and solid) and three mixture components I = 3. The area of the

  • val indicates the weight of the component. There are S = 2 different lin-

ear systems which take each of the components of the mixture into a new filtered state, the arrow indicating which dynamic system is used. After

  • ne time step each mixture component branches into a further S com-

ponents so that the joint approximation q(ht+1, st+1|v1:t+1) contains S 2I components (middle column). To keep the representation computationally tractable the mixture of Gaussians for each state st+1 is collapsed back to I components. This means that each state needs to be approximated by a smaller I component mixture of Gaussians. There are many ways to achieve this. A naive but computationally efficient approach is to simply ignore the lowest weight components, as depicted on the right column.

where the result above for the standard SLDS follows from the independence assumptions present in the standard SLDS. In the aSLDS, the term in Eq. (8.8) will generally need to be computed numerically. A simple approximation is to evaluate Eq. (8.8) at the mean value of the distribution q(ht|it, st, v1:t). To take covariance information into account an alternative would be to draw samples from the Gaussian q(ht|it, st, v1:t) and thus approximate the aver- age of p(st+1|ht, st) by sampling. Note that this does not equate Gaussian sum filtering with a sequential sampling procedure, such as particle filtering, Section 1.6.4. The sampling here is exact, for which no convergence issues arise. Closing the recursion We are now in a position to calculate Eq. (8.6). For each setting of the variable st+1, we have a mixture of I × S Gaussians. To prevent the number of components increasing exponentially, we numerically collapse q(ht+1|st+1, v1:t+1) back to I Gaussians to form q(ht+1|st+1, v1:t+1) →

I

  • it+1=1

q(ht+1|it+1, st+1, v1:t+1)q(it+1|st+1, v1:t+1). Any method of choice may be supplied to collapse a mixture to a smaller mixture. A straightforward approach is to repeatedly merge low-weight components, as explained in Section 8.3.4. In this way the new mixture coefficients q(it+1|st+1, v1:t+1), it+1 ∈ 1, . . . , I, are defined. This completes the description of how to form a recursion for the continuous filtered posterior approximation q(ht+1|st+1, v1:t+1) in Eq. (8.5). 8.3.2 Discrete filtering A recursion for the switch variable distribution in Eq. (8.5) is given by q(st+1|v1:t+1) ∝

  • it,st

q(st+1, it, st, vt+1, v1:t). The right-hand side of the above equation is proportional to

  • st,it

q(vt+1|st+1, it, st, v1:t)q(st+1|it, st, v1:t)q(it|st, v1:t)q(st|v1:t), for which all terms have been computed during the recursion for q(ht+1|st+1, v1:t+1). We now have all the quantities required to compute the Gaussian sum approximation of the filtering

terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511984679.009 Downloaded from https://www.cambridge.org/core. Seoul National University - Statistics Department, on 01 Aug 2018 at 08:05:20, subject to the Cambridge Core

slide-6
SLIDE 6

Inference in switching linear systems using mixtures 171 Algorithm 8.1 aSLDS forward pass. Approximate the filtered posterior p(st|v1:t) ≡ αt, p(ht|st, v1:t) ≡

  • it wt(it, st)N (ht ft(it, st), Ft(it, st)). Also return the approximate

log-likelihood L ≡ log p(v1:T). It are the number of components in each Gaus- sian mixture approximation. We require I1 = 1, I2 ≤ S, It ≤ S × It−1. θ(s) = A(s), B(s), Σh(s), Σv(s), ¯ h(s), ¯ v(s). See also Algorithm 1.1. for s1 ← 1 to S do {f1(1, s1), F1(1, s1), ˆ p} = LDSFORWARD(0, 0, v1; θ(s1)) α1 ← p(s1) ˆ p end for for t ← 2 to T do for st ← 1 to S do for i ← 1 to It−1, and s ← 1 to S do {µx|y(i, s), Σx|y(i, s), ˆ p} = LDSFORWARD(ft−1(i, s), Ft−1(i, s), vt; θ(st)) p∗(st|i, s) ≡ p(st|ht−1, st−1 = s)p(ht−1|it−1=i,st−1=s,v1:t−1) p′(st, i, s) ← wt−1(i, s)p∗(st|i, s)αt−1(s) ˆ p end for Collapse the It−1 × S mixture of Gaussians defined by µx|y,Σx|y, and weights p(i, s|st) ∝ p′(st, i, s) to a Gaussian with It components, p(ht|st, v1:t) ≈ It

it=1 p(it|st, v1:t)p(ht|st, it, v1:t). This defines the new means ft(it, st), covariances

Ft(it, st) and mixture weights wt(it, st) ≡ p(it|st, v1:t). Compute αt(st) ∝

i,s p′(st, i, s)

end for normalise αt L ← L + log

st,i,s p′(st, i, s)

end for forward pass. A schematic representation of Gaussian sum filtering is given in Fig. 8.2 and the pseudo-code is presented in Algorithm 8.1. 8.3.3 The likelihood p(v1:T) The likelihood p(v1:T) may be found from p(v1:T) =

T−1

  • t=0

p(vt+1|v1:t), p(vt+1|v1:t) ≈

  • it,st,st+1

q(vt+1|it, st, st+1, v1:t)q(st+1|it, st, v1:t)q(it|st, v1:t)q(st|v1:t). In the above expression, all terms have been computed when forming the recursion for the filtered posterior q(ht+1, st+1|v1:t+1). 8.3.4 Collapsing Gaussians We wish to collapse a mixture of N Gaussians p(x) =

N

  • i=1

piN x µi, Σi

  • (8.9)

terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511984679.009 Downloaded from https://www.cambridge.org/core. Seoul National University - Statistics Department, on 01 Aug 2018 at 08:05:20, subject to the Cambridge Core

slide-7
SLIDE 7

172 David Barber to a mixture of K < N Gaussians. We present a simple method which has the advantage of computational efficiency, but the disadvantage that no spatial information about the mixture is used [20]. First we describe how to collapse a mixture to a single Gaussian N (x µ, Σ). This can be achieved by setting µ and Σ to be the mean and covariance of the mixture distribution (8.9), that is µ =

  • i

piµi, Σ =

  • i

pi

  • Σi + µiµT

i

  • − µµT.

To collapse to a K-component mixture we may retain the K − 1 Gaussians with the largest mixture weights and merge the remaining N−K−1 Gaussians to a single Gaussian using the above method. Alternative heuristics such as recursively merging the two Gaussians with the lowest mixture weights are also reasonable. More sophisticated methods which retain some spatial information would clearly be potentially useful. The method presented in [15] is a suitable approach which considers removing Gaussians which are spatially similar, thereby retaining a sense of diversity over the possible solutions. 8.3.5 Relation to other methods Gaussian sum filtering can be considered a form of ‘analytical particle filtering’ in which instead of point distributions (delta functions) being propagated, Gaussians are propagated. The collapse operation to a smaller number of Gaussians is analogous to resampling in particle filtering. Since a Gaussian is more expressive than a delta function, the Gaussian sum filter is generally an improved approximation technique over using point particles. See [3] for a numerical comparison. This Gaussian sum filtering method is an instance of assumed density filtering, see Section 1.5.2.

8.4 Expectation correction

Approximating the smoothed posterior p(ht, st|v1:T) is more involved than filtering and requires additional approximations. For this reason smoothing is more prone to failure since there are more assumptions that need to be satisfied for the approximations to hold. The route we take here is to assume that a Gaussian sum filtered approximation has been carried

  • ut, and then approximate the γ backward pass, analogous to that of Section 1.4.3. The

exact backward pass for the SLDS reads p(ht, st|v1:T) =

  • st+1
  • ht+1

p(ht, st|ht+1, st+1, v1:t)p(ht+1, st+1|v1:T), (8.10) where p(st+1|v1:T) and p(ht+1|st+1, v1:T) are the discrete and continuous components of the smoothed posterior at the next time step. The recursion runs backwards in time, beginning with the initialisation p(hT, sT|v1:T) set by the filtered result (at time t = T, the filtered and smoothed posteriors coincide). Apart from the fact that the number of mixture components will increase at each step, computing the integral over ht+1 in Eq. (8.10) is problematic since the conditional distribution term is non-Gaussian in ht+1. For this reason it is more useful in deriving an approximate recursion to begin with the exact relation p(st, ht|v1:T) =

  • st+1

p(st+1|v1:T)p(ht|st, st+1, v1:T)p(st|st+1, v1:T),

terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511984679.009 Downloaded from https://www.cambridge.org/core. Seoul National University - Statistics Department, on 01 Aug 2018 at 08:05:20, subject to the Cambridge Core

slide-8
SLIDE 8

Inference in switching linear systems using mixtures 173

st−1 ht−1 vt−1 st ht vt st+1 ht+1 vt+1 st+2 ht+2 vt+2

Figure 8.3 The EC backpass approximates p(ht+1|st+1, st, v1:T ) by p(ht+1|st+1, v1:T ). The moti- vation for this is that st influences ht+1 only indirectly through ht. However, ht will most likely be heavily influenced by v1:t, so that not knowing the state of st is likely to be of secondary importance. The light shaded node is the variable we wish to find the posterior for. The values of the dark shaded nodes are known, and the dashed node indicates a known variable which is assumed unknown in forming the approximation.

which can be expressed more directly in terms of the SLDS dynamics as p(st, ht|v1:T) =

  • st+1

p(st+1|v1:T) p(ht|ht+1, st, st+1, v1:t,✘✘ ✘ vt+1:T)p(ht+1|st,st+1,v1:T ) × p(st|ht+1, st+1, v1:T)p(ht+1|st+1,v1:T ) . In forming the recursion we assume access to the distribution p(st+1, ht+1|v1:T) from the future time step. However, we also require the distribution p(ht+1|st, st+1, v1:T) which is not directly known and needs to be inferred – a computationally challenging task. In the expectation correction (EC) approach [3] one assumes the approximation (see Fig. 8.3) p(ht+1|st, st+1, v1:T) ≈ p(ht+1|st+1, v1:T), (8.11) resulting in an approximate recursion for the smoothed posterior p(st, ht|v1:T) ≈

  • st+1

p(st+1|v1:T) p(ht|ht+1, st, st+1, v1:t)ht+1 p(st|ht+1, st+1, v1:T)ht+1 , where ·ht+1 represents averaging with respect to the distribution p(ht+1|st+1, v1:T). In car- rying out the above approximate recursion, we will end up with a mixture of Gaussians that grows at each time step. To avoid the exponential explosion problem, we use a finite mixture approximation p(ht+1, st+1|v1:T) ≈ q(ht+1, st+1|v1:T) = q(ht+1|st+1, v1:T)q(st+1|v1:T), and plug this into the recursion above. A recursion for the approximation is given by q(ht, st|v1:T) =

  • st+1

q(st+1|v1:T) q(ht|ht+1, st, st+1, v1:t)q(ht+1|st+1,v1:T ) × q(st|ht+1, st+1, v1:t)q(ht+1|st+1,v1:T ) . (8.12) As for filtering, wherever possible, we replace approximate terms by their exact counter- parts and parameterise the posterior using q(ht+1, st+1|v1:T) = q(ht+1|st+1, v1:T)q(st+1|v1:T). (8.13) To reduce the notational burden here we outline the method only for the case of using a single component approximation in both the forward and backward passes. The extension to using a mixture to approximate each p(ht+1|st+1, v1:T) is conceptually straightforward and deferred to Section 8.4.5. In the single Gaussian case we assume we have a Gaussian approximation available for q(ht+1|st+1, v1:T) = N (ht+1 g(st+1), G(st+1)) , which below is ‘propagated’ backwards in time using the smoothing recursion.

terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511984679.009 Downloaded from https://www.cambridge.org/core. Seoul National University - Statistics Department, on 01 Aug 2018 at 08:05:20, subject to the Cambridge Core

slide-9
SLIDE 9

174 David Barber 8.4.1 Continuous smoothing For given st, st+1, a recursion for the smoothed continuous variable is obtained using q(ht|st, st+1, v1:T) =

  • ht+1

p(ht|ht+1, st, st+1, v1:t)q(ht+1|st+1, v1:T). (8.14) In forming the recursion, we assume that we know the distribution q(ht+1|st+1, v1:T) = N (ht+1 g(st+1), G(st+1)) . To compute Eq. (8.14) we then perform a single update of the LDS backward recursion, Algorithm 1.2. 8.4.2 Discrete smoothing The second average in Eq. (8.12) corresponds to a recursion for the discrete variable and is given by q(st|ht+1, st+1, v1:t)q(ht+1|st+1,v1:T ) ≡ q(st|st+1, v1:T). This average cannot be computed in closed form. The simplest approach is to approximate it by evaluation at the mean, that is1 q(st|ht+1, st+1v1:t)q(ht+1|st+1,v1:T ) ≈ q(st|ht+1, st+1, v1:t)

  • ht+1=ht+1|st+1,v1:T ,

where ht+1|st+1, v1:T is the mean of ht+1 with respect to q(ht+1|st+1, v1:T). This gives the approximation q(st|ht+1, st+1, v1:t)q(ht+1|st+1,v1:T ) ≈ 1 Z e− 1

2 zT t+1(st,st+1)Σ −1(st,st+1|v1:t)zt+1(st,st+1)

√det (Σ(st, st+1|v1:t)) q(st|st+1, v1:t), where zt+1(st, st+1) ≡ ht+1|st+1, v1:T − ht+1|st, st+1, v1:t , and Z ensures normalisation over st; Σ(st, st+1|v1:t) is the filtered covariance of ht+1 given st, st+1 and the observations v1:t, which may be taken from Σhh in Eq. (8.7). Approximations which take covariance information into account can also be considered, although the above method may suffice in practice [3, 17]. 8.4.3 Collapsing the mixture From Sections 8.4.1 and 8.4.2 we now have all the terms to compute the approximation to

  • Eq. (8.12). As for the filtering, however, the number of mixture components is multiplied

by S at each iteration. To prevent an exponential explosion of components, the mixture is then collapsed to a single Gaussian q(ht, st|v1:T) = q(ht|st, v1:T)q(st|v1:T). The collapse to a mixture is discussed in Section 8.4.5.

1In general this approximation has the form f(x)p(x) ≈ f(xp(x)). terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511984679.009 Downloaded from https://www.cambridge.org/core. Seoul National University - Statistics Department, on 01 Aug 2018 at 08:05:20, subject to the Cambridge Core

slide-10
SLIDE 10

Inference in switching linear systems using mixtures 175 8.4.4 Relation to other methods A classical smoothing approximation for the SLDS is generalised pseudo Bayes (GPB) [2, 12, 11]. GPB makes the following approximation p(st|st+1, v1:T) ≈ p(st|st+1, v1:t), which depends only on the filtered posterior for st and does not include any information passing through the variable ht+1. Since p(st|st+1, v1:t) ∝ p(st+1|st)p(st|v1:t), computing the smoothed recursion for the switch states in GPB is equivalent to running the RTS backward pass on a hidden Markov model, independently of the backward recursion for the continuous variables: p(st|v1:T) =

  • st+1

p(st, st+1|v1:T) =

  • st+1

p(st|st+1, v1:T)p(st+1|v1:T) ≈

  • st+1

p(st|st+1, v1:t)p(st+1|v1:T) =

  • st+1

p(st+1|st)p(st|v1:t)

  • st p(st+1|st)p(st|v1:t) p(st+1|v1:T).

The only information the GPB method uses to form the smoothed distribution p(st|v1:T) from the filtered distribution p(st|v1:t) is the Markov switch transition p(st+1|st). This approximation discounts some information from the future since information passed via the continuous variables is not taken into account. In contrast to GPB, EC preserves future information passing through the continuous variables. 8.4.5 Using mixtures in the expectation correction backward pass The extension to the mixture case is straightforward, based on the representation p(ht|st, v1:T) ≈

J

  • jt=1

q( jt|st, v1:T)q(ht|jt, st, v1:T). Analogously to the case with a single component, q(ht, st|v1:T) =

  • it, jt+1,st+1

p(st+1|v1:T)p( jt+1|st+1, v1:T)q(ht|jt+1, st+1, it, st, v1:T) × q(it, st|ht+1, jt+1, st+1, v1:t)q(ht+1| jt+1,st+1,v1:T ) . The average in the last line of the above equation can be tackled using the same tech- niques as outlined in the single Gaussian case. To approximate q(ht| jt+1, st+1, it, st, v1:T) we consider this as the marginal of the joint distribution q(ht, ht+1|it, st, jt+1, st+1, v1:T) = q(ht|ht+1, it, st, jt+1, st+1, v1:t)q(ht+1|it, st, jt+1, st+1, v1:T). As in the case of a single mixture, the problematic term is q(ht+1|it, st, jt+1, st+1, v1:T). Analogously to Eq. (8.11), we make the assumption q(ht+1|it, st, jt+1, st+1, v1:T) ≈ q(ht+1| jt+1, st+1, v1:T),

terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511984679.009 Downloaded from https://www.cambridge.org/core. Seoul National University - Statistics Department, on 01 Aug 2018 at 08:05:20, subject to the Cambridge Core

slide-11
SLIDE 11

176 David Barber Algorithm 8.2 aSLDS: EC backward pass. Approximates p(st|v1:T) and p(ht|st, v1:T) ≡ Jt

jt=1 ut( jt, st)N(gt( jt, st), Gt( jt, st)) using a mixture of Gaussians. JT = IT, Jt ≤ S ×It×Jt+1.

This routine needs the results from Algorithm 8.1. GT ← FT, gT ← fT, uT ← wT for t ← T − 1 to 1 do for s ← 1 to S , s′ ← 1 to S , i ← 1 to It, j′ ← 1 to Jt+1 do (µ, Σ)(i, s, j′, s′) = LDSBACKWARD(gt+1( j′, s′), Gt+1(j′, s′), ft(i, s), Ft(i, s), θ(s′)) p(it, st|jt+1, st+1, v1:T) = p(st = s, it = i|ht+1, st+1 = s′, jt+1 = j′, v1:t)

p(ht+1|st+1=s′, jt+1=j′,v1:T )

p(i, s, j′, s′|v1:T) ← p(st+1 = s′|v1:T)ut+1( j′, s′)p(it, st| jt+1, st+1, v1:T) end for for st ← 1 to S do Collapse the mixture defined by weights p(it = i, st+1 = s′, jt+1 = j′|st, v1T ) ∝ p(i, st, j′, s′|v1:T), means µ(it, st, jt+1, st+1) and covariances Σ(it, st, jt+1, st+1) to a mixture with Jt components. This defines the new means gt(jt, st), covariances Gt( jt, st) and mixture weights ut(jt, st). p(st|v1:T) ←

it, j′,s′ p(it, st, j′, s′|v1:T)

end for end for meaning that information about the current switch state st, it is ignored. We can then form p(ht|st, v1:T) =

  • it, jt+1,st+1

p(it, jt+1, st+1|st, v1:T)p(ht|it, st, jt+1, st+1, v1:T). This mixture can then be collapsed to a smaller mixture to give p(ht|st, v1:T) ≈

  • jt

q( jt|st, v1:T)q(ht| jt, v1:T). The resulting procedure is sketched in Algorithm 8.2, including using mixtures in both forward and backward passes.

8.5 Demonstration: traffic flow

An illustration of modelling and inference with an SLDS is given in the network of traf- fic flow problem, Fig. 8.4. Here there are four junctions a, b, c, d, and traffic flows along theroads in the direction indicated. Traffic flows into the junction and then goes via differ- ent routes to d. Flow out of a junction must match the flow in to a junction (up to noise). There are traffic light switches at junctions a and b which, depending on their state, route traffic differently along the roads. Using φ to denote flow, we model the flows using the switching linear system φa(t) φa→d(t) φa→b(t) φb→d(t) φb→c(t) φc→d(t)                          =                          φa(t − 1) φa(t − 1) (0.75 × I [sa(t) = 1] + 1 × I [sa(t) = 2]) φa(t − 1) (0.25 × I [sa(t) = 1] + 1 × I [sa(t) = 3]) φa→b(t − 1)0.5 × I [sb(t) = 1] φa→b(t − 1) (0.5 × I [sb(t) = 1] + 1 × I [sb(t) = 2]) φb→c(t − 1)

terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511984679.009 Downloaded from https://www.cambridge.org/core. Seoul National University - Statistics Department, on 01 Aug 2018 at 08:05:20, subject to the Cambridge Core

slide-12
SLIDE 12

Inference in switching linear systems using mixtures 177 a b c d

Figure 8.4 A representation of the traffic flow between junctions at a, b, c, d, with traffic lights at a and b. If sa = 1 a → d and a → b carry 0.75 and 0.25 of the flow out

  • f a respectively. If sa = 2 all the flow from a goes through a → d; for sa = 3, all the

flow goes through a → b. For sb = 1 the flow out of b is split equally between b → d and b → c. For sb = 2 all flow out of b goes along b → c.

20 40 60 80 100 20 40 20 40 60 80 100 20 40

Figure 8.5 Time evolution of the traffic flow measured at two points in the network. Sensors measure the total flow into the network φa(t) and the total flow out of the network, φd(t) = φa→d(t) + φb→d(t) + φc→d(t). The total inflow at a undergoes a random walk. Note that the flow measured at d can momentarily drop to zero if all traffic is routed through a → b → c for two time steps.

By identifying the flows at time t with a six-dimensional vector hidden variable ht, we can write the above flow equations as ht = A(st)ht−1 + ηh

t

for a set of suitably defined matrices A(s) indexed by the switch variable s = sa

  • sb,

which takes 3 × 2 = 6 states. We additionally include noise terms to model cars parking or de-parking during a single time frame. The covariance Σh is diagonal with a larger variance at the inflow point a to model that the total volume of traffic entering the system can vary. Noisy measurements of the flow into the network are taken at a v1,t = φa(t) + ηv

1(t),

along with a noisy measurement of the total flow out of the system at d v2,t = φd(t) = φa→d(t) + φb→d(t) + φc→d(t) + ηv

2(t).

The observation model can be represented by vt = Bht+ηv

t using a constant 2×6 projection

matrix B. This is clearly a very crude model of traffic flows since in a real system one cannot have negative flows. Nevertheless it serves to demonstrate the principles of modelling and inference using switching models. The switch variables follow a simple Markov transition p(st|st−1) which biases the switches to remain in the same state in preference to jumping to another state. Given the above system and a prior which initialises all flow at a, we draw samples from the model using forward (ancestral) sampling and form the observations v1:100, Fig. 8.5. Using only the observations and the known model structure we then attempt to infer the latent switch variables and traffic flows using Gaussian sum filtering with I = 2 and EC with J = 1 mixture components per switch state, Fig. 8.6.

8.6 Comparison of smoothing techniques

In order to demonstrate the potential advantages of Gaussian sum approximate inference we chose a problem that, from the viewpoint of classical signal processing, is difficult, with changes in the switches not identifiable by simply eyeballing a short-time Fourier trans- form of the signal. The setup, as described in Fig. 8.7, is such that the latent trajectory h

terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511984679.009 Downloaded from https://www.cambridge.org/core. Seoul National University - Statistics Department, on 01 Aug 2018 at 08:05:20, subject to the Cambridge Core

slide-13
SLIDE 13

178 David Barber

Figure 8.6 Given the observations from Fig. 8.5 we infer the flows and switch states of all the latent variables. (a) The correct latent flows through time along with the switch variable state used to generate the data. (b) Filtered flows based on a I = 2 Gaussian sum forward pass approximation. Plotted are the six components of the vector ht|v1:t with the posterior distribution of the sa and sb traffic light states p(sa

t |v1:t),p(sb t |v1:t) plotted below. (c)

Smoothed flows ht|v1:T and corresponding smoothed switch states p(st|v1:T ) using EC with J = 1. Figure 8.7 A sample from the inference problem for comparison of smoothing techniques. This scalar

  • bservation data (V = 1) of T = 100 time steps

is generated from an LDS with two switch states, S = 2: A(s) = 0.9999 ∗ orth(randn(H, H)), B(s) = randn(V, H), ¯ vt ≡ 0, ¯ h1 = 10 ∗ randn(H, 1), ¯ ht>1 = 0, Σh

1 = IH, p1 = uniform. The output bias is zero. The

hidden space has dimension H = 30, with low tran- sition noise, Σh(s) = 0.01IH, p(st+1|st) ∝ 1S ×S . The

  • utput noise is relatively high, Σv(s) = 30IV.

(H = 30) is near deterministic; however, the noise in the observations (V = 1) produces high uncertainty as to where this trajectory is, needing to track multiple hypotheses until there is sufficient data to reliably reveal which of the likely hypotheses is correct. For this reason, the filtered distribution is typically strongly multimodal and methods that do not address this perform poorly. The transition and emission matrices were sampled at random (see Fig. 8.7 for details); each random sampling of parameters generated a problem instance

  • n which the rival algorithms were evaluated. We sequentially generated hidden states ht, st

and observations vt from the known SLDS and then, given only the model and the observa- tions, the task was to infer p(ht, st | v1:T). Since the exact computation is exponential in T, a formal evaluation of the method is infeasible. A simple alternative is to assume that the

  • riginal sample states s1:T are the ‘correct’ inferred states, and compare the most proba-

ble posterior smoothed estimates arg maxst p(st|v1:T) with the assumed correct sample st. Performance over a set of problem instances can then be assesed. All deterministic algo- rithms were initialised using the corresponding filtered results, as was Gibbs sampling. The running time of all algorithms was set to be roughly equal to that of EC using I = J = 4 mixture components per switch state. 8.6.1 Filtering as approximate smoothing In Fig. 8.8 we present the accuracy of techniques that use filtering p(st|v1:t) as an approx- imation of the smoothed posterior p(st|v1:T). Many of the smoothing approaches we

terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511984679.009 Downloaded from https://www.cambridge.org/core. Seoul National University - Statistics Department, on 01 Aug 2018 at 08:05:20, subject to the Cambridge Core

slide-14
SLIDE 14

Inference in switching linear systems using mixtures 179

25 50 75 200 400 600 800 1000 PF 25 50 75 RBPF 25 50 75 EP 25 50 75 GSFS 25 50 75 KimS 25 50 75 ECS 25 50 75 GSFM 25 50 75 KimM 25 50 75 ECM 25 50 75 Gibbs

Figure 8.8 The number of errors in estimating a binary switch p(st|v1:T ) over a time series of length T = 100, for the problem described in Fig. 8.7. Hence 50 errors corresponds to random guessing. Plotted are histograms of the errors over 1000 experiments. (PF) Particle filter. (RBPF) Rao–Blackwellised PF. (EP) Expectation propagation. (GSFS) Gaussian sum filtering using a single Gaussian. (KimS) Kim’s smoother, i.e. GPB, using the results from

  • GSFS. (ECS) Expectation correction using a single Gaussian (I = J = 1). (GSFM) GSF using a mixture of I = 4
  • Gaussians. (KimM) Kim’s smoother using the results from GSFM. (ECM) Expectation correction using a mixture

with I = J = 4 components. In Gibbs sampling, we use the initialisation from GSFM.

examined are initialised with filtered approximations and it is therefore interesting to see the potential improvement that each smoothing algorithm makes over the filtered estimates. In each case the histogram of errors in estimating the switch variable st for t from 1 to 50 is presented, based on a set of 1000 realisations of the observed sequence. Particle filter We included the particle filter (PF) for comparison with Gaussian sum fil- tering (GSF) and to demonstrate the baseline performance of using filtering to approximate the smoothed posterior. For the PF, 1000 particles were used, with Kitagawa resampling, [14]. In this case the PF performs poorly due to the difficulty of representing continu-

  • us high-dimensional posterior distributions (H = 30) using a limited number of point

particles and the underlying use of importance sampling in generating the samples. The Rao–Blackwellised particle filter (RBPF) [7] attempts to alleviate the difficulty of sampling in high-dimensional spaces by approximate integration over the continuous hidden vari-

  • able. For the RBPF, 500 particles were used, with Kitagawa resampling. We were unable

to improve on the standard PF results using this technique, most likely due to the fact that the available RBPF implementation assumed a single Gaussian per switch state. Gaussian sum filtering Whilst our interest is primarily in smoothing, for EC and GPB we need to initialise the smoothing algorithm with filtered estimates. Using a single Gaus- sian (GSFS) gives reasonable results, though using a mixture with I = 4 components per switch state (GSFM) improves on the single component case considerably, indicating the multimodal nature of the filtered posterior. Not surprisingly, the best filtered results are given using GSF, since this is better able to represent the variance in the filtered posterior than the PF sampling methods. 8.6.2 Approximate smoothing Generalised pseudo Bayes For this problem, Kim’s GPB method does not improve on the Gaussian sum filtered results, using either a mixture I = 4 (KimM) or a single compo- nent I = 1 (KimS). In this case the GPB approximation of ignoring all future observations, provided only the future switch state st+1 is known, is too severe.

terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511984679.009 Downloaded from https://www.cambridge.org/core. Seoul National University - Statistics Department, on 01 Aug 2018 at 08:05:20, subject to the Cambridge Core

slide-15
SLIDE 15

180 David Barber Expectation correction For EC we use the mean approximation for the numerical inte- gration of Eq. (8.13). In ECS a single Gaussian per switch state is used I = J = 1, with the backward recursion initialised with the filtered posterior. The method improves on GSFS, though the performance is still limited due to only using a single component. When using multiple components, I = J = 4, ECM has excellent performance, with a near perfect esti- mation of the most likely smoothed posterior class. One should bear in mind that EC, GPB, Gibbs and EP are initialised with the same GSF filtered distributions. Gibbs sampling Gibbs sampling provides an alternative non-deterministic procedure [4]. For a fixed state sequence s1:T, p(v1:T | s1:T) is easily computable since this is just the like- lihood of an LDS with transitions and emissions specified by the assumed given switch

  • states. We therefore attempt to sample from the posterior p(s1:T | v1:T) ∝ p(v1:T | s1:T) p(s1:T)
  • directly. We implemented a standard single-site Gibbs sampler in which all switch vari-

ables are fixed, except st, and then a sample from p(st|s\t, v1:T) drawn, with the 100 update sweeps forwards from the start to the end time and then in the reverse direction. The aver- age of the final 80 forward-backward sweeps was used to estimate the smoothed posterior. Such a procedure may work well provided that the initial setting of s1:T is in a region of high probability – otherwise sampling by such individual coordinate updates may be extremely

  • inefficient. For this case, even if the Gibbs sampler is initialised in a reasonable way (using

the results from GSFM) the sampler is unable to deal with the strong temporal correlations in the posterior and on average degrades the filtering results. Expectation propagation Expectation propagation (EP) [18] has had considerable suc- cess in approximate inference in graphical models, and has been ported to approximate inference in the SLDS (see Chapter 7). For our particular problem, we found EP to be numerically unstable, often struggling to converge.2 To encourage convergence, we used the damping method in [10], performing 20 iterations with a damping factor of 0.5. Nev- ertheless, the disappointing performance of EP is most likely due to conflicts resulting from numerical instabilities introduced by the frequent conversions between moment and canonical representations. In addition, the current implementation of EP assumes that the posterior is well represented by a single Gaussian per switch state, and the inherent multimodality of the posterior in this experiment may render this assumption invalid.

8.7 Summary

Exact inference in the class of switching linear dynamical systems is computationally

  • difficult. Whilst deriving exact filtering and smoothing recursions is straightforward, repre-

senting the filtered and smoothed posterior for time step t requires a mixture of Gaussians with a number of components scaling exponentially with t. This suggests the Gaussian sum class of approximations in which a limited number of Gaussian components is retained. We discussed how both filtering and smoothing recursions can be derived for approximations in which multiple Gaussians are assumed per switch state, extending on previous approx- imations which assume a single Gaussian per switch state. In extreme cases we showed how using such mixture approximations of the switch conditional posterior can improve the accuracy of the approximation considerably. Our Gaussian sum smoothing approach, expectation correction, can be viewed as a form of analytic particle smoothing. Rather than

2Generalised EP [21], which groups variables together improves on the results, but is still far inferior to the

EC results presented here – Onno Zoeter personal communication.

terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511984679.009 Downloaded from https://www.cambridge.org/core. Seoul National University - Statistics Department, on 01 Aug 2018 at 08:05:20, subject to the Cambridge Core

slide-16
SLIDE 16

Inference in switching linear systems using mixtures 181 propagating point distributions (delta-functions), as in the particle approaches, EC prop- agates Gaussians, which are more able to represent the variability of the distribution, particularly in high dimensions. An important consideration in time series models is numer- ical stability. Expectation correction uses standard forward and backward linear dynamical system message updates and is therefore able to take advantage of well-studied methods for numerically stable inference. Our smoothing technique has been successfully applied to time series of length O

  • 105

without numerical difficulty [17]. Code for implementing EC is available from the author’s website and is part of the BRMLtoolbox.

Bibliography

[1] D. L. Alspach and H. W. Sorenson. Nonlinear Bayesian estimation using Gaussian sum

  • approximations. IEEE Transactions on

Automatic Control, 17(4):439–448, 1972. [2] Y. Bar-Shalom and Xiao-Rong Li. Estimation and Tracking : Principles, Techniques and

  • Software. Artech House, 1998.

[3] D. Barber. Expectation correction for smoothing in switching linear Gaussian state space models. Journal of Machine Learning Research, 7:2515–2540, 2006. [4] C. Carter and R. Kohn. Markov chain Monte Carlo in conditionally Gaussian state space

  • models. Biometrika, 83:589–601, 1996.

[5] A. T. Cemgil, B. Kappen and D. Barber. A generative model for music transcription. IEEE Transactions on Audio, Speech and Language Processing, 14(2):679 – 694, 2006. [6] S. Chib and M. Dueker. Non-Markovian regime switching with endogenous states and time-varying state strengths. Econometric Society 2004 North American Summer Meetings 600, Econometric Society, August 2004. [7] A. Doucet, N. de Freitas, K. Murphy and

  • S. Russell. Rao-Blackwellised particle filtering

for dynamic Bayesian networks. Uncertainty in Artificial Intelligence, 2000. [8] S. Fr¨ uhwirth-Schnatter. Finite Mixture and Markov Switching Models. Springer, 2006. [9] Z. Ghahramani and G. E. Hinton. Variational learning for switching state-space models. Neural Computation, 12(4):963–996, 1998. [10] T. Heskes and O. Zoeter. Expectation propagation for approximate inference in dynamic Bayesian networks. In A. Darwiche and

  • N. Friedman, editors, Uncertainty in Artificial

Intelligence, pages 216–223, 2002. [11] C.-J. Kim. Dynamic linear models with Markov-switching. Journal of Econometrics, 60:1–22, 1994. [12] C.-J. Kim and C. R. Nelson. State-Space Models with Regime Switching. MIT Press, 1999. [13] G. Kitagawa. The two-filter formula for smoothing and an implementation of the Gaussian-sum smoother. Annals of the Institute

  • f Statistical Mathematics, 46(4):605–623, 1994.

[14] G. Kitagawa. Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. Journal of Computational and Graphical Statistics, 5(1):1–25, 1996. [15] U. Lerner, R. Parr, D. Koller and G. Biswas. Bayesian fault detection and diagnosis in dynamic systems. In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AIII-00), pages 531–537, 2000. [16] U. N. Lerner. Hybrid Bayesian networks for reasoning about complex systems. PhD thesis, Stanford University, 2002. [17] B. Mesot and D. Barber. Switching linear dynamical systems for noise robust speech

  • recognition. IEEE Transactions of Audio, Speech

and Language Processing, 15(6):1850–1858, 2007. [18] T. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, MIT Media Lab, 2001. [19] V. Pavlovic, J. M. Rehg and J. MacCormick. Learning switching linear models of human

  • motion. In Advances in Neural Information

Processing Systems (NIPS), number 13, pages 981–987, 2001. MIT Press. [20] D. M. Titterington, A. F. M. Smith and U. E.

  • Makov. Statistical Analysis of Finite Mixture
  • Distributions. John Wiley & Sons, 1985.

[21] O. Zoeter. Monitoring non-linear and switching dynamical systems. PhD thesis, Radboud University Nijmegen, 2005.

Contributor David Barber, Department of Computer Science, University College London

terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511984679.009 Downloaded from https://www.cambridge.org/core. Seoul National University - Statistics Department, on 01 Aug 2018 at 08:05:20, subject to the Cambridge Core