SLIDE 1
Travel Time Estimation using Approximate Belief States on a Hidden - - PowerPoint PPT Presentation
Travel Time Estimation using Approximate Belief States on a Hidden - - PowerPoint PPT Presentation
Travel Time Estimation using Approximate Belief States on a Hidden Markov Model Walid Krichene Overview Context Inference on a HMM Modeling framework and exact inference Approximate Inference: the Boyen-Koller algorithm Graph Partitioning
SLIDE 2
SLIDE 3
Overview
Context Inference on a HMM Modeling framework and exact inference Approximate Inference: the Boyen-Koller algorithm Graph Partitioning
SLIDE 4
Context
◮ Mobile Millennium project ◮ Travel time estimation on an Arterial Network
SLIDE 5
Context
◮ Mobile Millennium project ◮ Travel time estimation on an Arterial Network ◮ Input data: probe vehicles that send their GPS locations
periodically
SLIDE 6
Context
◮ Mobile Millennium project ◮ Travel time estimation on an Arterial Network ◮ Input data: probe vehicles that send their GPS locations
periodically
◮ processed using path inference
SLIDE 7
Context
◮ Mobile Millennium project ◮ Travel time estimation on an Arterial Network ◮ Input data: probe vehicles that send their GPS locations
periodically
◮ processed using path inference ◮ observation = (path, travel time along the path)
SLIDE 8
Objective
Improve inference algorithm
◮ Time complexity exponential in size of the network (number
- f links)
SLIDE 9
Objective
Improve inference algorithm
◮ Time complexity exponential in size of the network (number
- f links)
◮ Solution: assume links are independent ◮ But lose structure of network
SLIDE 10
Objective
Improve inference algorithm
◮ Time complexity exponential in size of the network (number
- f links)
◮ Solution: assume links are independent ◮ But lose structure of network ◮ Need approximate inference to keep the structure
SLIDE 11
Overview
Context Inference on a HMM Modeling framework and exact inference Approximate Inference: the Boyen-Koller algorithm Graph Partitioning
SLIDE 12
Graphical Model
◮ Nodes: random variables ◮ Conditional independence: x and y are independent
conditionally on (n1, n2) but not on n1
SLIDE 13
Hidden Markov Model
◮ Hidden variables st ∈ (s1, . . . , sN) ◮ Observed variables yt ◮ (s0, . . . , st) is a Markov process
SLIDE 14
Hidden Markov Model
◮ Hidden variables st ∈ (s1, . . . , sN) ◮ Observed variables yt ◮ (s0, . . . , st) is a Markov process ◮ Hidden variables are introduced to simplify the model
SLIDE 15
Hidden Markov Model
◮ Hidden variables st ∈ (s1, . . . , sN) ◮ Observed variables yt ◮ (s0, . . . , st) is a Markov process ◮ Hidden variables are introduced to simplify the model ◮ Interesting because provides efficient algorithms to do
inference and parameter estimation
SLIDE 16
Parametrization of a HMM
◮ Initial probability distribution πi = P(si 0)
SLIDE 17
Parametrization of a HMM
◮ Initial probability distribution πi = P(si 0) ◮ Transition Matrix: Ti,j = P(sj t+1|si t)
SLIDE 18
Parametrization of a HMM
◮ Initial probability distribution πi = P(si 0) ◮ Transition Matrix: Ti,j = P(sj t+1|si t) ◮ Observation model: P(yt|st)
SLIDE 19
Parametrization of a HMM
◮ Initial probability distribution πi = P(si 0) ◮ Transition Matrix: Ti,j = P(sj t+1|si t) ◮ Observation model: P(yt|st) ◮ Completely characterizes the HMM: We can compute
probability of any event.
SLIDE 20
Inference
General inference problem: compute P(st|y0:T)
SLIDE 21
Inference
General inference problem: compute P(st|y0:T)
◮ Filtering if t = T ◮ Prediction if t > T ◮ Smoothing if t < T
SLIDE 22
Inference
General inference problem: compute P(st|y0:T)
◮ Filtering if t = T ◮ Prediction if t > T ◮ Smoothing if t < T
Let y = y0:T P(s|y) = P(s, y) P(y) = α(st)β(st)
- st α(st)β(st)
where α(st) ∆ = P(y0, . . . , yt, st) β(st) ∆ = P(yt+1, . . . , yT|st)
SLIDE 23
Message passing algorithms
Recursive algorithm to compute α(st) and β(st)
SLIDE 24
Message passing algorithms
Recursive algorithm to compute α(st) and β(st)
◮ α(st+1) = st α(st)Tst,st+1P(yt+1|st+1) ◮ β(st) = st+1 β(st+1)P(yt+1|st+1)Tst,st+1
SLIDE 25
Message passing algorithms
Recursive algorithm to compute α(st) and β(st)
◮ α(st+1) = st α(st)Tst,st+1P(yt+1|st+1) ◮ β(st) = st+1 β(st+1)P(yt+1|st+1)Tst,st+1 ◮ Complexity: O(N2T) operations ◮ α recursion: for every t, N possible values of st+1, each
α(st+1) requires N multiplications
SLIDE 26
Parameter estimation
Parameters of the HMM: θ = (π, T, η)
◮ T: transition matrix ◮ π: initial state probability distribution ◮ η: parameters of observation model: P(yt|st, η)
Parameter estimation: maximize log likelihood w.r.t θ ln
- s0
- s1
· · ·
- sT
πs0
T−1
- t=0
Tst,st+1
T
- t=0
P(yt|st, η)
SLIDE 27
Expectation Maximization algorithm
◮ E step: estimate the hidden (unobserved) variables given the
- bserved variables and the current estimate of θ
SLIDE 28
Expectation Maximization algorithm
◮ E step: estimate the hidden (unobserved) variables given the
- bserved variables and the current estimate of θ
◮ M step: maximize likelihood function under assumption that
latent variables are known (they are “filled-in” with their expected values)
SLIDE 29
Expectation Maximization algorithm
In the case of HMMs:
◮ ˆ
Tij =
T−1
t=0 ξ(si t,sj t+1)
T−1
t=0 γ(si t)
◮ ˆ
ηij =
T
t=0 γ(si t)yj t
T
t=0 γ(si t)
◮ ˆ
πi =
α(si
0)β(si 0)
- s0 α(s0)β(s0)
where ξ and γ are simple functions of α and β.
Time complexity
O(N2T) operations
SLIDE 30
Overview
Context Inference on a HMM Modeling framework and exact inference Approximate Inference: the Boyen-Koller algorithm Graph Partitioning
SLIDE 31
Modeling framework
◮ System modeled using a Hidden Markov Model. ◮ L links
SLIDE 32
Modeling framework
◮ System modeled using a Hidden Markov Model. ◮ L links
Hidden variables
Link l: discrete state Sl
t ∈ {1, . . . , K}
- state of entire system St = (S1
t , . . . , SL t )
- N = K L possible states
- Markov process P(St+1|S0, . . . , St) = P(St+1|St)
SLIDE 33
Modeling framework
◮ System modeled using a Hidden Markov Model. ◮ L links
Hidden variables
Link l: discrete state Sl
t ∈ {1, . . . , K}
- state of entire system St = (S1
t , . . . , SL t )
- N = K L possible states
- Markov process P(St+1|S0, . . . , St) = P(St+1|St)
Observed variables
We observe travel times: random variables Distributions depend on the state of links
SLIDE 34
HMM
SLIDE 35
Parametrization of the HMM
Transition model
Tt(si → sj) ∆ = P(sj
t+1|si t)
Transition matrix, size K L
SLIDE 36
Parametrization of the HMM
Observation model
Probability to observe response y = (l, xi, xf , δ) given state s at time t Ot(s → y) ∆ = P(yt|st) = gl,s
t (δ) ×
xf
xi
ρl
t(x)dx ◮ gl,s t : distribution of total travel time on link l at state s. ◮ ρl t: probability distribution of vehicle locations (results from
traffic assumptions)
SLIDE 37
Parametrization of the HMM
Observation model
Probability to observe response y = (l, xi, xf , δ) given state s at time t Ot(s → y) ∆ = P(yt|st) = gl,s
t (δ) ×
xf
xi
ρl
t(x)dx ◮ gl,s t : distribution of total travel time on link l at state s. ◮ ρl t: probability distribution of vehicle locations (results from
traffic assumptions)
Assumptions
Processes time invariant during 1 hour time slices
SLIDE 38
Travel time estimation
◮ Estimate state of system
SLIDE 39
Travel time estimation
◮ Estimate state of system ◮ Estimate parameters of models (observation)
SLIDE 40
Travel time estimation
◮ Estimate state of system ◮ Estimate parameters of models (observation) ◮ Update estimation when new responses are observed
SLIDE 41
Travel time estimation
◮ Estimate state of system ◮ Estimate parameters of models (observation) ◮ Update estimation when new responses are observed
Belief State
pt(s) ∆ = P(st|y0:t) Probability distribution over possible states
SLIDE 42
Travel time estimation
Bayesian tracking of the belief state: forward-backward propagation (O(N2T) time) Can be done in O(N2): pt
T [.]
→ qt+1
Oy[.]
→ pt+1
SLIDE 43
Travel time estimation
Bayesian tracking of the belief state: forward-backward propagation (O(N2T) time) Can be done in O(N2): pt
T [.]
→ qt+1
Oy[.]
→ pt+1
Parameter estimation of the model
◮ update parameters of probability distribution of vehicle
locations: solve max
- x∈X l
t
ln ρl
t(x)
where X l
t are the observed vehicle locations
SLIDE 44
Parameter estimation of the model
◮ update Transition matrix: EM algorithm in O(N2T)
- perations
SLIDE 45
Parameter estimation of the model
◮ update Transition matrix: EM algorithm in O(N2T)
- perations
◮ Exact inference and parameter estimation done in
O(N2T) = O(K 2LT) time complexity
SLIDE 46
Computational intractability
Exact inference and EM algorithm not tractable
Size of the belief state and transition matrix exponential in size of network. EM algorithm takes exponential time in L.
SLIDE 47
Computational intractability
Exact inference and EM algorithm not tractable
Size of the belief state and transition matrix exponential in size of network. EM algorithm takes exponential time in L.
◮ Assume independence of links?
SLIDE 48
Computational intractability
Exact inference and EM algorithm not tractable
Size of the belief state and transition matrix exponential in size of network. EM algorithm takes exponential time in L.
◮ Assume independence of links? ◮ Use approximate tracking instead, limit the size of the
network?
SLIDE 49
Overview
Context Inference on a HMM Modeling framework and exact inference Approximate Inference: the Boyen-Koller algorithm Graph Partitioning
SLIDE 50
Approximate Belief State
Choose a family of belief states that have a compact representation.
SLIDE 51
Approximate Belief State
Choose a family of belief states that have a compact representation.
Factorized belief state
Decompose process into subprocesses. Approximate probability of state s by product of marginal probabilities of sub states sc pt(s) = P(st|y0:t) ≈
C
- c=1
P(sc
t |y0:t)
=
C
- c=1
˜ pc
t (sc) ∆
= ˜ pt(s)
SLIDE 52
Approximate Belief State
Example
A network with 3 links, S = (S1, S2, S3), links 1 and 2 are in cluster 1 and link 3 is in cluster 2. pt((0, 1, 1)) = P(St = (0, 1, 1)|y0:t) ≈ P((S1
t , S2 t ) = (0, 1)|y0:t)P(S3 t = 1|y0:t)
= ˜ p1
t ((0, 1))˜
p2
t (1) ∆
= ˜ pt((0, 1, 1))
SLIDE 53
Approximate Belief State
SLIDE 54
Approximate Belief State
Perform Bayesian tracking and parameter estimation (EM algorithm) on each ˜ p separately.
Transition model
Assume state of cluster c at t + 1 only depend on state of N(c) at t Transition matrix of size K |N(c)| × K |Sc|
Inference
Inference is done on the subprocesses separately ˜ pt
T [.]
→ ˆ qt+1
Oy[.]
→ ˆ pt+1
Π[.]
→ ˜ pt+1 The new approximate belief state ˜ pt+1 is computed as the product
- f marginal distributions over the subprocesses.
Observation model and EM algorithm are the same.
SLIDE 55
Time Complexity
Let L′ be the maximum size of subprocesses, C the number of clusters, and M = K L′
◮ Time complexity of inference + parameter estimation (EM) is
O(CM2T).
SLIDE 56
Time Complexity
Let L′ be the maximum size of subprocesses, C the number of clusters, and M = K L′
◮ Time complexity of inference + parameter estimation (EM) is
O(CM2T).
◮ If L′ is fixed:
SLIDE 57
Time Complexity
Let L′ be the maximum size of subprocesses, C the number of clusters, and M = K L′
◮ Time complexity of inference + parameter estimation (EM) is
O(CM2T).
◮ If L′ is fixed:
C increases linearly with L
SLIDE 58
Time Complexity
Let L′ be the maximum size of subprocesses, C the number of clusters, and M = K L′
◮ Time complexity of inference + parameter estimation (EM) is
O(CM2T).
◮ If L′ is fixed:
C increases linearly with L time complexity becomes O(CT) = O(LT): linear in L as
- pposed to the original algorithm (O(K 2LT))
SLIDE 59
Approximation error
Use the Kullback-Leibler divergence (relative entropy) D[pt, ˜ pt] =
- i
pt(si) ln pt(si) ˜ pt(si) The error is shown to be bounded D[pt, ˜ pt] ≤ ǫ (γ/r)q for some ǫ, where each subprocess Tc depends on at most r
- thers, and affects at most q others.
◮ Smaller error for more independent subprocesses.
SLIDE 60
Approximation error
Use the Kullback-Leibler divergence (relative entropy) D[pt, ˜ pt] =
- i
pt(si) ln pt(si) ˜ pt(si) The error is shown to be bounded D[pt, ˜ pt] ≤ ǫ (γ/r)q for some ǫ, where each subprocess Tc depends on at most r
- thers, and affects at most q others.
◮ Smaller error for more independent subprocesses. ◮ Trade-off between speed and approximation error.
SLIDE 61
Approximation error
Use the Kullback-Leibler divergence (relative entropy) D[pt, ˜ pt] =
- i
pt(si) ln pt(si) ˜ pt(si) The error is shown to be bounded D[pt, ˜ pt] ≤ ǫ (γ/r)q for some ǫ, where each subprocess Tc depends on at most r
- thers, and affects at most q others.
◮ Smaller error for more independent subprocesses. ◮ Trade-off between speed and approximation error. ◮ Define a partitioning of the Network into subgraphs such that
clusters depend weekly.
SLIDE 62
Overview
Context Inference on a HMM Modeling framework and exact inference Approximate Inference: the Boyen-Koller algorithm Graph Partitioning
SLIDE 63
Graph Partitioning
We need to cluster the network into sub-graphs that interact weakly (to have a small approximation error). Use historical observations to define a weighted graph that describes interaction.
SLIDE 64
Graph Partitioning
We need to cluster the network into sub-graphs that interact weakly (to have a small approximation error). Use historical observations to define a weighted graph that describes interaction.
Weighted graph
◮ Set P of observed paths
SLIDE 65
Graph Partitioning
We need to cluster the network into sub-graphs that interact weakly (to have a small approximation error). Use historical observations to define a weighted graph that describes interaction.
Weighted graph
◮ Set P of observed paths ◮ Each path p ∈ P is a sequence of connected links
p = (li1, . . . , lik)
SLIDE 66
Graph Partitioning
We need to cluster the network into sub-graphs that interact weakly (to have a small approximation error). Use historical observations to define a weighted graph that describes interaction.
Weighted graph
◮ Set P of observed paths ◮ Each path p ∈ P is a sequence of connected links
p = (li1, . . . , lik)
◮ Define weight of edge (i, j)
wi,j = #
- p ∈ P|li
p
→ lj
- # {p ∈ P|li ∈ p}
SLIDE 67
Graph Partitioning
We need to cluster the network into sub-graphs that interact weakly (to have a small approximation error). Use historical observations to define a weighted graph that describes interaction.
Weighted graph
◮ Set P of observed paths ◮ Each path p ∈ P is a sequence of connected links
p = (li1, . . . , lik)
◮ Define weight of edge (i, j)
wi,j = #
- p ∈ P|li
p
→ lj
- # {p ∈ P|li ∈ p}
◮ Weights are normalized ∀i, j wi,j = 1
SLIDE 68
Partitioning the weighted graph
Loss function
Minimize a loss function L((Gc)1≤c≤C) =
- c,c′
cut(Gc, Gc′)
SLIDE 69
Partitioning the weighted graph
Loss function
Minimize a loss function L((Gc)1≤c≤C) =
- c,c′
cut(Gc, Gc′) where cut(Gc, Gc′) =
- li∈Gc,lj∈Gc′
wi,j
◮ Does it yield good results?
SLIDE 70
Partitioning the weighted graph
Minimizing the cut function yields unbalanced clusters
SLIDE 71
Partitioning the weighted graph
Appropriate loss function
Minimize L((Gc)1≤c≤C) =
- c,c′
Ncut(Gc, Gc′)
SLIDE 72
Partitioning the weighted graph
Appropriate loss function
Minimize L((Gc)1≤c≤C) =
- c,c′
Ncut(Gc, Gc′) where Ncut(Gc, Gc′) = cut(Gc, Gc′)
- li∈Gc wi,j +
li∈Gc′ wi,j
SLIDE 73
Partitioning the weighted graph
◮ Normalizing the cut function favors balanced clusters
SLIDE 74
Partitioning the weighted graph
◮ Normalizing the cut function favors balanced clusters ◮ Exact solution is NP hard
SLIDE 75
Partitioning the weighted graph
◮ Normalizing the cut function favors balanced clusters ◮ Exact solution is NP hard ◮ Use METIS algorithm for approximate solution
SLIDE 76
Partitioning the weighted graph
◮ Normalizing the cut function favors balanced clusters ◮ Exact solution is NP hard ◮ Use METIS algorithm for approximate solution ◮ Post process the output to have connected clusters
SLIDE 77
Results of Graph Partitioning
Tested on historical data aggregated on 1 hour time periods, for each day of week, over 3 months.
SLIDE 78
Results of Graph Partitioning
SLIDE 79
Results of Graph Partitioning
SLIDE 80
Results of Graph Partitioning
◮ Geographically connected clusters ◮ Connected arteries appear in the same cluster ◮ Sections of highway 80 (Bay Bridge) and neighboring links all
appear in the same cluster
SLIDE 81
Summary
Adapted BK algorithm to our model and provided a study and description of its steps. BK is promising because:
◮ Trade-off between fast computation and approximation error:
can adjust the size of network to choose speed over accuracy.
◮ If we limit the size of subgraphs: polynomial time in the size
- f the network.
◮ Error remains bounded in time ◮ Possibility of concurrent processing ◮ Possibility of short term prediction:
Use the transition matrix learned up to time t0 and propagate the belief state pt0 up to time t0 + T pt0
T
→ pt0+1
T
→ · · · T → pt0+T Addressed the graph partition problem. The BK algorithm should be tested to evaluate the performance. I started its implementation and it is being carried by the arterial team.
SLIDE 82