SLIDE 1 Data-driven COVID-19 modeling Data-driven COVID-19 modeling
Cyprien Neverov August 28 , 2020
Student at IMT Mines Ales and intern at FAU Erlangen-Nurnberg under the supervision of Prof. Enrique Zuazua. 1
th
1
SLIDE 2 Table of Contents Table of Contents
- 1. Data-driven system identication
- 2. Modeling COVID-19
- 3. Conclusion
SLIDE 3
- 1. Data-driven system identication
- 1. Data-driven system identication
Identifying the dynamics from data is becoming a key challenge because: Data acquisition is getting cheaper Problems are getting more complex Computational power is cheaper
SLIDE 4
Sparse identication of nonlinear dynamical systems Sparse identication of nonlinear dynamical systems
Approach proposed by S. Brunton in [1]. Uses sparse regression Relies on a set of candidate functions Expresses the dynamics as a function which is a linear combination of the candidate functions: where is the state of the system.
f (t) = f(x(t)) dx dt x
SLIDE 5
- 1. Make two matrices
- 1. Make two matrices
Let's say that we have observed the system at and either observed or numerically computed its time derivative at those time points, then we can construct the two following matrices:
, , . . . , t1 t2 tm = and X = X ˙ ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ( )
dx dt t1
( )
dx dt t2
⋮ ( )
dx dt tm
⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ x( ) t1 x( ) t2 ⋮ x( ) tm ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥
SLIDE 6
- 2. Augment the state matrix
- 2. Augment the state matrix
And then we can augment the matrix with the candidate functions this will yield :
X , , … , f1 f2 fp θ(X) θ(X) = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ (x( )) f1 t1 (x( )) f1 t2 ⋮ (x( )) f1 tm (x( )) f2 t1 (x( )) f2 t2 ⋮ (x( )) f2 tm ⋯ ⋯ ⋱ ⋯ (x( )) fp t1 (x( )) fp t2 ⋮ (x( )) fp tm ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥
SLIDE 7
- 3. Solve the linear least squares
- 3. Solve the linear least squares
Now we want to nd a matrix that is a solution to: in the least squares sense. The sparsity is achieved by running the optimization several times and gradually zeroing out the values that are under a cut-off value.
ξ = θ(X)ξ X ˙
SLIDE 8
Discretized formulation Discretized formulation This algorithm also works in an iterative manner, when instead of we have : and then we seek to solve
X ˙ X2 = and X = X2 ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ x( ) t2 x( ) t3 ⋮ x( ) tm ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ x( ) t1 x( ) t2 ⋮ x( ) tm−1 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ = θ(X)ξ X2
SLIDE 9
Simple example Simple example
Oscillator with a cubic nonlinearity: From now on we will consider only polynomial terms of the variables as candidate functions: .
x ˙ y ˙ = −0.1 + 2 x3 y3 = −2 − 0.1 x3 y3 1, x, y, , xy, , , … , x2 y2 x3 yp
SLIDE 10 In [15]: plt.figure(dpi=100) plt.plot(cubic_oscillator[:, 0], label='x') plt.plot(cubic_oscillator[:, 1], label='y') plt.legend() plt.xlabel('time') plt.ylabel('state') plt.title('Cubic oscillator') plt.show()
SLIDE 11 Simple example Simple example
In [27]: X_dot, X = make_targets(cubic_oscillator, derivative=(derivative:=False)) theta_X, _ = make_polynomials(X, max_degree=3) weights, _ = sparse_regression(theta_X, X_dot, cutoff=1e-3) if derivative: weights /= t_ode[1] show_weights(weights, derivative=derivative)
function 1.00577
1.00572
- 0.00816922
- 0.0645619
- 0.00932543
- 0.00826491
0.0654499
xk+1 yk+1 1 x y x2 xy y2 x3 y x2 xy2 y3
SLIDE 12 2.
Modeling COVID-19
The trajectories of cumulative cases:
SLIDE 13 In [15]: countries_to_display = ['China', 'South Korea', 'France', 'Spain', 'Germany', 'Uruguay'] plt.figure(dpi=130) for country in countries_to_display: values = ds.cumulative(country) plt.plot(values, label=country) plt.legend() plt.xlabel('days') plt.ylabel('cumulative cases') plt.show()
SLIDE 14
Single country trajectory Single country trajectory
SLIDE 15 In [31]: country = 'France' data = ds.cumulative(country, rescaling=100000)[:100] X_dot, X = make_targets(data[..., np.newaxis], derivative=(derivative:=True)) theta_X, _ = make_polynomials(X, max_degree=5) weights, _ = sparse_regression(theta_X, X_dot, cutoff=1e-15) show_weights(weights, derivative=derivative) show_trajectory(data, weights)
function
0.372269
2.3851
0.36289
x ˙ 1 x x2 x3 x4 x5
SLIDE 16
Observations Observations
A single trajectory can be described by only two or three parameters. The algorithm is not robust. We did not have the complete evolutions for most of the countries at the time of writing. How can we use the information from countries that are more advanced into the epidemic to make predictions for countries at a more early stage ?
SLIDE 17
Multi-country model Multi-country model
What if the evolution of the number of cases in several countries could be governed by a single formula? This would require additional information: Information about the countries, the cultures; Information about the measures taken by the governments to tackle the spread of the disease. We would like to nd a function such that for any country , at any day , we have: Where is this additional information.
f c t = f( , ) xt+1,c xt,c it,c i
SLIDE 18
SINDy vs ARIMA SINDy vs ARIMA
seems to be the preferred tool of statisticians to make COVID forecasting: [6, 7]. We compared the performance of our techniques to an ARIMA model for forecasting purposes: ARIMA (https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average)
SLIDE 19
SINDy vs ARIMA SINDy vs ARIMA
But it doesn't always work as expected.
SLIDE 20 SINDy vs ARIMA SINDy vs ARIMA
Percentage of best guesses Error on 1 week forecast horizon Error on 2 weeks forecast horizon
SLIDE 21
Other approaches/experiments Other approaches/experiments
N-Beats and NNs SIR tting and recovering More precedent states ( , ) - autoregression. Rational basis functions (https://kipre.github.io/les/internship/reports/non- linear/nonlinear.html) Time dependent dynamics (https://kipre.github.io/les/internship/reports/covid_time/index.html) Control measures (https://kipre.github.io/les/internship/reports/covid_control/report.html)
xn−1 xn−2
SLIDE 22
Conclusion Conclusion
System identication is challenging. In simple settings the SINDy algorithm learns well. In complex situations not so much. It was quite early to work on data-driven approaches. It is really sensitive to small hyperparameters changes. Sensitive to small data changes. Easily overts. Random events had a big impact on the evolutions.
SLIDE 23 Other tasks Other tasks
- a small program to transform
GASLIB .net les to a MATLAB .mat. Written in C++.
implemented algorithms to deploy a service through a REST API for identifying systems from data. net2mat (https://github.com/Kipre/net2mat) System identication as a service (https://github.com/Kipre/les/tree/master/internship/siaas)
SLIDE 24 References References
[1] Brunton, Steven L., Joshua L. Proctor, and J. Nathan Kutz. 2016. “Discovering Governing Equations from Data by Sparse Identication of Nonlinear Dynamical Systems.” Proceedings of the National Academy of Sciences 113 (15): 3932–7. . [2] Hale, Thomas, Sam Webster, Anna Petherick, Toby Phillips, and Beatriz Kira. 2020. “Oxford COVID-19 Government Response Tracker.” Blavatnik School of Government. . [3] “Understanding the Coronavirus (COVID-19) Pandemic Through Data. World Bank.” n.d. . [4] Kermack, William Ogilvy, A. G. McKendrick, and Gilbert Thomas Walker. 1997. “A Contribution to the Mathematical Theory of Epidemics.” Proc. R. Soc. Lond. 12: 700–721. . [5] Andreas Kergassner, Christian Burkhardt, Dorothee Lippold, Sarah Nistler, Matthias Kergassner, Paul Steinmann, Dominik Budday, Silvia Budday. 2020. “Meso-scale modeling
- f COVID-19 spatio-temporal outbreak dynamics in Germany” medRxiv
2020.06.10.20126771; doi https://doi.org/10.1073/pnas.1517384113 (https://doi.org/10.1073/pnas.1517384113) https://github.com/OxCGRT/covid-policy-tracker/ (https://github.com/OxCGRT/covid- policy-tracker/) http://datatopics.worldbank.org/universal-health-coverage/covid19/ (http://datatopics.worldbank.org/universal-health-coverage/covid19/) https://doi.org/10.1098/rspa.1927.0118 (https://doi.org/10.1098/rspa.1927.0118) https://doi.org/10.1101/2020.06.10.20126771
SLIDE 25 [6] Guorong Ding, Xinru Li, Yang Shen, Jiao Fan. 2020. “Brief Analysis of the ARIMA model
- n the COVID-19 in Italy” medRxiv 2020.04.08.20058636; doi:
[7] Lut Bayyurt, Burcu Bayyurt. 2020. “Forecasting of COVID-19 Cases and Deaths Using ARIMA Models” medRxiv 2020.04.17.20069237; doi: [8] Keimer, Alexander & Pug, Lukas. (2020). “Modeling infectious diseases using integro- differential equations: Optimal control strategies for policy decisions and Applications in COVID-19”. 10.13140/RG.2.2.10845.44000. https://doi.org/10.1101/2020.04.08.20058636 https://doi.org/10.1101/2020.04.17.20069237 link (https://www.researchgate.net/publication/341265820_Modeling_infectious_diseases_usin differential_equations_Optimal_control_strategies_for_policy_decisions_and_Applications_in 19?channel=doi&linkId=5eb6577f299bf1287f77ed58&showFulltext=true)
SLIDE 26
Appendix Appendix
SLIDE 27 Quick introduction to epidemiology Quick introduction to epidemiology
The most basic compartmental model is the SIR (Susceptible, Infected, Recovered) introduced in [4] and is governed by the following dynamics: Where is the total population, and are the parameters of the disease, is called .
S ˙ I ˙ R ˙ = − βIS N = − γI βIS N = γI N β γ
β γ
r0
SLIDE 28
Examples for dierent Examples for dierent s
constant_r0 : piece_r0 : exp_r0 :
r0
x ↦ 2 x ↦ ⎧ ⎩ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 2.5, 1, 0.8, 1.1, 1, if x < 10 if x < 55 if x < 90 if x < 130 else x ↦ 20 × + ϵ e−0.1x
SLIDE 29 In [9]: integrate_sir(constant_r0)
SLIDE 30 In practice In practice
Proposed by [5] With
= S ˙k = I ˙k = Q ˙k = R ˙k = D ˙ k − βIkSk + βIkSk − ∑
l nc
β ~
klSkIl
+ ∑
l nc
β ~
klSkIl −
γ1 ω − 1 Ik + γ1 ω − 1 Ik −γ1Ik + I γ1 − γ2Qk + γ2Qk −γ2 μω 1 − μω Q +γ2 μω 1 − μω Q = β ~
kl
βCC βkβl − − − − √
N ξ
k N λ l
ax N ξ+λ
m
e−
rkl r
SLIDE 31
Inuence of the cuto value Inuence of the cuto value
Choosing a cutoff value is not trivial.
SLIDE 32
Information about the countries Information about the countries List of country indicators relevant to the pandemic by the World Bank [2], divided into 3 categories: Health: general information about health infrastructure, economics and mortality causes. Water & Sanitation: information about hygiene. Age & Population: demographic indicators.
SLIDE 33
Information about the government measures Information about the government measures Social distancing measures had a huge impact on the development of the disease and we want to take them into account. We use a "Stringency Index" provided by a group of researchers from Oxford [3]. An aggregation of policy indicators: information on containment and closure policies, such as school closures and restrictions in movement. Should not be interpreted as effectiveness of the measures.
SLIDE 34 In [14]: def plot_cases_and_stringency(country): country_data = ds.cumulative(country) fig, ax1 = plt.subplots(dpi=80) color = 'tab:red' ax1.set_xlabel('time') ax1.set_ylabel('Cases') ax1.plot(country_data[:-1], 'k.', label='cases') ax1.tick_params(axis='y') ax2 = ax1.twinx() color = 'tab:blue' ax2.set_ylabel('Stringency', color=color) ax2.plot(ds.stringency(country)[:-1], color=color) ax2.tick_params(axis='y', labelcolor=color) fig.tight_layout() plt.title(country) plt.show() widgets.interact(plot_cases_and_stringency, country=ds.all_ox_countries()[1:]);
SLIDE 35 Fitting Fitting
Experimental setting: 4857 training examples (rows in the and matrices) and 1028 test examples for 105 and 26 countries respectively. 42 variables (state + stringency + indicators). Polynomial terms with a maximum degree of 3. We neglect the terms that are constant with respect to the state. This yields more than 800 polynomial terms. The number of polynomial terms is equal to where is the number of variables and the maximum degree.
X X2
(n+r−1)! r!(n−1)!
n r
SLIDE 36 Putting it all together Putting it all together
Here is a description of the information that was used for the model:
Variable Explanation Nature The state of the system: the number of cases in the country The state, time-dependent Stringency index: how severe are the containment measures. (the control) Time-dependent Human development index Constant Total population Constant Population ages 65 and above (% of total) Constant Hospital beds (per 1,000 people) Constant ... ... ... Cause of death, by injury (% of total) Constant
We want to nd a general formula of the following form :
The country indexes were omitted for better readability.
xt i1,t i2 i3 i4 i5 i40
1
= f( , , , , , . . . , ) xt+1 xt i1,t i2 i3 i4 i40
1
SLIDE 37
Study of the inuence of the delay Study of the inuence of the delay
It is obvious that the inuence of the delay in the dynamics of the pandemic is very important [8]. But we couldn't nd any delay that would give signicantly better results than no delay.
SLIDE 38 Indicators Indicators
Hospital beds (per 1,000 people), Physicians (per 1,000 people), Nurses and midwives (per 1,000 people), UHC service coverage index, Human development index, Current health expenditure (% of GDP), Current health expenditure per capita (current US$), Current health expenditure per capita, PPP (current international $), Out-of-pocket expenditure (% of current health expenditure), Out-of-pocket expenditure per capita (current US$), Out-of-pocket expenditure per capita, PPP (current international $), Diabetes prevalence (% of population ages 20 to 79), Life expectancy at birth, female (years), Life expectancy at birth, total (years), Life expectancy at birth, male (years), Mortality rate, adult, female (per 1,000 female adults), Mortality rate, adult, male (per 1,000 male adults), Mortality from CVD, cancer, diabetes or CRD between exact ages 30 and 70 (%), Mortality from CVD, cancer, diabetes or CRD between exact ages 30 and 70, female (%), Mortality from CVD, cancer, diabetes or CRD between exact ages 30 and 70, male
SLIDE 39
Importance of the indicators Importance of the indicators
Let the importance of a variable be the normalized sum of the absolute value of the coefcients of the polynomial terms where this variable appears.
SLIDE 40
For a cuto value of For a cuto value of and sparsity of and sparsity of %
10−4
SLIDE 41
For a cuto value of For a cuto value of and sparsity of and sparsity of %
1 98
SLIDE 42 Logistic trajectories Logistic trajectories
Most of the trajectories have an evolution close to a logistic function: Where and are the parameters. Since the derivative of a logistic function is: we should be able to have some baseline results with polynomials of a maximum degree of 2.
x(t) = a 1 + e−b(t−c) a, b c (t) = x(t) × (1 − x(t))
dx dt
SLIDE 43 Integrating the identied system Integrating the identied system
We can easily generate a trajectory with
formulation and solving the ODE. But synchronizing is difcult.
= θ( )ξ xk+1 xk = θ(x)ξ x ˙