Efficient Filtering and Fitting of Models Derived from Integro-Difference Equations

Author

Evan Tate Paterson Hughes

1 Introduction

The Integro-Difference equation model (here abbreviated as IDEM ¹) is dynamics-based spatio-temporal aiming to model diffusion and convection by making the value of a process a weighted average of it’s previous time, plus noise.

[NOTE: I intend to create a more thorough background for the introduction here.]

2 Integro-difference Based Dynamics

As common and widespread as the problem is, spatio-temporal modelling still presents a great deal of difficulty. Inherently, Spatio-Temporal datasets are almost always high-dimensional, and repeated observations are usually not possible.

Traditionally, the problem has been tackled by the moments (usually the means and covariances) of the process in order to make inference (Wikle, Zammit-Mangion, and Cressie (2019), for example, call this ‘descriptive’ modelling). While this method can be sufficient for many problems, there are many cases where we are underutilizing some knowledge of the underlying dynamic systems involved. For instance, in temperature models, we know that temperature has movement (convection) and spread (diffusion), and that the state at any given time will depend on its state at previous times ². We call models which make use of this ‘dynamic’ models.

A general way of writing such hierarchical dynamical models might be

\[\begin{split} Y_{t+1}(\cdot) &= \mathcal M_t(Y_0(\cdot), \dots, Y_t(\cdot)) + \omega_t(\cdot), \quad t=0, \dots, T-1,\\ Z_t(\cdot) &= \mathcal O_t(Y_t(\cdot)) + x(\cdot)^{\intercal}\bv \beta + \epsilon_t(\cdot), \quad t=1,\dots,T. \end{split} \]

This describes the scalar random fields $Z_t(\cdot), Y_t(\cdot)\in \mathbb R$ over the space $\mathcal D\subset \mathcal R^d$, which are the observed data and unobserved dynamic process, respectively. $\mathcal M_t$ here is a non-random ‘propagation operator’, defining how the process evolves with respect to it’s previous state(s), and $\mathcal O_t$ is a non-random ‘observation operator’, defining how observations of a given process state are taken. Both these fields have random (usually time-independent) additive random fields, $\omega_t(\cdot), \epsilon_t(\cdot)$, and we also include non-random measured linear covariate terms $x(\cdot)^{\intercal}\bv \beta$.

If we discretize the space into $n$ $spatial locations $\{\bv s_i\}_{i=1,\dots, n}$, assume the operator are linear, assert a Markov condition, and assume the errors are all normal, we get a simple linear dynamic system;

\[\begin{split} \bv Y_{t+1} &= M_t\bv Y_t + \bv \omega_t, \quad t=0, \dots, T-1,\\ \tilde{\bv Z}_t &= O_t\bv Y_t + \bv \epsilon_t, \quad t=1,\dots,T, \end{split} \tag{1}\]

where we have written $\bv Y_t = (Y_t(\bv s_1),\dots, Y_t(\bv s_n))$, and similar for $\bv Z_t, \bv \epsilon_t$ and $\bv\omega_t$, and we have written $\tilde{\bv Z}_t = \bv Z_t + X^{\intercal}\bv \beta$. This is a well-known type of system, the process $Y$ can easily be estimated either directly of with a Kalman filter/smoother and variants, which will be discussed later.

However, this model is restrictive and high-dimensional; $M_t$, the primary quantities which needs estimation, is of dimension $n\times n$, of which there are $T$ matrices to be estimated. Even if we allow the propagation matrix to be invariant in time, we can still only make predictions at the stations $\{\bv s_i\}$.

This motivates a different approach; in particular, one which allows us to estimate the random field at arbitrary points $Y_t(\bv s)$ using some spectral decomposition, which would alleviate these problems.

The Integro-difference equation model attempts to generalise Equation 1 into the continuous space by replacing the discrete linear $M_t$ by a continuous integral equivalent;

\[\begin{split} Y_{t+1}(\bv s) &= \int_{\mathcal D_s} \kappa_t(\bv s,\bv r) Y_t(\bv r) d\bv r + \omega_t(\bv s), \quad t=0, \dots, T-1, \\ Z_t(\bv s) &= Y_t(\bv s) + X(\bv s)^{\intercal}\bv \beta + \epsilon_t(\bv s), \quad t=1,\dots,T. \end{split} \tag{2}\]

Where $\omega_t(\bv s)$ is a small scale gaussian variation with no temporal dynamics (Cressie and Wikle 2015 call this a ‘spatially descriptive’ component), $\bv X(\bv s)$ are spatially varying covariates (for example, in a large-scale climate scenario, this might be latitude, concentration of some chemical/element like nitrogen) $\kappa(\bv s, \bv r)$ is the driving ‘kernel’ function, and $\epsilon_t$ is a Gaussian white noise ‘measurement error’ term.

Our operator is now $\mathcal M(Y_t(\bv s)) = \int_{\mathcal D_s} \kappa_t(\bv s,\bv r) Y_t(\bv r) d\bv r$, which can model diffusion and convection by choosing the shape of $\kappa$ (which, from now on, we will assume to be temporally invariant). This kernel defines how each point in space is affected by every other point in space at the previous time. For example, if we choose a Gaussian-like shape,

\[\begin{split} \kappa(\bv s, \bv r; \bv m, a, b) = a \exp \left( -\frac{1}{b} \vert \bv s- \bv r +\bv m(\bv s)\vert^2 \right), \end{split} \]

then the ‘flow’ would be in the direction of $-\bv m(\bv s)$, and the diffusion would be controlled by $b$ and $a$. This creates a ‘spatially variant kernel’, where the direction of flow varies across the space, as in Figure 1.

3 Spectral Representations

The key to being able to computationally work with IDEMs, as perhaps originally made by Wikle and Cressie (1999), is to work with the spectral decomposition of the process, in order to coerce the model hierarchy into a more familiar linear dynamical system form, like Equation 1.

This kind of dimension-reduction allows us to parametrise spatial fields with as few or as many parameters as we want.

3.1 Process decomposition

Choose a complete class of spatial spectral basis functions, $\{\phi_i(\cdot): \mathcal D\to \mathbb R\}_{i=1,\dots}$, and decompose the process spatial field at each time;

\[\begin{split} Y_t(\bv s) \approx \sum_{i=1}^{r} \alpha_{i,t} \phi_i(\bv s), \quad t=0,\dots,T. \end{split} \tag{3}\]

where we truncate the expansion at some $r\in\mathbb N$. Notice that we can write this in vector/matrix form, where we consider the vector field $\bv \phi(\cdot) = (\phi_1(\cdot),\dots, \phi_r(\cdot))^\intercal$; considering times $t=1,2,\dots, T$, we set

\[\begin{split} \bv \phi(\bv s) &= (\phi_1(\bv s), \phi_2(\bv s), \dots, \phi_r(\bv s))^{\intercal},\\ \bv \alpha_t &= (\alpha_{1,t}, \alpha_{2,t}, \dots, \alpha_{r, t})^{\intercal}. \end{split} \tag{4}\]

Now, (Equation 3) gives us, for any $\bv s\in \mathcal D$,

\[\begin{split} Y(\bv s; t) \approx \bv \phi^{\intercal}(\bv s) \alpha(t).\\ \end{split} \tag{5}\]

We can effectively now work exclusively with $\bv \alpha_t = (\alpha_{1,t},\dots, \alpha_{r,t})^\intercal$. To do so, we need to find the evolution equation of $\bv \alpha_t$, as given below.

Theorem 1 (Spectral form of the state evolution) Define the Gram matrix;

\[\Psi \coloneq \int_{\mathcal D_s} \bv \phi(\bv s) \bv \phi(\bv s)^\intercal d\bv s. \tag{6}\]

Then, the basis coefficients evolve by the equation

\[\bv \alpha_{t+1} = M \bv\alpha_t + \bv\eta_t, \tag{7}\]

where $M = \Psi^{-1} \int\int \bv\phi(\bv s) \kappa(\bv s, \bv r)\bv\phi(\bv r)^\intercal d\bv r d \bv s$ and $\bv\eta_t =\Psi^{-1} \int \bv \phi(\bv s)\omega_t(s)d\bv s$.

Proof. (Adapting from Dewar, Scerri, and Kadirkamanathan 2008), write out the process equation, (Equation 2), using the first equation of (Equation 5);

\[Y_{t+1}(\bv s) = \bv \phi(\bv s)^{\intercal} \alpha_{t+1} = \int_{\mathcal D_s} \kappa(\bv s, \bv r) \bv\phi(\bv r)^{\intercal}\bv \alpha_t d\bv r + \omega_t(\bv s), \]

We then multiply both sides by $\bv \phi(s)$ and integrate over $\bv s$

\[\begin{split} \int_{\mathcal D_s} \bv\phi(\bv s)\bv\phi(\bv s)^{\intercal} d\bv s \bv\alpha_{t+1} &= \int\bv\phi(\bv s)\int \kappa(\bv s, \bv r)\bv\phi(\bv r)^\intercal d\bv r d \bv s\ \bv\alpha_t + \int \bv \phi(\bv s)\omega_t(s)d\bv s\\ \Psi \bv\alpha_{t+1} &= \int\int \bv\phi(\bv s)\kappa(\bv s, \bv r) \bv\phi(\bv r)^\intercal d\bv r d \bv s\ \bv\alpha_t + \int \bv \phi(\bv s)\omega_t(s)d\bv s. \end{split} \]

So, finally, pre-multipling by the inverse of the gram matrix, $\Psi^{-1}$ (Equation 6), we arrive at the result.

3.2 Spectral form of the Process Noise

We still have to set out what the process noise, $\omega_t(\bv s)$, and it’s spectral counterpart, $\bv \eta_t$, are. Dewar, Scerri, and Kadirkamanathan (2008) fix the variance of $\omega_t(\bv s)$ to be uniform and uncorrelated across space and time, with $\omega_t(\bv s) \sim \mathcal N(0,\sigma^2)$ It is then easily shown that $\bv\eta_t$ is also normal, with $\bv\eta_t \sim \mathcal N(0, \sigma^2\Psi^{-1})$.

However, in practice, we simulate in the spectral domain; that is, if we want to keep things simple, it would make sense to specify (and fit) the distribution of $\bv\eta_t$, and compute the variance of $\omega_t(\bv s)$ if needed.

Lemma 1 Let $\bv\eta_t \sim \mathcal N(0, \Sigma_\eta)$, and $\cov[\bv\eta_t, \bv \eta_{t+\tau}] =0$, $\forall \tau>0$. Then $\omega_t(\bv s)$ has covariance

\[\cov [\omega_t(\bv s), \omega_{t+\tau}(\bv r)] = \begin{cases} \bv\phi(\bv s)^\intercal \Sigma_\eta \bv\phi(\bv r) & \text{if }\tau=0\\ 0 & \text{else}\\ \end{cases} \]

Proof. Consider $\Psi \bv\eta_t$, and consider the case $\tau=0$. It is clearly normal, with zero expectation and variance (using Equation 6),

\[\begin{split} \var[\Psi \bv\eta_t] &= \Psi \var[\bv\eta_t] \Psi^\intercal = \Psi\Sigma_\eta\Psi^\intercal,\\ &= \int_{\mathcal D_s} \bv\phi(\bv s) \bv\phi(\bv s)^\intercal d\bv s \ \Sigma_\eta \ \int_{\mathcal D_s} \bv\phi(\bv r) \bv\phi(\bv r)^\intercal d\bv r\\ &= \int\int_{\mathcal D_s^2} \bv\phi(\bv s) \bv\phi(\bv s)^\intercal \ \Sigma_\eta \ \bv\phi(\bv r) \bv\phi(\bv r)^\intercal d\bv r d\bv s\\ \end{split} \tag{8}\]

Since it has zero expectation, we also have

\[\begin{split} \var[\Psi\bv\eta_t] &= \mathbb E[(\Psi\bv\eta_t) (\Psi\bv\eta_t)^\intercal] = \mathbb E[\Psi\bv\eta_t\bv\eta_t^\intercal\Psi^\intercal]\\ &= \mathbb E \left[ \int_{\mathcal D_s} \bv\phi(\bv s)\omega_t(\bv s)d\bv s \int_{\mathcal D_s} \bv \phi(\bv r)^\intercal \omega_t(\bv r) d\bv r \right]\\ &= \int\int_{\mathcal D_s^2} \bv\phi(\bv s)\ \mathbb E[\omega_t(\bv s)\omega_t(\bv r)]\ \bv \phi(\bv r)^\intercal d\bv s d \bv r. \end{split} \tag{9}\]

We can see that, comparing (Equation 8) and (Equation 9), we have

\[\cov [\omega_t(\bv s), \omega_t(\bv r)] = \mathbb E[\omega_t(\bv s)\omega_t(\bv r)]= \bv\phi(\bv s)^\intercal \Sigma_\eta \bv\phi(\bv r). \]

Since, once again, $\mathbb E[\bv\omega_t(\bv s)]=0$.

For the $\tau\neq0$ case, it is simple to show that the covariance is 0.

3.3 Kernel Parameterisations

Next is the part of the system, which defines the dynamics; the kernel function, $\kappa$. There are a few ways to handle the kernel. One of the most obvious is to expand it out into a spectral decomposition as well;

\[\kappa \approx \sum_i \beta_i\psi(\bv s, \bv r). \]

This can allow for a wide range of interestingly shaped kernel functions, but see how these basis functions must now act on $\mathbb R^2\times \mathbb R^2$; to get a wide enough space of possible functions, we would likely need many terms in the spectral expansion.

A much simpler approach would be to simply parametrise the kernel function, to $\kappa(\bv s, \bv r, \bv \theta_\kappa)$. We then establish a simple shape for the kernel (e.g. Gaussian) and rely on very few parameters (for example, scale, shape, offsets). The example kernel used in the jaxidem is a Gaussian-shape kernel;

\[\kappa(\bv s, \bv r; \bv m, a, b) = a \exp \left( -\frac{1}{b} \vert \bv s- \bv r +\bv m\vert^2 \right). \]

Of course, this kernel lacks spatial dependence. We can add spatial variance back by adding dependence on $\bv s$ to the parameters, for example, varying the offset term as $\bv m(\bv s)$. Of course, now we are back to having entire functions as parameters, but taking the spectral decomposition of the parameters we actually want to be spatially variant seems like a reasonable middle ground (Cressie and Wikle 2015). The actual parameters of such a spatially-variant kernel are then the spectral coefficients for the expansion of any spatially variant parameters, as well as any constant parameters. This is precisely what is plotting in Figure 1, where the spectral coefficients are randomly sampled from a multivariate normal distribution;

\[\begin{split} \bv m(\bv s) = \left(\begin{matrix} \sum_{i=1}^{r_m} \phi_{\kappa,i}(\bv s) m^{(x)}_i\\ \sum_{i=1}^{r_m} \phi_{\kappa,i}(\bv s) m^{(y)}_i \end{matrix}\right), \end{split} \]

where $m^{(x)}_i$ and $m^{(y)}_i$ are coefficients for the x and y coordinates respectively, and $\phi_{\kappa, i}(\bv s)$ are basis functions (e.g. bisquare ³) functions in Figure 1).

3.4 IDEM as a linear dynamical system

To summarise, we have taken a truncated spectral decomposition to write the Integro-difference equation model as a more traditional linear dynamical system form (Equation 7). All that is left is to include our observations in our system.

Lets assume that at each time $t$ there are $n_t$ observations at locations $\bv s_{1,t},\dots, \bv s_{n_{t},t}$. We write the vector of the process at these points as $\bv Y(t) = (Y(s_{1,t};t), \dots, Y(s_{n_{t},t};t))^\intercal$, and, in it’s expanded form $\bv Y_t = \Phi_t \bv\alpha_t$, where $\Phi \in \mathbb R^{r\times n_{t}}$ is

\[\begin{split} \{\Phi_{t}\}_{i, j} = \phi_{i}(s_{j,t}). \end{split} \]

For the covariates, we write the matrix $X_t = (\bv X(\bv s_{1, t}), \dots, \bv X(\bv s_{1=n_{t}, t})^\intercal$. We then have

\[\begin{split} \bv Z_t &= \Phi \bv \alpha_t + X_{t} \bv \beta + \bv \epsilon_t, \quad t = 1,\dots, T,\\ \bv \alpha_{t+1} &= M\bv \alpha_t + \bv\eta_t,\quad t = 0,1,\dots, T-1,\\ M &= \int_{\mathcal D_s}\bv\phi(\bv s) \bv\phi(\bv s)^\intercal d\bv s \int_{\mathcal D_s^2}\bv\phi(\bv s) \kappa(\bv s, \bv r; \bv\theta_\kappa)\bv\phi(\bv r)^\intercal d\bv r d \bv s, \end{split} \]

Writing $\tilde{\bv{Z}}_t = \bv Z_t - X_t \bv \beta$,

\[\begin{split} \tilde{\bv Z}_t &= \Phi_{t} \bv \alpha_t + \bv \epsilon_t,\quad &t = 1,2,\dots, T,\\ \bv \alpha_{t+1} &= M \bv \alpha_t + \bv\eta_t,\quad &t = 0,1, \dots, T-1.\\ \end{split} \tag{10}\]

We should also initialise $\bv \alpha_0 \sim \mathcal N^{r}(\bv m_{0}, \Sigma_{0})$, and fix simple distributions to the noise terms,

\[\begin{split} \bv \epsilon_{t} \overset{\mathrm{iid}}{\sim} \mathcal N_{n_\mathrm{obs}}(0,\Sigma_\epsilon),\\ \bv \eta_{t} \overset{\mathrm{iid}}{\sim} \mathcal N_{R}(0,\Sigma_\eta), \end{split} \]

which are independent in time.

As in, for example, (Wikle and Cressie 1999), Equation 10 is now in a traditional enough form that the Kalman filter can be applied to filter and compute many necessary quantities for inference, including the marginal likelihood. We can use these quantities in either an EM algorithm or a Bayesian approach, or directly maximise the marginal data likelihood

We now move on to an example simulation of this kind of model using its spectral decomposition and jaxidem.

3.5 Example Simulation

We can now use the above to simulate easily from such models; once we have chosen the appropriate decompositions, we simply compute $M$ and propagate $\bv \alpha_t$ as we would when simulating any other linear dynamic system. We then use the spectral coefficients to generate $Y_t(\bv s)$ and $Z_t(\bv s)$ in the obvious way.

jaxidem implements this in the function sim_idem, or through the more user-friendly method idem.IDEM.simulate. An object of the IDEM class contains all the necessary information about basis decompositions, and the simulate methods calls simIDEM without compromising its jit-ability (although just-in-time computation obviously isn’t as important for simulation, the jit-ed function could save compile time if someone want to simulate from many models).

The gen_example_idem method creates a simple IDEM object without many required parameters;

Code

key = jax.random.PRNGKey(42)
keys = jax.random.split(key, 3)

model = idem.gen_example_idem(keys[0], k_spat_inv=False)
# Simulation
T = 35
nobs = 50

coords = jax.random.uniform(
                keys[1],
                shape=(nobs, 2),
                minval=0,
                maxval=1,
            )

times = jnp.repeat(jnp.arange(1, T + 1), coords.shape[0])
rep_coords = jnp.tile(coords, (T, 1))
x = rep_coords[:,0]
y = rep_coords[:,1]

process_data, obs_data = model.simulate(keys[2], x, y, times)

The resulting objects are of class st_data, containing a couple of niceties for handling spatio-temporal data, while still storing all data as JAX arrays. For example, the show_plot, save_plot and save_gif methods provide easy plotting;

Code

process_data.save_plot('figure/process_data_example.png')
obs_data.save_plot('figure/obs_data_example.png')

4 The Kalman filter, and its many flavours

The Kalman filter gives us linear estimates for the distribution of $\bv\alpha_r\mid \{\bv Z_t=\bv z_t\}_{t=0,...,r}$ in any dynamical system like Equation 1. Now that we have written the IDEM in this form, this filter can now help compute estimates for the moments of the state $\bv \alpha_t$. The Kalman filter also computes the marginal data likelihood, $\pi(\{\bv z_t\}_{t=1,\dots, T}\mid \bv\theta)$, where $\bv\theta$ are the model parameters. This allows us to perform maximum-likelihood estimation (as well as any other likelihood-based method of optimization). We will not prove the Kalman filter here, (for that, see, for example, Shumway, Stoffer, and Stoffer 2000).

Since it’s initial formulation in the 50s by a variety of authors (Kálmán included) there have been many variations of the Kalman filter proposed, even as recently as this decade with the temporally paralellised Kalman filter, more technically a variant of the information form of the Kalman filter, by Särkkä and Garcı́a-Fernández (2020).

4.1 The Kalman Filter

Firstly, we should establish some notation. Write

\[\begin{split} m_{i \mid j} &= \mathbb E[\bv\alpha_i \mid \{\bv Z_t=\bv z_t\}_{t=0,\dots,j}],\\ P_{i \mid j} &= \var[\bv\alpha_i \mid \{\bv Z_t=\bv z_t\}_{t=0,\dots,j}],\\ P_{i,j \mid k} &= \cov[\bv\alpha_i, \bv\alpha_k \mid \{\bv Z_t=\bv z_t\}_{t=0,\dots,k}]. \end{split} \]

For the initial terms, we choose Bayesian-like prior moments $m_{0\mid0}=m_0$ and $P_{0\mid0}=\Sigma_0$. For convenience and generality, we write $\Sigma_\eta$ and $\Sigma_\epsilon$ for the variance matrices of the process and observations. Note that, if the number of observations change at each time point (for example, due to missing data), then $\Sigma_\epsilon$ should be time varying (even in its shape); we could either always keep it as uncorrelated so that $\Sigma_\epsilon = \mathrm{diag} (\sigma_\epsilon^2)$, or perhaps put some kind of distance-dependant covariance function to it.

To move the filter forward, that is, given $m_{t\mid t}$ and $P_{t\mid t}$, to get $m_{t+1\mid t+1}$ and $P_{t+1\mid t+1}$, we first predict

\[\begin{split} \bv m_{t+1\mid t} &= M \bv m_{t\mid t},\\ P_{t+1\mid t} &= M P_{t\mid t} M^\intercal + \Sigma_\eta, \end{split} \tag{11}\]

then we add our new information, update, with $z_{t}$;

\[\begin{split} \bv m_{t+1\mid t+1} &= \bv m_{t+1\mid t} + K_{t+1} \bv e_{t+1}\\ P_{t+1\mid t+1} &= [I- K_{t+1}\Phi_{t+1}]P_{t+1\mid t} \end{split} \tag{12}\]

where $K_{t+1}$ is the Kalman gain;

\[\begin{split} K_{t+1} = P_{t+1\mid t}\Phi_{t+1}^\intercal [\Phi_{t+1} P_{t+1\mid t} \Phi_{t+1}^\intercal + \Sigma_\epsilon]^{-1}, \quad t=0,\dots,T-1, \end{split} \]

and $\bv e_{t+1}$ are the prediction errors

\[\begin{split} \bv e_{t+1} = \tilde{\bv z}_{t+1}-\Phi_{t+1} \bv m_{t+1\mid t}, \quad t=1,\dots,T. \end{split} \]

Starting with $m_0$ and $P_0$, we can then iteratively move across the data to eventually compute $m_{T\mid T}$ and $P_{T\mid T}$.

Assuming Gaussian all random variables here are Gaussian, this is the optimal mean-square estimators for these quantities, but even outside of the Gaussian case, these are optimal for the class of linear operators.

We can compute the marginal data likelihood alongside the Kalman filter using the prediction errors $\bv e_t$. These, under the assumptions we have made about $\bv \eta_t$ and $\bv\epsilon_t$ being normal, are also normal with zero mean and variance

\[\begin{split} \mathbb V\mathrm{ar}[\bv e_t]=\Sigma_t= \Phi_{t} P_{t\mid t-1} \Phi_{t}^\intercal + \Sigma_\epsilon. \end{split} \tag{13}\]

Therefore, the log-likelihood at each time is

\[\begin{split} \mathcal L(Z\mid\bv\theta) = -\frac12\sum \log\det(\Sigma_t(\bv\theta)) - \frac12 \sum\bv e_t(\bv\theta)^\intercal\Sigma_{t}(\bv\theta)^{-1} \bv e_t(\bv\theta) - \frac{n_{t}}{2}\log(2*\pi). \end{split} \]

Summing these across time, we get the log likelihood for all the data.

A simplified example of the Kalman filter function, written to be JAX compatible, used in the package is this;

Code

@jax.jit
def kalman_filter(m_, P_0, M, PHI_obs, Sigma_eta, Sigma_eps, ztildes):
    nbasis = m_0.shape[0]
    nobs = ztildes.shape[0]

    @jax.jit
    def step(carry, z_t):
        m_tt, P_tt, _, _, ll, _ = carry

        # predict
        m_pred = M @ m_tt
        P_pred = M @ P_tt @ M.T + Sigma_eta

        # Update
        # Prediction Errors
        eps_t = z_t - PHI_obs @ m_pred

        Sigma_t = PHI_obs @ P_pred @ PHI_obs.T + Sigma_eps
        # Kalman Gain
        K_t = (jnp.linalg.solve(Sigma_t, PHI_obs)@ P_pred.T).T

        m_up = m_pred + K_t @ eps_t
        P_up = (jnp.eye(nbasis) - K_t @ PHI_obs) @ P_pred

        # likelihood of epsilon, using cholesky decomposition
        ll_new = ll - 0.5 * n * jnp.log(2*jnp.pi) - \
            0.5 * jnp.log(jnp.linalg.det(Sigma_t)) -\
            0.5 * e.T @ jnp.linalg.solve(Sigma_t, e)

        return (m_up, P_up, m_pred, P_pred, ll_new, K_t), (m_up, P_up, m_pred, P_pred, ll_new, K_t,)

    carry, seq = jl.scan(
        step,
        (m_0, P_0, m_0, P_0, 0, jnp.zeros((nbasis, nobs))),
        ztildes.T,
    )

    return (carry[4], seq[0], seq[1], seq[2][1:], seq[3][1:], seq[5][1:])

For the documentation of the method provided by the package, see filter_smoother_functions.kalman_filter.

4.2 The Information Filter

In some computational scenarios, it is beneficial to work with vectors of consistent dimension. In Python JAX, the efficient scan method works only with such arrays; JAX has no support for jagged arrays, and traditional for loops will likely lead to long compile times when jit-compiled. Although there are some tools in JAX to get around this problem (namely the jax.tree functions which allow mapping over PyTrees), scan is still a large problem; since the Kalman filter is, at it’s core, a scan-type operation (scanning over the data), this causes a large problem when the observation dimension is changing, as is frequent with many spatio-temporal data.

But it is possible to re-write the Kalman filter in a way which is compatible with this kind of data. The ‘information filter’ (sometimes called inverse Kalman filter or other names) involves transforming the data into its ‘information form’, which will always have consistent dimension, allowing us to avoid jagged scans.

The information filter is simply the Kalman filter re-written to use the Gaussian distribution’s canonical parameters ⁴, those being the information vector and the information matrix. If a Gaussian distribution has mean $\bv\mu$ and variance matrix $\Sigma$, then the corresponding information vector and information matrix is $\nu = \Sigma^{-1}\mu$ and $Q = \Sigma^{-1}$, correspondingly.

Theorem 2 The Kalman filter can be rewritten in information form as follows (for example, Khan 2005). Write

\[\begin{split} Q_{i\mid j} &= P_{i\mid j}^{-1}\\ \bv\nu_{i\mid j} &= Q_{i\mid j} \bv m_{i\mid j} \end{split} \]

and transform the observations into their ‘information form’, for $t=1,\dots, T$

\[\begin{split} I_{t} = \Phi_{t}^{\intercal} \Sigma_{\epsilon}^{-1}\Phi_{t},\\ i_{t} = \Phi_{t}^{\intercal} \Sigma_{\epsilon}^{-1} \bv z_{t}. \end{split} \tag{14}\]

The prediction step now becomes

\[\begin{split} \bv\nu_{t+1\mid t} &= (I-J_t) M^{-1}\bv\nu_{t\mid t}\\ Q_{t+1\mid t} &= (I-J_t) S_{t} \end{split} \tag{15}\]

where $S_t = M^{-\intercal} Q_{t\mid t} M^{-1}$ and $J_t = S_t [S_{t}+\Sigma_{\eta}^{-1}]^{-1}$.

Updating is now as simple as adding the information-form observations;

\[\begin{split} \bv\nu_{t+1\mid t+1} &= \bv\nu_{t+1\mid t} + i_{t+1}\\ Q_{t+1\mid t+1} &= Q_{t+1\mid t} + I_{t+1}. \end{split} \tag{16}\]

Proof in Appendix (Section 7.2.)

We can see that the information form of the observations (Equation 14) will always have the same dimension ⁵. For our purposes, this means that jax.lax.scan will work after we ‘informationify’ the data, which can be done using jax.tree.map. This is implemented in the functions information_filter and information_filter_indep (for uncorrelated errors).

There are other often cited advantages to filtering in this form. It can be quicker that the traditional form in certain cases, especially when the observation dimension is bigger than the state dimension (since you solve a smaller system of equations with $[S_t + \Sigma_\eta]^{-1}$ in the process dimension instead of $[\Phi_t P_{t+1\mid t} \Phi_t^\intercal + \Sigma_\epsilon]^{-1}$ in the observation dimension) (Assimakis, Adam, and Douladiris 2012).

The other often mentioned advantage is the ability to use a flat prior for $\alpha_0$; that is, we can set $Q_0$ as the zero matrix, without worrying about an infinite variance matrix. While this is indeed true, it is actually possible to do the same with the Kalman filter by doing the first step analytically, see Section 7.3.

As with the Kalman filter, it is also possible to get the data likelihood in-line as well. Again, we would like to stick with things in the state dimension, so working directly with the prediction errors $\bv e_t$ should be avoided. Luckily, by multiplying the errors by $\Phi_t^\intercal \Sigma_\epsilon^{-1}$, we can define the ‘information errors’ $\bv \iota_t$;

\[\begin{split} \bv \iota_t &= \Phi_t^\intercal \Sigma_\epsilon^{-1} \bv e_t = \Phi_t^\intercal \Sigma_\epsilon^{-1} \tilde{\bv z}_t -\Phi_t^\intercal \Sigma_\epsilon^{-1}\Phi_t m_{t\mid t-1}\\ &= i_t - I_tQ_{t\mid t-1}^{-1}\bv \nu_{t\mid t-1}. \end{split} \]

The variance of this quantity is also easy to find;

\[\begin{split} \var[\bv \iota_t] &= \Phi_t^\intercal \Sigma_\epsilon^{-1}\var[\bv e_t]\Sigma_\epsilon^{-1}\Phi_t\\ &= \Phi_t^\intercal \Sigma_\epsilon^{-1} [\Phi_{t} P_{t\mid t-1} \Phi_{t}^\intercal + \Sigma_\epsilon] \Sigma_\epsilon^{-1}\Phi_t\\ &= \Phi_t^\intercal \Sigma_\epsilon^{-1}\Phi_{t} Q_{t\mid t-1}^{-1} \Phi_{t}^\intercal \Sigma_\epsilon^{-1}\Phi_t \Phi_t^\intercal \Sigma_\epsilon^{-1} \Phi_t\\ &= I_t Q_{t\mid t-1}^{-1} I_t^\intercal + I_t =: \Sigma_{\iota, t}. \end{split} \]

Noting that $\bv \iota$ clearly still has mean zero, this allows us once again to compute the log likelihood, this time through $\bv\iota$

\[\begin{split} \mathcal L(z_t\mid\bv\theta) = -\frac12\sum \log\det(\Sigma_{\iota, t}(\bv\theta)) - \frac12 \sum\bv \iota_t(\bv\theta)^\intercal\Sigma_{\iota, t}(\bv\theta)^{-1} \bv \iota_t(\bv\theta) - \frac{r}{2}\log(2*\pi). \end{split} \]

4.3 The Square-Root filters

In certain high-dimensional cases, the Kalman filter (and, indeed, the information filter) can encounter numerical stability issues. For example, in the predict step of the standard Kalman filter, note the update step for the variance matrix

\[\begin{split} P_{t+1\mid t+1} &= [I- K_{t+1}\Phi_{t+1}]P_{t+1\mid t}. \end{split} \]

Somewhat masked within this equation is two (often very small) variance matrices subtracted from eachother. While analytically, the result is still guaranteed to be positive (semi-)definite, when done in floating point arithmetic (especially in single-precision or lower), the result can often be numerically indefinite. When the variances are very low (as they often become in these Kalman filters), the eigenvalues come out very close to zero and can tick over to becoming negative erroneously. This can lead to definiteness issues with all the other variance matrices, most crucially $\Sigma_t$ Equation 13. When this happens, computation of the likelihood likely fails (certainly when such a computation involves a Cholesky decomposition). Even if such is rare to happen with 64-bit precision, modern GPU hardware tends to be much more efficient with Single (32-bit) precision, so it may still be desirable to increase stability if it permits using a lower precision. The Square-root filter and the SVD filter are such algorithms.

4.3.1 The Square-root Kalman filter

The square-root Kalman filter has it’s origins soon after the standard Kalman filter gained popularity (Kaminski, Bryson, and Schmidt 1971). Of course, computational and memory constraints necessitated stable and memory-efficient approaches, while today the standard Kalman filter (and, more recently, it’s parallel counterpart, to be covered in section [TBD]) usually suffice.

As its name suggests, this variant involves carriyng through the square roots of variances ⁶ instead of the variances themselves. This leads to, at least in some sense, an increased precision, and we can always guarentee that, at least analytically, the square of these square roots (the variances) are positive (semi-)definite.

While the square root filter has been known for a long time (even used during NASA’s Apollo program), more recently, (Tracy 2022) wrote it neatly in terms of the QR decomposition, and this is what we base the presentation on here.

The key observation used for this filter is that if we have the sum of two equations where a square root is known for both, it can be written

\[\begin{split} X + Y &= A^\intercal A + B^\intercal B\\ &= \left[A^\intercal\ B^\intercal\right] \left[\begin{matrix}A\\B\end{matrix}\right] \end{split} \]

Taking the QR decomposition of the vertical block yields QR, and since $(QR)^\intercal\ (QR) = R^\intercal Q^\intercal Q R = R^\intercal R$, so $R$ is a square root of $X+Y$. This motivates the following ‘QR operator’

\[\begin{split} \mathrm qr_R(A, B), \end{split} \]

as the matrix $R$ in the QR decomposition of the block matrix

\[\begin{split} \left[\begin{matrix}A\\B\end{matrix}\right]. \end{split} \]

Beginning with the Cholesky decomposition of the initial variances, $P_0 = U_0^\intercal U_0$, $\Sigma_{\eta} = U_{\eta}^\intercal U_{\eta}$ and $\Sigma_\epsilon = U_{\epsilon}^\intercal U_{\epsilon}$ the predict step for the variance becomes

\[\begin{split} U_{t+1\mid t} = \sqrt{P_{t+1\mid t}} = \mathrm{qr}_R(U_{t\mid t} M^\intercal, U_{\eta}), \end{split} \]

with the step for the means being the same as before (Equation 11). The prediction errors, prediction variance and Kalman gain are now

\[\begin{split} \bv e_{t+1} &= \tilde{\bv z}_t - \Phi_{t+1} \bv m_{t+1\mid t},\\ \Sigma_{t+1} &= \Phi_{t+1} P_{t+1\mid t} \Phi_{t+1}^\intercal + \Sigma_\epsilon,\\ \sqrt{\Sigma_{t+1}} &= U_{e, t+1} = \mathrm{qr}_R(\Phi_{t+1} U_{t+1\mid t}, U_\epsilon),\\ K_{t+1} &= P_{t+1\mid t} \Phi_{t+1}^\intercal \Sigma_{t+1}^{-1} = U_{t+1\mid t}^\intercal U_{t+1\mid t} \Phi_{t+1}^\intercal (U_{e, t+1}^\intercal U_{e, t+1})^{-1}\\ &= (U_{e, t+1}^{-1}U_{e, t+1}^{-\intercal}\Phi_{t+1}U_{t+1\mid t}^\intercal U_{t+1\mid t})^\intercal \end{split} \]

where the last equation for the Kalman gain can easily be solved with a computationally efficient triangular solve.

Finally, the update step for the mean is simply

\[\begin{split} \bv m_{t+1\mid t+1} = \bv m_{t \mid t+1} + K_{t+1} {\bv e_{t+1}}. \end{split} \]

However, for the update we use the so-called Joseph stabilised form (sometimes used in the derivation of the Kalman filter)

\[\begin{split} P_{t+1\mid t+1} &= \mathbb C\mathrm{ov}[\bv \alpha_t - \bv m_{t+1\mid t+1}]\\ &= \mathbb C\mathrm{ov}[\bv \alpha_t - \bv m_{t \mid t+1} - K_{t+1} (\tilde{\bv z}_{t+1} - \Phi_{t+1} \bv m_{t+1\mid t})]\\ &= \mathbb C\mathrm{ov}[\bv \alpha_t - \bv m_{t \mid t+1} - K_{t+1} (\Phi_{t+1} \bv m_{t+1} + \bv \epsilon_{t+1} - \Phi_{t+1}\bv m_{t+1\mid t})]\\ &= \mathbb C\mathrm{ov}[(I - K_{t+1} \Phi_{t+1})(\bv \alpha_t - \bv m_{t+1 \mid t}) - \bv \epsilon_{t+1}]\\ &= (I - K_{t+1} \Phi_{t+1}) \mathbb C\mathrm{ov}[\bv \alpha_t - \bv m_{t+1 \mid t}](I - K_{t+1} \Phi_{t+1})^\intercal + \mathbb C\mathrm{ov}[\bv \epsilon_{t+1}]\\ &= (I - K_{t+1} \Phi_{t+1}) P_{t+1\mid t}(I - K_{t+1} \Phi_{t+1})^\intercal + \Sigma_{\epsilon} \end{split} \]

which is often simplified further to Equation 12, but as discussed that involves negation of two square root matrices; this form is more complicated and involves more matrix computation, but guarentees that the result will be positive (semi-)definite. Furthermore, this is also in a form that allows us to easily find the square root with the QR trick;

\[\begin{split} U_{t+1\mid t+1} = \sqrt{P_{t+1\mid t+1}} = \mathrm{qr}_R(U_{t+1\mid t}(I - K_{t+1} \Phi_{t+1})^{\intercal}, U_\epsilon). \end{split} \]

Of course, from here, we can similarily easily compute the data likelihood using $U_{e,t+1}$, using standard techniques; the multivariate normal likelihood is usually computed using the cholesky decomposition of the variance matrix anyway. The result is an algorithm which is of a higher order than the standard Kalman filter, but the stability is often worth the comprimise. Once jit-compiled, the function sqrt_filter_indep on a moderately sized IDEM (on a discrete GPU) on 64-bit precision ⁷ takes approximately 23.5ms, compared to kalman_filter_indep taking approximately 7.8ms, achieving similar log-likelihoods (whith some difference due to precision). However, running the code in 32-bit causes the Kalman filter likelihood computation to fail, the square-root filter succeeds at a time of 7.0ms.

4.3.2 Square-root Information filter

Very similarily, we can write the information filter using the square roots of the information matrices. We will label roots of ‘information-type’ matrices with $R$, and ‘variance-type’ matrices (their inverse) with $U$.

We now carry the data’s information matrix’s (Equation 14) square root as well, $R_t^{(I)} = \sqrt(\Phi_t^\intercal\Sigma_\epsilon^{-1} \Phi_t)$, with the same observation vector.

So, once again, beginning with the lower-triangular cholesky decomposition $Q_0 = R^\intercal R$, and the upper-triangular $\Sigma_{\eta} = U_{\eta}^\intercal U_{\eta}$ and $\Sigma_\epsilon = U_{\epsilon}^\intercal U_{\epsilon}$.

So, to predict step for the information matrix (Equation 15) becomes

\[\begin{split} Q_{t+1\mid t} &= (M Q_{t\mid t}^{-1} M^\intercal + \Sigma_\eta)^{-1}\\ &= (M R_{t\mid t}^{-1}R_{t\mid t}^{-\intercal} M^\intercal + U_\eta^\intercal U_\eta)^{-1}\\ &= \left[(M R_{t\mid t}^{-1}, U_\eta^\intercal)\left(\begin{matrix}R_{t\mid t}^{-\intercal} M^\intercal\\U_\eta\end{matrix}\right)\right]^{-1}\\ R_{t+1\mid t}^{-1} &= \mathrm{qr}_R(R_{t\mid t}^{-\intercal} M^\intercal, U_\eta) \end{split} \tag{17}\]

This must now be explicitly inverted, which isn’t a big problem since it is upper-triangular.

The update on the information vector is now

\[\begin{split} \bv\nu_{t+1\mid t} &= Q_{t+1\mid t} M Q_{t\mid t}^{-1} \bv \nu_{t\mid t}\\ &= R_{t+1\mid t}^\intercal R_{t+1\mid t} M R_{t\mid t}^{-1}R_{t\mid t}^{-\intercal} \bv \nu_{t\mid t}, \end{split} \tag{18}\]

which can be done, as in the square-root Kalman filter’s Kalman gain computation, using forward-solves.

Now the update step is

\[\begin{split} \bv\nu_{t+1\mid t+1} &= \bv\nu_{t+1\mid t} + i_{t+1}\\ Q_{t+1\mid t+1} &= Q_{t+1\mid t} + I_{t+1}\\ &= R_{t+1\mid t}^\intercal R_{t+1\mid t} + R^{(I)\intercal}_{t+1}R^{(I)}_{t+1}\\ R_{t+1\mid t+1} &= \mathrm{qr}_R(R_{t+1\mid t}, R^{(I)}_{t+1}). \end{split} \tag{19}\]

4.4 Smoothing

Beyond the filtering, another task is smoothing. That is, filters estimate $\bv m_{T\mid T}$ and $P_{T\mid T}$, but there is use for estimating $\bv m_{t\mid T}$ and $P_{t\mid T}$ for all $t=0,\dots, T$.

We simply work backwards from $\bv m_{T\mid T}$ and $P_{T\mid T}$ values using what is known as the Rauch-Tung-Striebel (RTS) smoother;

\[\begin{split} \bv m_{t-1\mid T} &= \bv m_{t-1\mid t-1} + J_{t-1}(\bv m_{t\mid T} - \bv m_{t\mid t-1}),\\ P_{t-1\mid T} &= P_{t-1\mid t-1} + J_{t-1}(P_{t\mid T} - P_{t\mid t-1})J_{t-1}^\intercal, \end{split} \tag{20}\]

where,

\[\begin{split} J_{t-1} = P_{t-1\mid t-1}M^\intercal[P_{t\mid t-1}]^{-1}. \end{split} \]

We can clearly see, then, that it is crucial to keep the values in Equation 11.

We can then also compute the lag-one cross-covariance matrices $P_{t,t-1\mid T}$ using the Lag-One Covariance Smoother. This will b useful, for example, in the expectation-maximisation algorithm later. From

\[\begin{split} P_{T,T-1\mid T} = (I - K_T\Phi_{T}) MP_{T-1\mid T-1}, \end{split} \]

we can compute the lag-one covariances

\[\begin{split} P_{t, t-1\mid T} = P_{t\mid t}J_{t-1}^\intercal + J_{t}[P_{t+1,t\mid T} - MP_{t-1\mid t-1}]J_{t-1}^\intercal \end{split} \tag{21}\]

These values can be used to implement the expectation-maximisation (EM) algorithm which will be introduced later.

5 EM Algorithm (NEEDS A LOT OF WORK, PROBABLY IGNORE FOR NOW)

Instead of the marginal data likelihood, we may instead want to work with the ‘full’ likelihood, including the unobserved process, $l(\bv z(1),\dots, \bv z(T), \bv Y(1), \dots, \bv Y(T)\mid \bv\theta)$, or, equivalently, $l(\bv z(1),\dots, \bv z(t), \bv \alpha(1), \dots, \bv\alpha(T)\mid \bv\theta)$. This is difficult to maximise directly, but can be done with the EM algorithm, consisting of two steps, which can be shown to always increase the full likelihood.

Firstly, the E step is to find the function

\[\begin{split} \mathcal Q(\bv \theta; \bv \theta') = \mathbb E_{\bv Z(t)\sim p(Z \mid \bv\alpha(t),\bv\theta)}[\log p_{\bv\theta}(Z^{(T)}, A^{(T)})\mid Z^{(T)}], \end{split} \tag{22}\]

where $Z^{(T)} = \{\bv z_t\}_{t=0,\dots,T}$, $A^{(T)} = \{\bv \alpha_t\}_{t=0,\dots,T}$ and $A^{(T-1)} = \{\bv \alpha_t\}_{t=0,\dots,T-1}$. This approximates $\log p_\theta(Z^{(T)}, A^{(T)})$.

Proposition 1 We have [NOTE: This may well be wrong in places…]

\[\begin{split} -2\mathcal Q(\bv\theta;\bv\theta') &= \mathbb E_{Z^{(T)}\sim p(Z \mid A^{(T)},\bv\theta')}[\log p_{\bv\theta}(Z^{(T)}, A^{(T)}\mid Z^{(T)} = z^{(T)})]\\ &\eqc \sigma_\epsilon^2 [\sum_{t=0}^{T}\bv z_t^{\intercal}z_t - 2\Phi_t(\sum_{t=1}^{T} \bv z_t^\intercal \bv m_{t\mid T}) - 2(\sum_{t=0}^{T} \bv z_t^T)X_t\bv\beta\\ &\quad\quad\quad +\Phi_t^\intercal(\sum_{t=0}^{T}\mathrm{tr}\{P_{t\mid T} - \bv m_{t\mid T}\bv m_{t\mid T}^{\intercal}\})\Phi_t + 2X_t\bv\beta\Phi_t(\sum_{t=0}^{T}\bv m_{t\mid T}) + (\sum_{t=1}^{T}X_t^\intercal \bv\beta^{\intercal}\bv\beta X_t)]\\ &\quad + \mathrm{tr}\{\Sigma_\eta^{-1}[(\sum_{t=1}^{T}P_{t\mid T} - m_{t\mid T}) - 2M(\sum_{t=1}^{T}P_{t,t-1\mid T} - \bv m_{t-1,T}\bv m_{t\mid T}^{\intercal})\\ &\quad\quad\quad\quad\quad + M(\sum_{t=1}^{T}P_{t-1\mid T} - \bv m_{t-1\mid T}\bv m_{t-1\mid T}^{\intercal})M^\intercal]\}\\ &\quad + \mathrm{tr}\{\Sigma_0^{-1}[P_{0\mid T} - m_{0\mid T}m_{0\mid T}^{\intercal} - 2\bv m_{0\mid T}\bv m_0 + \bv m_0\bv m_0^\intercal]\}\\ &\quad + \log(\det(\sigma_\epsilon^{2T}\Sigma_\eta^{T+1}\Sigma_0)) \end{split} \tag{23}\]

Proof. See appendix.

In the EM algorithm, we maximise the full likelihood by changing $\bv \theta$ in order to increase (Equation 23), which can be shown to guarantee that the Likelihood $L(\bv \theta)$ also increases. The idea is then that repeatedly alternating between adjusting $\bv \theta$ to increase Equation 23, and then doing the filters and smoothers to obtain new values for $\bv m_{t\mid T}$, $P_{t\mid T}$, and $P_{t,t-1\mid T}$.

6 Algorithm for Maximum Complete-data Likelihood estimation

Overall, our algorithm for Maximum Likelihood estimation is:

Set $i=0$ and take an initial guess for the parameters we are considering, $\bv\theta_0=\bv\theta_i$
Starting from $\bv m_{0\mid 0}=\bv m_0, P_{0\mid0}=\Sigma_0$, run the Kalman Filter to get $\bv m_{t\mid t}$, $P_{t\mid t}$, and $K_t$ for all $t$ Equation 12,
Starting from $\bv m_{T\mid T}, P_{T\mid T}$, run the Kalman Smoother to get $\bv m_{t\mid T}$, $P_{t\mid T}$, and $J_t$ for all $t$ (Equation 20),
Starting from $P_{T,T-1\mid T} = (I - K_nA_n) MP_{T-1\mid T-1}$, run the Lag-One Smoother to get $\bv m_{t,t-1\mid T}$ and $P_{t,t-1\mid T}$ for all $t$ Equation 21,
Use the above values to construct $\mathcal Q(\bv\theta;\bv \theta')$ in Equation 23,
Maximise the function $\mathcal Q(\bv\theta;\bv \theta')$ to get a new guess $\bv \theta_{i+1}$, then return to step 2,
Stop once a certain criteria is met.

7 Appendix

7.1 Woodbury’s identity

The following two sections will make heavy use of the Woodbury identity.

Lemma 2 (Woodbury’s Identity) We have, for conformable matrices $A, U, C, V$,

\[\begin{split} (A + UCV)^{-1} = A^{-1} - A^{-1} U (C^{-1} + VA^{-1}U)^{-1}VA^{-1}. \end{split} \tag{24}\]

Additionally, we have the variant

\[\begin{split} (A + UCV)^{-1}UC = A^{-1} U(C^{-1} + VA^{-1}U)^{-1}. \end{split} \tag{25}\]

Proof. We only prove (Equation 25), since various proofs of (Equation 24) are well known (see, for example, the Wikipedia page).

Simply multipliying (Equation 24) by $CU$, (similar to Khan 2005, although there is an error in their proof)

\[\begin{split} (A+UCV)^{-1}UC &= A^{-1}UC - A^{-1}U(C^{-1} + VA^{-1}U)^{-1}VA^{-1}UC\\ &= A^{-1}UC - A^{-1}U(C^{-1} + VA^{-1}U) [(C^{-1} +VA^{-1}U)C - I]\\ &= A^{-1}U(C^{-1}+VA^{-1}U) \end{split} \]

as needed.

7.2 Proof of Theorem 2

Proof. Firstly, for the prediction step, using $S_t = M^{-\intercal}Q_{t\mid t}M^{-1}$ and $J_t = S_t(\Sigma_\eta^{-1} + S_t)^{-1}$ and the identities Equation 24 and Equation 25,

\[\begin{split} Q_{t+1\mid t} &= P_{t+1\mid t}^{-1} = (MQ_{t\mid t}^{-1}M^\intercal + \Sigma_\eta)^{-1}\\ &= S_t - J_t S_t = (I-J_t)S_t, \end{split} \]

where we used $A=MQ_{t\mid t}^{-1}M^\intercal$, $C=\Sigma_\eta$ and $U=C=I$ in Equation 24. Thurthermore,

\[\begin{split} \bv \nu_{t+1\mid t} &= Q_{t+1\mid t} \bv m_{t+1\mid t}\\ &= Q_{t+1\mid t} M Q_{t\mid t}^{-1} \bv \nu_{t\mid t} = Q_{t+1\mid t} (M Q_{t\mid t}^{-1}) \bv \nu_{t\mid t}\\ &= (I-J_t)M^{-\intercal}Q_{t\mid t}M^{-1} (M Q_{t\mid t}^{-1}) \bv \nu_{t\mid t}\\ &= (I-J_t)M^{-\intercal} \bv \nu_{t\mid t}. \end{split} \]

For the update step,

\[\begin{split} Q_{t+1\mid t+1} &= P_{t+1\mid t+1}^{-1}\\ &= (Q_{t+1}^{-1} - Q_{t+1\mid t}^{-1}\Phi_{t+1}^\intercal[\Phi_{t+1}\Sigma_\epsilon\Phi_{t+1}^\intercal + \Sigma_\epsilon]^{-1}\Phi_{t+1}Q_{t+1\mid t}^{-1})^{-1}\\ &= ((Q_{t+1\mid t} + \Phi_{t+1}^\intercal\Sigma_\epsilon^{-1}\Phi_{t+1})^{-1})^{-1} = Q_{t+1\mid t} + \Phi_{t+1}^\intercal\Sigma_\epsilon^{-1}\Phi_{t+1}\\ &= Q_{t+1\mid t} + I_{t+1}. \end{split} \]

Then, writing $\bv m_{t+1\mid t+1}$ in terms of $Q_{t+1\mid t}$ and $\bv \nu_{t+1\mid t}$

\[\begin{split} \bv m_{t+1\mid t+1} &= Q_{t+1\mid t}^{-1} \bv \nu_{t+1\mid t} - Q_{t+1\mid t}^{-1}\Phi_{t+1}^\intercal[\Phi_{t+1}Q_{t+1\mid t}^{-1}\Phi_{t+1}^\intercal +\Sigma_\epsilon]^{-1} [\tilde{\bv z}_{t+1} - \Phi_{t+1}Q_{t+1\mid t}^{-1}\bv \nu_{t+1\mit t}]\\ &= (Q_{t+1\mid t}^{-1} - Q_{t+1\mid t}^{-1}\Phi_{t+1}^\intercal[\Phi_{t+1}Q_{t+1\mid t}^{-1}\Phi_{t+1}^\intercal +\Sigma_\epsilon]^{-1}\Phi_{t+1}Q_{t+1\mid t}^{-1})\bv \nu_{t+1\mid t} \\ &\quad + Q_{t+1\mid t}^{-1}\Phi_{t+1}^\intercal[\Phi_{t+1}Q_{t+1\mid t}^{-1}\Phi_{t+1}^\intercal +\Sigma_\epsilon]^{-1}\tilde{\bv z}_{t+1}\\ &= [Q_{t+1\mid t} + I_{t+1}]^{-1}\bv \nu_{t+1\mid t}\\ &\quad + [Q_{t+1\mid t} + I_{t+1}]^{-1}\Phi_{t+1}\Sigma_\epsilon^{-1}\tilde{\bv z}_{t+1}, \end{split} \]

and now noting that $\bv\nu_{t+1\mid t+1} = (Q_{t+1\mid t} + I_{t+1}) \bv m_{t+1\mid t+1}$, we complete the proof.

7.3 Truly Vague Prior with the Kalman Filter

It has been stated before that one of the large advantages of the information filter is the ability to use a completely vague prior $Q_{0}=0$. While this is true, it is actually possible to do this in the Kalman filter by ‘skipping’ the first step (contrary to some sources, such as the Wikipedia page as of January 2025).

Theorem 3 In the Kalman Filter (Section 4.1), if we allow $P_{0}^{-1} = 0$, effectively setting infinite variance, and assuming the propagator matrix $M$ is invertible, we have

\[\begin{split} \bv m_{1\mid1} &= (\Phi_1^\intercal \Sigma_\epsilon^{-1} \Phi_1)^{-1} \Phi_1 \Sigma_\epsilon^{-1} \tilde{\bv z}_1,\\ P_{1\mid1} &= (\Phi_1^\intercal \Sigma_\epsilon^{-1} \Phi_1)^{-1}. \end{split} \tag{26}\]

Therefore, starting with these values then continuing the filter as normal, we can perform the Kalman filter with ‘infinite’ prior variance.

[NOTE: The requirement that M be invertible should be droppable, see the proof below]

Proof. Unsurprisingly, the proof is effectively equivalent to proving the information filter and setting $Q_0 = P_0^{-1}=0$.

For the first predict step (Equation 11),

\[\begin{split} \bv m_{1\mid0} &= M \bv m_0,\\ P_{1\mid0} &= M P_0 M^\intercal + \Sigma_\eta. \end{split} \]

By (Equation 24),

\[\begin{split} P_{1\mid0}^{-1} &= \Sigma_\eta^{-1} - \Sigma_\eta^{-1} M (P_0^{-1} + M^\intercal \Sigma_\eta^{-1} M)^{-1}M^\intercal\Sigma_\eta^{-1}\\ &= \Sigma_\eta^{-1} - \Sigma_\eta^{-1} M (M^\intercal \Sigma_\eta^{-1} M)^{-1}M^\intercal\Sigma_\eta^{-1}\\ &= \Sigma_\eta^{-1} - \Sigma_\eta^{-1} = 0. \end{split} \]

So, moving to the update step (Equation 12),

\[\begin{split} \bv m_{1\mid1} = M \bv m_0 + P_{1\mid0}\Phi_1 [\Phi_1 P_{1\mid0} \Phi_1^\intercal + \Sigma_\epsilon]^{-1}(\tilde{\bv{z}}_1 - \Phi M \bv m_0).\\ \end{split} \]

Applying (Equation 25) with $A = P_{1\mid0}^{-1}, U=\Phi_1, V=\Phi_1^\intercal, C=\Sigma_\epsilon^{-1}$,

\[\begin{split} \bv m_{1\mid1} &= M \bv m_0 + (P_{1\mid0}^{-1} + \Phi_1^\intercal\Sigma_\epsilon^{-1} \Phi_1)^{-1}\Phi_1^\intercal \Sigma_\epsilon^{-1}(\tilde{\bv{z}}_1 - \Phi_1 M\bv m_0)\\ &= M \bv m_0 + (\Phi_1^\intercal\Sigma_\epsilon^{-1}\Phi_1)^{-1}\Phi_1^\intercal\Sigma_\epsilon^{-1} \tilde{\bv{z}}_1 - (\Phi_1^\intercal\Sigma_\epsilon^{-1}\Phi_1)^{-1}\Phi_1^\intercal\Sigma_\epsilon^{-1}\Phi_1M\bv m_0\\ &= (\Phi_1^\intercal\Sigma_\epsilon^{-1}\Phi_1)^{-1}\Phi_1^\intercal\Sigma_\epsilon^{-1} \tilde{\bv{z}}_1. \end{split} \]

For the variance, we apply the (Equation 24) with $A = P_{1\mid0}^{-1}, U=\Phi_1^\intercal, V=\Phi_1, C=\Sigma_\epsilon^{-1}$,

\[\begin{split} P_{1\mid1} &= (I - P_{1\mid0}\Phi_1^\intercal[\Sigma_\epsilon + \Phi_1^\intercal P_{1\mid0}\Phi_1]^{-1}\Phi_1)P_{1\mid0}\\ &= (P_{1\mid0}^{-1} + \Phi_1^\intercal \Sigma_\epsilon^{-1}\Phi_1)^{-1}\\ &= (\Phi_1^\intercal \Sigma_\epsilon^{-1}\Phi_1)^{-1}, \end{split} \]

as needed.

It is worth noting that (Equation 26) seems to make a lot of sense; namely, we expect the estimate for $\bv m_0$ to look like a correlated least squares-type estimator like this.

References

Assimakis, Nicholas, Maria Adam, and Anargyros Douladiris. 2012. “Information Filter and Kalman Filter Comparison: Selection of the Faster Filter.” In Information Engineering, 2:1–5. 1.

Cressie, Noel, and Christopher K Wikle. 2015. Statistics for Spatio-Temporal Data. John Wiley & Sons.

Dewar, Michael, Kenneth Scerri, and Visakan Kadirkamanathan. 2008. “Data-Driven Spatio-Temporal Modeling Using the Integro-Difference Equation.” IEEE Transactions on Signal Processing 57 (1): 83–91.

Kaminski, Paul, Arthur Bryson, and Stanley Schmidt. 1971. “Discrete Square Root Filtering: A Survey of Current Techniques.” IEEE Transactions on Automatic Control 16 (6): 727–36.

Khan, Mohammad Emtiyaz. 2005. “Matrix Inversion Lemma and Information Filter.” Honeywell Techonology Solutions Lab, Bangalore, India.

Liu, Xiao, Kyongmin Yeo, and Siyuan Lu. 2022. “Statistical Modeling for Spatio-Temporal Data from Stochastic Convection-Diffusion Processes.” Journal of the American Statistical Association 117 (539): 1482–99.

Särkkä, Simo, and Ángel F Garcı́a-Fernández. 2020. “Temporal Parallelization of Bayesian Smoothers.” IEEE Transactions on Automatic Control 66 (1): 299–306.

Shumway, Robert H, David S Stoffer, and David S Stoffer. 2000. Time Series Analysis and Its Applications. Vol. 3. Springer.

Tracy, Kevin. 2022. “A Square-Root Kalman Filter Using Only QR Decompositions.” arXiv Preprint arXiv:2208.06452.

Wikle, Christopher K, and Noel Cressie. 1999. “A Dimension-Reduced Approach to Space-Time Kalman Filtering.” Biometrika 86 (4): 815–29.

Wikle, Christopher K, Andrew Zammit-Mangion, and Noel Cressie. 2019. Spatio-Temporal Statistics with r. CRC Press.

Footnotes

Historically, this has been abbreviated as IDE. However, with that abbreviation almost universally meaning ‘Integrated Development Environment’, here, we choose to include the ‘M’ in the abbreviation.↩︎
at least, in a discrete-time scenario. Integro-difference based mechanics can be derived from continuous-time convection-diffusion processes, see Liu, Yeo, and Lu (2022)↩︎
The bisquare functions, here, $\phi_i(\bv s) = [1-\frac{\Vert \bv s - \bv c_i \Vert}{w_i}]^2 \cdot I(\Vert \bv s - \bv c_i \Vert < w_i)$ for $i$ ‘centroids’ or ‘knots’, $\bv c_i\in \mathcal D$, each with ‘radius’ $w_i$↩︎
that is, the parameters of the Gaussian distribution in it’s exponential family form↩︎
that being the process dimension, previously labelled $r$, the number of basis functions used in the expansion of the process↩︎
A matrix $A$ is said to be a ‘square root’ of a positive-definate matrix $X$ if $A^\intercal A = X$. Note that these square roots are not unique, but can be ‘rotated’ by an arbitrary unitary matrix. The ‘canonical’ square root is the Cholesky factor, the unique upper (or occasionally lower) triangular square root. This can be found for arbitraty square roots by taking the QR decomposition (or RQ decomposition), which effectively computes the upper-triangular square root, $R$, and the unitary transformation $Q^\intercal$ necessary to get there.↩︎
which must be explicitely enabled in JAX↩︎