Machine Learning

The Kalman Filter. Intuition, history and mathematical derivation.

In this article I will try to introduce the complete derivation behind the Kalman Filter, one of the most popular filtering algorithm in noisy environments. In addition, the following article will be about the Extended Kalman Filter, how it’s used in localisation algorithms, when we have known and unknown correspondences.

This article will be a mouthful, so first let’s quickly draft what we will speak about:

Table of contents

  1. The history surrounding the Kalman Filter
  2. The state space
    1. The Lotka-Volterra equations
  3. Weighed Gaussian Measurements
    1. Encoding uncertainty with the standard deviation
    2. Finding the optimal weights with Lagrange multipliers
    3. Finding the optimal weights probabilistically
  4. The Kalman Filter
    1. Deriving the Kalman Gain
    2. Deriving the standard Kalman Filter equations
  5. The Extended Kalman Filter
  6. Bibliography

The history surrounding the Kalman Filter

Rudolf E. Kálmán presented in 1960, the seminal paper presenting the homonym technique. He proposed this technique in the context of extracting the actual value of a measurement (or better said the most likely value), given a long list of noisy measurements.

The state space

But what is probably more important is Kalman’s heavy use of what’s called a state space. A state space is generally an abstract space. In other words, a mathematical construct which has no direct mapping to what we would call a real world space (like the 3D space). It usually arises naturally when working with differential equations. For instance, the Lotka-Volterra equations that describe predator-prey population dynamics:

Lotka-Volterra equations

I will briefly speak about the famous predator-prey model and how a state space naturally arises here. I find this example very palpable, in the sense we can relate it to an everyday thing surrounding the animal world.

Predator-free population model

Firstly, suppose we have a population \(x\), and the environment doesn’t impose a maximum limit over this population. In this scenario, the population would grow exponentially (solving \(\frac{dx}{dt} = \alpha x\) we get \(x=Ce^{\alpha t}\)).

Predator population model

Now, suppose we introduce a population of predators, \(y\), that feeds on \(x\). We can intuitively say that there will be a decrease in the population \(x\), proportional to the number of predators, and the population itself. Why the latter? Because in this simplistic model, any predator has the potential to catch any prey (i.e. there are \(xy\) possible interactions, thus we define a killing rate depending on that).

The predators however, if left without a source of food would all die. In the same manner, we can define a birth rate in the predators depending on the number of interactions they have with the prey.

After all this talk we can finally define a model of the form:

Predator – Prey mathematical model

$$\begin{cases}\frac{dx}{dt} = \alpha x – \beta xy \\ \frac{dy}{dt} = -\gamma y + \delta xy \end{cases} \\ \text{Looking at the variation between y and x (so we divide the equations above): } \\ \frac{dy}{dx} = -\frac{y}{x} \frac{\delta x – \gamma}{\beta y – \alpha} \\ \text{Separating the variables: } \frac{\beta y – \alpha}{y}dy = \frac{\delta x – \gamma}{x}dx \\ \text{And by integrating on both sides: } \beta y – \alpha ln(y) + const_1 = \delta x – \gamma ln(x) + const_2 \\ \text{We can now define: } V=\delta x – \gamma ln(x) + \beta y – \alpha ln(y) + const$$

\(V\) depends on the initial conditions (which affect the constant), but most importantly, it’s a constant quantity on the closed curves defined by the equation above. Some images from Wikipedia, for a clearer picture:

The population dynamics for predators and preys have an intrinsic periodic nature. First, the prey population increases, followed by predators.
Population dynamics for predators and preys [1].
The predator/prey state space and the curves described by various initial conditions.
The curves described by the equation above with value V. [1]

The last picture you saw is a nice example of a STATE SPACE. State spaces are useful because they can describe graphically the dynamics of a system of differential equations. With the drawing above you can quickly see the evolution for various initial conditions, that otherwise wouldn’t have been so obvious. The thing to remember in our story: Kalman heavily used state spaces.

Weighed Gaussian Measurements

Before talking about Kalman filters, I want to speak a little about weighted gaussian measurements.

Let’s suppose we have two independent measurements of a certain quantity. Those two measurements are made either by us, either by some sort of measuring instrument. Either way, we or the instrument, aren’t 100% sure about the value we think or display (the instrument too, because it might have some sort of intrinsic error).

Encoding uncertainty with the standard deviation

What both measurements have in common are a level of uncertainty. And we will encode it using the standard deviation. We can attach(conceptually at least) a normal distribution to every measurement we make. The ones that have a high degree of certainty will have a low uncertainty (i.e. a low variance!) and vice-versa.

With that in mind, back to our measurements. We will write them as: \(x_1 \pm \sigma_1\) și \(x_2 \pm \sigma_2\). But now that we have them, comes the question: What’s the best estimator for \(x\)?

We will try answering in the simplest manner – A linear combination between the two measurements: \(x_{12} = w_1 x_1 + w_2x_2\), cu \(w_1+w_2=1\). Now the question is: What are the best values for \(w_1\) and \(w_2\). Fortunately, there are two different ways to reach at the same result: using Lagrange multipliers or probabilistically.

Finding the optimal weights with Lagrange multipliers

We will choose the weights that minimise the uncertainty, \(\sigma^2 = \sum_i w_i^2\sigma_i^2\), subject to \(1 – \sum_i w_i = 0\).

Leaving the constraint aside…Why this formula? Short answer: this is the actual variance of the random variable defined as \(x\). We can quickly see why if we make the assumption that the two random variables are independent – an assumption which in the real world should hold most of the times. I mean, if one odometer messes around with another odometer then we’re in big trouble…far bigger than our philosophical assumptions (-:

Deriving the mean and variance of a linear combination

This is Bienaymé’s formula. Let’s quickly derive it here [2]:

First the mean of a linear combination of independent random variables:

$$\mathbb{E}(\sum_i w_i x_i) = \sum_i w_i \mathbb{E}(x_i) = \sum_i w_i \mu_i $$

Secondly, the variance:

$$\begin{align}\sigma(x)^2 &= Var(x) = \mathbb{E}[(x-\mu_x)^2]=\mathbb{E}[(\sum_i w_i x_i – \sum_i w_i \mu_i)^2 \\ &= \mathbb{E}[(\sum_i w_i(x_i – \mu_i))^2] = \mathbb{E}[(\sum_i w_i(x_i-\mu_i))(\sum_j w_j(x_j – \mu_j)]\\ &= \mathbb{E}[\sum_i\sum_j w_i w_j (x_i-\mu_i)(x_j-\mu_j)] = \sum_i\sum_j w_i w_j \mathbb{E}[(x_i-\mu_i)(x_j-\mu_j)]\end{align}$$

Using the independence property we can see that \(\mathbb{E}[(x_i-\mu_i)(x_j-\mu_j)]\) is actually \(0 \; \forall \; i\neq j\). Thus the above relation simplifies to:

$$ \sigma^2 = \sum_i w_i^2 \mathbb{E}[(x_i-\mu_i)^2] = \sum_i w_i^2 \sigma_i^2$$

Solving the Lagrangian

$$\mathcal{L}(\textbf{w}, \lambda) = \sum_i w_i^2\sigma_i^2 – \lambda (1 – \sum_i w_i)\\
\frac{\partial \mathcal{L}}{\partial w_i} = 2w_i\sigma_i^2 + \lambda = 0 \Rightarrow w_i \propto \frac{1}{\sigma_i^2} \\
\text{and because of the constraint: } \quad w_i = \frac{1/\sigma_i^2}{\sum_j 1/\sigma_j^2} \\
\text{for the example: } \quad \hat{x} = \frac{\sigma_2^2 x_1 + \sigma_1^2 x_2}{\sigma_2^2 + \sigma_1^2}; \quad \hat{\sigma} = \frac{\sigma_1^2 \sigma_2^2}{\sigma_1^2 + \sigma_2^2}$$

\(\textbf{^}\) is a notation for ESTIMATED values

As a sidenote, look how the lagrangian formulation led us to something very intuitive: that the weights are directly proportional to the inverse of the variance, which in statistics is commonly referred as precision (in other words, it encodes the importance of the measurement).

Finding the optimal weights probabilistically

We interpret the measurements in the terms of a probability distribution centered around the measured values, for ex. \(x_1\) (which becomes the mean of a normal distribution).

Thus, for the first measurement we can assume that \(P(x| x_1) = \mathcal{N}(x;x_1, \sigma_1^2)\). Because the measurements are independent, we can apply the same reasoning for the second measurement.

$$P(x|x_1, x_2) = P(x|x_1)P(x|x_2)P(x_1)P(x_2), \text{ summing after } x_1 \text{ and } x_2 \\ P(x) \propto exp (-\frac{1}{2}(\frac{(x-x_1)^2}{\sigma_1^2} + \frac{(x-x_2)^2}{\sigma_2^2}))\\ \text{And after some algebra we reach to a form of the type: } P(x) \propto exp(-\frac{(x-\hat{x})^2}{2\hat{\sigma}^2})$$ [3]

Final estimation after two independent measurements
Final estimation after two independent measurements

The Kalman Filter

The Kalman Filter is actually a systematization brought to the method of weighted gaussian measurements, in the context of Systems theory. Practically, this filter is used in equipments that need to tune the next estimated state, based on their current internal state (or belief), along with the new information that comes due to measurements. Therefore, this filter is recursive, and thus having a feedback control loop.

Deriving the Kalman Gain

To capture the recursive nature of the Kalman filter, in our 1D example, we have to make some changes to the equations that describe the estimated mean and variance.

$$\begin{align}
\hat{x} = \dfrac{\sigma_2^2 x_1 + \sigma_1^2 x_2}{\sigma_2^2 + \sigma_1^2} &\rightarrow
\hat{x} = x_1 + \frac{\sigma_1^2 (x_2-x_1)}{\sigma_1^2 + \sigma_2^2} \\
\hat{\sigma} = \frac{\sigma_1^2 \sigma_2^2}{\sigma_1^2 + \sigma_2^2} &\rightarrow
\hat{\sigma} = \sigma_1^2 – \frac{\sigma_1^4}{\sigma_1^2+\sigma_2^2}
\end{align}\\ \text{Introducing: } \mathbf{k} = \frac{\sigma_1^2}{\sigma_1^2+\sigma_2^2} \text{ we obtain: } \\
\hat{x}=x_1 + k(x_2 – x_1) \quad (1)\\
\hat{\sigma} = \sigma_1^2 – k\sigma_1^2 \quad (2)$$

\(\mathbf{k}\) is the 1D version of the Kalman Gain.

However, Kalman didn’t write it like above. He was more general and worked with the N-dimensional version of the Kalman gain. For the sake of simplicity, I will draw a simple parallel from the 1D forms of the equations (1) and (2) and the Kalman Gain to the N-dimensional ones. Because of this, the standard deviation will become the covariance and the common division operation a multiplication with the inverse:

$$x_i \rightarrow \vec x_i \land \sigma_i^2 \rightarrow \Sigma_i, i=1,2 \\
\mathbf{K} = \Sigma_1(\Sigma_1 + \Sigma_2)^{-1} \quad (4)\\
\vec{\hat{x}} = \vec x_1 + \mathbf{K}(\vec x_2 – \vec x_1) \quad (5) \\
\hat{\Sigma} = \Sigma_1 – \mathbf{K} \Sigma_1 \quad (6)$$

\(\mathbf{K}\) is the N-dimensional version of the Kalman Gain.

And now, before diving in the full derivation of Kalman Filter equations we need to introduce some notations and do a quick math recap:

Multiplication of the covariance matrix with another constant matrix

$$\mathrm{Cov}(x) = \Sigma \\
\mathrm{Cov}(Ax) = A \Sigma A^T \\
\mathrm{Cov}(Ax) = \mathbb{E}[(Ax – \mathbb{E}[Ax])(Ax – \mathbb{E}[Ax])^T] \\
= \mathbb{E}[(Ax – A \mathbb{E}[x])(Ax – A\mathbb{E}[x])^T] \\
= \mathbb{E}[A (x – \mathbb{E}[x])(x – \mathbb{E}[x])^T A^T] \\
= A \underbrace{\mathbb{E}[(x – \mathbb{E}[x])(x – \mathbb{E}[x])^T]}_{=\Sigma} A^T$$

Deriving the Kalman Filter equations

It’s important to know that the Kalman Filter works in two, distinct, phases: the estimation phase(which offers predictions regarding the next state, based on the current state and some external factors) and the correction phase, that refines the predictions using actual measurements.

The estimation phase

  • To capture the changes the current state (at time \(k-1\)) will bring to the next state prediction (at time \(k\)), we will use the matrix \(\mathbf{F_k}\).
  • Analogously, to capture the influence of the external factors, we will use a different matrix, \(\mathbf{B_k}\), while \(\mathbf{u_k}\) for the factor’s themselves.
  • To capture the uncertainty induced by the external factors, we will define gaussian noise, \(\mathbf{Q_k}\).

We can define, using the above notations, the equations for the predicted states. The superscript minus sign “\(\textbf{-}\)” denotes we speak about predictions (a-priori estimations).

$$\hat{\mathbf{x}}_k^\text{-} = F_k \hat{\mathbf{x}}_{k-1} + B_k \mathbf{u_k} \\ \Sigma_k^\text{-} = F_k \Sigma_{k-1} F_k^T + Q_k$$

The correction phase

Another key point is the sensors that make the measurements, although they may follow the same states, it’s possible they’re doing it using a different measuring scale. As a consequence, we need to bring the predictions in the sensor’s space (so everyone is on the same page).

  • For this, we will introduce a new transformation, via the matrix \(\mathbf{H_k}\).
  • As I’ve previously stated, the sensors have a limited precision. To express this, we will introduce a covariance matrix, \(\mathbf{R_k}\). The average of these measurements will be the final value used(the mean is the best estimator), \(\mathbf{z_k}\).

In effect, by applying the transformation from the prediction space to the sensor space we obtain:

$$\begin{align}
\mu_{expected} = H_k \hat{\mathbf{x_k}}^\text{-} \\
\Sigma_{expected} = H_k {\Sigma_k^\text{-}}H_k^T
\end{align}$$

The predicted and measured values, along with  their associated uncertainties.
The predicted and measured values, along with their associated uncertainties
The joint probability distribution obtained from the prediction and measurement distributions.
The joint probability distribution obtained from the prediction and measurement distributions.
The newly obtained average and covariance are the best estimators.
The newly obtained average and covariance are the best estimators.

The three images above come from a great article, also about the Kalman Filter (intuitive too) [4].

The Kalman Filter equations

$$\text{Having the predictions: } (\mu_1, \Sigma_1) = (H_k \mathbf{\hat{x}_k^\text{-}}, H_k\Sigma_k^\text{-} H_k^T) \\
\text{and the measured values: } (\mu_2, \Sigma_2) = (\mathbf{z_k}, R_k)\\
\text{introducing them in equations (4), (5), (6) we obtain: } \\
H_k \mathbf{\hat{x}_k} = H_k \mathbf{\hat{x}_k}^\text{-} + K (\mathbf{z_k} – H_k \mathbf{\hat{x}_k^\text{-}})\\
H_k \Sigma_k H_k^T = H_k \Sigma_k^\text{-} H_k^T – K H_k \Sigma_k^\text{-} H_k^T \\
K = H_k \Sigma_k H_k^T ( H_k \Sigma_k H_k^T + R_k)^{-1} \Rightarrow \\
\color{blue}{K’ = \Sigma_k H_k^T ( H_k \Sigma_k H_k^T + R_k)^{-1}}\\
\color{blue}{\mathbf{\hat{x}_k} = \mathbf{\hat{x}_k^\text{-}} + K'(\mathbf{z_k} – H_k \mathbf{\hat{x}_k^\text{-}})}\\
\color{blue}{\Sigma_k = (I – K’H_k)\hat{\Sigma}_k^\text{-}}$$

The Extended Kalman Filter (EKF)

You might have noticed that everything we’ve discussed so far is basically just a fancy LINEAR model. As shown above, we work only with linear transformation, transforming one state space into another and so on. Unfortunately, the vast majority of real-world problems are inherently non linear.

Fortunately for us, we can do a bunch of fancy stuff with a non-linear model so that, at least on small intervals, we can approximate it with it’s simpler linear cousin. One trick is to use first order Taylor approximation (because it’s linear). We will use the generalized version of the Taylor, so the first coefficient will actually be the Jacobian matrix (instead of the common derivative). Having said that, it’s pretty simple to generalise the basic Kalman Filter to the Extended Kalman Filter version (EKF).

More precisely, we change the linear transformations \(F_k, B_k, H_k\) with some nonlinear functions \(f_k, h_k\).

$$\text{For the nonlinear observations}: \\ \begin{cases}&\mathbf{x_k} = f(\mathbf{x_{k-1}}, \mathbf{u_k}) + \mathbf{w_k}, \quad w_k \text{ model uncertainty} \\ & \mathbf{z_k} = h(\mathbf{x_k}) + \mathbf{v_k}, \quad v_k \text{ measurement uncertainty} \end{cases} \\ \text{We have the predictions: } \\ \mathbf{\hat{x}_k^\text{-}} = f(\mathbf{\hat{x}_{k-1}}, \mathbf{u_k}) \\
\Sigma_k^\text{-} = J_f(\mathbf{\hat{x}_{k-1}})\Sigma_{k-1}J_f^T(\mathbf{\hat{x}_{k-1}}) + Q_{k-1} \\
\text{And their improvements: } \\
\color{blue}{K_k’ = \Sigma_k J_h^T(\mathbf{\hat{x}_k^\text{-}})(J_h(\mathbf{\hat{x}_k^\text{-}})\Sigma_k J_h^T(\mathbf{\hat{x}_k^\text{-}}) + R_k)^{-1}}\\
\color{blue}{\mathbf{\hat{x}_k} \approx \mathbf{\hat{x}_k^\text{-}} + K_k'(\mathbf{z_k} – h(\mathbf{\hat{x}_k^\text{-}}))} \\
\color{blue}{\Sigma_k = (I-K_k’ J_h(\mathbf{\hat{x}_k^\text{-}}))\Sigma_k^\text{-}}$$

Conclusion

So that was it! A complete derivation of the equations behind the Kalman Filter. As you might have noticed, I tried to avoid the dense mathematical formalism. For a rigorous derivation I suggest looking at the original paper, made by the master himself [5]. And if you have any suggestions, or you see any mistakes, please let me know in the comments.

Bibliography

  1. https://en.wikipedia.org/wiki/Lotka–Volterra_equations
  2. https://newonlinecourses.science.psu.edu/stat414/node/166/
  3. https://indico.cern.ch/category/6015/attachments/192/632/Statistics_Gaussian_I.pdf
  4. https://www.bzarg.com/p/how-a-kalman-filter-works-in-pictures/
  5. https://www.cs.unc.edu/~welch/kalman/media/pdf/Kalman1960.pdf

Write A Comment