Implementing a multivariate gaussian probability density function for >2 dimensions in C++

https://stackoverflow.com/questions/7189801

12-01-2021
|

Question

I'm working on implementing a probability density function of a multivariate Gaussian in C++, and I'm stuck on how to best handle the cases where dimension > 2.

The pdf of a gaussian can be written as

multivariate gaussian pdf

where (A)' or A' represents the transpose of the 'matrix' created by subtracting the mean from all elements of x. In this equation, k is the number of dimensions we have, and sigma represents the covariance matrix, which is a k x k matrix. Finally, |X| means the determinant of the matrix X.

In the univariate case, implementing the pdf is trivial. Even in the bivariate (k = 2) case, it's trivial. However, when we go further than two dimensions, the implementation is much harder.

In the bivariate case, we'd have

bivariate gaussian pdf

where rho is the correlation between x and y, with correlation equal to

correlation between two random variables X and Y

In this case, I could use Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic> to implement the first equation, or just calculate everything myself using the second equation, without benefiting from Eigen's simplified linear algebra interface.

My thoughts for an attempt at the multivariate case would probably begin by extending the above equations to the multivariate case

multivariate pdf

with

multivariate pdf

My questions are:

Would it be appropriate/advised to use a boost::multi_array for the n-dimensional array, or should I instead try to leverage Eigen?
Should I have separate functions for the univariate/bivariate cases, or should I just abstract it all to the multivariate case using boost::multi_array (or an appropriate alternative)?

Solution

I'm a little bit out of my element here, but some thoughts:

First, from a programming view, the stock answer is "profile." That is, code it up the clearer way first. Then profile your execution to see if optimizing is worthwhile. IMHO it's probably clearer to use a matrix library to keep closer to the original math.

From a math view: I'm a bit dubious about the formula you provide for the multivariate case. It doesn't look right to me. The expression Z should be a quadratic form, and your Z isn't. Unless I'm missing something.

Here's an option you didn't mention, but might make sense. Especially if you're going to be evaluating the PDF multiple times for a single distribution. Start off by calculating the principle component basis of your distribution. That is, an eigenbasis for Σ. Principle components directions are orthogonal. In the principle component basis, the cross covariances are all 0, so the PDF has a simple form. When you want to evaluate, change basis on the input into the principle component basis, and then perform the simpler PDF calculation on that.

The thought being that you can calculate the change of basis matrix and principle components once up front, and then only have to do a single matrix multiplication (the change of basis) per evaluation, instead of the two matrix multiplications needed to evaluate the (x-μ)' Σ (x-μ) in the standard basis.

OTHER TIPS

I have basically implemented the exp-part of the equation for the three dimensional case in this question. I used a computer vision library called OpenCV at first. But I noticed that the C++ interface was very slow. Afterwards I tried the C interface, which was a bit faster. Finally, I decided to ignore flexibility and readability, so I implemented it without any libraries and it was way faster.

What I am trying to say is this: When performance is important, you should consider implementing special cases for the most used number of dimensions with as little overhead as possible. Otherwise choose maintainability over speed.

Disclaimer: I do not know anything about the speed of Eigen or boost::multi_array (which is probably what this question is really aiming at?).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow