Generalised data science approach to astronomical catalogue cross-match

https://datascience.stackexchange.com/questions/67809

08-12-2020
|

Question

I'm interested in performing a cross-match between two source catalogues in an astronomical context. I'll try and explain the context simply.

A truth catalogue of sources (each with some position, shape and flux properties) is used to create a synthetic astronomical image, by convolving with noise and instrument artefacts.

Then 'some process' is developed (by whatever means - it could be a deep learning algorithm, or it could be a person picking sources manually) which identifies sources in the synthetic image, obviously with no knowledge of the truth catalogue.

The challenge is to map the sources in the 'estimated' catalogue with those in the truth catalogue, after which it is possible to devise a measure of the efficacy of the source recovery approach. Bear in mind that estimated positions/fluxes will differ from the truth due to noise, and not all sources may be detected.

One possible way to do this is to use a positional cross-match algorithm like this: https://arxiv.org/abs/1611.04431, and then refine the possible match candidates down by taking the 'closest' in some multi-dimensional parameter space.

However, I thought there could very well be a more generalised maximum-likelihood estimation algorithm suited to this kind of matching task. I think a nearest-neighbours type routine sounds promising, though I'm unsure how this would scale with hundreds of thousands or millions of sources. If anyone has any suggestions for a starting point, I'd be very grateful.

Solution

The generalised algorithm I was looking for was a k-dimensional tree, a space-partitioning data structure which is commonly used in nearest neighbour searches. There are implementations in scipy.spatial (KDTree and the faster cKDTree) and in sklearn.neighbours (also called KDTree).

These are the underlying engines of the Astropy and AstroML catalogue cross match methods, though for my specific use case I required the functionality of the k-d tree itself. The KDTree allows one to search in any number of dimensions, though these need to be scaled (by std dev) if they are different properties.

In some preliminary testing I found this to be over 100 times faster than the C3 algorithm I linked above for my use case. It takes around 10 secs to pick, for 85K sources, the nearest neighbour from a catalogue of 1.4M sources (when searching in 4 dimensions: 2D position, flux and size)

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange