Question

I am new to python, and am trying to work out the best way to approach a data analysis problem. Apologies if this question seems basic. I essentially want help in working out whether I want to use tuples, dicts or a pandas dataframe to store my data. Here is my scenario:

My data: I have a 3D spatial dataset, with data at uneven XYZ positions. The precise position of the data points is vital, so I can't resample to an even grid, which would be much easier to deal with. Each XYZ datapoint has an associated set of detail, including character, integer and float and boolean classes. I basically have a fairly disordered 'cloud' of data.

My aims: I want to be able to examine every point of type 'X' within the dataset, and look at the properties of all other points within a given radius (what type they are, and various other characteristics).

My question: What is the most efficient way of storing and querying this type of data? Intuitively, a pandas dataframe with columns for x, y, z, ... would make sense, but given I'll be working with large datasets, I am concerned about whether this is the most efficient way of doing it. Would it be sensible to create a dict object, where the definitions are XYZ tuples, and the stored values are further dict objects containing the characteristics of the point? Is there an obvious way of doing this that I've missed?

Any help/suggestions greatly appreciated!

Thanks in advance.

Was it helpful?

Solution

Since the bottleneck of this use-case appears to be the spatial queries on the data, I would go for the approach where you store the coordinates in a highly optimized data structure for spatial queries and then you have a dictionary where you can retrieve the other features of the points on demand. High performance specialized libraries ie, boost graph and CGAL for computational geometry follow similar philosophies (for example, see property maps in boost http://www.boost.org/doc/libs/1_37_0/libs/graph/doc/using_property_maps.html)

With regards to the appropriate data structure, possibly SciPy has something that suits you http://docs.scipy.org/doc/scipy/reference/spatial.html KDTree would be an option for nearest neighbour queries. Pandas can store the data yes, but it does not have specialized spatial indexing support.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top