Question

I've looked all over for an answer to this one, but nothing really seems to fit the bill. I've got very large files that I'm trying to read with ATpy, and the data comes in the form of numpy arrays. For smaller files the following code has been sufficient:

sat = atpy.Table('satellite_data.tbl')

From there I build up a number of variables that I have to manipulate later for plotting purposes. It's lots of these kinds of operations:

w1 = np.array([sat['w1_column']])
w2 = np.array([sat['w2_column']])
w3 = np.array([sat['w3_column']])

colorw1w2 = w1 - w2 #just subtracting w2 values from w1 values for each element
colorw1w3 = w1 - w3 

etc.

But for very large files the computer can't handle it. I think all the data is getting stored in memory before parsing begins, and that's not feasible for 2GB files. So, what can I use instead to handle these large files?

I've seen lots of posts where people are breaking up the data into chunks and using for loops to iterate over each line, but I don't think that's going to work for me here given the nature of these files, and the kinds of operations I need to do on these arrays. I can't just do a single operation on every line of the file, because each line contains a number of parameters that are assigned to columns, and in some cases I need to do multiple operations with figures from a single column.

Honestly I don't really understand everything going on behind the scenes with ATpy and numpy. I'm new to Python, so I appreciate answers that spell it out clearly (i.e. not relying on lots of implicit coding knowledge). There has to be a clean way of parsing this, but I'm not finding it. Thanks.

Was it helpful?

Solution

For very large arrays (larger than your memory capacity) you can use pytables which stores arrays on disk in some clever ways (using the HDF5 format) so that manipulations can be done on them without loading the entire array into memory at once. Then, you won't have to manually break up your datasets or manipulate them one line at a time.

I know nothing about ATpy so you might be better off asking on an ATpy mailing list or at least some astronomy python users mailing list, as it's possible that ATpy has another solution built in.


From the pyables website:

PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.

PyTables is built on top of the HDF5 library, using the Python language and the NumPy package.

... fast, yet extremely easy to use tool for interactively browse, process and search very large amounts of data. One important feature of PyTables is that it optimizes memory and disk resources so that data takes much less space...

OTHER TIPS

Look into using pandas. It's built for this kind of work. But the data files need to be stored in a well structured binary format like hdf5 to get good performance with any solution.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top