Question

I am trying to use the netCDF4 package with python. I am ingesting close to 20mil records of data, 28 bytes each, and then I need to write the data to a netCDF4 file. Yesterday, I tried doing it all at once, and after an hour or so of execution, python stopped running the code with the very helpful error message:

Killed.

Anyway, doing this with subsections of the data, it becomes apparent that somewhere between 2,560,000 records and 5,120,000 records, the code doesn't have enough memory and has to start swapping. Performance is, of course, greatly reduced. So two questions: 1) Anyone know how to make this work more effeciently? One thing I am thinking is to somehow put subsections of data in incrementally, instead of doing it all at once. Anyone know how to do that? 2) I presume the "Killed" message happened when memory finally ran out, but I don't know. Can anyone shed any light on this?

Thanks.

Addendum: netCDF4 provides an answer to this problem, which you can see in the answer I have given to my own question. So for the moment, I can move forward. But here's another question: The netCDF4 answer will not work with netCDF3, and netCDF3 is not gone by a long shot. Anyone know how to resolve this problem in the framework of netCDF3? Thanks again.

Was it helpful?

Solution

It's hard to tell what you are doing without seeing code, but you could try using the sync command to flush the data in memory to disk after some amount of data has been written to the file:

http://netcdf4-python.googlecode.com/svn/trunk/docs/netCDF4.Dataset-class.html

OTHER TIPS

There is a ready answer in netCDF4: declare the netCDF4 variable with some specified "chunksize". I used 10000, and everything proceeded very nicely. As I indicated in the edit to my answer, I would like to find a way to resolve this in netCDF3 also, since netDF3 is far from dead.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top