Question

I would like to read and write from a single netCDF-4 file from R. The file will be accessed by many nodes processes at the same time, (~100's for development, ~1000's for production).

What is the best way to access parallel I/O features in netCDF from within R?

What I have found:

  • It appears from the Unidata page that all I need to do is compile with parallel features enabled (--enable-parallel). Is that really all I need to do?
  • I can not find any mention of parallel io in the ncdf4 package description.
  • Given that I/O is the bottleneck in my computation, any hints on how to optimize my computation - are there circumstances when it would be better to write to multiple files during computation (e.g., locally), and combine the files later (e.g. using nco)?
Was it helpful?

Solution

Information related to using parallel I/O with Unidata NetCDF may be found here:

https://www.unidata.ucar.edu/software/netcdf/docs/parallel_io.html

The --enable-parallel flag is no longer necessary when configuring netCDF; It will check the documentation and update it if need be. The flag is necessary when building the hdf5 library, however.

In order to use parallel I/O with netCDF-4, you need to make sure that it was built against an hdf5 library with parallel I/O enabled. At configure time, netCDF will query the hdf5 library to see whether or not the parallel I/O symbols are present.

  • If they are, parallel I/O for netCDF-4 is assumed.
  • If they are not, parallel I/O for netCDF-4 files is turned off.

If you are installing the netCDF library yourself, you can specify the --enable-parallel-tests flag when configuring; when you run make check, parallel tests will be run. You can also scan the output in config.log to see if parallel I/O functionality was found in the hdf5 library; there should be a message notifying you whether or not it was enabled.

Note that there are some limitations to Parallel I/O with netCDF-4, specifically:

NetCDF-4 provides access to HDF5 parallel I/O features for netCDF-4/HDF5 files. NetCDF classic and 64-bit offset format may not be opened or created for use with parallel I/O. (They may be opened and created, but parallel I/O is not available.)

Assuming that the underlying netCDF library has parallel I/O enabled, and you are operating on the correct type of file, the standard API call invoked by ncdf4 should leverage parallel I/O automatically.

OTHER TIPS

There is one more R package dedicated for parallel handling of NetCDF files which is called pbdNCDF4.
This solution is based on standard ncdf4 package, so the syntax is very similar to the "traditional" approach. Further information available on CRAN: https://cran.r-project.org/web/packages/pbdNCDF4/vignettes/pbdNCDF4-guide.pdf

Ward gave a fine answer. I wanted to add that there is another way to get parallel I/O features out of Unidata NetCDF-4.

NetCDF-4 has an architecture that separates the API from the back end storage implementation. Commonly, that's the NetCDF API on an HDF5 back end. But, here's the neat thing: you can also have the NetCDF API on the Northwestern/Argonne "Parallel-NetCDF" (http://cucis.ece.northwestern.edu/projects/PnetCDF/ and http://www.mcs.anl.gov/parallel-netcdf) back end.

This approach will give you a parallel I/O method to classic and 64-bit offset formatted datasets.

Both Ward and Rob gave fine answers! ;-)

But there is yet another way to get parallel I/O on classic and 64-bit offset files, through the standard netCDF API.

When netCDF is built with --enable-pnetcdf, then the parallel-netcdf library is used, behind the scenes, to perform parallel I/O on classic, 64-bit offset, and CDF5 (though I have not tested that last format with parallel I/O).

When opening the file, use the NC_PNETCDF flag to the mode to indicate that you want to use parallel I/O for that file.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top