Question

I am currently starting a project titled "Cloud computing for time series mining algorithms using Hadoop". The data which I have is hdf files of size over a terabyte.In hadoop as I know that we should have text files as input for further processing (map-reduce task). So I have one option that I convert all my .hdf files to text files which is going to take a lot of time.

Or I find a way of how to use raw hdf files in map reduce programmes. So far I have not been successful in finding any java code which reads hdf files and extract data from them. If somebody has a better idea of how to work with hdf files I will really appreciate such help.

Thanks Ayush

Was it helpful?

Solution

For your first option, you could use a conversion tool like HDF dump to dump HDF file to text format. Otherwise, you can write a program using Java library for reading HDF file and write it to text file.

For your second option, SciHadoop is a good example of how to read Scientific datasets from Hadoop. It uses NetCDF-Java library to read NetCDF file. Hadoop does not support POSIX API for file IO. So, it uses an extra software layer to translate POSIX call of NetCDF-java library to HDFS(Hadoop) API calls. If SciHadoop does not already support HDF files, you might go along a little harder path and develop a similar solution yourself.

OTHER TIPS

Here are some resources:

  • SciHadoop (uses netCDF but might be already extended to HDF5).
  • You can either use JHDF5 or the lower level official Java HDF5 interface to read out data from any HDF5 file in the map-reduce task.

If you do not find any java code and can do in other languages then you can use hadoop streaming.

SciMATE http://www.cse.ohio-state.edu/~wayi/papers/SciMATE.pdf is a good option. It is developed based on a variant of MapReduce, which has been shown to perform a lot of scientific applications much more efficiently than Hadoop.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top