working with big scientific data on Hadoop

https://stackoverflow.com//questions/11653987

11-12-2019
|

Question

I am currently starting a project titled "Cloud computing for time series mining algorithms using Hadoop". The data which I have is hdf files of size over a terabyte.In hadoop as I know that we should have text files as input for further processing (map-reduce task). So I have one option that I convert all my .hdf files to text files which is going to take a lot of time.

Or I find a way of how to use raw hdf files in map reduce programmes. So far I have not been successful in finding any java code which reads hdf files and extract data from them. If somebody has a better idea of how to work with hdf files I will really appreciate such help.

Thanks Ayush

Solution

For your first option, you could use a conversion tool like HDF dump to dump HDF file to text format. Otherwise, you can write a program using Java library for reading HDF file and write it to text file.

For your second option, SciHadoop is a good example of how to read Scientific datasets from Hadoop. It uses NetCDF-Java library to read NetCDF file. Hadoop does not support POSIX API for file IO. So, it uses an extra software layer to translate POSIX call of NetCDF-java library to HDFS(Hadoop) API calls. If SciHadoop does not already support HDF files, you might go along a little harder path and develop a similar solution yourself.

OTHER TIPS

Here are some resources:

SciHadoop (uses netCDF but might be already extended to HDF5).
You can either use JHDF5 or the lower level official Java HDF5 interface to read out data from any HDF5 file in the map-reduce task.

If you do not find any java code and can do in other languages then you can use hadoop streaming.

SciMATE http://www.cse.ohio-state.edu/~wayi/papers/SciMATE.pdf is a good option. It is developed based on a variant of MapReduce, which has been shown to perform a lot of scientific applications much more efficiently than Hadoop.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow