Question

I have one more very basic concept level question coming to my mind, what are the differences and relationship between HDFS and various file format -- sequence file (map file based on it), HAR file?

I think the answer is, HDFS is the foundation file system, we can put raw binary file upload to HDFS (without using sequence file, HAR file, etc.), and we can also use some special designed file format to write files based on HDFS -- like sequence file (map file based on it) format, HAR file format. Is that correct understanding?

Was it helpful?

Solution

HDFS is a file system and is not tied to any specific file format. It is a distributed file system and abstracts out most internal details about how the files are actually persisted on disk (just like NFS/FAT). It gives us a continous view of the file and directory structure but internally the files are actually replicated and stored in chunks across various nodes in the cluster.

File formats like sequence file are particularly suited well for map reduce programming paradigm as it can be easily splitted across data nodes enabling parallel processing. However, HDFS has no such preference, it can divide any file-format into blocks (viz. binary/plain text) and store it.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top