Question

What is the advantage of Hadoop Sequence File over HDFS flat file(Text)? In what way Sequence file is efficient?

Small files can be combined and written into a sequence file, but the same can be done for a HDFS text file also. Need to know the difference between the two ways. I have been googling about this for a while, would be helpful if i get clarity on this?

Was it helpful?

Solution

  1. Sequence files are appropriate for situations in which you want to store keys and their corresponding values. For text files you can do that but you have to parse each line.
  2. Can be compressed and still be splittable which means better workload. You can't split a compressed text file unless you use a splittable compression format.
  3. Can be approached as binary files => more storage efficient. In a text file a double will be a number of chars => large storage overhead.

OTHER TIPS

Advantages of Hadoop Sequence files ( As per Siva's article from hadooptutorial.info website)

  1. More compact than text files
  2. Provides support for compression at different levels - Block or Record etc.
  3. Files can be split and processed in parallel
  4. They can solve large number of small files problem in Hadoop where Hadoop main advantage is processing large file with Map reduce jobs. It can be used as a container for large number of small files
  5. Temporary output of Mapper can be stored in sequential files

Disadvantages:

  1. Sequential files are append only

Sequence files are intermediate files generated during mapper and reducer phase of MapReduce processing. Sequence file are compressible and fast in processing it is used to write output during mapper and reducer reds from it. There are APIs in Hadoop and Spark to read/write sequence files

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top